We are building a small cluster of servers for computer vision related tasks (university lab), mostly deep learning on GPUs. Each box will have 8 GPUs, so that roughly 8 people can work on them at once. To train these models, lots of data has to be fed into GPUs for computation. So we need two things, a fast network and fast storage. I would say that maybe 16 processes from 3 servers will constantly be reading from the server.
Most of the operations are read only, we put the dataset (images, videos, audio etc.) on the store and then mount them on a node via NFS to be read for training the models. We don't need very much space, at the moment we run the stuff on disks with around 1.5 TB storage, with roughly 600 GB free.
Networking I have worked out, 40gbe with used parts from ebay as our budget is sort of very limited here. One Mellanox Connect-X 3 card per node and two in the file server. Infiniband may be a cheaper option, but I have the feeling 40gbe is just easier to integrate into existing environments.
For the file server I have a Dell R720 which I can use. Dual E5-2630 (v1) with 96 GB Ram and a H710 with 512MB Cache. My goal is to saturate a link aggregated 40 gbe link, so roughly 80 gbit/s (10 gbyte/s) for reading. I don't care about write speeds, it can be 100 mb/s. Files range from small 100kbyte files to Gigabyte H5 database files.
One idea that crossed my mind was to get 3x 512GB Samsung 960 Pro NVMe with PCIe cards. Each of them has roughly 3.5 GByte/s sequential read speed. In a RAID0 setting I would expect to get around 8GByte/s.
The data will be backuped every 24 hours to a NAS. While there is development code on this share, we have a git server for code. If a drive dies I don't care if code is lost, it should be on the git server anyways. We can afford down time (not production, its all r&d)
While I found this to be a good idea regarding read speeds, two things were bothering me:
1.) The 960 Pro has a write endurance of 400 TBW. While we are not writing much, this seems awfully low.
2.) I don't like RAID0, except for the speed of course.
Now, I was wondering if anyone had some helpful input into how we can get the maximum read speeds, keep the cost low and not build a ticking time bomb. I had something like a hybrid solution with SSDs and NVMe in mind as well, but didn't find anything useful yet. Would maybe ZFS help us out here?
Most of the operations are read only, we put the dataset (images, videos, audio etc.) on the store and then mount them on a node via NFS to be read for training the models. We don't need very much space, at the moment we run the stuff on disks with around 1.5 TB storage, with roughly 600 GB free.
Networking I have worked out, 40gbe with used parts from ebay as our budget is sort of very limited here. One Mellanox Connect-X 3 card per node and two in the file server. Infiniband may be a cheaper option, but I have the feeling 40gbe is just easier to integrate into existing environments.
For the file server I have a Dell R720 which I can use. Dual E5-2630 (v1) with 96 GB Ram and a H710 with 512MB Cache. My goal is to saturate a link aggregated 40 gbe link, so roughly 80 gbit/s (10 gbyte/s) for reading. I don't care about write speeds, it can be 100 mb/s. Files range from small 100kbyte files to Gigabyte H5 database files.
One idea that crossed my mind was to get 3x 512GB Samsung 960 Pro NVMe with PCIe cards. Each of them has roughly 3.5 GByte/s sequential read speed. In a RAID0 setting I would expect to get around 8GByte/s.
The data will be backuped every 24 hours to a NAS. While there is development code on this share, we have a git server for code. If a drive dies I don't care if code is lost, it should be on the git server anyways. We can afford down time (not production, its all r&d)
While I found this to be a good idea regarding read speeds, two things were bothering me:
1.) The 960 Pro has a write endurance of 400 TBW. While we are not writing much, this seems awfully low.
2.) I don't like RAID0, except for the speed of course.
Now, I was wondering if anyone had some helpful input into how we can get the maximum read speeds, keep the cost low and not build a ticking time bomb. I had something like a hybrid solution with SSDs and NVMe in mind as well, but didn't find anything useful yet. Would maybe ZFS help us out here?