AI/Deep Learning/Machine Learning/HPC Build

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TectraTech

New Member
Jan 25, 2020
2
1
3
Swindon
Hi all,

I am interested in building a workstation or server that can tackle the above tasks and have a few GPUs knocking about. I have read numerous guides online, but most tend to advocate GTX/RTX and desktop hardware etc. I am in a relatively uncommon situation as I am an IT broker (primarily enterprise) and manage my own stock so I have random smatterings of components everywhere. If there is anyone out here that has experience in configuring enterprise equipment for the above tasks I would love to pick your brain!

Proposed build:

Chassis:
- Dell PowerEdge R630/R620/R610/T330/T620/T320
or
- HP Z420/Precision T7910/T5810/T5800
- CPU (2x E5-2690 V1 or 2x E5-2695 V3) is this overkill? The GPU will be handling the crunching? I can always sub out for a more modest CPU such 1x E5-2620 V1/V3 etc.

- RAM (16GB DDR3 12800R or 16GB DDR4 2133MT/s - depending on the system) 32GB is too expensive to squirrel away for this hobby so I would prefer to use 16GB & I have enough to fully populate any of the above machines. Should I max out the RAM or will I see diminishing returns after a rough figure?

- HDD (Mechanical 15Ks or SAS/SATA SSDs?) Should I be running an all flash system? or have a OS SSD and mechanical drives for data? Or this is not that important?

- OS (Ubuntu or Windows?)

- GPU: This is the part I need the most help with.

As far as I understand, Nvidia GPU hardware is extremely similar underneath but features are typically disabled/enabled depending on the card (ECC etc) I have Quadros (K4200/M4000/Q4000), Teslas (K20/K40/K80) and a few lower end GTX cards (770GTX) in stock that I can use - but which one should I be using?! I know the Quadro are more suited for CAD, Tesla for scientific computation and GTX for gaming. It might sound like I have answered my own question but there are drivers, ECC, memory bandwidths & cooling, power supplies and other things to take into consideration.

Any advice would be appreciated

Thanks all in advance!
 
  • Like
Reactions: Queninc

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
It'd be great to know a bit more about your exact specifications and exactly what type of jobs you expect to run on this system.

I'm not super familiar with the HPC world, but I have a bit of passing familiarity with the AI/DL world. If this is a server for bleeding edge deep learning research, you need as much VRAM as possible.

Of the cards you already have in your inventory, realistically only the K40 (12GB) and the K80 (24GB) are still relevant, everything is much too small/old. Do note that the K80 is a bit weird (as it's two GPUs tied together). And with only 8TF of single bit precision performance, you're more than 50% slower than the 16.3TF of a single Titan RTX. Real-world performance, without Tensor Cores, you're probably even slower (likely 1/4th the speed of a modern GPU), so if you're speccing this for business purposes, make sure you are accounting for the fact that a model that would normally train in 1 day might take 3-5 days on this cobbled together setup.

If you're only doing inference, you might be able to get away with some of the smaller cards. I still wouldn't go with any of the pre-Pascal cards you have in stock though, the effort required is probably just not worth the performance you'll get out of them.

As for the other stuff:
  • Chassis is w/e as long as there's enough space to fit the cards + airflow to cool the cards + large enough PSU to power the cards.
  • CPU is mostly w/e as long as the GPUs are being fed data fast enough via PCIe. (assuming you're running GPU-bound jobs. Different story if you're doing HPC work + large amounts of data preprocessing)
  • RAM: you need at least as much as you have VRAM on the GPU, ideally 2x if you can swing it; otherwise you'll be bottlenecked getting data onto and off the GPU
  • HDD: you should get enough flash to be able to fit and manipulate whatever working-set target datasets are (which will depend on the type of job you're doing). You might not need this if you have a large and fast-enough NAS connected.
  • OS: Ubuntu 18.04 is probably the best supported for DL libraries at the moment. We support Windows users in our lab (and to be frank, it's really not worth the effort)