ML / DL system build for learning vs cloud


Staff member
Dec 21, 2010
I saw that guide. I was totally bummed. It has a bunch of incorrect information/ and is missing a lot.

One easy example, most people will use 300w power limit on the 1080 Ti's instead of 250w. Either way, a fully loaded system training for a day or two will use more power than most 15A / 120V circuits can run and push that power supply to over 1.3kW.

General guides are around 2x RAM v. GPU memory. So with 44GB of memory, you need to move to a 128GB memory configuration using Threadripper.

Cost comparison did not include air conditioning/ soundproofing. A 4x 1080 Ti FE machine will be extremely loud if you keep it cool enough to not have severe GPU throttling.

A single Threadripper system does not have enough PCIe lanes for 4x GPUs + running storage and such at full speeds.

Threadripper is worse than Xeon E5 V4 because all of the GPUs will be on different PCIe roots. You cannot use nccl with Threadripper which is what gives you a huge multi-GPU scaling boost. Beyond that, with Threadripper you only have a maximum of PCIe x16 between two pairs of GPUs.

If you are doing a single machine/ small dataset that storage and networking setup works. Most of the higher-end setups will use a NAS to share data among many systems and have more data storage. You also tend to have higher speed networking with FDR Infiniband being the minimum.

There is quite a bit more, but that Medium post is really dangerous. It is basically a "hey I can build a PC" and run deep learning on it. That is fine, but saying that the single GPU system scales to 4 GPUs properly is incorrect. You may save a few bucks, but it is a sub-optimal build.

There is a reason type systems are so popular. That is a system with proper networking, proper PCIe topology, proper power and cooling. The cost per GPU on a DeepLearning11 class system is ~$1600 per to setup. In that Medium article, the author is using $2963 per GPU. At 2x GPUs, it is still around $1800 per GPU but an Intel Xeon E5 has a better PCIe root.

As a "hey I built a PC with a 1080 Ti and it is cheaper performance per dollar than AWS" that is fair. But it is not scalable so it is stuck doing low-end training, which it is fine for. While it is cheaper, it is also the wrong architecture for this.

It is a bit frustrating since we have this all online and lots of people read it