Looking for advice for my first Deep Learning system

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

LenE

New Member
Jan 29, 2020
28
8
3
I've been dabbling with deep learning for about a year, and I finally hit the point (of frustration) where I'm actually going to invest some time and money into building a deep learning-specific workstation. This post is my attempt to air out what has been rattling around my head, so that more knowledgable people can disabuse me of any crazy notions that I may have grabbed a hold of.

I should start off by saying my goal or target is to build a system that will help me become competitive in Kaggle competitions. Kaggle isn't my real goal, it is just a convenient stand-in for the level of performance I'd like to hit with this system. My assumption is that this is a system in the 2-4 GPU range would be about right. As of now, I'm looking at RTX 2070 Super cards as a baseline. My intention is to start with 2 GPU's, and add more as time goes on.

I know that I could build this system with either Intel or AMD (Zen2), but I'm favoring the latter at the moment. I also know that most of the work will be done on the GPU, but I want to have a CPU that runs fast for dealing with Python code that doesn't get accelerated by a GPU, and for other compilations. This is where I'm hitting my first quandary. Should I go with Epyc or Threadripper? Both have enough PCIe lanes for four GPU's. My gut wants to go with an Epyc 7302P or 7402P for lower power usage, but I keep looking at the ThreadRipper 3960X in the same price range with a wider assortment of motherboards that could accommodate 4 GPU's and a lot of other "nice" built-in features.

I'm looking to run Linux only on this machine, and have no current intention to do anything other than deep learning with this system. I saw mention of docker and VM's in other threads, but I'm a bit too much of a luddite to fathom any advantages this would give in this context.

Any thoughts and or suggestions will be greatly appreciated.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
GPU Opinion: Start with 1x RTX 2080TI before moving on.

When first learning deep learning, you are significantly more likely to run individual models, tweak them, and then move on, rather than running multiple models in parallel. Optimize here for single-job performance. The 11GB will also help significantly over 8GB when you start getting into more complex models.

CPU Opinion: Similar to the GPU -- go with TR. Your single-threaded performance will be better, and your overall cost of components will be cheaper than Epyc. Only go with Epyc if you know you're planning on going with server-quantity amounts of RAM (256GB or greater). Weak suggestion to reconsider intel as well -- it's possible some scientific libraries you want are compiled with intel MKL which you will see some reasonable boost w/ on Intel CPUS (ofc. workload specific).

Motherboard Opinion: Optimizing for 4 GPUs on a single board seems cool and forward-proofing, but once you do it you'll realize how much of heat dissipation and powering challenge it is. On power: if you're in the USA, the typical household 15A 120V Wall Plug can only do 1440W constant, which is a bit below a safe total system budget of a 4 GPU Build (I would actually go 1.6kW to be safe). So plugging or using anything else in that circuit (which may span multiple physical home plugs) will trip the fuse.
 
Last edited:

LenE

New Member
Jan 29, 2020
28
8
3
Thanks for the reply! The potential power draw was why I was looking at the lower TDP of both the smaller Epyc and RTX 2070 Super. I am in the USA, and had considered running a new 20A circuit specifically to support this machine when totally complete.

I get the 2080 Ti has more memory, but will it be worth more than double the price of a 2070 Super for deep learning? I have not used big models yet, because most of my training to this point has been CPU only on a laptop. When I was able to borrow a laptop with a mobile 1070, the model training went 50x faster. I expect any RTX card with tensor cores to be much faster than that.

perhaps My understanding of how multiple GPU’s can be used is in error. I use TensorFlow, and my belief is that it can and will split training of models across multiple GPU’s. In this way, using two GPU’s might be ~1.9x faster than using 1. Is my understanding wrong?

My affinity for AMD was driven by my 4 GPU end vision for the system, and the PCIe lanes required. In full disclosure, I invest in companies that I believe in, and own both nVidia and AMD stock, and own no stake in intel. That is secondary to my distaste for intel’s pricing and market segmentation.

MKL is a big boost for intel. My hope is that as AMD continues their debt retirement, they can again invest in better tailoring their compiler tools and specialized libraries.

If you were building a new system as a research box housed in a residential setting, what exactly would you build today?
 
Last edited:

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
Multi-GPU training requires that the models / training loop specifically be written to support distributed training.

You can "kind of" do unified memory via an NVLink bridge, but there are a whole host of caveats related to getting that to work cleanly, including the fact that in naive implementations you will have 2x the memory but only 1x the compute.

Re: the last question, we "technically" build "research boxes" systems in "residential settings" today with extreme caveats on each of those quoted words haha. I've been meaning to write online in detail at some point... but the general gist is that we shove Titan RTXs into a small rack of ESC8000 G4s. This requires running multiple 240V lines from the laundry room.

We're currently in the process of doing a new test 4GPU build on the Rome platform. We have the G242-Z10 planned for that.
 
  • Like
Reactions: LenE

LenE

New Member
Jan 29, 2020
28
8
3
Whoa! That’s really cool, but way more than I’m looking to do at this point.

I thought that in looking at the 2070 Super instead of the higher end Ti and Titan cards I was looking at a cost-effective sweet spot for a hobbyist level dedicated machine. Now, I’m wondering if I’m just aiming too low.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
It really depends on what sorts of problems you work on.

For traditional machine learning, simple classification, or even simple RL agent models, you should be able to get away with the 2070 Super on a local development machine. To be fair, a lot of Kaggle problems fall into this bucket.

If you're diving into any sort of large model (e.g. multi-scale image classification, large transformer models, generative image models, etc.) you'll often find yourself entirely VRAM bound (and a lot of github examples are going to just refuse to run on your card w/out significant tweaking or reduction of batch size).

Here's a possible suggestion: start with 1x 2070 and see how it goes.

I would definitely recommend against buying two 2070s though.If you're at the point where you need a second 2070, then you're probably also at the point where you need a 2080 TI or higher (Titan RTX, Quadro 6000 or Quadro 8000), at which point you should just get the bigger card and not have two small cards.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
I'd give you my two cents.

1. Lanes: you will be ok with 8 lanes per GPU. There will be no significant speed losses (= you will be ok with a 40/48 lanes processor).

2. CPU: Mind that AMD is still a bit behind with MKL performance. Alternatives are feasible (OpenBLAS), but do expect some hassle. If you decide to go with AMD, the threadrippers (3rd gen) are costly as hell. Better to go with Epyc, in my opinion. See the thread about HPE deals, you could get a 16cores Rome for 600$. As for intel, you can find Scalable qualification samples at good prices on ebay, much more tested platform, wider choice of motherboards.

3. Ram: another advantage of Epyc and Scalable is that you can use RDIMMs. Less costly, higher density per module. Also I think that the higher the amount of ram, the more you would need ECC and a hardware buffer.

4. GPU. The 2060 Super (8Gb) is the current champion of price/performance ratio (~350$). With the money you spend for a 2080ti, you can buy three of them, having the combined memory of a Titan. I still have to incur in a model whatsoever that refuses to run with Pytorch DataParallel. Sure, you won't scale the computing power linearly, but you will always have the combined memory. If a model takes more to train, you can just wait. If a model doesn't fit into memory, you have to compromise about important stuff like batch size or architecture. Think about big transformers or the bigger EfficientNets. Last but not least, learning to deal with parallelization is a valuable thing by itself.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
I would definitely recommend against buying two 2070s though.If you're at the point where you need a second 2070, then you're probably also at the point where you need a 2080 TI or higher (Titan RTX, Quadro 6000 or Quadro 8000), at which point you should just get the bigger card and not have two small cards.
Could you elaborate? Two smaller cards (e.g. 8gb class) do have more combined memory than a single 11gb card, and cost considerably less..
Did you ever incur in some model that you couldn't run in parallel over multiple cards?
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
You can "kind of" do unified memory via an NVLink bridge, but there are a whole host of caveats related to getting that to work cleanly, including the fact that in naive implementations you will have 2x the memory but only 1x the compute.
AFAIK you don't need NVlink to combine memory. I can very well be wrong (e.g. I do not use Tensorflow), but Pytorch DataParallel always allowed me to combine memory without NVLink. Just plain pci-express bus.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
Could you elaborate? Two smaller cards (e.g. 8gb class) do have more combined memory than a single 11gb card, and cost considerably less..
Did you ever incur in some model that you couldn't run in parallel over multiple cards?
As an example, the old StyleGAN repo had 11GB as a minimum (PER worker gpu). And ideally to accelerate this job, you have 8 of these in a single chassis.

AFAIK you don't need NVlink to combine memory. I can very well be wrong (e.g. I do not use Tensorflow), but Pytorch DataParallel always allowed me to combine memory without NVLink. Just plain pci-express bus.
Pytorch Data Parallel is not actually letting you combine memory; it's letting you run a larger batch-side in a data-parallel fashion. Each GPU is running its own identical copy of the graph, and just pulling training examples from the shared batch.

When I refer to combining memory, I'm referring to allowing a single CUDA kernel to pretend it actually has 2x more memory than it actually physically has on card. This is useful if your model is not trivially shard-able via data-parallel fashion; and often-times model-parallelism is way more work than it's worth just for experimentation. OFC for this to work you need either GPU-Direct (iirc disabled in the 20XX series of consumer cards) or NV-Link.

Obviously I'm colored a bit by my own experiences (we have a small homelab that does research on large-scale image generation models), so take my recommendations w/ that in mind. But we did very rapidly retire ever card <= 8GB into an inference-only chassis, and then moved forward on training >8GB.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
Pytorch Data Parallel is not actually letting you combine memory; it's letting you run a larger batch-side in a data-parallel fashion. Each GPU is running its own identical copy of the graph, and just pulling training examples from the shared batch.
That's true. But the end result, as unelegant as it is, is that memory occupation per gpu is 1/n (or almost) where 1 is the occupation it would have on a single gpu.

Allow me one question: how to leverage true parallelism with Pytorch on professional cards connected with NVLink (it would be the case of a large AWS instance with four/eight V100s)?

EDIT: Doing a quick search, I found: https://devtalk.nvidia.com/default/.../can-nvlink-combine-2x-gpus-into-1x-big-gpu-/

The Nvidia guy says that NVlink is actually meant for quicker communication between GPUs for data parallelism, that is (as you correctly pointed out), exactly what DataParallel does.

When I refer to combining memory, I'm referring to allowing a single CUDA kernel to pretend it actually has 2x more memory than it actually physically has on card. This is useful if your model is not trivially shard-able via data-parallel fashion; and often-times model-parallelism is way more work than it's worth just for experimentation. OFC for this to work you need either GPU-direct (iirc disabled in the 20XX series of consumer cards) or NV-Link.
But even NVLink on consumer cards (anything less that rtx titan) operates in "SLI-mode" (that is, it just exchanges synchronization signals mostly for gaming, at least as far as I know.


Obviously I'm colored a bit by my own experiences (we have a small homelab that does research on large-scale image generation models), so take my recommendations w/ that in mind. But we did very rapidly retire ever card <= 8GB into an inference-only chassis, and then moved forward on training >8GB.
It is clear that 11Gb cards are better than 8Gb cards, but then there is the usual price/perf thing. For ~1000/1100$, better to have one 2080ti, or three 2060 super to use in DataParallel (or for running multiple smaller experiments)?
I'd be inclined towards three 8gb cards at least for my use cases. I wasn't aware of that limit for StyleGAN, but I'd be curious to see other such examples..
That said, if budget is not an issue, one could go with a couple rtx 8000 with 48gb each, combined with NVlink (the true one). If you raise real money with deep learning, an investment of 12k$ in gpus could be worth it.
 
Last edited:

LenE

New Member
Jan 29, 2020
28
8
3
I really appreciate this discussion, as it mirrors the internal debate in my head, but with valuable experiences from being there and doing that already from both of you.

At this point, I haven’t developed anything novel that would attract any attention from anyone looking for deep learning tech. Maybe that’s part of my problem, in that I am not in any way focused on solving a specific problem in a marketable way. I just want to build much more experience and understanding of the technology so that when I do eventually get a brilliant flash of forward vision, I will be better prepared to justify an investment into a heavyweight system with the proper insight into how it would be best configured.

If I was already on that path, I wouldn’t be sweating the $3-5k cost that I’m budgeting for this system. Analysis paralysis has me even contemplating what may be around the corner this summer with Zen 3 and Ampere. Given the lack of pricing shifts for the RTX cards without direct AMD competition, I don’t have much hope for Ampere being a cost-effective upgrade, but maybe its launch may bring the 2080 Ti price down.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
If Pascal-Turing shift teaches us something, I think you just will find used cards on ebay at prices manginally more approachable. The problem is Nvidia de facto monopoly. Look at Maxwell-->Pascal. They doubled the onboard memory for each tier. Then they realized that people were putting the 1080TIs into servers, and that they were ruining the market for the Teslas. So, for Pascal-->Turing, we got the same memory amount per tier, despite a considerable price increase. 750$ seemed too much for a 1080TI but now we have to pay 1200 for top tier consumer card, with no memory increment.

Don't know if you are interested in Pascal, but I found out that Pascal cards are perfectly capable of operating in FP16, albeit with a very modest performance gain (5-10%$). The memory will be doubled nonetheless, and you can have a 11Gb card for 500$.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
The Nvidia guy says that NVlink is actually meant for quicker communication between GPUs for data parallelism, that is (as you correctly pointed out), exactly what DataParallel does.
You can also use it for CUDA unified memory. Obviously yes, it's slower, but it still allows you to do things w/out having to recode everything in a data parallel way while still being significantly faster than a full trip to system RAM. See this TF config option.

But even NVLink on consumer cards (anything less that rtx titan) operates in "SLI-mode" (that is, it just exchanges synchronization signals mostly for gaming, at least as far as I know.
Incorrect, GPU P2P is fully supported. See: RTX 2080Ti with NVLINK - TensorFlow Performance (Includes Comparison with GTX 1080Ti, RTX 2070, 2080, 2080Ti and Titan V)

I really appreciate this discussion, as it mirrors the internal debate in my head, but with valuable experiences from being there and doing that already from both of you.
Thanks LenE! I think the best thing to do, honestly, is to just get your feet wet with _any_ GPU regardless of what it is. Starting is definitely better than not starting; a very fast, prudent course of action could be to just shove a GPU into whatever desktop system you might already have. Start learning there, and then continue onwards!

I think as hardware geeks, sometimes we unnecessarily fetishize the system-building and specifications part. I'm definitely guilty of that myself ^^;; It's not worth being paralyzed by though; in the end the hardware is just the means to the end (i.e. running the interesting DL algorithms), and not the end itself.
 
Last edited:

LenE

New Member
Jan 29, 2020
28
8
3
NVLink seems to be an interesting option, and both the 2080 Ti and 2070 Super support it. Can NVLink be used between cards of different capabilities, or do they have to be identical?

I guess what I’m asking is if I start with a lesser card like a 2070, and decide later to add a 2080 Ti, can they be linked? Similarly, to get any parallel benefit, will I be limited to the batch sizes required for the 2070, or could I step up to larger batches with peering?
 

nthu9280

Well-Known Member
Feb 3, 2016
1,628
498
83
San Antonio, TX
Following this thread... I only have GTX 1060/6GB model in my home workstation and bumping into the limitations even with the fast.ai tutorials. I'll first muddle my way thru and get good foundation within the confines before looking at upgrade options.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
But look at the numbers.. For Resnet50, you get 776 im/s with NVlink versus 750 without it.. It would be interesting to see a similar experiment upon professional cards: I'd expect much more substantial gains..

I think as hardware geeks, sometimes we unnecessarily fetishize the system-building and specifications part. I'm definitely guilty of that myself ^^;;
True. I usually get stuck for months for every build I do, since any configuration whatsoever seems to be unsatisfactory.
 

LenE

New Member
Jan 29, 2020
28
8
3
Has anyone been able to get the 2 slot version of NVLink to work with the 20x0 RTX cards? The 3 and 4 slot NVLink bridges nVidia sells seem to be an anti-consumer lever to push deep learning people away from their gamer cards and into their much more expensive server/professional cards.

As far as I can tell, the max for an ATX-ish form factor will be a pair of 20x0 cards linked and a single card unlinked to anything, where a 2 slot bridge would allow two linked pairs.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
Has anyone been able to get the 2 slot version of NVLink to work with the 20x0 RTX cards? The 3 and 4 slot NVLink bridges nVidia sells seem to be an anti-consumer lever to push deep learning people away from their gamer cards and into their much more expensive server/professional cards.

As far as I can tell, the max for an ATX-ish form factor will be a pair of 20x0 cards linked and a single card unlinked to anything, where a 2 slot bridge would allow two linked pairs.
Yes, it has been done: see 4x RTX 2080 TI with Quadro Nvlink | Performance Test

Note that with the Quadro bridge the performance gains are more substantial, which is in line with what I said previously.. Furthermore, it seems (not surprisingly) that the benefits of NVlink decrease with the amount of data exchanged between the gpus: in FP16, you just get some 10-11%, whereas in FP32 you have a good 20% (In a 2-gpu config, 499 imgs/sec without NVlink versus 535 imgs/sec with NVlink probably is not worth the hassle and the price of the bridge).

Some applications are more sensitive. See for example: 2 x RTX2070 Super with NVLINK TensorFlow Performance Comparison

the 2070Ss are completely indifferent to NVlink in vision tasks, whereas the speedup with the LSTMs seems to be substantial.

Finally, see NVIDIA NVLink Bridge Compatibility Chart.

As a sidenote, as far as I understood by reading the articles, those people are all using data parallelism. Please correct me if I misunderstood.
 
Last edited: