3080 deep learning rig questions

josh · Nov 13, 2020

So I've managed to secure a 3080. Wondering if I need to upgrade the rest of the hardware for this.

I intend on sticking it in a dual E5-2650V2 proxmox host that runs all the various other VMs for the homelab, with the GPU on passthrough. Way past EOL, yes but great because DDR3 is much cheaper. Chips were also dirt cheap ($40 ea). GPU does most of the heavy lifting anyway.

The next tier of hardware would be to upgrade to E5-2680V4s. Not exactly optimal as DDR4 costs add up and I have to upgrade RAM for VMs that don't really need it.

Another option would be to stick the card into my desktop rig that might get upgraded to TR after Black Fri. That means I'd have to keep the desktop on 24/7 when training models. Not the most ideal, but if it is the only option where the 3080 is not bottlenecked, I'll have to do it.

Are there any people out there who run 3080/3090s with older gen hardware?

Thanks

larrysb · Nov 16, 2020

Try it. I think you might get better results with faster generation CPU/mobo/memory. I'm running Xeon E5-v4 and v3 hardware on PLX'd motherboards with Turing cards. The main bottle neck is storage. NVMe really helps and that might be enough reason to make the upgrade.

But if it works well enough, it works well enough.

Magic8Ball · Nov 16, 2020

If you've ever followed Tim Dettmers superb blog on GPUs and hardware then you may have noticed his (now outdated) PC recommendations which advises a Ryzen 3600 system. With Black Friday looming and Zen 3 incoming I expect you can pick up a real bargain on the last gen stuff and build a fast system for not much money. It would be a pity if the system was the bottleneck with such a nice new GPU.

Also, don't underestimate the benefits of a dedicated ML box ... you can build a system that's well designed for the purpose and bare metal is just simpler and easier (passthrough can be problematic), plus when you train a model and max out the system resources nothing else gets impacted.

larrysb · Nov 16, 2020

I'm familiar with Tim's guide. My experience in running things 24x7 in my startup business was a bit different than his blog.

I think you'll find really good info at Puget System's HPC blog.

As far as CPU's go, there really isn't anything better than Xeon E5-v4 series in the "used" parts bin, especially the low-core-count. They are incredibly reliable, don't run hot and the NUMA architecture on a PLX'd motherboard allow for as many as 4 gpu cards with excellent memory/pcie bandwidth. The Asus X99-E WS series motherboards are really good. Loaded up with a frequency optimized Xeon and ECC memory, they're really hard to beat. I ran lots of x4 1080ti systems on those. I also ran the Titan-V in x4. I was really ticked when I found that Nvidia clock-crippled them in compute mode. They were actually slower than 1080ti on compute, especially FP32. The FP16 was only somewhat better than running some models in FP32 on the 1080ti. The Volta GPU really shine the best at FP64, for other HPC use. Most deep-learning models train nicely at FP16. The other big bottleneck on the Titan-V is the lack of NVLink. NV did un-cripple the Titan V driver a little after I sold my last set of them.

The Turing cards were a better deal than the Titan-V. The Titan RTX being the best of the bunch. The extra DDR VRAM helps a lot.

The other thing I found is that most power supplies are not up to the ballgame, no matter how many watts they're rated. Even some well-respected brands with ample power rating don't hold up to the "surge" nature of DL training. It isn't like gaming or rendering where there's a steady load. The DL batches get loaded in card memory and run in 1/2 to 1-second long "surges". The EVGA Super Nova 1600W handles it well. Second place going to the Seasonic Prime 1300W. I've tried other PSU's with poor results, especially after many hours of work.

If you look inside the Volta version of the Nvidia DGX Station, it is built on an Asus X99-E WS/10G motherboard and an EVGA Super Nova 2 1600W power supply, Xeon E5-2699v4 or 2698v4 CPU. The Volta cards used in it are proprietary, but very much like the Quadro GV100 (16gb or later 32gb each) with special firmware and watercooling blocks (EKWB I'm pretty sure). The unobtanium part is the NVLink bridge that connects the 4x cards.

I'm running a few configs now, either 2x Titan RTX or 2x 2080ti blower cards. The fan coolers on the Titan RTX limit things a lot, the NVLink only being a 2-card arrangement also limit things. That's OK, because I'm also running Mellanox 25gb ethernet with RDMA and that uses a full PCIe slot.

Even with just 2 cards in the workstation, at full speed, it will make the lights in the office blink. Two running systems with 2x GPU's each (which I run on the 25gb network) will pop the 15 amp breaker for my home office. I usually power cap the GPU's with the nvidia-smi command to keep things cooler and not pop the breakers.

It looks like the current generation of AMD has a lot going for it, were I buying new, I'd probably be looking at them. The Xeon Skylake series are hot running and have a funky PCIe root complex that didn't get along well for peer memory or RDMA. The Skylake HEDT (consumer) didn't pan out at all for me. Hot running and nearly all the MoBo's are gamer oriented, with all kinds of over-clocking nonsense that makes for unstable systems.

For the OP's question: Just pop the 3080 in and see what it does. I'd guess it will do fine. You can compare benchmarks at the Puget Systems blog.

josh · Nov 19, 2020

Thanks for the detailed replies, will spend some time pondering the new information.

Do you guys know a good chassis that will provide good airflow for a 3080 while we wait for them to release 3080 blower cards or at least till the 3090 Turbo comes down in pricing (and or availability)?

Is tower to only way to go for triple fan cards right now?

larrysb · Nov 19, 2020

For fan cards, I like the Corsair Carbide 540. It's discontinued, but lots of them still new in the box. Not overly expensive either. It will easily fit larger CEB mobo's too. Slight mod to fit the oversize EVGA 1600 power supply. It's not a quiet case, as it is basically a cube with mesh screens all over it.

Impresses the ladies more than a new Corvette too.

I like the Fractal Design Define 7 for quiet, CEB-size boards just fit it. Very nicely made. Lot of options for cooling too, though I'm using blower cards in it.

josh · Nov 20, 2020

larrysb said:
For fan cards, I like the Corsair Carbide 540. It's discontinued, but lots of them still new in the box. Not overly expensive either. It will easily fit larger CEB mobo's too. Slight mod to fit the oversize EVGA 1600 power supply. It's not a quiet case, as it is basically a cube with mesh screens all over it.

Impresses the ladies more than a new Corvette too.

I like the Fractal Design Define 7 for quiet, CEB-size boards just fit it. Very nicely made. Lot of options for cooling too, though I'm using blower cards in it.

My mobo is e-ATX though. And I would require 8x3.5" drive bays.

Currently looking at Fractal Design XL (first edition not R2). The fractals don't really have good airflow, they sacrificed that for the sound levels. Are they good for these cards?

Which blower cards are you using? As far as I know there's only one 3090 with blower and none for 3070/3080.

Magic8Ball · Nov 20, 2020

Still waiting for my 3080

but everything I've read suggests that unless you intend to under-volt them then airflow is extremely important in getting all the performance you just paid for. A lot of gamers are having to upgrade their cases to avoid the cards automatically down-clocking.

I suggest that if you intend to run it at full load for any length of time you make airflow/thermals a priority, and because of the lack of blower cards this probably means thinking about how you get the hot air out of the case fast enough.

Personally I'm going for a 4U rack chassis with a powerful fan wall, but my rack is nicely out of earshot.

josh · Nov 20, 2020

Magic8Ball said:
Still waiting for my 3080 but everything I've read suggests that unless you intend to under-volt them then airflow is extremely important in getting all the performance you just paid for. A lot of gamers are having to upgrade their cases to avoid the cards automatically down-clocking.

I suggest that if you intend to run it at full load for any length of time you make airflow/thermals a priority, and because of the lack of blower cards this probably means thinking about how you get the hot air out of the case fast enough.

Personally I'm going for a 4U rack chassis with a powerful fan wall, but my rack is nicely out of earshot.

I'm thinking of moving to 4U eventually but right now I only have one card and it seems like a waste of space for that.

Magic8Ball · Nov 21, 2020

If I was going to buy a tower case for these GPUs I'd definitely be looking at ones that have fans at the very bottom to blow cold external air directly onto the card and pci-e slots from close proximity. Then I'd think about using high static pressure fans for this, and let the other fans in the case shift the bulk volume of air through.

josh · Nov 22, 2020

Which 4U chassis are you guys looking at btw? I'm trying to get one of the hotswap Rosewills but they don't seem to produce them anymore. Might just get a second hand Dell/SM server but they seem to want to squeeze as many drive bays as possible (They're all 36-bay with 12 at the back which eat into the GPU width).

larrysb · Nov 22, 2020

If you're using them for deep learning or other GPU compute, you may well be better off setting a power or clock limit on them anyway. I haven't set hands upon a 3080 yet. The early benchmarks I've read weren't all that much better than the Turing cards, which makes me wonder if they are clock limited in compute mode, as the Titan-V driver was.

josh · Nov 23, 2020

larrysb said:
If you're using them for deep learning or other GPU compute, you may well be better off setting a power or clock limit on them anyway. I haven't set hands upon a 3080 yet. The early benchmarks I've read weren't all that much better than the Turing cards, which makes me wonder if they are clock limited in compute mode, as the Titan-V driver was.

I read that the 3090s are intentionally handicapped on the driver level but not sure if it extends to 3080 and 3070.

larrysb · Nov 23, 2020

Yeah, it would be real interesting to start, 'nvidia-smi dmon' and then begin a GPU compute session on it, compared to the same with a graphics (gaming) session.

That's how I confirmed the issue on the Titan V. Even wrote Jensen Huang an email about it. He (or his office) responded! They subsequently un-crippled the Titan V compute clock, but I had moved on to the Turing cards by then.

josh · Nov 23, 2020

Do any of the AIBs matter when picking a GPU? I went for the cheapest I could get my hands on which was the Gigabyte Eagle. But I'm sometimes wondering if those huge coolers like the Gigabyte Aorus or the ROG Strix make a difference at all.

funkywizard · Nov 23, 2020

larrysb said:
I'm familiar with Tim's guide. My experience in running things 24x7 in my startup business was a bit different than his blog.

I think you'll find really good info at Puget System's HPC blog.

As far as CPU's go, there really isn't anything better than Xeon E5-v4 series in the "used" parts bin, especially the low-core-count. They are incredibly reliable, don't run hot and the NUMA architecture on a PLX'd motherboard allow for as many as 4 gpu cards with excellent memory/pcie bandwidth. The Asus X99-E WS series motherboards are really good. Loaded up with a frequency optimized Xeon and ECC memory, they're really hard to beat. I ran lots of x4 1080ti systems on those. I also ran the Titan-V in x4. I was really ticked when I found that Nvidia clock-crippled them in compute mode. They were actually slower than 1080ti on compute, especially FP32. The FP16 was only somewhat better than running some models in FP32 on the 1080ti. The Volta GPU really shine the best at FP64, for other HPC use. Most deep-learning models train nicely at FP16. The other big bottleneck on the Titan-V is the lack of NVLink. NV did un-cripple the Titan V driver a little after I sold my last set of them.

The Turing cards were a better deal than the Titan-V. The Titan RTX being the best of the bunch. The extra DDR VRAM helps a lot.

The other thing I found is that most power supplies are not up to the ballgame, no matter how many watts they're rated. Even some well-respected brands with ample power rating don't hold up to the "surge" nature of DL training. It isn't like gaming or rendering where there's a steady load. The DL batches get loaded in card memory and run in 1/2 to 1-second long "surges". The EVGA Super Nova 1600W handles it well. Second place going to the Seasonic Prime 1300W. I've tried other PSU's with poor results, especially after many hours of work.

If you look inside the Volta version of the Nvidia DGX Station, it is built on an Asus X99-E WS/10G motherboard and an EVGA Super Nova 2 1600W power supply, Xeon E5-2699v4 or 2698v4 CPU. The Volta cards used in it are proprietary, but very much like the Quadro GV100 (16gb or later 32gb each) with special firmware and watercooling blocks (EKWB I'm pretty sure). The unobtanium part is the NVLink bridge that connects the 4x cards.

I'm running a few configs now, either 2x Titan RTX or 2x 2080ti blower cards. The fan coolers on the Titan RTX limit things a lot, the NVLink only being a 2-card arrangement also limit things. That's OK, because I'm also running Mellanox 25gb ethernet with RDMA and that uses a full PCIe slot.

Even with just 2 cards in the workstation, at full speed, it will make the lights in the office blink. Two running systems with 2x GPU's each (which I run on the 25gb network) will pop the 15 amp breaker for my home office. I usually power cap the GPU's with the nvidia-smi command to keep things cooler and not pop the breakers.

It looks like the current generation of AMD has a lot going for it, were I buying new, I'd probably be looking at them. The Xeon Skylake series are hot running and have a funky PCIe root complex that didn't get along well for peer memory or RDMA. The Skylake HEDT (consumer) didn't pan out at all for me. Hot running and nearly all the MoBo's are gamer oriented, with all kinds of over-clocking nonsense that makes for unstable systems.

For the OP's question: Just pop the 3080 in and see what it does. I'd guess it will do fine. You can compare benchmarks at the Puget Systems blog.

Would probably get better results running these on 208v / 240v.

The psu will have a much easier time supplying its rated capacity when run at the higher input voltage.

larrysb · Nov 23, 2020

funkywizard said:
Would probably get better results running these on 208v / 240v.

The psu will have a much easier time supplying its rated capacity when run at the higher input voltage.

Well, it isn't the power supply in the computer that's struggling. it's the power supply in the wall outlet and circuit breaker box! LOL

Yes, it would be better to run them on 240v/30A power, like an air conditioner, oven or clothes dryer.

The point being, there are physical limits to how much GPU computing you can do in a home or office, given the typical 120v/15A limits of residential and office buildings.

larrysb · Nov 23, 2020

josh said:
Do any of the AIBs matter when picking a GPU? I went for the cheapest I could get my hands on which was the Gigabyte Eagle. But I'm sometimes wondering if those huge coolers like the Gigabyte Aorus or the ROG Strix make a difference at all.

Sigh, well, my experience with gaming cards is that it is pretty much pot luck. You won't get more than you pay for, that much is certain, and a lot of them are built with RGB lighting as the most prominent selling technology other than price. The Founder Edition cards have been pretty good for the last couple of generations. The less overclocking the better.

The professional cards cost an arm and a leg and are usually slower than the gaming cards. But they tend to be more reliable and intended to run 24x7 under compute loads. They also support some software features that aren't enabled on the gaming cards.

funkywizard · Nov 23, 2020

larrysb said:
Well, it isn't the power supply in the computer that's struggling. it's the power supply in the wall outlet and circuit breaker box! LOL

Yes, it would be better to run them on 240v/30A power, like an air conditioner, oven or clothes dryer.

The point being, there are physical limits to how much GPU computing you can do in a home or office, given the typical 120v/15A limits of residential and office buildings.

The psu could better tolerate the voltage sag and more closely match its rated output at the higher voltage. Of course you're right that an even larger benefit is that a 208/240v circuit will be rated for more total watts as well giving you more headroom there.

larrysb · Nov 24, 2020

Well, so far, the limit for me has been the wall outlet, not the PSU.

Typical US wiring for homes and offices:
120v * 15A = 1800w

Any higher draw than that, the breaker or fuse pop back at the panel. Usually, all the outlets in a single bedroom or office are all on a single 15A breaker.

Two computers, each with a pair of high-end GPU's can easily exceed that much power and pop the breaker. Makes you really popular with the office landlord.

I had to cap the power limits on the GPU's in software so we wouldn't pop breakers.

3080 deep learning rig questions

Active Member

Active Member

Member

Active Member

Active Member

Active Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Active Member

Active Member

Active Member

mmm.... bandwidth.

Active Member

Active Member

mmm.... bandwidth.

Active Member