3090 driver handicap?

josh · Jan 23, 2021

There have been reports about the 3090s being handicapped on the driver level by Nvidia for deep learning.

Can anyone with real world experience confirm if they're still worth the large bump in pricing over the 3080s?

If so, which models have the best circulation so that they don't face thermal throttling in multi GPU systems?

Thanks!

balnazzar · Jan 31, 2021

If it's driver-capped, it's not so a huge handicap.

NVIDIA RTX 3090 vs 2080 Ti vs TITAN RTX vs RTX 6000/8000 | Exxact Blog

Exxact

blog.exxactcorp.com

Other benchmarks highlight that it's a bit behind the A6000 (which has the same processor, is a 300W design, and with gddr6 rather than gddr6x, but not driver capped).

I still think the 3090 is the current king as for performance-price ratio. Two of them perform better than a single A6000, but cost less and have the same cumulative memory amount, while giving you more flexibility for multiple experiment and a chance to learn how to parallelize.

josh · Feb 3, 2021

balnazzar said:
If it's driver-capped, it's not so a huge handicap.

NVIDIA RTX 3090 vs 2080 Ti vs TITAN RTX vs RTX 6000/8000 | Exxact Blog

Exxact

blog.exxactcorp.com

Other benchmarks highlight that it's a bit behind the A6000 (which has the same processor, is a 300W design, and with gddr6 rather than gddr6x, but not driver capped).

I still think the 3090 is the current king as for performance-price ratio. Two of them perform better than a single A6000, but cost less and have the same cumulative memory amount, while giving you more flexibility for multiple experiment and a chance to learn how to parallelize.

I'm thinking of starting with a single Gigabyte Turbo 3090 then adding another down the road.
It's the only blower style 3090 until the ASUS model becomes available but once I purchase a blower card I'll be in it for the long run because these cards have terrible resale value.
Should I get a regular open air card instead if I don't plan on scaling up to more than 2x 3090s?

balnazzar · Feb 3, 2021

josh said:
I'm thinking of starting with a single Gigabyte Turbo 3090 then adding another down the road.
It's the only blower style 3090 until the ASUS model becomes available but once I purchase a blower card I'll be in it for the long run because these cards have terrible resale value.
Should I get a regular open air card instead if I don't plan on scaling up to more than 2x 3090s?

TL;DR, yes, in my opinion. But mind that in the future you could change your mind, and this will happen as soon as you'll hit a model that doesn't fit into vram with a decent batch size (bs impacts heavily upon model convergence, not only speed..).
Also be aware that you cannot nvlink two cards of different height.
If you still want to go with an open-air, my vote is for the FE. Totally silent under full load, top-notch build quality.

EDIT: Actually I misread. You do want to scale up to two cards. Well, in that case I'd be a bit more cautious. You will have a ~700W heat output thrown into your case, and the worst thing is that the cards will end up preventing each other from cooling properly. The upper card will take the brunt of it. AFAIK, no one managed to keep two RTX Titans under the throttling limit in a dual setup, and these were a 280W design.
Other components could suffer as well.

josh · Feb 3, 2021

balnazzar said:
TL;DR, yes, in my opinion. But mind that in the future you could change your mind, and this will happen as soon as you'll hit a model that doesn't fit into vram with a decent batch size (bs impacts heavily upon model convergence, not only speed..).
Also be aware that you cannot nvlink two cards of different height.
If you still want to go with an open-air, my vote is for the FE. Totally silent under full load, top-notch build quality.

EDIT: Actually I misread. You do want to scale up to two cards. Well, in that case I'd be a bit more cautious. You will have a ~700W heat output thrown into your case, and the worst thing is that the cards will end up preventing each other from cooling properly. The upper card will take the brunt of it. AFAIK, no one managed to keep two RTX Titans under the throttling limit in a dual setup, and these were a 280W design.
Other components could suffer as well.

The FEs were the ideal card for me. Unfortunately, Nvidia's decision to sell only through BestBuy has locked out that possibility for me entirely as BestBuy will not ship to international forwarders at all. I wish this wasn't the case as the Gigabyte Turbo is impossible to find anywhere (even on Amazon) and the local distributors are pricing it 35% above the US MSRP.

Cixelyn · Feb 3, 2021

We've had issues attempting to use 3090s in distributed training. I tweeted some graphs here. Not sure if this is driver handicaping or not.

Granted this was in October of last year, so the driver situation might have improved a bit, but at the time the downclocking issues was severe enough that we ended up not building out any 3090 machines as they were underperforming our Titan RTXs.

T_Minus · Feb 3, 2021

Cixelyn said:
We've had issues attempting to use 3090s in distributed training. I tweeted some graphs here. Not sure if this is driver handicaping or not.

Granted this was in October of last year, so the driver situation might have improved a bit, but at the time the downclocking issues was severe enough that we ended up not building out any 3090 machines as they were underperforming our Titan RTXs.

Do you have any left to re-test ?

josh · Feb 4, 2021

Cixelyn said:
We've had issues attempting to use 3090s in distributed training. I tweeted some graphs here. Not sure if this is driver handicaping or not.

Granted this was in October of last year, so the driver situation might have improved a bit, but at the time the downclocking issues was severe enough that we ended up not building out any 3090 machines as they were underperforming our Titan RTXs.

What about 2-NVLinked 3090s vs 1 Titan?

balnazzar · Feb 4, 2021

josh said:
The FEs were the ideal card for me. Unfortunately, Nvidia's decision to sell only through BestBuy has locked out that possibility for me entirely as BestBuy will not ship to international forwarders at all. I wish this wasn't the case as the Gigabyte Turbo is impossible to find anywhere (even on Amazon) and the local distributors are pricing it 35% above the US MSRP.

I paid 2000eur per card on Amazon (eu) for two turbos, since I needed them badly for my consulting work.
They stayed two days at that price, then they went to 2499, then disappeared.

Cixelyn said:
We've had issues attempting to use 3090s in distributed training. I tweeted some graphs here. Not sure if this is driver handicaping or not.

Granted this was in October of last year, so the driver situation might have improved a bit, but at the time the downclocking issues was severe enough that we ended up not building out any 3090 machines as they were underperforming our Titan RTXs.

That's very strange. I don't have any Titan RTX to confront with, but you can see in any professional review that the 3090 is above it, and, as an example of non-professional, quickly reproducible real-world benchmark, you can look at eugeneware/benchmark-transformers

Which kind of workloads did you throw at them?

josh · Feb 4, 2021

balnazzar said:
I paid 2000eur per card on Amazon (eu) for two turbos, since I needed them badly for my consulting work.
They stayed two days at that price, then they went to 2499, then disappeared.

Yea I haven't seen a Turbo go in stock on Amazon at a reasonable price since they launched. Local distributor charges 2900 SGD so I guess still somewhat better than yours. Still the most pricey 3090 out there aside from the Asus Strix and a huge markup from the 1499 USD MSRP (which is highly unrealistic anyway).

Cixelyn · Feb 4, 2021

T_Minus said:
Do you have any left to re-test ?

lol I wish. The stock situation hasn't improved at all in the past 4 months.The whole parts shortage + scalper situation is driving me nuts; currently trying to get our hands on AMD 5000 series for workstation builds and can't

josh said:
What about 2-NVLinked 3090s vs 1 Titan?

Wasn't in consideration due to chassis constraints. Our current workloads use 8 Titans, and it's not currently possible afaik to shove 16x 3090s into a single chassis. Also that would absolutely destroy our total power budget lol.

balnazzar said:
Which kind of workloads did you throw at them?

I think in the tweeted graph we were benchmarking vanilla StyleGAN2 on the remote test system. All our primary workloads are massive convnets & GANs, so it's a good approximation of performance for us. We don't really do much transformer work here, so I can't comment on those benchmarks -- sorry!

balnazzar · Feb 4, 2021

Cixelyn said:
I think in the tweeted graph we were benchmarking vanilla StyleGAN2 on the remote test system. All our primary workloads are massive convnets & GANs, so it's a good approximation of performance for us. We don't really do much transformer work here, so I can't comment on those benchmarks -- sorry!

Mmhh if you can give me some pre-concocted repository that I can run upon my two 3090s without too much hassle, I'll run it.
Even better if you can provide some Titan RTX baseline with it.

Two further questions: how do you manage to keep the titans under their throttling temp? And do you nvlink them in pairs or just let them communicate through the pcie bus?

Thanks!

balnazzar · Feb 4, 2021

Cixelyn said:
lol I wish. The stock situation hasn't improved at all in the past 4 months.The whole parts shortage + scalper situation is driving me nuts; currently trying to get our hands on AMD 5000 series for workstation builds and can't

If you live in EU, I can tell you where to find the 5000s, but with a premium over their listing prices.

Cixelyn · Feb 4, 2021

balnazzar said:
Mmhh if you can give me some pre-concocted repository that I can run upon my two 3090s without too much hassle, I'll run it.
Even better if you can provide some Titan RTX baseline with it.

The Original TF-based SG2 repo is what we were using to test, config-f with batchsize 32.

balnazzar said:
Two further questions: how do you manage to keep the titans under their throttling temp?

Chassis fans are all maxed. Rack also has a rear door with a giant maxed active fan.

balnazzar said:
And do you nvlink them in pairs or just let them communicate through the pcie bus?

We have one system w/ pairs nvlinked, and one without, but it didn't make a big enough difference for our particular workload since we're more compute than IO bottlenecked.

balnazzar said:
If you live in EU, I can tell you where to find the 5000s, but with a premium over their listing prices.

California, USA unfortunately. Thanks for the offer, though!

balnazzar · Feb 4, 2021

Cixelyn said:
Thanks for the offer, though!

Just for the sake of disclosure, I don't sell them

Cixelyn said:
The Original TF-based SG2 repo is what we were using to test, config-f with batchsize 32.

Thanks, I'll try!

larrysb · Feb 5, 2021

You can (and in many cases should) use the nvidia-smi tool to set a power or clock limit to lower the power cap for the cards. While this reduces performance, it reduces heat and tends to increase reliability. The "pro" cards (Quadro) have lower power and thermal caps in firmware than the game equivalent. Quadro RTX6000 and Titan RTX are essentially the same and the RTX Titan scores better benchmarks because the Quadro RTX6000 is power/clock capped.

But the Quadro has ECC, memory pooling and it allows GPU-Direct RDMA.

The RTX Titan has more bang/$$$, but the Quadro 6000 can scale to more than two, and it has the blower cooler for dense packing and heat removal.

For scaling, RDMA GPU-direct is a big plus, since high-speed RDMA/ROCE cards are fairly inexpensive now and available secondhand. ConnectX 5 or better will do DMA directly into the Quadro and Tesla cards, but that's disabled on consumer game cards. Distributed GPU across multiple system over high-speed fabric does scale pretty well, if you need it.

For a 2-gpu standalone workstation, it is hard to justify the considerable expense of Quadro level cards. I've also been bitten by the gamer cards from Nvidia board partners enough to avoid them for deep learning work. Honestly, most of them are garbage, unreliable and poorly made. The Nvidia FW and Titan boards have been better in my experience. I'm not going to pay scalper price for a 3090FE though. I'd rather go with RTX Quadro for scaling reasons.

balnazzar · Feb 5, 2021

larrysb said:
But the Quadro has ECC, memory pooling and it allows GPU-Direct RDMA.

The titan too supports, at least in theory, mem pooling.

larrysb said:
The RTX Titan has more bang/$$$, but the Quadro 6000 can scale to more than two

You cannot nvlink more than two quadros. I don't think you can nvlink more than two a6000/a40, either. On the other hand, the guy above packed 8 titans inside a single host. And gpud-rdma is essentially useless if you have to stay within 4 cards.

larrysb said:
I've also been bitten by the gamer cards from Nvidia board partners enough to avoid them for deep learning work. Honestly, most of them are garbage, unreliable and poorly made.

What kind of problems did you have with AIB cards?

larrysb said:
The Nvidia FW and Titan boards have been better in my experience

It's very hard to keep a titan rtx cool, let alone two of them inside the same case. on the other hand, I had no problems at all stacking four 2060 super (blower) into the same host, and they worked cool and reasonably quiet.

larrysb said:
I'm not going to pay scalper price for a 3090FE though. I'd rather go with RTX Quadro for scaling reasons.

Consider that *two* 3090 fe at scalper's prices are however cheaper than a single A6000.
If you have to scale to >4 gpus, that's another entirely different discourse.

Cixelyn · Feb 5, 2021

balnazzar said:
Consider that *two* 3090 fe at scalper's prices are however cheaper than a single A6000.
If you have to scale to >4 gpus, that's another entirely different discourse.

Yeah in our experience the scaling breakpoints are:

- 1 GPU: easiest, you don't need to write any special model or data parallel code and stuff "just works" as long as you have enough VRAM.
- 4 GPUs: no longer fits in a single EATX chassis _and_ you've maxed out an entire 15A standard USA socket
- 8 GPUS: no longer fits in a single 4U chassis, all code has to be multi-node capable now or PCI extenders. Note this limit is 16 if you're rich AF and can just get a 16x V100 or A100 DGX node.
- 24ish GPUs: you've pretty much totally maxed out the thermal and power capacity of a normal 10KW colo cabinet, so hopefully you know what you're doing)
- 72ish GPUs: you've maxed out an entire 42U Rack (assuming you left spaces for switches 'n stuff) and are pretty darn close to the thermal + power capacity of a 35KW chilled rear-door cabinet.

In our opinion, the quadros only start to make sense if a) you need to go >8 GPUs for a _single_ job, or b) you're at 2-8 GPUs and your IO characteristics are bad enough that you take a 2X speed penalty w/out NVLink (and in our experience only really happens when doing fancy model-parallel stuff)

balnazzar · Feb 6, 2021

Cixelyn said:
- 1 GPU: easiest, you don't need to write any special model or data parallel code and stuff "just works" as long as you have enough VRAM.
- 4 GPUs: no longer fits in a single EATX chassis _and_ you've maxed out an entire 15A standard USA socket

I'm commenting from a workstation-ish viewpoint, rather than server's..

Four cards easily fit into a chassis, without even bothering EATX... Just buy four Turbos and a regular ATX case/board.
The wall sockets here in EU are 10A-230V and 16A-230V, so they're not really a problem. In the US, you can just use the power limiter (300W per card will barely touch the real performance) or limit yourself to 3 cards.

As for parallelism, if you are content with data parallelism you don't really have to write any special code except for two additional lines (Pytorch). Model parallelism is an entirely different matter.

larrysb · Feb 6, 2021

I've certainly setup workstation with 4x GPU. I used the same Mobo that Nvidia did in their DGX series workstations to build them. I even use the same EVGA 1600W power supply. (Corsair Carbide 540 case is excellent for this by the way)

I used the Titan V in that 4x format as well. However, not only was NVLink software-disabled on these (the fingers are there, and electrically connected!), they were also severely clock limited by the driver in GPU compute mode. I even wrote Jensen an email about it, and he responded!

The RTX series with the two card limited NVLink puts a bit of a damper on the number of cards you can really make use of, even if you have the slots and PLX'd PCIe lanes to work with. In a workstation situation, you might as well go with 2x GPU if you need it.

I hit limits on vram though as our model complexity increased. Honestly, I could make use of the 48gb vram cards.

I can also only pull so many amps from a single outlet too. Then there's the heat problem and noise problem. Even 2 RTX titans put out a lot of heat.

Where the Quadros begin to shine is the scaling beyond 2 GPU. If you go to a distributed compute model with Horovod or other methods, it can be useful to utilize multiple workstations when they're available. With the availability of high speed network cards of 25gb or better and direct RDMA, the scaling of GPU and storage can be pretty good on a "mini-fabric".

On the partner cards, I've found them to not be so reliable in many cases. They're designed for gamers and tend to have overclocking of everything available, different throttling and fan curves and honestly, poor cooling solutions giving way to RGB lighting and plastic dress up kits. A lot of cost optimization goes on to the reference designs so they can eek out a little more margin in competitive markets. Not all the FE cards are that great either. But the last couple of generations have been generally better than the AIB's in the long run. I've taken the occasional problem child GPU out of service and run them in a game demo and start to see artifacts on screen in many cases, or glitches and mystery crashes. Early on in our process, we thought, "hey it is the same GPU on this card as the other and it is $x.xx cheaper!" But you live and learn that you never get more than you pay for, and you often get less.

The Quadros we've used have been generally very reliable. ECC memory is a plus in my book. More conservative clocking and throttling tends to make them more reliable. Last thing you want after 50 hours of model training is a GPU related crash or situations that lead you to suspect memory corruption. (BTDT) I just wish they were not so ripping expensive.

On the new generation of RTX A6000, the virtualization is an interesting prospect. There are many cases where it would be nice to have one honking big GPU that can be virtualized out to several workloads, especially for development. Even on a single workstation I can think of good reasons to do that. Would like to play with that and see how well it works with several containerized workloads with a vGPU each. Sharing a GPU through virtualization could be useful in other HPC applications, especially on a high-speed local network.

I readily agree that scaling beyond 4 physical GPU's in a typical office room, much less in a single computer chassis can be pretty tough. I became real popular with my office landlord for popping the circuit breakers from the load imposed by multiple GPU workstations.

3090 driver handicap?

Active Member

Active Member

Active Member

Active Member

Active Member

Researcher

Build. Break. Fix. Repeat

Active Member

Active Member

Active Member

Researcher

Active Member

Active Member

Researcher

Active Member

Active Member

Active Member

Researcher

Active Member

Active Member