3090 driver handicap?

josh · Feb 13, 2021

funkywizard said:
Although theoretically there is some limited ddr3 support for 2011-3, I've never seen it in the wild.

DDR4 isn't all that expensive anymore, at least the slower ones like ddr4-2400 are ~$45 for a 16gb stick. A dual E5v4 should have 8 sticks for maximum speed, so, around $360 for 128gb ddr4 ram. Compared to the cost of GPUs that's pretty minimal. The higher speed ram you'll want to use for AMD (ddr4-3200), does cost more, however.

Yea but I have a ton of ddr3 from old setups that I paid only 11-13 per 16gb stick. There's a couple of dual X99 Chinese motherboards that take DDR3 and the dual CPU setup also means unlimited PCIe lanes.

funkywizard · Feb 13, 2021

josh said:
Yea but I have a ton of ddr3 from old setups that I paid only 11-13 per 16gb stick. There's a couple of dual X99 Chinese motherboards that take DDR3 and the dual CPU setup also means unlimited PCIe lanes.

Sounds like a pretty good setup then.

josh · Feb 14, 2021

Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?

funkywizard · Feb 15, 2021

josh said:
Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?

If youre just doing standard desktop stuff, the impact should be little or nothing.

josh · Feb 15, 2021

funkywizard said:
If youre just doing standard desktop stuff, the impact should be little or nothing.

Occasional games. Currently use a standalone RX 580 but thinking of selling it with the premium people are offering for these cards.

balnazzar · Feb 15, 2021

josh said:
Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?

It depends. It impacts very little during regular desktop work.
A ton of running browser tabs with hardware acceleration enabled will impact much more.
And of course gaming will impact even more.
Particularly, amount of occupied VRAM. Just buy a gt710/1030 or use the integrated IPMI graphics if you have it. If you like gaming, a 1030 will be ok-ish at 1080p. You can buy them used on amazon for 60-70eur.

iceisfun · Feb 27, 2021

We are doing awesome with the RTX 3090 for pytorch and vision. It seems to be faster than RTX Titan and A100 SXM

balnazzar · Feb 27, 2021

iceisfun said:
and A100 SXM

Come on, that's not possible.

Were you meaning the V100 SXM?

iceisfun · Feb 27, 2021

balnazzar said:
Come on, that's not possible.

Were you meaning the V100 SXM?

balnazzar · Feb 27, 2021

iceisfun said:
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 29C P0 62W / 400W | 6776MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 30C P0 60W / 400W | 6716MiB / 40536MiB | 4% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 57W / 400W | 6576MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 60W / 400W | 6648MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

You bought the new DGX station, uh? You lucky SOB

balnazzar · Feb 27, 2021

iceisfun said:
And by faster I ment lower latency per frame processed in small batch sizes, not more frames processed in big batches.

No, we are trying to figure out if its just too early with pytorch or a driver issue or something else?

It's possible. Try and benchmark with tensorflow.

Cixelyn · Feb 27, 2021

iceisfun said:
It seems to be faster than RTX Titan and A100 SXM

balnazzar said:
Come on, that's not possible.

I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.

But in general, yeah no way the A100 should blow the 3090 out of the water on basically everything else.

iceisfun · Feb 27, 2021

Cixelyn said:
I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.

But in general, yeah no way the A100 should blow the 3090 out of the water on basically everything else.

Right, we were investigating numa domains and such.

Also correct that our 3090 test rig is a single processor i9 9900k, so much simpler topology.

And we're talking about ~10ms/req, it seems much higher than could be possible with a numa issue and the machine not being under significant load.

balnazzar · Feb 28, 2021

Cixelyn said:
I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.

If you have the dgx a100, it shouldn't use any pcie switch. That's due to the presence of a sufficient of lanes for 4 gpus. The same happens if you just build your own server/workstation using a motherboard without pcie switches.

Cixelyn · Feb 28, 2021

balnazzar said:
If you have the dgx a100, it shouldn't use any pcie switch.

From page 13 of the user manual:

The switch is actually mandatory since you need the ConnectX-6 cards directly linked to the GPU on the same PCIe Root in order for GPUDirect RDMA to work for large multi-node deployments. You'll notice that each GPU also has a direct path to one of the NICs on the far right regardless of where in the topology it is (GPU -> Switch -> Switch -> NIC)

I think the 4x A100 Redstone boards can be directly connected to the CPUs, but the Supermicro 2124GQ-NART still uses two CPUs making NUMA domains an issue.

larrysb · Feb 28, 2021

Cixelyn said:
I think the 4x A100 Redstone boards can be directly connected to the CPUs, but the Supermicro 2124GQ-NART still uses two CPUs making NUMA domains an issue.

Wow, the SuperMicro 2124GQ-NART system manual is pretty confusing about which GPU is mapped to what CPU. They're numbered differently in IPMI and Linux. It would take a while to sort out NUMA and processor affinity. Honestly, I don't know much about AMD, since I'm bottom feeding on last generation Intel hardware.

just curious what nvidia-smi topo -m (-mp) shows?

balnazzar · Mar 1, 2021

Cixelyn said:
but the Supermicro 2124GQ-NART still uses two CPUs making NUMA domains an issue.

Got it. I was under the (wrong) impression that he had the single-cpu dgx station a100.

iceisfun · Mar 1, 2021

Machine (252GB total)
Package L#0
NUMANode L#0 (P#0 126GB)
L3 L#0 (16MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#32)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#33)
L3 L#1 (16MB)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#34)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#35)
L3 L#2 (16MB)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#36)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#37)
L3 L#3 (16MB)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#38)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#39)
L3 L#4 (16MB)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#40)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#41)
L3 L#5 (16MB)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#42)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#43)
L3 L#6 (16MB)
L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#44)
L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#45)
L3 L#7 (16MB)
L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#46)
L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#47)
HostBridge
PCIBridge
PCI 01:00.0 (3D)
CoProc(OpenCL) "opencl0d0"
HostBridge
PCIBridge
PCI 23:00.0 (SATA)
PCIBridge
PCI 24:00.0 (SATA)
HostBridge
PCIBridge
PCI 41:00.0 (3D)
CoProc(OpenCL) "opencl0d1"
PCIBridge
PCI 44:00.0 (SATA)
PCIBridge
PCI 45:00.0 (SATA)
HostBridge
PCIBridge
PCI 61:00.0 (Ethernet)
Net "eno1"
PCI 61:00.1 (Ethernet)
Net "eno2"
PCIBridge
PCIBridge
PCI 64:00.0 (VGA)
Package L#1
NUMANode L#1 (P#1 126GB)
L3 L#8 (16MB)
L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#48)
L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#49)
L3 L#9 (16MB)
L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#50)
L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#51)
L3 L#10 (16MB)
L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#52)
L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#53)
L3 L#11 (16MB)
L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#54)
L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#55)
L3 L#12 (16MB)
L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#56)
L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#57)
L3 L#13 (16MB)
L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#58)
L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#59)
L3 L#14 (16MB)
L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#60)
L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#61)
L3 L#15 (16MB)
L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#62)
L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#63)
HostBridge
PCIBridge
PCI 81:00.0 (3D)
CoProc(OpenCL) "opencl0d2"
HostBridge
PCIBridge
PCI a3:00.0 (SATA)
PCIBridge
PCI a4:00.0 (SATA)
HostBridge
PCIBridge
PCI c1:00.0 (3D)
CoProc(OpenCL) "opencl0d3"
PCIBridge
PCI c2:00.0 (NVMExp)
Block(Disk) "nvme0n1"
PCIBridge
PCI c3:00.0 (NVMExp)
Block(Disk) "nvme1n1"
PCIBridge
PCI c8:00.0 (SATA)
PCIBridge
PCI c9:00.0 (SATA)

Mon Mar 1 15:54:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 31C P0 63W / 400W | 6822MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 31C P0 66W / 400W | 6822MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 28C P0 58W / 400W | 6718MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 29C P0 61W / 400W | 6810MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

josh · Mar 1, 2021

Blower-style RTX 3090 cards are disappearing, and that’s bad for prosumers

Versions of the powerful RTX 3090 with blower-style coolers are disappearing from the web, with ominous implications for prosumers who need workstations for work and play.

www.pcworld.com

Nvidia forcing AIBs to remove blower 3090s. Thankfully I already bought 2 of them

larrysb · Mar 1, 2021

Doesn’t surprise me much. They really like playing the “market segmentation game” here in Silicon Valley.

3090 driver handicap?

Active Member

mmm.... bandwidth.

Active Member

mmm.... bandwidth.

Active Member

Active Member

Member

Active Member

Member

Active Member

Active Member

Researcher

Member

Active Member

Researcher

Active Member

Active Member

Member

Active Member

Active Member