3090 driver handicap?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

josh

Active Member
Oct 21, 2013
621
196
43
Although theoretically there is some limited ddr3 support for 2011-3, I've never seen it in the wild.

DDR4 isn't all that expensive anymore, at least the slower ones like ddr4-2400 are ~$45 for a 16gb stick. A dual E5v4 should have 8 sticks for maximum speed, so, around $360 for 128gb ddr4 ram. Compared to the cost of GPUs that's pretty minimal. The higher speed ram you'll want to use for AMD (ddr4-3200), does cost more, however.
Yea but I have a ton of ddr3 from old setups that I paid only 11-13 per 16gb stick. There's a couple of dual X99 Chinese motherboards that take DDR3 and the dual CPU setup also means unlimited PCIe lanes.
 
  • Haha
Reactions: balnazzar

josh

Active Member
Oct 21, 2013
621
196
43
Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
849
402
63
USA
ioflood.com
Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?
If youre just doing standard desktop stuff, the impact should be little or nothing.
 

josh

Active Member
Oct 21, 2013
621
196
43
If youre just doing standard desktop stuff, the impact should be little or nothing.
Occasional games. Currently use a standalone RX 580 but thinking of selling it with the premium people are offering for these cards.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
Do I need a separate card for the monitors if I'm using the ampere card for deep learning? Does it drop the training rates if I'm using 6-8 monitors over DP chaining?
It depends. It impacts very little during regular desktop work.
A ton of running browser tabs with hardware acceleration enabled will impact much more.
And of course gaming will impact even more.
Particularly, amount of occupied VRAM. Just buy a gt710/1030 or use the integrated IPMI graphics if you have it. If you like gaming, a 1030 will be ok-ish at 1080p. You can buy them used on amazon for 60-70eur.
 

iceisfun

Member
Jul 19, 2014
31
5
8
We are doing awesome with the RTX 3090 for pytorch and vision. It seems to be faster than RTX Titan and A100 SXM
 

iceisfun

Member
Jul 19, 2014
31
5
8
Come on, that's not possible.

Were you meaning the V100 SXM?
No, we are trying to figure out if its just too early with pytorch or a driver issue or something else?

And by faster I ment lower latency per frame processed in small batch sizes, not more frames processed in big batches.

|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 29C P0 62W / 400W | 6776MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 30C P0 60W / 400W | 6716MiB / 40536MiB | 4% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 57W / 400W | 6576MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 60W / 400W | 6648MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
 
  • Like
Reactions: balnazzar

balnazzar

Active Member
Mar 6, 2019
221
30
28
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 29C P0 62W / 400W | 6776MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 30C P0 60W / 400W | 6716MiB / 40536MiB | 4% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 57W / 400W | 6576MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 27C P0 60W / 400W | 6648MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
You bought the new DGX station, uh? You lucky SOB :)
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
And by faster I ment lower latency per frame processed in small batch sizes, not more frames processed in big batches.

No, we are trying to figure out if its just too early with pytorch or a driver issue or something else?
It's possible. Try and benchmark with tensorflow.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
It seems to be faster than RTX Titan and A100 SXM
Come on, that's not possible.
I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.

But in general, yeah no way the A100 should blow the 3090 out of the water on basically everything else.
 

iceisfun

Member
Jul 19, 2014
31
5
8
I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.

But in general, yeah no way the A100 should blow the 3090 out of the water on basically everything else.
Right, we were investigating numa domains and such.

Also correct that our 3090 test rig is a single processor i9 9900k, so much simpler topology.

And we're talking about ~10ms/req, it seems much higher than could be possible with a numa issue and the machine not being under significant load.
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
I could definitely see an A100 having higher latency in specific cases; especially if the pinning isn't right, and data has to traverse both a PCIe Switch as well as an Infinity Fabric link before hitting the target GPU.

Just due to the nature of 3090 systems, they're much more likely to be single CPU and have direct PCIe links between CPU and GPU with no switches inbetween. So I could believe lower latency for copy-bound tasks.
If you have the dgx a100, it shouldn't use any pcie switch. That's due to the presence of a sufficient of lanes for 4 gpus. The same happens if you just build your own server/workstation using a motherboard without pcie switches.
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
If you have the dgx a100, it shouldn't use any pcie switch.
From page 13 of the user manual:

Notification_Center.jpg

The switch is actually mandatory since you need the ConnectX-6 cards directly linked to the GPU on the same PCIe Root in order for GPUDirect RDMA to work for large multi-node deployments. You'll notice that each GPU also has a direct path to one of the NICs on the far right regardless of where in the topology it is (GPU -> Switch -> Switch -> NIC)

I think the 4x A100 Redstone boards can be directly connected to the CPUs, but the Supermicro 2124GQ-NART still uses two CPUs making NUMA domains an issue.
 

larrysb

Active Member
Nov 7, 2018
108
49
28
I think the 4x A100 Redstone boards can be directly connected to the CPUs, but the Supermicro 2124GQ-NART still uses two CPUs making NUMA domains an issue.

Wow, the SuperMicro 2124GQ-NART system manual is pretty confusing about which GPU is mapped to what CPU. They're numbered differently in IPMI and Linux. It would take a while to sort out NUMA and processor affinity. Honestly, I don't know much about AMD, since I'm bottom feeding on last generation Intel hardware.

just curious what nvidia-smi topo -m (-mp) shows?
 

iceisfun

Member
Jul 19, 2014
31
5
8
Machine (252GB total)
Package L#0
NUMANode L#0 (P#0 126GB)
L3 L#0 (16MB)
L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#32)
L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#33)
L3 L#1 (16MB)
L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#34)
L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#35)
L3 L#2 (16MB)
L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#36)
L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#37)
L3 L#3 (16MB)
L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#38)
L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#39)
L3 L#4 (16MB)
L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#40)
L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#41)
L3 L#5 (16MB)
L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#42)
L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#43)
L3 L#6 (16MB)
L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#44)
L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#45)
L3 L#7 (16MB)
L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#46)
L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#47)
HostBridge
PCIBridge
PCI 01:00.0 (3D)
CoProc(OpenCL) "opencl0d0"
HostBridge
PCIBridge
PCI 23:00.0 (SATA)
PCIBridge
PCI 24:00.0 (SATA)
HostBridge
PCIBridge
PCI 41:00.0 (3D)
CoProc(OpenCL) "opencl0d1"
PCIBridge
PCI 44:00.0 (SATA)
PCIBridge
PCI 45:00.0 (SATA)
HostBridge
PCIBridge
PCI 61:00.0 (Ethernet)
Net "eno1"
PCI 61:00.1 (Ethernet)
Net "eno2"
PCIBridge
PCIBridge
PCI 64:00.0 (VGA)
Package L#1
NUMANode L#1 (P#1 126GB)
L3 L#8 (16MB)
L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#48)
L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#49)
L3 L#9 (16MB)
L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#50)
L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#51)
L3 L#10 (16MB)
L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
PU L#40 (P#20)
PU L#41 (P#52)
L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
PU L#42 (P#21)
PU L#43 (P#53)
L3 L#11 (16MB)
L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
PU L#44 (P#22)
PU L#45 (P#54)
L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#23)
PU L#47 (P#55)
L3 L#12 (16MB)
L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
PU L#48 (P#24)
PU L#49 (P#56)
L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
PU L#50 (P#25)
PU L#51 (P#57)
L3 L#13 (16MB)
L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
PU L#52 (P#26)
PU L#53 (P#58)
L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
PU L#54 (P#27)
PU L#55 (P#59)
L3 L#14 (16MB)
L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
PU L#56 (P#28)
PU L#57 (P#60)
L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
PU L#58 (P#29)
PU L#59 (P#61)
L3 L#15 (16MB)
L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
PU L#60 (P#30)
PU L#61 (P#62)
L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
PU L#62 (P#31)
PU L#63 (P#63)
HostBridge
PCIBridge
PCI 81:00.0 (3D)
CoProc(OpenCL) "opencl0d2"
HostBridge
PCIBridge
PCI a3:00.0 (SATA)
PCIBridge
PCI a4:00.0 (SATA)
HostBridge
PCIBridge
PCI c1:00.0 (3D)
CoProc(OpenCL) "opencl0d3"
PCIBridge
PCI c2:00.0 (NVMExp)
Block(Disk) "nvme0n1"
PCIBridge
PCI c3:00.0 (NVMExp)
Block(Disk) "nvme1n1"
PCIBridge
PCI c8:00.0 (SATA)
PCIBridge
PCI c9:00.0 (SATA)


Mon Mar 1 15:54:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| N/A 31C P0 63W / 400W | 6822MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 31C P0 66W / 400W | 6822MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 28C P0 58W / 400W | 6718MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 29C P0 61W / 400W | 6810MiB / 40536MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
 

larrysb

Active Member
Nov 7, 2018
108
49
28
Doesn’t surprise me much. They really like playing the “market segmentation game” here in Silicon Valley.