Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gsrcrxsi

Active Member
Dec 12, 2018
362
119
43
Yes, it can be used on the supermicro aom-sxmv board
hmm. interesting. was there any need for some kind of modification or configuration to the AOM-SXMV board to get them recognized?

i wouldnt have expected a fringe GPU like the A100 drive to work, when even a P100 doesnt work on it.
 

Leiko

New Member
Aug 15, 2021
11
0
1
hmm. interesting. was there any need for some kind of modification or configuration to the AOM-SXMV board to get them recognized?

i wouldnt have expected a fringe GPU like the A100 drive to work, when even a P100 doesnt work on it.
iirc they work out of the box on the adapters (when on linux). I expect them to do the same on aom-sxmv. I will try
 

Leiko

New Member
Aug 15, 2021
11
0
1
Will add for anyone thinking of trying it that
- the heatsink mount is not standard sxm2 (might be sxm4 ?)
- the cards are indeed missing a few SMs 108 -> 96
- linux driver works out of the box but windows needs to be modded
 

xdever

New Member
Jun 29, 2021
18
0
1
Oh, the heatsink sounds bad. I got a card from eBay, but I have not yet been able to test it. How did you solve the problem? Were you able to mount it somehow?
 

xdever

New Member
Jun 29, 2021
18
0
1
Oh, the heatsink sounds bad. I got a card from eBay, but I have not yet been able to test it. How did you solve the problem? Were you able to mount it somehow?
By overlapping the picture of this card and a v100, it seems it should be easy to drill a new set of holes in the heatsink and it looks like that would solve the problem. I'm still curious about your solution.
 

Leiko

New Member
Aug 15, 2021
11
0
1
By overlapping the picture of this card and a v100, it seems it should be easy to drill a new set of holes in the heatsink and it looks like that would solve the problem. I'm still curious about your solution.
In china, custom brackets are being sold with the card sometimes. One seller was selling some standalone but it was expensive. Im almost sure the width is sxm4 and ive seen pics of sxm4 coolers mountend on those cards.

worst case scenario you can always just bolt one side of the heatsink as it doesnt rely on pressure
 

xdever

New Member
Jun 29, 2021
18
0
1
Ok, I got it working in my Chinese adapter (with modified 5V directly from the power supply). It needed the fan thermal sensor to be removed and its trace to be cut, because unlike V100 and P100, there is no hole where they put the sensor. I mounted an sxm2 heatsink. The difference is that the holes are 4mm more apart than on SXM2. I made a slot with a bit of filing, so now I can use the heatsink for both the V100 and the A100.

Performance-wise, using the Triton Matmul example, it is 7% slower than the real A100 SXM4 40Gb for the biggest size in BF16. For real-world testing, it's 25% slower than the real A100 for small (200M param) Transformer training, but even like this is 2x as fast as a 3090. The testing might be a bit unfair because the A100 SXM4 is in a DGX, with a much more powerful and much newer CPU than my desktop with the Drive A100 and the 3090, although this should have minimal influence. Also, my desktop uses PCI-E gen 3, and DGX uses gen 4.

Cooling down the card silently is very challenging. Currently, I have a server fan alu-taped to the heatsink, and it sounds like a jet engine. The 8cm Noctua fans do not provide anywhere near enough cooling power to keep the card cool. I'd like to hear any suggestions on how to cool it down silently without water cooling.

The idle power consumption of the Drive A100 is 48W compared to 20W for the 3090.
 

gsrcrxsi

Active Member
Dec 12, 2018
362
119
43
Ok, I got it working in my Chinese adapter (with modified 5V directly from the power supply). It needed the fan thermal sensor to be removed and its trace to be cut, because unlike V100 and P100, there is no hole where they put the sensor. I mounted an sxm2 heatsink. The difference is that the holes are 4mm more apart than on SXM2. I made a slot with a bit of filing, so now I can use the heatsink for both the V100 and the A100.

Performance-wise, using the Triton Matmul example, it is 7% slower than the real A100 SXM4 40Gb for the biggest size in BF16. For real-world testing, it's 25% slower than the real A100 for small (200M param) Transformer training, but even like this is 2x as fast as a 3090. The testing might be a bit unfair because the A100 SXM4 is in a DGX, with a much more powerful and much newer CPU than my desktop with the Drive A100 and the 3090, although this should have minimal influence. Also, my desktop uses PCI-E gen 3, and DGX uses gen 4.

Cooling down the card silently is very challenging. Currently, I have a server fan alu-taped to the heatsink, and it sounds like a jet engine. The 8cm Noctua fans do not provide anywhere near enough cooling power to keep the card cool. I'd like to hear any suggestions on how to cool it down silently without water cooling.

The idle power consumption of the Drive A100 is 48W compared to 20W for the 3090.
can you post a pic of the heatsink modification necessary?
 

xdever

New Member
Jun 29, 2021
18
0
1
can you post a pic of the heatsink modification necessary?
This is the whole contraption for now: Drive A100.

Unfortunately, the power consumption regularly spikes to 400W, and it makes plenty of coil whine, maybe because the Chinese adapter doesn't have the NVLink populated, which has a bunch of grounds. I'm also worried about the power connectors, perhaps I should add one more.

This power consumption explains why I killed some of the servers.
 

Leiko

New Member
Aug 15, 2021
11
0
1
This is the whole contraption for now: Drive A100.

Unfortunately, the power consumption regularly spikes to 400W, and it makes plenty of coil whine, maybe because the Chinese adapter doesn't have the NVLink populated, which has a bunch of grounds. I'm also worried about the power connectors, perhaps I should add one more.

This power consumption explains why I killed some of the servers.
Looks like the old version, the latest one seems to have (weirdly enough) QS instead of CS engraved on the heatspreader. QS has more perf from the info ive seen on chinese internet.

heatsink is sxm4 mount thats why you had to modify it.

the card also ISNT (the a100 neither but its closer) a full ga100 implementation. Its only 96SM and 4 hbm2 stacks.
 

xdever

New Member
Jun 29, 2021
18
0
1
Has anyone experienced the card falling off the bus if pushed to the max for like ~5min? I added two more power connectors, so the current on the 12V rail is definitely not an issue. It is running constantly above 380W TDP, and the temperature is at a constant 81 degrees, with the mem being 70. The thermal shutdown temperature is 95deg, and it is definitively not reaching that. I found no way to check the VRM temperatures.
 

gsrcrxsi

Active Member
Dec 12, 2018
362
119
43
Falling off the bus is usually from PCIe issues in my experience. (With GPUs in general, not this card specifically). Sometimes from sketchy/defective power cables.

check the logs in the OS, do you see evidence of PCIe errors before or when it drops?

how is the card attached to the host? Are you using any risers on the PCIe-SXM adapter?
 

xdever

New Member
Jun 29, 2021
18
0
1
There are no PCIe errors in dmesg. Only:

```
[19058.484849] NVRM: GPU at PCI:0000:09:00: GPU-11469866-5ce5-2380-8eb6-a7e74ce1d5dc
[19058.484857] NVRM: Xid (PCI:0000:09:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[19058.484861] NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
```


The PCIe-SXM adapter is directly plugged into the motherboard without a riser, and it is a PCI-E gen 3 board and CPU, which is less finicky than PCI-E gen 4. It always happens after roughly the same time, which would explained by overheating, except that I don't see any sign of it.

Has anyone seen it run in adapters without the nvlink port populated? I'm running it in this kind of adapter, with the thermal sensor moved to the side, added two more 12V ports, and the 5V directly coming from the PC PSU instead of the weak SMPS that it came with. My fear is that the unpopulated nvlink port has tons of ground pins and I'm not sure if the gnds on the pci-e side are enough. Probably it should be enough, as there are more of them than the 12V and 5V lines together. Additionally, there are 0 capacitors on the board, but it seems like the fancy adapter here also doesn't have any.
 

gsrcrxsi

Active Member
Dec 12, 2018
362
119
43
Can you post a picture of how you added more connectors? I have a feeling that your adapter board maybe just can’t supply enough power through the PCB regardless of how many power connectors you add externally.

you could test power delivery being a culprit, just power limit the card to like 200W or something and see if you have the same problem.
 

xdever

New Member
Jun 29, 2021
18
0
1
I have a feeling that your adapter board maybe just can’t supply enough power through the PCB regardless of how many power connectors you add externally.
I have the same fear, but the ground planes are quite big. The top plane near the edge is +12V and the bottom is gnd, so I just scratched away the solder mask and put the connector on the edge (actually there is a thin GND right next to the edge, but the 12V is next to it and the pins can easily reach it). I added it to the closest point possible to the SXM PCI+power connector to minimize wire lengths.

The card doesn't seem to support power limiting. Already, querying power information doesn't work properly. Should it support it? If yes, is this a failed power management on the card or a bad driver?

I also noticed that the power draw reported in nvidia-smi likes to jump around a lot, occasionally above 420W. Thermal throttling doesn't work. First, I had a weaker fan, and the card just shuts down at 95degC instead of starting thermal throttling at 85 as the standard gpus.

$ nvidia-smi -q -d POWER
Timestamp : Mon Sep 9 16:24:55 2024
Driver Version : 535.183.01
CUDA Version : 12.2
...
GPU 00000000:09:00.0
GPU Power Readings
Power Draw : 44.16 W
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Samples
Duration : Not Found
Number of Samples : Not Found
Max : Not Found
Min : Not Found
Avg : Not Found
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A


$ sudo nvidia-smi -i 1 -pl 200
Changing power management limit is not supported in current scope for GPU: 00000000:09:00.0.
All done.


pcie_connector.jpg
 

gsrcrxsi

Active Member
Dec 12, 2018
362
119
43
Power jumping around is probably normal

hard to say if the GPU or adapter is the culprit. I would try to get a different adapter that can handle the power supply you’re intending. RGL on the SXM2 to PCIe thread posted a link to some engineering boards that are better made and stated to work with this GPU.
 

xdever

New Member
Jun 29, 2021
18
0
1
It looks to me that the power spikes get increasingly bigger as the card heats up. In the beginning, it barely reaches 400W, while at the end, it jumps around 425. Does anybody know if this card should support power limits and throttling?