Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

xdever

Member
Jun 29, 2021
34
4
8
I can confirm NVLink does not work, because they are missing the traces on these GPUs.
How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?

To be clear, I don't know what the clock for the A100 should be, but I can imagine it's different.
 
Last edited:

jenapper

New Member
May 21, 2023
10
7
3
How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?

To be clear, I don't know what the clock for the A100 should be, but I can imagine it's different.
They seem to be pretty certain about the traces missing, I actually asked about whether its possible to get it working some how, their feedback was they have had success by migrating the whole chip to a PCIe doner. I assume they have tested the circuits out on the SXM2 board if they have gone to that regard, but thats all the info I have.
 

jenapper

New Member
May 21, 2023
10
7
3
I also have another interesting development on my side...

Out of the 4 QS I have received, 1 of them was faulty. I have shipped it back and the seller has prepared another unit for exchange, and it seems like they will be sending me the CS version.

I will keep an eye out on any differences (if any) I can notice on my system.
 
  • Like
Reactions: blackcat1402

Leiko

New Member
Aug 15, 2021
29
6
3
I also have another interesting development on my side...

Out of the 4 QS I have received, 1 of them was faulty. I have shipped it back and the seller has prepared another unit for exchange, and it seems like they will be sending me the CS version.

I will keep an eye out on any differences (if any) I can notice on my system.
Mines have been sitting in the corner of my room for months now, I hope none are faulty
 

aosudh

Member
Jan 25, 2023
58
15
8
I finally had a chance to test the Bykski water block. I can't make the GPU exceed 50 degrees celsius, but it still crashes with sufficient load. With the following python code, the power usage goes up to 460W+:

Python:
import torch
import torch.nn.functional as F

a = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16)
w = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16)

print("START")
with torch.no_grad():
    while True:
        F.linear(a, w)

The VRMs are significantly hotter than the core, but it doesn't seem catastrophic. Does anybody have an idea what is happening?

View attachment 41781
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60C
 

jenapper

New Member
May 21, 2023
10
7
3
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.
Interesting...

How are you getting the HBM temperatures?

So you delidded the heat spreader and reapplied the thermal paste underneath right?
 

aosudh

Member
Jan 25, 2023
58
15
8
Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60C
This card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.
 

aosudh

Member
Jan 25, 2023
58
15
8
Interesting...

How are you getting the HBM temperatures?

So you delidded the heat spreader and reapplied the thermal paste underneath right?
This temperature can be seen through other SMI commands. What is shown in nvidia - smi is just the core temperature. I completely removed the Integrated Heat Spreader of the core. After removal, it was found that the thermal paste inside had become ineffective. However, the removed IHS had also deformed and could no longer be used. So I directly made the radiator contact the core.
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
This card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.
That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBM
 

xdever

Member
Jun 29, 2021
34
4
8
That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBM
I'm pretty sure it depends on the temperature limit because if I make the contact of the radiator worse or lower the pump speed, it dies with much less load. Maybe the temperature goes up at a peak, and the card dies before nvidia-smi would show the issue.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,484
500
113
benefits of A100 vs V100
If you run in lower precision mode, then you can beat V100 hands down.
You can run vgpu *on both, but A100's bit better there - especially when it comes to rt.
(*yes you can run A100 for games with vgpu - as long as you pass quadro's -never released card- pci-id or grids)
// those pciid's exist, all you have to do is find them - i typically use devicehunt site.


if you do boinc i'm sure you can get some benefits there too.
 
  • Like
Reactions: blackcat1402

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
For Einstein it's about 60% faster/more productive than a V100

I havent tested many other projects extensively, I ran a bunch of PG tests on their tasks, and they are "just OK" IMO. but Primegrid likes fast clock speed, lots of cores, and large L2 cache.

will test GPUGRID when they come back with Quantum Chemistry work, which always benefited from FP64 and lots of VRAM.

The A100-Drive is best for things that are memory constrained, or uses FP64 a lot, or both. or with AI/ML loads that can take advantage of the newer data types introduced with Ampere
 
Last edited:

TomGhostSmith

New Member
Mar 14, 2025
1
0
1
If it is just LLM inference calculation, the inference speed of this A100 32G card is not much faster than that of V100 32G.
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32