Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

jenapper · Feb 24, 2025

Leiko said:
I thought running too hot hurt the HBM, is it the case ?

From what I have read from various sources damage starts kicking in at 110C and usually occurs around 120C.

xdever · Feb 26, 2025

jenapper said:
I can confirm NVLink does not work, because they are missing the traces on these GPUs.

How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?

To be clear, I don't know what the clock for the A100 should be, but I can imagine it's different.

jenapper · Feb 26, 2025

xdever said:
How certain are you (or the Chinese forums, which seem to be up to date) that this is because of the missing traces versus the Nvlink clock being different from 156.25MHz, which is the default for the P100 and V100? If the board is meant to be used in a dual card setup, at least some links should be present, right?

To be clear, I don't know what the clock for the A100 should be, but I can imagine it's different.

They seem to be pretty certain about the traces missing, I actually asked about whether its possible to get it working some how, their feedback was they have had success by migrating the whole chip to a PCIe doner. I assume they have tested the circuits out on the SXM2 board if they have gone to that regard, but thats all the info I have.

jenapper · Feb 26, 2025

I also have another interesting development on my side...

Out of the 4 QS I have received, 1 of them was faulty. I have shipped it back and the seller has prepared another unit for exchange, and it seems like they will be sending me the CS version.

I will keep an eye out on any differences (if any) I can notice on my system.

Leiko · Feb 26, 2025

jenapper said:
I also have another interesting development on my side...

Out of the 4 QS I have received, 1 of them was faulty. I have shipped it back and the seller has prepared another unit for exchange, and it seems like they will be sending me the CS version.

I will keep an eye out on any differences (if any) I can notice on my system.

Mines have been sitting in the corner of my room for months now, I hope none are faulty

blackcat1402 · Mar 1, 2025

Leiko said:
Mines have been sitting in the corner of my room for months now, I hope none are faulty

why not use them

Leiko · Mar 1, 2025

blackcat1402 said:
why not use them

The rest of the setup isn’t ready yet. It almost is though

aosudh · Mar 3, 2025

xdever said:
I finally had a chance to test the Bykski water block. I can't make the GPU exceed 50 degrees celsius, but it still crashes with sufficient load. With the following python code, the power usage goes up to 460W+:

Python:

import torch import torch.nn.functional as F a = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16) w = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16) print("START") with torch.no_grad(): while True: F.linear(a, w)

The VRMs are significantly hotter than the core, but it doesn't seem catastrophic. Does anybody have an idea what is happening?

View attachment 41781

I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.

gsrcrxsi · Mar 3, 2025

aosudh said:
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.

Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60C

jenapper · Mar 3, 2025

aosudh said:
I currently have approximately 64 of these A100 PROD cards in a cluster using GPUDirectRDMA. When running inference tasks, the power consumption is normal. However, during the training process, the cards do drop out becauses. Through monitoring, it can be found that this is caused by the excessively high temperature of the HBM as HBM temperature soon reach 110 degrees Celsius where core exceeds 80 degrees Celsius. I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.

Interesting...

How are you getting the HBM temperatures?

So you delidded the heat spreader and reapplied the thermal paste underneath right?

aosudh · Mar 3, 2025

gsrcrxsi said:
Using the script there. I had the GPU drop out when the core was less than 80C and the HBM was less than 60C

This card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.

aosudh · Mar 3, 2025

jenapper said:
Interesting...

How are you getting the HBM temperatures?

So you delidded the heat spreader and reapplied the thermal paste underneath right?

This temperature can be seen through other SMI commands. What is shown in nvidia - smi is just the core temperature. I completely removed the Integrated Heat Spreader of the core. After removal, it was found that the thermal paste inside had become ineffective. However, the removed IHS had also deformed and could no longer be used. So I directly made the radiator contact the core.

xdever · Mar 4, 2025

aosudh said:
I removed the top cover of the card and reapplied the thermal paste, and everything is normal now.

Do you have some instructions/photos on how to remove it safely?

gsrcrxsi · Mar 4, 2025

aosudh said:
This card doesn't have any downclocking strategies. Therefore, once either the core or the HBM exceeds the temperature limit, the card will malfunction and the system will freeze.

That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBM

xdever · Mar 6, 2025

gsrcrxsi said:
That’s the point I was making. The card did not exceed any temperature limit since it was below 80C and 60C on HBM

I'm pretty sure it depends on the temperature limit because if I make the contact of the radiator worse or lower the pump speed, it dies with much less load. Maybe the temperature goes up at a peak, and the card dies before nvidia-smi would show the issue.

blackcat1402 · Mar 9, 2025

If it is just LLM inference calculation, the inference speed of this A100 32G card is not much faster than that of V100 32G.

blackcat1402 · Mar 9, 2025

To be honest I'm a bit disappointed. Not sure what additional benefits it can bring.

CyklonDX · Mar 9, 2025

benefits of A100 vs V100
If you run in lower precision mode, then you can beat V100 hands down.
You can run vgpu *on both, but A100's bit better there - especially when it comes to rt.
(*yes you can run A100 for games with vgpu - as long as you pass quadro's -never released card- pci-id or grids)
// those pciid's exist, all you have to do is find them - i typically use devicehunt site.

if you do boinc i'm sure you can get some benefits there too.

gsrcrxsi · Mar 9, 2025

For Einstein it's about 60% faster/more productive than a V100

I havent tested many other projects extensively, I ran a bunch of PG tests on their tasks, and they are "just OK" IMO. but Primegrid likes fast clock speed, lots of cores, and large L2 cache.

will test GPUGRID when they come back with Quantum Chemistry work, which always benefited from FP64 and lots of VRAM.

The A100-Drive is best for things that are memory constrained, or uses FP64 a lot, or both. or with AI/ML loads that can take advantage of the newer data types introduced with Ampere

TomGhostSmith · Mar 14, 2025

blackcat1402 said:
If it is just LLM inference calculation, the inference speed of this A100 32G card is not much faster than that of V100 32G.

Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32

Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

New Member

Member

New Member

New Member

Member

New Member

Member

Member

Active Member

New Member

Member

Member

Member

Active Member

Member

New Member

New Member

Well-Known Member

Active Member

New Member