Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

blackcat1402

New Member
Dec 10, 2024
13
2
3
I'm not sure how to differentiate what is "Production" and what is "pre-production" or "ES/QS".

my unit came in an Nvidia branded box. the GPU itself has a matching serial to the box. does this mean it's "Production"?

yet, the die heat spreader markings indicate "QS", nvidia-smi labels it as "PG199-PROD", and the VBIOS revision is exactly the same VBIOS as your "pre-production" screen shot. 92.00.79.00.01

a lot of conflicting information.
I've been curious about getting one of these and an adapter to try out for some AI stuff as well as other things too. I'm wondering though, how does Windows support work? What did you have to do to get it working?

Also will any SXM2 to PCIe adapter work for this? I've been eyeing this one on ebay: SXM2 To PCIE Adapter For Nvidia Tesla V100 A100 SXM2 GPU Computing Graphics | eBay

That listing says that adapter works for the A100 but I can't find much info on what the A100 automative actually works on. But if that adapter can work, that'd be really awesome to make a slim dual slot A100.
It does not work due to wrong heat sink screw hole spacing, it works for P100 and V100 only, but A100 SXM2 is different.
 

xdever

Member
Jun 29, 2021
34
4
8
I haven’t seen the heatsink completely that comes with those eBay adapters. But it looks like a solid block of copper with fins. Totally insufficient to cool the A100 IMO. Maybe barely enough to cool a power limited V100.
I had one of these that came with my adapter; it's complete junk; it can't cool even my v100 for more than a few seconds, even with max fan speed. Also, it is super light. The Bykskiwater block is made from copper, it's ~2x thicker but 10x heavier, so I wonder what material the heatsink that came with the adapter is made of (the color looks copper, but I doubt that it is actually copper).
 
  • Like
Reactions: gsrcrxsi

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
It does not work due to wrong heat sink screw hole spacing, it works for P100 and V100 only, but A100 SXM2 is different.
since the heatsink screws into the SXM2 module itself and not the board. that board will work if you use a different heatsink instead of the one they send you. or you modify the heatsink holes with a file or drill to expand the spacing.

but i think it will not properly cool a A100-Drive unit. and maybe not even a V100. heatsink seems too small.
 

Leiko

Member
Aug 15, 2021
38
6
8
I've been curious about getting one of these and an adapter to try out for some AI stuff as well as other things too. I'm wondering though, how does Windows support work? What did you have to do to get it working?

Also will any SXM2 to PCIe adapter work for this? I've been eyeing this one on ebay: SXM2 To PCIE Adapter For Nvidia Tesla V100 A100 SXM2 GPU Computing Graphics | eBay

That listing says that adapter works for the A100 but I can't find much info on what the A100 automative actually works on. But if that adapter can work, that'd be really awesome to make a slim dual slot A100.
These card WON'T work by default on windows, you can do some funky driver install tricks to get a driver to install but even then you'll blue screen often. Id only recommend linux for these.
also any sxm2 to pcie adapter should work, don't even think about making it a slim card, you won't be able to cool it nor power limit it to be able to.
 

xdever

Member
Jun 29, 2021
34
4
8
I finally had a chance to test the Bykski water block. I can't make the GPU exceed 50 degrees celsius, but it still crashes with sufficient load. With the following python code, the power usage goes up to 460W+:

Python:
import torch
import torch.nn.functional as F

a = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16)
w = torch.randn(32768, 32768, device="cuda", dtype=torch.bfloat16)

print("START")
with torch.no_grad():
    while True:
        F.linear(a, w)

The VRMs are significantly hotter than the core, but it doesn't seem catastrophic. Does anybody have an idea what is happening?

Xinf_250208_210942_349.jpg
 
  • Like
Reactions: blackcat1402

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
are you using thermal pads? 60C isnt really that hot at all for VRMs. I'd guess heat probably isnt the reason for your crashing. are you still using that modified SXM2->PCIe adapter? the issue might be more related to that.

another thing to check. Look in the kernel log for something like PCI errors.

the A100-drive can run at PCIe gen4, and if your CPU/motherboard support gen4 it may be auto-negotiating that speed. And the adapter board you are using might not be capable of running at gen4.

try setting the PCIe slot to gen3 instead and see if it helps.
 
Last edited:

xdever

Member
Jun 29, 2021
34
4
8
I use 1.5mm thermal pads on the VRMs and no thermal pads on the coils. I was also trying to put thermal pads on those, but no matter how long I tried, I couldn't get them right. Either they don't make contact, or they are too thick, and the core doesn't make contact (the thermal paste is not spread out). Do you happen to know what the proper thickness of the pads is?

I'm still using the modified adapter, but as far as the modifications go, I'm pretty sure that it receives enough power. That said, I can't verify the thickness of the traces that they used below the socket.

My motherboard and CPU only support PCIe Gen 3. I doubt that this can be an issue because this happens only when the card goes above 450W.

I'm wondering if my card has some issue with the power limits, as even the SXM4 "real" A100 has a power limit of 400W, and this card clearly goes above it. If I change the shape of the matmul from above from 32768 to 16386, the power level is constantly around 420W, and the card is stable. This is clearly more than the official specification for any version of A100. Can this be related to the card being the CS version and not the production one?

Can somebody try the script form above and check the maximum power usage of their card using nvidia-smi?
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
I’m running your script now. I have the QS version. How long should it run? It’s been going about 5mins without any problem.

GPU core - 78C
GPU mem - 57C

I see power draw reported around 400-450. But it bounces around quite a bit. The highest I’ve seen is like 465W.

edit. Spoke too soon. It did indeed crash. Dropped the GPU off the bus.
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
If you think it’s power related, can you try to set the GPU to the minimum supported core clocks?

when I set my core clocks to 1140MHz, power use is dramatically reduced to like 300-350W on your same script.
 

jenapper

New Member
May 21, 2023
10
7
3
I have purchased 4 of these, and will be joining the discussion next week.

From what I have gathered:
CS, ES, QS is more or less the same, there were "engineering samples" but from what the crowd has gathered, they were all produced in such large quantities they are all considered same as PROD in Chinese community. One of the community members that had their hands on multiple of all 3 says they were likely just sold to different car manufacturers.

NVLink is possible, but perhaps not quite worth it, someone from the community managed to migrate the chip from SXM2 board onto a custom PCIe board with NVlink, and get full nvlink capabilities. The risks and the labor involved is quite expensive.

The SXM2 board does not have the necessary traces to enable the NVLink which the chip is capable of.

The board fits into SXM2 slots but requires the SXM4 A100 heatsink.

Photos to come next week as my parts arrive.
 
  • Like
Reactions: blackcat1402

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
How can it require an SXM4 heatsink when the VRM layout is SXM2? Don’t think the SXM4 heatsink would fit properly.

so far the only thing that seems to fit the card with zero modifications is the Bykski NVV100 waterblock
 

jenapper

New Member
May 21, 2023
10
7
3
I am still just getting info from the community member who designs heatsinks and mods for the PG199.

Once I get my hands on the cards and heatsink in a few days I will post more updates.

I did ask them, they said my heatsink comes with the A100 golden hood.
 

xdever

Member
Jun 29, 2021
34
4
8
I limited the clock speed with sudo nvidia-smi -i 1 -ac 1404,1260, and now it seems stable with basically no performance difference.

I still worry about the VRM temps, maybe the thermal pad is too thin. The temps near the VRM is 73 degrees C, and this is not the top of the chip, which I can't see from the heatsink. There is a hole where there is no FET, so I can see the coils, they are 64 degC. Also, the heatsink immediately on top of the VRMS is cold. I wonder if the thermal pad is not thick enough.

Even if you look at the back of the adapter board, you can see that the core is more cold than the VRM area.

Screenshot 2025-02-09 at 07.42.29.png

Xinf_250209_074056_241.jpg
 

Attachments

  • Like
Reactions: CyklonDX

jenapper

New Member
May 21, 2023
10
7
3
What height of thermal pads is everyone using?

Are you using different heights for the VRMS chips and ram chips?
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
What height of thermal pads is everyone using?

Are you using different heights for the VRMS chips and ram chips?
Ram chips? what ram chips? this has HBM under the main heat spreader, packaged with the GPU core.

VRM thermal pad thickness will depend on what cooler you're using. the waterblock uses much thinner pads (like 1mm?) than the HP 3U air cooler (like 3-4mm i'd guess). the copper 2U air coolers dont have any contact with the VRMs at all and need direct air cooling with a fast screamer fan.

it all depends.
 

xdever

Member
Jun 29, 2021
34
4
8
For the VRMs, I used 1.5mm, and based on the imprint and the spread of the thermal paste, it seems to be the perfect height.
For the coils, I tied a bunch of different configs today. Currently, I have 1mm. 1.25mm is still too thick; it prevents proper contact with the GPU.

All of this is for Bykski N-NVV100-NVLink-X. The rest of the cooling setup is:
  • Radiator: EKWB EK-Coolstream CE 420
  • Pump: EK-Quantum Kinetic TBE 200 D5 PWM
  • 3x EKWB Vardar Evo 140ER fans
  • Pipe connectors: Alphacool 13/10mm icicle G1/4 - six pack
  • Alphacool plastic hose 13mm outside/10mm inside diameter
  • Noctua NT-H2 thermal paste
The order is pump -> GPU -> radiator top connector-> radiator bottom connector -> pump.

Because PC cooling liquids are next to impossible to get in my country, I'm running it on ~22% mixture of monoethylene glycol / distilled water mix to prevent algae and reduce corrosion (red Valeo Protectiv 100 G12 car coolant).

Currently, the fans are controlled by the adapter board's thermal sensor, and the pump is always running at max speed. I will try to connect the pump to the PWM next.

There is no separate VRAM. They are HBM chiplets integrated close to the die under the built-in heat spreader.
 

jenapper

New Member
May 21, 2023
10
7
3
For the VRMs, I used 1.5mm, and based on the imprint and the spread of the thermal paste, it seems to be the perfect height.
For the coils, I tied a bunch of different configs today. Currently, I have 1mm. 1.25mm is still too thick; it prevents proper contact with the GPU.
Oh yes, sorry I meant VRMs and coils. I plan on running an air cooled setup, was thinking if its possible to use additional smaller heat sinks etc to cool the VRMs.

Interesting watercooling setup, I will update on mine as soon as they arrive.
 

xdever

Member
Jun 29, 2021
34
4
8
Update on the pads: my thermal pads were too hard. I updated the VRM pads to 1.75mm, very soft ones now the whole setup is very stable and runs for many hours on max load without going above 54 degrees. The VRMs are also just ~65degC. For the coils, I kept the 1mm ones, but if one has a 1.1-1.15mm soft one, that might work better.
 

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
i didnt put anything on the coils, they usually dont need any cooling. if you wanted to put something you could probably get that kind of thermal putty to take out the guesswork of thickness.
 

xdever

Member
Jun 29, 2021
34
4
8
I don't think the coils themselves need cooling, but they conduct away the heat from the area near the VRMs. Before I did this, the backside of the adapter board behind the VRM area was untouchably hot. Now, it is noticeably cooler.