Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

CyklonDX · Mar 14, 2025

TomGhostSmith said:
V100 only support fp16, but on A100 they support fp32

the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16

xdever · Mar 14, 2025

TomGhostSmith said:
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32

I'm not sure if you are talking about LLM inference or some very specific use case that is unclear to me, but why would you do inference in fp32? Even the training of basically all modern networks is done in BF16, with a copy of FP32 master weights in the optimizer (used only for preventing underflow when accumulating the gradients). People often even int-quantize the models, although that reduces the performance marginally. I am not aware of any degradation in BF16.

blackcat1402 · Mar 15, 2025

TomGhostSmith said:
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32

The key thing is runing LLM locally with fp32 is not practical which demand too much VRAM, LOL.

aosudh · Mar 16, 2025

TomGhostSmith said:
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32

This conclusion is simply nonsense. Through actual testing, the reasoning performance of drive pg199 is approximately equivalent to that of the RTX 4090. I don't know through what reasoning framework the so-called performance equivalent to that of the V100 is achieved, but at least in both vllm and sglang, it has demonstrated remarkable performance. In addition, it seems that using the fp32 reasoning model has little effect. Leaving aside the fact that mainstream models have already been trained with low precision such as fp8 or even fp4, even the relatively earlier models at most support fp16, and there are very few models that can support fp32, let alone in the context of reasoning.

Leiko · Mar 17, 2025

CyklonDX said:
the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16

A100 doesn’t have FP32 TC and TF32 is 19 bits iirc.
This card (from my benchmarks) beats 4090 in flops in fp16/bf16 with fp32 accumulate and memory bandwidth.
It should be significantly (around 50%) faster than 4090 at LLM inference in bf16/fp16 when matmuls are accumulated in fp32.
It should also destroy it in training as bf16/fp16 training uses fp32 acc for stability reasons.

CyklonDX · Mar 18, 2025

4090 should be faster 2x than drive a100 (and normal a100); by any chance were you using sparsity on a100? I recall it was disabled on 4090 (or at least was in past - as i see mixed results after 2024).

Leiko · Mar 18, 2025

CyklonDX said:
4090 should be faster 2x than drive a100 (and normal a100); by any chance were you using sparsity on a100? I recall it was disabled on 4090 (or at least was in past - as i see mixed results after 2024).

No, what’s nerfed by NVIDIA on 4090 is the fp32 accumulation. They don’t do that on A6000 nor A100 (workstation/pro and datacenter gpus)
flops from different gpu: GitHub - mag-/gpu_benchmark: Gpu benchmark
4090: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

EDIT: corrected the note about fp8 flops, It's the right white paper

CyklonDX · Mar 18, 2025

page 30 - states full support.

https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

Leiko · Mar 18, 2025

CyklonDX said:
page 30 - states full support.

https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

4090 is the most right col, it's two times slower in fp32 acc. A100 and professional ada cards aren't nerfed like this.

Leiko · Mar 18, 2025

Leiko said:
4090 is the most right col, it's two times slower in fp32 acc. A100 and professional ada cards aren't nerfed like this.
View attachment 42571

note that DRIVE A100 is at least 250TF in FP16 /w FP32 acc (which is a ton more than 4090, and that's what is used for training neural nets)

CyklonDX · Mar 18, 2025

the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)

CyklonDX · Mar 18, 2025

The 4090 should be slightly faster than 40G A100 sxm4 card. *(it may be loosing half speed on memory but depends how it was written if compression inflight was used or not - 4090 could still outperform in terms of memory bandwidth due to latency of hbm.

CyklonDX · Mar 18, 2025

I think 4090 has had bugged driver that broke sparsity *but it should be fixed since. Just maybe when you tested it was still broken.

Sparsity fp8 Llama-3-8b on RTX4090 has no speed improvement against dense one. · Issue #36 · NVIDIA/TensorRT-Model-Optimizer

Hi! I tried Sparsity fp8 Llama-3-8b on RTX4090, but doesn't get performance improvement. I checked the trt-llm build log, which shows that depite there are layers eligible to use sparse tactics, th...

github.com

Leiko · Mar 18, 2025

CyklonDX said:
the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)

Sparsity is a scam, LLM weights tensors are still dense up to this day..
You don’t train with sparsity either.
Don’t look at the sparsity flops, they’re a scam and no, 4090 isn’t faster for training stuff (probably isn’t faster for inference unless you successfully use fp8 inference)

CyklonDX · Mar 18, 2025

well if sparsity isn't used on either then they are still about the same.

Leiko · Mar 18, 2025

CyklonDX said:
well if sparsity isn't used on either then they are still about the same.

Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)

aosudh · Mar 18, 2025

CyklonDX said:
the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)

Whenever Nvidia marks data with an asterisk*, you can simply halve it in calculations. This is a well-known scheme to prop up their inflated stock price.

CyklonDX · Mar 19, 2025

Leiko said:
Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)

i'm only arguing what they say in their papers.

(without sparsity)

A100 40G sxm4
fp16 acc fp32 156TF
fp16 acc fp16 312TF
TF32 156TF (only major difference)

4090
fp16 acc fp32 165.2TF
fp16 acc fp16 330.3TF
tf32 82.6TF (Only major difference)

*and its very likely it was written in error, and ran with sparsity on tf32 on a100. *(while not definite proof, the A40 without sparsity has only TF32 74TF which would match perfectly.) We can argue a100 gets more tf32 because ada is being handicapped by total memory bandwidth they have.

Leiko · Mar 19, 2025

CyklonDX said:
i'm only arguing what they say in their papers.

(without sparsity)

A100 40G sxm4
fp16 acc fp32 156TF
fp16 acc fp16 312TF
TF32 156TF (only major difference)

4090
fp16 acc fp32 165.2TF
fp16 acc fp16 330.3TF
tf32 82.6TF (Only major difference)

*and its very likely it was written in error, and ran with sparsity on tf32 on a100. *(while not definite proof, the A40 without sparsity has only TF32 74TF which would match perfectly.) We can argue a100 gets more tf32 because ada is being handicapped by total memory bandwidth they have.

Well A100, based on what they say in their papers:

CyklonDX · Mar 19, 2025

idk anymore, i'm sure you are right. its 4am.

Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

Well-Known Member

Member

New Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Attachments

Well-Known Member