Automotive A100 SXM2 for FSD? (NVIDIA DRIVE A100)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

CyklonDX

Well-Known Member
Nov 8, 2022
1,531
511
113
V100 only support fp16, but on A100 they support fp32
the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16
 

xdever

Member
Jun 29, 2021
34
4
8
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
I'm not sure if you are talking about LLM inference or some very specific use case that is unclear to me, but why would you do inference in fp32? Even the training of basically all modern networks is done in BF16, with a copy of FP32 master weights in the optimizer (used only for preventing underflow when accumulating the gradients). People often even int-quantize the models, although that reduces the performance marginally. I am not aware of any degradation in BF16.
 

blackcat1402

New Member
Dec 10, 2024
13
3
3
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
The key thing is runing LLM locally with fp32 is not practical which demand too much VRAM, LOL.
 

aosudh

Member
Jan 25, 2023
64
15
8
Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
This conclusion is simply nonsense. Through actual testing, the reasoning performance of drive pg199 is approximately equivalent to that of the RTX 4090. I don't know through what reasoning framework the so-called performance equivalent to that of the V100 is achieved, but at least in both vllm and sglang, it has demonstrated remarkable performance. In addition, it seems that using the fp32 reasoning model has little effect. Leaving aside the fact that mainstream models have already been trained with low precision such as fp8 or even fp4, even the relatively earlier models at most support fp16, and there are very few models that can support fp32, let alone in the context of reasoning.
 

Leiko

Member
Aug 15, 2021
38
7
8
the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16
A100 doesn’t have FP32 TC and TF32 is 19 bits iirc.
This card (from my benchmarks) beats 4090 in flops in fp16/bf16 with fp32 accumulate and memory bandwidth.
It should be significantly (around 50%) faster than 4090 at LLM inference in bf16/fp16 when matmuls are accumulated in fp32.
It should also destroy it in training as bf16/fp16 training uses fp32 acc for stability reasons.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,531
511
113
4090 should be faster 2x than drive a100 (and normal a100); by any chance were you using sparsity on a100? I recall it was disabled on 4090 (or at least was in past - as i see mixed results after 2024).
 
Last edited:

Leiko

Member
Aug 15, 2021
38
7
8
4090 should be faster 2x than drive a100 (and normal a100); by any chance were you using sparsity on a100? I recall it was disabled on 4090 (or at least was in past - as i see mixed results after 2024).
No, what’s nerfed by NVIDIA on 4090 is the fp32 accumulation. They don’t do that on A6000 nor A100 (workstation/pro and datacenter gpus)
flops from different gpu: GitHub - mag-/gpu_benchmark: Gpu benchmark
4090: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

EDIT: corrected the note about fp8 flops, It's the right white paper
 
Last edited:

CyklonDX

Well-Known Member
Nov 8, 2022
1,531
511
113
1742319025390.png
The 4090 should be slightly faster than 40G A100 sxm4 card. *(it may be loosing half speed on memory but depends how it was written if compression inflight was used or not - 4090 could still outperform in terms of memory bandwidth due to latency of hbm.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,531
511
113

Leiko

Member
Aug 15, 2021
38
7
8
the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)
Sparsity is a scam, LLM weights tensors are still dense up to this day..
You don’t train with sparsity either.
Don’t look at the sparsity flops, they’re a scam and no, 4090 isn’t faster for training stuff (probably isn’t faster for inference unless you successfully use fp8 inference)
 

Leiko

Member
Aug 15, 2021
38
7
8
well if sparsity isn't used on either then they are still about the same.
Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)
 

aosudh

Member
Jan 25, 2023
64
15
8
the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)
Whenever Nvidia marks data with an asterisk*, you can simply halve it in calculations. This is a well-known scheme to prop up their inflated stock price.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,531
511
113
Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)
i'm only arguing what they say in their papers.

(without sparsity)

A100 40G sxm4
fp16 acc fp32 156TF
fp16 acc fp16 312TF
TF32 156TF (only major difference)

4090
fp16 acc fp32 165.2TF
fp16 acc fp16 330.3TF
tf32 82.6TF (Only major difference)

*and its very likely it was written in error, and ran with sparsity on tf32 on a100. *(while not definite proof, the A40 without sparsity has only TF32 74TF which would match perfectly.) We can argue a100 gets more tf32 because ada is being handicapped by total memory bandwidth they have.
 
Last edited:

Leiko

Member
Aug 15, 2021
38
7
8
i'm only arguing what they say in their papers.

(without sparsity)

A100 40G sxm4
fp16 acc fp32 156TF
fp16 acc fp16 312TF
TF32 156TF (only major difference)

4090
fp16 acc fp32 165.2TF
fp16 acc fp16 330.3TF
tf32 82.6TF (Only major difference)

*and its very likely it was written in error, and ran with sparsity on tf32 on a100. *(while not definite proof, the A40 without sparsity has only TF32 74TF which would match perfectly.) We can argue a100 gets more tf32 because ada is being handicapped by total memory bandwidth they have.
Well A100, based on what they say in their papers:
 

Attachments