the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16V100 only support fp16, but on A100 they support fp32
the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16V100 only support fp16, but on A100 they support fp32
I'm not sure if you are talking about LLM inference or some very specific use case that is unclear to me, but why would you do inference in fp32? Even the training of basically all modern networks is done in BF16, with a copy of FP32 master weights in the optimizer (used only for preventing underflow when accumulating the gradients). People often even int-quantize the models, although that reduces the performance marginally. I am not aware of any degradation in BF16.Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
The key thing is runing LLM locally with fp32 is not practical which demand too much VRAM, LOL.Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
This conclusion is simply nonsense. Through actual testing, the reasoning performance of drive pg199 is approximately equivalent to that of the RTX 4090. I don't know through what reasoning framework the so-called performance equivalent to that of the V100 is achieved, but at least in both vllm and sglang, it has demonstrated remarkable performance. In addition, it seems that using the fp32 reasoning model has little effect. Leaving aside the fact that mainstream models have already been trained with low precision such as fp8 or even fp4, even the relatively earlier models at most support fp16, and there are very few models that can support fp32, let alone in the context of reasoning.Really? I heard that tensor cores on V100 only support fp16, but on A100 they support fp32. Actually I'm just wondering if I should choose this A100 rather than V100, since I'm going to run model inference on fp32
A100 doesn’t have FP32 TC and TF32 is 19 bits iirc.the tf32 is actually 2xfp16 as i recall *(someone correct me) no benefit of running tf32 accumulator vs single fp16 as those in tf32 don't actually have full precision length of fp16
No, what’s nerfed by NVIDIA on 4090 is the fp32 accumulation. They don’t do that on A6000 nor A100 (workstation/pro and datacenter gpus)4090 should be faster 2x than drive a100 (and normal a100); by any chance were you using sparsity on a100? I recall it was disabled on 4090 (or at least was in past - as i see mixed results after 2024).
note that DRIVE A100 is at least 250TF in FP16 /w FP32 acc (which is a ton more than 4090, and that's what is used for training neural nets)4090 is the most right col, it's two times slower in fp32 acc. A100 and professional ada cards aren't nerfed like this.
View attachment 42571
Sparsity is a scam, LLM weights tensors are still dense up to this day..the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)
Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)well if sparsity isn't used on either then they are still about the same.
Whenever Nvidia marks data with an asterisk*, you can simply halve it in calculations. This is a well-known scheme to prop up their inflated stock price.the 4090 has written its 330.4TF with fp32 accumulator. *(sparsity)
i'm only arguing what they say in their papers.Well no because what is used when using fp16 is acc fp32 by default because its stable enough. You will have issues trying to train with fp16 acc. This is why training on a 4090 will be slower (its almost 100TF slower ..)
Well A100, based on what they say in their papers:i'm only arguing what they say in their papers.
(without sparsity)
A100 40G sxm4
fp16 acc fp32 156TF
fp16 acc fp16 312TF
TF32 156TF (only major difference)
4090
fp16 acc fp32 165.2TF
fp16 acc fp16 330.3TF
tf32 82.6TF (Only major difference)
*and its very likely it was written in error, and ran with sparsity on tf32 on a100. *(while not definite proof, the A40 without sparsity has only TF32 74TF which would match perfectly.) We can argue a100 gets more tf32 because ada is being handicapped by total memory bandwidth they have.