I have seen multiple people using sxm4 heatsink on them on xianyuCan somebody tell me what heatsink this card uses? SXM4 doesn't have the pads for the VRMs like SXM2 does, and also overlapping the image of an SXM4 heatsink with the card seems to show that the mounting points still don't match (although I found only a pretty bad image of the SXM4 from the bottom). Has anyone actually tried to fit an SXM4 heatsink on it?
some SXM2 heatsinks dont have pads for the VRMs either (the copper 2U ones). I'm guessing they are intended to be cooled by massive chassis airflow instead. but i do like that the 3U SXM2 heatsinks have the VRM cooling directly so it's not much of a worry and you can run quieter fans.Can somebody tell me what heatsink this card uses? SXM4 doesn't have the pads for the VRMs like SXM2 does, and also overlapping the image of an SXM4 heatsink with the card seems to show that the mounting points still don't match (although I found only a pretty bad image of the SXM4 from the bottom). Has anyone actually tried to fit an SXM4 heatsink on it?
This Bykski N-N...-Taobao Malaysia looks exactly what you gonna need, isn't it?Can somebody tell me what heatsink this card uses? SXM4 doesn't have the pads for the VRMs like SXM2 does, and also overlapping the image of an SXM4 heatsink with the card seems to show that the mounting points still don't match (although I found only a pretty bad image of the SXM4 from the bottom). Has anyone actually tried to fit an SXM4 heatsink on it?
that's for SXM2 V100s. it would only fit the A100 Drive if it used the SXM2 heatsink, but other replies seem to indicate that it uses the SXM4 bolt pattern.This Bykski N-N...-Taobao Malaysia looks exactly what you gonna need, isn't it?
What about TF32?*(i would dissuade from getting those automotive a100 unless you need fp16 tensor, they aren't all that stronger than v100)
I can't run tests now because the GPU is without a heatsink in a different country than I am, but what I can confirm is that when getting gradients through LLama 2 7B loaded in bfloat16 without any quantization, it's almost as fast as the real A100, and it is much faster than a 3090. And the 3090 is much faster than the V100. On top of it, V100 supports only float16 but not bfloat16, which means that you need gradient scaling in order to keep the range of the gradients in meaningful bounds and keep the training stable. Also, OpenAI is not very keen on maintaining Triton for V100, which is the basis of torch.compile(). (for more than half of a year, they had a bug that resulted in returning 0s for all matmuls V100 tensor cores).*(i would dissuade from getting those automotive a100 unless you need fp16 tensor, they aren't all that stronger than v100)
Kind of looks good with a bit of filing (see my other sxm2 heatsink that I used for my tests), but how do I get access to this without being in China?This Bykski N-N...-Taobao Malaysia looks exactly what you gonna need, isn't it?
I haven't, 32Gb of memory is not that much .BTW, haven't you tested MIG option NVIDIA Multi-Instance GPU User Guide r560 in A100/32GB; is it working at all in this card or it has to be A100 equipped with 40/80GB?
Dunno, why not ask them if they are willing to ship it to you? In this case you probably gonna need to register on taobao, but it's rather simple task if you choose so called 'business account'. And also the same N-NVV100-NVLink-X thing is easily available on eBay, but sadly not for the same money.Kind of looks good with a bit of filing (see my other sxm2 heatsink that I used for my tests), but how do I get access to this without being in China?
What about TF32?
I can't run tests now because the GPU
im not sure this answers my question about TF32.A100 Drive 32GB SXM2 & A100 Tesla 40G SXM4
HBM2e 1.87TB/s (latency, better but still bad - i'm guessing but it will likely top out at ~750GB/s-1.4TB/s, A100 can't reach full throughput due to locality issue.)
FP16 77.97 TFlops (Only good reason to buy this - good for AI)
FP32 19.49 TFlops
FP64 9.7 TFlops
TDP 400W
V100 Tesla 32GB SXM2
HBM2 898GB/s (latency)
FP16 33.3 TFlops
FP32 15.6 TFlops
FP64 7.8 TFlops
TDP 250W
A 3090 24GB Turbo/Blower
GDDR6X 938 GB/s (better than v100, including better latency - con lack of ecc - but do note ecc does slow your operations by some 20-25%)
FP16 35.5 TFLops
FP32 35.58 TFlops (almost twice as A100 Drive/Automotive model)
FP64 556 GFLops
TDP 350W
This is true for the full A100 as well: For FP32, the gaming GPUs, or the A6000-series, are significantly faster.FP32 35.58 TFlops (almost twice as A100 Drive/Automotive model)
These are non-tensor core TFlops. The Tensor cores should be around 270-280 Tflops for the Drive A100. The 3090 should also be around ~71 with BF16 I/O and FP32 accumulated (typically used for NNs) and 142 with FB16 I/O and BF16 accumulate.FP16 77.97 TFlops (Only good reason to buy this - good for AI)
On paper you can expect around 150TFlops in TF32 (18bit)im not sure this answers my question about TF32.