GPU Memory Bandwidth Benchmark

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
503
113
Who is Ian? Are you referring to me? Because I have all those cards and my name is Ian. I am “Ian&Steve” on Einstein.
Yes

V100 - 2500/3 = 833s per task
I presume if you enable ECC and retest, you will get slower, and closer to Titan V
(wouldn't 3080Ti be faster than Titan V? If on cuda app its sitting more closely to mem bandwidth?)


//(I would still recommend trying to do tensors I presume it will give you greater performance uplift if you can pass through validation - tensors as i recall can also speed up memory "TMA" tho not sure if it works with anything but hopper.)

and you can try/see if this one works on ampere/volta

this part of code doesn't state you need hopper *(should work on ampere) - and I think this is what you want

(samples)
1740493727792.png

and a vid about it
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
420
141
43
I presume if you enable ECC and retest, you will get slower, and closer to Titan V
(wouldn't 3080Ti be faster than Titan V? If on cuda app its sitting more closely to mem bandwidth?)
ECC enabled/disabled doesnt meaningfully change computation speed with Einstein that I've seen. but does seem to use more power if it is turned on. based on my testing with the A100-Drive system.

yes, a 3080Ti is faster than a Titan V for Einstein. but it uses like 2x the power, the Titan V is much more efficient, which is why I use it (and V100s) instead. the 3080Ti is a little faster than a V100 (~10%) for Einstein when all settings and app are the same.


//(I would still recommend trying to do tensors I presume it will give you greater performance uplift if you can pass through validation - tensors as i recall can also speed up memory "TMA" tho not sure if it works with anything but hopper.)
I'm not sure how much of the computation is matrix multiplication, not sure that Tensors can be used that much. I would assume that the compiler and/or scheduler would recognize any matching computations and send them to the tensors wouldnt it? tensors on the Titan V and V100 dont have as much flexibility as on Ampere and up.

I'll take a look at the other resources you linked to
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
503
113
tensors on the Titan V and V100 dont have as much flexibility as on Ampere and up.
Looks like no benefits on Volta except on FP16 using Tensor Cores, only ampere and up actually do show good gains

*(but i would note that 4090/5090 do have locked tensor performance likely to stop them from cutting into their other tesla/quadro products)


scheduler would recognize any matching computations and send them to the tensors wouldnt it?
Its not enabled by default, you would need to call upon it - and enable mixed precision.
TF32 isn't really FP32, as i understand it there are few different approaches of using tensor cores and most common GEMM is just allocating 2 fp16 into single fp32 flop accumulation model. Thus it might be impossible to get proper precision level that is acceptable by validator.

I'm not sure how it really works in code, nvidia supplied at least couple different things with tensors producing different numbers, sparsity, accumulations and so on; so each might have their own use/caveats.