Pretty straightforward naming scheme.So this is sorta like a 1080ti with more VRAM and no limitations in drivers?
I didn't know the P models so was hoping this would be a 3090 equivalent for cheap oh well.
Processing Prompt [BLAS] (1486 / 1486 tokens)
Generating (357 / 2048 tokens)
(EOS token triggered!)
CtxLimit: 1843/8192, Process:16.32s (11.0ms/T = 91.03T/s), Generate:80.01s (224.1ms/T = 4.46T/s), Total:96.33s (3.71T/s)
Processing Prompt (17 / 17 tokens)
Generating (723 / 2048 tokens)
(EOS token triggered!)
CtxLimit: 2583/8192, Process:1.08s (63.7ms/T = 15.70T/s), Generate:168.33s (232.8ms/T = 4.30T/s), Total:169.41s (4.27T/s)
got prompt
Requested to load StableCascadeClipModel
Loading 1 new model
Requested to load StableCascade_C
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:31<00:00, 5.04s/it]
Requested to load StableCascade_B
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [02:07<00:00, 6.35s/it]
Requested to load StageA
Loading 1 new model
Prompt executed in 320.39 seconds
How did you tune the idle power usage?Power consumption:
without usage - ~9W each
With just model loaded - ~50-52W each
While inferencing - 70-80W one and 170-190W second (they are switching often).
ECC is already disabled, if I'm reading it rightDisable ecc on ram, and you'll likely jump some 30% in performance.
ECC Mode
Current : Disabled
Pending : Disabled
If I am looking to make xgboost, pytorch models on tabular data - does this matter? Asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.Not just that, the P100 has FP16 which GP102/the P40 doesn't. Only 16GB VRAM though![]()
I would just save up for a 3090 instead. CUDA support can be yanked at any time for the P40.If I am looking xgboost, pytorch models on tabular data - does this matter - asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.
Strictly speaking, if a card does FP32 it does FP16, but potentially with no performance uplift.If I am looking xgboost, pytorch models on tabular data - does this matter - asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.
its actually downhill for this card.Strictly speaking, if a card does FP32 it does FP16, but potentially with no performance uplift.
I can agree with that as I have a P100 card and it is a great card when running molecular dynamics simulation such as in gpugrid.net project. Looking at the today's top hosts sorted by "recent average credits", this pascal generation card still show up in the top 20 hosts since it is relatively cheap when compare to titan V, 3090, 3080 Ti, etc on that list. The FP64 flops and high memory bandwidth provide advantage in that specific use case or any computational tasks requiring high FP64 and memory bandwidth.The P100 is probably the only one of that generation worth getting at today's prices. The real place it shines is float 64 workloads in scientific computing- consumer cards simply take too much of a performance hit where the P100 is only 1/2 the 32bit.
Not using any kind of nvlink or SLI with my P40s. Each one just works on its own. Blender renders are 1 card = 1 frame. A moderately complex frame at full HD or 4K is 5 minutes or more on a P40, but 20 minutes on my fastest CPU. A few minutes of a blender animation is thousands of frames, so cheap rendering is welcome and queing up renders across a bunch of GPUs is pretty simple.its actually downhill for this card.
View attachment 36825
(but it does have meh but decent enough int8)![]()
NVIDIA Tesla P40 Specs
NVIDIA GP102, 1531 MHz, 3840 Cores, 240 TMUs, 96 ROPs, 24576 MB GDDR5, 1808 MHz, 384 bitwww.techpowerup.com
@drros must be the the heat/low clocks then - or your system scheduler/power profile. (i ran more complicated models for ai imaging, and was in 250sec time)
@anewsome
btw, how are you running nvlink/sli with those cards? I don't think they support it. I don't think you can spread single model query or train over multiple gpu's without it? (or are you just scheduling separate model queries to each?)
(A4000 beats 4x P40's in ai image generation by 4x)
// On my P40 setup single query takes around 250sec to produce image, if you had 4 of those it'd be around 62.5sec per image; while single A4000 with thermal issues gets me same model image in 22sec. 4060Ti has about same spec as A4000 (4060Ti is of newer arch) - making it 11x faster than single P40.
(I know people say "they can wait", but time is money, literally in this case.)
I'm interesting with the DDP, but looks like hell-lot to read on - does it do merge/join? What happens if your workload exceeds 24G during the join? Is it done on the system memory? swapping?multiple GPUs using something called "DDP".
Do you recall the precision you were using? I could see that if you were using fp32.I was able to train faster on 3x P40s compared to 1 4060ti
The training for this particular project was pretty stable and easy to train with. I had no issues with a training run exceeding GPU memory, since the training would fail immediately if the batch size was set too high - GPU memory would be exceeded. And by immediate, I mean before the first pass was complete, which could take 5, 10 or 20 minutes depending on the parameters used and how many voice samples there were. Very reliably, if the first pass completed, it would run until the model training was complete.I'm interesting with the DDP, but looks like hell-lot to read on - does it do merge/join? What happens if your workload exceeds 24G during the join? Is it done on the system memory? swapping?
Do you recall the precision you were using? I could see that if you were using fp32.
(as 4060ti has only 22TFlops, while 3x P40's will come to somewhere around 35TFlops.)