Nvidia Tesla P40 24gb for $149.99 +sht

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

efschu3

Active Member
Mar 11, 2019
195
79
28
But it can do int8 and int4?

Probably more interesting then fp16 for local llms.

(I have a 2080ti and a P40 and I'm running llama2 on it)
 
  • Like
Reactions: macboy80

EasyRhino

Well-Known Member
Aug 6, 2019
612
518
93
So this is sorta like a 1080ti with more VRAM and no limitations in drivers?

I didn't know the P models so was hoping this would be a 3090 equivalent for cheap oh well.
 

Patriot

Moderator
Apr 18, 2011
1,485
820
113
So this is sorta like a 1080ti with more VRAM and no limitations in drivers?

I didn't know the P models so was hoping this would be a 3090 equivalent for cheap oh well.
Pretty straightforward naming scheme.
M Maxwell, K Kepler, P pascal, V Volta, T Turing, ... they get a little screwy with Ampere and Ada Lovelace, Hopper, and Blackwell in 6mo.
Volta is first gen matrix, Turing 2nd (rtx2000 gen), Ampere desktop and server resync'd A100 and RTX3000, split Ada Lovelace, "quadro's RTX4000 are Ada Lovelace, server Hopper/H100.

I am chilling with first and 2nd gen Tensor cores.
V100's, RTX8000s, Mi100s.


Having run the full llama3 models across 4 mi100s... just get a 7900xtx or 3090 or 4090... run the condensed models, the bigger ones are cool to be able to run but are not terribly more accurate.imo.
 
  • Like
Reactions: T_Minus

piranha32

Well-Known Member
Mar 4, 2023
360
293
63
The biggest advantage of P40 is that you get 24G of VRAM for peanuts. It can run Stable Diffusion with reasonable speed, and decently sized LLMs at 10+ tokens per second.
Although stock 2080 is more modern and faster, it is not a replacement for P40, due to much smaller RAM. With desktop running I could run SD only a post-stamp sized images (300-400px range). I did not try to run LLMs, but judging by RAM usage, I would not go much beyond ollama with 3M, maaaaybe 7M models. P40 can run 30M models without braking a sweat, or even 70M models, but with much degraded performance (low single-digit tokens per second, or even slower).
It's a different story if you want to train or fine-tune the model, but for just using the LLM, even with its high power usage, P40 is IMHO still the sweet spot for shoe-string budget builds.
 
Last edited:
  • Like
Reactions: SnJ9MX

UhClem

just another Bozo on the bus
Jun 26, 2012
493
299
63
NH, USA
FYI: If you want more than one, same seller has another listing (qty2 for $300) [Link], which accepts offers, has free shipping, AND (for 2 more days) has a 10% eBay coupon.
 

drros

New Member
Mar 22, 2019
19
9
3
This is what I'm getting from mine - Dell R730 with 2 x P40:
LLM: GGUF of Llama 3 70b - Q_4_K_M (with imatrix, not sure it affect performance) with 8k context:
Code:
Processing Prompt [BLAS] (1486 / 1486 tokens)
Generating (357 / 2048 tokens)
(EOS token triggered!)
CtxLimit: 1843/8192, Process:16.32s (11.0ms/T = 91.03T/s), Generate:80.01s (224.1ms/T = 4.46T/s), Total:96.33s (3.71T/s)

Processing Prompt (17 / 17 tokens)
Generating (723 / 2048 tokens)
(EOS token triggered!)
CtxLimit: 2583/8192, Process:1.08s (63.7ms/T = 15.70T/s), Generate:168.33s (232.8ms/T = 4.30T/s), Total:169.41s (4.27T/s)
It can be lower towards the end of context, but rarely lower than 3t\s. This is with koboldcpp 1.65. Overall 70b models are absolutely usable.

Text to image generation:
Code:
got prompt
Requested to load StableCascadeClipModel
Loading 1 new model
Requested to load StableCascade_C
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:31<00:00,  5.04s/it]
Requested to load StableCascade_B
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [02:07<00:00,  6.35s/it]
Requested to load StageA
Loading 1 new model
Prompt executed in 320.39 seconds
this is StableCascade in ComfyUI and 1536*1536 image - overall 50 iterations.

Power consumption:
without usage - ~9W each
With just model loaded - ~50-52W each
While inferencing - 70-80W one and 170-190W second (they are switching often).
 
  • Like
Reactions: omgwtfbyobbq

piranha32

Well-Known Member
Mar 4, 2023
360
293
63
Power consumption:
without usage - ~9W each
With just model loaded - ~50-52W each
While inferencing - 70-80W one and 170-190W second (they are switching often).
How did you tune the idle power usage?
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,618
577
113
The P40 has normal power states, so aspm does it.
(p100 doesn't)


@dross50 those are really bad numbers, check if you have ecc memory enabled;
Disable ecc on ram, and you'll likely jump some 30% in performance.

nvidia-smi -q
nvidia-smi --ecc-config=0
reboot
nvidia-smi -q (confirm its disabled)

(there should be few other ways to go about it - if this solution doesn't work)
Note: Some models are configured to use fp16 by default, you would need to check if you can force int8 on them - if not just use fp32 (anything is faster than fp16 pipe on p40.)

// even so i would recommend modded 2080's or normal used 3090 for some 500-700 usd, they are many times faster (like 50-100x in some cases) for lesser amount of power. *(not to mention noise)
 
Last edited:
  • Like
Reactions: T_Minus

Filez

Active Member
Feb 18, 2019
114
95
28
Not just that, the P100 has FP16 which GP102/the P40 doesn't. Only 16GB VRAM though :(
If I am looking to make xgboost, pytorch models on tabular data - does this matter? Asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.
 
Last edited:

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
2,106
752
113
If I am looking xgboost, pytorch models on tabular data - does this matter - asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.
I would just save up for a 3090 instead. CUDA support can be yanked at any time for the P40.
 
  • Like
Reactions: Filez

Kahooli

Member
Dec 3, 2022
54
36
18
If I am looking xgboost, pytorch models on tabular data - does this matter - asking suggestions, can not afford a 3090 but was considering P40 for its VRAM.
Strictly speaking, if a card does FP32 it does FP16, but potentially with no performance uplift.
 
  • Like
Reactions: Filez

anewsome

Active Member
Mar 15, 2024
130
132
43
I think my take on these cards is a bit different from everyone here. I bought 4 of these at $175 each and I will stand by my belief, it's the most GPU bang for the buck. 4 of these were just a bit more than my 4060ti/16GB. They give me way more vRAM and flexibility for allocating GPU where it's needed. Yes they are slower than modern cards but still pretty impressive, especially at $150 each.

I did a bunch of model training on them, rendered thousands of blender frames, sampled a bunch of LLMs that won't even run on my 4060ti. I used a local llava LLM to auto-caption over 5,000 photos from my photo collection. Admittedly, most of those were captioned by the 4060ti, but the P40s chipped in quite a bit!

I don't have the exact numbers in front of me, but before captioning my script resized images to no more than 1024 pixels on the long side. 4060ti was on average maybe less than 10 seconds. P40 was maybe 1-2 mins per image. With 4 of the P40 working in parallel, they weren't that much slower than the 4060ti.

I stand by it. Great deal at $150. I might even buy more.
 
  • Like
Reactions: macboy80

CyklonDX

Well-Known Member
Nov 8, 2022
1,618
577
113
Strictly speaking, if a card does FP32 it does FP16, but potentially with no performance uplift.
its actually downhill for this card.
1716412667961.png
(but it does have meh but decent enough int8)

@drros must be the the heat/low clocks then - or your system scheduler/power profile. (i ran more complicated models for ai imaging, and was in 250sec time)

@anewsome
btw, how are you running nvlink/sli with those cards? I don't think they support it. I don't think you can spread single model query or train over multiple gpu's without it? (or are you just scheduling separate model queries to each?)
(A4000 beats 4x P40's in ai image generation by 4x)

// On my P40 setup single query takes around 250sec to produce image, if you had 4 of those it'd be around 62.5sec per image; while single A4000 with thermal issues gets me same model image in 22sec. 4060Ti has about same spec as A4000 (4060Ti is of newer arch) - making it 11x faster than single P40.

(I know people say "they can wait", but time is money, literally in this case.)
 
Last edited:

pututu

Member
May 7, 2016
49
18
8
hardforum.com
The P100 is probably the only one of that generation worth getting at today's prices. The real place it shines is float 64 workloads in scientific computing- consumer cards simply take too much of a performance hit where the P100 is only 1/2 the 32bit.
I can agree with that as I have a P100 card and it is a great card when running molecular dynamics simulation such as in gpugrid.net project. Looking at the today's top hosts sorted by "recent average credits", this pascal generation card still show up in the top 20 hosts since it is relatively cheap when compare to titan V, 3090, 3080 Ti, etc on that list. The FP64 flops and high memory bandwidth provide advantage in that specific use case or any computational tasks requiring high FP64 and memory bandwidth.
 

anewsome

Active Member
Mar 15, 2024
130
132
43
its actually downhill for this card.
View attachment 36825
(but it does have meh but decent enough int8)

@drros must be the the heat/low clocks then - or your system scheduler/power profile. (i ran more complicated models for ai imaging, and was in 250sec time)

@anewsome
btw, how are you running nvlink/sli with those cards? I don't think they support it. I don't think you can spread single model query or train over multiple gpu's without it? (or are you just scheduling separate model queries to each?)
(A4000 beats 4x P40's in ai image generation by 4x)

// On my P40 setup single query takes around 250sec to produce image, if you had 4 of those it'd be around 62.5sec per image; while single A4000 with thermal issues gets me same model image in 22sec. 4060Ti has about same spec as A4000 (4060Ti is of newer arch) - making it 11x faster than single P40.

(I know people say "they can wait", but time is money, literally in this case.)
Not using any kind of nvlink or SLI with my P40s. Each one just works on its own. Blender renders are 1 card = 1 frame. A moderately complex frame at full HD or 4K is 5 minutes or more on a P40, but 20 minutes on my fastest CPU. A few minutes of a blender animation is thousands of frames, so cheap rendering is welcome and queing up renders across a bunch of GPUs is pretty simple.

For the photo auto-captioning, same thing, 1 photo - 1 GPU. Even with 7 total GPUs working in parallel, captioning 5,000+ photos still tied up ALL the GPUs for a few days, but the P40s demonstrated their value.

Training the Piper TTS voice models was a bit different. The Piper TTS training scripts use Torch Lightning. Torch lightning was able to train using multiple GPUs using something called "DDP". Detailed here: GPU training (Intermediate) — PyTorch Lightning 2.2.5 documentation

I was able to get DDP working across the network but it was useless. The 1g network wasn't really fast enough to be useful. 10 or 25g maybe DDP would have been useful over a network. I had 3 of the P40 in one system, so I used the DDP strategy to train one of my TTS models with 3. I also trained a few models on the 4060ti and if I remember correctly, I was able to train faster on 3x P40s compared to 1 4060ti. The batch sizes were bigger with 3x 24gb VRAM, compared to 16gb VRAM on the 4060ti. Still took weeks to train, but it was fun doing it with those trash heap P40.

I run my local LLMs using Ollama, which utilizes all the GPUs in the system too. Seems to work pretty well, spreading the load across multiple GPUs. The P40s were useless for interactive LLMs though. The responses were so slow with the P40s, I usually just went to my other Ollama instance running on faster GPUS. Stable Diffusion, same story. I'm usually not patient enough to wait for the P40s to make an image.

I'm not planning on ditching the P40s but they have been powered off for a few weeks. I'll online them when I have something for them to do. All of my GPU required loads are currently running across 3x 4060ti.
 
  • Like
Reactions: piranha32

CyklonDX

Well-Known Member
Nov 8, 2022
1,618
577
113
multiple GPUs using something called "DDP".
I'm interesting with the DDP, but looks like hell-lot to read on - does it do merge/join? What happens if your workload exceeds 24G during the join? Is it done on the system memory? swapping?


I was able to train faster on 3x P40s compared to 1 4060ti
Do you recall the precision you were using? I could see that if you were using fp32.
(as 4060ti has only 22TFlops, while 3x P40's will come to somewhere around 35TFlops.)
 
  • Like
Reactions: piranha32

anewsome

Active Member
Mar 15, 2024
130
132
43
I'm interesting with the DDP, but looks like hell-lot to read on - does it do merge/join? What happens if your workload exceeds 24G during the join? Is it done on the system memory? swapping?



Do you recall the precision you were using? I could see that if you were using fp32.
(as 4060ti has only 22TFlops, while 3x P40's will come to somewhere around 35TFlops.)
The training for this particular project was pretty stable and easy to train with. I had no issues with a training run exceeding GPU memory, since the training would fail immediately if the batch size was set too high - GPU memory would be exceeded. And by immediate, I mean before the first pass was complete, which could take 5, 10 or 20 minutes depending on the parameters used and how many voice samples there were. Very reliably, if the first pass completed, it would run until the model training was complete.

Running across the 3 GPUs didn't allow me to set a higher batch size, compared to 1 GPU but 3 GPUs at batch size X was definitely faster than 1 GPU at batch size X.

I can't recall if the the training used fp32 or not. I remember using "--precision 32" as one of the arguments in the training script, so yeah maybe.

But on the point of using DDP to utilize multiple GPU, it was a lot of tinkering since training models is not really something I do or know anything about. The project's documentation had zero hints on how to actually do it. No one on the project's Github offered any advice, the torch lightning documentation was too general to be useful, so I just kept tinkering until I got it to work.

I detailed some of my notes on the Github: Guidance or examples on multi-node training · Issue #330 · rhasspy/piper
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,618
577
113
Ok so nothing revolutionary, i was hoping they are allowing for datasets over the size of single gpu. ~ with some magic trick ~
My main problem i'm having on scala with spark rapids, after processing part of training data on each node - it goes back to single gpu node for join (and can blow up if combined trainning data is larger than local vram of single gpu - unless those are running over nvlink ~ and can pool vram)