Using a Tesla M40 24g vram for Hunyuan AI & video

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.
There's no GPU board here so i'm posting in general chat...

I'm wanting to get a Tesla M40 card in the 24gig Vram size to try to play around some with the first AI of any interest to me - the Hunyuan source recently shared on github GitHub - kijai/ComfyUI-HunyuanVideoWrapper - I know NOTHING about AI anything and it hadn't been on my radar as something I could use with affordable hardware but then someone patched this to work with as little as 12-24gig Vram so now I want to experiment with it.

It will serve a secondary purpose for Davinci Resolve at high res, 24gig should be enough for even 8-12k footage i'd think in the future. A few games will probably be played too but that's not at all the purpose for buying it or doing the hassle to get whats around a 1070 GTX otherwise.

This might serve as my gateway drug into other AI stuff I probably can't afford for awhile in the future since it seems everything is vram limited (or power bill and time limited if you dont do it on a GPU) but i'm aware of no other remotely affordable way to run AI stuff. (I don't suppose they can make much use of multiple cards if they're not NVLinked?)

I'm aware there are issues with power, cooling, physical mounting, possibly drivers... just wondering if anyone else has done it and had suggestions, guides, best videos (i've searched for a few which are a bit lo fi) but it seems like you need an EPS12v cpu type connector to power it, have to rig up cooling (i've seen more than one way done, end blowers and side fans), it's been long enough since I watched the videos I forget what the physical mount issue is but I remember hearing of it at the time...
 
Last edited:

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
502
113
don't. The card is very weak. You'll be wasting watts, and your time.

Get either 3090 (~700 usd), Titan RTX (~$700) ~> those are best value for your time

or P40
*but again P40 is slow as hell - still much faster than m40.
 
I'm not sure if i'll have $700 though. : P What kind of a speed difference is there/is there a site that tests and compares of different GPUs with the same VRAM? I know the VRAM is the main bottleneck.

I was mostly trying to get things working "at all" even if not perfectly well, I dont know if generating a 5 second video would take minutes hours or all day. I have zero experience playing with AI stuff. This just seemed like an entry level way to do anything.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
502
113
m40 -> 3090 some 45x - 100x
ex.
Generating 1080p image on 3090 would take 5sec, same thing on m40 would take around 10-20min.
P40 would expect 2-5min
Titan RTX 15-25sec

Tensor really makes difference for ai stuff; and support kinda started with 20 series, but real support started with 30 series (ampere)
 
Last edited:

piranha32

Well-Known Member
Mar 4, 2023
347
283
63
AI craze reached even ancient P40s. They used to sell for well below $200, today a quick search on ebay did not find any cards for less than $400, which is completely bonkers.
 
Humm, if it's that huge of a time difference I can't disagree. I just dont know if thats universal or just something unique about pictures - I dont know enough about how AI works, I just want to get in on the absolute ground floor since I dont know what I dont know about anything never having had a 24gig vram card to even try.

I think the ability to run things on lower cards might've raised the prices - things were less two months back. :-/
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
502
113
most ai stuff works at int8 or fp16 or tensor versions of those for real uptick in performance.

Here's what you can expect in terms of raw performance
m40 6.8TFlops FP32 Only
p40 11TFlops FP32 Only (does have fp16 support but its dog sht), 47 TOPS int8
p100-16G 19TFlops FP16 Support, likely much higher tops vs p40
Titan RTX 32TFlops FP16, Tops, and Tensor FP16 support likely giving you around 130TFlops
3090 35TFlops FP16, Tops, and Tensor support for FP32, FP16, BF16, INT8, INT4 *which can be a game changer


(note the gain in tflops with tensors on 3090)
1738554698246.png
with AI workloads tensor support for fp16, int8 is what puts ampere ahead.

Here's also old test using tensor cores (its good visualization how massive difference it can make)
1738554515932.jpeg


Here's a screen showing some comparison ai image gen *(tho today with zluda, one could potentially run decently comfy ui with all models @ fp16.)
1738555289130.png
 
Last edited:
  • Like
Reactions: Twice_Shy
I feel like I need to improve my understanding of AI everything before I can even make proper use of some of the information you posted. Is there somewhere you'd recommend reading/viewing to get myself up to speed more? I read stories with language I dont understand https://www.reddit.com/r/LocalLLaMA/comments/1iczucy and I don't know what tokens are, layers of offloading, I know there's some people doing things in RAM or on SSD and i'd imagine that might be a slam dunk for Optane, but there's too much to catch up with to be an intelligent part of the conversation so instead i'll just ask where do you suggest I start learning?


I'll also mention I still havent ruled out an M40 card because I said I have a second use for Davinci Resolve since that has to process in Vram. Although the card is a less good value than it was two months ago. A card used for video "that can run AI but not well" is still better than one that can't run AI at all. A 386 isn't much of a computer but if you have NO computer, it's more useful than nothing. I just want to dabble and wet my feet, unless i'm spending more on power at home than I would a dollar per 5 second video like the commercial services showing up. If it's not very efficient until I get a serious card upgrade, well, I sort of expect that.

And I realize everything will have performance bottlenecks but I dont know if Hunyuan has the same bottlenecks generating video as AI learning, inferencing, picture generation, language models... i'm just trying to do college student budget video work most not for pay yet and anything beyond running this one program is gravy until I can afford a 6090 RTXX in two years, and my workload small enough that the difference between 5 video generations and 30 in one night probably is not enough to matter as i'm the only user. As soon as I see the limits and I get a "minimum cost of entry" baseline, i'll be looking to improve it as soon as that upgrade is worth more than a better dSLR or other video upgrades i'll be choosing between as well. I'll ask other users how much faster their rigs are making clips and see if that's what I most need at the time.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
502
113
ok so running deepseek r1 (full 323GB model) from ssd as offload from memory, is a no no
You will kill yourself before you get response to "Hi".

Offload to ram, or disk is slow, very slow but can get job done if you have no other choice:
4x r0 nvme is already decent option for running large models - and i say decent but reality we are in shthole territory. Medium/Long term it will actually be more expensive than owning 768 or more GB of ram.

Optane persistent memory is really good - its fast (not as fast as ram) but its decent enough for large models.

How you should think about it:
If you have 3090 with 24GB vram, model that has 40G, and image gen takes 10sec if you use just the vram, you can imagine that if you offload to system ram the excess it will take 1-2minutes (if its decently fast).
 
Just to share others possibly interested in my own original thinking...


I'm still not set against an M40 but might not buy the now $200 24gig ones (that were $100 a month ago), the $40ish 12gig cards have me curious but my next GPU will probably have 12gig one way or another even if it's not an nvidia. (not sure how Intel Arc or AMD compare for speed) I also have to remember "i'm ALSO doing video on this card" for Davinci Resolve so it's not a one trick pony. My experiments might just deadend sooner than I planned if the age of the card is more of a problem than the VRAM.

That said...

I've recently learned there's other projects allowing generating video clips at home on the up-to-24gb or so VRAM cards - I don't know if they require newer architechtures (i'm hearing even P40 is being discouraged for some newer software tho it's only a year newer) maybe FP16 on the gpu and alot of system ram are faster than putting the whole model in VRAM. I realize there's more to learn before I can decide on this AI issue so i'm going to shift most of this convo to the AI board here.

Random example.

Also something FastHunyuan I saw someone posting about video made on a lowly 8gig card... and it's much faster than the original Hunyuan model. Plus other models I haven't heard of before. So i'm going back to the researching board awhile. :) If anyone has any tail end comments about PHYSICALLY USING the M40 card even if it's no longer for AI, and just for Davinci liking a big RAM card please post on the cooling, power, mounting, and driver problems you encountered or solved because for $40 on one of my other old workstations to play around with that's still negligible.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,503
502
113
(i'm hearing even P40 is being discouraged for some newer software tho it's only a year newer) maybe FP16 on the gpu and alot of system ram are faster than putting the whole model in VRAM
I think you should first get low end gpu like 2080/ti and see what you really need for your case.
As mentioned P40 has only 2 things going for it P states, and vram - otherwise its shit, and M40 is many times worse.


With amd zluda *(for linux, those that support rocm) you can potentially run at fp16
*(but keep note newer nv gpu's run with tensor int8, fp16 at much higher performance for models that support tensors.)

(radeon vii would be only potential for rocm5 - outdated version to work with zluda to certain degree)
With intel, you are potentially in worse situation than amd.



// noting
I did use M40 like 8-10ish years ago for primegrid, and it was worse than even 980ti in normal fp32 compute. *(because of the clocks)
Both are maxwell cards - and both became outdated overnight with pascal architecture.
 
Last edited:
Dont get me wrong, i'm not against taking your advice for when it's time to do anything even slightly serious about AI.

Maybe it would help if I mention i'm building more than one computer? My girlfriend is going to have something from the RTX era by summer, maybe a 4070 or 5070 - I just cant use it most of the time if she's using it. So i'm trying to figure out what, IF ANY, microbudget AI video is even worth attempting on a small scale - might be at the end of the day renting remote time on someone's AI workstation makes more sense than owning a slower card yourself. (some things i've been reading suggest that) Because that's cheaper than just using a commercial AI video service.

We can self host a cloud, but we can also just rent existing ones when they make more sense. There's an experimental aspect to all this of hands on homelab learning.

I still have use of a cheap 12-24gig Vram card for Davinci Resolve at the end of the day too. :p None of the "here's why it sucks for AI" changes that secondary use, which may become primary. Even if all AI use with it is verified "not worth the time" within 3 months.

Hence why i've tried to shift the AI discussion more to a different thread and turn this back into a physically using the M40 question. I promise i'm taking your AI insights into consideration, I just had more than one computer homelab project going on in parallel for making use of a high Vram card.