Best hardware option for Qwen 72b and 80b models without spending 10k€?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Mashie

Member
Jun 26, 2020
40
11
8
@mashrooms thank you. I dug a bit deeper and it seems that both a DGX Spark and Ryzen AI Max+ are plenty fast for large MoE models but struggle with the bigger dense models due to their limited memory bandwidth. There is no hard data to be found on this other than the usual benchmaxed numbers and some user experiences.
I still do plan to have a local AI infrastructure but I think I should first try them via openrouter, deepinfra or a similar service. While many hail "open" models like Qwen as nearly as good as ChatGPT or Claude it might just be that they would not cope well with the things I intend to do with them ... which is not vibe coding until it appears to work and call it a day.
I'm not sure you will find anything within your budget that will work well with a large dense model. I mean a dual H100 setup would be perfect but that is 8x what you want to spend.
 
  • Like
Reactions: T_Minus

unwind-protect

Well-Known Member
Mar 7, 2016
617
252
63
Boston
I just can't do a multi GPU setup. For once I'll probably waste more of my very scarce time building and fiddling with it then using it and I also don't have the space for it.
Anyone has actually used a DGX Sparc, Ryzen AI Max or Mac Studion M1/3 Ultra and can comment if it's usable in a single-user non-work environment? I do not really care if if I get an answer to my prompt in 500ms or 5s.
On second thought I think my thread title was ill conceived.
The Philadelphia Linux User Group has several people with Ryzen 395 128 GB systems for playing with agents. They are satisfied but I'm not up-to-date with what models they are running how fast. The same people warned me that Macs don't run all models well due to limitations in the math libraries provided by Apple.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
If you are trying to do cpu-assisted model running, couple points.

Get some kind of GPU, any kind of GPU. your prompt processing will improve dramatically. this is huge when you have more than a tiny context. Otherwise expect to wait for minutes for the first output token even if most of the model is offloaded to ram.

Memory speed is king, not that im saying you need fast RAM, but you can get way faster dram with more channels and slower dimms. if you are buying es/qs epycs be aware most of them ave maybe 60% of the memory channels the prod versions have (and much slower max clock speeds).
 
  • Like
Reactions: T_Minus