Best hardware option for Qwen 72b and 80b models without spending 10k€?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Currently I am chugging along with a single RTX 3090 FE in my workstation. Qwen 27B is nice, but 32B is already rather awkward and I have to go headless to get a halfway usable context window. I considered adding a second 3090, but decided against it for various reasons.

Right now I have my eyes set on the Qwen3-next-80B model at NVFP4 or Q5. But maybe that's just me dreaming and 32B would already be fine for my (non-work related) needs?
With the latest price hikes I am just unwilling to throw even more money at Anthropic or OpenAI. They are still burning through money, and my personal opinion is that they will need to raise prices massively. I also do not think we will see any meaningful price reductions after the bubble bursts. Imho, it will be quite the opposite, as all AI addicts will scramble to secure whatever hardware they can to keep going.

The most expensive option would be to shell out nearly 10k€ for an Nvidia Pro 6000 Blackwell with 97GB VRAM. The 72GB Pro 5000 model could work as well, but it still requires RAM offloading, which can lead to erratic TPS while still costing around 7200€. Right below that sits the M3 Ultra Mac Studio with 128GB unified memory, but it currently is not available in Europe. The 96GB model feels somewhat like a trap to me.

The more affordable options for me are:
  • Nvidia Pro 5000 48GB — Even with NVFP4 and RAM offloading it probably would be unbearably slow for Qwen-next-80B.
  • DGX Spark — On paper this seems more potent than a Ryzen AI Max+ system, but user experiences do not really seem to support that. The dev environment also appears to be rather beta right now.
  • Ryzen AI Max+ mini PCs — Probably the most budget-friendly option, but they seem sluggish with larger models due to the mediocre memory bandwidth. They are also not dramatically cheaper than a DGX Spark.
Does anyone have some experience with these? It is very hard to find hard data on what these platforms are actually capable of and what the downsides are. Asking Claude or ChatGPT (both paid versions with "frontier" models available) only adds to my confusion.

I would settle for this: It is OK if it takes upwards of two minutes from the first prompt on a larger software project, as long as subsequent prompts are reasonably fast. It is not for work, but for ambitious home projects.
 

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
Since you already have 3090, I would think that getting 3 more 3090 would be a more cost effective route and that gives you 96GB of VRAM. I would also use ik_llamacpp over llamacpp as your inference engine or even use vllm (best for concurrent requests). ik_llamacpp over llamacpp since it's more CUDA focused and edge features that will get the most performance out of your Nvidia hardware.

I'm currently using 3 RTX 30390 with 27B Q8 and I get about 48-62 token/s in real world code (ik_llamacpp with MTP and dual speculative decoding, thinking turned off). My usage strategy is to use Deepseek V4 Pro for thinking/planning/code review. Qwen3.6-27b is solely for coding and carrying out those planned tasks.

But keep in mind that you're splitting between 4 cards. I don't believe there are GGUF for ik_llamacpp with MTP for this model, but MTP should work with vllm. With P2P driver hacks enabled or the 4 cards are on the same PLX switch plane, you'll probably get around high 20 token/s with vllm (without MTP, probably much much higher with MTP) and probably slight lower numbers with ik_llamacpp.
 
Last edited:

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Why Qwen3-Next if 3.6 is long since out?
I mixed that up, it's qwen3-coder-next. If you trust the internet it's still superior to Qwen3.6-72b when it comes to coding tasks but needs hand holding. Anyway, let's just settle on 72b to 80b models.

Since you already have 3090, I would think that getting 3 more 3090 would be a more cost effective route and that gives you 96GB of VRAM.
I thought about that but I simply do not have the slots available in my main rig. Best I could do would be two 3090s but that would require buying a bigger case so I can mount either one 3090 or the XXV710-DA2 vertically. For three 3090s I would have to build a new box from scratch. Given current prices that would cost nearly as much as buying a Ryzen AI Max or DGX Spark as I would also need to buy another GPU for my workstation.
 

foureight84

Well-Known Member
Jun 26, 2018
458
387
63
I mixed that up, it's qwen3-coder-next. If you trust the internet it's still superior to Qwen3.6-72b when it comes to coding tasks but needs hand holding. Anyway, let's just settle on 72b to 80b models.


I thought about that but I simply do not have the slots available in my main rig. Best I could do would be two 3090s but that would require buying a bigger case so I can mount either one 3090 or the XXV710-DA2 vertically. For three 3090s I would have to build a new box from scratch. Given current prices that would cost nearly as much as buying a Ryzen AI Max or DGX Spark as I would also need to buy another GPU for my workstation.
You can get a PCIE PLX extension and add all 4 cards on the same plane. https://forums.servethehome.com/index.php?threads/new-chinese-pcie-switch-board-gpu-testing.52488/ and use an open case https://www.amazon.com/Mining-Computer-Currency-Bitcoin-Accessories/dp/B09CNG58R1 relatively cheap swap.
 
  • Like
Reactions: nexox

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Interesting concept but far to bulky (and probably noisy) for the limited space I have available. I'd rather have something that either fits in my workstation without alterations or is silent and compact enough to sit on a shelf above my workplace. So the budget option for me would be either a DGX Spark or Ryzen AI. Provided they aren't too sluggish with larger models. Even a Mac Studio M3 Ultra with 96GB would be cheaper than a Nvidia Pro 6000 Blackwell.
Instead of going with some claims from people on reddit or Youtube or whatever Claude or ChatGPT claims I'd though I better ask here if someone has some first hand experiences.
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
347
125
43
I wouldn't go under 128GB of VRAM if your aim is to buy something to run LLMs on. 96 is really tight - it's barely enough for Qwen-122B at full context which means once more models with 1M context or something with 150B params rolls out you won't be able to take advantage of it. Here are a few unpopular options that are nevertheless interesting:
  • 4x MI250/MI255: The Supermicro MI250 carrier will run in a normal computer with some effort. MI250 is actually 2x MI210 on a single board, so the resulting system is topologically similar to a DGX-1 but has 512GB of VRAM and bf16. May or may not be supported by anything.
  • 8x Habana Gaudi2: 768GB of VRAM for $18K, but is it really worth saving $25K to not have Nvidia?
  • 8x V100-SXM2: Volta is aging well, there are several projects that aim to bring modern models to V100. Unfortunately, not having bf16 sucks.
  • A single MI250 on a carrier: really not fun, because OAM modules are 48V...
  • 4x RTX 8000: 192GB, but same problem as Volta - no bf16.
  • 4x Intel Max 1100: 192GB for $8K, but it is unclear if Pytorch runs on Max 1100
 

Styp

Member
Aug 1, 2018
81
31
18
I have a solid background in Computer Vision, but I’ve also done a fair share of LLM work; not just using them, but also fine-tuning, post-training, DPO, RL, etc. I wanted to share my perspective on current hardware choices…
My general take: if you're primarily a user looking for a unified memory architecture to run inference, cross-platform options like the Mac Studio Ultra or one of the new AMD Ryzen AI Max+ 395 (Strix Halo) mini-PCs with 128GB are solid choices but consider more than 128GB as this is where the plattform shines (Studio Ultra!).
However, once you transition into research, tinkering, post-training, and deep dives into PyTorch, the NVIDIA ecosystem is just way better. I’ve given this advice numerous times to people asking me what to buy:
  • For learning and entry-level tinkering: Go with NVIDIA. An RTX 5080 16GB is perfectly fine for learning and understanding how the pipeline works, take a 3B model and just apply all the techniques you want.
  • For heavy models (like the new Qwen 3.6): You’re going to want massive VRAM on a single card. The NVIDIA RTX PRO 6000 Blackwell (96GB) is the card to aim for here - and it is the only option in my opinion!
  • The NVIDIA DGX Spark: This is a fantastic desktop platform if you eventually need a seamless scaling path to a DGX Workstation or a multi-GPU cloud setup; it just works. It’s decently fast, though at ~$3k for a standalone device, it’s quite expensive for the form factor. Beta-ish behavior, portability issues, etc. (I am struggling to commit here as well…).
  • The AMD Alternative for VRAM on a budget: If you need VRAM but want to avoid the NVIDIA price, look into a dual AMD Radeon PRO W7800/7900 48GB setup. Yes, multi-GPU can be a struggle to set up for some workloads, but getting 96GB of VRAM for under $3,500 USD with almost 900 GB/s of bandwidth is a very compelling value proposition. In terms of throughput, it should be roughly on par with an RTX 4090, maybe slightly slower, but you simply won't find another 96GB solution at that price point.
 

bayleyw

Active Member
Jan 8, 2014
347
125
43
Interesting concept but far to bulky (and probably noisy) for the limited space I have available. I'd rather have something that either fits in my workstation without alterations or is silent and compact enough to sit on a shelf above my workplace. So the budget option for me would be either a DGX Spark or Ryzen AI. Provided they aren't too sluggish with larger models. Even a Mac Studio M3 Ultra with 96GB would be cheaper than a Nvidia Pro 6000 Blackwell.
Instead of going with some claims from people on reddit or Youtube or whatever Claude or ChatGPT claims I'd though I better ask here if someone has some first hand experiences.
I think at this type of spend you really need to define your goals. If your goal is to not pay for Claude for non-work projects, just use GLM 5.1 or Kimi 2.6 on OpenRouter. The cloud providers get more out of their hardware than you do, because they can batch requests to increase GPU utilization and hide memory accesses.

I think you can make it financially make sense for certain types of projects - if I needed to build an demo app with multiple backend services, but each service wasn't particularly intellectually challenging, I'd feel comfortable setting Minimax-M2.7, MiMo-2.5, or Deepseek-V4-Flash loose to tackle them. I wouldn't leave a 300B-class model unattended to build complicated numerical simulations or write number theory algorithms with tons of control flow.
 
  • Like
Reactions: T_Minus

Styp

Member
Aug 1, 2018
81
31
18
I wouldn't go under 128GB of VRAM if your aim is to buy something to run LLMs on. 96 is really tight - it's barely enough for Qwen-122B at full context which means once more models with 1M context or something with 150B params rolls out you won't be able to take advantage of it. Here are a few unpopular options that are nevertheless interesting:
  • 4x MI250/MI255: The Supermicro MI250 carrier will run in a normal computer with some effort. MI250 is actually 2x MI210 on a single board, so the resulting system is topologically similar to a DGX-1 but has 512GB of VRAM and bf16. May or may not be supported by anything.
  • 8x Habana Gaudi2: 768GB of VRAM for $18K, but is it really worth saving $25K to not have Nvidia?
  • 8x V100-SXM2: Volta is aging well, there are several projects that aim to bring modern models to V100. Unfortunately, not having bf16 sucks.
  • A single MI250 on a carrier: really not fun, because OAM modules are 48V...
  • 4x RTX 8000: 192GB, but same problem as Volta - no bf16.
  • 4x Intel Max 1100: 192GB for $8K, but it is unclear if Pytorch runs on Max 1100
Some great ideas here, but some ideas that make economically no sense. For homelabs ok, but I couldn't justify setting up Habana Gaudi2 and pytorch no matter the price - the knowledge is just too scarce...
 

bayleyw

Active Member
Jan 8, 2014
347
125
43
  • For heavy models (like the new Qwen 3.6): You’re going to want massive VRAM on a single card. The NVIDIA RTX PRO 6000 Blackwell (96GB) is the card to aim for here - and it is the only option in my opinion!
  • The NVIDIA DGX Spark: This is a fantastic desktop platform if you eventually need a seamless scaling path to a DGX Workstation or a multi-GPU cloud setup; it just works. It’s decently fast, though at ~$3k for a standalone device, it’s quite expensive for the form factor. Beta-ish behavior, portability issues, etc. (I am struggling to commit here as well…).
In general I think multi-GPU scaling (or the lack thereof) is worth discussing here. Modern models are effectively "deep and narrow" during inference because they are all really sparse. If you are generating at 100 tokens per second, you have 10 ms per forward pass. If your model has 48 layers, you have under 25 microseconds per layer, so suddenly a few microseconds spent launch NCCL kernels or waiting for AllReduce to finish feels like an eternity.

There is a bit of a caveat on DGX Spark: we all wish it were what Jensen advertised - a very slow B200 that you could prototype on - but GB200 is SM10_0 and Spark is SM12_1, so there's a bit of drama regarding kernel compatibility right now...
 

bayleyw

Active Member
Jan 8, 2014
347
125
43
Some great ideas here, but some ideas that make economically no sense. For homelabs ok, but I couldn't justify setting up Habana Gaudi2 and pytorch no matter the price - the knowledge is just too scarce...
I think if your only goal was to serve GLM5.1 or Kimi K2.6, its a viable choice - Intel does maintain a working vLLM implementation which is only a couple months behind upstream and covers most of the interesting models. 768GB for $18K really is a class of its own and I think if it creeps down to $13K in a few months these will be squarely be in hobbyist range.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Some great information here. Thank you all. But for me it boils down to: Would I tear my hairs out trying to use a 72/80b Qwen model on a DGX Spark or a Ryzen AI mini pc or not? I do not need it for work, just for some ambitious coding projects.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,883
2,219
113
meh, I think it's silly to consider a 128GB 4000$ mini-PC with (relatively) slow performance compared to your NVIDIA options as your daily-driver AI, that would drive me INSANE.

Personally, I find you need something snappy\fast for daily use... ie: 5090 or RTX6000, and then you need a LOT of RAM for the top-tier big ones... ie: 600GB+ just to fit the models. I've yet to say "wish I could fit that 120b" model fast", but sure wish I had a 2nd 512gb studio to run the 600gb+ models.


The new MTP models are really fast too have you tried those? Even dipping into system RAM for large context it was impressively fast.
 
  • Like
Reactions: MBastian

Styp

Member
Aug 1, 2018
81
31
18
meh, I think it's silly to consider a 128GB 4000$ mini-PC with (relatively) slow performance compared to your NVIDIA options as your daily-driver AI, that would drive me INSANE.

Personally, I find you need something snappy\fast for daily use... ie: 5090 or RTX6000, and then you need a LOT of RAM for the top-tier big ones... ie: 600GB+ just to fit the models. I've yet to say "wish I could fit that 120b" model fast", but sure wish I had a 2nd 512gb studio to run the 600gb+ models.


The new MTP models are really fast too have you tried those? Even dipping into system RAM for large context it was impressively fast.
I partially agree that you need an RTX 6000 Blackwell or an RTX 5090. It's a delicate balance; self-hosting for a solo operation or an SME is challenging because pooling hardware resources through a larger entity is generally much more efficient. While data privacy is a valid reason to self-host, sometimes it's done just for the sake of doing it. That said, for batch-processing workloads like OpenClaw, slower hardware isn't a big issue; it just takes a bit longer to run.
 

Patriot

Moderator
Apr 18, 2011
1,513
834
113
I just re-imaged my old gpu setup, I can run some performance numbers if you tell me what you are interested in...
Have a pair of old RTX8000s Nvlinked together for 96gb vram and an Epyc 7763 w/512gb 2933

If you ask real nice... I might do the wiring required for external blowers on a quad set of v100s but I don't particularly fancy the heat or noise for daily running. It's been a bit since I ran local LLMs but I did so on a hive of mi100s.
 
  • Like
Reactions: T_Minus

TrashMaster

Active Member
Sep 8, 2024
116
87
28
multiple rtx6k remains the only quazi "affordable" option for running the big capable boys right now in a truly usable state. (not some tragic 2.13 bit quant at 4t/s). lets call usable 8kt/s pp and 50t/s gen with concurrency 4, and 8-16 bit weights.

if precision matters to you (e.g. quality of output) then nvfp4 (and frankly most 4bit weight options) are going to suffer. any quantization of activations or kvcache will likewise result in noticeable degradation as context grows.

the sm120 and sm121 software wall is real, these devices are not sm100 (real blackwell) and will not be able to use many software components without you hand-rolling your own cuda tiles and making serious concessions around accuracy and/or performance.

if you are trying to stitch together a dozen 3090s (or even 5090s) I have been down that road, built those rigs, run the tests, and ultimately come full circle back to a pile of rtx6k's. if i had the cash for h200nvl's, that would be the obvious choice. your limiting factor is going to be power, cooling, and complexity rather than plugging the GPUs in. There is a severe curve of diminishing returns especially past 4 GPUs due to all the card to card communications and PCIE being a bottleneck.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
I just can't do a multi GPU setup. For once I'll probably waste more of my very scarce time building and fiddling with it then using it and I also don't have the space for it.
Anyone has actually used a DGX Sparc, Ryzen AI Max or Mac Studion M1/3 Ultra and can comment if it's usable in a single-user non-work environment? I do not really care if if I get an answer to my prompt in 500ms or 5s.
On second thought I think my thread title was ill conceived.
 
Last edited:

Mashie

Member
Jun 26, 2020
40
11
8
I just can't do a multi GPU setup. For once I'll probably waste more of my very scarce time building and fiddling with it then using it and I also don't have the space for it.
Anyone has actually used a DGX Sparc, Ryzen AI Max or Mac Studion M1/3 Ultra and can comment if it's usable in a single-user non-work environment? I do not really care if if I get an answer to my prompt in 500ms or 5s.
On second thought I think my thread title was ill conceived.
There are quite a few over at the Nvidia dev forums using the DGX Sparks as daily drivers for code assist and whatever they need using the Qwen 3.5 122b models or the smaller dense ones. Give it another month and Deepseek 4 Flash should be stable/fast enough on a 2-node setup as a daily driver.

I have not started to use my 2-node setup in anger yet so can't provide much feedback myself yet.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
@mashrooms thank you. I dug a bit deeper and it seems that both a DGX Spark and Ryzen AI Max+ are plenty fast for large MoE models but struggle with the bigger dense models due to their limited memory bandwidth. There is no hard data to be found on this other than the usual benchmaxed numbers and some user experiences.
I still do plan to have a local AI infrastructure but I think I should first try them via openrouter, deepinfra or a similar service. While many hail "open" models like Qwen as nearly as good as ChatGPT or Claude it might just be that they would not cope well with the things I intend to do with them ... which is not vibe coding until it appears to work and call it a day.