CPU+mobo for LLM inference offloading from 2x3090

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

InternetExplorer

New Member
May 26, 2026
3
0
1
Hi guys! I have a 128gb Strix Halo / Ryzen AI MAX 395 (Minisforum, bought out of curiosity) and a gaming PC with RTX 3090 / Ryzen 5 8400F / 16+16gb ddr5 6000 ram on basic 1 GPU mobo.

I love using the Halo for LLM inference for coding agents, but I having 2 PCs are too much and the Halo is slower with smaller and dense models than the 3090. I want to sell the Halo and upgrade the PC with second 3090 and a CPU+mobo with multiple 16x PCI-e slots and 8-channel DDR4. As I understand that is the best option to be able to use models larger than 48Gb with relatively fast RAM offload on a budget. Budget for CPU+mobo - less than $1200.

I also game occasionally and it would be nice to have the CPU to be not much slower than Ryzen 5 8400F for consumer/gaming tasks. Maybe there are some models with few but fast cores?

Thank you!
 
Last edited:

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
You could just swap the motherboard. A viable option would be a Asus B850-Creator Neo. you'll loose the x4 PCH slot unless yor case allows PCIE vertical mounts but it's a solid option with two x8 PCI5 slots and the ability to update to a more powerful CPU in the future. You'll most likely have to invest in a new PSU. Ideally a good 1200W one. 1000W can do in a pinch if you undervolt or lower the power limit to 200W of the two 3090s

the Halo is slower with smaller and dense models than the 3090
How much slower? Unusable? One 3090 can only run Qwen 3.6 27b with (emulated) FP4. What model did you use and did you use (or try) FP8 or BF16 on the Halo?
 
Last edited:

InternetExplorer

New Member
May 26, 2026
3
0
1
You could just swap the motherboard. A viable option would be a Asus B850-Creator Neo. you'll loose the x4 PCH slot unless yor case allows PCIE vertical mounts but it's a solid option with two x8 PCI5 slots and the ability to update to a more powerful CPU in the future. You'll most likely have to invest in a new PSU. Ideally a good 1200W one. 1000W can do in a pinch if you undervolt or lower the power limit to 200W of the two 3090s


How much slower? Unusable? One 3090 can only run Qwen 3.6 27b with (emulated) FP4. What model did you use and did you use (or try) FP8 or BF16 on the Halo?
Hi man! Thank for the reply:)

Yeah as a budget option I could go for consumer board with x8x8 PCI-e and that will get me a a fast CPU for desktop/gaming tasks. I'm planning to buy used, CPU+mobos are often sold as a kit, so little difference between swapping both or just the mobo. However I think the mobo has to have both fast PCIe lanes go to the CPU for fast processing and the CPU has to support a certain amount of lanes. HEDT/server hardware has it covered, not all consumer gear supports it.

But I think 2-channel DDR5 + 8x8x PCIe consumer gear will be slower than 8-channel DDR4 + 16x16x pro gear when offloading to RAM. Here a guy goes from 2-channel DDR5 to a server DDR4 setup with 62% t/s speedup. https://www.reddit.com/r/LocalLLaMA/comments/1nsm53q
Ideally I want to have a functional replacement for the strix halo in terms of big models (runs at least at 2/3 of halo's speed) and something better than Halo in terms of smaller ones (<48 Gb).

I actually rented a server with 2x3090 to check their performance.

Qwen3.6-27B-Q8_0.gguf on illama.cpp on Ubuntu was:
~8 t/s on Halo
~24 t/s on 2x3090
Which is expected due to big bandwidth difference. The model fits in VRAM in both cases.

gpt-oss-120b-Q4_K_M (does not fit in 3090s, gets offloaded):
- Halo: 56 t/s
- 2x3090: 8.8 t/s
Here I guess I didn't run the big model properly on the 3090s, I'll try again later.

That is without MTP/DFlash, etc.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
gpt-oss-120b-Q4_K_M (does not fit in 3090s, gets offloaded):
- Halo: 56 t/s
- 2x3090: 8.8 t/s
Here I guess I didn't run the big model properly on the 3090s, I'll try again later.
That's an MoE model. Each token will only utilize 10B 5.1B parameters. Since the Halo has unified memory the low memory bandwidth does not hurt as much as with a dense model like the Qwen3.6 27B you used. I do not think you'll see better numbers with the 3090s as the RAM offloading will really hurt.

Qwen3.6-27B-Q8_0.gguf on illama.cpp on Ubuntu was:
~8 t/s on Halo
~24 t/s on 2x3090
Which is expected due to big bandwidth difference. The model fits in VRAM in both cases.
With Q8 you will have a tiny context window on the 3090 and a good chance it'll run OOM frequently. For more a realistic comparison you should use the Q4 or better Q5 version. It does not loose much from the Q8 one. The Halo should be around 12-14 t/s with the Q5 version.

Edit: When it comes to performance numbers and tuning options you should take anything on the internet with a huge grain of salt, especially on reddit. I am still learning too and it's a pita to find some truth in all that mess. There is an overwelming ammount of plainly wrong information floating around. "Remember to turn the chickens head to the left side, it makes your soup more flavorful!"

Last edit, I promise: And the most important thing is to actually be sure the open models are accetable for your use case. I am at that stage now. It would be a huge letdown to build and tune something and then discover that Qwen or whatever LLM is not really up to the task. I'd rather shell out some money for Codex, Claude or Qwen Max than to spend my time handholding a sub-par local LLM I originally envisioned as a time-saver.
 
Last edited:

InternetExplorer

New Member
May 26, 2026
3
0
1
That's an MoE model. Each token will only utilize 10B 5.1B parameters. Since the Halo has unified memory the low memory bandwidth does not hurt as much as with a dense model like the Qwen3.6 27B you used. I do not think you'll see better numbers with the 3090s as the RAM offloading will really hurt.


With Q8 you will have a tiny context window on the 3090 and a good chance it'll run OOM frequently. For more a realistic comparison you should use the Q4 or better Q5 version. It does not loose much from the Q8 one. The Halo should be around 12-14 t/s with the Q5 version.

Edit: When it comes to performance numbers and tuning options you should take anything on the internet with a huge grain of salt, especially on reddit. I am still learning too and it's a pita to find some truth in all that mess. There is an overwelming ammount of plainly wrong information floating around. "Remember to turn the chickens head to the left side, it makes your soup more flavorful!"

Last edit, I promise: And the most important thing is to actually be sure the open models are accetable for your use case. I am at that stage now. It would be a huge letdown to build and tune something and then discover that Qwen or whatever LLM is not really up to the task. I'd rather shell out some money for Codex, Claude or Qwen Max than to spend my time handholding a sub-par local LLM I originally envisioned as a time-saver.
Again, thank you for your interest.

FYI Qwen3.6 27B Q8 with MTP on Halo gives about 17-18 t/s output. On 2x3090s it's around 50 t/s.

I want to have a local solution in addition to cloud ones, for smaller easier tasks local LLM are fine for me.

Regarding what people say about performance and my results - you might actually get way more than 8.8 t/s for gpt-oss-120b-Q4_K_M (or other bigger MoE models) on 2x3090 with RAM offload since I was testing on a virtual machine, and RAM bandwidth there is bad. For example I got these results on another 2x3090 VM.

ubuntu@double3090test:~$ mbw -t0 15000
Long uses 8 bytes. Allocating 2*1966080000 elements = 31457280000 bytes of memory.
Getting down to business... Doing 10 runs per test.
0 Method: MEMCPY Elapsed: 2.16906 MiB: 15000.00000 Copy: 6915.438 MiB/s
1 Method: MEMCPY Elapsed: 2.15624 MiB: 15000.00000 Copy: 6956.544 MiB/s
2 Method: MEMCPY Elapsed: 2.16562 MiB: 15000.00000 Copy: 6926.413 MiB/s
3 Method: MEMCPY Elapsed: 2.17202 MiB: 15000.00000 Copy: 6906.004 MiB/s
4 Method: MEMCPY Elapsed: 2.16786 MiB: 15000.00000 Copy: 6919.272 MiB/s
5 Method: MEMCPY Elapsed: 2.17107 MiB: 15000.00000 Copy: 6909.042 MiB/s
6 Method: MEMCPY Elapsed: 2.17539 MiB: 15000.00000 Copy: 6895.303 MiB/s
7 Method: MEMCPY Elapsed: 2.17570 MiB: 15000.00000 Copy: 6894.330 MiB/s
8 Method: MEMCPY Elapsed: 2.17242 MiB: 15000.00000 Copy: 6904.742 MiB/s
9 Method: MEMCPY Elapsed: 2.17889 MiB: 15000.00000 Copy: 6884.246 MiB/s
AVG Method: MEMCPY Elapsed: 2.17043 MiB: 15000.00000 Copy: 6911.080 MiB/s


About 7 gbps meanwhile 2-channel DDR5 can theoretically get you 100 gbps. You need a dedicated machine to get true perf numbers. With the capabilities of Qwen3.6 27B double 3090 setups look promising.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
AVG Method: MEMCPY Elapsed: 2.17043 MiB: 15000.00000 Copy: 6911.080 MiB/s
Ran that mbw cmd on my workstation. Ryzen 9900x, 2x32GB Kingston Fury Beast DDR5-6000 CL30, Arch Linux, Kernel with 6.18.33-1-lts
AVG Result is: 22805.464 MiB/s.
So you will see improvement but do not expect miracles. Maybe if always the same experts are used and they fit into the GPUs memory you'll see way better numbers. But for multi-user or agentic workflows I'd rather doubt that.
 
Last edited:

Kizune

Member
Dec 2, 2022
96
69
18
I do not think this mbw command shows the correct picture. I mean if you get 22800 on dual channel DDR5 6000 and I get 24300 on 12 channel DDR5 5600 - that’s just doesn’t add up to anything useful.

Ok, after some digging i found out the mbw is basically the single thread application so all you can test is how much memory bandwidth is available for one thread running effectively on one CCD. Which is unrealistic for real multi-thread workload. So i found this post: Testing Memory I/O Bandwidth where author proposes to run mbw concurrently on all cores making it compete for the memory bandwidth and then adding up averages for each method.

So i did run his shell script and here are my results:

EPYC 7742 (64 cores 128 threads), 8 channel DDR4 3200:
Starting test on 128 cores
Waiting for tests to finish
MEMCPY Total AVG: 42677.861 MiB/s
DUMB Total AVG: 65461.118 MiB/s
MCBLOCK Total AVG: 67056.455 MiB/s

EPYC 9755 (128 cores, 256 threads), 12 channel DDR5 5600:
Starting test on 256 cores
Waiting for tests to finish
MEMCPY Total AVG: 215104.170 MiB/s
DUMB Total AVG: 132518.049 MiB/s
MCBLOCK Total AVG: 139659.745 MiB/s

Now that paints a bit different picture doesn't it?
 
Last edited:

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Now that paints a bit different picture doesn't it?
Yes, but only if the process that deals with offloading can have multiple threads over different cores. Even then you only get like 2.8 times the memory speed per EPYC CPU. I wonder how this benchmark scales with cores, I'll check in the evening on my workstation.
 

MBastian

Active Member
Jul 17, 2016
337
99
28
Germany
Two things.
1. A RTX 3090 can not transfer more than 32GB/s through its PCIE 4 x16 interface. On a consumer platform two 3090s would have x8 each. Imho the main benefit of a server plaform would be the doubled throughput to the two 3090s and not the higher memory bandwidth. Edit: I am assuming here that there is no CPU inferencing in play.
2. mbs is a (intentionally unoptimized) memory copy benchmark which explains the low results. But that's not what's happening in offloading, data is pushed to the GPU and not some other memory region.
 
Last edited:

Kizune

Member
Dec 2, 2022
96
69
18
Two things.
1. A RTX 3090 can not transfer more than 32GB/s through its PCIE 4 x16 interface. On a consumer platform two 3090s would have x8 each. Imho the main benefit of a server plaform would be the doubled throughput to the two 3090s and not the higher memory bandwidth. Edit: I am assuming here that there is no CPU inferencing in play.
2. mbs is a (intentionally unoptimized) memory copy benchmark which explains the low results. But that's not what's happening in offloading, data is pushed to the GPU and not some other memory region.
Usually when people looking for server platforms they assume large amount of RAM and at least a possibility of the hybrid inference. I just finished building my new LLM rig and still in process of optimizing it. I will definitely do a hybrid inference using CUDA and ZenDNN optimized modules. But I’m still in the very early stage of testing various combinations to see what will work so I can’t give any recommendations myself.