New Chinese PCIE Switch Board GPU Testing

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

josh

Active Member
Oct 21, 2013
644
212
43
@josh, for 16 GPUs on the same board, you would need a PLX88096 switch with 10 SFF8654 ports.

The one you linked would work, setting the dip switch to X4, which would let you connect up to 20 cards at X4 4.0.

View attachment 47849

You would need some SFF8654 to 2*SFF8654 4i cables and some SFF8654i device adapters.

Some cables could be https://es.aliexpress.com/item/1005005316396748.html

And device adapters could be either https://es.aliexpress.com/item/1005005818101660.html or SlimSAS PCIe gen4 Device Adapter x4

You could also get some SlimSAS 8654 8i cables and use an adapter like this https://es.aliexpress.com/item/1005010314638714.html

As long you keep all inside the switch (i.e. with the P2P driver), uplink shouldn't matter.
Sorry I meant 16 cards per motherboard. I only want 8 cards per switch unless there's some sort of 8i to 2x8i cable out there because I don't want to change the rest of my setup (have the 8i cables + risers already). My plan is to eventually stack more and more of these switches populated with 8-10 cards each.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Why do you think it will perform like absolute trash? Speed-wise or quant-wise? K2.5 only needs Q4 to be full precision so a 2.5-3bpw fully offloaded will be good and fast.

I already have 16x3090s running on direct CPU connections and I'm thinking moving all of that GPU traffic from the interconnect would actually speed things up a lot.
This is where your complete stack comes into play. What inference engine are you using, and how have you configured it?

Running pipeline parallelism between N cards means every token must pass through each card on its long slow path to output.

Running tensor parallelism (and EP if you are talking about a big model like K2.5) requires shading the model resulting in an incredible amount of pcie bandwidth directly between the cards for what most people would likely classify as 'good' performance. 4 lanes of pcie 4.0 per card is gonna get tight. The first part of processing the prompt will be memory bandwidth bound. At ~7GB/s. At that point, you might as well do it on a CPU with ddr5. A pcie switch actually does help with this, however you will need the p2p custom drivers for that to work on consumer GPUs like 3090s.

Overall we would have to go way deeper into your storage, pci topology, iommu, bios/drivers/kernel/grub, inference engine, and all the configs. to explore all the interesting and irritating bottlenecks.

So out of curiosity, what is the details behind the current setup and how many PP/TG do you get?
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Even if you had 16 x 3090 thats only 16 * 24gb vram = 384 gb. Kimi k2.5 is still 374 GB at Q2_K. So you would be crippled with max prompt size even assuming you could load it.
 

josh

Active Member
Oct 21, 2013
644
212
43
This is where your complete stack comes into play. What inference engine are you using, and how have you configured it?

Running pipeline parallelism between N cards means every token must pass through each card on its long slow path to output.

Running tensor parallelism (and EP if you are talking about a big model like K2.5) requires shading the model resulting in an incredible amount of pcie bandwidth directly between the cards for what most people would likely classify as 'good' performance. 4 lanes of pcie 4.0 per card is gonna get tight. The first part of processing the prompt will be memory bandwidth bound. At ~7GB/s. At that point, you might as well do it on a CPU with ddr5. A pcie switch actually does help with this, however you will need the p2p custom drivers for that to work on consumer GPUs like 3090s.

Overall we would have to go way deeper into your storage, pci topology, iommu, bios/drivers/kernel/grub, inference engine, and all the configs. to explore all the interesting and irritating bottlenecks.

So out of curiosity, what is the details behind the current setup and how many PP/TG do you get?
The GPUs are currently split on 2 rigs:

1. 3x3090s + 768GB DDR5 with ik_llama.cpp. It does 5-10t/s TG on Q4 K2.5 full context but the PP is where it goes to the absolute shitter at around 25t/s.
2. 13x3090s running exl3. It does 20t/s TG on 5bpw (almost full precision),full context for GLM 4.7 with PP around 200t/s.

The problem I have now is I really like K2.5 but it takes so long to PP on codebase ingestion it times out on my IDE and renders it unusable. So I'm thinking of trying to just add more 3090s to my second rig.

The second reason why I'm looking at this setup is because I believe my current 13x setup is suffering from massive inefficiency anyway due to 13 GPUs constantly shouting on the interconnect every AllReduce operation and I'm hoping offloading this onto dedicated switch fabric will drastically improve performance. Connecting 10 cards on the 10 slot says we have min of 4.0 x8 on each port which should be more than enough for basic inference.

Theoretically even if we do PP across 2 of these switches, the interconnect is only used once per layer when it transitions from GPU7 to GPU8 which should drastically improve performance. Also, theoretically vLLM with TP=8, PP=2 should also work if somehow I manage to convey this topology information to it.
 
Last edited:

josh

Active Member
Oct 21, 2013
644
212
43
Even if you had 16 x 3090 thats only 16 * 24gb vram = 384 gb. Kimi k2.5 is still 374 GB at Q2_K. So you would be crippled with max prompt size even assuming you could load it.
The great thing about K2.5 is it's INT4 native which means that not only can we run a 4bpw at full precision, we can also strip the vision model out and save even more. Then we mix in the mixed bpw and we get some of these very usable quants:
ubergarm/Kimi-K2.5-GGUF · Hugging Face
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Running ikllama.cpp with that gguf would get you right back into 7 t/s PP jail. Pipeline parallelism does not work if you need usable prompt processing speed on huge models across many cards, you gotta have TP.

The problem with TP is you don't just need ports to connect the cards, you need the bandwidth between all cards. That prompt processing is not going to improve unless you have the cards connected at higher speed with more lanes. The fundamental issue is that prompt processing is memory bound not within a single card but across all the cards in a full mesh as you mentioned with that big ol all reduce. Like I said, running p2p helps in that specific case because through the constant nagging of some people NCCL is a valid backend for tabby, but you would need a much bigger pcie switch than just 96 lanes.

Assuming you split the cards at 8 cards per switch, those groups of 8 still need an immense amount of bandwidth between them. I have done what you are doing with tabbyAPI / exl2 and exl3 and tensor parallelism. And the pcie bandwidth regularly maxes out every bit of available bandwidth on the bus. I have custom monitoring scripts built for this now. So your pcie 4.0 x16 (at 28GB/s) between the CPU and each switch would become another pinch point.

Anyways depending on the size/scale/money you have available, going to a 160 lane dual epyc setup might actually cost less and provide more bandwidth. You can through the miracle of the infinity fabric bypass the compute tile and main memory bottlenecks so long as you watch out for the xGMI links. This is where Turin really shines.

Anyways if you are on any of the AI or L1T discords DM me your username and ill add you.
 

josh

Active Member
Oct 21, 2013
644
212
43
Running ikllama.cpp with that gguf would get you right back into 7 t/s PP jail. Pipeline parallelism does not work if you need usable prompt processing speed on huge models across many cards, you gotta have TP.

The problem with TP is you don't just need ports to connect the cards, you need the bandwidth between all cards. That prompt processing is not going to improve unless you have the cards connected at higher speed with more lanes. The fundamental issue is that prompt processing is memory bound not within a single card but across all the cards in a full mesh as you mentioned with that big ol all reduce. Like I said, running p2p helps in that specific case because through the constant nagging of some people NCCL is a valid backend for tabby, but you would need a much bigger pcie switch than just 96 lanes.

Assuming you split the cards at 8 cards per switch, those groups of 8 still need an immense amount of bandwidth between them. I have done what you are doing with tabbyAPI / exl2 and exl3 and tensor parallelism. And the pcie bandwidth regularly maxes out every bit of available bandwidth on the bus. I have custom monitoring scripts built for this now. So your pcie 4.0 x16 (at 28GB/s) between the CPU and each switch would become another pinch point.

Anyways depending on the size/scale/money you have available, going to a 160 lane dual epyc setup might actually cost less and provide more bandwidth. You can through the miracle of the infinity fabric bypass the compute tile and main memory bottlenecks so long as you watch out for the xGMI links. This is where Turin really shines.

Anyways if you are on any of the AI or L1T discords DM me your username and ill add you.
Both my setups are 160 lane EYPCs.

The interconnect is way too much latency. Going through the CPU is 2 hops per P2P connection vs 8 GPUs talking to each other at full x16 per P2P connection at single hop. I would still be using the same amount of lanes per GPU but I would have removed a massive latency problem.

The idea is if I reduce the number of interconnects between switches, I don't need to rely on the shitty bandwidth of the interconnect. So for PP it would only cross once when we go from GPU7 to GPU8. Across switch AR is where it could get nasty but that's where we start doing hybrid processing with vLLM. Would need some way to tell it to TP within the switches and PP across them.

I'm on exllama, tabby, unsloth, vllm. DMed you.
 

panchovix

Member
Nov 11, 2025
63
17
8
Running pipeline parallelism between N cards means every token must pass through each card on its long slow path to output.

So out of curiosity, what is the details behind the current setup and how many PP/TG do you get?
At least on my case, when running Kimi K2 Q3_K_M (a mix between RAM on a consumer 9900X and 272GB VRAM), I get about 300-400 t/s PP and 12-14 t/s TG, using llamacpp with:

Code:
./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-Q3_K_M-00001-of-00011.gguf' -c 32768 --no-mmap -mg 0 -ub 2048
I.e.

Code:
prompt eval time =   11646.12 ms /  4394 tokens (    2.65 ms per token,   377.29 tokens per second)
       eval time =   50754.89 ms /   633 tokens (   80.18 ms per token,    12.47 tokens per second)
It's not much but pretty decent to have things on RAM! I would guess an Epyc/Threadripper would be noticeably faster on TG, as I'm limited to about 70-75GB/s of bandwidth with my RAM. PP seems to be limited by both compute and PCIe (it maxes transfers to 64GB/s from the CPU to the main (CUDA 0) GPU, so I guess if PCIe 6.0 X16 existed on consumer boards and GPUs it would be even faster)

I was planning to get a Threadripper but now to get 256GB DDR5 RDIMM 6000Mhz RAM, is more expensive than the total amount of GPUs I have gotten (8), just insane.

When RAM stops being that overpriced, probably in some years I will get a Epyc/Threadripper next gen or something.
 

josh

Active Member
Oct 21, 2013
644
212
43
At least on my case, when running Kimi K2 Q3_K_M (a mix between RAM on a consumer 9900X and 272GB VRAM), I get about 300-400 t/s PP and 12-14 t/s TG, using llamacpp with:

Code:
./llama-server -m '/run/media/pancho/MyDrive/models_llm_2tb/Kimi-K2.5-Q3_K_M-00001-of-00011.gguf' -c 32768 --no-mmap -mg 0 -ub 2048
I.e.

Code:
prompt eval time =   11646.12 ms /  4394 tokens (    2.65 ms per token,   377.29 tokens per second)
       eval time =   50754.89 ms /   633 tokens (   80.18 ms per token,    12.47 tokens per second)
It's not much but pretty decent to have things on RAM! I would guess an Epyc/Threadripper would be noticeably faster on TG, as I'm limited to about 70-75GB/s of bandwidth with my RAM. PP seems to be limited by both compute and PCIe (it maxes transfers to 64GB/s from the CPU to the main (CUDA 0) GPU, so I guess if PCIe 6.0 X16 existed on consumer boards and GPUs it would be even faster)

I was planning to get a Threadripper but now to get 256GB DDR5 RDIMM 6000Mhz RAM, is more expensive than the total amount of GPUs I have gotten (8), just insane.

When RAM stops being that overpriced, probably in some years I will get a Epyc/Threadripper next gen or something.
3090s? Are they attached to a switch or via interconnect? Pretty interesting partial offload speeds with no -ot tweaks. What's the % split?
 

panchovix

Member
Nov 11, 2025
63
17
8
3090s? Are they attached to a switch or via interconnect? Pretty interesting partial offload speeds with no -ot tweaks. What's the % split?
4x5090, 2x4090, 1xA6000 and 1xA40. llamacpp nowadays with -fit (which is enabled by default) makes the partial offload better than my manual -ot one haha.

About 220GB RAM, 270GB VRAM (it uses more VRAM because context, cache, etc)
 

josh

Active Member
Oct 21, 2013
644
212
43
4x5090, 2x4090, 1xA6000 and 1xA40. llamacpp nowadays with -fit (which is enabled by default) makes the partial offload better than my manual -ot one haha.

About 220GB RAM, 270GB VRAM (it uses more VRAM because context, cache, etc)
But these are all on a single PLX88096? If so impressive bandwidth.
 

panchovix

Member
Nov 11, 2025
63
17
8
But these are all on a single PLX88096? If so impressive bandwidth.
It is a PM50100 switch (100 lanes), which has 4x5090 connected, and the last 2 MCIO downstream ports are used to connect a PLX88096 switch, where I connect the other 4 GPUs.
 

josh

Active Member
Oct 21, 2013
644
212
43
It is a PM50100 switch (100 lanes), which has 4x5090 connected, and the last 2 MCIO downstream ports are used to connect a PLX88096 switch, where I connect the other 4 GPUs.
Very interesting didn't know you could daisy chain like that. That must be the weirdest topology and surprised the p2p is functioning. Are you using any custom p2p software?
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Very interesting didn't know you could daisy chain like that. That must be the weirdest topology and surprised the p2p is functioning. Are you using any custom p2p software?
You can chain PCIE switches, a single socket topology works with cpu -> pcie1 -> pcie2.


But the more you chain the more you bottleneck the groups and latency goes up.

PCIE latency is almost as important as bandwidth when it comes to that precious performance.
 

panchovix

Member
Nov 11, 2025
63
17
8
Very interesting didn't know you could daisy chain like that. That must be the weirdest topology and surprised the p2p is functioning. Are you using any custom p2p software?
P2P works just fine, and by the cascading the PLX88096 has less bandwidth when using all cards at the same time and also moving data to the CPU (not very common though, inference or train can be kept inside the switch). Note that diff architectures have to do an extra jump besides PCIe anyways.

I use the P2P driver from here GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support
 

Visual-Synthesizer

New Member
Mar 14, 2026
2
0
1
Hey all, been following this thread closely

I have a specific topology question that I don't think anyone has tested yet:

Gen4 motherboard → Gen5 PCIe switch (PM50100) → Gen5 GPUs

Do the downstream GPU ports negotiate Gen5 P2P with each other even when the upstream port to the CPU is Gen4?

My setup: TRX40 board (Threadripper 3960X, 72 Gen4 lanes, 256GB DDR4 ECC) with two 6000 Blackwell's. Been avoiding upgrading to gen5 due to the ram shortages. Would rather buy more GPUs. If i could get the P2P at Gen5 speeds it would hold me over until the ram supply chain eases up.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Your presence is requested in the hardware -> pcie switching channel over at the /r/blackwellperformance discord: Join the RTX6kPRO Discord Server!

also yes this will work, it will help with p2p dma assuming u have all that setup (detailed guide in the p2p dma channel) and several people have lab gear to test this kind of setup. also c-payne is there ;D
 
  • Wow
Reactions: Visual-Synthesizer

mjwlod

New Member
Feb 26, 2026
1
0
1
You can chain PCIE switches, a single socket topology works with cpu -> pcie1 -> pcie2.


But the more you chain the more you bottleneck the groups and latency goes up.

PCIE latency is almost as important as bandwidth when it comes to that precious performance.

How deep can the daisy chain go?
Could you run 5 of these in a line, TP grouping of 4 (careful to re-order them in the tree), pipeline parallel between the PLXes?
PCIE is bidirectional and the "upstream" for the whole chain would almost exclusively go to putting the finished product at the beginning again for layer 0 (in terms of Kimi K2.5).
Cross-layer bandwidth would be minimal right?
Unhindered multi-user throughput would be roughly the number of TP groups you have.
For the same reason though, it may be easier to just connect all 5 to the PCIE root instead of chaining them, except for cable length.
Someone tell me I'm crazy. I want to do this now.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
On the discord right now there is one guy with this setup:
1773614070476.png

He has 3 of those c-payne 100 lane gen5 switches. one is setup as a root bridge to 2 leafs, each leaf is chopped in half to two partitions. each partition has a dedicated x16 path back to the root. so its really 4 switches hanging off 1 switch hanging off his threadripper.

so yes, you can do TP16 with ep on 16x rtx6000 pro blackwells. and its disgusting to see the benchmarks lol


im actually in the benchmarking channel right now testing out this latest Qwen 3.5 397B release from Luke all tuned up and screaming fast in VLLM:
1773614354908.png

Check out the GPU bandwidth numbers for TP lol