New Chinese PCIE Switch Board GPU Testing

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

foureight84

Well-Known Member
Jun 26, 2018
460
389
63
I ended up buying this https://www.ebay.com/itm/127622597672 -- good trade-off for the price. I just want to get the 4 RTX 3090 on the same plane and it's still an upgrade over the current 3 cards on PCIE 3.0 x16x16x8. Adding a 4th card somewhere down the line wouldn't be a problem. Probably will get Qwen3.6-27B Q4_K_M somewhere around 60tk/s (45 tk/s with the current setup) with MTP but at a much lower latency between the cards. LLM min/maxing is too expensive.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
eh, you will be getting less bandwidth with only 8 lanes per card idk how that will impact ur performance u might want to test that compared with the full x16 on a 3090. its the difference between 12GB/s and 24GB/s effective on the old gen4 ampere cards.

If you plan to run TP4 in VLLM it will absolutely consume all your pcie bandwidth: 1778412282265.png

You should also be aware that ampere doesnt have native 4bit instructions so you will probably get better performance and quality staying at 8bpw. the qwen 3.6 27b int8 quant is likely your best bet for quality/speed.


Despite how tempting it might be to try and squeeze out more conteext, that quant from the dude is W8A16. if you care about the accuracy/quality of the output, dont quantize the activations or kv cache.
 
Last edited:
  • Like
Reactions: foureight84

foureight84

Well-Known Member
Jun 26, 2018
460
389
63
eh, you will be getting less bandwidth with only 8 lanes per card idk how that will impact ur performance u might want to test that compared with the full x16 on a 3090. its the difference between 12GB/s and 24GB/s effective on the old gen4 ampere cards.

You should also be aware that ampere doesnt have native 4bit instructions so you will probably get better performance and quality staying at 8bpw. the qwen 3.6 27b int8 quant is likely your best bet for quality/speed.


Despite how tempting it might be to try and squeeze out more conteext, that quant from the dude is W8A16. if you care about the accuracy/quality of the output, dont quantize the activations or kv cache.
It's actually a slight upgrade. I'm on pcie 3 atm and not all cards are on x16 just two of them. There's also latency of one of the cards having to go through pch. At least all cards on this will have the same speed equivalent to pcie 3 x16 and not have to go through the pch for one of the cards.

Yea. I thought about going back to INT8 and I probably will test that again with this card. As long as it's around 40tk/a then it's usable.

I'll probably upgrade this card when they come down in price or get another one. Although the two groups of 3090s will have to talk to each other on pcie 3x16. At least that's connected to cpu.
 
Last edited:

foureight84

Well-Known Member
Jun 26, 2018
460
389
63
eh, you will be getting less bandwidth with only 8 lanes per card idk how that will impact ur performance u might want to test that compared with the full x16 on a 3090. its the difference between 12GB/s and 24GB/s effective on the old gen4 ampere cards.

If you plan to run TP4 in VLLM it will absolutely consume all your pcie bandwidth: View attachment 48725

You should also be aware that ampere doesnt have native 4bit instructions so you will probably get better performance and quality staying at 8bpw. the qwen 3.6 27b int8 quant is likely your best bet for quality/speed.


Despite how tempting it might be to try and squeeze out more conteext, that quant from the dude is W8A16. if you care about the accuracy/quality of the output, dont quantize the activations or kv cache.
Good call on the INT8 btw. I went back to Q8 MTP and spent some time tuning the values for MTP and it's at 40~43 tok/s with ik_llamacpp. I'm pretty happy with that atm, I think the cheaper connector with all of them on x8 will still be a good uplift that will get me in the sweet spot of money spent vs performance goal.
 

Bad Apple

New Member
Nov 18, 2023
11
9
3
I've tried several cables I purchased recently (amazon link). I only have 0.8m ones that seemed to work fine in a bifurcation card (although that has issues with negotiating gen 4 and I had to use gen 3 on them). I tried both pcie slots (the one linked to the cpu directly and the one behind the motherboard chip) and the card shows up fine, but still no gpus on the other side. The card shows green lights on all the leds besides the SYSTEM-FAIL one.

Regarding the gpus itself, the fans spin up and I know the power cables work since they work in the bifurcation setup. For the power being suplied to the pcie switch, I assume it would take from the motherboard, otherwise there's no way to provide additional power to it.

I can't say whether it's fake or not, it looks exactly like in the picture from KingKaido on l1tech thread. The seller wasn't exactly helpful; they don't seem technical. I did contact them again to see whether they could help somehow (with firmware files or something).

What I've tried so far in bios: lowering the pcie slot to gen 3; pcie ari enumeration, acs, iommu.

It could definitely be the cables as well, since the pcie bifurcation card I have can't use gen 4 properly (in this case, I think the card is not working as expected since I can get gen 4 x8 [but not gen 4 x16] on rdna 4, while rdna 3 refuses to do gen 4 completely [can use 8x gen 3, but if I plug both cables in it falls to gen 1 x16]).
What type of SFF-8654 to PCIe adapter board are you using? Some adapter boards may have different pin definitions, which will cause the device to fail to be recognized after connection.
 

itterative

New Member
May 2, 2026
8
2
3
What type of SFF-8654 to PCIe adapter board are you using? Some adapter boards may have different pin definitions, which will cause the device to fail to be recognized after connection.
This is also what the seller was suggesting, that the pcie detection doesn't work on the adapter card. They said the board I have requires that to work. I have one board on the way from them, so I'll see if it works soon.

As foir the boards I already have, I have 3 but they are all the same (3 from the same seller, one from amazon). I tried to look at the trace lines for where the pcie detection looked like it would be, and I think I'm missing 2 resistors on the adapters. However, I haven't done electrical engineering, so I might be wrong. Plus, I wouldn't be able to solder these resistors myself.

pcie_adapter.jpeg
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Those look pretty rough... you consider swapping to mcio-based risers? long pcb traces on the risers has been the number 1 pain in my back dealing with long distance GPU connections. The pcie-megathread over at L1T has alot of good info/pics/links.
 

itterative

New Member
May 2, 2026
8
2
3
Those look pretty rough... you consider swapping to mcio-based risers? long pcb traces on the risers has been the number 1 pain in my back dealing with long distance GPU connections. The pcie-megathread over at L1T has alot of good info/pics/links.
Do you mean using a slimsas to mcio cable and the mcio gen 5 adapters? I might consider it if the new adapter from the switch seller (one which should have the pcie detection working) doesn't work. From the looks of it, it would be around 100 euro for each gpu though (30x2 for cables and 45-50 for board).

Otherwise, if you mean the pcie gen 5 switches with mcio, I'd rather stay on the gen 4 switch since the gen 5 ones are a lot more expensive and that's way beyond the budget I had in mind for my build.

I've seen your posts about the adapters you've been using, but I haven't seen these on neither amazon nor aliexpress.
 

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Evan at 80cm, you should not be having signal problems with gen4. Something is wonky with the setup, and its cheap/easy to figure out one GPU at a time.

1778933895850.png
1778933902180.png

I have used both of these, you could start with a single riser and single cable to see if your issue is resolved. Fairly low cost option for an experiment, then you could buy more over time.

These long-distance GPU setups are problem prone because there are so many different pieces and places a problem can occur and its opaque to detect which piece is being flakey. The gpu, the riser, the cable(s), the switch itself, etc.
 

itterative

New Member
May 2, 2026
8
2
3
Small update on my end: I received a new adapter for the same seller and this one works with my board. It seems like the board required a pcie insertion detection function (from what the seller told me), and none of the adapter I had have this.

I've attached photos of the board and adapter (left one is the one that works).
 

Attachments

unphased

Active Member
Jun 9, 2022
192
43
28
Great info here. Hey @TrashMaster what is that sexy vLLM dashboard? is it part of vllm or a separate tool of yours?
I am converging on the same aesthetic for most of my tooling. I've been using tmux for 15+ years and last few years I'm really pushing tmux, it seems to want to start to lag after I open 50 or so panes in it. So I might abandon tmux but terminal UI overall is clearly gonna stay around for another hundred years...
 

unphased

Active Member
Jun 9, 2022
192
43
28
Nice, thanks. That's a familiar name, Ive been on that rtx6000 discord and it will continue to be a good resource now that i'm acquiring even more blackwell GPUs. All baby GPUs but these exciting P2P developments are lifting all boats.

I'm trying to figure out what I should do. I have a bunch of cost effective hardware but it has ended up being kind of a grab bag without a really clear strategy of how I should put it all together in...

- 5950X on Dark Hero
- 5600G on Asrock B550 ITX (limited to gen 3! shit!) -- i had my NAS on here...
- TR 1950X on Zenith Extreme (gen 3 x16/x8/x16/x8)
- pair of x99 i gotta get a second xeon for, but been collecting dust... I realized my NAS really also wants x8 for the connectx card, so it's planned for moving into one of these platforms.

- 2x3090 of differing heights (an FTW3 and a XLR8)
- 3090Ti FE
- 4 slot spacing 3090 NVLink bridge
- 2x5060Ti 16GB

At first I got super excited (I'm still excited) about PEX88096 being $270 for leveraging P2P, but I now know that this was just for a bare card that also only has 6x 8i ports, to get actual 5x gen 4 x16 I will need to spend $550, not $270, on either a GPU slot backplane (for hosting x16 GPUs) or a 10x 8i card (with DIP switch! thanks for the heads up!) and a ton of cables to run e.g. 8x or 10x GPUs.

Right now my best strategy so far seems like to spend $470 or something on a 5x gen 4 x16 GPU breakout board with PEX88096 onboard. Then some $50 of adapters on top of this would let me connect all 5 GPUs into one host with decent P2P across the GPUs. Sure the 5060s won't really wanna talk to the 3090s but it's okay, and it seems like the 110 aggregate GB/s of the nvlink will come in handy to make an nvlinked pair of 3090s. One question is if i will be able to actually nvlink the FTW3 3090 to the 3090Ti FE, they have the right height. Couple issues... slot spacing of 4 slots i think will need me to get a slot gpu plane, not a 5 slot one (which looks like has 3 slot spacing)

On the other hand I could prolly wait a bit and i bet i can do a bunch of awesome stuff with just that one nvlink pair that I can run. It's ridiculous to set up, needs my macguyver lifted pci bracket mount, but, at least I already know that I can run the XLR8 3090 and FTW3 3090 together on nvlink.

It's just so damn awkward that my 5600G lacks gen 4 so the 5060Ti testing I can do now without a gen 4 pex (or throwing them in as primary cards on the x570) will gimp them down to a nearly pointless 8GB/s of P2P bw. Kinda wish I didnt go out to acquire 2 more GPUs before I realized the 88096's still cost $500. oh well.
 

unphased

Active Member
Jun 9, 2022
192
43
28
wow i didn't realize there was so much design and strategy to work through. I went through a lot of realizations.
  1. So yes, PEX88096 is somewhat expensive, and, my existing hardware wouldn't gain THAT much off of it. I'd be able to get gen 4 x16 3-way 3090 P2P and gen 4 x8 5060Ti P2P, and likely marginal or no P2P between the groups. All on one machine would be a mild consolidation win.
  2. I could just run 3090 on x4 (one off DMI, ew), connect them on NVLink, plug the 5060ti on mobo's main slots for gen 4 x8.
  3. the above would require me to bring the 3090s out and because of how important nvlink is now, i have to make a rigid mount for them rather than having them just lie down.
  4. Acquire a Ryzen 3600 to downgrade my B550 ITX to gain Gen 4. This is a massive win. Seeing if I can pick one up locally for $30.
  5. I will be able to run a lean and mean 32GB VRAM LLM node with 2x5060Ti at a reasonable dual 4.0 x8 off the Ryzen 3600!
  6. Slap the remaining 3090Ti somewhere. hardly matters. it can even work if i can swing 16 lanes for it in the NAS rig, since it only adds 10W to idle power. (xeon on x99 lacks a GPU so this is less dumb than it sounds)
Even if i had the PEX88096, I will still be wanting to set up the NVLink on the 3090s, for the extra p2p speed for various needs, so since I already have the bridge...
 
Last edited:

TrashMaster

Active Member
Sep 8, 2024
116
87
28
Remember, none of that p2p/dma/nvlink stuff matters if your software cant take advantage of it. Building a peak performance gpu rig these days (being your own systems integrator) requires that kind of radical co-design. so, you are the only person that would be able to know your workload. Its way easier to rent hardware and test rather than keep buying pieces and discovering new bottlenecks as they move around.