Multi-NVMe (m.2, u.2) adapters that do not require bifurcation

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

beatle

Member
Mar 23, 2017
73
15
8
That looks about right. Then you just need an NVMe to USB enclosure and a means to secure the drive inside the case. Velcro or zipties should suffice.
 
  • Like
Reactions: nexox

kryten

Member
Apr 10, 2023
111
21
18
Can't speak for whichever model you have, but I had the gen3 4x SFF-8643 model. It definitely didn't have hotplug implemented correctly. It seemed to only have one hotplug slot when it should have had one for each downstream port.

If you can boot into Linux, do a `tree /sys/bus/pci/slots` and check `/sys/bus/pci/slots/*/address` to see which ones correspond to what. If you can't, then you can check in hwinfo64 and you should see something like this for each port:
View attachment 44066

You also need BIOS support for hotplug events. Fortunately, Thunderbolt has forced most manufacturers to start supporting hotplug events.
Have you come across a card that has hot plug on all slots yet?
I haven't seen a PCIE4 card yet
 

mattventura

Well-Known Member
Nov 9, 2022
775
432
63
Have you come across a card that has hot plug on all slots yet?
I haven't seen a PCIE4 card yet
I haven't bought one, but I believe the best options would be some of the higher-end HighPoint cards and Broadcom's inhouse cards (pure PCIe switches, NOT tri-mode). Both of those advertise both hotplug and backplane management support.

Theoretically, the serialcables.com card with external ports that I have should also work for this, but I haven't been able to do anything useful since it apparently uses a non-standard pinout.
 
  • Like
Reactions: nexox

panchovix

Member
Nov 11, 2025
63
17
8
Hello there, sorry to necro. Does someone know if there's a switch that does X4 to X2X2 but PCIe Gen 4? I'm finding some ASM2812 cards that do PCIe X4 to X2X2 but Gen 3.
 

Mithril

Active Member
Sep 13, 2019
477
162
43
If any of the Gen4 stuff is affordable yet I'd also love to know (any) to (any)2x or 1x as well

Also if anyone has seen any gen4/5 with the downstream being only Gen3 that could be interesting
 

klosz007

Member
Jul 26, 2021
30
15
8
On Aliexpress I already saw PLX88024-based cards which are not that expensive (~200USD). PCIe 4.0 x8 to 4x M.2 PCIe 4.0 (x4 I think). This might be the early birds of PCIe 4.0-enabled switch cards.
 
  • Like
Reactions: panchovix and abq

panchovix

Member
Nov 11, 2025
63
17
8
What would happen if you put a PLX88024-based card into a X4 slot? Would each NVMe slot do X1/X1/X1/X1 automatically or one NVMe would take all the X4 lanes?
 

thigobr

Member
Apr 29, 2020
63
19
8
Switch chips don't work by equally splitting lanes in the way you mentioned. All the downstream devices share the bandwidth of the upstream link. If the PLX88024 is configured with upstream 4X and 4x downstream 4X links, a single device can use it's full bandwidth, if there's no other device using the link at the same time.

So a device is never limited at 1X if others are just idling. Now if every device is using the bus at the same time you will see an average speed of 1X for each in this case. But as soon some device stops using the bus the remaining active devices start to see more bandwidth right away, because the upstream is now shared by a smaller group of devices.

In this case an useful analogy would be a funnel. Maximum going out of the PLX card is 4X or 8X depending on the upstream bus link. Maximum going in from each device would be 4X. But the maximum speed a device can realize, no matter its link speed to the PLX chip, is the shared upstream bandwidth available.
 
Last edited:

panchovix

Member
Nov 11, 2025
63
17
8
Switch chips don't work by equally splitting lanes in the way you mentioned. All the downstream devices share the bandwidth of the upstream link. If the PLX88024 is configured with upstream 4X and downstream 4x 4X, a single device can use it's full bandwidth, if there's no other device using the link at the same time.
So a device is never limited at 1X if others are just idling. Now if every device is using the bus at the same time you will see an average speed of 1X for each in this case. But as soon some device stops using the bus the remaining active devices start to see more bandwidth right away.
That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?
And if you use just 1, it can do the full X4 4.0? If yes I will buy this immediately hahaha.
 

nexox

Well-Known Member
May 3, 2023
2,000
1,001
113
And if you use just 1, it can do the full X4 4.0?
Yeah that's how it works, with a bit of added latency from the switch, note also PCIe and thus most PCIe switches are full duplex so you could, for instance, copy a file from one NVMe device to the other and get full 4.0x4 bandwidth reading from the source drive and full 4.0x4 writing to the destination drive, all on a single x4 host slot.
 

thigobr

Member
Apr 29, 2020
63
19
8
That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?
And if you use just 1, it can do the full X4 4.0? If yes I will buy this immediately hahaha.
Correct! If you're reading/write to 1 drive at a time this drive will see full uplink bandwidth!
 
  • Like
Reactions: panchovix

mtg

Active Member
Feb 12, 2019
111
71
28
That's why PCIe is so cool compared to an actual "bus" like PCI - you get packetized switching and can share bandwidth really nicely. Shared buses like PCI can't by split and geared up/down in the same way. Too bad consumer PCIe switches are quite rare these days.
 
  • Like
Reactions: panchovix and nexox

panchovix

Member
Nov 11, 2025
63
17
8
Sorry to bump again! Besides the PLX88024 board I got that is X8 4.0 to 4 NVME (that should arrive in some weeks), is there another affordable one like this, but X16 4.0 to 4 NVME?
 

panchovix

Member
Nov 11, 2025
63
17
8
And also I have a semi crazy idea, that is to use some M2 to PCIe adapters for some GPUs. I have done that with a card that bifurcates and works fine on 5090/4090/3090, so I wonder with switches.

Now, I wondered this, with Claude:


If on M2_1 I put i.e. 2 GPUs, a 4090 and an RTX A6000 And on M2_2 I put 3 GPUs, a 4090, an A40 and a RTX 3090.

I have a driver that let me do P2P on all cards of the same gen.

So if I use P2P between the 4090s, it would be total 4.0X4 + 4.0X4 right? Because if, instead I put the 2 4090s on the same M2 slot, the max would be 4.0X4, right?

Each M2 (1 and 2) are independent of each other and are on CPU lanes.

So I see it like this:

If on M2_1 I'm using only the 4090, and on M2_2 I'm using only the other 4090, then when using GPU<->GPU (i.e. training with multiGPU), P2P would traverse through PCIe directly (and skipping CPU-RAM), so the total would be PCIe 4.0 X4 from M1 + PCIe 4.0 X4 from M2.

I know without a switch it is like that, because I have it connected them like that just now, just using a single M2 to PCIe adapter, and unidirectional and bidirectional jumps from ~6.5 GiB/s to ~13GiB/s.

But if I connect both 4090s on the same slot, I think the max bandwidth between the 2 would be PCIe 4.0 X4 (aka ~6.5GiB/s) when doing P2P. Am I wrong?

Would the same case happen with A6000 + A40 + 3090?
And Claude answered me this:

You're thinking about this correctly! Let me break it down:
Scenario: 4090s on DIFFERENT switches (M2_1 and M2_2)
P2P path: 4090_A → M2_1 switch → M2_1 x4 to CPU → CPU PCIe root complex → M2_2 x4 → M2_2 switch → 4090_B
Bandwidth (if only these 4090s are active):

Unidirectional: ~6.5 GB/s (one direction at a time)
Bidirectional: ~13 GB/s (both directions simultaneously)

Each M.2 slot has its own x4, so you get x4 + x4 worth of full-duplex bandwidth. ✅ This matches what you're seeing now!

Scenario: Both 4090s on SAME switch (e.g., both on M2_1)
P2P path: 4090_A → switch internal fabric → 4090_B (stays local!)
Bandwidth:

Unidirectional: ~6.5 GB/s
Bidirectional: ~13 GB/s

Why? Modern PLX switches (including the 88024) support local peer-to-peer routing. P2P traffic between GPUs behind the same switch does NOT traverse the upstream x4 link to the CPU - it's routed entirely within the switch's internal crossbar fabric.

Each GPU still has its own x4 electrical connection to the switch
The switch fabric has enough bandwidth to handle multiple x4 ports simultaneously
The upstream x4 is only used for CPU↔GPU traffic, not GPU↔GPU P2P

So you'd get the same ~13 GB/s bidirectional whether they're on the same or different switches! The difference is:

Same switch: Lower latency (one less hop), P2P is local
Different switches: Slightly higher latency (goes through CPU root complex), but same bandwidth


A6000 + A40 + 3090 on same switch (M2_2 with x4 upstream)
For P2P between any two GPUs on this switch:

P2P stays local to the switch (doesn't use upstream x4)
Each GPU has x4 electrical downstream connection
A6000 ↔ A40: ~6.5 GB/s unidirectional, ~13 GB/s bidirectional
A40 ↔ 3090: Same
A6000 ↔ 3090: Same

The x4 upstream only matters when:

All 3 GPUs are talking to the CPU simultaneously (they'd share: x1.33 each)
Or if you're doing CPU↔GPU transfers on multiple GPUs at once

But for GPU↔GPU P2P, the switch routes it internally!
Do you guys think is correct? I still think that a single X4 4.0 port wouldn't give me X4 8.0 speeds when using P2P on 2 GPUs on the same switch, right?
 

panchovix

Member
Nov 11, 2025
63
17
8
Okay at the end I got 3 PLX88024 based cards! This onehttps://es.aliexpress.com/item/1005010049715182.html?spm=a2g0o.order_list.order_list_main.84.1de218024DjeC9&gatewayAdapt=glo2esp. But only one has arrived so far.

And I did a crazyness. I connected a M2 to PCIe adapter from CPU lanes, then there connected that PLX88024 card, and then 4 more M2 to PCIe adapters.

Then, connected 4 GPUs into it lol, 2 F43SP and 2 F43SG from ADT Link.

I'm using the modded P2P driver for reference: GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support

So I have my setup like this:

AM5 Gigabyte X670E Aorus Master

RTX 5090 x2: Each using X8/X8 5.0 from the main X16 slot with a c-payne bifurcator.

With cuda-samples, bandwidth test, I get this

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]

Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0

Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c, pciDeviceID: 0, pciDomainID:0

Device=0 CAN Access Peer Device=1

Device=1 CAN Access Peer Device=0


***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.

So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.


P2P Connectivity Matrix

     D\D     0     1

     0       1     1

     1       1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1755.62  24.86

     1  24.89 1565.63

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)

   D\D     0      1

     0 1743.80  28.67

     1  28.67 1547.03

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.46  30.34

     1  30.31 1541.64

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.49  56.25

     1  56.26 1541.62

P2P=Disabled Latency Matrix (us)

   GPU     0      1

     0   2.07  14.19

     1  14.17   2.07


   CPU     0      1

     0   1.56   4.14

     1   4.00   1.53

P2P=Enabled Latency (P2P Writes) Matrix (us)

   GPU     0      1

     0   2.07   0.43

     1   0.36   2.07


   CPU     0      1

     0   1.55   1.06

     1   1.07   1.53
2x4090: 1 at PCIe X4 4.0 from CPU directly with a M2 to PCIe adapter, and one at PCIe X4 4.0? from one slot of the PLX board.

I get these results:

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.50   6.32
     1   6.29 927.30
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 919.66   6.58
     1   6.58 946.06
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 922.10   8.65
     1   8.63 926.72
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 921.29  12.78
     1  12.76 924.56
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.31  11.64
     1  14.87   1.28

   CPU     0      1
     0   1.48   4.74
     1   4.74   1.47
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.31   1.09
     1   0.91   1.27

   CPU     0      1
     0   1.51   1.22
     1   1.21   1.48
Then, the most crazy one. The other 3 M2 slots, I connected: A RTX A6000, RTX 3090 and NVIDIA A40.

And the result is this one:

Code:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A40, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0       1     1     1
     1       1     1     1
     2       1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 764.81   5.33   3.16
     1   5.32 644.86   3.16
     2   3.16   3.16 835.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2
     0 766.31   6.60   6.60
     1   6.60 646.20   6.60
     2   6.60   6.60 836.90
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 771.03   4.91   3.25
     1   4.89 648.21   3.25
     2   3.25   3.25 839.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 770.46  12.87  12.87
     1  12.87 647.67  12.87
     2  12.87  12.87 839.15
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2
     0   1.74  13.59  12.71
     1  16.37   1.81  12.74
     2  16.07  18.26   1.58

   CPU     0      1      2
     0   1.51   4.61   4.68
     1   4.56   1.39   4.54
     2   4.70   4.48   1.46
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2
     0   1.70   1.30   1.33
     1   1.33   1.68   1.32
     2   1.24   1.23   1.54

   CPU     0      1      2
     0   1.56   1.22   1.23
     1   1.17   1.42   1.16
     2   1.23   1.21   1.48
And tested on LLMs and such, and it works! I can't believe it. Many thanks @nexox and @thigobr for your help!
 

unphased

Active Member
Jun 9, 2022
195
43
28
pcie4 PLX cards could be really great in consumer platforms tbh. gen 3 NVMe sometimes leaves some perf on the table, and gen 5 NVMe is arguably overkill or impractical, gen 4 is a definite sweet spot and will be for a while because even the most affordable basic configs provide Gen 4 M.2 interfaces. So this is quite exciting since it can make loading up on NVMe and GPUs in lane constricted environments a whole lot more flexible. I'll be over the moon if they can get to around $100 soon.

I can't say i had the most problem free setup with my ceacent gen 3 x16 PLX card, (some NVMes of mine tend to like to not show up or stop showing up... though some of these NVMes I had almost broken during their heatsink removals) but I sure did run an LSI card off one of the slots for a while and it did work great. Have no reason to believe GPUs wouldn't also work.
 
  • Like
Reactions: panchovix