Multi-NVMe (m.2, u.2) adapters that do not require bifurcation

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

panchovix

Member
Nov 11, 2025
58
15
8
What would happen if you put a PLX88024-based card into a X4 slot? Would each NVMe slot do X1/X1/X1/X1 automatically or one NVMe would take all the X4 lanes?
 

thigobr

Member
Apr 29, 2020
59
18
8
Switch chips don't work by equally splitting lanes in the way you mentioned. All the downstream devices share the bandwidth of the upstream link. If the PLX88024 is configured with upstream 4X and 4x downstream 4X links, a single device can use it's full bandwidth, if there's no other device using the link at the same time.

So a device is never limited at 1X if others are just idling. Now if every device is using the bus at the same time you will see an average speed of 1X for each in this case. But as soon some device stops using the bus the remaining active devices start to see more bandwidth right away, because the upstream is now shared by a smaller group of devices.

In this case an useful analogy would be a funnel. Maximum going out of the PLX card is 4X or 8X depending on the upstream bus link. Maximum going in from each device would be 4X. But the maximum speed a device can realize, no matter its link speed to the PLX chip, is the shared upstream bandwidth available.
 
Last edited:

panchovix

Member
Nov 11, 2025
58
15
8
Switch chips don't work by equally splitting lanes in the way you mentioned. All the downstream devices share the bandwidth of the upstream link. If the PLX88024 is configured with upstream 4X and downstream 4x 4X, a single device can use it's full bandwidth, if there's no other device using the link at the same time.
So a device is never limited at 1X if others are just idling. Now if every device is using the bus at the same time you will see an average speed of 1X for each in this case. But as soon some device stops using the bus the remaining active devices start to see more bandwidth right away.
That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?
And if you use just 1, it can do the full X4 4.0? If yes I will buy this immediately hahaha.
 

nexox

Well-Known Member
May 3, 2023
1,961
975
113
And if you use just 1, it can do the full X4 4.0?
Yeah that's how it works, with a bit of added latency from the switch, note also PCIe and thus most PCIe switches are full duplex so you could, for instance, copy a file from one NVMe device to the other and get full 4.0x4 bandwidth reading from the source drive and full 4.0x4 writing to the destination drive, all on a single x4 host slot.
 

thigobr

Member
Apr 29, 2020
59
18
8
That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?
And if you use just 1, it can do the full X4 4.0? If yes I will buy this immediately hahaha.
Correct! If you're reading/write to 1 drive at a time this drive will see full uplink bandwidth!
 
  • Like
Reactions: panchovix

mtg

Active Member
Feb 12, 2019
100
66
28
That's why PCIe is so cool compared to an actual "bus" like PCI - you get packetized switching and can share bandwidth really nicely. Shared buses like PCI can't by split and geared up/down in the same way. Too bad consumer PCIe switches are quite rare these days.
 
  • Like
Reactions: panchovix and nexox

panchovix

Member
Nov 11, 2025
58
15
8
Sorry to bump again! Besides the PLX88024 board I got that is X8 4.0 to 4 NVME (that should arrive in some weeks), is there another affordable one like this, but X16 4.0 to 4 NVME?
 

panchovix

Member
Nov 11, 2025
58
15
8
And also I have a semi crazy idea, that is to use some M2 to PCIe adapters for some GPUs. I have done that with a card that bifurcates and works fine on 5090/4090/3090, so I wonder with switches.

Now, I wondered this, with Claude:


If on M2_1 I put i.e. 2 GPUs, a 4090 and an RTX A6000 And on M2_2 I put 3 GPUs, a 4090, an A40 and a RTX 3090.

I have a driver that let me do P2P on all cards of the same gen.

So if I use P2P between the 4090s, it would be total 4.0X4 + 4.0X4 right? Because if, instead I put the 2 4090s on the same M2 slot, the max would be 4.0X4, right?

Each M2 (1 and 2) are independent of each other and are on CPU lanes.

So I see it like this:

If on M2_1 I'm using only the 4090, and on M2_2 I'm using only the other 4090, then when using GPU<->GPU (i.e. training with multiGPU), P2P would traverse through PCIe directly (and skipping CPU-RAM), so the total would be PCIe 4.0 X4 from M1 + PCIe 4.0 X4 from M2.

I know without a switch it is like that, because I have it connected them like that just now, just using a single M2 to PCIe adapter, and unidirectional and bidirectional jumps from ~6.5 GiB/s to ~13GiB/s.

But if I connect both 4090s on the same slot, I think the max bandwidth between the 2 would be PCIe 4.0 X4 (aka ~6.5GiB/s) when doing P2P. Am I wrong?

Would the same case happen with A6000 + A40 + 3090?
And Claude answered me this:

You're thinking about this correctly! Let me break it down:
Scenario: 4090s on DIFFERENT switches (M2_1 and M2_2)
P2P path: 4090_A → M2_1 switch → M2_1 x4 to CPU → CPU PCIe root complex → M2_2 x4 → M2_2 switch → 4090_B
Bandwidth (if only these 4090s are active):

Unidirectional: ~6.5 GB/s (one direction at a time)
Bidirectional: ~13 GB/s (both directions simultaneously)

Each M.2 slot has its own x4, so you get x4 + x4 worth of full-duplex bandwidth. ✅ This matches what you're seeing now!

Scenario: Both 4090s on SAME switch (e.g., both on M2_1)
P2P path: 4090_A → switch internal fabric → 4090_B (stays local!)
Bandwidth:

Unidirectional: ~6.5 GB/s
Bidirectional: ~13 GB/s

Why? Modern PLX switches (including the 88024) support local peer-to-peer routing. P2P traffic between GPUs behind the same switch does NOT traverse the upstream x4 link to the CPU - it's routed entirely within the switch's internal crossbar fabric.

Each GPU still has its own x4 electrical connection to the switch
The switch fabric has enough bandwidth to handle multiple x4 ports simultaneously
The upstream x4 is only used for CPU↔GPU traffic, not GPU↔GPU P2P

So you'd get the same ~13 GB/s bidirectional whether they're on the same or different switches! The difference is:

Same switch: Lower latency (one less hop), P2P is local
Different switches: Slightly higher latency (goes through CPU root complex), but same bandwidth


A6000 + A40 + 3090 on same switch (M2_2 with x4 upstream)
For P2P between any two GPUs on this switch:

P2P stays local to the switch (doesn't use upstream x4)
Each GPU has x4 electrical downstream connection
A6000 ↔ A40: ~6.5 GB/s unidirectional, ~13 GB/s bidirectional
A40 ↔ 3090: Same
A6000 ↔ 3090: Same

The x4 upstream only matters when:

All 3 GPUs are talking to the CPU simultaneously (they'd share: x1.33 each)
Or if you're doing CPU↔GPU transfers on multiple GPUs at once

But for GPU↔GPU P2P, the switch routes it internally!
Do you guys think is correct? I still think that a single X4 4.0 port wouldn't give me X4 8.0 speeds when using P2P on 2 GPUs on the same switch, right?
 

panchovix

Member
Nov 11, 2025
58
15
8
Okay at the end I got 3 PLX88024 based cards! This onehttps://es.aliexpress.com/item/1005010049715182.html?spm=a2g0o.order_list.order_list_main.84.1de218024DjeC9&gatewayAdapt=glo2esp. But only one has arrived so far.

And I did a crazyness. I connected a M2 to PCIe adapter from CPU lanes, then there connected that PLX88024 card, and then 4 more M2 to PCIe adapters.

Then, connected 4 GPUs into it lol, 2 F43SP and 2 F43SG from ADT Link.

I'm using the modded P2P driver for reference: GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support

So I have my setup like this:

AM5 Gigabyte X670E Aorus Master

RTX 5090 x2: Each using X8/X8 5.0 from the main X16 slot with a c-payne bifurcator.

With cuda-samples, bandwidth test, I get this

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]

Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0

Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c, pciDeviceID: 0, pciDomainID:0

Device=0 CAN Access Peer Device=1

Device=1 CAN Access Peer Device=0


***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.

So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.


P2P Connectivity Matrix

     D\D     0     1

     0       1     1

     1       1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1755.62  24.86

     1  24.89 1565.63

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)

   D\D     0      1

     0 1743.80  28.67

     1  28.67 1547.03

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.46  30.34

     1  30.31 1541.64

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.49  56.25

     1  56.26 1541.62

P2P=Disabled Latency Matrix (us)

   GPU     0      1

     0   2.07  14.19

     1  14.17   2.07


   CPU     0      1

     0   1.56   4.14

     1   4.00   1.53

P2P=Enabled Latency (P2P Writes) Matrix (us)

   GPU     0      1

     0   2.07   0.43

     1   0.36   2.07


   CPU     0      1

     0   1.55   1.06

     1   1.07   1.53
2x4090: 1 at PCIe X4 4.0 from CPU directly with a M2 to PCIe adapter, and one at PCIe X4 4.0? from one slot of the PLX board.

I get these results:

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.50   6.32
     1   6.29 927.30
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 919.66   6.58
     1   6.58 946.06
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 922.10   8.65
     1   8.63 926.72
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 921.29  12.78
     1  12.76 924.56
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.31  11.64
     1  14.87   1.28

   CPU     0      1
     0   1.48   4.74
     1   4.74   1.47
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.31   1.09
     1   0.91   1.27

   CPU     0      1
     0   1.51   1.22
     1   1.21   1.48
Then, the most crazy one. The other 3 M2 slots, I connected: A RTX A6000, RTX 3090 and NVIDIA A40.

And the result is this one:

Code:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A40, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0       1     1     1
     1       1     1     1
     2       1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 764.81   5.33   3.16
     1   5.32 644.86   3.16
     2   3.16   3.16 835.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2
     0 766.31   6.60   6.60
     1   6.60 646.20   6.60
     2   6.60   6.60 836.90
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 771.03   4.91   3.25
     1   4.89 648.21   3.25
     2   3.25   3.25 839.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 770.46  12.87  12.87
     1  12.87 647.67  12.87
     2  12.87  12.87 839.15
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2
     0   1.74  13.59  12.71
     1  16.37   1.81  12.74
     2  16.07  18.26   1.58

   CPU     0      1      2
     0   1.51   4.61   4.68
     1   4.56   1.39   4.54
     2   4.70   4.48   1.46
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2
     0   1.70   1.30   1.33
     1   1.33   1.68   1.32
     2   1.24   1.23   1.54

   CPU     0      1      2
     0   1.56   1.22   1.23
     1   1.17   1.42   1.16
     2   1.23   1.21   1.48
And tested on LLMs and such, and it works! I can't believe it. Many thanks @nexox and @thigobr for your help!
 

unphased

Active Member
Jun 9, 2022
184
41
28
pcie4 PLX cards could be really great in consumer platforms tbh. gen 3 NVMe sometimes leaves some perf on the table, and gen 5 NVMe is arguably overkill or impractical, gen 4 is a definite sweet spot and will be for a while because even the most affordable basic configs provide Gen 4 M.2 interfaces. So this is quite exciting since it can make loading up on NVMe and GPUs in lane constricted environments a whole lot more flexible. I'll be over the moon if they can get to around $100 soon.

I can't say i had the most problem free setup with my ceacent gen 3 x16 PLX card, (some NVMes of mine tend to like to not show up or stop showing up... though some of these NVMes I had almost broken during their heatsink removals) but I sure did run an LSI card off one of the slots for a while and it did work great. Have no reason to believe GPUs wouldn't also work.
 
  • Like
Reactions: panchovix

panchovix

Member
Nov 11, 2025
58
15
8
Yup! The thing here is it:

M2 - > PCIe -> 4*M2 -> each to PCIe and somehow it works fine on all my tests since yesterday, I'm impressed.

Now I was searching just for curiosity for a PEX/PLX89024 5.0 switch board but no luck so far.
 

unphased

Active Member
Jun 9, 2022
184
41
28
Back when I was getting really into this that led to me choosing a ceacent gen 3 plx card to use with my (already gen 4) x570 setup, gen 4 plx cards were not within a reasonable price range ($1k ballpark). Since each gen doubles the bandwidth each step forward will be fairly compelling I suppose. But I'd be surprised if gen 5 parts like this are affordable anytime soon.
 

panchovix

Member
Nov 11, 2025
58
15
8
For sure. At least that PLX88024 4.0 switch board was 120USD each, so about 360USD total. Got them on the cyber friday discounts with some coupons.
 

panchovix

Member
Nov 11, 2025
58
15
8
how much power does the plx 88024 consume? does it support aspm?
It seems to hover between 5 and 15W, based on a wall power meter I use. I.e. idle with 7 GPUs went from 260W to 265-275W most of the time. I haven't actually checked ASPM, how do I check if it works there or not?

EDIT: This is considering each M2 slot is using a PCIe adapter that is powered by a 6-pin PCIe. Maybe if you were to use only NVMe power usage would be higher.
 
Last edited:

ptf

New Member
Jan 20, 2025
18
1
3
Has anyone tried an M.2->SATA adapter using the ASM1166 (or JMB585) on one of the PEX8749 cards.

I would dearly like this to work but two PEX87xx cards (one 8747 and one 8749) on two different motherboards (ASRock Z790M-ITX WIFI and MSI Z790I EDGE WIFI) later and it's, shall we say, not going well.

Either the card drops off the bus (though, interestingly, can be software disabled and reset), or I get system freezes and kernel panics.

One common factor (apart from Linux as the OS) is trying to use one of the ASM1166 based M.2 to SATA adapters. Long term I'd like to get rid of SATA/spinning rust but I need it for now.

Google's AI bit thinks this combination is known for instability - I have a couple of JMB 585 based cards on order but wondered if anyone has successfully used the ASM1166 in this situation.
 

unphased

Active Member
Jun 9, 2022
184
41
28
Has anyone tried an M.2->SATA adapter using the ASM1166 (or JMB585) on one of the PEX8749 cards.

I would dearly like this to work but two PEX87xx cards (one 8747 and one 8749) on two different motherboards (ASRock Z790M-ITX WIFI and MSI Z790I EDGE WIFI) later and it's, shall we say, not going well.

Either the card drops off the bus (though, interestingly, can be software disabled and reset), or I get system freezes and kernel panics.

One common factor (apart from Linux as the OS) is trying to use one of the ASM1166 based M.2 to SATA adapters. Long term I'd like to get rid of SATA/spinning rust but I need it for now.

Google's AI bit thinks this combination is known for instability - I have a couple of JMB 585 based cards on order but wondered if anyone has successfully used the ASM1166 in this situation.
I get the compactness angle of it, but where SATA is concerned my understanding is that it's impossible to do better than IT mode HBA to hook up spinning rust. For a few one off disks see if SATA over USB3 is more stable, sounds like heresy but anything to help you troubleshoot the pcie connections.
 

ptf

New Member
Jan 20, 2025
18
1
3
I get the compactness angle of it,
It's more that I have 8 SATA drives - 4 SSD's and 4 HDD's along with 4 M.2 drives (2 boot in RAID1, 2 for VM filesystem images, also RAID1) to hook up to one ITX motherboard. which means I need some SATA expansion one way or the other as well as somewhere to plug in 2 more M.2 drives - the PEX8747/PEX8749 based expansion boards looked ideal - the former would allow me to connect everything I have now, the latter would allow future expansion.

see if SATA over USB3 is more stable
Yes, the 4x SSDs are already connected that way as a temporary measure - as are another couple of M.2's which I want in the system short term (they have a backup of the last OS install in case i need to pull any bits of config/scripts/forgotten utilities etc).

There are some firmware updates for ASM1166's kicking around the 'net and a tool to do the update in Linux but it just crashes - I think it wants to map physical memory which is disallowed if secure boot is enabled.