That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?Switch chips don't work by equally splitting lanes in the way you mentioned. All the downstream devices share the bandwidth of the upstream link. If the PLX88024 is configured with upstream 4X and downstream 4x 4X, a single device can use it's full bandwidth, if there's no other device using the link at the same time.
So a device is never limited at 1X if others are just idling. Now if every device is using the bus at the same time you will see an average speed of 1X for each in this case. But as soon some device stops using the bus the remaining active devices start to see more bandwidth right away.
Yeah that's how it works, with a bit of added latency from the switch, note also PCIe and thus most PCIe switches are full duplex so you could, for instance, copy a file from one NVMe device to the other and get full 4.0x4 bandwidth reading from the source drive and full 4.0x4 writing to the destination drive, all on a single x4 host slot.And if you use just 1, it can do the full X4 4.0?
Correct! If you're reading/write to 1 drive at a time this drive will see full uplink bandwidth!That's pretty nice to know! Then If I use 2 NVMes on that card, and then use both at the same time, they will maxed at about X2 4.0 each?
And if you use just 1, it can do the full X4 4.0? If yes I will buy this immediately hahaha.
And Claude answered me this:If on M2_1 I put i.e. 2 GPUs, a 4090 and an RTX A6000 And on M2_2 I put 3 GPUs, a 4090, an A40 and a RTX 3090.
I have a driver that let me do P2P on all cards of the same gen.
So if I use P2P between the 4090s, it would be total 4.0X4 + 4.0X4 right? Because if, instead I put the 2 4090s on the same M2 slot, the max would be 4.0X4, right?
Each M2 (1 and 2) are independent of each other and are on CPU lanes.
So I see it like this:
If on M2_1 I'm using only the 4090, and on M2_2 I'm using only the other 4090, then when using GPU<->GPU (i.e. training with multiGPU), P2P would traverse through PCIe directly (and skipping CPU-RAM), so the total would be PCIe 4.0 X4 from M1 + PCIe 4.0 X4 from M2.
I know without a switch it is like that, because I have it connected them like that just now, just using a single M2 to PCIe adapter, and unidirectional and bidirectional jumps from ~6.5 GiB/s to ~13GiB/s.
But if I connect both 4090s on the same slot, I think the max bandwidth between the 2 would be PCIe 4.0 X4 (aka ~6.5GiB/s) when doing P2P. Am I wrong?
Would the same case happen with A6000 + A40 + 3090?
Do you guys think is correct? I still think that a single X4 4.0 port wouldn't give me X4 8.0 speeds when using P2P on 2 GPUs on the same switch, right?You're thinking about this correctly! Let me break it down:
Scenario: 4090s on DIFFERENT switches (M2_1 and M2_2)
P2P path: 4090_A → M2_1 switch → M2_1 x4 to CPU → CPU PCIe root complex → M2_2 x4 → M2_2 switch → 4090_B
Bandwidth (if only these 4090s are active):
Unidirectional: ~6.5 GB/s (one direction at a time)
Bidirectional: ~13 GB/s (both directions simultaneously)
Each M.2 slot has its own x4, so you get x4 + x4 worth of full-duplex bandwidth.This matches what you're seeing now!
Scenario: Both 4090s on SAME switch (e.g., both on M2_1)
P2P path: 4090_A → switch internal fabric → 4090_B (stays local!)
Bandwidth:
Unidirectional: ~6.5 GB/s
Bidirectional: ~13 GB/s
Why? Modern PLX switches (including the 88024) support local peer-to-peer routing. P2P traffic between GPUs behind the same switch does NOT traverse the upstream x4 link to the CPU - it's routed entirely within the switch's internal crossbar fabric.
Each GPU still has its own x4 electrical connection to the switch
The switch fabric has enough bandwidth to handle multiple x4 ports simultaneously
The upstream x4 is only used for CPUGPU traffic, not GPU
GPU P2P
So you'd get the same ~13 GB/s bidirectional whether they're on the same or different switches! The difference is:
Same switch: Lower latency (one less hop), P2P is local
Different switches: Slightly higher latency (goes through CPU root complex), but same bandwidth
A6000 + A40 + 3090 on same switch (M2_2 with x4 upstream)
For P2P between any two GPUs on this switch:
P2P stays local to the switch (doesn't use upstream x4)
Each GPU has x4 electrical downstream connection
A6000A40: ~6.5 GB/s unidirectional, ~13 GB/s bidirectional
A403090: Same
A60003090: Same
The x4 upstream only matters when:
All 3 GPUs are talking to the CPU simultaneously (they'd share: x1.33 each)
Or if you're doing CPUGPU transfers on multiple GPUs at once
But for GPUGPU P2P, the switch routes it internally!
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1755.62 24.86
1 24.89 1565.63
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1743.80 28.67
1 28.67 1547.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1761.46 30.34
1 30.31 1541.64
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1761.49 56.25
1 56.26 1541.62
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.07 14.19
1 14.17 2.07
CPU 0 1
0 1.56 4.14
1 4.00 1.53
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.43
1 0.36 2.07
CPU 0 1
0 1.55 1.06
1 1.07 1.53
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 917.50 6.32
1 6.29 927.30
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 919.66 6.58
1 6.58 946.06
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 922.10 8.65
1 8.63 926.72
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 921.29 12.78
1 12.76 924.56
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.31 11.64
1 14.87 1.28
CPU 0 1
0 1.48 4.74
1 4.74 1.47
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.31 1.09
1 0.91 1.27
CPU 0 1
0 1.51 1.22
1 1.21 1.48
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A40, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 764.81 5.33 3.16
1 5.32 644.86 3.16
2 3.16 3.16 835.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2
0 766.31 6.60 6.60
1 6.60 646.20 6.60
2 6.60 6.60 836.90
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 771.03 4.91 3.25
1 4.89 648.21 3.25
2 3.25 3.25 839.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 770.46 12.87 12.87
1 12.87 647.67 12.87
2 12.87 12.87 839.15
P2P=Disabled Latency Matrix (us)
GPU 0 1 2
0 1.74 13.59 12.71
1 16.37 1.81 12.74
2 16.07 18.26 1.58
CPU 0 1 2
0 1.51 4.61 4.68
1 4.56 1.39 4.54
2 4.70 4.48 1.46
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2
0 1.70 1.30 1.33
1 1.33 1.68 1.32
2 1.24 1.23 1.54
CPU 0 1 2
0 1.56 1.22 1.23
1 1.17 1.42 1.16
2 1.23 1.21 1.48
It seems to hover between 5 and 15W, based on a wall power meter I use. I.e. idle with 7 GPUs went from 260W to 265-275W most of the time. I haven't actually checked ASPM, how do I check if it works there or not?how much power does the plx 88024 consume? does it support aspm?
I get the compactness angle of it, but where SATA is concerned my understanding is that it's impossible to do better than IT mode HBA to hook up spinning rust. For a few one off disks see if SATA over USB3 is more stable, sounds like heresy but anything to help you troubleshoot the pcie connections.Has anyone tried an M.2->SATA adapter using the ASM1166 (or JMB585) on one of the PEX8749 cards.
I would dearly like this to work but two PEX87xx cards (one 8747 and one 8749) on two different motherboards (ASRock Z790M-ITX WIFI and MSI Z790I EDGE WIFI) later and it's, shall we say, not going well.
Either the card drops off the bus (though, interestingly, can be software disabled and reset), or I get system freezes and kernel panics.
One common factor (apart from Linux as the OS) is trying to use one of the ASM1166 based M.2 to SATA adapters. Long term I'd like to get rid of SATA/spinning rust but I need it for now.
Google's AI bit thinks this combination is known for instability - I have a couple of JMB 585 based cards on order but wondered if anyone has successfully used the ASM1166 in this situation.
It's more that I have 8 SATA drives - 4 SSD's and 4 HDD's along with 4 M.2 drives (2 boot in RAID1, 2 for VM filesystem images, also RAID1) to hook up to one ITX motherboard. which means I need some SATA expansion one way or the other as well as somewhere to plug in 2 more M.2 drives - the PEX8747/PEX8749 based expansion boards looked ideal - the former would allow me to connect everything I have now, the latter would allow future expansion.I get the compactness angle of it,
Yes, the 4x SSDs are already connected that way as a temporary measure - as are another couple of M.2's which I want in the system short term (they have a backup of the last OS install in case i need to pull any bits of config/scripts/forgotten utilities etc).see if SATA over USB3 is more stable