New Chinese PCIE Switch Board GPU Testing

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TrashMaster

Member
Sep 8, 2024
81
61
18
This is a bit of a crosspost but thought it was worth sharing my latest experimental hardware acquisition in case some here didnt read the other forums.

About a month ago I ordered a pex 88096 (96 lane pcie gen4 switch) board off ali express for around 500 bones. It took 2.5-3 weeks to arrive via usual delivery channels.

These are the vendors pictures from the listing that still shows ~90 ish cards available. I will reply to my own post with my actual product pics, and some quick test results.
 

Attachments

TrashMaster

Member
Sep 8, 2024
81
61
18
I got the card screwed down to a tray made out of scrap 2020 aluminum extrusion with standoffs for testing, first I used some junk 10/25g nics and a disposable nvme storage card in case anything went "poorly".

The board by itself consumes about 45-50 watts powered on. Using a PCIE retimer card on the motherboard caused a delay to the pcie switch boards automated power on/off feature (which worked rather nicely after I replaced that impediment with a passive shinreal mcio adapter)

Here is a video of the whole setup connected to a single pcie slot on the epyc 7R13 / asrock romed8-2t test bench:


After initial testing confirmed it wasnt going to let the magic smoke escape, I went for 4x 3090's and fired up llamabench on llama3.3 70b q8 for the lulz.

Code:
root@rome:~/llama.cpp/build/bin# ./llama-bench   -m /root/Llama-3.3-70B-Instruct-Q8-unsloth-00001-of-00002.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |  99 |           pp512 |        595.64 ± 0.83 |
| llama 70B Q8_0                 |  69.82 GiB |    70.55 B | CUDA       |  99 |           tg128 |         11.25 ± 0.00 |

build: 1faa13a1 (6725)

The nvidia topology shows that it is indeed switching without crossing the cpu's root bridge, but the topology is a bit interesting.

Code:
root@rome:~/llama.cpp/build/bin# nvidia-smi
Fri Oct 10 01:53:40 2025      
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:85:00.0 Off |                  N/A |
|  0%   47C    P8             25W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:88:00.0 Off |                  N/A |
|  0%   46C    P8             23W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        On  |   00000000:8C:00.0 Off |                  N/A |
| 46%   55C    P8             27W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:8D:00.0 Off |                  N/A |
| 49%   54C    P8             25W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@rome:~/llama.cpp/build/bin# nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     PXB     0-95    0               N/A
GPU1    PXB      X      PXB     PXB     0-95    0               N/A
GPU2    PXB     PXB      X      PIX     0-95    0               N/A
GPU3    PXB     PXB     PIX      X      0-95    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
I have the cuda samples with p2p bandwidth and latency but the 3090s wont do that, i would need some workstation or datacenter cards to finish out the pcie switch testing.
 

Attachments

nexox

Well-Known Member
May 3, 2023
1,962
976
113
I can't possibly make any good choices with the answer to this question, but are you equipped to try a passive dual or quad NVMe adapter to see if the switch is configured to support them?
 

TrashMaster

Member
Sep 8, 2024
81
61
18
I can't possibly make any good choices with the answer to this question, but are you equipped to try a passive dual or quad NVMe adapter to see if the switch is configured to support them?
The board is configured by default to operate only in x16 mode, so you can use a passive m.2 splitter it doesnt dynamically reconfigure itself to bifurcate any of the slots.

As discussed in the other thread, you can actually reprogram the switch yourself (if you are feeling brave) to handle splitting into x8x8 or x8x4x4 or x4x4x4x4... the toolkit from broadscum is over at PCI/PCIe Software Development Kits

Users over at a certain chinese site were discussing this kind of thing (as well as gen5 versions of the switch chip and board) however if you are really looking for a storage switch card, check out the pictures i dropped into imgur:
 

Attachments

  • Like
Reactions: nexox

calmingaura

New Member
Nov 25, 2025
1
1
1
Thanks for sharing this. This is exactly what I'm looking for as I've been trying to make an 8gpu build work. I've placed an order for the 10 SFF-8654 8i card and some SFf-8654 to pcie x16 physical slot adapters and plan to run everything at pcie Gen4 x8. The boards seem cleaner, but I'm not sure they can accommodate my 3 slot cards.
 
  • Like
Reactions: Aluminat

TrashMaster

Member
Sep 8, 2024
81
61
18
Thanks for sharing this. This is exactly what I'm looking for as I've been trying to make an 8gpu build work. I've placed an order for the 10 SFF-8654 8i card and some SFf-8654 to pcie x16 physical slot adapters and plan to run everything at pcie Gen4 x8. The boards seem cleaner, but I'm not sure they can accommodate my 3 slot cards.
I just finished upgrading my lower shelf to gen5 x16 with slightly different risers and cables.
 

Attachments

panchovix

Member
Nov 11, 2025
58
15
8
Hey @TrashMaster nice work! You're thr3e on l1 tech forums?

I wonder, how does P2P work on those 3090s with cuda samples (on p2pbandiwdthtest, forgot the name exactly)? Do you get the max the lane offers or does it provide more?
 

TrashMaster

Member
Sep 8, 2024
81
61
18
In the words of GN, big number go up, line go up good. latency more lower.

Code:
# CUDA_VISIBLE_DEVICES=6,7,8 /usr/share/doc/nvidia-cuda-toolkit/examples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: c1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: e1, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: f1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0       1     1     1
     1       1     1     1
     2       1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 830.68  11.52  11.59
     1  11.46 833.78  11.59
     2  11.35  11.41 834.22
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2
     0 833.78  26.40  26.37
     1  26.40 834.67  26.40
     2  26.40  26.40 835.11
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 838.27  16.92  16.93
     1  16.85 839.15  17.05
     2  17.11  16.99 839.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 839.83  52.20  52.19
     1  52.19 839.97  52.20
     2  52.20  52.16 839.15
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2
     0   1.48  13.28  13.71
     1  13.15   1.56  13.91
     2  12.73  13.82   1.56

   CPU     0      1      2
     0   2.00   5.76   5.31
     1   5.61   1.90   5.39
     2   5.40   5.53   1.80
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2
     0   1.56   1.02   1.01
     1   1.04   1.48   1.04
     2   0.97   0.97   1.58

   CPU     0      1      2
     0   1.91   1.49   1.51
     1   1.59   1.94   1.60
     2   1.47   1.44   1.88
 
  • Love
Reactions: panchovix

panchovix

Member
Nov 11, 2025
58
15
8
Wow so you get an amazing improvement with P2P! Well, I know what I will purchase next then haha
 

panchovix

Member
Nov 11, 2025
58
15
8
I basically want to get some boards to do some bad decisions haha, I assume the 3x3090 are on the same switch right?

I.e. I want to test these cases with a switch:

- X4 4.0 into 3 GPUs. With P2P, Bidirectional bandwidth would be PCIe 4.0 X1.33*3 (aka PCIe 4.0 X4 total), or it would be PCIe 4.0 X4*3, I guess the former?
- Same question with X8 and probably X16 (but you're showing X16 and PCIe 4.0 X16 max is 32 GiB/s but you're getting 52GiB/s, so maybe it is the latter case?)

Your other GPUs are 5090s right? How do 5090 look with 3 GPUs on the same switch?
 

TrashMaster

Member
Sep 8, 2024
81
61
18
Switching PCIE is vastly superior than bifurcating 8 or 4 (yuck) lanes. But this all depends on understanding YOUR specific workload, model, inference tools, etc.

These are all pcie gen5 devices so not using the switch from this thread, but extremely impressive on their own:

Code:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 /usr/share/doc/nvidia-cuda-toolkit/examples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX PRO 6000 Blackwell Workstation Edition, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: 11, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 5090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 5090, pciBusID: 71, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA GeForce RTX 5090, pciBusID: 81, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA GeForce RTX 5090, pciBusID: 91, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5
     0       1     1     1     1     1     1
     1       1     1     1     1     1     1
     2       1     1     1     1     1     1
     3       1     1     1     1     1     1
     4       1     1     1     1     1     1
     5       1     1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5
     0 1496.69  42.63  42.68  42.81  43.21  43.07
     1  42.63 1550.15  42.68  42.66  43.14  43.06
     2  42.69  42.57 1553.23  42.70  43.10  43.13
     3  42.75  42.72  42.66 1553.18  43.00  42.93
     4  42.97  42.85  42.89  42.89 1553.23  43.43
     5  43.01  42.89  42.91  42.95  43.73 1553.23
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5
     0 1493.83  56.57  56.55  56.55  55.85  55.86
     1  56.54 1537.89  56.55  56.57  55.71  55.63
     2  56.58  56.58 1534.87  56.56  55.56  55.85
     3  56.55  56.55  56.54 1543.97  55.83  55.82
     4  55.54  55.59  55.50  55.49 1537.89  56.55
     5  55.60  55.62  55.63  55.63  56.58 1543.97
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5
     0 1483.79  56.50  56.59  56.77  56.92  57.14
     1  56.21 1538.60  56.55  56.54  56.82  56.67
     2  56.27  56.47 1539.36  56.72  56.89  57.12
     3  56.40  56.58  56.21 1540.12  56.99  56.81
     4  56.75  56.81  56.73  56.89 1540.88  56.85
     5  56.71  56.85  57.05  56.87  56.77 1539.36
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5
     0 1483.81 111.33 111.39 111.39 110.88 110.88
     1 111.38 1534.80 111.38 111.38  55.36 110.01
     2 111.38 111.34 1534.07 111.39 110.76 110.90
     3 111.38 111.38 111.34 1538.60 110.80 110.80
     4 110.73 110.86 110.89 110.91 1537.85 111.39
     5 110.92 110.83 110.93 110.91 111.39 1537.07
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5
     0   2.07  14.34  14.30  14.30  14.29  14.29
     1  14.30   2.07  14.32  14.32  14.32  14.32
     2  14.32  14.31   2.07  14.32  14.32  14.32
     3  14.32  14.32  14.34   2.07  14.33  14.33
     4  14.32  14.34  14.31  14.23   2.07  14.33
     5  14.30  14.32  14.30  14.22  14.32   2.07

   CPU     0      1      2      3      4      5
     0   2.35   6.88   6.77   6.41   5.68   5.93
     1   6.65   2.39   7.07   6.95   6.09   6.15
     2   6.70   6.86   2.40   6.62   5.87   6.13
     3   6.43   6.71   6.74   2.29   5.69   5.92
     4   5.90   6.23   6.18   5.89   2.03   5.46
     5   6.12   6.42   6.44   6.15   5.43   2.16
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5
     0   2.07   0.37   0.36   0.43   0.36   0.36
     1   0.46   2.07   0.45   0.38   0.38   0.38
     2   0.39   0.37   2.07   0.37   0.38   0.37
     3   0.37   0.38   0.36   2.07   0.37   0.37
     4   0.38   0.43   0.44   0.37   2.07   0.38
     5   0.38   0.37   0.37   0.44   0.37   2.07

   CPU     0      1      2      3      4      5
     0   2.36   1.69   1.64   1.64   1.65   1.75
     1   1.79   2.45   1.75   1.87   1.89   1.88
     2   1.80   1.73   2.49   1.78   1.78   1.82
     3   1.70   1.65   1.66   2.30   1.67   1.71
     4   1.47   1.50   1.46   1.45   2.07   1.46
     5   1.59   1.54   1.54   1.52   1.53   2.15
If you are doing tensor parallel inference it can be extremely bandwidth heavy. Going peer to peer effectively cuts latency and can improve card to card bandwidth. It took a number of BIOS changes and GRUB options to get this working in addition to the driver changes I mentioned in That Other Place.
 
  • Like
Reactions: panchovix

panchovix

Member
Nov 11, 2025
58
15
8
Okay that's amazing! In theory a X4 4.0 for 4 slots would be pretty decent, so gonna get some X8 to 4 slots. I haven't found a 4.0 X16 to 8 or 4 slots on AliExpress :(

For 5.0 I checked some prices and I think I can't afford it lol but it would help a lot.
 

panchovix

Member
Nov 11, 2025
58
15
8
Wondering how do you connect that board. You would need to get a PCIe X16 to 2xSilmSAS (or 4) and then connect those to that board?
 

TrashMaster

Member
Sep 8, 2024
81
61
18
Wondering how do you connect that board. You would need to get a PCIe X16 to 2xSilmSAS (or 4) and then connect those to that board?
What board do you have? It depends on your board. Some have MCIO or SlimSAS connectors directly on them. Others need an adapter to add those connectors or a redriver/retimer card if the signal quality is too low / cables and traces too long.

If you buy cards and risers and stuff check out posts on level1techs for tons of pictures and product recommendations. For all new purchase I highly suggest standardizing on MCIO x8 for future gen5 compatibility and using cables to either split or convert those to whatever other ends you need.

My video up in the secondish post above has one such example with mcio to slimsas and shinreal adapter from pcie x16 to 2x mcio 8i
 
  • Like
Reactions: nexox and panchovix

panchovix

Member
Nov 11, 2025
58
15
8
I have a normal, consumer motherboard (Gigabyte Aorus Master X670E and MSI Carbon X670E). They have a PCIe X16 5.0 slot, but no SlimSAS or MCIO.

Will check out the video!
 

nexox

Well-Known Member
May 3, 2023
1,962
976
113
Note that just because the adapter has two cables does not mean you need bifurcation, if they both end up at the switch in order (and maybe the switch can re-order them if the cables are backwards, not sure, but I think it's technically possible) they make one x16 connection, just with two bundles of wires.
 
  • Like
Reactions: panchovix

panchovix

Member
Nov 11, 2025
58
15
8
Okay at the end I got 3 PLX88024 based cards! This onehttps://es.aliexpress.com/item/1005010049715182.html?spm=a2g0o.order_list.order_list_main.84.1de218024DjeC9&gatewayAdapt=glo2esp. But only one has arrived so far.

And I did a crazyness. I connected a M2 to PCIe adapter from CPU lanes, then there connected that PLX88024 card, and then 4 more M2 to PCIe adapters.

Then, connected 4 GPUs into it lol, 2 F43SP and 2 F43SG from ADT Link.

I'm using the modded P2P driver for reference: GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support

So I have my setup like this:

AM5 Gigabyte X670E Aorus Master

RTX 5090 x2: Each using X8/X8 5.0 from the main X16 slot with a c-payne bifurcator.

With cuda-samples, bandwidth test, I get this

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]

Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0

Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c, pciDeviceID: 0, pciDomainID:0

Device=0 CAN Access Peer Device=1

Device=1 CAN Access Peer Device=0


***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.

So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.


P2P Connectivity Matrix

     D\D     0     1

     0       1     1

     1       1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1755.62  24.86

     1  24.89 1565.63

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)

   D\D     0      1

     0 1743.80  28.67

     1  28.67 1547.03

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.46  30.34

     1  30.31 1541.64

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)

   D\D     0      1

     0 1761.49  56.25

     1  56.26 1541.62

P2P=Disabled Latency Matrix (us)

   GPU     0      1

     0   2.07  14.19

     1  14.17   2.07


   CPU     0      1

     0   1.56   4.14

     1   4.00   1.53

P2P=Enabled Latency (P2P Writes) Matrix (us)

   GPU     0      1

     0   2.07   0.43

     1   0.36   2.07


   CPU     0      1

     0   1.55   1.06

     1   1.07   1.53
2x4090: 1 at PCIe X4 4.0 from CPU directly with a M2 to PCIe adapter, and one at PCIe X4 4.0? from one slot of the PLX board.

I get these results:

Code:
pancho@fedora:~/cuda-samples/build/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 8, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 917.50   6.32
     1   6.29 927.30
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 919.66   6.58
     1   6.58 946.06
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 922.10   8.65
     1   8.63 926.72
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 921.29  12.78
     1  12.76 924.56
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.31  11.64
     1  14.87   1.28

   CPU     0      1
     0   1.48   4.74
     1   4.74   1.47
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.31   1.09
     1   0.91   1.27

   CPU     0      1
     0   1.51   1.22
     1   1.21   1.48
Then, the most crazy one. The other 3 M2 slots, I connected: A RTX A6000, RTX 3090 and NVIDIA A40.

And the result is this one:

Code:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 6, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A40, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0       1     1     1
     1       1     1     1
     2       1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 764.81   5.33   3.16
     1   5.32 644.86   3.16
     2   3.16   3.16 835.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2
     0 766.31   6.60   6.60
     1   6.60 646.20   6.60
     2   6.60   6.60 836.90
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 771.03   4.91   3.25
     1   4.89 648.21   3.25
     2   3.25   3.25 839.83
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2
     0 770.46  12.87  12.87
     1  12.87 647.67  12.87
     2  12.87  12.87 839.15
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2
     0   1.74  13.59  12.71
     1  16.37   1.81  12.74
     2  16.07  18.26   1.58

   CPU     0      1      2
     0   1.51   4.61   4.68
     1   4.56   1.39   4.54
     2   4.70   4.48   1.46
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2
     0   1.70   1.30   1.33
     1   1.33   1.68   1.32
     2   1.24   1.23   1.54

   CPU     0      1      2
     0   1.56   1.22   1.23
     1   1.17   1.42   1.16
     2   1.23   1.21   1.48
And tested on LLMs and such, and it works! I can't believe it. Many thanks @nexox, @TrashMaster and @thigobr for your help!
 
  • Like
Reactions: TrashMaster

TrashMaster

Member
Sep 8, 2024
81
61
18
Congrats on the upgrade, and welcome to the council of mad scientists.

Want to go really, really unhinged? You can actually install ubuntu 24 and upgrade it to ngreedia DGX OS 7 using steps like this: Customizing Ubuntu Installation with DGX Software — NVIDIA DGX OS 7 User Guide (highly recommend some 100G or faster Mellanox cards for this) DGX-at-home is my new experiment of the month and getting all their crazy features and tools working is bound to be a wild ride.
 
  • Like
  • Wow
Reactions: panchovix and nexox