Slow Speeds in Infiniband Mode for ConnectX 3 40Gbe/56GBps

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

erock

Member
Jul 19, 2023
84
17
8
I have two Mellanox ConnectX3 cards connected using Mellonox MC2207130 (56Gb/40Gb FDR) DAC's (no switch). Speed testing on Ubuntu 22.04 in 40Gb Ethernet mode using parallel iperf yields expected results with bandwidth around 32-38 GB/s using 8 processors.

However, when testing on Ubuntu in Infiniband with RDMA mode I am getting results that I do not understand.

Speed tests with iperf3 show a maximum bandwidth < 10Gps even if I run multiple servers in parallel. However, if I run a test with ib_send_bw results show an average bandwidth of 47 GB/s. Is there a problem with speed possibly associated with my configuration? Is this the expected speed for IPoIB? What am I missing?

Note below when I run ibdiagnet I get a suboptimal rate group warning for IPoIB Subnets Check. Also, the multple iperf3 server tests look worse than single server.

Similar results for archlinux were show here: InfiniBand - ArchWiki.

Any guidance and insights that may point me in the right direction would be appreciated (I am new to networking).

See below for test and system information:

ib_send_bw test:
ib_send_bw -s 65535 -i 1 -F --report_gbits

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : ibp68s0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
RX depth : 512
CQ Moderation : 1
Mtu : 2048
Link type : IB
Max inline data : 0
rdma_cm QPs : OFF
Data ex. method : Ethernet
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65535 1000 0.00 46.50 0.088700
---------------------------------------------------------------------------------------

iperf3 tests using a single server:

[ 5] local 10.16.16.50 port 5101 connected to 10.16.16.51 port 56104
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 283 MBytes 2.37 Gbits/sec
[ 5] 1.00-2.00 sec 299 MBytes 2.51 Gbits/sec
[ 5] 2.00-3.00 sec 293 MBytes 2.45 Gbits/sec
[ 5] 3.00-4.00 sec 290 MBytes 2.43 Gbits/sec
[ 5] 4.00-5.00 sec 293 MBytes 2.46 Gbits/sec
[ 5] 5.00-6.00 sec 292 MBytes 2.45 Gbits/sec
[ 5] 6.00-7.00 sec 287 MBytes 2.41 Gbits/sec
[ 5] 7.00-8.00 sec 291 MBytes 2.44 Gbits/sec
[ 5] 8.00-9.00 sec 294 MBytes 2.47 Gbits/sec
[ 5] 9.00-10.00 sec 453 MBytes 3.80 Gbits/sec
[ 5] 10.00-10.04 sec 25.1 MBytes 4.98 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 3.03 GBytes 2.59 Gbits/sec receiver

iperf3 test for 3 serves:
Accepted connection from 10.16.16.51, port 37608
[ 5] local 10.16.16.50 port 5101 connected to 10.16.16.51 port 37610
Accepted connection from 10.16.16.51, port 47732
[ 5] local 10.16.16.50 port 5102 connected to 10.16.16.51 port 47748
Accepted connection from 10.16.16.51, port 35500
[ 5] local 10.16.16.50 port 5103 connected to 10.16.16.51 port 35514
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 208 MBytes 1.75 Gbits/sec
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 108 MBytes 907 Mbits/sec
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 89.5 MBytes 750 Mbits/sec
[ 5] 1.00-2.00 sec 107 MBytes 894 Mbits/sec
[ 5] 1.00-2.00 sec 91.7 MBytes 769 Mbits/sec
[ 5] 1.00-2.00 sec 94.6 MBytes 794 Mbits/sec
[ 5] 2.00-3.00 sec 96.3 MBytes 808 Mbits/sec
[ 5] 2.00-3.00 sec 109 MBytes 918 Mbits/sec
[ 5] 2.00-3.00 sec 110 MBytes 926 Mbits/sec
[ 5] 3.00-4.00 sec 89.5 MBytes 751 Mbits/sec
[ 5] 3.00-4.00 sec 100 MBytes 840 Mbits/sec
[ 5] 3.00-4.00 sec 110 MBytes 924 Mbits/sec
[ 5] 4.00-5.00 sec 92.6 MBytes 777 Mbits/sec
[ 5] 4.00-5.00 sec 89.6 MBytes 751 Mbits/sec
[ 5] 4.00-5.00 sec 116 MBytes 977 Mbits/sec
[ 5] 5.00-6.00 sec 106 MBytes 890 Mbits/sec
[ 5] 5.00-6.00 sec 94.0 MBytes 789 Mbits/sec
[ 5] 5.00-6.00 sec 123 MBytes 1.03 Gbits/sec
[ 5] 6.00-7.00 sec 120 MBytes 1.01 Gbits/sec
[ 5] 6.00-7.00 sec 112 MBytes 938 Mbits/sec
[ 5] 6.00-7.00 sec 114 MBytes 956 Mbits/sec
[ 5] 7.00-8.00 sec 128 MBytes 1.07 Gbits/sec
[ 5] 7.00-8.00 sec 113 MBytes 948 Mbits/sec
[ 5] 7.00-8.00 sec 122 MBytes 1.02 Gbits/sec
[ 5] 8.00-9.00 sec 121 MBytes 1.01 Gbits/sec
[ 5] 8.00-9.00 sec 109 MBytes 916 Mbits/sec
[ 5] 8.00-9.00 sec 118 MBytes 993 Mbits/sec
[ 5] 9.00-10.00 sec 117 MBytes 978 Mbits/sec
[ 5] 10.00-10.04 sec 4.74 MBytes 949 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 1.16 GBytes 993 Mbits/sec receiver
-----------------------------------------------------------
Server listening on 5101
-----------------------------------------------------------
[ 5] 9.00-10.00 sec 141 MBytes 1.18 Gbits/sec
[ 5] 10.00-10.04 sec 7.36 MBytes 1.54 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 1.05 GBytes 898 Mbits/sec receiver
-----------------------------------------------------------
Server listening on 5102
-----------------------------------------------------------
[ 5] 9.00-10.00 sec 239 MBytes 2.01 Gbits/sec
[ 5] 10.00-10.04 sec 13.5 MBytes 2.52 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 1.22 GBytes 1.04 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5103
-----------------------------------------------------------

# sudo mstconfig -d 40:00.0 q

Device #1:
----------

Device type: ConnectX3
Device: 44:00.0

Configurations: Next Boot
SRIOV_EN False(0)
NUM_OF_VFS 8
LINK_TYPE_P1 IB(1)
LINK_TYPE_P2 IB(1)
LOG_BAR_SIZE 3
BOOT_PKEY_P1 0
BOOT_PKEY_P2 0
BOOT_OPTION_ROM_EN_P1 True(1)
BOOT_VLAN_EN_P1 False(0)
BOOT_RETRY_CNT_P1 0
LEGACY_BOOT_PROTOCOL_P1 PXE(1)
BOOT_VLAN_P1 1
BOOT_OPTION_ROM_EN_P2 True(1)
BOOT_VLAN_EN_P2 False(0)
BOOT_RETRY_CNT_P2 0
LEGACY_BOOT_PROTOCOL_P2 PXE(1)
BOOT_VLAN_P2 1
IP_VER_P1 IPv4(0)
IP_VER_P2 IPv4(0)
CQ_TIMESTAMP True(1)

# ibstatus
Infiniband device 'ibp68s0' port 1 status:
default gid:
base lid: 0x1
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 56 Gb/sec (4X FDR)
link_layer: InfiniBand

Infiniband device 'ibp68s0' port 2 status:
default gid:
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X SDR)
link_layer: InfiniBand

# ibdiagnet
Loading IBDIAGNET from: /usr/lib/x86_64-linux-gnu/ibdiagnet1.5.7
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib/x86_64-linux-gnu/ibdm1.5.7
-I- Using port 1 as the local port.
-I- Discovering ... 2 nodes (0 Switches & 2 CA-s) discovered.


-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- General Device Info
-I---------------------------------------------------

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-I- No illegal PM counters values were found

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I- PKey:0x7fff Hosts:2 full:2 limited:0

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey: MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------
----------------------------------------------------------------
-I- Stages Status Report:
STAGE Errors Warnings
Bad GUIDs/LIDs Check 0 0
Link State Active Check 0 0
General Devices Info Report 0 0
Performance Counters Report 0 0
Partitions Check 0 0
IPoIB Subnets Check 0 1

Please see /var/cache/ibutils/ibdiagnet.log for complete log
----------------------------------------------------------------

-I- Done. Run time was 0 seconds.
 

nexox

Well-Known Member
May 3, 2023
700
289
63
I haven't played with it in several years, but never saw more than about 2Gbps using IpoIB on 40G ConnectX 3, the consensus at the time was that it was just slow and there was nothing to be done about it.
 
  • Like
Reactions: erock

erock

Member
Jul 19, 2023
84
17
8
IPoIB has always come with lots of caveats; it is strongly CPU-dependent. Those cards can do RoCE v1; would that suffice for your needs?
I tried to get RoCE for Ethernet mode on these old cards to work but was not successful. It appears that the drivers required to make this work on ConnectX3 are only available for Ubuntu 20.04 and a particular kernel that is not well documented. After hitting a wall on RoCE and reading several threads that concluded this was not possible due to a lack of support after end of life I pursued RDMA with IB. Please let me know if you have some insights on how to make RoCE v1 work with Ubuntu 22.04.
 

erock

Member
Jul 19, 2023
84
17
8
IPoIB has always come with lots of caveats; it is strongly CPU-dependent. Those cards can do RoCE v1; would that suffice for your needs?
I was able to get RoCE to work in Ubuntu 22.04 and PopOS on my 40Gbe/56GB ConnectX-3 cards. The key was to not use the LST Mellanox Linux drivers. Here are the steps I used:

  1. Get device ID of ConnectX-3 card
    1. sudo lspci | grep Mellanox (to get device ID, e.g. 40.00.0)
  2. Install Mellanox Driver Tool and set link type to Ethernet
    1. sudo apt install mstflint
    2. sudo mstconfig -d <device ID> s LINK_TYPE_P1=ETH
    3. sudo mstconfig -d <device ID> s LINK_TYPE_P2=ETH
  3. Reboot
  4. Set IP, netmask etc.. using NetworkManager (nmtui)
  5. Install Infiniband, rdma and testing software
    1. sudo apt install rdma-core opensm ibutils infiniband-diags
    2. sudo apt install ibverbs-utils rdmacm-utils perftest
    3. sudo apt install udaddy
  6. Start opensm via sudo opensm
  7. Reboot
After these steps I ran ibv_devinfo and got the output shown below, which shows a device name that starts with 'roce', transport equal to InfiniBand and Link type equal to Ethernet (all of which I believe are indicating RoCE). I also ran tests using udaddy, rdma_server, ib_send_bw and ib_send_lat, all of which indicate RoCE is working. I am not able to get the tool cma_roce_mode to work but I did experiment with the RoCE 1 vs 2 configuration step from a fantastic Infiniband/RDMA chapter from RedHat located here. This RedHat Linux resource had most of what I need to get this working in Ubuntu/PopOS. Another good resource is here but some of the steps shown in this link need to be modified for your specific setup. The one Mellanox resource that I found useful is here, which showed helpful RDMA testing workflows.

My next step is to get MPI to work with RoCE. Not sure how hard this will be since I can't rely on the tuned version of OpenMPI (HPC-X) that typically come with Mellanox drivers. Any tips?

# ibv_devinfo
hca_id: rocep68s0
transport: InfiniBand (0)
fw_ver: 2.42.5000
node_guid: xxxxx
sys_image_guid: xxxxx
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet