Need help to setup Connectx-3 on ubuntu

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

heavyarms2112

Member
Feb 11, 2022
52
11
8
Hello everyone,

I've looked into the previous threads mentioning about the firmware update and converting MCX354A-QCBT to a MCX354A-FCBT
I have 2 x MCX354A-FCBT cards andNetApp X6558-R6 cable that I'm pairing two machines with.

Wanted to understand what are my hw limitations and what max transfer speeds should I expect between these two machines.
My aim was to have a 40G link or if feasible 56G

Tried to grab the latest drivers for ubuntu 20.04 and installed it with --skip-devices-check flag.

Code:
root@testub:~# mlxconfig -d /dev/mst/mt4099_pci_cr0 query

Device #1:
----------

Device type:    ConnectX3
Device:         /dev/mst/mt4099_pci_cr0

Configurations:                              Next Boot
         SRIOV_EN                            True(1)
         NUM_OF_VFS                          16
         LINK_TYPE_P1                        VPI(3)
         LINK_TYPE_P2                        VPI(3)
         LOG_BAR_SIZE                        3
         BOOT_PKEY_P1                        0
         BOOT_PKEY_P2                        0
         BOOT_OPTION_ROM_EN_P1               True(1)
         BOOT_VLAN_EN_P1                     False(0)
         BOOT_RETRY_CNT_P1                   0
         LEGACY_BOOT_PROTOCOL_P1             PXE(1)
         BOOT_VLAN_P1                        1
         BOOT_OPTION_ROM_EN_P2               True(1)
         BOOT_VLAN_EN_P2                     False(0)
         BOOT_RETRY_CNT_P2                   0
         LEGACY_BOOT_PROTOCOL_P2             PXE(1)
         BOOT_VLAN_P2                        1
         IP_VER_P1                           IPv4(0)
         IP_VER_P2                           IPv4(0)
         CQ_TIMESTAMP                        True(1)
 
Last edited:

prdtabim

Active Member
Jan 29, 2022
170
66
28
Looks like the driver installations are limited. Not sure which one to go for on Ubuntu 20.04.
By the screenshot I think that the firmware of the card is outdated.
I use a point to point configuration between 2 Connectx-3 pro using a AOC cable between them achiving 40Gb/s in ethernet mode.
The 1st change I suggest is to set LINK_TYPE_P1 and LINK_TYPE_P2 to 2 ( ethernet ) unless you want to use infiniband.
Code:
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2
This will fix in ethernet mode and avoid errors using VPI ...

Look at
ethtool <device name> # params of the NIC
ethtool -i <device name> # driver version
ethtool -m <device name> # GBIC/tranceiver data if exists
 

heavyarms2112

Member
Feb 11, 2022
52
11
8
Thanks a lot. Looks like I do have a link now.
So VPI auto mode doesn't work? Driver ver 4 and fw 2.42.5000 are quite old?

Code:
root@testub:~# ethtool -i  enp3s0
driver: mlx4_en
version: 4.0-0
firmware-version: 2.42.5000
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
 

prdtabim

Active Member
Jan 29, 2022
170
66
28
Thanks a lot. Looks like I do have a link now.
So VPI auto mode doesn't work? Driver ver 4 and fw 2.42.5000 are quite old?

Code:
root@testub:~# ethtool -i  enp3s0
driver: mlx4_en
version: 4.0-0
firmware-version: 2.42.5000
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
Not very old. My cards have firmware 2.42.7000 version.
Other users already noticed that VPI between cards have the bad behavior to fail in negotiate IB and ethernet. Fixing in the desired mode is the best.
The maximum rate between the cards depends of the used cable ( 56Gb/s only with Mellanox cards and cable that support it ) and the firmware of the cards. Some have artificial limitations to 10Gb/s in ethernet mode.
Other detail is that 56Gb/s isn't in autonegotiation ( at least in my cards ) so you probably need to set the speed in both sides using ethtool -s <device name> speed 56000 ( considering that you have a compatible cable ).
 

heavyarms2112

Member
Feb 11, 2022
52
11
8
By the screenshot I think that the firmware of the card is outdated.
I use a point to point configuration between 2 Connectx-3 pro using a AOC cable between them achiving 40Gb/s in ethernet mode.
The 1st change I suggest is to set LINK_TYPE_P1 and LINK_TYPE_P2 to 2 ( ethernet ) unless you want to use infiniband.
Code:
mlxconfig -d /dev/mst/mt4099_pci_cr0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2
This will fix in ethernet mode and avoid errors using VPI ...

Look at
ethtool <device name> # params of the NIC
ethtool -i <device name> # driver version
ethtool -m <device name> # GBIC/tranceiver data if exists
Thanks again. Is it possible to use both links at same time? not sure if link aggregation is a mess.
 

prdtabim

Active Member
Jan 29, 2022
170
66
28
Thanks again. Is it possible to use both links at same time? not sure if link aggregation is a mess.
Both links could be used at the same time.
Just a few notes:
- Depending of the card/firmware the links are 40 and 10Gb/s max. In my case it support 40 and 40 Gb/s.
- PCIe bottleneck. The card is usually PCIe 3.0 x8 = 63Gb/s of bandwidth. If you set 2 links of 40Gb/s, the sum will be limited by the PCIe bus
- Bonding in linux is easy. Use mode 0 ( round robin ) ou mode 4 ( 802.3ad LACP ). Just use hash layer3+4 to maximize de use of the bonding.
Code:
# example of bonding - /etc/network/interfaces

iface enp50s0f0 inet manual
iface enp50s0f1 inet manual

auto bond0
iface bond0 inet static
       address 192.168.200.1
       netmask 255.255.255.0

       bond-slaves enp50s0f0 enp50s0f1
       bond_mode 4
       bond-miimon 100
       bond_downdelay 500
       bond_updelay 2000
       bond_xmit_hash_policy layer3+4
       up ip link set enp50s0f0 master bond0
       up ip link set enp50s0f1 master bond0
This example sets an interface bond0 using 2 physical devices enp50s0f0 e enp50s0f1 in mode 4 - 802.3ad ( LACP ) with hash using layer3+4. Using layer3+4 turns the limit of aggregation to ip/port and not MAC/ip like layer2+3. That way SMB3 can use the bond at maximum speeds even between 2 hosts. Just have to stablish more than 1 connection.

Edit. Another note: LACP only likes links with the same speed.
 
Last edited:

heavyarms2112

Member
Feb 11, 2022
52
11
8
Both links could be used at the same time.
Just a few notes:
- Depending of the card/firmware the links are 40 and 10Gb/s max. In my case it support 40 and 40 Gb/s.
- PCIe bottleneck. The card is usually PCIe 3.0 x8 = 63Gb/s of bandwidth. If you set 2 links of 40Gb/s, the sum will be limited by the PCIe bus
- Bonding in linux is easy. Use mode 0 ( round robin ) ou mode 4 ( 802.3ad LACP ). Just use hash layer3+4 to maximize de use of the bonding.
Code:
# example of bonding - /etc/network/interfaces

iface enp50s0f0 inet manual
iface enp50s0f1 inet manual

auto bond0
iface bond0 inet static
       address 192.168.200.1
       netmask 255.255.255.0

       bond-slaves enp50s0f0 enp50s0f1
       bond_mode 4
       bond-miimon 100
       bond_downdelay 500
       bond_updelay 2000
       bond_xmit_hash_policy layer3+4
       up ip link set enp50s0f0 master bond0
       up ip link set enp50s0f1 master bond0
This example sets an interface bond0 using 2 physical devices enp50s0f0 e enp50s0f1 in mode 4 - 802.3ad ( LACP ) with hash using layer3+4. Using layer3+4 turns the limit of aggregation to ip/port and not MAC/ip like layer2+3. That way SMB3 can use the bond at maximum speeds even between 2 hosts. Just have to stablish more than 1 connection.

Edit. Another note: LACP only likes links with the same speed.
Thanks again for the instructions.

Code:
Connecting to host 192.168.200.2, port 5201
[  5] local 192.168.200.1 port 49254 connected to 192.168.200.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.20 GBytes  18.9 Gbits/sec    0   2.59 MBytes
[  5]   1.00-2.00   sec  2.21 GBytes  18.9 Gbits/sec    0   2.76 MBytes
[  5]   2.00-3.00   sec  2.21 GBytes  19.0 Gbits/sec    0   2.76 MBytes
[  5]   3.00-4.00   sec  2.19 GBytes  18.8 Gbits/sec    0   2.76 MBytes
[  5]   4.00-5.00   sec  2.21 GBytes  18.9 Gbits/sec    0   2.76 MBytes
[  5]   5.00-6.00   sec  2.15 GBytes  18.4 Gbits/sec    0   2.76 MBytes
[  5]   6.00-7.00   sec  2.20 GBytes  18.9 Gbits/sec    0   2.76 MBytes
[  5]   7.00-8.00   sec  2.20 GBytes  18.9 Gbits/sec    0   2.92 MBytes
[  5]   8.00-9.00   sec  2.20 GBytes  18.9 Gbits/sec    0   3.13 MBytes
[  5]   9.00-10.00  sec  2.21 GBytes  19.0 Gbits/sec    0   3.13 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec                  receiver

iperf Done.
So, now I have 2 interfaces bond0 (192.168.200.1) on client and bond1 (192.168.200.2) on server. iperf results should be more than what I see on single link?
Code:
Settings for bond0:
        Supported ports: [ ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 80000Mb/s
        Duplex: Full
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Link detected: yes
 

prdtabim

Active Member
Jan 29, 2022
170
66
28
Hmmm
Bond0 show 80Gb/s. OK.
Running iperf3 is only one connection, so must be near 40Gb/s. Running iperf3 client with "-P 2" increases the rates ? ( -P #n runs #n parallel transfers )
Exec "lspci -vv | more" and look for "LnkCap:" and "LnkSta:" for the Mellanox devices. The expected are "Speed 8GT/s (ok), Width x8 (ok)" for both. We are looking for bottlenecks at the PCIe bus.
Is the cpu usage very high during the test ?

Edit. Runs the iperf3 client with -R to see the transfer test from the reverse way.

Edit2. Do you set the mtu for the bond0 and the interfaces to 9000 ( jumbo frames ) ?
Do you set the txqueuelen for a value like 10000 or greater ?
 
Last edited:

heavyarms2112

Member
Feb 11, 2022
52
11
8
Hmmm
Bond0 show 80Gb/s. OK.
Running iperf3 is only one connection, so must be near 40Gb/s. Running iperf3 client with "-P 2" increases the rates ? ( -P #n runs #n parallel transfers )
Exec "lspci -vv | more" and look for "LnkCap:" and "LnkSta:" for the Mellanox devices. The expected are "Speed 8GT/s (ok), Width x8 (ok)" for both. We are looking for bottlenecks at the PCIe bus.
Is the cpu usage very high during the test ?

Edit. Runs the iperf3 client with -R to see the transfer test from the reverse way.

Edit2. Do you set the mtu for the bond0 and the interfaces to 9000 ( jumbo frames ) ?
Do you set the txqueuelen for a value like 10000 or greater ?
It's the same perf either way.
Code:
                LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (downgraded)
looks like I'm running them x4 instead of x8. Need to check BIOS and see if the slot supports it.
haven't set anything for the interface whatever is default.
EDIT: Thanks for the tips. False alarm on speeds. Was probably cause I was doing rsync in background.
Need to sort out the pcie speed.
 
Last edited:

heavyarms2112

Member
Feb 11, 2022
52
11
8
Okay so the slot is wired x4 so can't do much about it except swapping it into the x16 slot. but even then I should be getting 40G on the bonded connection. Using nload utility I see activity only one one link.
 

heavyarms2112

Member
Feb 11, 2022
52
11
8
will mode 4 bonding work without switch?

EDIT: got 40G when switched to x8 lanes. Still need to figure out on bonding not working.
 
Last edited:

prdtabim

Active Member
Jan 29, 2022
170
66
28
will mode 4 bonding work without switch?

EDIT: got 40G when switched to x8 lanes. Still need to figure out on bonding not working.
Good progress ...

Mode 4 is the only one compatible with managed switches ( 802.3.ad ). That I know the other modes are "linux only".
Look here: 7.7. Using Channel Bonding Red Hat Enterprise Linux 7 | Red Hat Customer Portal

The limitation of bonding with hash layer3+4 is that a SINGLE connection will use ONE of the interfaces to traffic data.
Since you add another connection it will use other interface. Thats the reason to test iperf client with "-P 2 " argument or bigger since it will open at least 2 connections using diferent ports.
Since the moment that transfers use both interfaces you can expect 60+ Gb/s limited by the PCIe bus.
 
Last edited:

heavyarms2112

Member
Feb 11, 2022
52
11
8
Good progress ...

Mode 4 is the only one compatible with managed switches ( 802.3.ad ). That I know the other modes are "linux only".
Look here: 7.7. Using Channel Bonding Red Hat Enterprise Linux 7 | Red Hat Customer Portal

The limitation of bonding with hash layer3+4 is that a SINGLE connection will use ONE of the interfaces to traffic data.
Since you add another connection it will use other interface. Thats the reason to test iperf client with "-P 2 " argument or bigger since it will open at least 2 connections using diferent ports.
Since the moment that transfers use both interfaces you can expect 60+ Gb/s limited by the PCIe bus.
Right. So I don't have a switch. Mine are direct attached using 2 x links on each side. So I guess I should use mode 0?
I think I already tried parallel and didn't get higher performance before on x4 speeds. I'll try on x8.


Code:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  66.1 GBytes  18.9 Gbits/sec  10245             sender
[  5]   0.00-30.00  sec  66.0 GBytes  18.9 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  65.7 GBytes  18.8 Gbits/sec  1339             sender
[  7]   0.00-30.00  sec  65.7 GBytes  18.8 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec   132 GBytes  37.7 Gbits/sec  11584             sender
[SUM]   0.00-30.00  sec   132 GBytes  37.7 Gbits/sec                  receiver

iperf Done.
looks like I got traffic on both interfaces (even) but still limited to 40G.
 
Last edited:
  • Like
Reactions: ColdCanuck

prdtabim

Active Member
Jan 29, 2022
170
66
28
Right. So I don't have a switch. Mine are direct attached using 2 x links on each side. So I guess I should use mode 0?
I think I already tried parallel and didn't get higher performance before on x4 speeds. I'll try on x8.


Code:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  66.1 GBytes  18.9 Gbits/sec  10245             sender
[  5]   0.00-30.00  sec  66.0 GBytes  18.9 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  65.7 GBytes  18.8 Gbits/sec  1339             sender
[  7]   0.00-30.00  sec  65.7 GBytes  18.8 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec   132 GBytes  37.7 Gbits/sec  11584             sender
[SUM]   0.00-30.00  sec   132 GBytes  37.7 Gbits/sec                  receiver

iperf Done.
looks like I got traffic on both interfaces (even) but still limited to 40G.
Well you could test with mode 0 since the both parts are linux.
I'm intrigued with that retransmission counts ... Maybe cpu/core overload. Look at packet steering Configuring RPS (Receive Packet Steering) This will distribute the packet load in cpu cores.
Are you tried increasing mtu to 9000 and txqueuelen to 10000 ?

Other test yet in mode 4 is to disconnect one cable . Mode 4 must detect the fail and maintain the link over the other connection.
 
Last edited:

heavyarms2112

Member
Feb 11, 2022
52
11
8
Well you could test with mode 0 since the both parts are linux.
I'm intrigued with that retransmission counts ... Maybe cpu/core overload. Look at packet steering Configuring RPS (Receive Packet Steering) This will distribute the packet load in cpu cores.
Are you tried increasing mtu to 9000 and txqueuelen to 10000 ?

Other test yet in mode 4 is to disconnect one cable . Mode 4 must detect the fail and maintain the link over the other connection.
I'm yet to test bonding configs. I did test with mtu 9000 and txqueuelen to 10k and 20k. I am able to saturate bandwidth on iperf test results. Turns out that I'd be limited to x4 slot since the x570 board has just 2 slots and 16x slot is going to be used for 4 x nvme storage.

p.s. Don't think i'd be cpu limited it's 5950X.
 

prdtabim

Active Member
Jan 29, 2022
170
66
28
I'm yet to test bonding configs. I did test with mtu 9000 and txqueuelen to 10k and 20k. I am able to saturate bandwidth on iperf test results. Turns out that I'd be limited to x4 slot since the x570 board has just 2 slots and 16x slot is going to be used for 4 x nvme storage.

p.s. Don't think i'd be cpu limited it's 5950X.
Ok. If you will use the x16 slot to the nvme storage, the only choice remaining to max the throuhput is to find a PCIe 4.0 NIC card to put at x4 slot. I have a 5950x in Asrock X570 Creator and use 8x/8x config in the 2 CPU PCIe slots ( NIC / VGA ).
 
  • Like
Reactions: heavyarms2112