RoCE on Linux

Blue)(Fusion

Active Member
Mar 1, 2017
133
49
28
Chicago
I'm trying to use RoCE on my CX3 cards for faster Gluster storage access. I'm currently stuck at the testing phase here.


I am able to use rping between the hosts without error but when trying to test ibv_rc_pingpong, I end up with the following error:
Code:
area51 ~ # ibv_rc_pingpong -g 0 -d mlx4_0 -i 2 10.1.4.75
  local address:  LID 0x0000, QPN 0x000225, PSN 0x8b59d0, GID fe80::202:c9ff:fe1b:fe11
  remote address: LID 0x0000, QPN 0x00022b, PSN 0x30bece, GID fe80::202:c9ff:fe1c:4680
Failed status transport retry counter exceeded (12) for wr_id 2
parse WC failed 1
Both CX3 cards are on the same VLAN and subnet.
 
  • Like
Reactions: gigatexal

i386

Well-Known Member
Mar 18, 2016
3,389
1,137
113
33
Germany
ibv_rc_pingpong is using infiniband, roce is for ethernet.
RoCE = RDMA over Converged Ethernet
ib = infiniband
 
  • Like
Reactions: fossxplorer

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
there is a perftest package that has ib_send_bw.
Run ib_send_bw -s on one host and ib_send_bw -c <host> to measure bandwidth between them.
 
  • Like
Reactions: fossxplorer

Blue)(Fusion

Active Member
Mar 1, 2017
133
49
28
Chicago
Fair enough, I'm working on getting the perftest package working on my uncommon distro. What about qperf? The man page says that test RDMA, doesn't say it has to be over InfiniBand.

But when I try polling RDMA numbers, I get:
Code:
area51 ~ # qperf 10.1.4.50 ud_lat ud_bw
ud_lat:
failed to create address handle
area51 ~ # qperf 10.1.4.50 rc_bi_bw
rc_bi_bw:
failed to modify QP to RTR: Invalid argument
The modules are already loaded on both servers:
Code:
area51 ~ # lsmod | egrep 'ib_|_ib|_rdma|mlx4|iw_'
iw_cm                  45056  1 rdma_cm
ib_cm                  53248  1 rdma_cm
ib_umad                28672  0
mlx4_ib               200704  0
ib_uverbs             110592  2 mlx4_ib,rdma_ucm
ib_core               245760  7 rdma_cm,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
mlx4_en               135168  0
mlx4_core             327680  2 mlx4_ib,mlx4_en
devlink                69632  3 mlx4_core,mlx4_ib,mlx4_en

Basically, I want to know that RDMA transport is fully functional before I change my Gluster setup to use RDMA transport (over Ethernet).
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
qperf is great if you have it. not all distros have it in their main repos.

check the MTU on your interfaces and any switches in between, and make sure they are 4k or higher.
many RDMA protocols use 4k.

I my setups, I set the interfaces that participate in RDMA to 9000 and switches 9000 or greater if they allow it. Some OSes, only allow 1500 and 9000, so they are common denominators.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
It's not an MTU issue. I built qperf on debian stretch, and it fails with the same error message, while ib_send_bw works on the same system. System and switch use jumbo frames. So it's something else.

The error message I see is:
rc_rdma_read_bw:
server: failed to modify QP to RTR: Network is unreachable

This may be a hint at whatever serves as the connection manager, which provides GUIDs for both ends of the connection.

So I tried:
ibv_rc_pingpong
local address: LID 0x0000, QPN 0x000245, PSN 0x0b827c, GID ::
Failed to modify QP to RTR
Couldn't connect to remote QP

It looks like an empty guid.
I'm not running a subnet manager, and I'm not familiar with setting one up, so I don't have a solution at the moment.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
Page 74 of the Mellanox OFED for Linux Manual says that the subnet manager is not needed for Ethernet:

"When working with RDMA applications over Ethernet link layer the following points should be noted: The presence of a Subnet Manager (SM) is not required in the fabric. Thus, operations that require communication with the SM are managed in a different way in RoCE. This does not affect the API but only the actions such as joining multicast group, that need to be taken when using the API"

http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_5.pdf

I've been using inbox drivers, and I'm about to try switching to the Mellanox OFED packages. My impression is that the capabilities to troubleshoot may be much better with the Mellanox OFED because it has more utilities to show the hardware, driver, etc. configuration, and in greater detail.
 

Blue)(Fusion

Active Member
Mar 1, 2017
133
49
28
Chicago
Thanks, zxv, for some looking into this.

I was curious about the subnet manager. I installed OpenSM, but never configured it so it appears it's not needed in this case as you said. Those are the errors I have been seeing as well. I have no idea what "QP" is referencing, so I don't know where to turn next.

I do have the VLAN and adapters that RoCE is to be used on using 9k jumbo frames. I've had it this way for a year now, so the JF are not the issue.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
I installed Mellanox OFED 4.5 on Ubuntu 18.04, and then qperf.
qperf fails with:

libibverbs: GRH is mandatory For RoCE address handle
failed to create address handle

This looks similar to Support RoCE protocol in tests · Issue #10 · gpudirect/libgdsync
which says the application doesn't support RoCE, only infiniband.
Looks like the application did not make the API calls necessary to set up the GUIDs for each side of the RoCE connection.

If using Mellanox cards, one option for troubleshooting is to install a Mellanox OFED release on a supported platform for the purpose of getting a baseline. Given the potential confusion about whether issues are hardware, driver or software related, this may help narrow the envelope, and potentially provide some known good results for comparing to results on other distros. Here's a link to the OFED release: http://www.mellanox.com/page/products_dyn?product_family=26
 

Blue)(Fusion

Active Member
Mar 1, 2017
133
49
28
Chicago
I installed all the OFED packages. I made slight progress, but just upgraded to CX3 cards which I know for sure support RoCE and are running up to date firmware.

I'm now trying to test Gluster with RDMA transport.

I can start the volume with only RDMA transport, but when checking the status I get the following:
Code:
proton mnt # gluster volume status rdmatest
Status of volume: rdmatest
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick proton.gluster.rgnet:/bricks/brick1/r
dmatest N/A N/A N N/A
Brick neutron.gluster.rgnet:/bricks/brick1/
rdmatest N/A N/A N N/A
Task Status of Volume rdmatest
------------------------------------------------------------------------------
There are no active volume tasks


proton mnt # gluster volume info rdmatest
Volume Name: rdmatest
Type: Distribute
Volume ID: b7c19928-060e-4e65-a27f-6164de30e251
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: rdma
Bricks:
Brick1: proton.gluster.rgnet:/bricks/brick1/rdmatest
Brick2: neutron.gluster.rgnet:/bricks/brick1/rdmatest
Options Reconfigured:
nfs.disable: on


proton mnt # lsmod | grep 'rdma\|_ib\|ib_\|_cm'
rpcrdma 204800 0
sunrpc 335872 1 rpcrdma
ib_umad 28672 0
rdma_ucm 32768 1
rdma_cm 65536 2 rpcrdma,rdma_ucm
iw_cm 45056 1 rdma_cm
ib_cm 53248 1 rdma_cm
configfs 40960 2 rdma_cm
mlx4_ib 200704 0
ib_uverbs 110592 2 mlx4_ib,rdma_ucm
ib_core 245760 8 rdma_cm,rpcrdma,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
mlx4_core 331776 2 mlx4_ib,mlx4_en
devlink 69632 3 mlx4_core,mlx4_ib,mlx4_en
Ideas?

Also, the servers are running the latest firmware for my CX3 cards:

Code:
proton mnt # mstfwmanager -d 09:00.0
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX3
Part Number: MCX312A-XCB_A2-A6
Description: ConnectX-3 EN network interface card; 10GigE; dual-port SFP+; PCIe3.0 x8 8GT/s; RoHS R6
PSID: MT_1080120023
PCI Device Name: 09:00.0
Port1 MAC: 0002c93b6130
Port2 MAC: 0002c93b6131
Versions: Current Available
FW 2.42.5000 N/A
PXE 3.4.0752 N/A
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
55
28
I have not used gluster, but in the case of iscsi/iser, what I've found is helpful is to, 1) make sure flow control and ecn are working, such that tests show over 90% of full bandwidth consistently, 2) test the storage protocols over regular IP without RDMA to ensure all of the storage configuration parameters are valid, and then enable RDMA.

There's an option to test bandwidth using a range of RDMA message sizes:
ib_send_bw -a

When there are flow control or ECN issues, the bandwidth will increase with increasing message for small messages, and then at larger message sizes drop sharply and vary inconsistently.
 

gerby

SREious Engineer
Apr 3, 2021
37
19
8
Stumbled on this from a google search and figured I'd contribute an answer I've now found; if you're testing with qperf and roce you need to use the -cm1 flag on the client side eg:

Code:
qperf -cm1 -t 60 --use_bits_per_sec 172.18.200.22 rc_bw
 
  • Like
Reactions: jode

Blue)(Fusion

Active Member
Mar 1, 2017
133
49
28
Chicago
Stumbled on this from a google search and figured I'd contribute an answer I've now found; if you're testing with qperf and roce you need to use the -cm1 flag on the client side eg:

Code:
qperf -cm1 -t 60 --use_bits_per_sec 172.18.200.22 rc_bw
Thanks for bumping this topic! This trick worked for qperf. I gave up on Gluster/RDMA since, but it's good to know RoCE is working.