RoCE on Linux

Blue)(Fusion · Mar 1, 2019

I'm trying to use RoCE on my CX3 cards for faster Gluster storage access. I'm currently stuck at the testing phase here.

I am able to use rping between the hosts without error but when trying to test ibv_rc_pingpong, I end up with the following error:

Code:

area51 ~ # ibv_rc_pingpong -g 0 -d mlx4_0 -i 2 10.1.4.75
  local address:  LID 0x0000, QPN 0x000225, PSN 0x8b59d0, GID fe80::202:c9ff:fe1b:fe11
  remote address: LID 0x0000, QPN 0x00022b, PSN 0x30bece, GID fe80::202:c9ff:fe1c:4680
Failed status transport retry counter exceeded (12) for wr_id 2
parse WC failed 1

Both CX3 cards are on the same VLAN and subnet.

i386 · Mar 1, 2019

ibv_rc_pingpong is using infiniband, roce is for ethernet.
RoCE = RDMA over Converged Ethernet
ib = infiniband

zxv · Mar 1, 2019

there is a perftest package that has ib_send_bw.
Run ib_send_bw -s on one host and ib_send_bw -c <host> to measure bandwidth between them.

Blue)(Fusion · Mar 1, 2019

Fair enough, I'm working on getting the perftest package working on my uncommon distro. What about qperf? The man page says that test RDMA, doesn't say it has to be over InfiniBand.

But when I try polling RDMA numbers, I get:

Code:

area51 ~ # qperf 10.1.4.50 ud_lat ud_bw
ud_lat:
failed to create address handle
area51 ~ # qperf 10.1.4.50 rc_bi_bw
rc_bi_bw:
failed to modify QP to RTR: Invalid argument

The modules are already loaded on both servers:

Code:

area51 ~ # lsmod | egrep 'ib_|_ib|_rdma|mlx4|iw_'
iw_cm                  45056  1 rdma_cm
ib_cm                  53248  1 rdma_cm
ib_umad                28672  0
mlx4_ib               200704  0
ib_uverbs             110592  2 mlx4_ib,rdma_ucm
ib_core               245760  7 rdma_cm,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
mlx4_en               135168  0
mlx4_core             327680  2 mlx4_ib,mlx4_en
devlink                69632  3 mlx4_core,mlx4_ib,mlx4_en

Basically, I want to know that RDMA transport is fully functional before I change my Gluster setup to use RDMA transport (over Ethernet).

zxv · Mar 2, 2019

qperf is great if you have it. not all distros have it in their main repos.

check the MTU on your interfaces and any switches in between, and make sure they are 4k or higher.
many RDMA protocols use 4k.

I my setups, I set the interfaces that participate in RDMA to 9000 and switches 9000 or greater if they allow it. Some OSes, only allow 1500 and 9000, so they are common denominators.

zxv · Mar 2, 2019

It's not an MTU issue. I built qperf on debian stretch, and it fails with the same error message, while ib_send_bw works on the same system. System and switch use jumbo frames. So it's something else.

The error message I see is:
rc_rdma_read_bw:
server: failed to modify QP to RTR: Network is unreachable

This may be a hint at whatever serves as the connection manager, which provides GUIDs for both ends of the connection.

So I tried:
ibv_rc_pingpong
local address: LID 0x0000, QPN 0x000245, PSN 0x0b827c, GID ::
Failed to modify QP to RTR
Couldn't connect to remote QP

It looks like an empty guid.
I'm not running a subnet manager, and I'm not familiar with setting one up, so I don't have a solution at the moment.

zxv · Mar 2, 2019

Page 74 of the Mellanox OFED for Linux Manual says that the subnet manager is not needed for Ethernet:

"When working with RDMA applications over Ethernet link layer the following points should be noted: The presence of a Subnet Manager (SM) is not required in the fabric. Thus, operations that require communication with the SM are managed in a different way in RoCE. This does not affect the API but only the actions such as joining multicast group, that need to be taken when using the API"

http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_5.pdf

I've been using inbox drivers, and I'm about to try switching to the Mellanox OFED packages. My impression is that the capabilities to troubleshoot may be much better with the Mellanox OFED because it has more utilities to show the hardware, driver, etc. configuration, and in greater detail.

Blue)(Fusion · Mar 4, 2019

Thanks, zxv, for some looking into this.

I was curious about the subnet manager. I installed OpenSM, but never configured it so it appears it's not needed in this case as you said. Those are the errors I have been seeing as well. I have no idea what "QP" is referencing, so I don't know where to turn next.

I do have the VLAN and adapters that RoCE is to be used on using 9k jumbo frames. I've had it this way for a year now, so the JF are not the issue.

zxv · Mar 4, 2019

I installed Mellanox OFED 4.5 on Ubuntu 18.04, and then qperf.
qperf fails with:

libibverbs: GRH is mandatory For RoCE address handle
failed to create address handle

This looks similar to Support RoCE protocol in tests · Issue #10 · gpudirect/libgdsync
which says the application doesn't support RoCE, only infiniband.
Looks like the application did not make the API calls necessary to set up the GUIDs for each side of the RoCE connection.

If using Mellanox cards, one option for troubleshooting is to install a Mellanox OFED release on a supported platform for the purpose of getting a baseline. Given the potential confusion about whether issues are hardware, driver or software related, this may help narrow the envelope, and potentially provide some known good results for comparing to results on other distros. Here's a link to the OFED release: http://www.mellanox.com/page/products_dyn?product_family=26

Blue)(Fusion · Mar 15, 2019

I installed all the OFED packages. I made slight progress, but just upgraded to CX3 cards which I know for sure support RoCE and are running up to date firmware.

I'm now trying to test Gluster with RDMA transport.

I can start the volume with only RDMA transport, but when checking the status I get the following:

Code:

proton mnt # gluster volume status rdmatest
Status of volume: rdmatest
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick proton.gluster.rgnet:/bricks/brick1/r
dmatest N/A N/A N N/A
Brick neutron.gluster.rgnet:/bricks/brick1/
rdmatest N/A N/A N N/A
Task Status of Volume rdmatest
------------------------------------------------------------------------------
There are no active volume tasks


proton mnt # gluster volume info rdmatest
Volume Name: rdmatest
Type: Distribute
Volume ID: b7c19928-060e-4e65-a27f-6164de30e251
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: rdma
Bricks:
Brick1: proton.gluster.rgnet:/bricks/brick1/rdmatest
Brick2: neutron.gluster.rgnet:/bricks/brick1/rdmatest
Options Reconfigured:
nfs.disable: on


proton mnt # lsmod | grep 'rdma\|_ib\|ib_\|_cm'
rpcrdma 204800 0
sunrpc 335872 1 rpcrdma
ib_umad 28672 0
rdma_ucm 32768 1
rdma_cm 65536 2 rpcrdma,rdma_ucm
iw_cm 45056 1 rdma_cm
ib_cm 53248 1 rdma_cm
configfs 40960 2 rdma_cm
mlx4_ib 200704 0
ib_uverbs 110592 2 mlx4_ib,rdma_ucm
ib_core 245760 8 rdma_cm,rpcrdma,mlx4_ib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,ib_cm
mlx4_core 331776 2 mlx4_ib,mlx4_en
devlink 69632 3 mlx4_core,mlx4_ib,mlx4_en

Ideas?

Also, the servers are running the latest firmware for my CX3 cards:

Code:

proton mnt # mstfwmanager -d 09:00.0
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX3
Part Number: MCX312A-XCB_A2-A6
Description: ConnectX-3 EN network interface card; 10GigE; dual-port SFP+; PCIe3.0 x8 8GT/s; RoHS R6
PSID: MT_1080120023
PCI Device Name: 09:00.0
Port1 MAC: 0002c93b6130
Port2 MAC: 0002c93b6131
Versions: Current Available
FW 2.42.5000 N/A
PXE 3.4.0752 N/A

zxv · Mar 15, 2019

I have not used gluster, but in the case of iscsi/iser, what I've found is helpful is to, 1) make sure flow control and ecn are working, such that tests show over 90% of full bandwidth consistently, 2) test the storage protocols over regular IP without RDMA to ensure all of the storage configuration parameters are valid, and then enable RDMA.

There's an option to test bandwidth using a range of RDMA message sizes:
ib_send_bw -a

When there are flow control or ECN issues, the bandwidth will increase with increasing message for small messages, and then at larger message sizes drop sharply and vary inconsistently.

gerby · Aug 2, 2021

Stumbled on this from a google search and figured I'd contribute an answer I've now found; if you're testing with qperf and roce you need to use the -cm1 flag on the client side eg:

Code:

qperf -cm1 -t 60 --use_bits_per_sec 172.18.200.22 rc_bw

Blue)(Fusion · Oct 15, 2021

gerby said:
Stumbled on this from a google search and figured I'd contribute an answer I've now found; if you're testing with qperf and roce you need to use the -cm1 flag on the client side eg:

Code:

qperf -cm1 -t 60 --use_bits_per_sec 172.18.200.22 rc_bw

Thanks for bumping this topic! This trick worked for qperf. I gave up on Gluster/RDMA since, but it's good to know RoCE is working.

Search

RoCE on Linux

Blue)(Fusion

Active Member

i386

Well-Known Member

zxv

The more I C, the less I see.

Blue)(Fusion

Active Member

zxv

The more I C, the less I see.

zxv

The more I C, the less I see.

zxv

The more I C, the less I see.

Blue)(Fusion

Active Member

zxv

The more I C, the less I see.

Blue)(Fusion

Active Member

zxv

The more I C, the less I see.

gerby

SREious Engineer

Blue)(Fusion

Active Member