rping (RoCE) cannot connect after restarting network service

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Meng Wang

New Member
Dec 19, 2018
9
0
1
Hello,

We setup two servers with ConnectX4 NICs (model: MCX414A-BCAT). Servers are connected to one SX6036 switch where flow control is enabled. We want to test RoCE functions on this setup, so we use rping to test RoCE links connectivity between servers first. At initial, they can establish RoCE connections (rping to each other works). However, when we did ifdown/ifup the interfaces, restart network services, or rebooted the two servers, the connectivity between the two servers may become abnormal: i.e. sometimes the active side was stuck at "rdma_connect" without any CM event generated later; sometimes, the connection can be established, but the sender side failed to send message to the receiver (with error 12: IBV_WC_RETRY_EXC_ERR). If we repeat ifdown/up the affected interface or restart the network service for several rounds, the connectivity between the two servers can eventually become normal. We tested on CentOS 7.4, 7.9 and Ubuntu 22.04 with different drivers (OFED 4.9, 5.4, 5.6) and kernels (default kernels: 3.10.693 for centos 7.4, 3.10.1160 for centos 7.9, 5.15 for ubuntu 22.04; and vanilla kernels 5.10 and 5.18 getting from Linux archive), and found that this problem can be reproduced on all these setups. TCP/IP connections are always working as expected. Has anyone found this issue before? Is there any method to troubleshoot this problem? Any suggestion is appreciated. Thanks.