RNIC functions on one host aren't working. I know it's a config problem, but where to look?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

naptastic

New Member
Jan 27, 2023
21
3
3
Edit: Everything below is... well, it's not wrong, except RDMA is actually not working for ANY hosts. Everything is going through IPoIB and it's awful. I thought it was working for other hosts but it is not.

Original post:

This host ("shark") currently has four RDMA-aware functions, and none of them are working. NFS falls back to TCP mode without giving an error message. iSCSI refuses to turn on iSER. (dmesg says "isert: isert_setup_id: rdma_bind_addr() failed: -99") The NVMe target module can be loaded, but it doesn't create files in sysfs like it's supposed to. Looking Glass Proxy (lgproxy) also fails to set up an RDMA connection and the source program eventually segfaults.

The thing is, these are IP-over-InfiniBand interfaces. rdma_bind_addr shouldn't fail. It also used to work... and it's just this one host. Everything else works for everyone else; I only can't test lgproxy because every other host on the network is headless.

Things I have tried so far:
  • stock Debian kernel instead of custom kernel
  • going back to major version 6.1 (the custom kernel is 6.5 right now)
  • removing and rebuilding everything under iscsi/ in targetcli
  • HCA settings: different numbers of VL's, different BAR sizes, disabled SR-IOV (doesn't work on this host anyway)
  • different HCA of the same generation (ConnectX-3 FCBT, uses mlx4* modules)
  • different HCA of different generation (Connect-IB FCAT, uses mlx5* modules)
  • compared everything in /etc/rdma, /etc/iscsi, /etc/modprobe.d, with a working system
  • checked opensm's configuration. (Did I do something silly like assign partition membership by GUID? No, I didn't.)
I know that I've made changes all over the place and unfortunately didn't take adequate notes. I also don't know when the problem started.

This is Debian Bookworm. It is tainted only by the Mellanox non-free firmware and DKMS modules, and Steam. Otherwise it's vanilla.

I'm at a total loss. Any suggestions?
 
Last edited: