Edit: Everything below is... well, it's not wrong, except RDMA is actually not working for ANY hosts. Everything is going through IPoIB and it's awful. I thought it was working for other hosts but it is not.
Original post:
This host ("shark") currently has four RDMA-aware functions, and none of them are working. NFS falls back to TCP mode without giving an error message. iSCSI refuses to turn on iSER. (dmesg says "isert: isert_setup_id: rdma_bind_addr() failed: -99") The NVMe target module can be loaded, but it doesn't create files in sysfs like it's supposed to. Looking Glass Proxy (lgproxy) also fails to set up an RDMA connection and the source program eventually segfaults.
The thing is, these are IP-over-InfiniBand interfaces. rdma_bind_addr shouldn't fail. It also used to work... and it's just this one host. Everything else works for everyone else; I only can't test lgproxy because every other host on the network is headless.
Things I have tried so far:
This is Debian Bookworm. It is tainted only by the Mellanox non-free firmware and DKMS modules, and Steam. Otherwise it's vanilla.
I'm at a total loss. Any suggestions?
Original post:
This host ("shark") currently has four RDMA-aware functions, and none of them are working. NFS falls back to TCP mode without giving an error message. iSCSI refuses to turn on iSER. (dmesg says "isert: isert_setup_id: rdma_bind_addr() failed: -99") The NVMe target module can be loaded, but it doesn't create files in sysfs like it's supposed to. Looking Glass Proxy (lgproxy) also fails to set up an RDMA connection and the source program eventually segfaults.
The thing is, these are IP-over-InfiniBand interfaces. rdma_bind_addr shouldn't fail. It also used to work... and it's just this one host. Everything else works for everyone else; I only can't test lgproxy because every other host on the network is headless.
Things I have tried so far:
- stock Debian kernel instead of custom kernel
- going back to major version 6.1 (the custom kernel is 6.5 right now)
- removing and rebuilding everything under iscsi/ in targetcli
- HCA settings: different numbers of VL's, different BAR sizes, disabled SR-IOV (doesn't work on this host anyway)
- different HCA of the same generation (ConnectX-3 FCBT, uses mlx4* modules)
- different HCA of different generation (Connect-IB FCAT, uses mlx5* modules)
- compared everything in /etc/rdma, /etc/iscsi, /etc/modprobe.d, with a working system
- checked opensm's configuration. (Did I do something silly like assign partition membership by GUID? No, I didn't.)
This is Debian Bookworm. It is tainted only by the Mellanox non-free firmware and DKMS modules, and Steam. Otherwise it's vanilla.
I'm at a total loss. Any suggestions?
Last edited: