As I wanted to learn working with RDMA, I got myself a ConnectX-5 (MCX512A-ACAT; 2x25GBe); a second one will follow soon. Now, in addition to doing RDMA (and zero-touch RoCe in particular), these chips also have extended eSwitch functionality that is exposed using the Linux switchdev module. I have installed the card in a Proxmox 8 machine (running a Linux 6.8 kernel), and I am using the standard mlx5_core module for it.
Getting the card to work was easy enough; it pretty much worked out of the gate. However, I have struggled getting everything out of the switchdev functionality. I see various pieces of documentation online, and it is not always clear what requires the Mellanox (now: NVidia) proprietary driver, and what should work under the Linux 6.8-range standard kernel module.
The relevant part of my current /etc/network/interfaces looks like:
auto enp101s0f0np0
iface enp101s0f0np0 inet manual
post-up ip link set dev $IFACE promisc on
#Disable RX VLAN filtering in hardware offload
pre-up ethtool -K $IFACE rx-vlan-filter off
auto vmbr0
iface vmbr0 inet manual
bridge-ports enp101s0f0np0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-510
post-up devlink dev eswitch set pci/0000:65:00.0 mode switchdev
post-up ip link set enp101s0f0np0 master vmbr0
Proxmox then adds VMs to vmbr0. I am not 100% sure about the pre-up ethtool -K $IFACE rx-vlan-filter off part; that was suggested somewhere (I forgot where) to get SR-IOV working, but more on that later.
Sure enough; the above works; data flows, between the host and the network, between VMs and the host, and between VMs and the network. However; how can I tell how much of the networking is offloaded to the switchdev, if any at all? The other question is: what is required for switchdev to work for VMs; do I need to use SR-IOV virtual functions, or, conversely, must I not? Does anyone have any experiences here?
Finally, the docs at NVidia seem to indicate that if you use SR-IOV, you should in fact be able to use RoCe/RDMA from a VM. However, it is not clear to me whether, in that situation, the virtual function is even connected to the eSwitch, and to what extent you can even communicate between the host and the VM at that point. Any pointers?
Thanks in advance
Getting the card to work was easy enough; it pretty much worked out of the gate. However, I have struggled getting everything out of the switchdev functionality. I see various pieces of documentation online, and it is not always clear what requires the Mellanox (now: NVidia) proprietary driver, and what should work under the Linux 6.8-range standard kernel module.
The relevant part of my current /etc/network/interfaces looks like:
auto enp101s0f0np0
iface enp101s0f0np0 inet manual
post-up ip link set dev $IFACE promisc on
#Disable RX VLAN filtering in hardware offload
pre-up ethtool -K $IFACE rx-vlan-filter off
auto vmbr0
iface vmbr0 inet manual
bridge-ports enp101s0f0np0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-510
post-up devlink dev eswitch set pci/0000:65:00.0 mode switchdev
post-up ip link set enp101s0f0np0 master vmbr0
Proxmox then adds VMs to vmbr0. I am not 100% sure about the pre-up ethtool -K $IFACE rx-vlan-filter off part; that was suggested somewhere (I forgot where) to get SR-IOV working, but more on that later.
Sure enough; the above works; data flows, between the host and the network, between VMs and the host, and between VMs and the network. However; how can I tell how much of the networking is offloaded to the switchdev, if any at all? The other question is: what is required for switchdev to work for VMs; do I need to use SR-IOV virtual functions, or, conversely, must I not? Does anyone have any experiences here?
Finally, the docs at NVidia seem to indicate that if you use SR-IOV, you should in fact be able to use RoCe/RDMA from a VM. However, it is not clear to me whether, in that situation, the virtual function is even connected to the eSwitch, and to what extent you can even communicate between the host and the VM at that point. Any pointers?
Thanks in advance