nftables flowtables hardware offload with ConnectX-5

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jsingh04

New Member
Mar 4, 2024
6
0
1
I recently bought a connectx 5 and have been playing about with it - i'm reasonably confident that i've gotten all the offloading working.
I use proxmox and it was mostly the case of installing the DOCA / ASAP2 drivers and using an OVS bridge, whilst setting up switchdev mode +adding the representors.

I am struggling to get DOCA offload setup for my connectx-4-LX on proxmox(I have tried kernel 6.5 and now working on kernel 6.8). Could you share how you set it up, right now openvswitch is throwing the error when I try to my PF on to the bridge

E-Switch port metadata is required when using HWS but it is disabled
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
35
I am struggling to get DOCA offload setup for my connectx-4-LX on proxmox(I have tried kernel 6.5 and now working on kernel 6.8). Could you share how you set it up, right now openvswitch is throwing the error when I try to my PF on to the bridge
list out the steps you've taken? maybe paste your history?

remember also if you change kernels you need to reinstall the modules with dkms or else they will be tainted
 

nasbdh9

Active Member
Aug 4, 2019
224
149
43
I am struggling to get DOCA offload setup for my connectx-4-LX on proxmox(I have tried kernel 6.5 and now working on kernel 6.8). Could you share how you set it up, right now openvswitch is throwing the error when I try to my PF on to the bridge
Requires at least ConnectX-5
 

jsingh04

New Member
Mar 4, 2024
6
0
1
list out the steps you've taken? maybe paste your history?

remember also if you change kernels you need to reinstall the modules with dkms or else they will be tainted
My hardware is an AMD Epyc 7532 with a Supermicro H12SSL-C. SR-IOV is enabled in bios.

So on a fresh install of proxmox 8.4:

I have the following kernels installed :

6.5.13-6-pve(tried on this)
6.8.12-13-pve (currently using this)
6.8.12-9-pve
I have installed proxmox headers for both 6.8 and 6.5 kernels.

-set the the basic gpu passthrough steps (amd_iommu=pt) (added vfio modules...)
- installed openvswitch-switch and openvswitch-switch-dpdk (for dpdk helper functions)
- installed the latest doca-networking package version 3.0 from the nvidia website ( I was pinned on kernel 6.5 when I installed this)

The config is based on the post I saw on STH forum here:

Bash:
[Unit]
Description=Script to enable SR-IOV on boot
After=ovs-vswitchd.service
# networking.service needs interfaces that we create here
Before=networking.service

[Service]
Type=oneshot
# Init SR-IOV
ExecStart=/usr/bin/bash -c '/opt/ovs-doca-config/ovs-doca.sh'
[Install]
WantedBy=multi-user.target network-online.target

Code:
#!/bin/bash

# Primary device name and location.
set -x

DEVNAME=enp198s0f0np0
DEVNAME2=enp198s0f1np1
DEVPCIBASE=0000:c6:00
DEVPCIBASE2=0000:c6:01
DEVPCIBASE3=0000:c6:02
MAX_WAIT=30

# Function to wait for openibd.service
wait_for_openibd() {
    local wait_time=0
    echo "Waiting for openibd.service to complete..."
    while ! systemctl is-active --quiet openibd.service; do
        if [ $wait_time -ge $MAX_WAIT ]; then
            echo "Error: openibd.service did not become active within $MAX_WAIT seconds"
            exit 1
        fi
        echo "openibd.service is not yet active, waiting..."
        sleep 1
        ((wait_time++))
    done
    echo "openibd.service is active"
}
systemctl stop openvswitch-switch.service

mkdir -p /hugepages
mount -t hugetlbfs hugetlbfs /hugepages
echo 4096 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

# Add SR-IOV virtual functions.
/usr/bin/echo 4 > /sys/class/net/${DEVNAME}/device/sriov_numvfs


#Set MAC addresses for the virtual functions.
/usr/bin/ip link set ${DEVNAME} vf 0 mac BC:24:11:5A:F6:00
/usr/bin/ip link set ${DEVNAME} vf 1 mac BC:24:11:5A:F6:01
/usr/bin/ip link set ${DEVNAME} vf 2 mac BC:24:11:5A:F6:02
/usr/bin/ip link set ${DEVNAME} vf 3 mac BC:24:11:5A:F6:03


#rename interfaces
/usr/bin/ip link set dev eth0 name ovs-sw1pf0vf0
/usr/bin/ip link set dev eth1 name ovs-sw1pf0vf1
/usr/bin/ip link set dev eth2 name ovs-sw1pf0vf2
/usr/bin/ip link set dev eth3 name ovs-sw1pf0vf3

#enable spoofchk
for i in $(seq 0 3); do
   /usr/bin/ip link set ${DEVNAME} vf ${i} spoofchk on
done;

#for i in `ls /sys/class/net/${DEVNAME}/device/sriov/`; do
#  echo ON | tee /sys/class/net/${DEVNAME}/device/sriov/${i}/trust
#done;

# Unbind the virtual functions.
/usr/bin/echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.3 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.4 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.5 > /sys/bus/pci/drivers/mlx5_core/unbind

#Enable the eSwitch./usr/sbin/devlink dev eswitch set pci/${DEVPCIBASE}.0 mode switchdev
/usr/sbin/devlink dev eswitch set pci/${DEVPCIBASE}.1 mode switchdev

#echo switchdev > /sys/class/net/${DEVNAME}/compat/devlink/mode
#echo switchdev > /sys/class/net/${DEVNAME2}/compat/devlink/mode


wait_for_openibd
# Ensure vfio-pci module is loaded
modprobe vfio-pci
if ! lsmod | grep -q vfio_pci; then
    echo "Error: Failed to load vfio-pci module"
    exit 1
fi

#Bind VF to host
/usr/bin/echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.3 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.4 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.5 > /sys/bus/pci/drivers/mlx5_core/bind

#setup huge pages
mkdir -p /hugepages
mount -t hugetlbfs hugetlbfs /hugepages
echo 4096 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

# Set up OpenVSwitch.
systemctl start openvswitch-switch.service

#/usr/bin/ovs-vsctl add-br vmbr0
/usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:lacp-fallback-ab=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw

systemctl restart openvswitch-switch.service

/usr/bin/ovs-vsctl set Open_vSwitch . other_config:default-datapath-type=netdev

/usr/bin/ovs-vsctl add-br vmbr0
/usr/bin/ovs-vsctl add-bond vmbr0 bond0 ${DEVNAME} ${DEVNAME2} lacp=active bond_mode=balance-tcp vlan_mode=native-untagged tag=1 other_config:lacp-time=fast

#rename interfaces
/usr/bin/ip link set dev eth0 name ovs-sw1pf0vf0
/usr/bin/ip link set dev eth1 name ovs-sw1pf0vf1
/usr/bin/ip link set dev eth2 name ovs-sw1pf0vf2
/usr/bin/ip link set dev eth3 name ovs-sw1pf0vf3

#Add the first network port as well as the representer devices.
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ovs-vsctl add-port vmbr0 ${i};
done;

#Bring up the switch, the physical port, and the representer devices.
usr/bin/ip link set dev ${DEVNAME} up
/usr/bin/ip link set dev ${DEVNAME2} up
/usr/bin/ip link set dev vmbr0 up
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ip link set dev ${i} up
done;

I check the openvswitch logs
and I see these lines
|00089|dpdk|WARN|mlx5_net: Unified FDB is not supported with this FW version.
|00090|dpdk|WARN|mlx5_net: No available register for sampler.
|00091|dpdk|ERR|mlx5_net: E-Switch port metadata is required when using HWS but it is disabled (configure it through devlink)
|dpdk|ERR|mlx5_net: probe of PCI device 0000:c6:00.0 aborted after encountering an error: Operation not supported
 
Last edited:

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
I don't think flowtable offloading is supported until the CX5, but finding actual proof of this on nVidia/Mellanox's website is hard. Between the links that they broke redoing the Mellanox site, the links that they broke when moving to nVidia's domain, and their constant naming churn I'm not sure what I even want to search for.

I *think* this says that it's only supported w/ ConnectX-5 cards, but the doc specifically covers MLX_OFED with OVS.

Here are a few other docs that at least mention related topics and have some instructions, but don't mention which cards are supported:

 
  • Like
Reactions: jsingh04 and nexox

jsingh04

New Member
Mar 4, 2024
6
0
1
There goes a month of work , I think it's not getting stuck on flowtable offloading but port metadata or is it the same thing?

EDIT: follwing the document links you sent me , I ended up here


esw_port_metadata" is not supported on connectx4-lx.

Nvidia documentation is hard to decode. Thanks for your help, otherwise I would've bashed my brain till it hurt
 
Last edited:

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
There goes a month of work , I think it's not getting stuck on flowtable offloading but port metadata or is it the same thing?

EDIT: follwing the document links you sent me , I ended up here


esw_port_metadata" is not supported on connectx4-lx.

Nvidia documentation is hard to decode. Thanks for your help, otherwise I would've bashed my brain till it hurt
I'm still not seeing ConnectX-4 or -5 mentioned there. Is this from comparing the mlx5 and mlx4 devlink pages? The mlx4 driver is for ConnectX-3 devices, and mlx5 covers everything newer. At least they haven't added mlx6 yet, just to add confusion.

I *think* offloading only works with the ConnectX-5, but I can't find a source for that.
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
35
My hardware is an AMD Epyc 7532 with a Supermicro H12SSL-C. SR-IOV is enabled in bios.

So on a fresh install of proxmox 8.4:

I have the following kernels installed :

6.5.13-6-pve(tried on this)
6.8.12-13-pve (currently using this)
6.8.12-9-pve
I have installed proxmox headers for both 6.8 and 6.5 kernels.

-set the the basic gpu passthrough steps (amd_iommu=pt) (added vfio modules...)
- installed openvswitch-switch and openvswitch-switch-dpdk (for dpdk helper functions)
- installed the latest doca-networking package version 3.0 from the nvidia website ( I was pinned on kernel 6.5 when I installed this)

The config is based on the post I saw on STH forum here:

Bash:
[Unit]
Description=Script to enable SR-IOV on boot
After=ovs-vswitchd.service
# networking.service needs interfaces that we create here
Before=networking.service

[Service]
Type=oneshot
# Init SR-IOV
ExecStart=/usr/bin/bash -c '/opt/ovs-doca-config/ovs-doca.sh'
[Install]
WantedBy=multi-user.target network-online.target

Code:
#!/bin/bash

# Primary device name and location.
set -x

DEVNAME=enp198s0f0np0
DEVNAME2=enp198s0f1np1
DEVPCIBASE=0000:c6:00
DEVPCIBASE2=0000:c6:01
DEVPCIBASE3=0000:c6:02
MAX_WAIT=30

# Function to wait for openibd.service
wait_for_openibd() {
    local wait_time=0
    echo "Waiting for openibd.service to complete..."
    while ! systemctl is-active --quiet openibd.service; do
        if [ $wait_time -ge $MAX_WAIT ]; then
            echo "Error: openibd.service did not become active within $MAX_WAIT seconds"
            exit 1
        fi
        echo "openibd.service is not yet active, waiting..."
        sleep 1
        ((wait_time++))
    done
    echo "openibd.service is active"
}
systemctl stop openvswitch-switch.service

mkdir -p /hugepages
mount -t hugetlbfs hugetlbfs /hugepages
echo 4096 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

# Add SR-IOV virtual functions.
/usr/bin/echo 4 > /sys/class/net/${DEVNAME}/device/sriov_numvfs


#Set MAC addresses for the virtual functions.
/usr/bin/ip link set ${DEVNAME} vf 0 mac BC:24:11:5A:F6:00
/usr/bin/ip link set ${DEVNAME} vf 1 mac BC:24:11:5A:F6:01
/usr/bin/ip link set ${DEVNAME} vf 2 mac BC:24:11:5A:F6:02
/usr/bin/ip link set ${DEVNAME} vf 3 mac BC:24:11:5A:F6:03


#rename interfaces
/usr/bin/ip link set dev eth0 name ovs-sw1pf0vf0
/usr/bin/ip link set dev eth1 name ovs-sw1pf0vf1
/usr/bin/ip link set dev eth2 name ovs-sw1pf0vf2
/usr/bin/ip link set dev eth3 name ovs-sw1pf0vf3

#enable spoofchk
for i in $(seq 0 3); do
   /usr/bin/ip link set ${DEVNAME} vf ${i} spoofchk on
done;

#for i in `ls /sys/class/net/${DEVNAME}/device/sriov/`; do
#  echo ON | tee /sys/class/net/${DEVNAME}/device/sriov/${i}/trust
#done;

# Unbind the virtual functions.
/usr/bin/echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.3 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.4 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.5 > /sys/bus/pci/drivers/mlx5_core/unbind

#Enable the eSwitch./usr/sbin/devlink dev eswitch set pci/${DEVPCIBASE}.0 mode switchdev
/usr/sbin/devlink dev eswitch set pci/${DEVPCIBASE}.1 mode switchdev

#echo switchdev > /sys/class/net/${DEVNAME}/compat/devlink/mode
#echo switchdev > /sys/class/net/${DEVNAME2}/compat/devlink/mode


wait_for_openibd
# Ensure vfio-pci module is loaded
modprobe vfio-pci
if ! lsmod | grep -q vfio_pci; then
    echo "Error: Failed to load vfio-pci module"
    exit 1
fi

#Bind VF to host
/usr/bin/echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.3 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.4 > /sys/bus/pci/drivers/mlx5_core/bind
/usr/bin/echo ${DEVPCIBASE}.5 > /sys/bus/pci/drivers/mlx5_core/bind

#setup huge pages
mkdir -p /hugepages
mount -t hugetlbfs hugetlbfs /hugepages
echo 4096 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

# Set up OpenVSwitch.
systemctl start openvswitch-switch.service

#/usr/bin/ovs-vsctl add-br vmbr0
/usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:doca-init=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:lacp-fallback-ab=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw

systemctl restart openvswitch-switch.service

/usr/bin/ovs-vsctl set Open_vSwitch . other_config:default-datapath-type=netdev

/usr/bin/ovs-vsctl add-br vmbr0
/usr/bin/ovs-vsctl add-bond vmbr0 bond0 ${DEVNAME} ${DEVNAME2} lacp=active bond_mode=balance-tcp vlan_mode=native-untagged tag=1 other_config:lacp-time=fast

#rename interfaces
/usr/bin/ip link set dev eth0 name ovs-sw1pf0vf0
/usr/bin/ip link set dev eth1 name ovs-sw1pf0vf1
/usr/bin/ip link set dev eth2 name ovs-sw1pf0vf2
/usr/bin/ip link set dev eth3 name ovs-sw1pf0vf3

#Add the first network port as well as the representer devices.
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ovs-vsctl add-port vmbr0 ${i};
done;

#Bring up the switch, the physical port, and the representer devices.
usr/bin/ip link set dev ${DEVNAME} up
/usr/bin/ip link set dev ${DEVNAME2} up
/usr/bin/ip link set dev vmbr0 up
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ip link set dev ${i} up
done;

I check the openvswitch logs
and I see these lines
so those errors look related to DPDK rather than DOCA, i've not set up DPDK because I don't have a single application that needs super high throughput and I don't want to pay the DPDK tax of having core(s) churning away at 100% usage when doing nothing

from the order of doing things you mentioned, its likely you have tainted your kernel so i would do a dkms check and have it rebuild those kernel modules since you've upgraded kernels (you never mentioned if you did this already)

and i'd also boot up windows and update the firmware on your nic (just in case, things do change and it might solve this problem)

additionally what i'd do is make use of the onboard nic and set vmbr0 to that, whilst you properly set up switchdev (thats what i did)

after doing that, then you can rework the /etc/network/interfaces for vmbr0 to be on that ovs bridge

i had issues putting everything into /etc/network/interfaces so i split it up into 2, the post-up stuff i put into a one shot script and the rest i kept with /etc/network/interfaces
 

jsingh04

New Member
Mar 4, 2024
6
0
1
I'm still not seeing ConnectX-4 or -5 mentioned there. Is this from comparing the mlx5 and mlx4 devlink pages? The mlx4 driver is for ConnectX-3 devices, and mlx5 covers everything newer. At least they haven't added mlx6 yet, just to add confusion.

I *think* offloading only works with the ConnectX-5, but I can't find a source for that.
You are right that connectx-4 does not support Flow table offloading, even after updating the firmware. The ovs log error was a hint. Furthermore, when I try to change the parameters this is what I get, so I think this confirms it.

Bash:
root@server:~# devlink dev param set pci/0000:c6:00.0 name flow_steering_mode value "smfs" cmode runtime
Error: mlx5_core: Software managed steering is not supported by current device.
kernel answers: Operation not supported
root@server:~# devlink dev param set pci/0000:c6:00.0 name esw_port_metadata value "true" cmode runtime
kernel answers: Operation not supported
 

jsingh04

New Member
Mar 4, 2024
6
0
1
so those errors look related to DPDK rather than DOCA,
Yes they are DPDK errors but DOCA is a parthway based on DPDK type pathway. Even when you check DOCA documention here:
You can see that even for doca the datapath setup is of type dpdk
Bash:
ovs-vsctl add-port br0-ovs enp4s0f0 -- set Interface enp4s0f0 type=dpdk
I did update the firmware to the latest vertion but it didn't change anything.The minimum for doca is connectx-5. Connectx-4 does not support dv_flow_en=2 (full harware steering support) only 0 and 1 as connectx-4 only supports DMFS not SMFS.

I did end up setting it to a linux bridge on the motherboard nic while I fiddled with this.


BTW do you use an OVS bridge or linux bridge with SR-IOV?
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
35
Yes they are DPDK errors but DOCA is a parthway based on DPDK type pathway. Even when you check DOCA documention here:
You can see that even for doca the datapath setup is of type dpdk
Bash:
ovs-vsctl add-port br0-ovs enp4s0f0 -- set Interface enp4s0f0 type=dpdk
I did update the firmware to the latest vertion but it didn't change anything.The minimum for doca is connectx-5. Connectx-4 does not support dv_flow_en=2 (full harware steering support) only 0 and 1 as connectx-4 only supports DMFS not SMFS.

I did end up setting it to a linux bridge on the motherboard nic while I fiddled with this.


BTW do you use an OVS bridge or linux bridge with SR-IOV?
im using ovs-bridge, you don't get all the offloading capabilities with regular linux bridges and from what i understand from having read it, is that you don't require DPDK because its effectively doing DPDK on the nic / eswitch. thats why its marketed as being a big NFV upgrade because the receptors all talk very fast with each other

im getting 45us on pings between vm's and thats not even the 'real' amount because ping has overhead, if you wanted to check its real latency you'd need to use some of the proper packet PTP / hardware time tracking