nftables flowtables hardware offload with ConnectX-5

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nexox

Well-Known Member
May 3, 2023
1,949
968
113
I realized last week that my router's NIC supports hardware offload for bridging, forwarding, and NAT, and while I don't need acceleration it at all, I've got nothing to do next week but break my network trying to see how it works.

I think I read just about everything there is to find on the subject, which isn't very much, and I was curious if anyone here had some experience to share before I jump in.

The most useful source I've found so far is this very detailed blog post: Flowtables - Part 1: A Netfilter/Nftables Fastpath [Thermalcircle.de] and beyond that it's pretty much just the patch description and sources found here: netfilter flowtable hardware offload [LWN.net]. All other references seem too high-level and vague to offer much help.
 

blunden

Well-Known Member
Nov 29, 2019
1,134
403
83
You could perhaps find a bit more information from the documentation for VyOS and OpenWrt.

At least VyOS supports it so if nothing else, looking at the source code for the script that generates the nftables rules from the VyOS configuration might give you some hints.
 
  • Like
Reactions: nexox

nexox

Well-Known Member
May 3, 2023
1,949
968
113
Update: Converting my reasonably-simple iptables script to nftables went smoothly, adding a flow table for software offload was easy enough, but I'm stuck trying to get hardware offload working - every time I try to add flags offload to the flow table definition I get Error: Could not process rule: Operation not supported.

There are a small number of threads out there where people have hit this same issue, none really have a conclusion - at first I thought it was because I was using the in-tree driver, so I built Mellanox's module, and the result is still the same, even after I set hw-tc-offload: on.

Thanks to the VyOS sources I did find the spot in the nftables source where this error comes from, but it's based on a return code from a function called via a nest of pointers and I haven't had time to dive in and figure out what, specifically, is causing the operation to fail. Maybe it doesn't like something with my bridge and VLAN config.
 

flo-hm

New Member
May 4, 2024
2
1
3
Hi!

I am struggeling with some similar issues. Maybe I came a little bit further with my debugging and digging into in the linux kernel. Offloading only seems to work on the Representor Devices in case you use the network card in the switchdev mode. Until now I sadly did not manage to combine all pieces so the representor NICs have a network connectivity. Neither using OVS nor a normal linux nic worked in my first tests but hopefully I will understand how to fix that issue the next few weeks.
 
  • Like
Reactions: nexox

Hellokitty123

New Member
Jun 25, 2024
1
0
1
Hello flo-hm,

Thanks very much for your deeper investigation. I make my nftables flowtables hardware offloading works on the VF representor.
It seems the requirement of representor device is reasonable. To make hardware offloading works, the NIC should be programmable to load forwarding rules like flowtable and this is what switchdev doing. And once offloading works, the representor will be the programming entrance and used by nftables.
 

flo-hm

New Member
May 4, 2024
2
1
3
Do you maybe have some more details for your setup? Do you assign IPs to the Representors? Where do you bind the VF itself? I did not have that much time the last weeks but if you have some hints for building a complete setup it would be great!
 

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
After wandering all over trying to figure out what's wrong with VyOS offloading, I find it somewhat annoying that Hellokitty123's comment above is apparently the single best source of info for this on the entire Internet.

STH for the win, I guess :)
 
  • Like
Reactions: blunden and nexox

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
After spending a few hours trying to get offloading on a scratch machine, I've come to the conclusion that *something* about my NIC doesn't want to let me enable VFs and SR-IOV for it, so no VF representors, so no flowtable offloading. This is a Minisforum MS-01 with an ConnectX-5 in its PCIe slot. The kernel shows IOMMUs are working, and the built-in Intel X710 shows VFs just fine. I have 3 almost-identical MS-01s; the ones with CX5s don't work, the one with a CX4 shows VFs.

Presumably I need to install mstconfig and/or flash the firmware on this thing, which is always a blast from VyOS.

Yaks all the way down...
 

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
Okay. Minor progress. I installed mstconfig on the non-VyOS (Ubuntu 24.04) machine with a CX5. Didn't work. Apparently secure boot was still on, and disabling it is a pain.

Copied mstconfig from that to the VyOS machine. Library hell.

Copied Debian Bookworm /etc/apt/sources.list onto the VyOS machine and installed mstconfig via apt. Ran
`mstconfig -d 01:00.0 set SRIOV_EN=1 NUM_OF_VFS=4`. Rebooted. Re-enabled swiitchdev (`devlink dev eswitch set pci/0000:01:00.0 mode switchdev`). Created a VF (`echo 1 > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/sriov_numvfs`). This created eth6 and eth7. Tried applying `firewall flowtable X offload hardware` to each. Both failed with `Interface "ethX" does not support hardware offload`.

So presumably I need a different flavor of representor and/or different config settings for the card. But hey -- progress.
 
  • Like
Reactions: nexox

gerby

SREious Engineer
Apr 3, 2021
62
24
8
I'm wondering if this is a vyos bug here; I've got a vyos instance with vifs on a cx4lx with the same behavior. I'm not using virtual functions as the newer drivers (including mainline) can enable the eswitch without sr-iov.
Code:
vyos@vyos:~$ sudo devlink dev eswitch show pci/0000:07:00.0
pci/0000:07:00.0: mode switchdev inline-mode link encap-mode basic
hw-tc-offload is showing on [fixed] according to ethtool:
Code:
vyos@vyos# ethtool -k eth4 | grep hw-tc
hw-tc-offload: on [fixed]
However vyos doesn't see the offload:
Code:
vyos@vyos:~$ show int eth eth4 physical offload
rx-checksumming               on
tx-checksumming               on
tx-checksum-ip-generic        on
scatter-gather                on
tx-scatter-gather             on
tcp-segmentation-offload      on
tx-tcp-segmentation           on
tx-tcp-mangleid-segmentation  off
tx-tcp6-segmentation          on
generic-segmentation-offload  on
generic-receive-offload       on
large-receive-offload         off
rx-vlan-offload               off  [requested  on]
tx-vlan-offload               on
ntuple-filters                off
receive-hashing               on
rx-vlan-filter                off  [requested  on]
tx-gre-segmentation           on
tx-gre-csum-segmentation      on
tx-udp_tnl-segmentation       on
tx-udp_tnl-csum-segmentation  on
tx-gso-partial                on
tx-udp-segmentation           on
tx-nocache-copy               off
rx-fcs                        off
rx-all                        off
tx-vlan-stag-hw-insert        on
rx-udp_tunnel-port-offload    on
rx-gro-list                   off
macsec-hw-offload             on
rx-udp-gro-forwarding         off
 
Last edited:

Scott Laird

Well-Known Member
Aug 30, 2014
436
270
63
Hmm. I didn't check `show int ... physical offload`, but it mostly looks like reformatted output from `ethtool`. The big issue is that the generated nftables commands return errors whenever hardware offload is enabled, even with eswitch mode and hw-tc-offload on. I *think* that's just because I'm not using the right representor device, but frankly the odds of getting this to actually be usable without at least minor VyOS surgery at this point is getting pretty slim.
 
  • Like
Reactions: nexox

gerby

SREious Engineer
Apr 3, 2021
62
24
8
I agree that this is gonna require VyOS surgery; I'm wondering if they're dropping offload lookups based on the driver loaded, i.e. seeing mlx5 and not bothering to look for tc_offload.
 

Endanger5372

New Member
Dec 4, 2024
5
3
3
I haven't had any more success but some other resources I've found that have seemed helpful are the docs from the manufacturers:
  1. Intel
  2. Mellanox/NVIDIA
Both have examples (mostly using tc) that clearly show all the hardware programming and configuration is happening using the representors.

I think there might be a missing piece to this as far as nftables is concerned. The initial packets will be coming in through the VFs and PFs but the flows need to be configured on the representors. Does nftables know how to look up the representors or does it assume that the VFs/PFs are programmable?

I also found this LKML thread that indicates that flowtable offload like this may not be supported by drivers. They need to support TC_SETUP_FT. Looking at kernel source, it looks like that basically means you need to use mlx5_core or some Mediatek driver. Intel for example supports hw-tc-offload but not TC_SETUP_FT, so flowtable offload won't work there.

It looks like someone using a Mediatek device actually managed to get things working pretty normally but if I understand correctly, they programmed the physical ports directly and didn't need to use SR-IOV at all.

@Nerox or @flo-hm, assuming you're using a device that supports mlx5_core, could you try a configuration like the Mediatek one above, just forwarding between two PFs without SR-IOV?
 
Last edited:
  • Like
Reactions: blunden and nexox

Endanger5372

New Member
Dec 4, 2024
5
3
3
I haven't had any more success but some other resources I've found that have seemed helpful are the docs from the manufacturers:
  1. Intel
  2. Mellanox/NVIDIA
Both have examples (mostly using tc) that clearly show all the hardware programming and configuration is happening using the representors.

I think there might be a missing piece to this as far as nftables is concerned. The initial packets will be coming in through the VFs and PFs but the flows need to be configured on the representors. Does nftables know how to look up the representors or does it assume that the VFs/PFs are programmable?

I also found this LKML thread that indicates that flowtable offload like this may not be supported by drivers. They need to support TC_SETUP_FT. Looking at kernel source, it looks like that basically means you need to use mlx5_core or some Mediatek driver. Intel for example supports hw-tc-offload but not TC_SETUP_FT, so flowtable offload won't work there.

It looks like someone using a Mediatek device actually managed to get things working pretty normally but if I understand correctly, they programmed the physical ports directly and didn't need to use SR-IOV at all.

@Nerox or @flo-hm, assuming you're using a device that supports mlx5_core, could you try a configuration like the Mediatek one above, just forwarding between two PFs without SR-IOV?
Very short follow-up but I trawled through the kernel code and concluded:
  1. mlx5 has two sub-variants: core and rep
  2. rep supports TC_SETUP_FT, core does not
  3. rep is used for PFs and representors, core is used for VFs
  4. VFs are defined as auxiliary devices using the kernel's auxiliary bus.
  5. In theory, I think nftables/netfilter could use the auxiliary bus API to find the representors but I'm not sure whether this would work well. What I can say with more certainty is that I couldn't find any evidence that such a feature already exists.
So as far as I can tell from the code, configuring things with PFs should work but nftables flow offload won't work with SR-IOV VFs.

I don't have a ConnectX-5 to test but if it works I might change that so folks who have one: please let me know if you could get it working.

Another thing that might work for SR-IOV: can you put the representors/PFs in the flowtable definition's "devices" option? The flowtable kernel documentation indicates that this parameter specifies the devices that the table is used for but it doesn't say that the flow add action has to be defined on a rule for the same device (it does say that the flowtable hash key includes the interface for the first incoming packet but if it's offloaded to hardware, maybe that won't matter).

Maybe nftables or the kernel will complain but maybe that's the secret to getting it to work ‍¯\_(ツ)_/¯
 
Last edited:

nexox

Well-Known Member
May 3, 2023
1,949
968
113
I have a single port ConnectX-5, and I am pretty sure I once saw documentation of how to create multiple PFs for a single port, but now I can't find anything about it, does anyone know if that's possible? I could definitely do one PF per VLAN in my setup.
 

Endanger5372

New Member
Dec 4, 2024
5
3
3
I have a single port ConnectX-5, and I am pretty sure I once saw documentation of how to create multiple PFs for a single port, but now I can't find anything about it, does anyone know if that's possible? I could definitely do one PF per VLAN in my setup.
Have you looked at devlink port split? I know it can be used to split 1x100GbE to 4x25GbE by breaking up the physical I/O channels. That might count as 4x PFs. Not sure how it'll play with any SFP modules or DACs you're using though so unless you have a breakout cable, I'd suggest doing it without an SFP installed.
 

nexox

Well-Known Member
May 3, 2023
1,949
968
113
Have you looked at devlink port split? I know it can be used to split 1x100GbE to 4x25GbE by breaking up the physical I/O channels. That might count as 4x PFs. Not sure how it'll play with any SFP modules or DACs you're using though so unless you have a breakout cable, I'd suggest doing it without an SFP installed.
That's an interesting idea, but after a couple quick searches I didn't come up with any information on how to configure such a thing, do you have a link to some documentation?
 

Endanger5372

New Member
Dec 4, 2024
5
3
3
That's an interesting idea, but after a couple quick searches I didn't come up with any information on how to configure such a thing, do you have a link to some documentation?
I haven't done it myself but as far as I can see, there's not much to it. "devlink port show" will tell you whether a port can be split and into how many parts, "devlink port split <identifier> count <number>" does the split. I only know about it because I saw a QSFP breakout cable for sale once and was curious how it worked.
 
  • Like
Reactions: nexox

nexox

Well-Known Member
May 3, 2023
1,949
968
113
Well it looks like mine isn't splittable:
Code:
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth4 flavour physical port 0 splittable false
 

Endanger5372

New Member
Dec 4, 2024
5
3
3
Well it looks like mine isn't splittable
That's unfortunate; could you try my other idea and try setting up a flowtable with representors in the flowtable devices but the VFs in the filters/actions?

Something along these lines:
Code:
table inet x {

    flowtable f {
        hook ingress priority 0
        # port representor devices go here
        devices = { <vf1_rep>, <vf2_rep> }
    }

    chain forward {
        type filter hook forward priority 0; policy drop;

        tcp dport { 80, 443 } ct state established flow offload @f counter packets 0 bytes 0
        ct state vmap { established : accept, related : accept, invalid : drop }
        
        # vf devices go here
        iifname <vf1> oifname <vf2> counter accept
    }
}