nftables flowtables hardware offload with ConnectX-5

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nexox

Well-Known Member
May 3, 2023
1,825
881
113
That's unfortunate; could you try my other idea and try setting up a flowtable with representors in the flowtable devices but the VFs in the filters/actions?
I will have to try that next time I get the opportunity to mess around with that system, it's my primary router and people tend not to like it when I break the internet for experiments.
 

t00l1024

New Member
Jul 31, 2020
9
9
3
I recognize the age of this thread, but just came across it. I went down this path several years ago and there were a few things that prohibited using the hardware offload features for routing:
  1. The ConnectX-5 NIC (actually, CX4-LX, CX5, and newer) don't allow for the HW flows across PHY interfaces. So you can get a flow inserted, but then still have to software process it.
  2. The HW offloading feature is really geared towards a VM or application talking to the NIC PHY directly. Not PHY <-> PHY kinds of flows
  3. For NAT, something has to keep state. The kernel? Then just process the flow in SW.
The approach I took was with Open vSwitch that makes flow handling easy. It will also handle the decision to insert a HW flow and when to expire them. Two OVS bridges represented the WAN and LAN networks and could handle flow insertion into HW, but still needed an app to handle the NAT state decision process.
 
  • Like
Reactions: blunden and nexox

gerby

SREious Engineer
Apr 3, 2021
62
24
8
I've been trying to get ovs offloading working in a separate thread as well; everything _looks_ correct but I never see ovs generate any offloaded flows for my SR-IOV attached guests; this is a simple switching setup with vlans out to the rest of the environment (cx4lx, proxmox).
 

nexox

Well-Known Member
May 3, 2023
1,825
881
113
3. For NAT, something has to keep state. The kernel? Then just process the flow in SW
The way I understand it is the kernel should be able to make a NAT decision and then hand the flow off to the NIC and then never see another packet from that flow until the connection ends or there's an error, in theory using much less CPU resources software NAT. Obviously that's not simple.
 

t00l1024

New Member
Jul 31, 2020
9
9
3
The way I understand it is the kernel should be able to make a NAT decision and then hand the flow off to the NIC and then never see another packet from that flow until the connection ends or there's an error, in theory using much less CPU resources software NAT. Obviously that's not simple.
The key issue is having some code that can take a NAT decision, and instruct the NIC with the right flow info.

Note: I never tried this with nftables, as in your post topic. Just with OVS, for the use case I was interested in.

This diagram Flowtables - nftables wiki would imply it's possible. I'll have to dig up all my notes on setting this up and will try again with nftables. I might have a CX-4LX, but know for certain I have a 5 for testing.

Are you want to use both CX5 interfaces? One for ingress and one for egress? Or to a different NIC on the host? Or to a VM?
 

nexox

Well-Known Member
May 3, 2023
1,825
881
113
The key issue is having some code that can take a NAT decision, and instruct the NIC with the right flow info.
Flowtables was made for that, and the Mellanox driver supports it, though I haven't found any evidence that anyone has gotten it to work.

Are you want to use both CX5 interfaces? One for ingress and one for egress? Or to a different NIC on the host? Or to a VM?
A single interface routing between VLANs, bare metal, I suspect either the VLANs or the lack of a dual port NIC are breaking things, but I haven't gotten around to digging through the layers of driver code to find out exactly why I'm hitting the error in my first post.
 
  • Like
Reactions: t00l1024

t00l1024

New Member
Jul 31, 2020
9
9
3
Found my old notes. Setting up my test systems now. I went down the Intel E810 route too. IIRC there wasn't too much different from the Intel and Mellanox setups. Some worthwhile reading:
Unfortunately, Mellanox/NVIDIA moved around a bunch of their documentation so other URLs I have are dead. :-(
 

t00l1024

New Member
Jul 31, 2020
9
9
3
Flowtables was made for that, and the Mellanox driver supports it, though I haven't found any evidence that anyone has gotten it to work.


A single interface routing between VLANs, bare metal, I suspect either the VLANs or the lack of a dual port NIC are breaking things, but I haven't gotten around to digging through the layers of driver code to find out exactly why I'm hitting the error in my first post.
I'm new with nftables. Are you using nftables for setting the VLAN tags or setting them on the representor interfaces?
 

nexox

Well-Known Member
May 3, 2023
1,825
881
113
I'm new with nftables. Are you using nftables for setting the VLAN tags or setting them on the representor interfaces?
Haven't had a chance to try representors, just doing standard kernel VLANs on one interface, which creates child interfaces, which I then use in nftables (mostly in the same way as iptables.)
 
  • Like
Reactions: t00l1024

Scott Laird

Active Member
Aug 30, 2014
425
248
43
Given the similarity between Mellanox's NICs and their switch chips, I find it kind of funny that their NICs can (supposedly) do flowtable offloads when in switchdev mode under Linux, while the switch chips can do L3 offloads in switchdev mode, but supposedly the switches can't do flowtables and the NICs can't do L3.

If the switches could do flow offloads for NAT, then an SN2700 or SN2100 might actually be a viable home router :).
 
  • Like
Reactions: blunden and nexox

t00l1024

New Member
Jul 31, 2020
9
9
3
I feel like we need some agreement on what "offload" really means.

Modern NICs can "offload" some of the packet checksumming to reduce [kernel] overhead when processing packets (either TX and/or RX). The purpose of enabling switchdev mode in a NIC is to properly place the packet (based upon a flow tuple) in the right space of memory to reduce memory copies, interrupts, etc. This is typically done in conjunction with SRIOV and virtual machines where these problems exist. So "offload" in this context is more of a L2/L3/L4 header match.

Intel had some good videos on this topic years ago:
and

I'm not saying it isn't possible to treat a PCI NIC as an in-hardware router, but some application is going to need to make decisions on what to do with a new flow (eg. tc, learning switch in OVS, etc.)

Note: If you want to use the eswitch/switchdev mode, you need to be using the created representor interfaces. Not the actual ones.
 
  • Like
Reactions: abq

t00l1024

New Member
Jul 31, 2020
9
9
3
Haven't had a chance to try representors, just doing standard kernel VLANs on one interface, which creates child interfaces, which I then use in nftables (mostly in the same way as iptables.)
I didn't get far enough to troubleshoot the nftables setup (VERY new to this). I might be able to cheat and use OVS, but probably not what you wanted to do. Will try later this week.
 
  • Like
Reactions: nexox

Scott Laird

Active Member
Aug 30, 2014
425
248
43
I feel like we need some agreement on what "offload" really means.

Modern NICs can "offload" some of the packet checksumming to reduce [kernel] overhead when processing packets (either TX and/or RX). The purpose of enabling switchdev mode in a NIC is to properly place the packet (based upon a flow tuple) in the right space of memory to reduce memory copies, interrupts, etc. This is typically done in conjunction with SRIOV and virtual machines where these problems exist. So "offload" in this context is more of a L2/L3/L4 header match.

Intel had some good videos on this topic years ago:
and

I'm not saying it isn't possible to treat a PCI NIC as an in-hardware router, but some application is going to need to make decisions on what to do with a new flow (eg. tc, learning switch in OVS, etc.)

Note: If you want to use the eswitch/switchdev mode, you need to be using the created representor interfaces. Not the actual ones.
In general, "offload" in this sort of context (to me at least) just means that the hardware (NIC in this case) is able to do at least *some* of the work without the main system (and OS) having to do all of the work itself. So, for TCP send offloading, the OS can send a TCP header template and a blob of data (>1500 bytes) and the NIC can segment it and create packets on its own. Receive offloading works the same way in reverse. Segmentation isn't that hard, but it means that the OS can handle one interrupt per ~64k rather than one for every 1500 bytes, effectively getting better-than-jumbo-frames performance without needing any extra config.

The point is that the hardware is just doing (at the OS's behest) what the OS would have done itself anyway, just faster.

For flow/NAT/etc offloading, I want the OS to maintain a map of what goes where, but for some of the traffic, I want the NIC to be able to recognize a pattern provided by the OS (5-tuple, etc) and take actions on its own without involving the OS. Obviously offloading is going to be less flexible than doing it through the OS, but it should be dramatically faster *when the hardware supports the operations needed*. And presumably the OS can skip offloading especially difficult connections, because everything will just fall back to the OS's slow path anyway.

Linux seems to have a reasonable framework for a lot of network-related offloading; see IPng Networks - Debian on Mellanox SN2700 (32x100G) for example, where he installs Debian on a Mellanox SN2700 switch and Linux offloads basically all L3 routing to hardware *while still using the usual Linux management interfaces*.

The examples that I've seen for NAT and VXLAN offloading aren't quite as clean, but I'm not sure if those are just bad examples or if the offload interface for those isn't as well done. Plus, nVidia/Mellanox keeps moving their docs, *and* it's hard to tell which docs work with the ConnectX family and which require BlueField DPUs.
 

tjjh89017

New Member
Feb 10, 2021
6
0
1
Hi

I just found BF2 with DPU mode, you can setup "flags offload" in stock ubuntu image to offload traffic to the eswitch on the ARM side.
But I cannot make VyOS on Bluefield work with hw offload features, I guess I missed some kernel config or something.

Another limitation, TC_SETUP_FT only worked in mlx5_rep netdev.
And some netdev will need to be with bridge to offload.
I tested pf0hpf, pf1hpf worked with bridge to offload (pf0hpf is br0's slave, and IP is on br0 to be a router)
p0, p1 are not mlx5_rep, so I guess I will need to create SF and bridge SF with p0, and put IP on SF to be a router.

But I cannot still find out what I'm missing for the hw offload, even Conntrack and MLX5 tracing showed HW_OFFLOAD and MLX5 called driver level's flow.
 

Attachments

hmy

New Member
May 27, 2025
1
0
1
if use bridge, the performance of nat will low than soft offload between two phy interface.
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
34
I recently bought a connectx 5 and have been playing about with it - i'm reasonably confident that i've gotten all the offloading working.

I use proxmox and it was mostly the case of installing the DOCA / ASAP2 drivers and using an OVS bridge, whilst setting up switchdev mode +adding the representors.

The host + the VMs need DOCA / ASAP2 drivers (which only work on linux, no freebsd) to support it. You can though use LXC containers for the platforms which the installer doesn't support and it'll work (since LXCs are using the host kernel). You still need to passthrough the SR-IOV VFs into the LXC container though. So that would be your play if you wanted to set up VyOS with it I suspect (i've set up openwrt in an LXC and it works there).

be aware though, the VMs / LXCs aren't natively going to be aware what they are doing is being offloaded, you won't see it represented in the VMs / LXC's themselves. The appropriate place to check for the offloading is going to be on the host using ovs-appctl dpctl/dump-flows type=offloaded

I made a little script to parse that and some other commands so I could watch the traffic in real time to confirm it was all happening.



To then further be sure that it was, I created some nested routing chains (VM > router > > LXC) where the VM can only talk to the LXC via the router and i gave them all 1x vCPU.

i could push 20gbit through them bidirectional using iperf3 and if you cumulatively add them up its 80gbit bidirectional traffic flowing through 1 vCPU. that is enough to convince me that its all working as it should because 160gbit of traffic shouldn't be possible to service on 1x vCPU with minimal gains utilisation (the core remained usable and the VM + LXCs seemed fine).

Some things to note though is that you need to reengineer your networking a little because a bunch of stuff will break the hardware offload. you can't use any kind of software bridges (so things like ipvlan and other custom docker networks are out if you want full end to end offload).

with docker you'd have to use the host networking for your containers. if you do though, you'll get the full offloads.

also CAKE SQM is out, that'll drop to software it seems (or at least im basing that on the halving of bandwidth when trying it out). what is kind of neat though is that FQ_CODEL seems fine, so at least we still have one powerful SQM that can be offloaded.
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
34
just to clarify i was incorrect about docker ipvlan being out, it works fine with the hardware offloads, as does vlan sub interfaces in any vms / lxc's. likely macvlan will equally be fine but i've not bothered to test

this is assuming that you are binding ipvlan to eth0 / the interface directly. if you do something like a software bridge and then bind it to that then it wont work. (some implementations like unraid do this, so you might need to change it if you make use of that).
 

nexox

Well-Known Member
May 3, 2023
1,825
881
113
I've been trying to do it with software bridges, I guess that's not going to work.
 

mrpops2ko

New Member
Feb 12, 2017
19
14
3
34

did some SQM testing to see if anything specific can be observed, the above was completely uncapped (i.e setting the limiters to be beyond what they could achieve (100gbit))


and this was the data when i put them at 10gbit limiters

i dont really know how to interpret the data in that CAKE has to end up dropping to software because when uncapped its speed drop so much, but that sirq value is very low in the capped profiles which i wasn't expecting to see. i'm going to give layer cake a shot when i do finally finish migrating from pfsense+ to openwrt
 
  • Like
Reactions: nasbdh9