Example of bridging across multiple SR-IOV links-- with one unsolved routing problem

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nmpu

Member
Sep 22, 2023
45
16
8
Bradenton, Florida, USA
This is a proof of concept. There are still some security issues to iron out. My goal was to use SR-IOV between VMs instead of the usual software bridge. I have not run actual performance tests on throughput or CPU utilization. From what I've read, virtual (hardware) NICs will only work if the link is active. The 'virtual' bandwidth is also limited to the external link rate. That sounds decent for a 10Gb link. For 1Gb, probably better to use the software bridge between VMs.

My test box (Dell/VMware Edge 680) has 2 10Gb links. I don't have anything else that can use 10Gb, so I put a Mellanox ConnectX-3 Pro in my PC. This gives me fast local access to Proxmox and OpenWrt. Because Home Assistant (HAOS) is more of a black box, I decided to just bridge across the 1Gb switch port. I've configured Debian (Proxmox host) to offer DHCP on 3 'admin' ports. That way I can just plug in a computer without messing with settings. I don't get access to the rest of the network (including internet) until the OpenWrt VM is running. The admin ports are configured to automatically disable routes when the link is down. I've also added route metrics so that a 10Gb bridge gets used when available. Here's what it looks like:

network.png
For those unfamiliar with SR-IOV, think of each vertical path as an internal switch. Everything works the way I want except the route from PC2 to PC1. A Debian console can ping everything. PC1 can ping PC2. PC2 can ping 10.0.5.1, but not 10.0.5.4 or 10.0.5.9. If I start the OpenWrt VM and keep just the default routes in Debian, then it also works. However, I shouldn't need the alternate path through OpenWrt. I would prefer to solve this with 'ip route add x'. I'm not prepared to delve into nftables just yet. I think the routes are required either way. I left the purple switch in my diagram, but the same issue remains if PC1 is directly connected to GE5.

Here's the Debian network config:
Code:
auto lo
iface lo inet loopback

auto eno7
iface eno7 inet static
    address 10.0.1.1/24
    gateway 10.0.1.2
    up ip route replace dev eno8 metric 0 default via 10.0.1.2 # <- delete this line-- can't do striketrhough in code block
    up echo 1 > /sys/devices/pci0000:00/0000:00:16.0/0000:05:00.1/sriov_numvfs

auto eno8
iface eno8 inet static
    address 10.0.2.1/24
    up ip route replace dev eno8 metric 1 default via 10.0.2.2
    up echo 1 > /sys/devices/pci0000:00/0000:00:16.0/0000:05:00.0/sriov_numvfs

auto eno6
iface eno6 inet static
    address 10.0.6.1/24
    up ip route replace dev eno6 metric 2 default via 10.0.6.2
    up echo 1 > /sys/devices/pci0000:00/0000:00:17.0/0000:07:00.0/sriov_numvfs

auto eno5
iface eno5 inet static
    address 10.0.5.1/24
    up ip route replace dev eno5 metric 3 default via 10.0.5.2
    #up ip route del 10.0.5.0/24
    up echo 2 > /sys/devices/pci0000:00/0000:00:17.0/0000:07:00.1/sriov_numvfs
and my current routes:
Code:
default via 10.0.1.2 dev eno7 proto kernel onlink
default via 10.0.2.2 dev eno8 metric 1 dead linkdown
default via 10.0.6.2 dev eno6 metric 2 dead linkdown
default via 10.0.5.2 dev eno5 metric 3
10.0.1.0/24 dev eno7 proto kernel scope link src 10.0.1.1
10.0.2.0/24 dev eno8 proto kernel scope link src 10.0.2.1 dead linkdown
10.0.5.0/24 dev eno5 proto kernel scope link src 10.0.5.1
10.0.6.0/24 dev eno6 proto kernel scope link src 10.0.6.1 dead linkdown
It's got to be something simple?
 
Last edited:

nmpu

Member
Sep 22, 2023
45
16
8
Bradenton, Florida, USA
OK, I don't think this is a routing problem. The route between the PCs should be symmetrical, but I get different results when I swap the PCs. That suggests timing differences. Could be lots of things. Even a bug beyond my control.
 

mattventura

Active Member
Nov 9, 2022
448
217
43
If PC1 can ping PC2, but not the other way around, it sounds like maybe somewhere, there's a NAT rule when it should just be plain non-NAT routing. Without any sort of connection-tracking like with NAT, a successful ping would need correct routes in both directions. Maybe there's a duplicate IP or something like that somewhere.

I notice in your 'iface eno7' section, your 'ip route' command mentions eno8 - is that intentional?

As for a few other things:

From what I've read, virtual (hardware) NICs will only work if the link is active.
Some NICs allow the virtual links to be forced on even if the physical link is down.

The 'virtual' bandwidth is also limited to the external link rate.
It depends. You can also get bottlenecked by the PCIe link itself, since there's no reason (apart from VM-to-VM SR-IOV traffic) to have more PCIe link speed than ethernet link speed.
 

nmpu

Member
Sep 22, 2023
45
16
8
Bradenton, Florida, USA
If PC1 can ping PC2, but not the other way around, it sounds like maybe somewhere, there's a NAT rule when it should just be plain non-NAT routing. Without any sort of connection-tracking like with NAT, a successful ping would need correct routes in both directions. Maybe there's a duplicate IP or something like that somewhere.
There's literally nothing other than the config I posted. It's classic subnet to subnet. nftables is empty. This all happens outside of Proxmox on the Debian host. No VMs are running. Since only my desktop has SFP ports, I can only test certain permutations. The desktop as PC2 is never able to ping the laptop as PC1 no matter whether connected via eno6/eno7/eno8. The laptop as PC2 can always ping the desktop as PC1. Neither desktop/laptop as PC2 can ping the switch. All the routes work as expected from within the Debian console. The problem only occurs when traffic originates outside. If I delete the '10.0.5.0/24 dev eno5 proto kernel scope link src 10.0.5.1' route and fire up the OpenWrt VM (also currently void of any firewall rules), the traffic is successfully routed via the 'default' routes.

It's all very strange. My first thought was some sort of feedback loop causing a flood/timeout, but I don't see any opportunity. I really think there's a bug. How many people have attempted something this elaborate on this platform? I know that virtualization wasn't even possible until recently. I guess it's possible to trigger logging with nftables. That would be interesting.

There's still a bunch of things I can try:

1) Maybe leave the base NICs unused and spin off more virtual copies or swap usage. I wasn't sure I could passthrough the base NIC, so I kept it with the Debian host. I also wasn't sure how to configure a virtual interface that doesn't exist until it's spawned.

2) Try vanilla Debian instead of the Proxmox install. I also ran some 'optimization' script. Who knows what that did.

3) Turn off virtualization. This would then be garden-variety routing.

4) Use different NICs. eno5-eno8 are part of the C3958. eno0-eno3 are external I350.


I notice in your 'iface eno7' section, your 'ip route' command mentions eno8 - is that intentional?
Nice catch. That line should be removed. The default route for eno7 is derived from the gateway. I couldn't use the gateway trick for the subsequent interfaces because it simply replaces any existing. The dump shows the correct routes, so I must have got lucky.


As far as transfer speed is concerned, I really can't anticipate any situation where I'd be transferring data between VMs where I wasn't already limited by the external connection. To force the link on, is there some common setting in /sys/class/net?

Thanks for your interest. I was going to post this on one of those 'stack' forums, but they've got so many different websites and too many rules. They don't want a 'discussion'. Everybody wants Q/A stuff they can feed to an AI bot.
 
Last edited:

nmpu

Member
Sep 22, 2023
45
16
8
Bradenton, Florida, USA
Grabbed another laptop and while disabling the Windows firewall, I decided to double-check the other machines. I discovered that the first laptop still had the Windows firewall enabled for 'public' networks. That's a perfectly reasonable setting if the laptop is used outside the home. I turned it off and now I've got 3 computers talking as expected.

The switch (Dell X1026P) allows multiple access IPs (different VLAN/subnet), but there's no associated gateway field. I haven't tried the DHCP option. Fortunately, the switch has an L2+ mode where you can add static routes. I was able to add a default route so that the switch could respond via the 10.0.5.1 gateway.

So, all is good. The sky is not falling. I knew it had to be something simple. I'm glad nothing is actually broken.

According to iperf3, I'm getting 12.4Gb across the 10Gb SR-IOV and 1.24Gb across the 1Gb SR-IOV. I do need at least 4 CPU cores to saturate the 10Gb link. However, we're talking about low-power C3x58 cores.
 
Last edited: