Fully utilizing the ConnectX-5 eSwitch/switchdev functionality in Proxmox VE

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

niekbergboer

Active Member
Jun 21, 2016
193
102
43
48
Switzerland
As I wanted to learn working with RDMA, I got myself a ConnectX-5 (MCX512A-ACAT; 2x25GBe); a second one will follow soon. Now, in addition to doing RDMA (and zero-touch RoCe in particular), these chips also have extended eSwitch functionality that is exposed using the Linux switchdev module. I have installed the card in a Proxmox 8 machine (running a Linux 6.8 kernel), and I am using the standard mlx5_core module for it.

Getting the card to work was easy enough; it pretty much worked out of the gate. However, I have struggled getting everything out of the switchdev functionality. I see various pieces of documentation online, and it is not always clear what requires the Mellanox (now: NVidia) proprietary driver, and what should work under the Linux 6.8-range standard kernel module.

The relevant part of my current /etc/network/interfaces looks like:

auto enp101s0f0np0
iface enp101s0f0np0 inet manual
post-up ip link set dev $IFACE promisc on
#Disable RX VLAN filtering in hardware offload
pre-up ethtool -K $IFACE rx-vlan-filter off

auto vmbr0
iface vmbr0 inet manual
bridge-ports enp101s0f0np0
bridge-stp off
bridge-fd 0
bridge-vlan-aware yes
bridge-vids 2-510
post-up devlink dev eswitch set pci/0000:65:00.0 mode switchdev
post-up ip link set enp101s0f0np0 master vmbr0


Proxmox then adds VMs to vmbr0. I am not 100% sure about the pre-up ethtool -K $IFACE rx-vlan-filter off part; that was suggested somewhere (I forgot where) to get SR-IOV working, but more on that later.

Sure enough; the above works; data flows, between the host and the network, between VMs and the host, and between VMs and the network. However; how can I tell how much of the networking is offloaded to the switchdev, if any at all? The other question is: what is required for switchdev to work for VMs; do I need to use SR-IOV virtual functions, or, conversely, must I not? Does anyone have any experiences here?

Finally, the docs at NVidia seem to indicate that if you use SR-IOV, you should in fact be able to use RoCe/RDMA from a VM. However, it is not clear to me whether, in that situation, the virtual function is even connected to the eSwitch, and to what extent you can even communicate between the host and the VM at that point. Any pointers?

Thanks in advance
 
  • Like
Reactions: pixelBit

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,896
1,313
113
I think offloads are only supported if you use OpenVSwitch (see here: OVS Offload Using ASAP² Direct)

However; how can I tell how much of the networking is offloaded to the switchdev, if any at all?
Code:
Run traffic from the VFs and observe the rules added to the OVS data-path.

# ovs-dpctl dump-flows
 
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=e4:1d:2d:a5:f3:9d),
eth_type(0x0800),ipv4(frag=no), packets:33, bytes:3234, used:1.196s, actions:2
 
recirc_id(0),in_port(2),eth(src=e4:1d:2d:a5:f3:9d,dst=e4:11:22:33:44:50),
eth_type(0x0800),ipv4(frag=no), packets:34, bytes:3332, used:1.196s, actions:3
In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d
As shown above, two OVS rules were added, one in each direction.
Note that you can also verify offloaded packets by adding type=offloaded to the command. For example:

# ovs-appctl dpctl/dump-flows type=offloaded
Finally, the docs at NVidia seem to indicate that if you use SR-IOV, you should in fact be able to use RoCe/RDMA from a VM. However, it is not clear to me whether, in that situation, the virtual function is even connected to the eSwitch, and to what extent you can even communicate between the host and the VM at that point. Any pointers?
Generally, you probably won't configure all the OVS stuff via /etc/networking/interfaces, as it does not support all the stuff you need. Expect to write shell scripts

There are two ways to connect VMs to the OVS Bridge:
1. Via normal VirtIO -> Instead of the normal kernel bridge, you use a OVS bridge. The configuration is similar, you just need to add the name of the OVS bridge to the /etc/networking/interfaces in order that you can select it in the UI
2. Via SR-IOV -> You create VFs, add them to the SR-IOV bridge and then pass through the VFs to the VM


An example of a pretty lame, verbose and hardcoded config script:
Code:
[Unit]
Description=Script to enable SR-IOV on boot
After=ovs-vswitchd.service
# networking.service needs interfaces that we create here
Before=networking.service

[Service]
Type=oneshot
# Init SR-IOV
ExecStart=/usr/bin/bash -c '/usr/bin/echo 8 > /sys/class/net/ens21f0np0/device/sriov_numvfs'
ExecStart=/usr/bin/bash -c '/usr/bin/echo 0 > /sys/class/net/ens21f1np1/device/sriov_numvfs'

# Set static MAC for VFs
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 0 mac d2:77:a4:7c:0e:02'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 1 mac d2:77:a4:7c:0e:03'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 2 mac d2:77:a4:7c:0e:04'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 3 mac d2:77:a4:7c:0e:05'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 4 mac d2:77:a4:7c:0e:06'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 5 mac d2:77:a4:7c:0e:07'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 6 mac d2:77:a4:7c:0e:08'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set ens21f0np0 vf 7 mac d2:77:a4:7c:0e:09'


# Unbind VFs
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.4 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.5 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.6 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.7 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:01.0 > /sys/bus/pci/drivers/mlx5_core/unbind'
ExecStart=/usr/bin/bash -c 'echo 0000:b3:01.1 > /sys/bus/pci/drivers/mlx5_core/unbind'


# Change eSwitch mode from Legacy to Switchdev
ExecStart=/usr/bin/bash -c '/usr/sbin/devlink dev eswitch set pci/0000:b3:00.0 mode switchdev'
ExecStart=/usr/bin/bash -c '/usr/sbin/devlink dev eswitch set pci/0000:b3:00.1 mode switchdev'

# Add bond network device
ExecStart=/usr/bin/bash -c '/usr/bin/ip link add dev bond1 type bond mode 802.3ad xmit_hash_policy layer2+3 lacp_rate fast miimon 100'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0np0 master bond1'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f1np1 master bond1'

# Start init of openvswitch
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-br ovs-sriov'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl set Open_vSwitch . other_config:hw-offload=true'
ExecStart=/usr/bin/bash -c 'systemctl restart openvswitch-switch.service'

ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov bond1'

ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf0'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf1'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf2'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf3'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf4'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf5'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf6'
ExecStart=/usr/bin/bash -c '/usr/bin/ovs-vsctl add-port ovs-sriov ens21f0npf0vf7'


# Up all the ifs
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ovs-sriov up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev bond1 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0np0 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f1np1 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf0 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf1 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf2 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf3 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf4 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf5 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf6 up'
ExecStart=/usr/bin/bash -c '/usr/bin/ip link set dev ens21f0npf0vf7 up'

# Bind first VF to host
ExecStart=/usr/bin/bash -c 'echo 0000:b3:00.2 > /sys/bus/pci/drivers/mlx5_core/bind'

[Install]
WantedBy=multi-user.target network-online.target
 

niekbergboer

Active Member
Jun 21, 2016
193
102
43
48
Switzerland
It took me a bit of tinkering, but I managed to get it working. The concept that threw me off is the difference between the actual SR-IOV Virtual Function interfaces, and the "representer interfaces" that the host OS talks to (regardless of whether the VFs are bound to the host).

This works for me, as a script in /usr/local/etc/connectx5.sh , which is started as the only ExecStart command in the connectx5.service file:

Code:
#!/bin/bash

# Primary device name and location.
DEVNAME=enp101s0f0np0
DEVPCIBASE=0000:65:00

# Add SR-IOV virtual functions.
/usr/bin/echo 4 > /sys/class/net/${DEVNAME}/device/sriov_numvfs

# Set MAC addresses for the virtual functions.
/usr/bin/ip link set ${DEVNAME} vf 0 mac d2:77:a4:7c:0e:02
/usr/bin/ip link set ${DEVNAME} vf 1 mac d2:77:a4:7c:0e:03
/usr/bin/ip link set ${DEVNAME} vf 2 mac d2:77:a4:7c:0e:04
/usr/bin/ip link set ${DEVNAME} vf 3 mac d2:77:a4:7c:0e:05

# Unbind the virtual functions.
/usr/bin/echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.3 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.4 > /sys/bus/pci/drivers/mlx5_core/unbind
/usr/bin/echo ${DEVPCIBASE}.5 > /sys/bus/pci/drivers/mlx5_core/unbind

# Enable the eSwitch.
/usr/sbin/devlink dev eswitch set pci/${DEVPCIBASE}.0 mode switchdev

# Set up OpenVSwitch.
/usr/bin/ovs-vsctl add-br vmbr2
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
/usr/bin/ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw
systemctl restart openvswitch-switch.service

# Add the first network port as well as the representer devices.
/usr/bin/ovs-vsctl add-port vmbr2 ${DEVNAME}
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ovs-vsctl add-port vmbr2 ${i};
done;

# Bring up the switch, the physical port, and the representer devices.
/usr/bin/ip link set dev vmbr2 up
/usr/bin/ip link set dev ${DEVNAME} up
for i in `ls /sys/class/net/${DEVNAME}/device/net/ | grep -v ${DEVNAME}`; do
  /usr/bin/ip link set dev ${i} up
done;

# Bind first VF to host
echo ${DEVPCIBASE}.2 > /sys/bus/pci/drivers/mlx5_core/bind
 
Last edited:

niekbergboer

Active Member
Jun 21, 2016
193
102
43
48
Switzerland
As an addendum to the above: I ran into networking issues, and I found that adding

/usr/bin/ovs-vsctl set Open_vSwitch . other_config:tc-policy=skip_sw

to the Open_vSwitch config helped. Essentially, all networking would lock up, and the console would say:
.. No such timeout policy "ovs_test_tp"

I found this info in Estonian (good Chrome can translate pages!): Intel E810 – Imre kasutab arvutit
 

gerby

SREious Engineer
Apr 3, 2021
70
25
18
I'm quite curious what difference you're seeing in packet processing rate between software switching sr-iov and offloaded sr-iov; have you run trex or anything to test the limits?
 

gerby

SREious Engineer
Apr 3, 2021
70
25
18
Does anyone have any info on the difference in hardware acceleration present in the CX4LX vs the later cards? I have allegedly got this all configured correctly; guests are passing traffic over SRIOV interfaces ... but I'm not seeing any offloaded flows.
 

florian21de

New Member
Oct 6, 2024
1
0
1
Does anyone have any info on the difference in hardware acceleration present in the CX4LX vs the later cards? I have allegedly got this all configured correctly; guests are passing traffic over SRIOV interfaces ... but I'm not seeing any offloaded flows.
Hi, I'm encountering the same issue with a CX4LX
 
Jun 21, 2023
61
18
8
I just spent around 18 hours straight no sleep messing around with this, I am trying on Bare metal which I see no one talking about.
Had no luck in unraid or truenas.
Right near the end I installed windows server 2022 since my main pc is Windows 11 so this made getting RDMA working much easier.
Tried with some 40 gig cards and all I have on hand is one CX4 single port 40gig and 1 dual port CX3 40 gig card.
I set the CX4 card to run ROCEV1 since the CX3 dual port 40 gig cards can only do ROCEV1.

Was only able to get a steady 20gbps file transfer (18gb linux iso's)with it peaking shortly at 36gbps since my main daily PC only has a x4 slot available which the limit on that is 31gbps not including any overhead.

I am gonna mess around with it some more, the Windows Implementation for setting up iscsi is kinda jank IMHO compared to something like truenas.

Eyeballing doing an Epyc system here soon.


I have 1 CX4 single port 25gbe card and am about to order another one, I am trying to learn more about it and is it best to use dual port nics?
 

gerby

SREious Engineer
Apr 3, 2021
70
25
18
I've had no problems getting ROCEv2 working in Windows or ksmbd either virtual or bare metal, what this thread is focusing on is the eswitch capabilities that provide tc offload and ASAP2 functionality in *nix. All that said get the number of ports you're going to need for your workload!
 

niekbergboer

Active Member
Jun 21, 2016
193
102
43
48
Switzerland
The latest upgrade to 6.14.8 broke the setup for me: Somehow, either switching on the eSwitch does not work, or the representer devices are no longer listed under /class/net/<NIC>/device/net.

I reverted to 6.14.5 for now, but this is a blocker for me to upgrade to Proxmox VE 9.0.

Did anyone else see such trouble when upgrading to 6.14.8?
 
  • Like
Reactions: pixelBit

thulle

Member
Apr 11, 2019
85
38
18

thulle

Member
Apr 11, 2019
85
38
18
What speeds are people getting between VMs with this?
I started to evaluate this on my desktop and with the config script above minus the bonding part, setting up two VMs with 4 cores and a VF each, and I am getting iperf speeds at 1 & 4 threads slightly below 10Gbps in both cases, which was a bit underwhelming seeing that a software virtio bridge would do 16 & 36Gbps.
I tried disabling all configuration where the interface is part of software bridges just to avoid them maybe forcing software involvement of the offloaded bridge, but no difference.
All benchmarked @ ryzen 5950x with boosting frequencies disabled to give more stable results.
 
Last edited:

mattventura

Well-Known Member
Nov 9, 2022
769
427
63
What speeds are people getting between VMs with this?
I started to evaluate this on my desktop and with the config script above minus the bonding part, setting up two VMs with 4 cores and a VF each, and I am getting iperf speeds at 1 & 4 threads slightly below 10Gbps in both cases, which was a bit underwhelming seeing that a software virtio bridge would do 16 & 36Gbps.
I tried disabling all configuration where the interface is part of software bridges just to avoid them maybe forcing software involvement of the offloaded bridge, but no difference.
All benchmarked @ ryzen 5950x with boosting frequencies disabled to give more stable results.
What PCIe link speed/width? Keep in mind that everything would have to go through your PCIe link. It can result in a lower ceiling than software bridging as a result.
 

thulle

Member
Apr 11, 2019
85
38
18
What PCIe link speed/width? Keep in mind that everything would have to go through your PCIe link. It can result in a lower ceiling than software bridging as a result.
The card is bottlenecked by a PCIe4x4 link, so ~63Gbit. I've gotten slightly above 50Gbps from a VM on a server. Even if data has to go twice over that link it should be a bit higher?
Plan is to shift things around a bit so I can use a PCIe4x8 instead, but gotta replace the 4slot wide GPU for that. But if the PCIe link was the issue I'd still only double to 20Gbps?
 
Last edited:

adolfotregosa

New Member
Dec 19, 2024
5
1
1
The card is bottlenecked by a PCIe4x4 link, so ~63Gbit. I've gotten slightly above 50Gbps from a VM on a server. Even if data has to go twice over that link it should be a bit higher?
Plan is to shift things around a bit so I can use a PCIe4x8 instead, but gotta replace the 4slot wide GPU for that. But if the PCIe link was the issue I'd still only double to 20Gbps?
which mellanox card are you using ? They are pretty much almost all pcie 3.0, so 32Gbit at pcie 3.0 4x !
 

thulle

Member
Apr 11, 2019
85
38
18
which mellanox card are you using ? They are pretty much almost all pcie 3.0, so 32Gbit at pcie 3.0 4x !
Then I would be capped at ~32Gbit from the server instead of the ~50 i get, wouldn't I?

Part Number: MCX556A-EDA_Ax_Bx
Description: ConnectX-5 Ex VPI adapter card; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0 x16
The chipset PCIe-slot on my motherboard is PCIe4x8 electrically though, so I can get the same speed with a PCIe3x8 too, I used a MCX455A-ECAT before this, running PCIe3x8 to the chipset that then talks PCIe4x4 with the CPU.

lspci says:

LnkSta: Speed 16GT/s, Width x8 (downgraded)
16GT/s would be PCIe4, and width is x8 to the chipset, and then there's the x4 chokepoint which maybe is why it says downgraded? Or maybe it just says that because it doesn't have the full x16 link that the card wants.

I've just upgraded the cpu in the server (thanks tugm4470) and thus wiped the CPU pinnings that can give a bit of extra speed, but the card genuinely is capable of delivering more than the 32Gbit:

$ iperf -c 172.17.37.33 -P4
------------------------------------------------------------
Client connecting to 172.17.37.33, TCP port 5001
TCP window size: 256 KByte (default)
------------------------------------------------------------
[ 2] local 172.17.37.34 port 47748 connected with 172.17.37.33 port 5001
[ 4] local 172.17.37.34 port 47776 connected with 172.17.37.33 port 5001
[ 3] local 172.17.37.34 port 47770 connected with 172.17.37.33 port 5001
[ 1] local 172.17.37.34 port 47758 connected with 172.17.37.33 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.02 sec 14.0 GBytes 12.0 Gbits/sec
[ 3] 0.00-10.02 sec 13.8 GBytes 11.9 Gbits/sec
[ 2] 0.00-10.02 sec 13.9 GBytes 11.9 Gbits/sec
[ 4] 0.00-10.02 sec 13.9 GBytes 11.9 Gbits/sec
[SUM] 0.00-10.02 sec 55.6 GBytes 47.7 Gbits/sec
 

adolfotregosa

New Member
Dec 19, 2024
5
1
1
Then I would be capped at ~32Gbit from the server instead of the ~50 i get, wouldn't I?



The chipset PCIe-slot on my motherboard is PCIe4x8 electrically though, so I can get the same speed with a PCIe3x8 too, I used a MCX455A-ECAT before this, running PCIe3x8 to the chipset that then talks PCIe4x4 with the CPU.

lspci says:


16GT/s would be PCIe4, and width is x8 to the chipset, and then there's the x4 chokepoint which maybe is why it says downgraded? Or maybe it just says that because it doesn't have the full x16 link that the card wants.

I've just upgraded the cpu in the server (thanks tugm4470) and thus wiped the CPU pinnings that can give a bit of extra speed, but the card genuinely is capable of delivering more than the 32Gbit:
That’s why I asked for the model of your card. Yours is a PCIe 4.0 card. Most Mellanox cards we buy off eBay to play around with are PCIe 3.0 :)

I’m currently learning and experimenting with SR-IOV and eSwitch features out of curiosity. So far, there’s only one situation where SR-IOV clearly wins: when the goal is to reduce host CPU usage and power consumption, and you need to move traffic in or out of the PF, like a OPNsense VM. For inter-VM traffic (I’ve just been running iperf3 benchmarks), either Linux Bridge or OVS and virtio nics I can reach much higher throughput on my consumer-grade system, around 60 Gbit VM-to-VM and around 100 Gbit between the host and VMs or LXC containers.

Right now, I have an Intel X710 PCIe 3.0 x8 card in a PCIe 4.0 x4 slot. iperf3 between VFs gives me around 20 to 23 Gbit. I tried an E810 (so I could test the eSwitch features) that’s PCIe 4.0 and got 40 to 43 Gbit in legacy mode, but only around 20 Gbit in switchdev mode, uhhhhh, CPU usage and power consumption did not go down (iperf3 locked at comparable speeds of course), which is where my knowledge pretty much runs out. I was expecting inter-VM traffic to be also hardware offloaded, but clearly I do not fully understand what is going on.

When I cap iperf3 at e.g 20 Gbit (max of X710) or 40 Gbit (max of E810) and check CPU usage and wall power consumption, I get basically the same readings whether I am using VFs or Linux Bridge/OVS + virtio nics for inter-VM iperf3 runs.

I’m waiting on a ConnectX-4 LX from eBay to see if its eSwitch changes anything. So far, the only situation where SR-IOV has shown a clear advantage is with my OPNsense VM NICs. When I run internet speed tests from my other VMs, host CPU usage and power consumption are much higher with Linux Bridge/OVS and virtio NICs compared to SR-IOV passthrough VFs only on the OPNsense VM. The other VMs with either virtio nic or VFs does not make a difference.

Can anyone confirm if my findings are correct so far? I might just be expecting lower host CPU usage for inter-VM traffic when in reality it’s already about as good as it gets, no?

Thank you
 
Last edited:

niekbergboer

Active Member
Jun 21, 2016
193
102
43
48
Switzerland
In order to make progress on the PVE 8-to-9 upgrade, I reverted back to the setup all the way at the beginning of this thread. This now works.
 

kapone

Well-Known Member
May 23, 2015
1,977
1,329
113
So far, there’s only one situation where SR-IOV clearly wins: when the goal is to reduce host CPU usage and power consumption,
Nope. SR-IOV is needed when you need hardware offloaded distinct nics for VMs/containers. It's also needed when your traffic patterns demand it, for example.g. promiscuous mode, port snooping, DPDK etc etc. Any power consumption savings are incidental.

either Linux Bridge or OVS and virtio nics I can reach much higher throughput on my consumer-grade system, around 60 Gbit VM-to-VM and around 100 Gbit between the host and VMs or LXC containers
Using virtio nics, and VM<->VM or Host<->VM, you're essentially doing a memory transfer. It should be higher than pcie speeds. The complexity comes in when traffic has to go in/out of the whole box.

I have an Intel X710 PCIe 3.0 x8 card in a PCIe 4.0 x4 slot. iperf3 between VFs gives me around 20 to 23 Gbit
Right. So, the card is running at pcie3.0 x4 speeds. Which is ~32gbps. Your speeds are a bit low, but I don't know what else might be bottle necking it. You should be seeing ~28-30gbps at a minimum. iperf3 settings correct? Not single stream? slow CPU? slow memory?

I tried an E810 (so I could test the eSwitch features) that’s PCIe 4.0 and got 40 to 43 Gbit in legacy mode
Right. So, the card is running at pcie4.0 x4 speeds, which is ~64gbps. Your speeds are a bit low, but I don't know what else might be bottle necking it. You should be seeing ~60gbps at a minimum. iperf3 settings correct? Not single stream? slow CPU? slow memory?

host CPU usage and power consumption are much higher with Linux Bridge/OVS and virtio NICs compared to SR-IOV passthrough VFs only on the OPNsense VM
Exactly. Like I said above, hardware offloaded distinct nics and the fact that OPNsense traffic patterns are a perfect fit for SR-IOV.

I might just be expecting lower host CPU usage for inter-VM traffic when in reality it’s already about as good as it gets, no?
You're more right than wrong. :)