SR-IOV on Kubuntu 23.04 ConnectX-5 - ip link set "Cannot find device"

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
I can't get any further here, so far everything looks good, but the step with assigning the IP address does not work.
Maybe it's because of network-manager, should I switch to netplan?

ConnectX-5 driver in use is version
MLNX_OFED_LINUX-23.07-0.5.1.2-ubuntu23.04-x86_64

root@kubu01:~# ip link set enp5s0f2v0 vf 0 mac 00:22:33:44:55:66
Cannot find device "enp5s0f2v0"

After installing WinOF2 in Win11 22H2, the VF shows error code 10, I tried version MLNX_WinOF2-3_10_52010 and MLNX_WinOF2-23_7_50000.

edit: I followed this guide


Code:
root@kubu01:~# echo 4 > /sys/class/net/enp5s0f0np0/device/sriov_numvfs
root@kubu01:~# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 58:11:22:d3:8e:21 brd ff:ff:ff:ff:ff:ff
    altname enp10s0
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br3 state UP mode DEFAULT group default qlen 1000
    link/ether 58:11:22:d3:8e:20 brd ff:ff:ff:ff:ff:ff
    altname enp11s0
4: enp5s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:71:b8:3a brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
5: enp5s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:71:b8:3b brd ff:ff:ff:ff:ff:ff
6: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 52:54:00:91:bb:4a brd ff:ff:ff:ff:ff:ff
7: br3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 1a:a3:4d:c6:1a:75 brd ff:ff:ff:ff:ff:ff
8: enp5s0f2v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a6:1f:d4:0d:2e:77 brd ff:ff:ff:ff:ff:ff permaddr ca:2e:a1:3e:fd:1d
9: enp5s0f3v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7a:e2:15:27:e1:03 brd ff:ff:ff:ff:ff:ff permaddr 5e:6a:88:16:e6:be
10: enp5s0f4v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 86:fb:e6:c4:45:6f brd ff:ff:ff:ff:ff:ff permaddr be:81:e3:32:6c:41
11: enp5s0f5v3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 22:be:bb:30:13:87 brd ff:ff:ff:ff:ff:ff permaddr aa:9c:e5:95:0f:cb
root@kubu01:~#
root@kubu01:~# lspci -D | grep Mellanox
0000:05:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:05:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:05:00.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
0000:05:00.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
0000:05:00.4 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
0000:05:00.5 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
root@kubu01:~#
root@kubu01:~#
root@kubu01:~#
root@kubu01:~#
root@kubu01:~# echo 0000:05:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
root@kubu01:~# ip link set enp5s0f2v0 vf 0 mac 00:22:33:44:55:66
Cannot find device "enp5s0f2v0"

ip a shows the random MAC address, do I only have to set them if I want to have certain MAC addresses?
Code:
9: enp5s0f3v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 7a:e2:15:27:e1:03 brd ff:ff:ff:ff:ff:ff permaddr 5e:6a:88:16:e6:be
10: enp5s0f4v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 86:fb:e6:c4:45:6f brd ff:ff:ff:ff:ff:ff permaddr be:81:e3:32:6c:41
11: enp5s0f5v3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 22:be:bb:30:13:87 brd ff:ff:ff:ff:ff:ff permaddr aa:9c:e5:95:0f:cb
12: enp5s0f2v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether a6:1f:d4:0d:2e:77 brd ff:ff:ff:ff:ff:ff permaddr 72:bc:0e:e9:de:40
inet6 fe80::c57f:7a8f:fa2c:6ee4/64 scope link noprefixroute
valid_lft forever preferred_lft forever

dmesg
Code:
[  213.939668] mlx5_core 0000:05:00.0: E-Switch: Enable: mode(LEGACY), nvfs(4), necvfs(0), active vports(5)
[  214.045845] pci 0000:05:00.2: [15b3:101a] type 00 class 0x020000
[  214.045906] pci 0000:05:00.2: enabling Extended Tags
[  214.046663] pci 0000:05:00.2: Adding to iommu group 37
[  214.046806] mlx5_core 0000:05:00.2: enabling device (0000 -> 0002)
[  214.046869] mlx5_core 0000:05:00.2: firmware version: 16.35.2000
[  214.193334] mlx5_core 0000:05:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[  214.206808] mlx5_core 0000:05:00.2: Assigned random MAC address ca:2e:a1:3e:fd:1d
[  214.358497] mlx5_core 0000:05:00.2: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[  214.360252] mlx5_core 0000:05:00.2 enp5s0f2v0: renamed from eth0
[  214.413264] pci 0000:05:00.3: [15b3:101a] type 00 class 0x020000
[  214.413326] pci 0000:05:00.3: enabling Extended Tags
[  214.414088] pci 0000:05:00.3: Adding to iommu group 38
[  214.414211] mlx5_core 0000:05:00.3: enabling device (0000 -> 0002)
[  214.414272] mlx5_core 0000:05:00.3: firmware version: 16.35.2000
[  214.561109] mlx5_core 0000:05:00.2 enp5s0f2v0: Link up
[  214.563182] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f2v0: link becomes ready
[  214.573143] mlx5_core 0000:05:00.3: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[  214.601312] mlx5_core 0000:05:00.3: Assigned random MAC address 5e:6a:88:16:e6:be
[  214.753151] mlx5_core 0000:05:00.3: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[  214.754694] mlx5_core 0000:05:00.3 enp5s0f3v1: renamed from eth0
[  214.808717] pci 0000:05:00.4: [15b3:101a] type 00 class 0x020000
[  214.808779] pci 0000:05:00.4: enabling Extended Tags
[  214.809536] pci 0000:05:00.4: Adding to iommu group 39
[  214.809663] mlx5_core 0000:05:00.4: enabling device (0000 -> 0002)
[  214.809721] mlx5_core 0000:05:00.4: firmware version: 16.35.2000
[  214.957738] mlx5_core 0000:05:00.3 enp5s0f3v1: Link up
[  214.969887] mlx5_core 0000:05:00.4: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[  215.070126] mlx5_core 0000:05:00.4: Assigned random MAC address be:81:e3:32:6c:41
[  215.221655] mlx5_core 0000:05:00.4: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[  215.223078] mlx5_core 0000:05:00.4 enp5s0f4v2: renamed from eth0
[  215.277798] pci 0000:05:00.5: [15b3:101a] type 00 class 0x020000
[  215.277862] pci 0000:05:00.5: enabling Extended Tags
[  215.278623] pci 0000:05:00.5: Adding to iommu group 40
[  215.278739] mlx5_core 0000:05:00.5: enabling device (0000 -> 0002)
[  215.278798] mlx5_core 0000:05:00.5: firmware version: 16.35.2000
[  215.415312] mlx5_core 0000:05:00.4 enp5s0f4v2: Link up
[  215.451869] mlx5_core 0000:05:00.5: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[  215.542135] mlx5_core 0000:05:00.5: Assigned random MAC address aa:9c:e5:95:0f:cb
[  215.576065] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f3v1: link becomes ready
[  215.578626] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f4v2: link becomes ready
[  215.696464] mlx5_core 0000:05:00.5: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 basic)
[  215.697499] mlx5_core 0000:05:00.5 enp5s0f5v3: renamed from eth0
[  215.859625] mlx5_core 0000:05:00.5 enp5s0f5v3: Link up
[  216.594386] IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f5v3: link becomes ready
 
Last edited:

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
looks like that's my problem

edit: arh, I knew I would be punished for buying a desktop system, I also have a Zen4 system



Code:
[ 2987.068846] ioremap memtype_reserve failed -16
[ 2987.080846] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.080847] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.080848] ioremap memtype_reserve failed -16
[ 2987.092830] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.092831] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.092832] ioremap memtype_reserve failed -16
[ 2987.104843] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.104845] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.104845] ioremap memtype_reserve failed -16
[ 2987.116844] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.116846] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.116847] ioremap memtype_reserve failed -16
[ 2987.128840] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.128841] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.128842] ioremap memtype_reserve failed -16
[ 2987.140842] x86/PAT: CPU 2/KVM:102943 conflicting memory types fcf6c00000-fcf6e00000 uncached-minus<->write-combining
[ 2987.140844] x86/PAT: memtype_reserve failed [mem 0xfcf6c00000-0xfcf6dfffff], track uncached-minus, req uncached-minus
[ 2987.140844] ioremap memtype_reserve failed -16
 
Last edited:

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
Try using the base device and not the VF.
that worked thanks!
At least this issue is solved, but the main problem still exists, I need either an Intel 810 or something comparable or a new system, I don't think AMD is interested in SR-IOV problems on AM5
 

TheRac25

New Member
May 30, 2023
12
0
1
looks like you are probing the vfs, pretty sure you dont want that if you are passing them through
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
looks like you are probing the vfs, pretty sure you dont want that if you are passing them through
then I have to use netplan and not network-manager, because network-manager is probing the vf already , right?

"Assigned random MAC address 5e:6a:88:16:e6:be"
 

TheRac25

New Member
May 30, 2023
12
0
1
uh its a function of mlx5_core that makes the vf show up as usable interfaces to the system, ie enp5s0f2v0, if im not mistaken
i could be wrong and thats not what those are, just saying thats what they look like
since they are mounted as devices for the host to use there will be problems to pass them through
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
cx5 stores config on the card if im not mistaken,

sounded right, but I can't find it and there's no mention of it in the NVIDIA Docs, what must be set are these two

SRIOV_EN 0 1
NUM_OF_VFS 0 4

I did a factory reset and have disabled the network manager, didn't help


Code:
mstconfig -d 05:00.0 q

Device #1:
----------

Device type: ConnectX5
Name: MCX516A-CDA_Ax_Bx
Description: ConnectX-5 Ex EN network interface card; 100GbE dual-port QSFP28; PCIe4.0 x16; tall bracket; ROHS R6
Device: 05:00.0

Configurations: Next Boot
MEMIC_BAR_SIZE 0
MEMIC_SIZE_LIMIT _256KB(1)
HOST_CHAINING_MODE DISABLED(0)
HOST_CHAINING_CACHE_DISABLE False(0)
HOST_CHAINING_DESCRIPTORS Array[0..7]
HOST_CHAINING_TOTAL_BUFFER_SIZE Array[0..7]
FLEX_PARSER_PROFILE_ENABLE 0
FLEX_IPV4_OVER_VXLAN_PORT 0
ROCE_NEXT_PROTOCOL 254
ESWITCH_HAIRPIN_DESCRIPTORS Array[0..7]
ESWITCH_HAIRPIN_TOT_BUFFER_SIZE Array[0..7]
PF_BAR2_SIZE 0
PF_NUM_OF_VF_VALID False(0)
NON_PREFETCHABLE_PF_BAR False(0)
VF_VPD_ENABLE False(0)
PF_NUM_PF_MSIX_VALID False(0)
PER_PF_NUM_SF False(0)
STRICT_VF_MSIX_NUM False(0)
VF_NODNIC_ENABLE False(0)
NUM_PF_MSIX_VALID True(1)
NUM_OF_VFS 4
NUM_OF_PF 2
PF_BAR2_ENABLE False(0)
SRIOV_EN True(1)
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 1
NUM_PF_MSIX 63
NUM_VF_MSIX 11
INT_LOG_MAX_PAYLOAD_SIZE AUTOMATIC(0)
PCIE_CREDIT_TOKEN_TIMEOUT 0
ACCURATE_TX_SCHEDULER False(0)
PARTIAL_RESET_EN False(0)
SW_RECOVERY_ON_ERRORS False(0)
RESET_WITH_HOST_ON_ERRORS False(0)
ADVANCED_POWER_SETTINGS False(0)
CQE_COMPRESSION BALANCED(0)
IP_OVER_VXLAN_EN False(0)
MKEY_BY_NAME False(0)
ESWITCH_IPV4_TTL_MODIFY_ENABLE False(0)
PRIO_TAG_REQUIRED_EN False(0)
UCTX_EN True(1)
PCI_ATOMIC_MODE PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
TUNNEL_ECN_COPY_DISABLE False(0)
LRO_LOG_TIMEOUT0 6
LRO_LOG_TIMEOUT1 7
LRO_LOG_TIMEOUT2 8
LRO_LOG_TIMEOUT3 13
LOG_TX_PSN_WINDOW 7
LOG_MAX_OUTSTANDING_WQE 7
ROCE_ADAPTIVE_ROUTING_EN False(0)
TUNNEL_IP_PROTO_ENTROPY_DISABLE False(0)
ICM_CACHE_MODE DEVICE_DEFAULT(0)
TX_SCHEDULER_BURST 0
ZERO_TOUCH_TUNING_ENABLE False(0)
LOG_MAX_QUEUE 17
LOG_DCR_HASH_TABLE_SIZE 11
DCR_LIFO_SIZE 16384
ROCE_CC_PRIO_MASK_P1 255
ROCE_CC_PRIO_MASK_P2 255
CLAMP_TGT_RATE_AFTER_TIME_INC_P1 True(1)
CLAMP_TGT_RATE_P1 False(0)
RPG_TIME_RESET_P1 300
RPG_BYTE_RESET_P1 32767
RPG_THRESHOLD_P1 1
RPG_MAX_RATE_P1 0
RPG_AI_RATE_P1 5
RPG_HAI_RATE_P1 50
RPG_GD_P1 11
RPG_MIN_DEC_FAC_P1 50
RPG_MIN_RATE_P1 1
RATE_TO_SET_ON_FIRST_CNP_P1 0
DCE_TCP_G_P1 1019
DCE_TCP_RTT_P1 1
RATE_REDUCE_MONITOR_PERIOD_P1 4
INITIAL_ALPHA_VALUE_P1 1023
MIN_TIME_BETWEEN_CNPS_P1 4
CNP_802P_PRIO_P1 6
CNP_DSCP_P1 48
CLAMP_TGT_RATE_AFTER_TIME_INC_P2 True(1)
CLAMP_TGT_RATE_P2 False(0)
RPG_TIME_RESET_P2 300
RPG_BYTE_RESET_P2 32767
RPG_THRESHOLD_P2 1
RPG_MAX_RATE_P2 0
RPG_AI_RATE_P2 5
RPG_HAI_RATE_P2 50
RPG_GD_P2 11
RPG_MIN_DEC_FAC_P2 50
RPG_MIN_RATE_P2 1
RATE_TO_SET_ON_FIRST_CNP_P2 0
DCE_TCP_G_P2 1019
DCE_TCP_RTT_P2 1
RATE_REDUCE_MONITOR_PERIOD_P2 4
INITIAL_ALPHA_VALUE_P2 1023
MIN_TIME_BETWEEN_CNPS_P2 4
CNP_802P_PRIO_P2 6
CNP_DSCP_P2 48
LLDP_NB_DCBX_P1 False(0)
LLDP_NB_RX_MODE_P1 OFF(0)
LLDP_NB_TX_MODE_P1 OFF(0)
LLDP_NB_DCBX_P2 False(0)
LLDP_NB_RX_MODE_P2 OFF(0)
LLDP_NB_TX_MODE_P2 OFF(0)
DCBX_IEEE_P1 True(1)
DCBX_CEE_P1 True(1)
DCBX_WILLING_P1 True(1)
DCBX_IEEE_P2 True(1)
DCBX_CEE_P2 True(1)
DCBX_WILLING_P2 True(1)
KEEP_ETH_LINK_UP_P1 True(1)
KEEP_IB_LINK_UP_P1 False(0)
KEEP_LINK_UP_ON_BOOT_P1 False(0)
KEEP_LINK_UP_ON_STANDBY_P1 False(0)
DO_NOT_CLEAR_PORT_STATS_P1 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P1 False(0)
KEEP_ETH_LINK_UP_P2 True(1)
KEEP_IB_LINK_UP_P2 False(0)
KEEP_LINK_UP_ON_BOOT_P2 False(0)
KEEP_LINK_UP_ON_STANDBY_P2 False(0)
DO_NOT_CLEAR_PORT_STATS_P2 False(0)
AUTO_POWER_SAVE_LINK_DOWN_P2 False(0)
NUM_OF_VL_P1 _4_VLs(3)
NUM_OF_TC_P1 _8_TCs(0)
NUM_OF_PFC_P1 8
VL15_BUFFER_SIZE_P1 0
NUM_OF_VL_P2 _4_VLs(3)
NUM_OF_TC_P2 _8_TCs(0)
NUM_OF_PFC_P2 8
VL15_BUFFER_SIZE_P2 0
DUP_MAC_ACTION_P1 LAST_CFG(0)
MPFS_MC_LOOPBACK_DISABLE_P1 False(0)
MPFS_UC_LOOPBACK_DISABLE_P1 False(0)
UNKNOWN_UPLINK_MAC_FLOOD_P1 False(0)
SRIOV_IB_ROUTING_MODE_P1 LID(1)
IB_ROUTING_MODE_P1 LID(1)
DUP_MAC_ACTION_P2 LAST_CFG(0)
MPFS_MC_LOOPBACK_DISABLE_P2 False(0)
MPFS_UC_LOOPBACK_DISABLE_P2 False(0)
UNKNOWN_UPLINK_MAC_FLOOD_P2 False(0)
SRIOV_IB_ROUTING_MODE_P2 LID(1)
IB_ROUTING_MODE_P2 LID(1)
PHY_AUTO_NEG_P1 DEVICE_DEFAULT(0)
PHY_RATE_MASK_OVERRIDE_P1 False(0)
PHY_FEC_OVERRIDE_P1 DEVICE_DEFAULT(0)
PHY_AUTO_NEG_P2 DEVICE_DEFAULT(0)
PHY_RATE_MASK_OVERRIDE_P2 False(0)
PHY_FEC_OVERRIDE_P2 DEVICE_DEFAULT(0)
PF_TOTAL_SF 0
PF_SF_BAR_SIZE 0
PF_NUM_PF_MSIX 63
ROCE_CONTROL ROCE_ENABLE(2)
PCI_WR_ORDERING per_mkey(0)
MULTI_PORT_VHCA_EN False(0)
PORT_OWNER True(1)
ALLOW_RD_COUNTERS True(1)
RENEG_ON_CHANGE True(1)
TRACER_ENABLE True(1)
IP_VER IPv4(0)
BOOT_UNDI_NETWORK_WAIT 0
UEFI_HII_EN True(1)
BOOT_DBG_LOG False(0)
UEFI_LOGS DISABLED(0)
BOOT_VLAN 1
LEGACY_BOOT_PROTOCOL PXE(1)
BOOT_RETRY_CNT NONE(0)
BOOT_INTERRUPT_DIS False(0)
BOOT_LACP_DIS True(1)
BOOT_VLAN_EN False(0)
BOOT_PKEY 0
P2P_ORDERING_MODE DEVICE_DEFAULT(0)
ATS_ENABLED False(0)
DYNAMIC_VF_MSIX_TABLE False(0)
EXP_ROM_UEFI_ARM_ENABLE False(0)
EXP_ROM_UEFI_x86_ENABLE False(0)
EXP_ROM_PXE_ENABLE True(1)
ADVANCED_PCI_SETTINGS False(0)
SAFE_MODE_THRESHOLD 10
SAFE_MODE_ENABLE True(1)
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
Guys I'm very grateful for the help, coming from the CX3pro, where it was totally easy and never caused any problem, except code 43 with Windows VMs, is a change to CX5 like going from an angel of a child to Kevin alone at home, at least in combination with an Asus consumer board.

I will try this now with another distro with a newer kernel


After deactivating "probe_vf" and creating the VFs it looks good at first
Code:
[ 1090.264169] mlx5_core 0000:05:00.0: E-Switch: Enable: mode(LEGACY), nvfs(4), necvfs(0), active vports(5)
[ 1090.372427] pci 0000:05:00.2: [15b3:101a] type 00 class 0x020000
[ 1090.372488] pci 0000:05:00.2: enabling Extended Tags
[ 1090.373248] pci 0000:05:00.2: Adding to iommu group 37
[ 1090.373338] mlx5_core 0000:05:00.2: Avoid probing VFs
[ 1090.373394] pci 0000:05:00.3: [15b3:101a] type 00 class 0x020000
[ 1090.373453] pci 0000:05:00.3: enabling Extended Tags
[ 1090.374345] pci 0000:05:00.3: Adding to iommu group 38
[ 1090.374400] mlx5_core 0000:05:00.3: Avoid probing VFs
[ 1090.374495] pci 0000:05:00.4: [15b3:101a] type 00 class 0x020000
[ 1090.374592] pci 0000:05:00.4: enabling Extended Tags
[ 1090.375762] pci 0000:05:00.4: Adding to iommu group 39
[ 1090.375819] mlx5_core 0000:05:00.4: Avoid probing VFs
[ 1090.375914] pci 0000:05:00.5: [15b3:101a] type 00 class 0x020000
[ 1090.376029] pci 0000:05:00.5: enabling Extended Tags
[ 1090.377202] pci 0000:05:00.5: Adding to iommu group 40
[ 1090.377236] mlx5_core 0000:05:00.5: Avoid probing VFs

But now I get this when starting the VM
Code:
[ 1217.170043] BUG: unable to handle page fault for address: 000000000002d0b0
[ 1217.170047] #PF: supervisor read access in kernel mode
[ 1217.170048] #PF: error_code(0x0000) - not-present page
[ 1217.170050] PGD 0 P4D 0
[ 1217.170052] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1217.170054] CPU: 6 PID: 2628 Comm: rpc-libvirtd Tainted: P OE 6.2.0-33-generic #33-Ubuntu
[ 1217.170057] Hardware name: ASUS System Product Name/ProArt B650-CREATOR, BIOS 1602 08/15/2023
[ 1217.170058] RIP: 0010:remove_one+0x32/0x140 [mlx5_core]
[ 1217.170105] Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 8b 9f 48 01 00 00 48 89 df e8 1c d1 bf dd 41 80 bc 24 43 08 00 00 00 49 89 c5 79 1c <44> 0f b6 b3 b0 d0 02 00 41 80 fe 01 0f 87 8c 34 13 00 41 83 e6 01
[ 1217.170106] RSP: 0018:ffffbb3bc1f8bca0 EFLAGS: 00010282
[ 1217.170108] RAX: fffffffffffffe20 RBX: 0000000000000000 RCX: 0000000000000000
[ 1217.170110] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1217.170110] RBP: ffffbb3bc1f8bcc8 R08: 0000000000000000 R09: 0000000000000000
[ 1217.170111] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9594b45a9000
[ 1217.170112] R13: fffffffffffffe20 R14: ffff9594b45a9150 R15: ffff959400bf2150
[ 1217.170113] FS: 00007f96acdfe6c0(0000) GS:ffff95aa99b80000(0000) knlGS:0000000000000000
[ 1217.170114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1217.170116] CR2: 000000000002d0b0 CR3: 00000001214d4000 CR4: 0000000000750ee0
[ 1217.170117] PKRU: 55555554
[ 1217.170118] Call Trace:
[ 1217.170119] <TASK>
[ 1217.170121] ? show_regs+0x6d/0x80
[ 1217.170124] ? __die+0x24/0x80
[ 1217.170126] ? page_fault_oops+0x99/0x1b0
[ 1217.170130] ? do_user_addr_fault+0x2f3/0x620
[ 1217.170131] ? exc_page_fault+0x80/0x1b0
[ 1217.170134] ? asm_exc_page_fault+0x27/0x30
[ 1217.170138] ? remove_one+0x32/0x140 [mlx5_core]
[ 1217.170179] pci_device_remove+0x36/0xb0
[ 1217.170182] device_remove+0x40/0x80
[ 1217.170184] device_release_driver_internal+0x222/0x2a0
[ 1217.170187] device_driver_detach+0x14/0x20
[ 1217.170189] unbind_store+0x102/0x130
[ 1217.170190] drv_attr_store+0x21/0x50
[ 1217.170193] sysfs_kf_write+0x3b/0x60
[ 1217.170195] kernfs_fop_write_iter+0x130/0x210
[ 1217.170197] vfs_write+0x24e/0x410
[ 1217.170200] ksys_write+0x73/0x100
[ 1217.170202] __x64_sys_write+0x19/0x30
[ 1217.170203] do_syscall_64+0x58/0x90
[ 1217.170205] ? do_syscall_64+0x67/0x90
[ 1217.170207] ? syscall_exit_to_user_mode+0x37/0x60
[ 1217.170209] ? do_syscall_64+0x67/0x90
[ 1217.170211] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1217.170212] RIP: 0033:0x7f96b130ba1f
[ 1217.170214] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 69 f5 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 bc f5 f7 ff 48
[ 1217.170215] RSP: 002b:00007f96acdfd340 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[ 1217.170217] RAX: ffffffffffffffda RBX: 000000000000001c RCX: 00007f96b130ba1f
[ 1217.170218] RDX: 000000000000000c RSI: 00007f969c063ee0 RDI: 000000000000001c
[ 1217.170219] RBP: 000000000000000c R08: 0000000000000000 R09: 00007f96acdfc670
[ 1217.170219] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f969c063ee0
[ 1217.170220] R13: 000000000000001c R14: 0000000000000000 R15: 00007f96b1b4f50b
[ 1217.170222] </TASK>
[ 1217.170223] Modules linked in: snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink bridge stp
llc vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd cuse rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) snd_hda_codec_hdmi zfs(PO) snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio snd_hda_codec snd_hda_core s
nd_usbmidi_lib mc snd_hwdep zunicode(PO) snd_pcm intel_rapl_msr zzstd(O) intel_rapl_common snd_seq_midi snd_seq_midi_event edac_mce_amd zlua(O) snd_rawmidi zavl(PO) nls_iso8859_1 kvm_amd snd_seq icp(PO) mlx5_ib(OE) snd_seq_device kvm zcommon(PO) snd_t
imer irqbypass ib_uverbs(OE) znvpair(PO) snd wmi_bmof asus_nb_wmi rapl k10temp joydev spl(O) input_leds ib_core(OE) ccp soundcore mac_hid binfmt_misc knem(OE) msr parport_pc ppdev lp parport efi_pstore dmi_sysfs ip_tables x_tables autofs4 amdgpu
[ 1217.170265] mlx5_core(OE) iommu_v2 drm_buddy gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec mfd_aaeon rc_core crct10dif_pclmul hid_generic asus_wmi drm_kms_helper crc32_pclmul ledtrig_audio polyval_clmulni mlxdevm(OE) syscopyarea
polyval_generic sparse_keymap sysfillrect ghash_clmulni_intel mlxfw(OE) usbhid hid sha512_ssse3 aesni_intel nvme psample sysimgblt crypto_simd ucsi_ccg platform_profile drm r8169 cryptd tls ahci nvme_core video xhci_pci i2c_designware_pci i2c_piix4 l
ibahci realtek mlx_compat(OE) xhci_pci_renesas i2c_ccgx_ucsi nvme_common ucsi_acpi pci_hyperv_intf typec_ucsi wmi typec gpio_amdpt
[ 1217.170291] CR2: 000000000002d0b0
[ 1217.170293] ---[ end trace 0000000000000000 ]---
[ 1217.809695] RIP: 0010:remove_one+0x32/0x140 [mlx5_core]
[ 1217.809741] Code: e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 8b 9f 48 01 00 00 48 89 df e8 1c d1 bf dd 41 80 bc 24 43 08 00 00 00 49 89 c5 79 1c <44> 0f b6 b3 b0 d0 02 00 41 80 fe 01 0f 87 8c 34 13 00 41 83 e6 01
[ 1217.809743] RSP: 0018:ffffbb3bc1f8bca0 EFLAGS: 00010282
[ 1217.809745] RAX: fffffffffffffe20 RBX: 0000000000000000 RCX: 0000000000000000
[ 1217.809746] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1217.809747] RBP: ffffbb3bc1f8bcc8 R08: 0000000000000000 R09: 0000000000000000
[ 1217.809748] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9594b45a9000
[ 1217.809749] R13: fffffffffffffe20 R14: ffff9594b45a9150 R15: ffff959400bf2150
[ 1217.809750] FS: 00007f96acdfe6c0(0000) GS:ffff95aa99b80000(0000) knlGS:0000000000000000
[ 1217.809751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1217.809752] CR2: 000000000002d0b0 CR3: 00000001214d4000 CR4: 0000000000750ee0
[ 1217.809754] PKRU: 55555554
[ 1217.809755] note: rpc-libvirtd[2628] exited with irqs disabled
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
It works with Manjaro Kernel 6.4, the VF is connected in Windows 11, with Kubuntu and MLNX_OFED, libvirtd died when starting the vm.
I had to create a udev rule to disable "probe_vf" at boot and create the VFs , thanks to @llowrey for his post

I still get this error, but I couldn't see any further problems in this context, many thanks for your help guys!!

[ 787.827394] CPU: 29 PID: 4322 Comm: rpc-libvirtd Tainted: P OE 6.4.16-1-MANJARO #1 b75fe5796da2edc38c34cd1a3d5a0deee650c91e
[ 787.827397] Hardware name: ASUS System Product Name/ProArt B650-CREATOR, BIOS 1602 08/15/2023
[ 787.827398] RIP: 0010:devm_free_irq+0x73/0x80
 

cromo

Member
Jun 6, 2019
94
24
8
I disabled autoprobing and now my VMs won’t start, ironically. kvm error line is:

kvm: -device vfio-pci,host=0000:05:01.3,id=hostpci1,bus=pci.0,addr=0x11: vfio 0000:05:01.3: failed to open /dev/vfio/18: No such file or directory


Looks like the VFIO device is not created for them with autoprobing off. I don’t see anyone mentioning this anywhere as an explicit step. Any ideas?
 

Oliver Mack

New Member
Sep 25, 2014
23
0
1
49
have you enabled SR-IOV in your Bios and NIC Firmware?
If so, you only need this udev rule

cat /etc/udev/rules.d/99-sriov.rules
ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x15b3", ATTRS{device}=="0x1019", ATTR{device/sriov_drivers_autoprobe}="0", ATTR{device/sriov_numvfs}="4"

[manja-02 ~]# lspci | grep Mella

01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.4 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.5 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.6 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.7 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]

One problem I have is that the MTU 9000 doesn't perform with SR-IOV, I've given up trying to find the error and use 1500.
 

cromo

Member
Jun 6, 2019
94
24
8
have you enabled SR-IOV in your Bios and NIC Firmware?
If so, you only need this udev rule

cat /etc/udev/rules.d/99-sriov.rules
ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x15b3", ATTRS{device}=="0x1019", ATTR{device/sriov_drivers_autoprobe}="0", ATTR{device/sriov_numvfs}="4"

[manja-02 ~]# lspci | grep Mella

01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.4 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.5 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.6 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.7 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]

One problem I have is that the MTU 9000 doesn't perform with SR-IOV, I've given up trying to find the error and use 1500.
I have all of that and was already using SRIOV. Everything worked flawlessly until I added the Radeon GPU today. But despite that, disabling autoprobing should not stop those VMs from binding to the PCI IDs. Are you sure you didn't do any additional manual steps in regards to vfio-pci kernel module? Like kernel parameters or some extra initialization via sysfs?

I added a separate post with more details: https://forums.servethehome.com/ind...-driver-in-memory-conflict-with-amdgpu.43434/
 
Last edited: