Hey folks,
Having an oddity and wanted to see if anyone else has run into this. A quick breakdown:
server - Ubuntu 22.04 - mellanox connectx-3, 40Gbps, connected to 1/2/1 via QSFP cable
Normally, 1/2/1 I have in tagged vlans 2, 50
switch - icx 6610-48P
Issue came up when I moved a bunch of services from an old server to this one, and began noticing packet loss when trying to communicate with the server only on vlan 2
(vlan 50 is, by design in this scenario, supposed to be unreachable). Roughly 7-10% of straight pings are lost, even just pinging from the ICX switch to the server(eg: ping 192.168.1.10 count 150
results in packet loss).
Doing a whole lot of troubleshooting, and noticed something. When I remove port 1/2/1 from vlan 50, everything is fine. Zero packet loss. No other network change necessary, just removing that vlan from the port. If I re-add it (vlan 50
=> tagged ethernet 1/2/1
) , immediately the packet loss starts up again.
Double checked my netplan on the server - there is a default route for vlan 2, as expected, and everything else is fine. For completness in testing, I removed the vlan 50 config from netplan, applied, and even rebooted. Even then, if I tag the port with vlan 50, the packet loss comes back? Double checked the usual suspects like mismatch MTU, speed, duplex, etc., and as far as I'm able to see, all the config is correct.
Checked for any errors on the server (with ifconfig and ethtool) and on the switch (with sh int et 1/21), but the interfaces both look perfectly clean. Zero errors of any kind on both sides.
I don't have a spare 40G card handy (other than an XL710, and well, we all know how that'll go). I just ordered another QSFP cable to rule that out, though the one I'm using is only ~2 months old. My next steps will be to change port on the connect-x and on the ICX to see if that helps, and if it does I'll of course let you all know.
For giggles, I created a new VLAN on the switch, vlan 101. I tagged 1/2/1 with that VLAN, and the packet loss is still present, just less, which makes even less sense.
I do have a spare 10G card I can drop in to test. An older server (also 22.04), on the same switch, but 10G only, tagged with the same VLANs, is not having the issue. So I'm really thinking something with the connectx or cable, but that doesn't explain why it goes away once I remove the vlan on the switchport, to me.