I have two Aristia 7150S-R that are mlagged together and I have all my servers using bonds across the mlagged switches. Four of my servers are experiencing random port flaps on then 10G. If I reboot the server it will stay up for 5-7 days and then start doing it again.
Switch OS 4.18
Server OS Ubuntu 18.04
Aristia Compatable DAC's
What I have done...
1. Changed out the DACS - Had quanta switches before so needed the arista ones (I thought)
2. The MCX312C-XCBT cards came with Dell/EMC firmware so I cross flashed to the OEM Mellanox (didn't fix it).
3. Compiled the latest driver from Mellanox as a dkms kernel driver instead of the default one (didn't fix it).
What I'm thinking about doing
1. Configuring 4 more mlagged ports and moving them on the switch. Port 1-4 seem affected but 5-7 and 12-24 seems fine
2. Changing to Intel X520 SFP cards
Error log from server.
Jul 4 08:31:08 cnode1 kernel: [597752.519787] mlx4_en: enp2s0: Steering Mode 2
Jul 4 08:31:08 cnode1 kernel: [597752.523119] mlx4_en: enp2s0: Setting RSS context tunnel type to RSS on inner headers
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Lost carrier
Jul 4 08:31:08 cnode1 kernel: [597752.546951] mlx4_en: enp2s0: Link Down
Jul 4 08:31:08 cnode1 kernel: [597752.548155] mlx4_core 0000:02:00.0 enp2s0: speed changed to 0 for port enp2s0
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Gained carrier
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Configured
Jul 4 08:31:08 cnode1 kernel: [597752.571156] mlx4_en: enp2s0: Link Up
Jul 4 08:31:08 cnode1 kernel: [597752.586374] bond0: link status definitely up for interface enp2s0, 10000 Mbps full duplex
Switch OS 4.18
Server OS Ubuntu 18.04
Aristia Compatable DAC's
What I have done...
1. Changed out the DACS - Had quanta switches before so needed the arista ones (I thought)
2. The MCX312C-XCBT cards came with Dell/EMC firmware so I cross flashed to the OEM Mellanox (didn't fix it).
3. Compiled the latest driver from Mellanox as a dkms kernel driver instead of the default one (didn't fix it).
What I'm thinking about doing
1. Configuring 4 more mlagged ports and moving them on the switch. Port 1-4 seem affected but 5-7 and 12-24 seems fine
2. Changing to Intel X520 SFP cards
Error log from server.
Jul 4 08:31:08 cnode1 kernel: [597752.519787] mlx4_en: enp2s0: Steering Mode 2
Jul 4 08:31:08 cnode1 kernel: [597752.523119] mlx4_en: enp2s0: Setting RSS context tunnel type to RSS on inner headers
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Lost carrier
Jul 4 08:31:08 cnode1 kernel: [597752.546951] mlx4_en: enp2s0: Link Down
Jul 4 08:31:08 cnode1 kernel: [597752.548155] mlx4_core 0000:02:00.0 enp2s0: speed changed to 0 for port enp2s0
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Gained carrier
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Configured
Jul 4 08:31:08 cnode1 kernel: [597752.571156] mlx4_en: enp2s0: Link Up
Jul 4 08:31:08 cnode1 kernel: [597752.586374] bond0: link status definitely up for interface enp2s0, 10000 Mbps full duplex