Port Flapping with Aristia Switch and Mellanox Connectx 3 Pro

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

mTek

New Member
Nov 18, 2018
15
6
3
I have two Aristia 7150S-R that are mlagged together and I have all my servers using bonds across the mlagged switches. Four of my servers are experiencing random port flaps on then 10G. If I reboot the server it will stay up for 5-7 days and then start doing it again.

Switch OS 4.18
Server OS Ubuntu 18.04
Aristia Compatable DAC's


What I have done...

1. Changed out the DACS - Had quanta switches before so needed the arista ones (I thought)
2. The MCX312C-XCBT cards came with Dell/EMC firmware so I cross flashed to the OEM Mellanox (didn't fix it).
3. Compiled the latest driver from Mellanox as a dkms kernel driver instead of the default one (didn't fix it).

What I'm thinking about doing
1. Configuring 4 more mlagged ports and moving them on the switch. Port 1-4 seem affected but 5-7 and 12-24 seems fine
2. Changing to Intel X520 SFP cards

Error log from server.
Jul 4 08:31:08 cnode1 kernel: [597752.519787] mlx4_en: enp2s0: Steering Mode 2
Jul 4 08:31:08 cnode1 kernel: [597752.523119] mlx4_en: enp2s0: Setting RSS context tunnel type to RSS on inner headers
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Lost carrier
Jul 4 08:31:08 cnode1 kernel: [597752.546951] mlx4_en: enp2s0: Link Down
Jul 4 08:31:08 cnode1 kernel: [597752.548155] mlx4_core 0000:02:00.0 enp2s0: speed changed to 0 for port enp2s0
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Gained carrier
Jul 4 08:31:08 cnode1 systemd-networkd[1187]: enp2s0: Configured
Jul 4 08:31:08 cnode1 kernel: [597752.571156] mlx4_en: enp2s0: Link Up
Jul 4 08:31:08 cnode1 kernel: [597752.586374] bond0: link status definitely up for interface enp2s0, 10000 Mbps full duplex
 

mTek

New Member
Nov 18, 2018
15
6
3
The final fix was to get ride of ubuntu. I installed proxmox on these nodes and used it as an easy way to run the ceph cluster. Probably could have just switched to debian but ones it was stable Ieft it alone.