NAS: sporadic momentary connection drops

mediacomposerman · Sep 18, 2023

We have a pair of s4810 (OS 9.14) switches in a stack. (Core switches in an isolated layer 2 LAN)

A Synology SA3600 NAS server with an Intel XL710-QDA2 NIC, is connected via 802.3ad, via 2 x 40GbE SQFP's.

For many months now, the NAS connection drops for 1 second, every 3-90 hours. (Only troubleshooting it now since the issue was underreported and I thought it was just 1 client experiencing the issue… which was incorrect.)

The switch's logs track it. Typical example: (full log attached)

Code:

Sep 14 14:31:24 %STKUNIT0-M:CP %LACP-5-PORT-GROUPED: PortChannel-056-Grouped: Interface Fo 0/56 joined port-channel 56.
Sep 14 14:31:24 %STKUNIT0-M:CP %IFMGR-5-OSTATE_UP: Changed interface state to up: Po 56
Sep 14 14:31:24 %STKUNIT0-M:CP %LACP-5-PORT-GROUPED: PortChannel-056-Grouped: Interface Fo 1/56 joined port-channel 56.
Sep 14 14:31:23 %STKUNIT0-M:CP %IFMGR-5-OSTATE_DN: Changed interface state to down: Po 56
Sep 14 14:31:23 %STKUNIT0-M:CP %LACP-5-PORT-UNGROUPED: PortChannel-056-Ungrouped: Interface Fo 1/56 exited port-channel 56.
Sep 14 14:31:23 %STKUNIT0-M:CP %LACP-5-PORT-UNGROUPED: PortChannel-056-Ungrouped: Interface Fo 0/56 exited port-channel 56.

According to Synology's tech support, the NAS logs don't show any link drops, but show dropped packets. Conversely, the switch sees link drops (above) but no packet issues: 0 runts, 0 giants, 0 throttles, 0 CRC, 0 overrun, 0 discarded on both ports as well as the port-channel.

I updated the NAS software, problem still persists.
I don't think it's the cables/connections since both LACP connections always drop together (and return together).
Not seeing this issues or errors with other ports/devices. Though I do see “Configuration mismatch with neighbor” on other ports. Is that just normal when a port goes live in a stack?

Any thoughts? Does it "smell" like a hardware problem? Or similar to known issues you've seen in certain driver versions, configurations, etc.?

Config:

Code:

interface fortyGigE 0/56
 description NAS03
 no ip address
! 
 port-channel-protocol LACP
  port-channel 56 mode active
 no shutdown
!
interface fortyGigE 1/56
 description NAS03
 no ip address
! 
 port-channel-protocol LACP
  port-channel 56 mode active
 no shutdown
!
interface Port-channel 56
 description NAS03
 no ip address
 switchport
 no shutdown

ano · Sep 18, 2023

last time for me, it was a bad cable

Railgun · Sep 19, 2023

Static LACP unless you have a reason to let it negotiate. I see this all the time in our “legacy” environment. For such short, directly connected links, chances of an issue that will drop the link based on LACP issues alone are going to be slim.

Taking CRC errors? LACP negotiation doesn’t matter.

In my experience, Arista blamed it on SolarFlare, and SF blamed it on Arista. YMMV

mediacomposerman · Oct 5, 2023

My new suspicion is that it has to do with LACPDUs and mismatched timers. It makes a lot of sense and could cause the LAG to flap.
Interestingly, the default in the Dell / FTOS is the "fast" timeout of 1 second, yet they recommend a slow timeout.

Though somehow switching to the slow timeout affected the NAS instead of the switch… so with the fast timeout, both sides were sending/replying to LACPDUs every 1 second. And once every 14,000 to 320,000 LACPDUs, the NAS slipped so the switch took the LAG down.
Now with the switch's timer set to 30 seconds… it still sends them every 1 second, and the NAS sends them every 27.8 seconds on average… so we're again ~1 second away from flapping the link.

Another solution may be to fix the LACP priorities.

If I give up on LACP, do static LAGs load-balance?

Search

NAS: sporadic momentary connection drops

mediacomposerman

New Member

Attachments

ano

Well-Known Member

Railgun

Active Member

mediacomposerman

New Member