NAS: sporadic momentary connection drops

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

mediacomposerman

New Member
Jan 17, 2022
3
0
1
We have a pair of s4810 (OS 9.14) switches in a stack. (Core switches in an isolated layer 2 LAN)

A Synology SA3600 NAS server with an Intel XL710-QDA2 NIC, is connected via 802.3ad, via 2 x 40GbE SQFP's.


For many months now, the NAS connection drops for 1 second, every 3-90 hours. (Only troubleshooting it now since the issue was underreported and I thought it was just 1 client experiencing the issue… which was incorrect.)

The switch's logs track it. Typical example: (full log attached)
Code:
Sep 14 14:31:24 %STKUNIT0-M:CP %LACP-5-PORT-GROUPED: PortChannel-056-Grouped: Interface Fo 0/56 joined port-channel 56.
Sep 14 14:31:24 %STKUNIT0-M:CP %IFMGR-5-OSTATE_UP: Changed interface state to up: Po 56
Sep 14 14:31:24 %STKUNIT0-M:CP %LACP-5-PORT-GROUPED: PortChannel-056-Grouped: Interface Fo 1/56 joined port-channel 56.
Sep 14 14:31:23 %STKUNIT0-M:CP %IFMGR-5-OSTATE_DN: Changed interface state to down: Po 56
Sep 14 14:31:23 %STKUNIT0-M:CP %LACP-5-PORT-UNGROUPED: PortChannel-056-Ungrouped: Interface Fo 1/56 exited port-channel 56.
Sep 14 14:31:23 %STKUNIT0-M:CP %LACP-5-PORT-UNGROUPED: PortChannel-056-Ungrouped: Interface Fo 0/56 exited port-channel 56.

According to Synology's tech support, the NAS logs don't show any link drops, but show dropped packets. Conversely, the switch sees link drops (above) but no packet issues: 0 runts, 0 giants, 0 throttles, 0 CRC, 0 overrun, 0 discarded on both ports as well as the port-channel.

  • I updated the NAS software, problem still persists.

  • I don't think it's the cables/connections since both LACP connections always drop together (and return together).

  • Not seeing this issues or errors with other ports/devices. Though I do see “Configuration mismatch with neighbor” on other ports. Is that just normal when a port goes live in a stack?

Any thoughts? Does it "smell" like a hardware problem? Or similar to known issues you've seen in certain driver versions, configurations, etc.?



Config:
Code:
interface fortyGigE 0/56
 description NAS03
 no ip address
! 
 port-channel-protocol LACP
  port-channel 56 mode active
 no shutdown
!
interface fortyGigE 1/56
 description NAS03
 no ip address
! 
 port-channel-protocol LACP
  port-channel 56 mode active
 no shutdown
!
interface Port-channel 56
 description NAS03
 no ip address
 switchport
 no shutdown
 

Attachments

Railgun

Active Member
Jul 28, 2018
150
57
28
Static LACP unless you have a reason to let it negotiate. I see this all the time in our “legacy” environment. For such short, directly connected links, chances of an issue that will drop the link based on LACP issues alone are going to be slim.

Taking CRC errors? LACP negotiation doesn’t matter.

In my experience, Arista blamed it on SolarFlare, and SF blamed it on Arista. YMMV
 
Last edited:

mediacomposerman

New Member
Jan 17, 2022
3
0
1
My new suspicion is that it has to do with LACPDUs and mismatched timers. It makes a lot of sense and could cause the LAG to flap.
Interestingly, the default in the Dell / FTOS is the "fast" timeout of 1 second, yet they recommend a slow timeout.

Though somehow switching to the slow timeout affected the NAS instead of the switch… so with the fast timeout, both sides were sending/replying to LACPDUs every 1 second. And once every 14,000 to 320,000 LACPDUs, the NAS slipped so the switch took the LAG down.
Now with the switch's timer set to 30 seconds… it still sends them every 1 second, and the NAS sends them every 27.8 seconds on average… so we're again ~1 second away from flapping the link.

Another solution may be to fix the LACP priorities.

If I give up on LACP, do static LAGs load-balance?