I have four Arista 7050QX-32S switches that unfortunately are no longer under support, running the latest recommended release 4.23.4M (and the updated Aboot from Field Notice 0044 - See #2 below). The switches are configured in pairs and all four are connected together via mlags. All four switches are connected via an up-link to the colo data center network and again the pairs are doing VARP (simpler config than VRRP) to provide redundancy for the incoming ip traffic. Everything was working fine for the past few weeks but then starting last Friday, one of the switches has been rebooting approximately every 13hrs. It just so happens that the switch that is rebooting happens to take a majority of the traffic so we can't tell if its traffic related vs hardware vs software issue. Yesterday we investigated hardening the switch to eliminate exposed services externally (specifically SSH & SNMP) since we were seeing dozens of SSH attempts per minute.
We've spent a ton of time trying to troubleshoot and here are some of the finer details that may or may not be causing the issue:
1. Output of show reload cause: This looks exactly like the scenario outlined in the Arista Field Notice 0044.
Switch-01#show reload cause
Reload Cause:
-------------
Unknown
Debugging Information:
----------------------
I've double & triple checked that the Aboot patch has been applied. I even tried to apply the other patch v1.0.0 but v1.0.2 is newer and already installed so v1.0.0 wouldn't install over it.
Switch-01#show extensions detail
Name: Aboot-patch-v1.0.2-419257.i686.rpm
Version: 1.0.2
Release: 16431296.edchienabootupgrade0.1
Presence: available
Status: not installed
Vendor:
Summary: Patch for BUG419257
Packages:
Total size: 0 bytes
Description:
Program the Aboot image that will disable memory power save mode
2. Transceivers: We are using 3rd party 1000Base-TX SFP+ transceivers (in all four switches) that are not supported by Arista. But if it was the transceiver, we'd expect to see the issue on the other switches or at least something in the logs if it were to notice / trigger an issue.
3. Logging: We turned on Syslog but not much is being sent especially around the time before the reboot. When the switch reboots all the old logs are wiped out so if there is something being logged before the reboot, its not being saved. I just noticed there is a Persistent Logging and turned it on but I'll have to wait until the next reboot to see if there is anything.
Switch-01#show logging
Syslog logging: enabled
Buffer logging: level debugging
Console logging: level errors
Persistent logging: level debugging
Monitor logging: level errors
Synchronous logging: disabled
Trap logging: level informational
Logging to '10.0.24.100' port 514 in VRF default via udp
Sequence numbers: disabled
Syslog facility: local4
Hostname format: Hostname only
Repeat logging interval: disabled
Repeat messages: disabled
4. Bad Traffic: We sniffed the traffic on the external interface and identified a bunch of unnecessary traffic that the Arista Switch CPU was having to deal with (ICMP Destination Unreachable / Host Unreachable) on a /22 subnet (1024 addresses) that is currently not being used. One of our original assessments was that there was too much script kiddie type traffic of people trying to portscan / hack our environment that the switch was having to deal with so we removed the upstream route to reduce the amount of portscan traffic that the switch would need to respond for. Short of creating a complex ACL whitelist or putting a firewall between the incoming circuit and the Arista Switch, we haven't been able to come up with another solution to mitigate this type of traffic. Even with all the bad traffic hitting the CPU, the Arista 7050QX-32S should be able to handle the amount of "bad" traffic coming in without rebooting and Arista EOS should be hardened enough at this point to not crash due to a bunch of public internet traffic traversing a 1 GbE interface.
5. show tech-support: I've combed through the logs looking for anything out of the ordinary. I've compared the logs of the rebooting switch to the others trying to see if I can identify anything that might be an anomaly.
Since its happening pretty regularly, we are starting to rule out that its due to bad traffic or something failing with the hardware at almost the exact same time every 13 hours. It's a pretty strange time frame and almost seems like a software watchdog timer is getting triggered and causing it to reboot.
Any ideas on how to turn up debugging or logging to capture any error messages?
Any ideas on what to do next to troubleshoot further?
We've spent a ton of time trying to troubleshoot and here are some of the finer details that may or may not be causing the issue:
1. Output of show reload cause: This looks exactly like the scenario outlined in the Arista Field Notice 0044.
Switch-01#show reload cause
Reload Cause:
-------------
Unknown
Debugging Information:
----------------------
I've double & triple checked that the Aboot patch has been applied. I even tried to apply the other patch v1.0.0 but v1.0.2 is newer and already installed so v1.0.0 wouldn't install over it.
Switch-01#show extensions detail
Name: Aboot-patch-v1.0.2-419257.i686.rpm
Version: 1.0.2
Release: 16431296.edchienabootupgrade0.1
Presence: available
Status: not installed
Vendor:
Summary: Patch for BUG419257
Packages:
Total size: 0 bytes
Description:
Program the Aboot image that will disable memory power save mode
2. Transceivers: We are using 3rd party 1000Base-TX SFP+ transceivers (in all four switches) that are not supported by Arista. But if it was the transceiver, we'd expect to see the issue on the other switches or at least something in the logs if it were to notice / trigger an issue.
3. Logging: We turned on Syslog but not much is being sent especially around the time before the reboot. When the switch reboots all the old logs are wiped out so if there is something being logged before the reboot, its not being saved. I just noticed there is a Persistent Logging and turned it on but I'll have to wait until the next reboot to see if there is anything.
Switch-01#show logging
Syslog logging: enabled
Buffer logging: level debugging
Console logging: level errors
Persistent logging: level debugging
Monitor logging: level errors
Synchronous logging: disabled
Trap logging: level informational
Logging to '10.0.24.100' port 514 in VRF default via udp
Sequence numbers: disabled
Syslog facility: local4
Hostname format: Hostname only
Repeat logging interval: disabled
Repeat messages: disabled
4. Bad Traffic: We sniffed the traffic on the external interface and identified a bunch of unnecessary traffic that the Arista Switch CPU was having to deal with (ICMP Destination Unreachable / Host Unreachable) on a /22 subnet (1024 addresses) that is currently not being used. One of our original assessments was that there was too much script kiddie type traffic of people trying to portscan / hack our environment that the switch was having to deal with so we removed the upstream route to reduce the amount of portscan traffic that the switch would need to respond for. Short of creating a complex ACL whitelist or putting a firewall between the incoming circuit and the Arista Switch, we haven't been able to come up with another solution to mitigate this type of traffic. Even with all the bad traffic hitting the CPU, the Arista 7050QX-32S should be able to handle the amount of "bad" traffic coming in without rebooting and Arista EOS should be hardened enough at this point to not crash due to a bunch of public internet traffic traversing a 1 GbE interface.
5. show tech-support: I've combed through the logs looking for anything out of the ordinary. I've compared the logs of the rebooting switch to the others trying to see if I can identify anything that might be an anomaly.
Since its happening pretty regularly, we are starting to rule out that its due to bad traffic or something failing with the hardware at almost the exact same time every 13 hours. It's a pretty strange time frame and almost seems like a software watchdog timer is getting triggered and causing it to reboot.
Any ideas on how to turn up debugging or logging to capture any error messages?
Any ideas on what to do next to troubleshoot further?
Attachments
-
36.4 KB Views: 28