Arista 7050QX-32S Rebooting every 13 hours

BigHye · Jul 7, 2020

I have four Arista 7050QX-32S switches that unfortunately are no longer under support, running the latest recommended release 4.23.4M (and the updated Aboot from Field Notice 0044 - See #2 below). The switches are configured in pairs and all four are connected together via mlags. All four switches are connected via an up-link to the colo data center network and again the pairs are doing VARP (simpler config than VRRP) to provide redundancy for the incoming ip traffic. Everything was working fine for the past few weeks but then starting last Friday, one of the switches has been rebooting approximately every 13hrs. It just so happens that the switch that is rebooting happens to take a majority of the traffic so we can't tell if its traffic related vs hardware vs software issue. Yesterday we investigated hardening the switch to eliminate exposed services externally (specifically SSH & SNMP) since we were seeing dozens of SSH attempts per minute.

We've spent a ton of time trying to troubleshoot and here are some of the finer details that may or may not be causing the issue:

1. Output of show reload cause: This looks exactly like the scenario outlined in the Arista Field Notice 0044.
Switch-01#show reload cause
Reload Cause:
-------------
Unknown

Debugging Information:
----------------------

I've double & triple checked that the Aboot patch has been applied. I even tried to apply the other patch v1.0.0 but v1.0.2 is newer and already installed so v1.0.0 wouldn't install over it.

Switch-01#show extensions detail
Name: Aboot-patch-v1.0.2-419257.i686.rpm
Version: 1.0.2
Release: 16431296.edchienabootupgrade0.1
Presence: available
Status: not installed
Vendor:
Summary: Patch for BUG419257
Packages:
Total size: 0 bytes
Description:
Program the Aboot image that will disable memory power save mode

2. Transceivers: We are using 3rd party 1000Base-TX SFP+ transceivers (in all four switches) that are not supported by Arista. But if it was the transceiver, we'd expect to see the issue on the other switches or at least something in the logs if it were to notice / trigger an issue.

3. Logging: We turned on Syslog but not much is being sent especially around the time before the reboot. When the switch reboots all the old logs are wiped out so if there is something being logged before the reboot, its not being saved. I just noticed there is a Persistent Logging and turned it on but I'll have to wait until the next reboot to see if there is anything.

Switch-01#show logging
Syslog logging: enabled
Buffer logging: level debugging
Console logging: level errors
Persistent logging: level debugging
Monitor logging: level errors
Synchronous logging: disabled
Trap logging: level informational
Logging to '10.0.24.100' port 514 in VRF default via udp
Sequence numbers: disabled
Syslog facility: local4
Hostname format: Hostname only
Repeat logging interval: disabled
Repeat messages: disabled

4. Bad Traffic: We sniffed the traffic on the external interface and identified a bunch of unnecessary traffic that the Arista Switch CPU was having to deal with (ICMP Destination Unreachable / Host Unreachable) on a /22 subnet (1024 addresses) that is currently not being used. One of our original assessments was that there was too much script kiddie type traffic of people trying to portscan / hack our environment that the switch was having to deal with so we removed the upstream route to reduce the amount of portscan traffic that the switch would need to respond for. Short of creating a complex ACL whitelist or putting a firewall between the incoming circuit and the Arista Switch, we haven't been able to come up with another solution to mitigate this type of traffic. Even with all the bad traffic hitting the CPU, the Arista 7050QX-32S should be able to handle the amount of "bad" traffic coming in without rebooting and Arista EOS should be hardened enough at this point to not crash due to a bunch of public internet traffic traversing a 1 GbE interface.

5. show tech-support: I've combed through the logs looking for anything out of the ordinary. I've compared the logs of the rebooting switch to the others trying to see if I can identify anything that might be an anomaly.

Since its happening pretty regularly, we are starting to rule out that its due to bad traffic or something failing with the hardware at almost the exact same time every 13 hours. It's a pretty strange time frame and almost seems like a software watchdog timer is getting triggered and causing it to reboot.

Any ideas on how to turn up debugging or logging to capture any error messages?

Any ideas on what to do next to troubleshoot further?

WANg · Jul 7, 2020

Swap RAM module (the chassis use standard desktop DDR3) and see if the issue goes away?

Labs · Jul 7, 2020

I noticed this in your show extensions output...

Presence: available
Status: not installed

Looks like it was installed but not applied.

Can you check the output of show version detail | grep Aboot-norcal is matching with the one from the pdf file?

BigHye · Jul 7, 2020

WANg said:
Swap RAM module (the chassis use standard desktop DDR3) and see if the issue goes away?

Interesting idea. I'll need to order some RAM give that a try but I need to come up with a Plan B since the Colo is in another state and I'll have to fly there to do the hands on work. Thanks for the suggestion.

Makes me also think, since the Arista runs Linux I might be able to run a memory test to see if that's the issue... Thoughts?

BigHye · Jul 7, 2020

Labs said:
I noticed this in your show extensions output...

Presence: available
Status: not installed

Looks like it was installed but not applied.

Can you check the output of show version detail | grep Aboot-norcal is matching with the one from the pdf file?

I noticed that too but it shows the same on all 4 switches.

Switch-01#show version detail | grep Aboot-norcal
Aboot Aboot-norcal4-4.0.7-13599834

I checked the PDF again and as the last step it shows deleting the extension after install/rebooting. Once its cleaned up it now shows:
Switch-01#show extensions detail
! No extensions are available

I really wish that was the problem & solution!

vangoose · Jul 7, 2020

BigHye said:
I noticed that too but it shows the same on all 4 switches.

Switch-01#show version detail | grep Aboot-norcal
Aboot Aboot-norcal4-4.0.7-13599834

I checked the PDF again and as the last step it shows deleting the extension after install/rebooting. Once its cleaned up it now shows:
Switch-01#show extensions detail
! No extensions are available

I really wish that was the problem & solution!

Thanks for this, din't know there's Aboot update.

Just tried to install it, if you reboot, then run sh extensions again, it will show NI (not installed). If you run sh extensions before reboot, it shows it's A,I.

Run sh ver and Aboot is updated to 6.17 from 6.12 on my 7060CX-32S
show version detail | grep Aboot-norcal
Aboot Aboot-norcal6-6.1.7-13531819

BigHye · Jul 7, 2020

WANg said:
Swap RAM module (the chassis use standard desktop DDR3) and see if the issue goes away?

After a quick search, it looks like you're the guy who knows these inside out... For the ram, would this work or do you recommend something else?

Crucial 8GB Single DDR3L 1600 MT/s (PC3L-12800) Unbuffered UDIMM Memory

https://www.amazon.com/Crucial-PC3L-12800-Unbuffered-Memory-CT2K102464BD160B/dp/B008KSHQBU

WANg · Jul 7, 2020

BigHye said:
After a quick search, it looks like you're the guy who knows these inside out... For the ram, would this work or do you recommend something else?

Crucial 8GB Single DDR3L 1600 MT/s (PC3L-12800) Unbuffered UDIMM Memory

https://www.amazon.com/Crucial-PC3L-12800-Unbuffered-Memory-CT2K102464BD160B/dp/B008KSHQBU

Yeah, it should work - although I would go ECC (unregistered, unbuffered) if it's going into mission critical applications - otherwise, regular desktop RAM will work. Considering tossing in a 16GB DIMM (unofficially supported) just in case it's a memory leak or something like that. I am not sure if Arista has a memtest utility built into their userland, although considering that it has only a single RAM slot, it should be something to keep an eye on "just in case"

Oh, and you might want to seek a second opinion from @fohdeesha - he's our switchgear yoda.

BigHye · Jul 8, 2020

I ordered the Crucial 8GB ECC as you suggested since this is going into a data center environment.

I'm looking at ordering another 7050QX-32S to have just in case this doesn't work but I've been eyeing the 7060CX-32S since I have Dual Port Mellanox ConnectX-5 100GbE nics in the servers. I think buying new QSFP100 cables is going to cost more than the switches...

I'm also looking at replacing the SFP's with ones from FS.com that are coded for Arista.

oddball · Jul 8, 2020

We have some 7050qx-32's in a DC as well. We're on an older version, but have no stability issues.

We did upgrade one of them with a 16GB ram module, zero problems.

One thing I have noticed at the DC... we have conditioned power. But we also have a feed straight off the breakers for the UPS. From looking at UPS logs it looks like the breaker has a voltage dip on a very regular basis. The UPS picks this up and runs for 1-3 seconds before the voltage is corrected. Since the other gear is running on conditioned power we don't have any blips.

I'm wondering if you're seeing a power issue. Could be a slight voltage intolerance that knocks a reboot. Is there a way to put a UPS between the switch and the outlet? Even just a single PSU will work.

vangoose · Jul 8, 2020

BigHye said:
I ordered the Crucial 8GB ECC as you suggested since this is going into a data center environment.

I'm looking at ordering another 7050QX-32S to have just in case this doesn't work but I've been eyeing the 7060CX-32S since I have Dual Port Mellanox ConnectX-5 100GbE nics in the servers. I think buying new QSFP100 cables is going to cost more than the switches...

I'm also looking at replacing the SFP's with ones from FS.com that are coded for Arista.

7060CX-32S takes 40G as well. You can use your existing hba and cable. You can also use 40G to 4*10G. 100G to 4*25G breakout cable as well.

100G DAC from FS is dirt cheap.

BigHye · Jul 8, 2020

oddball said:
We have some 7050qx-32's in a DC as well. We're on an older version, but have no stability issues.

We did upgrade one of them with a 16GB ram module, zero problems.

One thing I have noticed at the DC... we have conditioned power. But we also have a feed straight off the breakers for the UPS. From looking at UPS logs it looks like the breaker has a voltage dip on a very regular basis. The UPS picks this up and runs for 1-3 seconds before the voltage is corrected. Since the other gear is running on conditioned power we don't have any blips.

I'm wondering if you're seeing a power issue. Could be a slight voltage intolerance that knocks a reboot. Is there a way to put a UPS between the switch and the outlet? Even just a single PSU will work.

I wasn't able to find a 16GB Unbuffered/ECC DIMM. Seems like 8GB is (was) the max for that generation of DDR3. If you were able to get a 16GB Unbuffered/ECC DIMM, can you share the part number?

I'm hosted in a high end facility with very clean power. Each circuit is on 3Phase 30AMP circuit going into a Smart PDU where I can see the power draw per port. There is a second 7050QX-32S in the rack slot below it connected to the same PDU. Nothing out of the ordinary and none of the other devices in the rack are having an issue.

If it were the power, why would it reboot every 13hrs like clockwork?

The idea that its the RAM and some type of watchdog timer kicking in makes the most logical sense.

BigHye · Jul 8, 2020

Just checked the logs and found this gem:

Jul 8 05:30:39 Switch-01 kernel: [44985.324612] mce: [Hardware Error]: Machine check events logged
Jul 8 05:30:39 Switch-01 kernel: [44985.324629] mce: [EDAC]: The following error was corrected by the hardware. Further action is likely not required.
Jul 8 05:30:39 Switch-01 kernel: [44985.324633] [Hardware Error]: Corrected error, no action required.
Jul 8 05:30:39 Switch-01 kernel: [44985.324637] [Hardware Error]: CPU:0 (16:0:1) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c0d40006e080813
Jul 8 05:30:39 Switch-01 kernel: [44985.324643] [Hardware Error]: Error Addr: 0x000000008e9fad20
Jul 8 05:30:39 Switch-01 kernel: [44985.324645] [Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.
Jul 8 05:30:39 Switch-01 kernel: [44985.324662] EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x8e9fa offset:0xd20 grain:0 syndrome:0x6e1a)
Jul 8 05:30:39 Switch-01 kernel: [44985.324664] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

So @WANg your theory about the RAM might be the culprit after all. You'd think if the switch was experiencing hardware issues, it would have better notification...

Unfortunately, the RAM won't arrive until Monday at the earliest.

BigHye · Jul 16, 2020

Switch died on Monday with a memory error... Finally able to get out to the Data Center today and after about an hour of playing cable Jenga, we were able to get the switch out of the rack. Just put the 8GB DIMM in and heading to get some food to see if it barfs before putting it back in the rack.

Best part about this whole experience was that the fail over has been seamless. Our previous setup with 10GbE Force10 S4810's took so long to converge, half the VM's would lose their mind and drop their connection to the datastores.

Big test will be in 13 hours... Gotta see if this guy is going to reboot again.

BigHye · Jul 20, 2020

Replacing the DIMM did the trick and the switch has been healthy since Thursday!

Thank you everyone for your help and suggestions.

Search

Arista 7050QX-32S Rebooting every 13 hours

BigHye

New Member

Attachments

WANg

Well-Known Member

Labs

Member

BigHye

New Member

BigHye

New Member

vangoose

Active Member

BigHye

New Member

WANg

Well-Known Member

BigHye

New Member

oddball

Active Member

vangoose

Active Member

BigHye

New Member

BigHye

New Member

BigHye

New Member

BigHye

New Member