Arista DCS-7050QX-32S Spontaneous Restart Issue

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

LodeRunner

Active Member
Apr 27, 2019
540
227
43
So I've had this switch for a few weeks now and twice it's just restarted. The "sh reload cause" reports unknown. Both times the lastlog file in /var/log was over a gigabyte and "bash sudo less /var/log/lastlog" reported it as a binary file which rendered as gibberish on screen (as expected of a binary file, but lastlog should be a copy of mesages which is not binary).

First reboot was 3-4 days after enabling MVRP and /var/log/messages was full of MVRP spam where it was not properly talking to my Brocades. I disabled MVRP and it made it 18 days this time. The GZ files in /mnt/flash/scheduled/tech-support from after the restart all report the same uptime which I know to be false based on the actual uptime listed in "sh ver" and "bash uptime"

Switch is running 4.25.3M-21804478.4253M and I assume it has the factory stock SSD/eMMC module and RAM as the seller made no note of parts having been replaced. What tests can I run without downing the switch to see if the storage is failing?

Should I stand up a Syslog server on a VM and point the switch at it as well to hopefully catch what's happening? What logging options should I enable to try pinning this down? I see logging persistent to write to the flash, but if I suspect the flash storage, that's not a great idea
 

WANg

Well-Known Member
Jun 10, 2018
1,308
971
113
46
New York, NY
Whatever it was, it probably caused a crash-and-coredump, I doubt rsyslog would help in those situations. I’ll suspect either bad RAM (more likely) or bad flash media (less likely) first, then maybe look at your SFP modules (if applicable) and see if any of them seems unstable in terms of heat, voltage or power output. If you got spare DDR3 RAM (even desktop DIMMs), swap it out first, fire it up and see if the issue goes away. As for whether the flash is failing, eeeeeh, take it out and smartctl on another machine and see if it reports anything funky?
 
  • Like
Reactions: Patrick

thefloyd

New Member
Dec 21, 2020
29
7
3
So I've had this switch for a few weeks now and twice it's just restarted. The "sh reload cause" reports unknown. Both times the lastlog file in /var/log was over a gigabyte and "bash sudo less /var/log/lastlog" reported it as a binary file which rendered as gibberish on screen (as expected of a binary file, but lastlog should be a copy of mesages which is not binary).

First reboot was 3-4 days after enabling MVRP and /var/log/messages was full of MVRP spam where it was not properly talking to my Brocades. I disabled MVRP and it made it 18 days this time. The GZ files in /mnt/flash/scheduled/tech-support from after the restart all report the same uptime which I know to be false based on the actual uptime listed in "sh ver" and "bash uptime"

Switch is running 4.25.3M-21804478.4253M and I assume it has the factory stock SSD/eMMC module and RAM as the seller made no note of parts having been replaced. What tests can I run without downing the switch to see if the storage is failing?

Should I stand up a Syslog server on a VM and point the switch at it as well to hopefully catch what's happening? What logging options should I enable to try pinning this down? I see logging persistent to write to the flash, but if I suspect the flash storage, that's not a great idea
'lastlog' is a binary database file that contains the last login information of each user on the system (lastlog(8) - Linux man page). Where did you get the idea that it was a copy of 'messages'?
 

LodeRunner

Active Member
Apr 27, 2019
540
227
43
I confused it with behavior from another system and a different file name is what I did.

If it supports registered ECC, I have some unused 16GB sticks from a Dell server (Samsung M393B2G70QH0-YK0). Otherwise I’ll have to order something.
 
Last edited:

thefloyd

New Member
Dec 21, 2020
29
7
3
I confused it with behavior from another system and a different file name is what I did.

If it supports registered ECC, I have some unused 16GB sticks from a Dell server. Otherwise I’ll have to order something.
well the good news is you don't have to chase down a file corruption problem! :D
 

LodeRunner

Active Member
Apr 27, 2019
540
227
43
Whatever it was, it probably caused a crash-and-coredump, I doubt rsyslog would help in those situations. I’ll suspect either bad RAM (more likely) or bad flash media (less likely) first, then maybe look at your SFP modules (if applicable) and see if any of them seems unstable in terms of heat, voltage or power output. If you got spare DDR3 RAM (even desktop DIMMs), swap it out first, fire it up and see if the issue goes away. As for whether the flash is failing, eeeeeh, take it out and smartctl on another machine and see if it reports anything funky?
Seems like it was bad RAM. I found an old 8GB UDIMM from when I rebuilt my wife's machine and it's running happily on that. My only ECC DIMMS were in fact registered, which as I understand it, the management board does not support.

When I had MVRP enabled, it would crash every randomly between 3 and 5 days, I suspect as logging filled memory and then hit the bad address. With MVRP disabled, the switch generated far fewer logging events and made it 18-40 days between resets.

With the new DIMM, it's at 18 days with MVRP enabled and spamming the log. Hopefully that puts paid to the random reset and I can re-establish my cluster (have a switch death cause a VSAN split-brain is un-fun; I should have bought two of these to do MLAG before they damn near tripled in price).

Current status
Code:
core-40g#bash uptime
 08:30:26 up 18 days,  1:42,  1 user,  load average: 0.15, 0.18, 0.22
core-40g#bash free -m
             total       used       free     shared    buffers     cached
Mem:          7909       4142       3766          0        213       2836
-/+ buffers/cache:       1093       6815
Swap:            0          0          0
core-40g#sh ver
Arista DCS-7050QX-32S-F
Hardware version: 11.34
Serial number: 
Hardware MAC address: 
System MAC address: 

Software image version: 4.25.3M
Architecture: i686
Internal build version: 4.25.3M-21804478.4253M
Internal build ID: 9d5b03e9-8d80-47e4-a439-70c83d1ae0ef

Uptime: 2 weeks, 4 days, 3 hours and 28 minutes
Total memory: 8099124 kB
Free memory: 6629676 kB
Thanks for pointing out RAM and that it would take practically any DDR3 UDIMM.