Arista DCS-7050QX-32S Spontaneous Restart Issue

LodeRunner · Dec 18, 2021

So I've had this switch for a few weeks now and twice it's just restarted. The "sh reload cause" reports unknown. Both times the lastlog file in /var/log was over a gigabyte and "bash sudo less /var/log/lastlog" reported it as a binary file which rendered as gibberish on screen (as expected of a binary file, but lastlog should be a copy of mesages which is not binary).

First reboot was 3-4 days after enabling MVRP and /var/log/messages was full of MVRP spam where it was not properly talking to my Brocades. I disabled MVRP and it made it 18 days this time. The GZ files in /mnt/flash/scheduled/tech-support from after the restart all report the same uptime which I know to be false based on the actual uptime listed in "sh ver" and "bash uptime"

Switch is running 4.25.3M-21804478.4253M and I assume it has the factory stock SSD/eMMC module and RAM as the seller made no note of parts having been replaced. What tests can I run without downing the switch to see if the storage is failing?

Should I stand up a Syslog server on a VM and point the switch at it as well to hopefully catch what's happening? What logging options should I enable to try pinning this down? I see logging persistent to write to the flash, but if I suspect the flash storage, that's not a great idea

WANg · Dec 18, 2021

Whatever it was, it probably caused a crash-and-coredump, I doubt rsyslog would help in those situations. I’ll suspect either bad RAM (more likely) or bad flash media (less likely) first, then maybe look at your SFP modules (if applicable) and see if any of them seems unstable in terms of heat, voltage or power output. If you got spare DDR3 RAM (even desktop DIMMs), swap it out first, fire it up and see if the issue goes away. As for whether the flash is failing, eeeeeh, take it out and smartctl on another machine and see if it reports anything funky?

thefloyd · Dec 19, 2021

LodeRunner said:
So I've had this switch for a few weeks now and twice it's just restarted. The "sh reload cause" reports unknown. Both times the lastlog file in /var/log was over a gigabyte and "bash sudo less /var/log/lastlog" reported it as a binary file which rendered as gibberish on screen (as expected of a binary file, but lastlog should be a copy of mesages which is not binary).

First reboot was 3-4 days after enabling MVRP and /var/log/messages was full of MVRP spam where it was not properly talking to my Brocades. I disabled MVRP and it made it 18 days this time. The GZ files in /mnt/flash/scheduled/tech-support from after the restart all report the same uptime which I know to be false based on the actual uptime listed in "sh ver" and "bash uptime"

Switch is running 4.25.3M-21804478.4253M and I assume it has the factory stock SSD/eMMC module and RAM as the seller made no note of parts having been replaced. What tests can I run without downing the switch to see if the storage is failing?

Should I stand up a Syslog server on a VM and point the switch at it as well to hopefully catch what's happening? What logging options should I enable to try pinning this down? I see logging persistent to write to the flash, but if I suspect the flash storage, that's not a great idea

'lastlog' is a binary database file that contains the last login information of each user on the system (lastlog(8) - Linux man page). Where did you get the idea that it was a copy of 'messages'?

LodeRunner · Dec 19, 2021

I confused it with behavior from another system and a different file name is what I did.

If it supports registered ECC, I have some unused 16GB sticks from a Dell server (Samsung M393B2G70QH0-YK0). Otherwise I’ll have to order something.

thefloyd · Dec 19, 2021

LodeRunner said:
I confused it with behavior from another system and a different file name is what I did.

If it supports registered ECC, I have some unused 16GB sticks from a Dell server. Otherwise I’ll have to order something.

well the good news is you don't have to chase down a file corruption problem!

LodeRunner · Dec 19, 2021

thefloyd said:
well the good news is you don't have to chase down a file corruption problem!

Ayup.

LodeRunner · May 18, 2022

WANg said:
Whatever it was, it probably caused a crash-and-coredump, I doubt rsyslog would help in those situations. I’ll suspect either bad RAM (more likely) or bad flash media (less likely) first, then maybe look at your SFP modules (if applicable) and see if any of them seems unstable in terms of heat, voltage or power output. If you got spare DDR3 RAM (even desktop DIMMs), swap it out first, fire it up and see if the issue goes away. As for whether the flash is failing, eeeeeh, take it out and smartctl on another machine and see if it reports anything funky?

Seems like it was bad RAM. I found an old 8GB UDIMM from when I rebuilt my wife's machine and it's running happily on that. My only ECC DIMMS were in fact registered, which as I understand it, the management board does not support.

When I had MVRP enabled, it would crash every randomly between 3 and 5 days, I suspect as logging filled memory and then hit the bad address. With MVRP disabled, the switch generated far fewer logging events and made it 18-40 days between resets.

With the new DIMM, it's at 18 days with MVRP enabled and spamming the log. Hopefully that puts paid to the random reset and I can re-establish my cluster (have a switch death cause a VSAN split-brain is un-fun; I should have bought two of these to do MLAG before they damn near tripled in price).

Current status

Code:

core-40g#bash uptime
 08:30:26 up 18 days,  1:42,  1 user,  load average: 0.15, 0.18, 0.22
core-40g#bash free -m
             total       used       free     shared    buffers     cached
Mem:          7909       4142       3766          0        213       2836
-/+ buffers/cache:       1093       6815
Swap:            0          0          0
core-40g#sh ver
Arista DCS-7050QX-32S-F
Hardware version: 11.34
Serial number: 
Hardware MAC address: 
System MAC address: 

Software image version: 4.25.3M
Architecture: i686
Internal build version: 4.25.3M-21804478.4253M
Internal build ID: 9d5b03e9-8d80-47e4-a439-70c83d1ae0ef

Uptime: 2 weeks, 4 days, 3 hours and 28 minutes
Total memory: 8099124 kB
Free memory: 6629676 kB

Thanks for pointing out RAM and that it would take practically any DDR3 UDIMM.

Search

Arista DCS-7050QX-32S Spontaneous Restart Issue

LodeRunner

Active Member

WANg

Well-Known Member

thefloyd

New Member

LodeRunner

Active Member

thefloyd

New Member

LodeRunner

Active Member

LodeRunner

Active Member