Mellanox SX6036 died - 100% CPU utilization

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
797
408
63
Hey,

So I was trying to get a MLAG to run with two SX6036s, but it didn't work with the CX3...

Somehow, the switch broke.

There are two things that happened:
First, I managed to segfault the cli process (duh!) by trying to abort a write configuration text process...
Laster, I tried to debug the LACP issues and did debug generate dump which took forever!

Well, since some point (I don't know exactly when this happened), the Mellanox decided to constantly have 100% CPU utilization.

I tried everything. Rebooting, removing CMOS battery, etc....

At some point I was able to get into the switch, and after some time the "module" (probably the management module) was initialised and I was able to get into the CLI.
Still showed 100% CPU, but after about 10 minutes CPU dropped to low levels.

Entered the license key, changed to eth-single-switch, again got the "Modules are being configured" with "1 module remains to be configured" taking forever.
Loging into the switch takes forever and CPU usage is very high... Even though nothing is connected except serial


Any ideas what this could be? Dead flash? NVRAM issue? I'm out of ideas.
 

Stephan

Well-Known Member
Apr 21, 2017
700
491
63
Germany
No experience with SX6036 but slowness might mean corrupt bootstrap EEPROM. See if you can dump it (16 bytes):

/opt/tms/bin/mellaggra _read 0 0x52 0 1 16

This is for SX6012, might be vastly different on SX6036. On SX6012 should be either

DDR 166 MHz: 86 82 96 1a d9 80 0 e0 c0 8 23 50 d 5 0 0 or
DDR 200 MHz: 86 82 96 19 b9 80 0 e0 c0 8 23 50 d 5 0 0

I have attached a zipped XLS from the CPU design team, which lets you check values in case they are different and suspicious. Ideally you want to dump every byte of flash and EEPROM of such devices when you get them, so you can later correct any corruption.
 

Attachments

  • Like
Reactions: arnbju

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
797
408
63
I recommend re-flashing the 3.6.8012 firmware from manufacturing mode. It seems like updating from older firmwares to 3.6.8012 results in a much slower CLI than a completely new install of 3.6.8012
 

the_imperfectionest

New Member
Nov 7, 2021
6
0
1
I believe the HPE sx6036 variant, however it has no external HPE stickers, though it does appears someone badly took some stickers off ( and I was able to register it on hpe with the serial)

1675273595789.png
 

the_imperfectionest

New Member
Nov 7, 2021
6
0
1
Interesting,

Maybe different, maybe corrupt. I'm not versed in the fun world of eeprom magic


Here's mine off that 6036


[admin@switch-6304dc var]# /opt/tms/bin/mellaggra _read 0 0x52 0 1 16
86 82 96 19 b9 80 0 a0 40 8 23 50 5 d 0 0
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
797
408
63
@NablaSquaredG - I'm not finding this other thread. I seem to not understand where to put in the mfg_nodhcp command
You can find more info in the conversion guide:

And the thread:
 
  • Like
Reactions: the_imperfectionest