Mellanox Switches - Tips & Tricks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,346
820
113
thanks für this summary, any different between 7800 and 7890 except from Network modul? if i change the network modul to 7890 it boots up like a 7800?
SB7890 is unmanaged. SB7800 is managed. You would need to swap complete control plane PCB, cables, etc. ASIC Firmware is also different
 
  • Like
Reactions: hardwaretuner

hardwaretuner

New Member
Feb 7, 2024
5
0
1
5. Download Onyx 3.10 installer from mega.nz link in first page (should work - if your switch was on ancient firmware before, use the 3.9 instead)
that means in Mellanox -> Recovery -> 3.9.3202 -> X86_64-3.9.3202-installer.bin is firmware for my SB7800 included? or i need a own version?
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,346
820
113
that means in Mellanox -> Recovery -> 3.9.3202 -> X86_64-3.9.3202-installer.bin is firmware for my SB7800 included? or i need a own version?
yes, that's the right one.

You can then upgrade to a newer version, like 3.10.5000 (3.10.6XXX and 3.11.XXXX is NOT supported on SB7800)
 

Civiloid

Member
Jan 15, 2024
39
22
8
I'm unsure if that would be helpful, but I'll describe my journey with SN2100 recovery.

I got my unit for cheap (cheap enough to risk) as it was sold as "not working, doesn't turn on" for a home lab. After poking with a multimeter around it, I saw that there was 220V on the input but nothing on the output of the PSU when the mains cable was plugged in. And that was on both PSUs. (I initially thought that only on one and another was there no input voltage, but later on, I realised that the mains cable there needed to be pushed way harder than I was doing.)

So I've replaced PSU with Meanwell RPS-300-12 (Mellanox comes with Delta mds300apb-12, but I ordered Meanwell, and probably that was a small mistake, but it was half the price of Delta and had a better reputation among everybody whom I've asked for advice). One problem with Meanwell as a replacement is that neither screw sizes nor screw positions match what Delta have (I plan to model a small replacement plate that can be used, I will share if I'll ever do it), as well as connectors are different. In my case, someone tried to fix the switch before, and the sensing rail (the one that reports PWR_OK and other useful signals) was broken anyway (the cable was damaged).

However, I screwed in the PSU (put it on some insulator), and the switch started to boot. I bought a random RJ45->DB9 cable and plugged in my Prolific serial adapter (I have had it for a very long time), and I saw a BIOS screen, Grub and then some old firmware (3.6.2102). BIOS's password was not `admin`, so no luck there. After boot, the input was not working at all; I have no idea why.
I've bought a new M.2 SATA SSD (I had only one available in the local store - which was 512GB Transcend 430, which should be good enough for OS) and with fresh SSD, it happily booted from Flash and allowed me to embed ONIE first and then from "Install ONIE" on disk to flash it (I've put 3.9.3202 recovery firmware on HTTP server that was running on my router and it happily downloaded it).
During that time, input over serial was not working very well, like something was broken (sometimes characters resulted in garbage being sent); I'm not sure if that is just a PL2303 problem or something else.
After recovery was reflashed, it happily booted and reported a relatively healthy state (I still have one PSU, so it thinks that the second one is faulty, but SN2100 is happy even when PWR_OK reporting is missing completely and the first PSU is healthy from its point of view) and seems to be working just fine.
 

Dade49

New Member
Mar 26, 2021
8
3
3
During that time, input over serial was not working very well, like something was broken (sometimes characters resulted in garbage being sent);
Check your baud rate. If it's an older BIOS, you may need 9600, but newer BIOS wants 115200. I saw some weird characters when connecting at 9600 on my SN2700M, but that all cleared up when connected at 115200. Perhaps your issue is something else though.
 

Dade49

New Member
Mar 26, 2021
8
3
3
Short summary

1. MOST IMPORTANT STEP: REPLACE SSD - I recommend Transcend MSA452T-I 128GB
2. Download ONIE recovery image, flash to USB stick, insert USB stick into switch
3. Power Cycle switch, hammer CTRL+B to get into BIOS, if password required, try „admin“ (without quotes)
4. boot from USB (UEFI Mode!!), select „Embed ONIE“ once ONIE has booted
5. Download Onyx 3.10 installer from mega.nz link in first page (should work - if your switch was on ancient firmware before, use the 3.9 instead)
6. Copy the Onyx installer to a FAT32 formatted USB, rename installer file to onie-installer (no extension)
7. insert usb stick with onyx installer into switch, power cycle switch, on ONIE prompt select „Install OS“
8. let the installer finish. Switch should come up ready to use
I found a pair of HPE SN2700M [HPE branded SN2700] switches for a somewhat reasonable price ($1200 each). It's a lot of money for a home lab, but I wanted a step up from my SX6036 switches to play with 100GbE. I have some DAC's and NIC's that all run 56Gb Ethernet and I'm happy to report that the SN2700 supports this speed just like the SX6036 does. The Nvidia documentation was unclear to me, so hopefully this will help others.

After reading this thread, I also ordered two Transcend mSATA SSD's immediately [64GB TS64GMSA452T] for $20 each. They showed up before the switches did. I cannot find the industrial variant here in the US, but these seem to work just fine.

After changing my COM port speed to 115200 in Windows device manager and then connecting via serial in Putty, I saw each unit boot up. The first one was throwing the following errors in the console and rebooting over and over.
[ 74.491846] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 74.491846]
[ 74.501078] CPU: 0 PID: 1 Comm: init Not tainted 4.15.15-1.el7MELLANOXsmp-x86_64 #1
[ 74.508811] Hardware name: Mellanox Technologies Ltd. MSN2700/SA000874, BIOS 4.6.5 09/13/2018
[ 74.517419] Call Trace:
[ 74.519906] dump_stack+0x46/0x64
[ 74.523255] panic+0xd2/0x23c
[ 74.526243] do_exit+0xb10/0xb10
[ 74.529502] do_group_exit+0x39/0xa0
[ 74.533115] get_signal+0x1c6/0x580
[ 74.536636] do_signal+0x23/0x5c0
[ 74.539988] ? __printk_safe_exit+0x5/0x10
[ 74.544120] ? __printk_safe_exit+0x5/0x10
[ 74.548256] ? down_trylock+0x25/0x30
[ 74.551957] ? signal_wake_up_state+0x15/0x30
[ 74.556350] ? kick_process+0x5/0x40
[ 74.559963] ? __send_signal+0x19a/0x490
[ 74.563924] exit_to_usermode_loop+0x34/0x7a A2
[ 74.568233] ? general_protection+0x2f/0x50
[ 74.572459] prepare_exit_to_usermode+0x53/0x80
[ 74.577045] retint_user+0x8/0x8
[ 74.580313] RIP: 0033:0x7f8a8f65bc40
[ 74.583925] RSP: 002b:00007fffe6aad388 EFLAGS: 00010202
[ 74.589196] RAX: 08107f8a8f82b228 RBX: 000000037ffff1a0 RCX: 00007f8a8f6696e7
[ 74.596397] RDX: 0000000000000001 RSI: 00007f8a8f86f698 RDI: 00007f8a8f86f658
[ 74.603589] RBP: 00007fffe6aad4f0 R08: 00007f8a8f42a000 R09: 0000000070000021
[ 74.610800] R10: 0000000000000031 R11: 0000000000000206 R12: 00007f8a8f86f658
[ 74.617986] R13: 00007fffe6aad5d0 R14: 0000000000000003 R15: 000000006ffffeff
[ 74.625197] Kernel Offset: disabled
[ 74.628717] Rebooting in 10 seconds..
[ 84.633556] ACPI MEMORY or I/O RESET_REG.

The other switch had Cumulous installed. I had planned to remove the factory SSD's anyway and the error was resolved after a new SSD & a fresh install.

Using the post above was key for me to understand the recovery process for an SN2700. It's very simple.

1) Install a new SSD into the switch
2) Boot into the BIOS with CTRL + B (PW: admin worked for me)
3) Disable CSM & reboot
4) Boot from USB using the recovery image (I used Rufus to install 'onie-recovery-x86_64-mlnx_x86-r0.iso' [in the 2020.11-5.3.0005-115200 folder]
5) Select Embed ONIE, which formats the new SSD and installs the bootloader. After this is complete, remove the USB & reboot. You should see 'Install ONIE'.
6) On your PC, delete all partitions on your USB drive created earlier by Rufus. Manually create a new partition and format with FAT32. Copy the file 'X86_64-3.9.3202-installer.bin' onto the FAT32 partition (don't use Rufus). Rename 'X86_64-3.9.3202-installer.bin' to 'onie-installer' without a file extension.
7) With the new installer file on your USB drive, insert the USB drive and reboot the switch. Select 'Install ONIE' and wait for completion.
8) After the install is complete, remove the USB drive while rebooting. Leaving it in will cause the switch to not boot and give you this error: 'disk `hd0,4' not found'. This was confusing to me. You no longer need this USB drive.
9) Make sure the switch boots up. Connect to the mgmt0 web interface from your PC and login. Browse to System --> Onyx Upgrade.
10) Rename 'onyx-X86_64-3.10.4404.zip' to 'onyx-X86_64-3.10.4404.img'. Select Install from local file and upgrade the switch to 3.10.4404 via the browser.

This worked well for me on two of these switches. As previously mentioned in this thread, you might not be able to jump to 3.10.4404 directly if you need a BIOS upgrade and/or ASIC FW upgrade. You might want to do it one version at a time. It's comforting to know that you can just start over though, if you make a mistake.
 
Last edited:

Civiloid

Member
Jan 15, 2024
39
22
8
Check your baud rate. If it's an older BIOS, you may need 9600, but newer BIOS wants 115200. I saw some weird characters when connecting at 9600 on my SN2700M, but that all cleared up when connected at 115200. Perhaps your issue is something else though.
I was using 115200, as my bios was new enough for that. With 9600 I got only garbage (that was expected though).
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,346
820
113
Another failure mode encountered today:

LPCI2C bridge is busy

Seems to be some kind of hardware defect.



Update: Issue fixed.

Apparently, due to bad packaging and excessive shock / vibration, one of the connectors of the SE Harness (flat cable connecting MNG and switch board, carrying I²C signals and other stuff) came loose.
I didn't see it, because it was still halfway plugged in (the connector was tilted, so some pins had connection and some not)

Reseated the connector and everything works perfectly fine now.
 
Last edited:

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,346
820
113

hardwaretuner

New Member
Feb 7, 2024
5
0
1
yes, that's the right one.

You can then upgrade to a newer version, like 3.10.5000
After few tries, we made a clone off the old mSATA SSD with dd under linux to a new one, checked filesystem with gparted and put the new one back in Switch. Reboot and 3x Upgrades later it was still working fine. Now its running for over a week without Problems.
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,346
820
113
Another failure mode encountered today:

LPCI2C bridge is busy

Seems to be some kind of hardware defect.
Update: Issue fixed.

Apparently, due to bad packaging and excessive shock / vibration, one of the connectors of the SE Harness (flat cable connecting MNG and switch board, carrying I²C signals and other stuff) came loose.
I didn't see it, because it was still halfway plugged in (the connector was tilted, so some pins had connection and some not)

Reseated the connector and everything works perfectly fine now.
 

10Base5

New Member
Mar 5, 2024
2
0
1
What would be the recommended upgrade path for taking an SN2010 from 3.8 to 3.10? Is it necessary to upgrade to a 3.9 release first?