Mellanox Switches - Tips & Tricks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

smallwolf

New Member
Jul 22, 2023
1
0
1
Similar to the glorious Brocade ICX Series (cheap & powerful 10gbE/40gbE switching) Thread, I'd like to use this thread to compile some knowledge about Mellanox switches that is spread over many places in the forum, which can be hard to find for newbies.

I will expand this thread / post over time, so if you have some information that you'd like to have added, just write it in a comment.

(MAXIMUM) SUPPORTED FIRMWARE VERSIONS:

Upgrading beyond these versions or using other versions may brick your switch and require a recovery procedure!


Name / TypesVersion
SwitchX / SwitchX-2 PowerPC (SX6036, SX6012, SX1016, etc...)3.6.8012 (no newer build for PowerPC available)
SwitchX / SwitchX-2 x86 (SX6710, SX1410, etc...)3.6.8012 (upgrading beyond WILL brick your switches BIOS)
Spectrum x86 (SN2100, SN2700, SN2010, etc)3.10.4XXX (DO NOT upgrade to 3.10.5000, 3.10.6004 etc.)
SwitchIB (SB7700)3.9.3124



As I don't have a lot of time right now (but want to get this done before I forget it), I'll just start with some basics



Overview of current Mellanox switch series that may be interesting for homelabs:

- Mellanox IS5000 Series: Old, Infiniband only switches. They should generally be avoided:

- Mellanox SX Series: 40G / 56G switches with Switch-X / SwitchX-2 chip. They come in different flavours (unmanaged / managed, full width, half width, etc... TODO: Full Model Listing). They are very energy efficient (TODO: Exact numbers). Most of them are PowerPC, but some are x86

Managed SX Series switches can do VPI - That means you can have some ports do Infiniband, some Ethernet at the same time and even use integrated IPoIB gateway functionality.

Highest supported firmware version (for managed switches): 3.6.8012
DO NOT try to update to a version beyond that on x86 switches - It may brick your switch by automatically doing a bios update that prevents the switch ASIC from being detected!


- Mellanox SB77XX/78XX series: 100G Infiniband switches with Switch-IB / Switch-IB 2 chip (no VPI like SX Series). x86 control plane, highest supported version: 3.9.3124
Very power efficient! SB7700 needs only 53W in IDLE (one PSU)

- Mellanox SN2XXX series: 100G Ethernet switches with Spectrum chip (no VPI like SX series). x86 control plane, highest supported version: ?? Apparently, it's 3.10.4100 (LTS)
Very power efficient. SN2700 needs only 51W in IDLE (one PSU), SN2100 about 35W iirc (after the fans have spun down)
They come in different flavours, but generally you have:
- SN2700: 19" 32x100G, Celeron 1047U, mSATA SSD (TODO: Expand)
- SN2100: 9.5" (half-width), Atom C2XXX (possibly affected by AVR54 bug?), M.2 SATA SSD
... and more (TODO: Full model listing)


Tips & Tricks for working with those switches:

- SN2XXX / SB7XXX series: Replace your SSDs and make backups! The original Innodisk 3ME SSDs are prone to failure (one died while I was taking an image).
You can use whatever mSATA (apparently all except SN2100) or M.2 SATA (apparently only SN2100) you want. My go-to model are Transcend 452T2 (e.g. TS128GMSA452T2)

- If you get an SN2XXX with ONIE, Cumulus or no OS, you can easily flash it to ONYX / MLNX-OS by taking a good MLNX-OS / ONYX image from another switch and patching the embedded database to the correct model number, number of ports, MAC Addresses, etc... (TODO: Guide)

TODO: Expand
I have an sb7800 without any system. Can you send me a download link for this system?
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
PSA:
there is official confirmation that SN2100 and SN2010 ARE affected by AVR54 bug.

Reference:



Code:
Environment

This issue might affect the following switches on the NVIDIA hardware compatibility list:

100G switches: Dell Z9100-ON, Edge-Core AS7712-32X, HPE Altoline 6960, NVIDIA Spectrum SN2100, Penguin Arctica 3200C, QCT QuantaMesh BMS T7032-IX1, Supermicro SSE-C3632S
40G switches: Dell S6010-ON, Edge-Core AS6712-32X, Edge-Core AS6812-32X, HPE Altoline 6940, HPE Altoline 6941, NVIDIA Spectrum SN2100B, Penguin Arctica 3200XLP, QCT QuantaMesh BMS T5032-LY6-x86
10G switches: Dell S4048-ON, Dell S4048T-ON, Edge-Core AS5712-54X, Edge-Core AS5812-54T, Edge-Core AS5812-54X, HPE Altoline 6920, HPE Altoline 6921, HPE Altoline 6921T, Penguin Arctica 4806XP, QCT QuantaMesh BMS T3048-LY8, QCT QuantaMesh BMS T3048-LY9, Supermicro SSE-X3648S
1G switches: Dell S3048-ON, Penguin Arctica 4804iq, Supermicro SSE-G3648B
Root Cause

The CPU might fail to boot after the SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (low pin count bus clock outputs) stops functioning. For more information, read AVR54 in the Intel Atom Processor C2000 Product Family Specification Update from January 2017.

If this issue occurs, no messages or output appears on the serial console, the switch fans continue to run at full speed, and the state (on/off/color) of the LEDs all remain the same.
 
  • Like
Reactions: blunden

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
Is there a way to find out what the default value for "power_persent" is?
I am trying to change the fan rpms/reduce the noise of my 6036 :D
 

dbTH

Member
Apr 9, 2017
149
59
28
PSA:
there is official confirmation that SN2100 and SN2010 ARE affected by AVR54 bug.

Reference:



Code:
Environment

This issue might affect the following switches on the NVIDIA hardware compatibility list:

100G switches: Dell Z9100-ON, Edge-Core AS7712-32X, HPE Altoline 6960, NVIDIA Spectrum SN2100, Penguin Arctica 3200C, QCT QuantaMesh BMS T7032-IX1, Supermicro SSE-C3632S
40G switches: Dell S6010-ON, Edge-Core AS6712-32X, Edge-Core AS6812-32X, HPE Altoline 6940, HPE Altoline 6941, NVIDIA Spectrum SN2100B, Penguin Arctica 3200XLP, QCT QuantaMesh BMS T5032-LY6-x86
10G switches: Dell S4048-ON, Dell S4048T-ON, Edge-Core AS5712-54X, Edge-Core AS5812-54T, Edge-Core AS5812-54X, HPE Altoline 6920, HPE Altoline 6921, HPE Altoline 6921T, Penguin Arctica 4806XP, QCT QuantaMesh BMS T3048-LY8, QCT QuantaMesh BMS T3048-LY9, Supermicro SSE-X3648S
1G switches: Dell S3048-ON, Penguin Arctica 4804iq, Supermicro SSE-G3648B
Root Cause

The CPU might fail to boot after the SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (low pin count bus clock outputs) stops functioning. For more information, read AVR54 in the Intel Atom Processor C2000 Product Family Specification Update from January 2017.

If this issue occurs, no messages or output appears on the serial console, the switch fans continue to run at full speed, and the state (on/off/color) of the LEDs all remain the same.
This seems was the old Intel Atom C2000 bug found in early 2017 that had been fixed in the later C0 stepping, similarly to this reported in STH: Intel Atom C2000 C0 Stepping Fixing the AVR54 Bug
So, If you have a SN2100 or SN2010 switch that was manufactured on a later year, I guess you should be fine
 
Last edited:

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
This seems was the old Intel Atom C2000 bug found in early 2017 that had been fixed in the later C0 stepping, similarly to this reported in STH: Intel Atom C2000 C0 Stepping Fixing the AVR54 Bug
So, If you have a SN2100 or SN2010 switch that was manufactured on a later year, I guess you should be fine
Yes, but there was never official confirmation.
There can be layouts where the AVR54 bug does not matter, and I hoped that SN2010 / SN2100 is one of them.
That‘s the reason why I sold my SN2100s…

you can check with rootshell and lspci
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Has anyone experienced SX6012's management port behaving weirdly? It works fine if I plug it directly into my laptop but it refuses to connect to my lan. Both dhcp and static do not work. Looking at wireshark captures it is just completely ignoring arp, not making any dhcp requests, etc. The rest of the router works perfectly fine.
Sorry , late to the this party, but yes I had this on one of mine too.
Seemed to work better (more often) when using a dumb switch (vs a managed one) for it, no idea why.
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
Release Notes 3.10.4404

Internal Ref.CategoryDescription
3527932VLANAfter removing the VLAN interface using the IPL configuration and rebooting
the switch, errors are seen in the log.
3571204LACPWhen combining LACP fast rate and LACP slow rate on different devices, the
LACP link flapping occurs.
3510518CablesTx bias current shows N/A on some cables.
3700976LDAPFixed the "group-attribute/group-dn" LDAP configurations. If the "group-dn" is
set, a user must be a member of this group or the user will not be authorized
to log in as the membership of group is set by the group-attribute.
3565862IGMPIGMP snooping sends Source and Group Specific queries as a response to the
received IGMPv3 Current-State-Record Membership report.
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
FYI, the update sequence and show asic-version values when updating an old SN2410 to current version. You cannot directly upgrade a switch with 3.7 on it to 3.10, as 3.10 will fail to recognise the ASIC

Code:
3.7.1134 -> 3.8.2004 (13.2000.2162)
3.8.2004 -> 3.9.1020 (13.2008.1310)
3.9.1020 -> 3.9.3124 (13.2008.3226)
3.9.3124 -> 3.10.2102 (13.2010.2108)
3.10.2102 -> 10.4206 (13.2010.4208)
-> Fresh install 3.10.4206 (13.2010.4208)
3.10.4206 -> 3.10.4404 (13.2010.4406)
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
Does anyone know the exact specifications for the screws of the top cover & SSD? Got a switch with all those missing and no idea what to buy
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
Does anyone know the exact specifications for the screws of the top cover & SSD? Got a switch with all those missing and no idea what to buy
Top cover screw seems to be "Flat Head 100º 4-40x3/16" Phillips Patch Screw" (mind the 100° angle)

SSD screw seems to be M2x5mm or something


Another note why you should disable CSM on SN2700 / SN2410 when reinstalling ONIE (onie-recovery-x86_64-mlnx_x86-r0-2020.11-5.3.0005) and MLNX-OS 3.10 X86_64-3.10.4206-installer.bin and newest BIOS version (Version 2.15.1236. Copyright (C) 2012 American Megatrends, Inc. - BIOS Date: 09/13/2018 12:00:00 Ver: 0ABZS017_02.02.004)...

However, when trying to reproduce the issues I had with legacy boot, I couldn't reproduce them. Weird. Maybe it was an older BIOS version I had issues with back then? Dunno...
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
PSA:

When recovering SN2100, DO NOT SET TO UEFI ONLY as you would do on SN2700
SN2100 needs legacy boot. Otherwise it will get stuck in an infinite loop trying to update the BIOS, because it cannot read the DMI and the auto bios update script is terribly written.

Just verified this again
 

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
Added two new tips / tricks:

Code:
- The serial terminal behaves a bit weird sometimes when editing large commands. To fix this, use cli session terminal resize

- To download firmware from an USB stick, use image fetch scp://admin:admin@127.0.0.1/var/mnt/usb1/image-X86_64-3.10.4404.img
 
  • Like
Reactions: bwahaha

hardwaretuner

New Member
Feb 7, 2024
5
0
1
Similar to the glorious Brocade ICX Series (cheap & powerful 10gbE/40gbE switching) Thread, I'd like to use this thread to compile some knowledge about Mellanox switches that is spread over many places in the forum, which can be hard to find for newbies.

I will expand this thread / post over time, so if you have some information that you'd like to have added, just write it in a comment.

(MAXIMUM) SUPPORTED FIRMWARE VERSIONS:

Upgrading beyond these versions or using other versions may brick your switch and require a recovery procedure!


Name / TypesVersion
SwitchX / SwitchX-2 PowerPC (SX6036, SX6012, SX1016, etc...)3.6.8012 (no newer build for PowerPC available)
SwitchX / SwitchX-2 x86 (SX6710, SX1410, etc...)3.6.8012 (upgrading beyond WILL brick your switches' BIOS)
SwitchIB (SB7700)3.9.3124
Spectrum x86 (SN2100, SN2700, SN2010, etc)3.10.4XXX (3.10.4404 as of 24th January 2024)
(DO NOT upgrade to 3.10.5000, 3.10.6004, 3.11.XXXX etc.)


- Mellanox SB77XX/78XX series: 100G Infiniband switches with Switch-IB / Switch-IB 2 chip (no VPI like SX Series). x86 control plane, highest supported version: 3.9.3124
Very power efficient! SB7700 needs only 53W in IDLE (one PSU)
thanks für this summary, any different between 7800 and 7890 except from Network modul? if i change the network modul to 7890 it boots up like a 7800?


my Mellanox SB7800 doesn't boot any more after powercycle. its possible to repair the mSATA ssd or clone another one from SB7890? Or any other option to boot again?

serial connection shows:
"Version 2.15.1236. Copyright (C) 2012 American Megatrends, Inc.
BIOS Date: 03/08/2016 12:00:00 Ver: 0ABZS017_01.01.013

Reboot and Select proper Boot device
or Insert Boot Media in selected Boot device and press a key"

thanks
 
Last edited:

NablaSquaredG

Layer 1 Magician
Aug 17, 2020
1,345
820
113
Short summary

1. MOST IMPORTANT STEP: REPLACE SSD - I recommend Transcend MSA452T-I 128GB
2. Download ONIE recovery image, flash to USB stick, insert USB stick into switch
3. Power Cycle switch, hammer CTRL+B to get into BIOS, if password required, try „admin“ (without quotes)
4. boot from USB (UEFI Mode!!), select „Embed ONIE“ once ONIE has booted
5. Download Onyx 3.10 installer from mega.nz link in first page (should work - if your switch was on ancient firmware before, use the 3.9 instead)
6. Copy the Onyx installer to a FAT32 formatted USB, rename installer file to onie-installer (no extension)
7. insert usb stick with onyx installer into switch, power cycle switch, on ONIE prompt select „Install OS“
8. let the installer finish. Switch should come up ready to use