Mellanox Switches - Tips & Tricks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,618
1,072
113
Oooffff... I'm not sure, but there might be a board level fix. Maybe you're lucky.
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,187
769
113
Stavanger, Norway
intellistream.ai
So I managed to use a Raspberry Pi to create a fake virtual USB block device. So now I could just use these old mini-usb to usb-a cables.

The steps to do this is quite simple:

Using Raspian OS Lite and Raspberry Pi A with a single USB (The first one released a decade ago)

First copy over the onie-recovery-x86_64-mlnx_x86-r0.iso to the rpi, using wifi or sdcard.
Then edit the /boot/firmware/config.txt
Change this in the end of the file

[cm5]
#dtoverlay=dwc2,dr_mode=host

[all]
dtoverlay=dwc2,dr_mode=peripheral

Reboot the rpi

Then
sudo modprobe g_mass_storage file=/home/pi/onie-recovery-x86_64-mlnx_x86-r0.iso stall=0 removable=1

Connect to rpi USB to the SN2100 and success :cool:

After reinstalling on the Intel MSATA SSD, it was MUCH MUCH faster!
 
Last edited:

klui

༺༻
Feb 3, 2019
924
530
93
Sorry , late to the this party, but yes I had this on one of mine too.
Seemed to work better (more often) when using a dumb switch (vs a managed one) for it, no idea why.
Hey I kind of ran into this when I was converting mine around a month back.

The symptom only happened in U-Boot when retrieving the images through tftp. From MLNX-OS, it is fine.
Code:
=> tftp C00000 192.168.127.15:/net/mlnx/mlnx460ex/rootfs
Using ppc_4xx_eth0 device
TFTP from server 192.168.127.15; our IP address is 192.168.127.66
Filename '/net/mlnx/mlnx460ex/rootfs'.
Load address: 0xc00000
Loading: ########################T ###################T #########T #####T ########
         ###T #####T T #####T ######T T
Retry count exceeded; starting again
Waiting for PHY auto negotiation to complete...... TIMEOUT !
 done
ENET Speed is 10 Mbps - HALF duplex connection (EMAC1)
This behavior occurs with an untagged switchport of an Aruba S2500, and a dumb switch connected to an untagged port from the same S2500.

But it doesn't happen if I connect it to an HPE OfficeConnect 1820-8g. The 1820 is connected to the S2500 through a trunk, switch is PoE powered from the S2500. The port I'm using is untagged on the trunk. If I use a port that's a tagged VLAN on the 1820's trunk I still get timeouts but just not as much.
 

darkfader

New Member
Mar 29, 2022
6
0
1
Munich
Anyone know how to clean up disk space in / filesystem?
Out of 4 SN2010, there's one whose root filesystem is at 91%, I have no idea WHY and also couldn't find how to get a peak at what's there. I've let it delete old logs, looked at show files submenu etc. but there's nothing that has a good reason to use up too much space.

It's visible over SNMP:
91.09% used (1.74 of 1.91 GB, warn/crit at 80.00%/90.00%), trend: -53.47 kB / 24 hours

show files system etc. only shows me /config and /var:

Code:
Statistics for /config filesystem:
  Space Total          182 MB
  Space Used           3 MB
  Space Free           179 MB
  Space Available      170 MB
  Space Percent Free   98%
  Inodes Percent Free  99%

Statistics for /var filesystem:
  Space Total          9668 MB
  Space Used           1209 MB
  Space Free           8459 MB
  Space Available      7961 MB
  Space Percent Free   87%
  Inodes Percent Free  99%

There's the logging configuration and I have (now) added the max size and rotation interval
Code:
sw-vv-core [my-mlag-vip-domain: master] # show logging

Local logging level                          : notice
Override for class debug-module              : notice
Default remote logging level                 : notice
Allow receiving of messages from remote hosts: no
Number of archived log files to keep         : 3
Log rotation frequency                       : weekly (Once per week)
Log rotation (debug) size threshold          : 20 megabytes
Log format                                   : standard
Subsecond timestamp field                    : disabled
MAC address masking                          : enabled
One thing that might cause it is the MLAG partner is there but the MLAG domain is down (waiting for people to do stuff since a year or so... nevermind that). I dunno if that is spamming the logs.

No VMs, no docker containers, no nada

Running HPE flavor of 3.9.2110

version: X86_64 3.9.2110 2021-01-14 08:52:51 x86_6


If I don't find a way to clean this up i'll need to spend a night or two and do the stuff to bring the MLAG partner back up and then I can reboot this one. But the general question stands - any idea how to see whats there sucking up place in /?

(I've done a EMC^2 SX6012 mod long ago - and honestly loved doing that since it was so long since I had reason to do anything similar - but this one is doing actual work so I'm not keen on picking apart the OS :)
 

darkfader

New Member
Mar 29, 2022
6
0
1
Munich
I can confirm that Onyx will be discontinued in the near future (MLNX-OS is not affected, but the eth product will not be supported)
Later support for eth products will be Cumulus Linux and SONiC, or...switchdev
It is so frustrating btw! I heard about that in early 23 and since then I wished every day they would just give up that plan.
Mellanox switches + MLNX-OS was the best thing for envs with a few 100 users as a datacenter/core solution they can still understand, and allowed to have a safe boundary between cheap access (unifi) and core. It's different in global corporations where the IT staff and needs are greater, but for "normal" this was _the_ perfect solution. Those people are left with Netgear, FS, Huawei and none of that will compare.
 

klui

༺༻
Feb 3, 2019
924
530
93
I think you're worrying about nothing. Root is mounted read-only and not meant to be modified during normal use. What do you want to remove?

The SX6012's root file system is 95% full. All runtime changes are stored in tmpfs /var and /config.
Code:
[admin@sx6012a ~]# df -k
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/mtdblock6    524288 493684     30604  95% /
/dev/mtdblock8    102400   3340     99060   4% /config
/dev/mtdblock9    880640 389116    491524  45% /var
tmpfs            1038052   6668   1031384   1% /dev/shm
tmpfs              65536   9668     55868  15% /vtmp
 

darkfader

New Member
Mar 29, 2022
6
0
1
Munich
I think you're worrying about nothing. Root is mounted read-only and not meant to be modified during normal use. What do you want to remove?

The SX6012's root file system is 95% full. All runtime changes are stored in tmpfs /var and /config.
Code:
[admin@sx6012a ~]# df -k
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/mtdblock6    524288 493684     30604  95% /
/dev/mtdblock8    102400   3340     99060   4% /config
/dev/mtdblock9    880640 389116    491524  45% /var
tmpfs            1038052   6668   1031384   1% /dev/shm
tmpfs              65536   9668     55868  15% /vtmp
Hi,

thank you. it just was odd that it only appears on one of the whole bunch. But this is helpful. I'll adjust it to assume 95%ish is normal.
 

BoGs

Member
Feb 18, 2019
65
11
8
Would love some help keeping eye on SN2700 and was wondering if it mattered if I plan to follow the install guide on this thread the models? CS2F? CS2FC? CS2FO? seems most are the FO models.
 

Charles.Chiao

New Member
Sep 9, 2024
1
0
1
I am currently using NVIDIA SN3700,SN4700 and SN5600, the current installed system version is Cumulus 5.8.0, I tried to upgrade the system to Cumulus 5.9.2 or 5.10.0, but these three devices are not in Cumulus .An error occurs when the system is installed using the onie-install command or onie is installed when the system is powered on. I disabled the SecureBoot option in the switch BIOS according to the prompt, but the installation error is still displayed. Has anyone encountered this situation before? What should I do?
 

Attachments

Matta

Member
Oct 16, 2022
64
16
8
Mellanox SN2100 CB2R, made in 2020. for 1100 EUR - is that OK price ? Atom bug is rectified ?
I cannot find anything better, also looked for SN2700 but no luck.
 

Koop

Well-Known Member
Jan 24, 2024
369
267
63
So I recently got a pair of SN2410s- one which I'm having issues with and made a topic on. While I am still troubleshooting that one (waiting to get a DIMM I ordered) I went ahead and at least got the working one turned up in my rack. It looks like it was used as part of a whole pre-configured IBM rack solution.

I'll be honest that I picked these up with not much planning so I'm just learning as I go as someone who has never touched switches like this running linux, etc - does anyone have recommendations on where or how to start? Going to start reading through this thread first.

First thing I did was to get a complete clone of the internal SATA disk. I was able to log into the Cumulus install with some default credentials and starting just prodding around to see what info I could find. Anything specific I should capture?

I've been trying to read up on how the heck I can update it? Should I specifically be looking for IBM provided software/firmware/etc or can I use stuff provided by NVIDIA? Can I/should I update the bios (firmware?) on it? Can I/Should I update ONIE? Cumulus? Are there limits or concerns software wise? Any documentation or online guides on how to do any of these things?

Going off the OP I see the warning for max firmware levels. Anything new versus what is in the OP?

"NOS for dummies" or "Mellanox for dummies" guides out there for someone totally new to it all? Best to just start from scratch with ONIE on a USB or something?

Just want to not do something completely dumb and brick this thing somehow. Thanks for any guidance and advice.

Going to read through the thread now.
 

nasbdh9

Active Member
Aug 4, 2019
180
114
43
I've been trying to read up on how the heck I can update it? Should I specifically be looking for IBM provided software/firmware/etc or can I use stuff provided by NVIDIA? Can I/should I update the bios (firmware?) on it? Can I/Should I update ONIE? Cumulus? Are there limits or concerns software wise? Any documentation or online guides on how to do any of these things?
BIOS, ASIC firmware, etc. will be automatically updated with the installed NOS, at least in Onyx, Cumulus Linux, Sonic.
ONIE not need to update.
 
  • Like
Reactions: Koop

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,618
1,072
113
First thing I did was to get a complete clone of the internal SATA disk.
Replace the mSATA disk when you're already on it. A majority of the original SSDs are prone to failure. Innodisk is very bad.
 
  • Like
Reactions: Koop

Koop

Well-Known Member
Jan 24, 2024
369
267
63
BIOS, ASIC firmware, etc. will be automatically updated with the installed NOS, at least in Onyx, Cumulus Linux, Sonic.
ONIE not need to update.
Ah okay that helps thanks. I got both a few different versions of cumulus as well as onyx following NVIDIA's "Software versions path" in case it mattered for my particular situation.

Replace the mSATA disk when you're already on it. A majority of the original SSDs are prone to failure. Innodisk is very bad.
Yeah thanks for all the warnings regarding this, I always make clones/backups of drives when I can for hardware like this but after all the comments I knew to triple make sure I did it here. Both the switches I got had StorFly 300XE 16GB drives in them just for the record.

Chugging along with this currently to start somewhere:
 
Last edited:

Koop

Well-Known Member
Jan 24, 2024
369
267
63
Well I got some super basic cumulus setup going so that's cool. I have absolutely no idea wtf I am doing but that quick start guide was helpful. If I can ever figure out what is wrong with thee second switch is like to check out MLAG and such. I'm pretty much just reading what I can learn basics at this point. Been over a decade since I focused on anything network wise.

I figured the way things are these days you need to really just know it all so I'd really like to see and understand distributed nvme storage working. Not sure the best way to do that is (CEPH? vSAN? Lustre? VAST? WEKA?)


Question on the whole firmware thing then? How exactly do I get firmware updates and apply them? Providing PSID to INVIDIA on their page is what I saw but how do I find the my PSID? Physically on the switch label? I just want to head the caution of the OP as well.
 

NablaSquaredG

Bringing 100G switches to homelabs
Aug 17, 2020
1,618
1,072
113
Question on the whole firmware thing then?
What do you mean by Firmware? BIOS? ASIC?

Those are included in the Network Operating System like Cumulus. How you acquire newer versions of Cumulus is a different thing.
 
  • Like
Reactions: Koop