Beware of EMC switches sold as Mellanox SX6XXX on eBay

kapone · Dec 22, 2024

klui said:
Maybe forgot to mount / rw?

Code:

/dev/mtdblock7 524288 491388 32900 94% / /dev/mtdblock7 on / type jffs2 (ro,noatime)

I'm actually more curious if you did anything special to get your 6036's fans' PWM to 15%. I haven't unboxed my 6036 but my 6012s' fans will surge between PWM percentage of < 30% and maximum every 5-10 seconds.

I don't know if it makes a difference or not...this is a brand new 6036, plain vanilla Mellanox.

That said, I started where everyone else did, at 27% and started going lower until the fans decided to not cooperate and surge. 15-16% seems right where they don't surge. Right now, I have a single PSU connected, but I might try the second one as well, to see what the power draw looks like.

Edit: At 15-16% the fans are running ~4200-4300rpm.

Code:

# mount -nwo remount,rw /
# vi /etc/rc.d/rc.local

FAN_MIN=“15”
FAN_MAX="50"
WAIT_MAX="10" # 5 minutes

MDREQ1="/opt/tms/bin/mdreq action /system/chassis/actions/set-fan-speed fan_module string"
MDREQ2="fan_number int8 1 fan_speed int8"
MDREQ3="set_max uint8"

i=1
while :; do
PID=$(pidof clusterd)
if [ -n "$PID" ]; then
sleep 60
echo "Adjusting fan speed"
$MDREQ1 “/FAN/FAN1" $MDREQ2 $FAN_MIN $MDREQ3 $FAN_MAX
$MDREQ1 “/FAN/FAN2” $MDREQ2 $FAN_MIN $MDREQ3 $FAN_MAX
$MDREQ1 “/FAN/FAN3” $MDREQ2 $FAN_MIN $MDREQ3 $FAN_MAX
$MDREQ1 “/FAN/FAN4” $MDREQ2 $FAN_MIN $MDREQ3 $FAN_MAX
$MDREQ1 “/PS1/FAN1" $MDREQ2 $FAN_MIN $MDREQ3 $FAN_MAX
break
else
sleep 30
i=$((i+1))
if [ $i -gt $WAIT_MAX ]; then
echo "Timeout waiting for clusterd"
break
fi
fi
done

exit 0

klui · Dec 22, 2024

My 6012 won't achieve stable fan rotation below 26%. It's probably due to how its fan curves are defined seeing how the fans are controlled by just /MGMT/FAN1 and there is no /PS1/FAN[12]. I'm also using one power cord.

I leave the PWM at around 35% with 4 connections: 2 40G: 1 AOC DAC, 1 MPO MM; 2 10G: 1 passive DAC, 1 RJ45 (HP 2x 10G RJ45).

Code:

sx6012c [standalone: master] # show temperature
---------------------------------------------------------
Module      Component              Reg  CurTemp    Status
                                        (Celsius)
---------------------------------------------------------
MGMT        SX                     T1   47.00      OK
MGMT        QSFP_TEMP1             T1   34.50      OK
MGMT        QSFP_TEMP2             T1   38.50      OK
MGMT        QSFP_TEMP3             T1   35.50      OK
MGMT        BOARD_MONITOR          T1   38.00      OK
MGMT        CPU_BOARD_MONITOR      T1   36.00      OK
MGMT        CPU_BOARD_MONITOR      T2   60.00      OK
sx6012c [standalone: master] # show fan
-----------------------------------------------------
Module          Device          Fan  Speed     Status
                                     (RPM)
-----------------------------------------------------
MGMT            FAN1            F1   6600.00   OK
MGMT            FAN2            F1   6600.00   OK
MGMT            FAN3            F1   6510.00   OK
MGMT            FAN4            F1   6840.00   OK

kapone · Dec 22, 2024

My CPU_BOARD_MONITOR temp on the 6036, which is the highest (everything else is in the 20s...) is ~55-56 at 15-16% fan speed.

But that's idle, nothing connected. That said, I'm using DACs for the most part, no transceivers, so we'll see.

ZFSZealot · Jan 1, 2025

Wondering if anyone has any insight on some errors I'm seeing. I have converted a SX6012 and upgraded it nearly all the way per the 1.12 guide but I'm having trouble with 3.6.8010. 3.6.5000 works just fine. When booting 3.6.8010 I see errors when mount and fsck try to use /lib/libblkid.so.1:

Code:

Starting udev: mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
[  OK  ]
Setting clock  (utc): Wed Jan  1 18:38:54 UTC 2025 [  OK  ]
Setting hostname localhost:  [  OK  ]
Checking filesystems
fsck: /lib/libblkid.so.1: no version information available (required by fsck)
fsck: /lib/libblkid.so.1: no version information available (required by fsck)
fsck: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
fsck: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
fsck: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
Checking all file systems.
[  OK  ]
Remounting root filesystem in read-write mode:  mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
[  OK  ]
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
Mounting local filesystems:  mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
mount: /lib/libblkid.so.1: no version information available (required by /lib/libmount.so.1)
[  OK  ]

There are also a couple other errors that I would expect are caused by missing files because filesystems are not getting mounted:

Code:

Running firstboot script error reading information on service arp_responder: No such file or directory
...
file /opt/tms/customization_files/customization.6012 does not exists - safely using default values

I do see mention of this earlier in the thread, but in one place the response was that reinstalling 8010 fixed the problem, and in another that someone had copied kernel and related files out? of the u-boot storage. A third seemed to be related to trying to go directly to 8010 during the manufacturing stage:

https://forums.servethehome.com/index.php?threads/beware-of-emc-switches-sold-as-mellanox-sx6xxx-on-ebay.10786/post-396378

https://forums.servethehome.com/index.php?threads/beware-of-emc-switches-sold-as-mellanox-sx6xxx-on-ebay.10786/post-266798

https://forums.servethehome.com/index.php?threads/beware-of-emc-switches-sold-as-mellanox-sx6xxx-on-ebay.10786/post-306120

I've followed the 1.12 version of the guide that's posted earlier in the thread as a Google Doc, and up to this point it's almost been too easy. I deviated by using my already existing FreeBSD based TFTPD and web servers, and only upgrading one partition to a new image version before making that partition active, rebooting and then installing the next image version to the newly inactive partiion - I believe this is in keeping with the intent of having two different partitions - to be able to fall back to a good version - and it seems to have paid off here.

Here is the current "show images" output with it booted to 8010 and with those errors in the console:

Code:

sx6012 [standalone: master] # show images

Installed images:
  Partition 1:
    version: PPC_M460EX 3.6.8010 2018-08-20 18:04:16 ppc

  Partition 2:
    version: PPC_M460EX 3.6.5000 2017-11-10 18:14:29 ppc

Last boot partition: 1
Next boot partition: 1

Images available to be installed:
  1:
    Image  : image-PPC_M460EX-3.6.8010.img
    Version: PPC_M460EX 3.6.8010 2018-08-20 18:04:16 ppc

Serve image files via HTTP/HTTPS: no

No image install currently in progress.
No boot manager password is set.

Image signing              : trusted signature always required
Admin require signed images: yes

Settings for next boot only:
  Fallback reboot on configuration failure: yes (default)

The only thing I notice that seems a little off is the "/" filesystem seems very full, is this normal? When running the 3.6.5000 image:

Code:

[admin@sx6012 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/mtdblock7  512M  500M   13M  98% /
/dev/mtdblock8  100M  3.7M   97M   4% /config
/dev/mtdblock9  860M  392M  469M  46% /var
tmpfs          1014M  6.5M 1008M   1% /dev/shm
tmpfs            64M  8.6M   56M  14% /vtmp

When running 3.6.8010:

Code:

[admin@sx6012 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/mtdblock6  512M  481M   32M  94% /
/dev/mtdblock8  100M  3.7M   97M   4% /config
/dev/mtdblock9  860M  390M  471M  46% /var
tmpfs          1014M  6.6M 1008M   1% /dev/shm
tmpfs            64M  9.4M   55M  15% /vtmp

The web interface seems to be running fine in 8010, but did move from http to https with the 8010 upgrade. I'm wondering if this is just a problem with 8010 and it would be fixed with 8012, but I hate to clobber the known good 3.6.5000 partition to try, and I'm not sure if I can go directly to 3.6.8012 from 3.6.5000.

Anyone have any insight? Is this some kind of problem with the kernel not matching the user space utilities and there's some additional manual step I need to take?

NablaSquaredG · Jan 1, 2025

ZFSZealot said:
A third seemed to be related to trying to go directly to 8010 during the manufacturing stage:

I can only recommend to go directly to 8012 using the manufacturing process (you will need some extra flags: see here https://forums.servethehome.com/ind...anox-sx6xxx-on-ebay.10786/page-62#post-357920). Just be aware that it will take a lot longer.

For me, it feels like the switch runs MUCH better when you manufacture it directly with 8012 instead of going through the upgrade process (for whatever reason...)

ZFSZealot · Jan 1, 2025

Interesting - what gotchas are there when manufacturing directly to 3.6.8012 since I already "converted" this switch? I know some things are different on a previously converted switch - for example the notes in the v1.12 document about changing the device bus number to 8 instead of 1 when writing the modifed FRU back.

So strange that with 3.6.5000 and 3.6.8012 that the mounted filesystems look the same, just /, /config and /var, even though one seems to have errors running fsck and mount during boot and the other doesn't... I wonder if these are just warnings - but then there are those errors about it not being able to find the customization file and having trouble during the firstboot script - maybe not even related?

klui · Jan 1, 2025

I went from 3.4.0012 to 3.6.8012 when I first started. I also manufactured directly to 3.6.8012.

The error messages are also there when you first boot after installing 8012. They no longer show up on subsequent reboots. Space usage on 8012 is the same as 8010.

andvalb · Jan 3, 2025

klui said:
I went from 3.4.0012 to 3.6.8012 when I first started. I also manufactured directly to 3.6.8012.

The error messages are also there when you first boot after installing 8012. They no longer show up on subsequent reboots. Space usage on 8012 is the same as 8010.

Errors at first boot is normal.

ZFSZealot · Jan 6, 2025

andvalb said:
Errors at first boot is normal.

Thanks for the response but these are not the CRC errors mentioned in the 1.12 guide. I'll reboot again but I'm almost certain I've tried rebooting on 3.6.8010 and the errors are the same each time. Kind of wondering if this is something others haven't noticed if they didn't keep the console up while rebooting through the various upgrades when following the 1.12 guide.

EDIT: @andvalb I stand corrected and owe you and @klui an apology. It does look like after rebooting into 3.6.8010 - after issuing "reload" from 3.6.8010 - the errors are gone. I had booted back and forth between 3.6.5000 and 3.6.8010 and saw the errors when booting 3.6.8010 each time - but that was after "reload" from 3.6.5000 each time.

Thanks so much for pointing me in the right direction and I'm hoping this is a worthwhile contribution to the thread - reinforcing that not only CRC errors but these shared library (libblkid.so.1) related errors are expected on first boot into a new image version.

nongio · Jan 15, 2025

Guys, awsome work.
here are two more SX6012 now working. Thanks to everyone

CoryC · Feb 3, 2025

Here's another switch in 2025 upgraded. I don't know how people figured out this process, but I'm grateful.

BoGs · Feb 5, 2025

Same here finally got around to the script and playing with port splits - soooooo very nice! Thanks all

Marc_ · Feb 6, 2025

Morning all, picked up another sx6018 cheap to convert but I've hit a snag. It looks like the u-boot env has been defaulted and lost a lot of what was there. When I run the boot_mlxlinux command I get :
=> run boot_mlxlinux
## Error: "boot_mlxlinux" not defined

Looking at the env I kept from the last one I did it should be:
boot_mlxlinux=setenv flash_jffs2 run mlxlinux; setenv menu_usb run mlxlinux; saveenv

However I have set this and it now drops back to the => prompt when I run the command.

I also tried setting the mlxlinux=run jffs2_args boot_common_args;bootm ${kernel_addr} - ${fdt_addr} env but this gives:
Wrong Image Format for bootm command
ERROR: can't get kernel image!

Not sure what else needs to be set in order to get past this to flash the damn thing.

CoryC · Feb 6, 2025

At one point, I used resetenv or envreset. Then I entered the specific environment variables from the guide.

Try that.

Also, you can copy/paste the appendix that has all of them to set that as a baseline.

Marc_ · Feb 6, 2025

CoryC said:
At one point, I used resetenv or envreset. Then I entered the specific environment variables from the guide.

Try that.

Also, you can copy/paste the appendix that has all of them to set that as a baseline.

Did that still work as the guide variables are for the mellanox u-boot and not the emc? I'll give it a go either way as what's the worst that can happen.

CoryC · Feb 6, 2025

Marc_ said:
Did that still work as the guide variables are for the mellanox u-boot and not the emc? I'll give it a go either way as what's the worst that can happen.

Well, the EMC specific partitions are different from the Mellanox FF00 stuff vs FFXX specified by EMC variables. Shorter answer is EMC stuff won't boot anymore. Backup the variables. I did a printenv, copy/paste to text file when using putty. That way I had the old stuff. Then I did the reset.

So did it work ultimately, yes, I was able to follow the guide. Each time I needed to boot into a temp linux environment or boot from the TFTP hosted version, I pasted in the env's the guide had in the step's instructions. After the install, it worked fine as the installation script the guide has you run also sets the needed env stuff it needed to work during the install.

Edit: This thread is so epically long. Might make sense to start a new one?

Marc_ · Feb 6, 2025

I managed to get it to work. Reset the environment to default, made sure anything dhcp related was there and proceeded with the guide. The issue I had was it wasn't brining the mgmt0 port alive without dhcp. Once that was solved I managed to reflash the switch

BoGs · Feb 7, 2025

6036 does not like LR4 and my SMF - so much for my PLR4 qsfp modules, time for MMF as its so nice and quiet with the script

NablaSquaredG · Feb 7, 2025

BoGs said:
6036 does not like LR4 and my SMF - so much for my PLR4 qsfp modules, time for MMF as its so nice and quiet with the script

Classic case of "Read the manual"...

LR4 transceivers are only supported on ports Ports 1, 3, 33, 35

You also need to do fae cable-stamping-unlock 40g_lr4

You could also patch the firmware of the ASIC to enable high power on all ports

BoGs · Feb 7, 2025

I read about patching the hardware in this thread but all of it was so secret, wondered if the downside is burning down your house lol. Is there anywhere I could read up on that? I only run like 5 modules and max length is 300m not like I am shooting 80km.

Beware of EMC switches sold as Mellanox SX6XXX on eBay

Well-Known Member

༺༻

Well-Known Member

New Member

Bringing 100G switches to homelabs

New Member

༺༻

Member

New Member

New Member

New Member

Active Member

Chief breaker of switches

New Member

Chief breaker of switches

New Member

Chief breaker of switches

Active Member

Bringing 100G switches to homelabs

Active Member