RoCE v1 implementation (SX6036 heatsink/silence mod running log!)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

unphased

Active Member
Jun 9, 2022
148
26
28
new to fiber even though I'm very acquainted with this one $26 15m length of MPO fiber that I have routed around my house two times now. OS2 fiber is *really cheap* for short lengths and I'm reading that you can reach absurdly high distances with this technology. It's just so cool to me.

I just read up about the 100G fiber driving technologies and CWDM4 is the second most long range option, it is so incredible to be able to get this tech for $4 a pop.
 
Last edited:

unphased

Active Member
Jun 9, 2022
148
26
28
Alright I just pulled the trigger for $161 after tax on a SX6036. Will have some fun with this thing figuring out how to quiet it down and it's probably going to spend years leaning against the wall in the basement. It's such a mess there I couldn't make space for a rack if I wanted to. However, I might think about standardizing on racks, because I think I will carry on the tradition of doing serious work on consumer hardware to get peak single thread speed as well as performance per dollar, but tower cases just really aren't gonna be doing it for me. I'm envisioning some kind of 2U or 4U sliding racks with modular designs so I can have "disk modules" and "GPU modules" and maybe "water cooling modules" efficiently laid out.

I made the decision on the switch, because as I read more, all signs started pointing toward it. I also learned after making the purchase that this thing idles at less than 50 watts. That's a huge win considering that none of the other choices in the running here come even close.
 
Last edited:
  • Like
Reactions: DavidWJohnston

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Wow that sounds like a really good deal - I don't know much about the SX6036, I've read some models might need a license for Ethernet support, (Some kind of "gateway" license") but there are also eBay listings for "license assistance", probably someone with a certain piece of software that takes the serial number as an input and generates some kind of desirable output.

For my homelab, I have too many non-standard size items, and I change stuff around too frequently to use a proper rack. I bought these two black shelves, and this corner of the basement is dedicated to my lab. Generally I put heavier stuff on the bottom (UPSs). Cable management obviously needs work!! But something semi-structured like this might work for you.

1696878411433.png

1696878660103.png
 
  • Like
Reactions: unphased

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
but there are also eBay listings for "license assistance", probably someone with a certain piece of software that takes the serial number as an input and generates some kind of desirable output.
f*** these people selling "cracked" keys for profits
it's possible to generate the keys/licenses, search this forum for ltrace to find more information
 
  • Like
Reactions: mach3.2 and blunden

unphased

Active Member
Jun 9, 2022
148
26
28
Yeah I forget his handle, but I came across a patron of this forum who helps people out with licensing on these units in DMs so I will surely hit him up regarding that.

@DavidWJohnston Thank you for sharing photos. Yes I don't think it will be a problem at all to just stack these things on shelves!

Do you need to do any special configuration to use the CWDM4 100G transceivers to operate at 40G for 40G equipment? The 40G SR4 MPO optics I'm currently using work fine between my thunderbolt enclosure CX4 and a CX3 card so I'd imagine that will be fine in the switch, and I hope I'll be able to use those cheap CWDM4 100G transceivers with CX3's to the switch and carry these forward to 100G tech once I upgrade to a 100G switch and CX4/CX5 NICs. (realistically alongside this, I'll also need to more fully explore RDMA software and dip back into HEDT to leverage 100G due to the pcie lanes, though I will definitely consider x8/x8 bifurcation as GPUs are typically not starved at x8 and i've been led to believe 80Gbps is possible on x8, but 50Gbit might be more realistic)
 

unphased

Active Member
Jun 9, 2022
148
26
28
@DavidWJohnston

I got the switch in, but it is likely best for my sanity for me to wait to take delivery of my serial RJ45 to USB cable to interface with the switch. It is not providing DHCP and I do not know if it will be feasible connect to the webui. tcpdump isnt giving me an IP. I am going to try something to manually assign an IP.

I had ordered 4 of these ColorChip 100G CWDM4 transceivers since they were the cheapest at $4 each. However it seems like I have one already that is dead. With the other 3 it seems like I can push up to 24Gbit with -P8 in iperf3, sometimes, and also these things appear to get pretty warm. It seems like they warm up even when not transmitting much. Maybe that’s just how CWDM4 works. I shall want to not use these without active cooling as I suspect they will toast themselves. I think active cooling might allow the speeds to actually be good, needs more testing but it feels to me so far like once they warm up the speeds go to shit.

I will order some of these KAIAM 40G-LR4 $8 ones and the Intel ones I was eyeing. Hard to say if it’s this 20 meter fiber that is cheap and bad or if it is the transceivers, but seeing a ton of retries (like, thousands per second) in iperf3. Sometimes I was getting sub 1 gbit speeds with hundreds of retries. It seems like for short distances a DAC cannot be beat.

Oh I also just found on ebay some roughly $5 a pop FFB0412VHN Delta fans which hopefully can be somewhat quieter. Gonna grab a few of those.

Fun times.
 
Last edited:

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Cool, glad to hear you received it. For sure you'll need serial console access to get started, and troubleshoot/reset if something goes wrong.

I'm not that surprised one of the transceivers is broken, it happens. CWDM4 transceivers do run hot. Make sure to keep spares on-hand. It also might just be dirty...

Make sure to keep your fiber connections surgically clean. Buy a bunch of those 1-click cleaners, and use them, even on new cables. When the glass fiber ferrules come together at the connection point, the amount of pressure (PSI) between them is extreme, and the material is very hard. Any tiny piece of dust can cause huge problems and even crack the glass. Even if I unplug a fiber for a moment, I use the 1-click every time on both ends. Fiber cleanliness is an art, and there are lots of great YT videos about it.

Iperf3 does not use multiple cores of your CPU even with -Px, but you can run multiple instances, that might help. You really need RoCE though to get high throughput. For a single iperf3 I think I can get around 35G, after messing with the parameters, jumbo frames, etc. Check your CPU usage.
 

unphased

Active Member
Jun 9, 2022
148
26
28
Interesting! The fiber clicks into the transceiver and there is some spring/slop there but it's possible I suppose that the inner part is sliding or something and it remains in contact the whole time. I was reading about UPC/APC and it is unclear how to tell what type a given fiber or transceiver is, given i'm always using the cheapest of the cheap stuff... Will look into the cleaners, thanks for the tip.

Yeah I am aware iperf3 won't use more cores even with -P and I am not sure what this is saying if i get a lot more throughput with higher -P but that is what I was seeing there.

With a DAC I was getting 20Gbit going one way and 27 another way. with these CWDM4 i got up to 25ish one way and around 1.2 or something (it varies greatly) going the other way. not sure what it means yet
 

DavidWJohnston

Active Member
Sep 30, 2020
242
191
43
Yeah the ferrules are in spring tension. When fully seated if you push them together a little harder you will be compressing the spring(s). The 1-click cleaners are highly effective and cheap.

Generally, based on the color of the connector: UPC is blue, and APC is green. Transceivers are usually UPC, except for GPON (usually SC) and some long-haul transceivers. Best to check the specs.

There is a way you can sort of tell with these cheap fiber microscopes:


The included tips are straight-on facing the focal point on the microscope, so APC connectors are impossible to see the whole face in focus. With a little practice and persistence, if you slightly tilt an APC connector while viewing it, you can catch a glimpse of the APC angle from the changing focal point. This only works with patch cables.

If you get vastly different speeds with CWDM4 vs. DAC (and everything else constant) probably it's dirty fiber, bad transceiver, or incorrect FEC settings. For most 100G transceivers you should set FEC=rs.
 
  • Like
Reactions: unphased and klui

klui

Well-Known Member
Feb 3, 2019
842
462
63
Ignoring most older types of connectors, single mode MPO is almost always APC. SM LC is typically UPC. I've only seen UPC on multimode LC/MPO. Refer to the green/blue codes @DavidWJohnston stated for LC/SC.

EDIT: just saw today multi-mode MPO-16 come in APC (for 400G).
 
Last edited:

unphased

Active Member
Jun 9, 2022
148
26
28
I reckon it's probably safe to say that there is no UPC/APC mismatch if any sort of functioning connection is established!

This is fascinating knowledge. thank you.
 

erock

Member
Jul 19, 2023
84
17
8
1) roce can be enabled/disabled via config
3) some older drivers are shipped with Windows and don't have all Features enabled. User newest available mlnx drivers.
The other support depends on the used Software
4) get an Arista Switch and Set the Fan Speed to 30%. It's still Not silent but far away from a Jet engine
Does anyone know if ROCE can be configured with ConnectX3 (non pro) cards on Linux-Ubuntu 22.04. It seems that Mellanox drivers are for old Linux versions and kernels and can’t be installed.
 

unphased

Active Member
Jun 9, 2022
148
26
28
Does anyone know if ROCE can be configured with ConnectX3 (non pro) cards on Linux-Ubuntu 22.04. It seems that Mellanox drivers are for old Linux versions and kernels and can’t be installed.
Hm this is a good point. 20.04 and/or kernel 5.x might actually be the end of the road
 

unphased

Active Member
Jun 9, 2022
148
26
28
Wow thank you all for the knowledge and links!

I received my console cable and after some faffing about to learn that the 9600 baud rate is the correct one, I am in to the console! As a recap, I'm working toward bringing up the SX6036 I just got!

So far:

Code:
switch-8a36c2 [standalone: master] > show system profile

Profile         : ib
Number of SWIDs : 1
Adaptive Routing: yes

switch-8a36c2 [standalone: master] > show uboot
UBOOT version : U-Boot 2009.01 SX_PPC_M460EX SX_3.2.0330-82 ppc (Dec 20 2012 - 17:53:54)

switch-8a36c2 [standalone: master] > show images

Installed images:
  Partition 1:
    version: PPC_M460EX 3.6.8010 2018-08-20 18:04:16 ppc

  Partition 2:
    version: PPC_M460EX 3.6.5009 2018-01-02 07:42:18 ppc

Last boot partition: 1
Next boot partition: 1

Images available to be installed:
  No image files are available to be installed.

Serve image files via HTTP/HTTPS: no

No image install currently in progress.
Boot manager password is set.

Image signing              : trusted signature always required
Admin require signed images: yes

Settings for next boot only:
  Fallback reboot on configuration failure: yes (default)

switch-8a36c2 [standalone: master] > show asic-version
---------------------------------------------------
Module             Device              Version
---------------------------------------------------
MGMT               SX                  9.4.5070

switch-8a36c2 [standalone: master] > show inventory
-----------------------------------------------------------------------------
Module           Part Number        Serial Number        Asic Rev.    HW Rev.
-----------------------------------------------------------------------------
CHASSIS          MSX6036T-1SFS      MT1645X09712         N/A          AB
MGMT             MSX6036T-1SFS      MT1645X09712         2            AB
FAN              MSX60-FF           MT1642X02818         N/A          A1
PS1              MSX60-PF           MT1643X02475         N/A          A1

switch-8a36c2 [standalone: master] > show protocols

Infiniband:               enabled
sm:                     disabled
router:                 disabled

switch-8a36c2 [standalone: master] > show voltage
------------------------------------------------------------------------------------------------
Module   Power Meter              Reg                    Expected  Actual   Status  High   Low
                                                         Voltage   Voltage          Range  Range
------------------------------------------------------------------------------------------------
MGMT     BOARD_MONITOR            USB 5V                 5.00      5.08     OK      5.75   4.25
MGMT     BOARD_MONITOR            Asic I/0               2.27      2.17     OK      2.61   1.93
MGMT     BOARD_MONITOR            1.8V                   1.80      1.81     OK      2.07   1.53
MGMT     BOARD_MONITOR            SYS 3.3V               3.30      3.31     OK      3.79   2.80
MGMT     BOARD_MONITOR            CPU 0.9V               0.90      0.89     OK      1.10   0.81
MGMT     BOARD_MONITOR            1.2V                   1.20      1.19     OK      1.38   1.02
MGMT     CURR_MONITOR             12V                    12.00     11.70    OK      13.80  10.20
MGMT     CPU_BOARD_MONITOR        2.5V                   2.50      2.48     OK      2.88   2.12
MGMT     CPU_BOARD_MONITOR        SYS 3.3V               3.30      3.34     OK      3.79   2.80
MGMT     CPU_BOARD_MONITOR        SYS 3.3V-SEC           3.30      3.30     OK      3.79   2.80
MGMT     CPU_BOARD_MONITOR        1.8V                   1.80      1.81     OK      2.07   1.53
MGMT     CPU_BOARD_MONITOR        1.2V                   1.20      1.24     OK      1.38   1.02
I guess it's running mlnx_os? And is not an EMC switch? But I already knew that because it's black and blue? Anyway I quite like this switch, it is a well-built piece of hardware. certainly a steal for what I paid. Already got some quieter fans on order. Seems like swapping fans on the PSUs might be slightly scary but shouldn't be a big deal. I loosened one of the fans to confirm the headers are consistent across all 6 fans...

Now scratching my head as to what the next steps will be. I probably need to update the software. But it's also not clear what to do first on the road to enabling Ethernet.
 
Last edited:
  • Like
Reactions: erock

erock

Member
Jul 19, 2023
84
17
8
Wow thank you all for the knowledge and links!

I received my console cable and after some faffing about to learn that the 9600 baud rate is the correct one, I am in to the console! As a recap, I'm working toward bringing up the SX6036 I just got!

So far:

Code:
switch-8a36c2 [standalone: master] > show system profile

Profile         : ib
Number of SWIDs : 1
Adaptive Routing: yes

switch-8a36c2 [standalone: master] > show uboot
UBOOT version : U-Boot 2009.01 SX_PPC_M460EX SX_3.2.0330-82 ppc (Dec 20 2012 - 17:53:54)

switch-8a36c2 [standalone: master] > show images

Installed images:
  Partition 1:
    version: PPC_M460EX 3.6.8010 2018-08-20 18:04:16 ppc

  Partition 2:
    version: PPC_M460EX 3.6.5009 2018-01-02 07:42:18 ppc

Last boot partition: 1
Next boot partition: 1

Images available to be installed:
  No image files are available to be installed.

Serve image files via HTTP/HTTPS: no

No image install currently in progress.
Boot manager password is set.

Image signing              : trusted signature always required
Admin require signed images: yes

Settings for next boot only:
  Fallback reboot on configuration failure: yes (default)

switch-8a36c2 [standalone: master] > show asic-version
---------------------------------------------------
Module             Device              Version
---------------------------------------------------
MGMT               SX                  9.4.5070

switch-8a36c2 [standalone: master] > show inventory
-----------------------------------------------------------------------------
Module           Part Number        Serial Number        Asic Rev.    HW Rev.
-----------------------------------------------------------------------------
CHASSIS          MSX6036T-1SFS      MT1645X09712         N/A          AB
MGMT             MSX6036T-1SFS      MT1645X09712         2            AB
FAN              MSX60-FF           MT1642X02818         N/A          A1
PS1              MSX60-PF           MT1643X02475         N/A          A1

switch-8a36c2 [standalone: master] > show protocols

Infiniband:               enabled
sm:                     disabled
router:                 disabled

switch-8a36c2 [standalone: master] > show voltage
------------------------------------------------------------------------------------------------
Module   Power Meter              Reg                    Expected  Actual   Status  High   Low
                                                         Voltage   Voltage          Range  Range
------------------------------------------------------------------------------------------------
MGMT     BOARD_MONITOR            USB 5V                 5.00      5.08     OK      5.75   4.25
MGMT     BOARD_MONITOR            Asic I/0               2.27      2.17     OK      2.61   1.93
MGMT     BOARD_MONITOR            1.8V                   1.80      1.81     OK      2.07   1.53
MGMT     BOARD_MONITOR            SYS 3.3V               3.30      3.31     OK      3.79   2.80
MGMT     BOARD_MONITOR            CPU 0.9V               0.90      0.89     OK      1.10   0.81
MGMT     BOARD_MONITOR            1.2V                   1.20      1.19     OK      1.38   1.02
MGMT     CURR_MONITOR             12V                    12.00     11.70    OK      13.80  10.20
MGMT     CPU_BOARD_MONITOR        2.5V                   2.50      2.48     OK      2.88   2.12
MGMT     CPU_BOARD_MONITOR        SYS 3.3V               3.30      3.34     OK      3.79   2.80
MGMT     CPU_BOARD_MONITOR        SYS 3.3V-SEC           3.30      3.30     OK      3.79   2.80
MGMT     CPU_BOARD_MONITOR        1.8V                   1.80      1.81     OK      2.07   1.53
MGMT     CPU_BOARD_MONITOR        1.2V                   1.20      1.24     OK      1.38   1.02
I guess it's running mlnx_os? And is not an EMC switch? But I already knew that because it's black and blue? Anyway I quite like this switch, it is a well-built piece of hardware. certainly a steal for what I paid. Already got some quieter fans on order. Seems like swapping fans on the PSUs might be slightly scary but shouldn't be a big deal. I loosened one of the fans to confirm the headers are consistent across all 6 fans...

Now scratching my head as to what the next steps will be. I probably need to update the software. But it's also not clear what to do first on the road to enabling Ethernet.
Thank you for sharing. I have been eyeing the SX6036 but decided build my small cluster without a switch until I decide to add more than 5 nodes. I am looking forward to learning from your journey. I am particularly interested in whether or not Ethernet mode works as-is.
 

unphased

Active Member
Jun 9, 2022
148
26
28
Yes. This is why I am posting, to share and empower others like us! @erock what is your method for connecting your cluster without a switch? Do you run a daisy chain network? I planned to do that for the longest time as you can see in my first post here at least with my computers, but, I realized making a switch is the only practical way to approach it... proper routing/switching/bridging, and especially trying to do that at huge speeds even ignoring CPU overheads, is likely to be difficult/impractical/silly in ways that I never got close to being able to estimate. The thought of maxing out CPU on every computer in the chain of communication is frankly preposterous, haha.

I have found the megathread about conversion of EMC switches into running MLNX_OS which certainly looks like a journey. But I am not starting behind like that, I have a bona fide mellanox switch here. I am being super careful right now in terms of editing things willy nilly in the CLI right now, but it does feel very nice to have the serial console up and running.

Still poking around atm to work out how to enable ipv4 DHCP on management interfaces. I did find the initial wizard workflow and ran it and told it to turn on DHCP and gave it an ip but it does not seem to be working. I even found a CLI command to reboot the switch and that worked too. Still no ip address getting assigned
 
Last edited:
  • Like
Reactions: erock

unphased

Active Member
Jun 9, 2022
148
26
28
Update:

So, I have been going off of this doc: https://delivery04.dhe.ibm.com/sar/CMA/XSA/MLNX-OS_VPI_v3_4_3002_UM.pdf

I ran the wizard again (over the serial console) and this time instead of choosing DHCP I chose to assign the switch a static address.

From curl I see

Code:
❯ curl 10.22.1.1:443                                            
curl: (1) Received HTTP/0.9 when not allowed
which is kind of funny but I get that this is enterprise equipment

The webui totally loads! woot.
 

unphased

Active Member
Jun 9, 2022
148
26
28
OK so summarizing what the sparsely distributed clues are... we are to download an x86 image for switchx, unpack it, and look with ltrace in one of the programs to fetch a license key, which will be able to unlock the VPI for ethernet. Great.

Strangely this .img download is actually a zip, and inside is a .tbz (a bzip2 tarball) ... and inside that the programs can be found.

I had to fiddle with them for like an hour but I actually got an ethernet enabled license enabled via the WebUI. (Edit: Nope, I still haven't worked out the right string names to use but it seems like I have some kind of procedure for obtaining "valid" keys)

Now to figure out how to turn it on... probably the CLI is more discoverable in this regard. The webUI is so slow to switch tabs.
 
Last edited: