RoCE v1 implementation (SX6036 heatsink/silence mod running log!)

unphased · Oct 7, 2023

For my setup, I already have (4) CX3 non-pro cards. I needed to get a single CX4 card in order to support the mlx5 driver that is now bundled with macOS and my macs can be on the network now via thunderbolt.

I am interested in setting up a simple point to point network at home where the following topology diagram gives you a good idea of what I'm working with:

Code:

┌──────┐       ┌────────┐   ┌──────────┐   ┌────────┐         
│M1 Max├──TB3──┼─MCX455─┼─┐ │ TR 1950X │   │R9 5950X│         
└──────┘       └────────┘ └─┼──MCX354──┼───┼─MCX354 │  ┌──────────┐
              ┌────────┐    │          │   └────────┘  │R7 5800X3D│
              │R5 5600G│  ┌─┼──MCX354──┼───────────────┼──MCX353  │
    ┌───────┐ │ ??????─┼──┘ │          │  ┌──────────┐ └──────────┘
    │ 5820K │ └────────┘    │          │  │ E5-2696? │        
    │ ?????─┼───────────────┼──??????──┼──┼──??????  │
    └───────┘               └──────────┘  └──────────┘

Yes I have a lot of slots here for NICs I have not obtained yet, anyway, as you can see I am thinking of making a poor man's switch out of the threadripper over here. And I'll be able to load up the rest of its PCIe lanes with NVMe storage. Unfortunately the fastest systems also have GPUs in them that occupy the large PCIe slots so only the 5600G and my two X99 systems (Yeah i don't recall the model number on that Xeon) will be able to benefit from having greater than 4 PCIe gen 3 lanes for getting full bandwidth out of these NICs. But that is okay.

I am interested in learning how to go about implementing RoCEv1 which I understand to be a layer 2 protocol and I think that the limited needs of my home network will make it so that I can "get away with" RoCE v1 to achieve RDMA acceleration. Please correct me if I'm wrong but layer 3 and routing doesn't seem like something I'll need or want with something like this.

I still have lots of things I do not understand and am not familiar with.

Is RoCE a thing that you turn on, provided both ends of a link have supporting equipment (typically entailing both NICs and switch support it, in my case, both NICs) and then all of a sudden the regular ethernet connections between the machines become RDMA enabled? If it is not this simple to reason about, then what would be a sensible way to reason about it?
What software would I use to implement bridging to implement switching on the threadripper node? If I work out how to enable RoCE would bridge-utils/brctl Linux software bridge then gain RDMA capability and allow for CPU offload for traffic that traverses the threadripper?
Is there RoCE support for Windows (which would be running on the 5800X3D system) on CX3 drivers?
Obviously one thing to consider is to not load up NICs in the threadripper, get a 40Gbit switch instead. I think the main concern here is just that it seems I would need to heavily modify a switch to make it not sound like a jet engine in my office. Yes if I do resurrect all these systems to make this frankenstein network and compute cluster of all my systems, it will be kind of loud and warm and having an enterprise switch might not be unreasonable to add. But I am interested now in if there is something useful/edifying I could do along the above lines. The fact that it gives my threadripper a sentimental second chance at life is a factor yes.

i386 · Oct 8, 2023

1) roce can be enabled/disabled via config
3) some older drivers are shipped with Windows and don't have all Features enabled. User newest available mlnx drivers.
The other support depends on the used Software
4) get an Arista Switch and Set the Fan Speed to 30%. It's still Not silent but far away from a Jet engine

DavidWJohnston · Oct 8, 2023

For running a high performance switch on x86 hardware, OVS (OpenVirtualSwitch) may work. It's meant for virtualization but it supports bridging and DPDK with compatible NICs. It is complex to configure. Using a Linux bridge without DPDK I don't think you will see good performance.

I would HIGHLY recommend getting a switch instead of trying to build one. If you need to, modify the fans. To solve the noise problem, I have all fibers running to the basement where the switch and most of my lab is. At low fan speed the DX010 isn't too bad, and there is an over-abundance of airflow. With some in-line PWM speed controllers on all fans including the PSUs I think it would be near silent and still be thermally stable, but YMMV I've never tried.

If you are on a L2 network, either RoCE v1 or v2 will work. V2 can be routed, but it doesn't need to be. V1 is "raw" Ethertype 0x8915 frames and v2 is UDP/4791. Both of them rely on a lossless PFC+ECN-enabled segment that does not reorder packets.

In my own experience, it works well on my 100G DX010 without PFC or ECN but this is a single-user environment on an extremely high speed switching silicon backplane designed for it. Over a software bridge with transmit and receive queues over multiple cores, I'm not so sure it would work properly.

If you're running Windows, you need 11 Pro for Workstations to get RoCE. Then you enable it in the NIC options at both ends.

Then you can run this PowerShell command at both ends to make sure it's up:

Then off to the races (NVMe to NVMe):

This file copy is going between a 100G NIC on a PC and a 50G NIC in a VM through a DX010 100G switch.

Good luck!

unphased · Oct 8, 2023

Thank you for the details and guidance. I will definitely reconsider getting a real switch. I remember looking at topics last year where fan modifications were done and thinking I wouldn't need that kind of equipment but hey why not. Once my workloads increase it would be such a fun thing to set up. I do wonder if fan power connections on switches like this would be similar or not to consumer machines and I do have tons of noctua LNA low noise adapters if those fan headers are used, and I can also do things like solder high power resistors in series to bring RPMs down.

I was reading the STH article about this switch yesterday and

Does it not have the Intel Atom platform bug where it might just die? Just now I found this thread: https://forums.servethehome.com/ind...on-c2358-cpu-avr54-c0-stepping-failure.34912/ and unable to draw a conclusion from it.
Either way, the switch seems to be going for around $700 (200+ for "as-is, bad" units) which is probably well beyond what's reasonable for me to invest in. Though I can get behind the idea that trying to do something in software would use up at least that much worth of my time. But I might learn something too.
It would be nice for the size of my network to have a 4 or 8 port 100G switch. I haven't seen anything like this though. I would have hoped that some switch might exist that has something like 4x 100G QSFP28 ports and 16x 25G SFP28 ports. probably wouldn't be affordable at this time anyway. Although CX4's are kind of affordable now and 100Gbit is enticing from the perspective of really getting something out of RDMA (as sub 30Gbit it pegs a few CPU cores but I do have plenty of those and it won't really make RDMA so rewarding...), practically speaking it's nearly impossible to properly leverage still: things will change if I get a modern server/HEDT system, right now my zen 3's are consumer platforms and can only give me 4 lanes.
Maybe some kind of 40Gb switch for now is the answer. They are likely in a price range I can consider now. So far candidates include..? Brocade ICX 6610, Cisco 3064, not sure what Arista or Celestica options... Probably worth reviewing all of the options. Ah yes I remember. I did visit this page before: https://forums.servethehome.com/ind...s-cheap-powerful-10gbe-40gbe-switching.21107/ Will re-review.

DavidWJohnston · Oct 8, 2023

The build date on mine is after the C0 stepping bug was fixed. I bought it from an eBay listing selling dozens of them only a couple of years used, just before prices exploded higher post-COVID. In any case, I wouldn't recommend it for production work - Homelab only, and be ready to test your troubleshooting skills to the max.

Adding in-line fan speed control would probably require some circuit mods. The main system fans are in 5 pluggable modules. then 1 in each PSU. I used to own a Dell Z9100 100G, Dell S6100 40G, and an EDGE-CORE 7512-32X 100G, all of which I sold to people here after getting the DX010. Fan-wise, once started up and the fans spin-down they're not that different in sound. I still have a Cisco Nexus 5020 10G I haven't been able to sell and don't have the heart to scrap it.

I did recently do an experiment for another user showing the cheap 100G optics for sale lately do link when hard-set to 40G mode, at least on my switch. So that may be an option to keep cost down. Good Luck!

jamesdwi · Oct 8, 2023

is there any update on using csr504 as a switch that supports RoCE, last reported it doesn't.

unphased · Oct 8, 2023

DavidWJohnston said:
cheap 100G optics

Could you elaborate on which ones these are? Something like this? Generic Compatible 100G QSFP28 SR4 850nm 100m Transceiver | Optcore

May be worth trying along with Brocade units. I will also begin hunting for CX4 card deals (lol just sent an offer for $50 for a MCX4131A-BCAT) and gradually accumulate more of those. a CX4 100G card limited at 4 lanes may be able to get closer to the 40Gbit limit than the CX3 cards do. The CX3's in such a scenario reach around 25Gbit.

DavidWJohnston · Oct 8, 2023

Yeah check out my reply in Post #5 here re: Cheap 100G Optics:

https://forums.servethehome.com/index.php?threads/solved-nope-maybe-do-qsfp28-optics-work-in-qsfp-ports-obviously-at-lower-speed-sx6036-cx3.41698/#post-394616

Two different types worked hard-set at 40G in DX010, link light test only of looped cable. If needed I can do a better test.

jamesdwi · Oct 8, 2023

sorry i'm not an expert enough to mark anything 100gbit, i focus moslty on 25gbit links from 100gbit truning into 4x 25gbits.

unphased · Oct 8, 2023

I found some S4148F-ON being sold on ebay for cheap prices. But cannot find much info on google for this.

unphased · Oct 8, 2023

DavidWJohnston said:
Cheap 100G Optics:

Wow, these are $189 transceivers at FS.com. And you got them $5 a pop? Where do I look for deals like that. The $44 one i found linked above is cheap enough to try but I'd certainly prefer $5...

unphased · Oct 8, 2023

OK I'm seeing some SX6036 at affordable prices. I am unclear right now on their ethernet capability but cursory browsing of this thread https://forums.servethehome.com/index.php?threads/mellanox-sx6036-36ports-qsfp-eur150.24158/ is promising... Unclear what difference between this and SX1036 is, would I have to do flashing or shenanigans to get ethernet on the 6036?

So far Arista 7050QX-32S seems like the first choice, but it's gonna run me around $300 by the looks of it so a $150 or rso SX6036 appears possibly preferable. continuing research. Brocade ICX 7750-26Q is even higher priced and not in a realistic range. Celestica DX010 is in the same boat.

SX6015 is $100. Getting interesting. So far what I have gathered:

seems like SX6005 won't do Ethernet, and is out
SX6012 is a great size at 12 ports (hard to imagine wanting more), is physically compact, which is a plus, but cheapest one I see now is $215. Might be hard to set up for ethernet. But it should be possible
SX6036 has 36 ports, is $150 or so and a leading candidate
SX6015 has 18 ports, is $100 or so and has the same footprint as the 36. Unclear whether harder to configure than the 36 for ethernet.

DavidWJohnston · Oct 8, 2023

Haha yeah I posted it here in "Good Deals" a few months back, it caused quite a stir:

https://forums.servethehome.com/index.php?threads/5-intel-100g-cwdm4-qsfp28.39518/

In fact I just checked, the eBay auction listing is still up for $5, with over 8500 units sold:

SPTSBP2CLCKS INTEL 100G-CWDM4 1271-1331NM OPTICAL TRANSCEIVER MODULE NEW | eBay

Find many great new & used options and get the best deals for SPTSBP2CLCKS INTEL 100G-CWDM4 1271-1331NM OPTICAL TRANSCEIVER MODULE NEW at the best online prices at eBay! Free shipping for many products!

www.ebay.ca

klui · Oct 8, 2023

Least inexpensive "7050QX-32"s are ~$300, "7050QX-32S"es are ~$500.

SPTSBP3CLCs are now being sold on eBay but at around $20 instead of $5. The SBP3s, in fact all Intel Silicon Photonics transceivers except SBP2s, are rated for typical 70C temperatures while the now $5 transceivers are rated for 55C.

If you want to check if an Atom has the AVR54 LPC bug issue setpci -s 00:00.0 8.w from Linux. If it has the bug it will return 0002. C0-stepping CPUs will return 0003.

unphased · Oct 8, 2023

DavidWJohnston said:
Haha yeah I posted it here in "Good Deals" a few months back, it caused quite a stir:

https://forums.servethehome.com/index.php?threads/5-intel-100g-cwdm4-qsfp28.39518/

In fact I just checked, the eBay auction listing is still up for $5, with over 8500 units sold:

SPTSBP2CLCKS INTEL 100G-CWDM4 1271-1331NM OPTICAL TRANSCEIVER MODULE NEW | eBay

Find many great new & used options and get the best deals for SPTSBP2CLCKS INTEL 100G-CWDM4 1271-1331NM OPTICAL TRANSCEIVER MODULE NEW at the best online prices at eBay! Free shipping for many products!

www.ebay.ca

Wow! OK I'm in for 4 of these right off the bat, I guess. I take it I need to run LC fiber for these and those involve me routing two fibers around? Having 100G futureproofing there seems pretty sweet.

unphased · Oct 8, 2023

klui said:
Least inexpensive "7050QX-32"s are ~$300, "7050QX-32S"es are ~$500.

SPTSBP3CLCs are now being sold on eBay but at around $20 instead of $5. The SBP3s, in fact all Intel Silicon Photonics transceivers except SBP2s, are rated for typical 70C temperatures while the now $5 transceivers are rated for 55C.

If you want to check if an Atom has the AVR54 LPC bug issue setpci -s 00:00.0 8.w from Linux. If it has the bug it will return 0002. C0-stepping CPUs will return 0003.

Awesome! Thanks! Love these drive by knowledge drops! Keep them coming! For those transceivers even if they pop due to high temp being only out $5 makes it seem like hardly an issue...

I will add 7050QX-32 to my shopping search. Gotta go check what the difference is between 32 and 32S. (Edit: found the diff. it's in the thread Patrick made on here about these.)

I also just found out about Celestica D4040 which is uber cheap.

DavidWJohnston · Oct 8, 2023

I daily drive several of those $5 Intel QSFP28s and they are solid. Absolutely no issues. The "ColorChip" ones appear to work equally well.

I had a KAIAM 100G transceiver burn up due to a fan failure, I posted about it here with some pics of the inside:

https://forums.servethehome.com/index.php?threads/overheated-transceiver-kaiam-100g-cwdm4-sm-2xlc.39747/

Also later in that post are some temperature comparisons. The multimode MPO QSFP28s I also run are cooler by quite a bit.

Looks like my DX010 is safe from the Atom bug. I think I remember doing this before:

Edit: Yes, you need duplex LC singlemode UPC for those transceivers (CWDM4), which splits the 4x25G channels into separate wavelengths of light. That's how you can get 4x25G in each direction on one fiber.

unphased · Oct 8, 2023

Yeah I never thought single mode transceivers were gonna be affordable for me in the short term but having these options available today from masscloseouts is nice, so I'll grab some. I love rolling the dice on questionable sellers for cheap wares. This one is 99% and far from questionable by my standard.

Anyone know if these ColorChip 100G QSFP28 Transceiver 100GbE C100Q020CWDM400C are also 55C units? They should be of the same vintage as these Intel ones and the one that you fried, right? I also wonder what the temp limit is on these "XQX2502 KAIAM QSFP+40G-LR4 Lite" I found at $8 from the same seller.

I definitely know to keep fans on all this kind of stuff, but considering I'm about to get a switch I suppose it'll be quite possible to mess up trying to silence them too far.

klui · Oct 8, 2023

With the exception of Kaiam's 100G CWDM4-RT, every one of their QSFP transceivers, including their 40G CWDM4s are rated to 70C. Apparently the 55C limit is for "well-controlled data center environments."

Products – KAIAM

Kaiam is a vertically-integrated manufacturer of optical transceivers and PLC solutions for Tier 1 data centers.

web.archive.org

You can find the information for ColorChip with a normal web search. There are several types.

DavidWJohnston · Oct 8, 2023

The ColorChip appears to be 70C. Looks like they're up for $4:

ColorChip 100G QSFP28 Transceiver 100GbE C100QSFPCWDM400B | eBay

Find many great new & used options and get the best deals for ColorChip 100G QSFP28 Transceiver 100GbE C100QSFPCWDM400B at the best online prices at eBay! Free shipping for many products!

www.ebay.ca

In my primary workstation, I run the $5 Intel, with a 140mm fan spinning slow blowing down on all my PCIe cards, and my case fans are set to "silent" (stopped by default) so it does get warm inside. I've never had any dropouts - But I do have many spares to choose from!

When the KAIAM burnt, the 140mm PCIe card fan was unplugged by accident. It was totally fine up until then.

In my experience with these things, a little airflow goes a long way. It's when there's zero airflow you have a problem.

RoCE v1 implementation (SX6036 heatsink/silence mod running log!)

Active Member

Well-Known Member

Active Member

Active Member

Active Member

New Member

Active Member

Active Member

New Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Active Member