Is RoCE v2 worth it?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

bryan_v

Active Member
Nov 5, 2021
135
64
28
Toronto, Ontario
www.linkedin.com
Hey everyone,

I'm trying decide between the Mellanox MCX354A-FCBT and MCX354A-FCCT, the second which supports RoCE v2 and commands a higher price.

Has anyone used RoCE v2 and noticed an appreciable improvement either in Quality of Life or performance in a homelab setting?

Cheers,
Bryan
 

anthros

New Member
Dec 16, 2021
10
4
3
Portland, OR USA
I'm a 3-5 weeks away from implementing a four-node ROCE cluster using Mellanox X5 cards' host-chaining functionality, so I'm curious about this too.

For the code I run, latency is the the principal concern, and in theory Roce should be great for that. But hard benchmark numbers are thing on the ground right now.
 
  • Like
Reactions: bryan_v

i386

Well-Known Member
Mar 18, 2016
4,221
1,540
113
34
Germany
Has anyone used RoCE v2 and noticed an appreciable improvement either in Quality of Life or performance in a homelab setting?
1. Yes, Windows Server 2022 Fileserver + Windows 10/11 clients with cx-4 vpi nics.
2. No Performance differences in my everyday usage compared to cx-3 vpi nics

Edith: also No differences in CPU usage between cx-3 and cx-4
 
  • Like
Reactions: bryan_v

Freebsd1976

Active Member
Feb 23, 2018
387
73
28
I'm a 3-5 weeks away from implementing a four-node ROCE cluster using Mellanox X5 cards' host-chaining functionality, so I'm curious about this too.

For the code I run, latency is the the principal concern, and in theory Roce should be great for that. But hard benchmark numbers are thing on the ground right now.
Cx5 with host chaining worth a try
 
  • Like
Reactions: anthros

anthros

New Member
Dec 16, 2021
10
4
3
Portland, OR USA
Cx5 with host chaining worth a try
Is it worth a try? I sure hope so. But as far as I can tell, host chaining is entirely undocumented. Posts asking for help seem to be met universally with “oh, it’s not working for you? Call us directly!”

The first time I saw that, I thought, “what great tech support!
But I’ve seen it often enough that I can’t help wondering if NVIDIA is trying to keep information about host chaining on the DL. I could speculate about that, but when every support request is met with “Let’s discuss this in private,” something seems off

(And yes, I wrote X5 but obviously meant CX5. Thanks for the correction).

P.S.: We paid for support contracts when we bought our cards, so I fully expect help in making this work. But I’m getting a strange hush-hush vibe on this.
 

bryan_v

Active Member
Nov 5, 2021
135
64
28
Toronto, Ontario
www.linkedin.com
So the way I understood the new Mellanox/Nvidia/Cumulus business model is to use the CX cards as loss-leaders, and use those cards to sell Mellanox IB/CE switches with Cumulus support licenses.

That might be why host chaining isn't well documented. Host chaining would allow you create small clusters without expensive switches, effectively breaking the model. Unless you were a hyperscaler, I would expect most of the RoCE use-cases would more likely be a collection of smaller clusters, rather than one giant disaggregated environment.
 
  • Like
Reactions: anthros

anthros

New Member
Dec 16, 2021
10
4
3
Portland, OR USA
Amazing @anthros , I'll keep an eye out for your feedback.

Out of curiosity, I assume those are dual-port 100G CX5 cards and you're going to try NVMeoF with RoCE? Or is your DMA even lower level than that?
None of those assumptions are right, but they're pretty close! We're using dual-port 25GbE CX5s to build a four-node cluster for running finite element simulations using ANSYS. ANSYS solving speeds are mostly sensitive to latency, not bandwidth, so 25Gb ROCE is what we want. We're starting off with four nodes at 18 cores each. We've been acquiring 25GbE CX4 cards for our other workstations in the hope of eventually connecting all of them (including the cluster nodes) to a ROCE switch so we can solve using the maximum number of cores our license allows: 132 of them.

Unfortunately, PFC/ROCE switches seem to to start around $3K and sound like turbine engines so, they fit neither our budget nor our cube-farm-located rack. We'll benchmark our simulations with (a) the four nodes host-chained (with ROCE) and (b) connected via an SFP+ 10GbE Mikrotik switch to see how much benefit we get from ROCE with <10 solving nodes. Like you, we're essentially asking whether ROCE is worth it.

We will almost never actually run on 132 cores. Our license allows that, or we can run 1 simulation on 36 cores, a second on 12 and a third on 4 cores, which will be much more common for us. But if ROCE buys us a measurable reduction in solve time with four nodes, it'll likely be worth it for us to pony up for a switch. The FS S5800-48MBQ may be a likely suspect for $2400 new.

I strongly suspect that ROCE over lossy fabrics would be ideal for us, but I gather that requires the switch to support explicit congestion control. I haven't yet found a switch that supports ECS without also supporting PFC (which would enable lossless ROCE), so the distinction seems moot at this point. Is there a less-expensive ECN-enabled 10GbE or 25GbE switch out there? I'd love to know about it.