Amazing
@anthros , I'll keep an eye out for your feedback.
Out of curiosity, I assume those are dual-port 100G CX5 cards and you're going to try NVMeoF with RoCE? Or is your DMA even lower level than that?
None of those assumptions are right, but they're pretty close! We're using dual-port 25GbE CX5s to build a four-node cluster for running finite element simulations using ANSYS. ANSYS solving speeds are mostly sensitive to latency, not bandwidth, so 25Gb ROCE is what we want. We're starting off with four nodes at 18 cores each. We've been acquiring 25GbE CX4 cards for our other workstations in the hope of eventually connecting all of them (including the cluster nodes) to a ROCE switch so we can solve using the maximum number of cores our license allows: 132 of them.
Unfortunately, PFC/ROCE switches seem to to start around $3K and sound like turbine engines so, they fit neither our budget nor our cube-farm-located rack. We'll benchmark our simulations with (a) the four nodes host-chained (with ROCE) and (b) connected via an SFP+ 10GbE Mikrotik switch to see how much benefit we get from ROCE with <10 solving nodes. Like you, we're essentially asking whether ROCE is worth it.
We will almost never actually run on 132 cores. Our license allows that, or we can run 1 simulation on 36 cores, a second on 12 and a third on 4 cores, which will be much more common for us. But if ROCE buys us a measurable reduction in solve time with four nodes, it'll likely be worth it for us to pony up for a switch. The FS S5800-48MBQ may be a likely suspect for $2400 new.
I strongly suspect that ROCE over lossy fabrics would be ideal for us, but I gather that requires the switch to support explicit congestion control. I haven't yet found a switch that supports ECS without also supporting PFC (which would enable lossless ROCE), so the distinction seems moot at this point. Is there a less-expensive ECN-enabled 10GbE or 25GbE switch out there? I'd love to know about it.