School me on mlnx infini or EN and GPU direct

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

larrysb

Active Member
Nov 7, 2018
108
49
28
I've been working with GPU computing, mostly deep-learning and some HPC applications in Linux/Docker. I have two Xeon E5v3/v4 workstations with NVidia GPU cards. I had been using MLNX Connectx-3 Ethernet at 10gb to share data and regular IP level comms between them.

There is support in Nvidia's docker containers for MLX5 or better cards for doing RDMA and so on with NCCL library. On a lark, I bought a couple of cheap pulled ConnectX5 25gb Ethernet cards. They work great as ethernet, direct connected between the two workstations. I know they support ROCE and RDMA and virtual networking and some other cool things, but not of particular interest.

But - my knowledge of the whole Infiniband vs Ethernet in the Mellanox world is kinda fuzzy. It looks like the Nvidia plan is for connecting machines with Infiniband, as the software sees the 25gb EN cards, but when I try MPI jobs on them, it says "...no connection scheme exists..." in the fabric, and what I can deduce is that the software available is looking for Infiniband, not ethernet.

The goal in mind is simple: Use the 2x GPU's in each of 2 workstations together, so I get 4x GPU compute, exploiting the speed of the 25gb ethernet and RDMA/ROCE.

The ConnectX5 EN cards do show up in the normal ibv_devices and ibv_devinfo calls and some of the other ibv_ commands from OFED work, while others don't.

I know these are "ethernet" only hardware, which is different than VPI cards. They show up in the system as 40gb Infiniband devices, hard configured for 25gb Ethernet ports. (not changeable of course).

So, in a nutshell, what's the difference in capability between these ConnectX-5 512F 25gb Ethernet vs. ConnectX-5 Infiniband?

Is it possible to utilize nccl to connect the cards through gpu-direct rdma?

Is there a primmer somewhere I need to read? Networking is not my main domain, so I'm ignorant, but reasonable.
 

Fallen Kell

Member
Mar 10, 2020
57
23
8
Infiniband will get you 56gb (or faster with connectx5 and later with EDR and HDR capable cards and 16x PCIe 3.0 slots) connections between the systems but with a low level networking stack (basically it bypasses many of the more time consuming things in standard tcp networking stack, but has some drawbacks as well). The benefit of this lower level networking connection is that it works really well for applications that need to send lots of “tiny” data between systems, such as MPI shared memory/variables for applications that are running across multiple computers at the same time.

With only a couple systems you will probably not be able to see the benefits of infiniband over using the cards as 25/40gbe, as you can use MPI over ethernet and you don’t have the quintessential infiniband application of a lustre file system for distributed cluster storage (since the real problem with clustered systems is shared storage since things like NFS simply can not scale to handle hundreds of systems accessing the same files for inputs and outputs as they eventually hit problems with tcp connections and nfs threads).
 
Last edited:

larrysb

Active Member
Nov 7, 2018
108
49
28
It's mostly for my own education and experimentation, and that the software is conveniently present in Nvidia's pre-built docker containers with Tensorflow and a lot of Nvidia stuff, like nccl. I know in their data center products ($$$$$$) they offer all kinds of cool virtual GPU and direct-to-gpu RDMA between systems.

Running MPI distributed GPU sessions over regular networking helps a little, but the results touted for GPU direct methods show much better scaling. That's why I'm interested. Of course with just a couple of machines and a handful of GPU's, it is more for proof-of-concept and my education than it is for any productive purposes.

What I'm derailed on a little bit is that all the marketing docs tout that you can do this with RoCE, which these 512F-EN cards I got do indeed support. I've run the tests to prove that. But it looks like Nvidia's software (nccl) so far, only looks for Infiniband fabrics, and connects in the container to the /sys/infiniband device and says, "I don't like that". Of course, the nvidia-smi tool does find the mlnx-5 cards and shows the GPU/CPU topology and everything.

I'm just sort of guessing that the 512F ethernet card exposes some infiniband capability, but not everything, which a VPI card with the infiniband cables would.
 

i386

Well-Known Member
Mar 18, 2016
4,251
1,548
113
34
Germany
Ethernet and infiniband are different technologies. Your nic is an ethernet nic and there is no firmware for it which runs infinband (unlike the bigger nics with qsfp28 ports which could posssibly be crossflashed to vpi firmware).
I'm just sort of guessing that the 512F ethernet card exposes some infiniband capability, but not everything, which a VPI card with the infiniband cables would.
Your nic exposes RoCE capabilities, which is another implementation of RDMA. More about Rdma: Remote direct memory access - Wikipedia
 

larrysb

Active Member
Nov 7, 2018
108
49
28
Yeah, I can run the RDMA (RoCE) test programs between machines and it definitely is working.

I have the Dell OEM'd 512F 25gb Ethernet.


I'm most interested in the "Mellanox PeerDirect RDMA aka GPUDirect" feature.

At least from the docs I've dug into, the VPI cards have ports that can be configured to be either Infiniband or Ethernet with a software tool. As near as I can tell, my 512F-Ethernet card on the software side, acts like a VPI card that's been locked into ethernet mode.

But I'm still fumbling around in the ofed software and figuring out how the rest of the stack works.