I've been working with GPU computing, mostly deep-learning and some HPC applications in Linux/Docker. I have two Xeon E5v3/v4 workstations with NVidia GPU cards. I had been using MLNX Connectx-3 Ethernet at 10gb to share data and regular IP level comms between them.
There is support in Nvidia's docker containers for MLX5 or better cards for doing RDMA and so on with NCCL library. On a lark, I bought a couple of cheap pulled ConnectX5 25gb Ethernet cards. They work great as ethernet, direct connected between the two workstations. I know they support ROCE and RDMA and virtual networking and some other cool things, but not of particular interest.
But - my knowledge of the whole Infiniband vs Ethernet in the Mellanox world is kinda fuzzy. It looks like the Nvidia plan is for connecting machines with Infiniband, as the software sees the 25gb EN cards, but when I try MPI jobs on them, it says "...no connection scheme exists..." in the fabric, and what I can deduce is that the software available is looking for Infiniband, not ethernet.
The goal in mind is simple: Use the 2x GPU's in each of 2 workstations together, so I get 4x GPU compute, exploiting the speed of the 25gb ethernet and RDMA/ROCE.
The ConnectX5 EN cards do show up in the normal ibv_devices and ibv_devinfo calls and some of the other ibv_ commands from OFED work, while others don't.
I know these are "ethernet" only hardware, which is different than VPI cards. They show up in the system as 40gb Infiniband devices, hard configured for 25gb Ethernet ports. (not changeable of course).
So, in a nutshell, what's the difference in capability between these ConnectX-5 512F 25gb Ethernet vs. ConnectX-5 Infiniband?
Is it possible to utilize nccl to connect the cards through gpu-direct rdma?
Is there a primmer somewhere I need to read? Networking is not my main domain, so I'm ignorant, but reasonable.
There is support in Nvidia's docker containers for MLX5 or better cards for doing RDMA and so on with NCCL library. On a lark, I bought a couple of cheap pulled ConnectX5 25gb Ethernet cards. They work great as ethernet, direct connected between the two workstations. I know they support ROCE and RDMA and virtual networking and some other cool things, but not of particular interest.
But - my knowledge of the whole Infiniband vs Ethernet in the Mellanox world is kinda fuzzy. It looks like the Nvidia plan is for connecting machines with Infiniband, as the software sees the 25gb EN cards, but when I try MPI jobs on them, it says "...no connection scheme exists..." in the fabric, and what I can deduce is that the software available is looking for Infiniband, not ethernet.
The goal in mind is simple: Use the 2x GPU's in each of 2 workstations together, so I get 4x GPU compute, exploiting the speed of the 25gb ethernet and RDMA/ROCE.
The ConnectX5 EN cards do show up in the normal ibv_devices and ibv_devinfo calls and some of the other ibv_ commands from OFED work, while others don't.
I know these are "ethernet" only hardware, which is different than VPI cards. They show up in the system as 40gb Infiniband devices, hard configured for 25gb Ethernet ports. (not changeable of course).
So, in a nutshell, what's the difference in capability between these ConnectX-5 512F 25gb Ethernet vs. ConnectX-5 Infiniband?
Is it possible to utilize nccl to connect the cards through gpu-direct rdma?
Is there a primmer somewhere I need to read? Networking is not my main domain, so I'm ignorant, but reasonable.