Infiniband Network for small GPU-Cluster

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

L.P.

New Member
Feb 17, 2019
10
0
1
Hi everyone,

I am currently in the process of planning an Infiniband Network for our new GPU cluster. We want to be able to run simulation / AI workloads across the big GPU nodes and are hence looking to connect everything with a 100 Gb Infiniband-Network. At first, I was thinking of the following configuration:
  • Switch: Mellanox SB7800, 36-Port 100Gb EDR Infiniband
  • NICs: Mellanox ConnectX-4 VPI
  • Cabling: NVIDIA MCP1600-E002E30
This would lead to the following configurations for the individual node types:
  • 2 fat GPU Nodes -> 8 Links, 4 NICs (4 Links, 2 NICs for each Server with 8 A6000, i.e. 1 NIC for a block of 4 A6000)
  • 1 Storage Block -> 4 Links, 2 NICs
  • 1 thin GPU Nodes -> 2 Links, 1 NIC (2 Links, 1 NIC for a block of 4 A6000)
  • 1 small GPU Node -> 2 Links, 1 NIC (2 Links, 1 NIC for 2 blocks of 4 RTX 2080 Ti each)
  • 1 Head Node -> 1 Link, 1 NIC

As I have never planned out an Infiniband Network before, I am unsure whether I have errors in this configuration, or am over configuring individual parts and would hence be incredibly thankful for any thoughts, pointers, and suggestions.

Best, -L
 

necr

Active Member
Dec 27, 2017
156
48
28
124
1. If you can, get a ConnectX-5, it has higher message rate (can spit more packets per second out). And if you ever have to migrate to Ethernet or have 1 Ethernet, 1 Infiniband physical subnets, you'll have nice offloads in Ethernet mode.
2. Make sure your servers have enough PCIe lanes and are PCIe Gen 4 at least
3. If you only need Head node to run OpenSM (controller app), you can probably make do without it - it's possible to run it on the switch or on any other node