Hi everyone,
I am currently in the process of planning an Infiniband Network for our new GPU cluster. We want to be able to run simulation / AI workloads across the big GPU nodes and are hence looking to connect everything with a 100 Gb Infiniband-Network. At first, I was thinking of the following configuration:
This would lead to the following configurations for the individual node types:
As I have never planned out an Infiniband Network before, I am unsure whether I have errors in this configuration, or am over configuring individual parts and would hence be incredibly thankful for any thoughts, pointers, and suggestions.
Best, -L
I am currently in the process of planning an Infiniband Network for our new GPU cluster. We want to be able to run simulation / AI workloads across the big GPU nodes and are hence looking to connect everything with a 100 Gb Infiniband-Network. At first, I was thinking of the following configuration:
- Switch: Mellanox SB7800, 36-Port 100Gb EDR Infiniband
- NICs: Mellanox ConnectX-4 VPI
- Cabling: NVIDIA MCP1600-E002E30
- 2 fat GPU Nodes -> 8 Links, 4 NICs (4 Links, 2 NICs for each Server with 8 A6000, i.e. 1 NIC for a block of 4 A6000)
- 1 Storage Block -> 4 Links, 2 NICs
- 1 thin GPU Nodes -> 2 Links, 1 NIC (2 Links, 1 NIC for a block of 4 A6000)
- 1 small GPU Node -> 2 Links, 1 NIC (2 Links, 1 NIC for 2 blocks of 4 RTX 2080 Ti each)
- 1 Head Node -> 1 Link, 1 NIC
As I have never planned out an Infiniband Network before, I am unsure whether I have errors in this configuration, or am over configuring individual parts and would hence be incredibly thankful for any thoughts, pointers, and suggestions.
Best, -L