Hello, all. I have been given 3 servers to start learning how to stand up a GPU cluster. I have a pretty solid background on the intra-node architecture, but very little understanding of the networking that's needed for this, so please forgive my ignorance in this area : )
I plan to use one server as the login node and will not populate it with any GPUs due to power considerations. I'll use the other two servers as the compute nodes, which will be populated with GPUs. Each server has a Mellanox Technologies MT28908 Family [ConnectX-6] HCA/NIC, and I plan to use Ethernet rather than Infiniband due to costs. My budget for the network is, say, $500-$1000, so I don't want anything fancy; just something to learn how to configure a cluster.
As I understand it, I have two options here:
1) Buy Ethernet NICs to swap out for the Mellanox NICs, buy an 8- or 10-port Ethernet switch, and buy Ethernet cables
2) Use the existing Mellanox NICs, buy a special 8- to 10-port Ethernet switch that can be used with the Mellanox NICs, buy special cables that can be used with the Mellanox NICs and special Ethernet switch
ADDITIONAL CONSIDERATIONS:
I plan to use one server as the login node and will not populate it with any GPUs due to power considerations. I'll use the other two servers as the compute nodes, which will be populated with GPUs. Each server has a Mellanox Technologies MT28908 Family [ConnectX-6] HCA/NIC, and I plan to use Ethernet rather than Infiniband due to costs. My budget for the network is, say, $500-$1000, so I don't want anything fancy; just something to learn how to configure a cluster.
As I understand it, I have two options here:
1) Buy Ethernet NICs to swap out for the Mellanox NICs, buy an 8- or 10-port Ethernet switch, and buy Ethernet cables
2) Use the existing Mellanox NICs, buy a special 8- to 10-port Ethernet switch that can be used with the Mellanox NICs, buy special cables that can be used with the Mellanox NICs and special Ethernet switch
ADDITIONAL CONSIDERATIONS:
- The network must also support direct communication between GPUs on different compute nodes (i.e., RoCE)
- I'll need to have a management network that uses the BMCs on the node, so the switch should be managed. I don't believe I need any additional hardware beyond extra cables, but please correct me if I'm wrong
- I'm honestly not sure if I should have 1, 10, or 100G for this :shame: