1/10G Ethernet Network for 3-Node GPU Cluster

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

super-luminous

New Member
Sep 10, 2024
3
0
1
Hello, all. I have been given 3 servers to start learning how to stand up a GPU cluster. I have a pretty solid background on the intra-node architecture, but very little understanding of the networking that's needed for this, so please forgive my ignorance in this area : )

I plan to use one server as the login node and will not populate it with any GPUs due to power considerations. I'll use the other two servers as the compute nodes, which will be populated with GPUs. Each server has a Mellanox Technologies MT28908 Family [ConnectX-6] HCA/NIC, and I plan to use Ethernet rather than Infiniband due to costs. My budget for the network is, say, $500-$1000, so I don't want anything fancy; just something to learn how to configure a cluster.

As I understand it, I have two options here:

1) Buy Ethernet NICs to swap out for the Mellanox NICs, buy an 8- or 10-port Ethernet switch, and buy Ethernet cables

2) Use the existing Mellanox NICs, buy a special 8- to 10-port Ethernet switch that can be used with the Mellanox NICs, buy special cables that can be used with the Mellanox NICs and special Ethernet switch

ADDITIONAL CONSIDERATIONS:
  • The network must also support direct communication between GPUs on different compute nodes (i.e., RoCE)
  • I'll need to have a management network that uses the BMCs on the node, so the switch should be managed. I don't believe I need any additional hardware beyond extra cables, but please correct me if I'm wrong
  • I'm honestly not sure if I should have 1, 10, or 100G for this :shame:
Can this excellent community help me spec this network? Any suggestions you can offer would be greatly appreciated. Thank you so much!
 

MountainBofh

Beating my users into submission
Mar 9, 2024
416
298
63

super-luminous

New Member
Sep 10, 2024
3
0
1
Thank you for your reply. Would you also be willing to help me understand how to connect the nodes to the switch? I would like to have a primary network and a management network using a single switch if possible. So do I just have 2 cables from each node to the switch? Or do I also need a cable between the 2 compute nodes? Also, if I plan to put a couple extra SSDs in the login node to serve as a /home partition, should I have additional cables for this? Sorry for all the questions. I just want to make sure I fully understand : )