Hi All,
I recently built several computers for scientific computing and now would like to build a network so parallel calculations can be performed using MPI. My problem is that my current network setup is very slow (see definition of slow below) when running computations using two machines. I am currently not sure if this is a software, networking setup and/or networking hardware problem and would like your guidance on the best (low latency, high bandwidth) networking switch, NICs and network settings for a 4-8 node cluster so I can rule this aspect out. Here are my specs:
Each node has a H11DSi supermicro mother board with 2x7f52 16 core Epyc processors with 256GB of 3200 ecc ram, 2TB M2 drives, a 1 TB ssd with the Pop-Os and two 1gb Ethernet ports.
Nodes are linked together using cat 6 cables plugged into 1 of two 1gb ports on the motherboard and a netgear 1gb switch.
One node acts as a gateway to the internet and shares the internet with other nodes on the network.
I am using glusterfs to manage a distributed network file system and run MPI applications from the a gluster volume directory shared across the network. MPI was tested and can send and reduce information across multiple nodes. Here are links from an individual using a similar setup that describes the approach I am using:
Slowness Description: When I run MPI test code for calculating pi on a single master node or slave node I get approximately linear speed up. However, if I run an equivalent number of cores distributed between two machines a 30-60 second overhead is added to the calculation:
So in terms of possible causes, here is where I am:
I recently built several computers for scientific computing and now would like to build a network so parallel calculations can be performed using MPI. My problem is that my current network setup is very slow (see definition of slow below) when running computations using two machines. I am currently not sure if this is a software, networking setup and/or networking hardware problem and would like your guidance on the best (low latency, high bandwidth) networking switch, NICs and network settings for a 4-8 node cluster so I can rule this aspect out. Here are my specs:
Each node has a H11DSi supermicro mother board with 2x7f52 16 core Epyc processors with 256GB of 3200 ecc ram, 2TB M2 drives, a 1 TB ssd with the Pop-Os and two 1gb Ethernet ports.
Nodes are linked together using cat 6 cables plugged into 1 of two 1gb ports on the motherboard and a netgear 1gb switch.
One node acts as a gateway to the internet and shares the internet with other nodes on the network.
I am using glusterfs to manage a distributed network file system and run MPI applications from the a gluster volume directory shared across the network. MPI was tested and can send and reduce information across multiple nodes. Here are links from an individual using a similar setup that describes the approach I am using:
- Run code on 1 master node: cpu sec = 120
- Run code on 2 master nodes: cpu sec = 63
- Run code on 4 master nodes: cpu sec = 35 sec
- Run code on 2 master and 2 slave nodes: cpu sec = 121
So in terms of possible causes, here is where I am:
- My switch is too slow. What should I buy for a small cluster?
- I need to ditch the 1gb port and install a better NIC on the motherboard (if yes which one would be best).
- Something is wrong with the network settings possibly due to two Ethernet ports on the motherboard confusing MPI.
- My MPI mpich version is suboptimal and not configured properly.
Last edited: