Networking Hardware for Parallel Computing on a Small Cluster

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
The pi test is just a simple “can I get the basics of MPI to work” test. The real use case involves solving large systems of equations using a software package called Petsc. Solving the system of equations typically takes 80% of the cpu time in serial and these types of calculations scale very well on shared memory machines and optimized distributed memory clusters. So the algorithm involves the master node spawning an MPI calculation over multiple nodes, each node builds its local chunk of the system (typically ~20-150G of RAM per node) and the system is solved with internode communication. When this is done a 1-5G solution file is saved to disk that is read by the main program running on the master node. I am using Glusterfs volume as a network file system from which MPI code is executed. I am open to alternatives for spawning parallel MPI calculations and am only using GlusterFS because the examples above suggested this as a better alternative to NFS.

I do not have huge storage needs since this cluster is really about performing calculations that involve large amounts of RAM as quickly as possible. After calculations are completed I copy the results over the network (~100-300G) to my Threadripper workstation for visualization and post-processing. This cluster is used by a very small team so there is no need to worry about managing many users etc..
This is very helpful. So it sounds like the worker nodes do not really need local storage; it's only the master node that periodically outputs a 5GB solution file. In which case gluster is not really needed (as opposed to NFS or just local storage on the master node), and also that network bandwidth might not be a critical issue (although perhaps latency, depending on the frequency and nature of inter-node MPI communication). Algebraic solvers tend to be very very CPU-intensive, secondarily RAM, then network and lastly storage.

Have you considered offloading BLAS operations to GPUs?
 

erock

Member
Jul 19, 2023
87
17
8
This is very helpful. So it sounds like the worker nodes do not really need local storage; it's only the master node that periodically outputs a 5GB solution file. In which case gluster is not really needed (as opposed to NFS or just local storage on the master node), and also that network bandwidth might not be a critical issue (although perhaps latency, depending on the frequency and nature of inter-node MPI communication). Algebraic solvers tend to be very very CPU-intensive, secondarily RAM, then network and lastly storage.

Have you considered offloading BLAS operations to GPUs?
I am still exploring what makes the most sense in terms of something like GlusterFS, nfs vs some other alternative. I just need all machines to have access to the same code and and run MPI from the shared directory. It seems from the comments on this forum that Glusterfs is a suboptimal choice and may create bottle necks since I have so few nodes. I can switch to NSF and would like to hear alternative ways to execute MPI code for a small cluster.

Yes, the commonly used solvers are very CPU-intensive. Also, my application domain is in planetary science so the problem size is very large leading to the need for large amounts of RAM.

I have been looking into the application of GPU’s and am experimenting with a 24GB RTX 3090. GPU-based methods are not as mature as old-fashioned cpu-based computational tools but things are moving fast so give it a few years. There are some powerful GPU-based direct solvers, BLAS acceleration tools and applications of generative AI to solve equations but these approaches are not always easy to integrate into existing code bases and require a lot of trial and error with hardware and software. One of the key challenges with full GPU integration in CFD modeling is that much of the codebase has to be rewritten using a completely different coding approach but again this space is moving fast and things will get easier. My small cluster concept is motivated by the abundance of low-cost Epyc Rome CPU’s and motherboards. I find it interesting that performance of my dual core Epyc 7f52 (2x16 core cpu + mobo’s + 16 memory channels bought for $950 new and used parts from tugm4470 on ebay) is only 8% behind my Threadripper 3970x (32 core cpu + mobo’s + 4-memory channels bought new at $3500) for single node calculations if memory bandwidth bottlenecks are not reached, and the 7f52’s destroy the 3970x for parallel calculations that require more memory channels for linear speed up. I just need to get over this networking learning curve to fully leverage what Rome has to offer. This forum has really helped me get closer to that ultimate goal. Any additional input would be much appreciated.
 
Last edited:

erock

Member
Jul 19, 2023
87
17
8
No you don't need a switch. You can do a direct connection between 2 computers, install the mandatory IB stuff, load some modules, and add the OpenSM service on one of the workstations. Give each card on each side an IP and that's it. You could also get 2 cheap 100G Omnipath cards and connect both without a switch. Even better, but keep in mind CPU load ;)



EDIT: Added example for Debian net config
gb00s could you elaborate on the cpu load issue? It is not clear to me how cpu load and networking are connected.
 

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
Regarding modelling for GPUs, although yes it is a fast-moving area, with many libraries like cBLAS you only need to recompile to take advantage of CUDA; minimal code change.
 

erock

Member
Jul 19, 2023
87
17
8
Regarding modelling for GPUs, although yes it is a fast-moving area, with many libraries like cBLAS you only need to recompile to take advantage of CUDA; minimal code change.
For sure but the complicated algorithms that go into building the system of equations in complex simulations often take up a big chunk of compute time and are non-trivial to re-code for GPU’s.
 

erock

Member
Jul 19, 2023
87
17
8
Regarding modelling for GPUs, although yes it is a fast-moving area, with many libraries like cBLAS you only need to recompile to take advantage of CUDA; minimal code change.
OK, thank you for the clarification.
 

erock

Member
Jul 19, 2023
87
17
8
I figured out the issue with slow performance. It had nothing to do with my gigabit switch. The issue was associated with heterogenous processors. When I ran my parallel pi test I thought I was using two nodes with Epyc 7f52 processors but one of the nodes actually had a slower Epyc 7302 from an experimental build. When MPI was executed with processors evenly distributed between the nodes, the 7302 machine was a bottleneck. In order to rebalance the load using this heterogenous setup I had to use double to triple the 7302 cores relative to 7f52.

So for a small cluster the gigabit switch does show reasonable performance (I am getting close to linear speed up when loads are balanced) and provides a cost effective way of aggregating cpu resources for parallel scientific and engineering calculations. I will update this post on the performance of the gigabit switch as I expand the small cluster.

I also moved from glusterfs to a simple nfs setup since nfs is easier to manage and storage is not a critical factor for me right now. This change did not affect MPI performance on my small network.

I plan on experimenting with the Mellonox card and IB switch referenced above since they are affordable and will provide a valuable learning experience.

Thank you for your input!
 
Last edited:

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
Glad you're seeing performance within your expectations. A lot of CFD, simulation, etc. is "embarrassingly parallel" and compute-heavy, so gigabit networking may suffice (and be much cheaper and simpler to administer). To the point that often the lightweight MPI semantics are not even needed, and old-school batching (Slurm, PBS Torque, et al) with manual division of tasks works just fine.
 
  • Like
Reactions: erock