Networking Hardware for Parallel Computing on a Small Cluster

erock · Sep 1, 2023

Hi All,

I recently built several computers for scientific computing and now would like to build a network so parallel calculations can be performed using MPI. My problem is that my current network setup is very slow (see definition of slow below) when running computations using two machines. I am currently not sure if this is a software, networking setup and/or networking hardware problem and would like your guidance on the best (low latency, high bandwidth) networking switch, NICs and network settings for a 4-8 node cluster so I can rule this aspect out. Here are my specs:

Each node has a H11DSi supermicro mother board with 2x7f52 16 core Epyc processors with 256GB of 3200 ecc ram, 2TB M2 drives, a 1 TB ssd with the Pop-Os and two 1gb Ethernet ports.

Nodes are linked together using cat 6 cables plugged into 1 of two 1gb ports on the motherboard and a netgear 1gb switch.

One node acts as a gateway to the internet and shares the internet with other nodes on the network.

I am using glusterfs to manage a distributed network file system and run MPI applications from the a gluster volume directory shared across the network. MPI was tested and can send and reduce information across multiple nodes. Here are links from an individual using a similar setup that describes the approach I am using:

Slowness Description: When I run MPI test code for calculating pi on a single master node or slave node I get approximately linear speed up. However, if I run an equivalent number of cores distributed between two machines a 30-60 second overhead is added to the calculation:

Run code on 1 master node: cpu sec = 120
Run code on 2 master nodes: cpu sec = 63
Run code on 4 master nodes: cpu sec = 35 sec
Run code on 2 master and 2 slave nodes: cpu sec = 121

I have run MPI bandwidth tests that show a speed of 3000MB/s on the master node only whereas I get around 175MB/s for internode tests.

So in terms of possible causes, here is where I am:

My switch is too slow. What should I buy for a small cluster?
I need to ditch the 1gb port and install a better NIC on the motherboard (if yes which one would be best).
Something is wrong with the network settings possibly due to two Ethernet ports on the motherboard confusing MPI.
My MPI mpich version is suboptimal and not configured properly.

Any help, guidance and perspectives would be much appreciated.

piranha32 · Sep 1, 2023

Your best option are Mellanox ConnectX cards. You can use them in InfiniBand, or ethernet mode. Infiniband has been designed with computation in mind, offers high throughput and low latency, but you need a special switch supporting this protocol.
Ethernet probably should be good as well, you'll have to do a little digging which one works better.
As for speed, even with ancient ConnectX3 cards, you can get up to 56Gbps wire speed and low latency. How much you'll actually get, depends on auxiliary equipment, like fiber modules and switches.
What you really want, is support for RDMA (RoCE for ethernet). With MPI you should see a significant boost in performance of inter-node communication. There are other cards which support RoCE, but Mellanox cards are very popular in HPC community, and it should be fairly easy to get them working.

gb00s · Sep 1, 2023

piranha32 said:
Your best option are Mellanox ConnectX cards. You can use them in InfiniBand, or ethernet mode. Infiniband has been designed with computation in mind, offers high throughput and low latency, but you need a special switch supporting this protocol.

No you don't need a switch. You can do a direct connection between 2 computers, install the mandatory IB stuff, load some modules, and add the OpenSM service on one of the workstations. Give each card on each side an IP and that's it. You could also get 2 cheap 100G Omnipath cards and connect both without a switch. Even better, but keep in mind CPU load

auto ibs3
iface ibs3 inet static
address 172.20.20.100
netmask 255.255.255.0
#broadcast 172.20.20.255
#hwaddress ether random
mtu 65520
pre-up modprobe ib_ipoib
pre-up echo connected > /sys/class/net/ibs3/mode

EDIT: Added example for Debian net config

piranha32 · Sep 1, 2023

gb00s said:
No you don't need a switch. You can do a direct connection between 2 computers, install the mandatory IB stuff, load some modules, and add the OpenSM service on one of the workstations. Give each card on each side an IP and that's it. You could also get 2 cheap 100G Omnipath cards and connect both without a switch. Even better, but keep in mind CPU load

For 2 nodes - yes, switch is not required. But the OP writes about 4-8 nodes. The best what you can do is a partial mesh with routing via nodes. What kind of sucks from the performance point of view.

gb00s · Sep 1, 2023

Yeah, this was an oversight. But with 2 dual-port cards in each workstation you should still be able to run a 5-node IB-based cluster without a switch.

Done that. Worked .. 3 cards = ....

piranha32 · Sep 1, 2023

gb00s said:
Yeah, this was an oversight. But with 2 dual-port cards in each workstation you should still be able to run a 5-node IB-based cluster without a switch. Done that. Worked .. 3 cards = ....

Sure... But is it worth the extra effort, if you can get on ebay an Infiniband switch for less than $100? You'll probably spend more on extra cards, modules and cables.

erock · Sep 1, 2023

gb00s said:
No you don't need a switch. You can do a direct connection between 2 computers, install the mandatory IB stuff, load some modules, and add the OpenSM service on one of the workstations. Give each card on each side an IP and that's it. You could also get 2 cheap 100G Omnipath cards and connect both without a switch. Even better, but keep in mind CPU load

EDIT: Added example for Debian net config

I really like this idea of directly connecting the machines for testing and getting a feel of how the networking hardware works. Could you give me a few more recommendations on specific hardware to make this work? Specifically, what networking cards and cables should I buy for both machines?

Also, what is the IB stuff? Finally, I have been using openssh-sever. Does OpenSM provide similar functionality.

Sorry, for the basic questions. My background is in scientific computing software and high-end workstations. Networking is a new domain for me.

Thank you for your help with this!

erock · Sep 1, 2023

piranha32 said:
Sure... But is it worth the extra effort, if you can get on ebay an Infiniband switch for less than $100? You'll probably spend more on extra cards, modules and cables.

I also like the idea of eventually expanding my system using your switch. Could you provide a bit more details on hardware? What network cards, specific infinite band switch and cables should I buy?

erock · Sep 1, 2023

gb00s said:
Yeah, this was an oversight. But with 2 dual-port cards in each workstation you should still be able to run a 5-node IB-based cluster without a switch. Done that. Worked .. 3 cards = ....

Could you provide a recommendation for specific dual port cards and cables? Also, could you elaborate on how the cable connections work in terms of cables to ports? Thank you so much for your help!

piranha32 · Sep 1, 2023

erock said:
I also like the idea of eventually expanding my system using your switch. Could you provide a bit more details on hardware? What network cards, specific infinite band switch and cables should I buy?

I'm not an expert on infiniband networking, but hopefully others will chime in.

Searching for infiniband switches on ebay returned these results on the top of the list:
12 port SX6005 for $99+ship: Mellanox SX6005 12-Port Unmanaged Infiniband Switch | eBay
8 port IS5022 for $75: Mellanox IS5022 8 Port InifinScale IV QDR InfiniBand Switch | eBay

Cards: MCX354A-FCBT - IB and Eth at 40 and 56Gbps, available on ebay for ~$20 (or better).

Fiber modules: No idea, I'll be glad to see recommendations. On short distances you can also use DAC cables

nexox · Sep 1, 2023

I would say that unless you want to learn Infiniband and your application(s) support it natively, then just stick with ethernet.

The cheapest faster ethernet is 10G SFP gear, 40G QSFP is just a little more expensive, but it tends to be louder. If your machines are all relatively close together you don't need to mess with any fiber, just use copper DACs. The exact hardware which fits your needs best depends on how much noise is acceptable, how much time and money you want to spend, how many ports you need, and what is supported in the OS you're using.

That said, if you have enough ports on your current netgear switch you could wire up the second ports of your machines to run glusterfs on a different subnet from the first ports and reduce contention with the application on the first 1G port. That would be cheap and not terribly difficult.

Sean Ho · Sep 1, 2023

I wouldn't point OP to IB unless they're interested in tinkering. 10GbE SFP+ switches (e.g., ICX series, see mega-mega-thread) are affordable and plug-and-play. If RDMA is a significant benefit and worth the hassle, then RoCE.

How much storage? Is clustered storage necessary? Is a significant amount of storage flying around between nodes, or is each node mostly just churning on its own local chunk of data? The pi test is very contrived and has basically no storage needs.

piranha32 · Sep 1, 2023

Regardless of the IB/Eth dispute, what would be the best cheap MM modules for use with MCX354A?

gb00s · Sep 2, 2023

Before you want discuss IB vs ETH performance wise I would discuss GlusterFS vs ...... first. I guessed subject here is parallel computing with some storage behind and then IB is superior vs ETH. I admit if OP is ok with point of failure in his cluster, then use an IB switch. But my experience with 5-node IB cluster w/2x 56Gbos cards in each node connected as Mesh was great performance wise. The only time consuming thing was writing the net config on each node to not mess it up. From an expansion standpoint ap no go, of course.

Railgun · Sep 2, 2023

As you explicitly referencing latency, what is the latency target you are aiming for?

Often times when people talk about latency they don’t actually understand what it is they need.

If you want the absolute fastest, then you need a L1 switch which is on the order of about 5ns. The SX6005 referenced looks like it’s about 170ns port to port.

erock · Sep 2, 2023

nexox said:
I would say that unless you want to learn Infiniband and your application(s) support it natively, then just stick with ethernet.

The cheapest faster ethernet is 10G SFP gear, 40G QSFP is just a little more expensive, but it tends to be louder. If your machines are all relatively close together you don't need to mess with any fiber, just use copper DACs. The exact hardware which fits your needs best depends on how much noise is acceptable, how much time and money you want to spend, how many ports you need, and what is supported in the OS you're using.

That said, if you have enough ports on your current netgear switch you could wire up the second ports of your machines to run glusterfs on a different subnet from the first ports and reduce contention with the application on the first 1G port. That would be cheap and not terribly difficult.

Thank you for sharing your thoughts on this. Does the optimization using both ports on the mobo’s require a managed switch? My current switch is a 8-port unmanaged Netgear switch.

See below for responses to your questions about requirements. If you could provide specific hardware and Linux OS recommendations that would be very helpful so I can use this a starting point for traveling up the learning curve. Thank you again!

Noise: For this small cluster I did choose full towers for nodes because it was easier to cool the CPUs with tall Noctua coolers , which do not fit in a 4u chassis, while reducing noise relative to 4u server chassis (Dark Base Pro 900 towers are working very well with the E-ATX Supermicro server mobo’s). But the noise level is still high relative to a typical workstation because the fans run at full speed during heavy load. So my guess is that some additional noise from networking hardware will not make much difference. I also have a TR 3970x workstation which has loud and annoying MOSFET fans that may mask the networking hardware noise.

Total Networking Budget: I prefer staying around $500 but could go as high as $1000 if performance gain is worth it.

Time: I am willing to invest some time since short term project goals can be satisfied with running parallel calculations on single 32-core nodes with 8-memory channels. However, I would prefer to maximize time on the science, coding and algorithm design components of my projects. I do think though that my long-term objective can only be reached by getting this networking stuff to work well. I also want to avoid AWS since my calculations often run for multiple days, it is much easier for me to iterate and experiment on a small cluster and I only need around 120 cores with sufficient memory channels to not create a bottle neck (dual core EPYC 7f52 is great in this regard especially consider that the price is ~$300 for used from great eBay dealers).

Ports: I have a total of 5 machines, 4 or which I would like to cluster and 1 is a power hungry threadripper 3970x with only 4 memory channels available for 32 cores (not good for bandwidth limited parallel calcs). I need all machines to be on a network for access to the internet. So my current Netgear switch is 8 ports and I don’t think I need much more than this unless there is optimization I can do by using additional ports, which was referred to in a different reply.

OS: I am current using Pop-OS since it is great for graphics drivers and overall easy to use but have experience with Ubuntu and Fedora. I can easily switch to any recommended Linux OS.

erock · Sep 2, 2023

Sean Ho said:
I wouldn't point OP to IB unless they're interested in tinkering. 10GbE SFP+ switches (e.g., ICX series, see mega-mega-thread) are affordable and plug-and-play. If RDMA is a significant benefit and worth the hassle, then RoCE.

How much storage? Is clustered storage necessary? Is a significant amount of storage flying around between nodes, or is each node mostly just churning on its own local chunk of data? The pi test is very contrived and has basically no storage needs.

The pi test is just a simple “can I get the basics of MPI to work” test. The real use case involves solving large systems of equations using a software package called Petsc. Solving the system of equations typically takes 80% of the cpu time in serial and these types of calculations scale very well on shared memory machines and optimized distributed memory clusters. So the algorithm involves the master node spawning an MPI calculation over multiple nodes, each node builds its local chunk of the system (typically ~20-150G of RAM per node) and the system is solved with internode communication. When this is done a 1-5G solution file is saved to disk that is read by the main program running on the master node. I am using Glusterfs volume as a network file system from which MPI code is executed. I am open to alternatives for spawning parallel MPI calculations and am only using GlusterFS because the examples above suggested this as a better alternative to NFS.

I do not have huge storage needs since this cluster is really about performing calculations that involve large amounts of RAM as quickly as possible. After calculations are completed I copy the results over the network (~100-300G) to my Threadripper workstation for visualization and post-processing. This cluster is used by a very small team so there is no need to worry about managing many users etc..

erock · Sep 2, 2023

Railgun said:
As you explicitly referencing latency, what is the latency target you are aiming for?

Often times when people talk about latency they don’t actually understand what it is they need.

If you want the absolute fastest, then you need a L1 switch which is on the order of about 5ns. The SX6005 referenced looks like it’s about 170ns port to port.

I may be in the category of those who do not actually understand since I am a total noob when it comes to networking. I just want my parallel calculations across nodes to be as fast as possible given a $500-$1000 budget. The key requirement is that the overhead of the network communication between nodes must be smaller than the speed up associated with parallel calculation being done on nodes. So any reduction in internode communication time is key. My very limited understanding of networking led me to assume that total internode communication speed is directly proportional to latency reduction and bandwidth.

Could you provide some links to L1 switches that you think would work well? Thank you for your feedback!

Railgun · Sep 2, 2023

Speed/bandwidth ≠ latency in this context. You could have 400Gbps throughput at 3μs port to port latency and 1Gb at 5ns. Given your budget, chances are you won't get to the lower end of that budget. Without knowing the ins and out of this application, I'm just making educated guesses at what you need.

If you could find an inexpensive Arista 7150 which is about 350ns best case, that may be your best bet. While you could go infiniband to get better latencies, that is a lot of additional hardware you seem to not want to spend anything on.

That said, this all depends on what your bottleneck is. If you can measure a tangible difference based on node to node network latency, then ok.

alex_stief · Sep 2, 2023

Infiniband has been the industry standard for HPC clusters for quite a while now. No need to re-invent the wheel.
And the hardware is well within your budget, at least when buying used on ebay.

Networking Hardware for Parallel Computing on a Small Cluster

Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Member

Member

Member

Active Member

Well-Known Member

seanho.com

Active Member

Well-Known Member

Active Member

Member

Member

Member

Active Member

Well-Known Member