10 nodes cluster, infiniband disconnects frequently

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sharonyue

New Member
Apr 20, 2022
27
2
3
Guys,

I have a small cluster, connected by ten 100G mellanox connectx-4 infiniband adaptors.
Most of the time they run smoothly. But sometimes, one of the node would disconnects.
Once disconnected, the light in the adaptor is yellow, which means its offline.
I dont know why, any hints? or how can I debug?

Thanks.
 

necr

Active Member
Dec 27, 2017
156
49
28
124
Are they connected to one switch?
Are there any logs (opensm.log)?
 

sharonyue

New Member
Apr 20, 2022
27
2
3
Firstly, I know we can use aida64 to test the CPU and RAM, but any way to stress the adaptors and switch?
 

sharonyue

New Member
Apr 20, 2022
27
2
3
not the same one, random node. Only under load. NIC temp is around 80 degree. We suspect it is due to the temp. Then we install fans on them, afterwards the temp is around 55 degree. But does not solve the problem.

Here is the log file.
 

Attachments

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
In theory, they should automatically select a primary amongst themselves, but only one is really necessary. And if you have a managed switch there'll already be a subnet manager running on the switch.
 

sharonyue

New Member
Apr 20, 2022
27
2
3
In theory, they should automatically select a primary amongst themselves, but only one is really necessary. And if you have a managed switch there'll already be a subnet manager running on the switch.
Thanks. Let me try using one opensm and get back to you guys.
 

sharonyue

New Member
Apr 20, 2022
27
2
3
Yesterday everything went well. But I am still trying to stress-test it.

BTW, if this problem comes from multiple opensm, why it needs so long time to see the problem?
 

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
oh that is good to hear that it's running at least longer than before without dropping connections! I do not know, but in my observation usually people run at most two subnet managers on an IB segment -- one primary, one backup. With managed switches, most of the time folks don't even run opensm on any of the nodes, just the rdma client drivers.
 

sharonyue

New Member
Apr 20, 2022
27
2
3
oh that is good to hear that it's running at least longer than before without dropping connections! I do not know, but in my observation usually people run at most two subnet managers on an IB segment -- one primary, one backup. With managed switches, most of the time folks don't even run opensm on any of the nodes, just the rdma client drivers.
Thanks, bro. I tried another 6 hours, it runs well. At the weekend, I will try run it for another 48 hours if everything goes well, I would say this problem is solved. Mine is a unmanaged switch :) Let me see if I can run it for 2 whole days.
 

sharonyue

New Member
Apr 20, 2022
27
2
3
Excellent! Is it running the stress-test load, or production load now? That's very encouraging news.
Bro, English is not my mother tougue, and I didnot get your point. If I get it right, I connect them to run some software (ANSYS CAE) instead of stress-test app by mellanox. I tried to find how to stress-test the switch and adaptor, but I did not find anything that can do it.
 

Sean Ho

seanho.com
Nov 19, 2019
775
359
63
Vancouver, BC
seanho.com
My apologies! I was asking if so far you've only been putting synthetic (artificial, test, fake) traffic on the switch, or traffic for the actual real-world work (end-user, client, business use). The reason is that sometimes the network runs fine under stress-test but not with real-world traffic, which can be frustrating.
 

sharonyue

New Member
Apr 20, 2022
27
2
3
My apologies! I was asking if so far you've only been putting synthetic (artificial, test, fake) traffic on the switch, or traffic for the actual real-world work (end-user, client, business use). The reason is that sometimes the network runs fine under stress-test but not with real-world traffic, which can be frustrating.
Unfortunately, one of the node is offline again... It runs 8 hours, then it is off.

Last night, I updated the firmware as noted here, Firmware Compatible Products - Mellanox Switch-IB Firmware v11.2008.0236 - NVIDIA Networking Docs This time, its offline even faster, around 10 mins, one of the node is offline.

Alas...

It is used for actual real-world work (end-user, client, business use)...
 

necr

Active Member
Dec 27, 2017
156
49
28
124
It would be great if you could provide a normal text log instead of screenshots. It seems you were still facing problems after a SM failover (Master to Standby), which will cause client re-registration/breaking of IPoIB and new RDMA connections. Can you keep only 1 server with OpenSM, and remove it from all other servers? Also, keep that server with master OpenSM idle?
 
  • Like
Reactions: anthros