10 nodes cluster, infiniband disconnects frequently

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sharonyue

New Member
Apr 20, 2022
27
2
3
It would be great if you could provide a normal text log instead of screenshots. It seems you were still facing problems after a SM failover (Master to Standby), which will cause client re-registration/breaking of IPoIB and new RDMA connections. Can you keep only 1 server with OpenSM, and remove it from all other servers? Also, keep that server with master OpenSM idle?
Thanks Necr, here is my opensm.log:

Code:
May 07 07:56:23 390872 [5060E740] 0x03 -> OpenSM 5.9.1.MLNX20210811.517c4ae
OpenSM 5.9.1.MLNX20210811.517c4ae

May 07 07:56:23 391028 [5060E740] 0x80 -> OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 07:56:23 394646 [5060E740] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 07 07:56:23 394711 [5060E740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Using default GUID 0x506b4b0300eeb1da
Entering DISCOVERING state

May 07 07:56:23 411211 [5060E740] 0x80 -> Entering DISCOVERING state
May 07 07:56:23 411421 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 451059 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490305 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490342 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490367 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x0a binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490397 [5060E740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x506b4b0300eeb1da
May 07 07:56:23 493025 [4C99A700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498566 [4D19B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498645 [43988700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498929 [41183700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499009 [3D17B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499163 [3A976700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499225 [3A175700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499248 [3A175700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499605 [33167700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499676 [2E95E700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Entering MASTER state

May 07 07:56:23 501793 [AF17700] 0x80 -> Entering MASTER state
May 07 07:56:23 617615 [AF17700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 07:56:23 624011 [AF17700] 0x02 -> SUBNET UP
May 07 07:56:23 628697 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::ffff:ffff
May 07 07:56:23 634274 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::1
May 07 07:56:23 634307 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::fb
May 07 07:56:23 634327 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1
May 07 07:56:23 634362 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:ffa6
May 07 07:56:23 634415 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff6c:be4d
May 07 07:56:23 634465 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:c40e
May 07 07:56:23 681210 [4A195700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000d00000080
May 07 07:56:23 681288 [4A195700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 07:56:23 681303 [4A195700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 07:56:23 682879 [42986700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 37 changed state from DOWN to INIT
May 07 07:56:23 696022 [AF17700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 07:56:23 704986 [AF17700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:6 GID:fe80::ec0d:9a03:1c:1d28
May 07 07:56:23 704999 [AF17700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0xec0d9a03001c1d28 LID range [7,7] of node: Mellanox Technologies Aggregation Node
May 07 07:56:23 705202 [AF17700] 0x02 -> SUBNET UP
May 07 07:56:24 133635 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ffea:38d
May 07 07:56:24 133706 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::16
May 07 07:56:24 146657 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff2d:94a
May 07 07:56:24 152715 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 07:56:24 154254 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff83:1191
May 07 07:56:24 154847 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff3e:490b
May 07 07:56:24 169344 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff86:dccc
May 07 07:56:24 173977 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff05:4519
May 07 07:56:25 153851 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::2
May 07 07:57:52 479766 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::16
May 07 07:58:03 123973 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::2
May 07 08:16:27 150381 [1ACAF740] 0x03 -> OpenSM 5.9.1.MLNX20210811.517c4ae
OpenSM 5.9.1.MLNX20210811.517c4ae

May 07 08:16:27 150548 [1ACAF740] 0x80 -> OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 08:16:27 154546 [1ACAF740] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 07 08:16:27 154590 [1ACAF740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Using default GUID 0x506b4b0300eeb1da
Entering DISCOVERING state

May 07 08:16:27 171431 [1ACAF740] 0x80 -> Entering DISCOVERING state
May 07 08:16:27 171600 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 212350 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 252995 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253022 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253042 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x0a binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253064 [1ACAF740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x506b4b0300eeb1da
May 07 08:16:27 255565 [1703B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 260954 [15838700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261058 [12832700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261255 [1002D700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261557 [801D700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261634 [901F700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261911 [80E700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262101 [FE80A700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262266 [FD007700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262301 [FB804700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Entering MASTER state

May 07 08:16:27 264215 [D55B8700] 0x80 -> Entering MASTER state
May 07 08:16:27 379871 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 08:16:27 386543 [D55B8700] 0x02 -> SUBNET UP
May 07 08:16:27 390885 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::ffff:ffff
May 07 08:16:27 403083 [13033700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000b00000080
May 07 08:16:27 403166 [13033700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 08:16:27 403183 [13033700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 08:16:27 404502 [F02B700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 37 changed state from DOWN to INIT
May 07 08:16:27 417411 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 08:16:27 426313 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:6 GID:fe80::ec0d:9a03:1c:1d28
May 07 08:16:27 426330 [D55B8700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0xec0d9a03001c1d28 LID range [7,7] of node: Mellanox Technologies Aggregation Node
May 07 08:16:27 426566 [D55B8700] 0x02 -> SUBNET UP
May 07 08:16:27 894293 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::1
May 07 08:16:27 894359 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::fb
May 07 08:16:27 894417 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1
May 07 08:16:27 894457 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:ffa6
May 07 08:16:27 896557 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ffea:38d
May 07 08:16:27 896593 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::16
May 07 08:16:27 899413 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff86:dccc
May 07 08:16:27 904320 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 08:16:27 909901 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff3e:490b
May 07 08:16:27 912734 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff05:4519
May 07 08:16:27 913246 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff2d:94a
May 07 08:16:27 916368 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:c40e
May 07 08:16:27 917071 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff83:1191
May 07 08:16:27 917702 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff6c:be4d
May 07 08:16:28 945871 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::2
May 07 08:17:59 974116 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::16
May 07 08:18:06 765047 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::2
May 07 09:01:55 058145 [12031700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000e00000080
May 07 09:01:55 058225 [12031700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 09:01:55 058236 [12031700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 09:01:55 059897 [E82A700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 5 changed state from ACTIVE to DOWN
May 07 09:01:55 062428 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 09:01:55 062442 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:6 GID:fe80::506b:4b03:28:d06a
May 07 09:01:55 062453 [D55B8700] 0x02 -> drop_mgr_remove_port: Removed port with GUID:0x506b4b030028d06a LID range [14, 14] of node:inode5 HCA-1
May 07 09:01:55 073259 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 09:01:55 076831 [D55B8700] 0x02 -> SUBNET UP
It should be noted that now we are running only one opensm, at least it should be.

I am also thinking if I need to use AOC cables. Right now I am using five 3 meters DAC cables and 6 2 meters DAC cables.

PS. I also attach the full log, which is too big to upload here.
 

Attachments

Last edited:

necr

Active Member
Dec 27, 2017
156
48
28
124
Okay, something's wrong with your system where OpenSM runs. Every time you see "OpenSM 5.9.1.MLNX20210811.517c4ae" in the log, means OpenSM service has restarted - 40 restarts as far as I can see in the log. Is this system idle? Is there anything in syslog that would explain the restarts (crashes?)
 

sharonyue

New Member
Apr 20, 2022
27
2
3
Um. Let me repeat my procedure. I want to run a code on our cluster. We install centos on each node, install mellanox driver on each node (each node has opensm). But we only start one opensm on one of our node. Then all these nodes are connected by ssh. We create a shared folder by mount, run our code in this shared folder. Then, opensm crashes sometime later... Am I doing anything non-logical?
 

sharonyue

New Member
Apr 20, 2022
27
2
3
> Just run OpenSM and don't run any other code on it.

We haven't tried it yet. I guess it would take us very long time.

We bought another three cx555 and test it with three nodes, it runs fine. We replace these three cx555 and test 9 nodes, three of them are cx555, six of them are cx455, it still has this issue. One of the node would drop, even cx 555.

...