Thanks Necr, here is my opensm.log:It would be great if you could provide a normal text log instead of screenshots. It seems you were still facing problems after a SM failover (Master to Standby), which will cause client re-registration/breaking of IPoIB and new RDMA connections. Can you keep only 1 server with OpenSM, and remove it from all other servers? Also, keep that server with master OpenSM idle?
Code:
May 07 07:56:23 390872 [5060E740] 0x03 -> OpenSM 5.9.1.MLNX20210811.517c4ae
OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 07:56:23 391028 [5060E740] 0x80 -> OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 07:56:23 394646 [5060E740] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 07 07:56:23 394711 [5060E740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Using default GUID 0x506b4b0300eeb1da
Entering DISCOVERING state
May 07 07:56:23 411211 [5060E740] 0x80 -> Entering DISCOVERING state
May 07 07:56:23 411421 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 451059 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490305 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490342 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490367 [5060E740] 0x02 -> osm_vendor_bind: Mgmt class 0x0a binding to port GUID 0x506b4b0300eeb1da
May 07 07:56:23 490397 [5060E740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x506b4b0300eeb1da
May 07 07:56:23 493025 [4C99A700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498566 [4D19B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498645 [43988700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 498929 [41183700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499009 [3D17B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499163 [3A976700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499225 [3A175700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499248 [3A175700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499605 [33167700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 07:56:23 499676 [2E95E700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Entering MASTER state
May 07 07:56:23 501793 [AF17700] 0x80 -> Entering MASTER state
May 07 07:56:23 617615 [AF17700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 07:56:23 624011 [AF17700] 0x02 -> SUBNET UP
May 07 07:56:23 628697 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::ffff:ffff
May 07 07:56:23 634274 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::1
May 07 07:56:23 634307 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::fb
May 07 07:56:23 634327 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1
May 07 07:56:23 634362 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:ffa6
May 07 07:56:23 634415 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff6c:be4d
May 07 07:56:23 634465 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:c40e
May 07 07:56:23 681210 [4A195700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000d00000080
May 07 07:56:23 681288 [4A195700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 07:56:23 681303 [4A195700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 07:56:23 682879 [42986700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 37 changed state from DOWN to INIT
May 07 07:56:23 696022 [AF17700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 07:56:23 704986 [AF17700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:6 GID:fe80::ec0d:9a03:1c:1d28
May 07 07:56:23 704999 [AF17700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0xec0d9a03001c1d28 LID range [7,7] of node: Mellanox Technologies Aggregation Node
May 07 07:56:23 705202 [AF17700] 0x02 -> SUBNET UP
May 07 07:56:24 133635 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ffea:38d
May 07 07:56:24 133706 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::16
May 07 07:56:24 146657 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff2d:94a
May 07 07:56:24 152715 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 07:56:24 154254 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff83:1191
May 07 07:56:24 154847 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff3e:490b
May 07 07:56:24 169344 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff86:dccc
May 07 07:56:24 173977 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff05:4519
May 07 07:56:25 153851 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::2
May 07 07:57:52 479766 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::16
May 07 07:58:03 123973 [2E15D700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::2
May 07 08:16:27 150381 [1ACAF740] 0x03 -> OpenSM 5.9.1.MLNX20210811.517c4ae
OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 08:16:27 150548 [1ACAF740] 0x80 -> OpenSM 5.9.1.MLNX20210811.517c4ae
May 07 08:16:27 154546 [1ACAF740] 0x02 -> osm_vendor_init: 1000 pending umads specified
May 07 08:16:27 154590 [1ACAF740] 0x02 -> osm_vendor_init: 1000 pending umads specified
Using default GUID 0x506b4b0300eeb1da
Entering DISCOVERING state
May 07 08:16:27 171431 [1ACAF740] 0x80 -> Entering DISCOVERING state
May 07 08:16:27 171600 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 212350 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 252995 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253022 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253042 [1ACAF740] 0x02 -> osm_vendor_bind: Mgmt class 0x0a binding to port GUID 0x506b4b0300eeb1da
May 07 08:16:27 253064 [1ACAF740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x506b4b0300eeb1da
May 07 08:16:27 255565 [1703B700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 260954 [15838700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261058 [12832700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261255 [1002D700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261557 [801D700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261634 [901F700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 261911 [80E700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262101 [FE80A700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262266 [FD007700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
May 07 08:16:27 262301 [FB804700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Entering MASTER state
May 07 08:16:27 264215 [D55B8700] 0x80 -> Entering MASTER state
May 07 08:16:27 379871 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 08:16:27 386543 [D55B8700] 0x02 -> SUBNET UP
May 07 08:16:27 390885 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::ffff:ffff
May 07 08:16:27 403083 [13033700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000b00000080
May 07 08:16:27 403166 [13033700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 08:16:27 403183 [13033700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 08:16:27 404502 [F02B700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 37 changed state from DOWN to INIT
May 07 08:16:27 417411 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 08:16:27 426313 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:64 (GID in service) from LID:6 GID:fe80::ec0d:9a03:1c:1d28
May 07 08:16:27 426330 [D55B8700] 0x02 -> state_mgr_report_new_ports: Discovered new port with GUID:0xec0d9a03001c1d28 LID range [7,7] of node: Mellanox Technologies Aggregation Node
May 07 08:16:27 426566 [D55B8700] 0x02 -> SUBNET UP
May 07 08:16:27 894293 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::1
May 07 08:16:27 894359 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:401b:ffff::fb
May 07 08:16:27 894417 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1
May 07 08:16:27 894457 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:ffa6
May 07 08:16:27 896557 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ffea:38d
May 07 08:16:27 896593 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::16
May 07 08:16:27 899413 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff86:dccc
May 07 08:16:27 904320 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 08:16:27 909901 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff3e:490b
May 07 08:16:27 912734 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff05:4519
May 07 08:16:27 913246 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff2d:94a
May 07 08:16:27 916368 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff99:c40e
May 07 08:16:27 917071 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff83:1191
May 07 08:16:27 917702 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::1:ff6c:be4d
May 07 08:16:28 945871 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:66 (New mcast group created) from LID:6 GID:ff12:601b:ffff::2
May 07 08:17:59 974116 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::16
May 07 08:18:06 765047 [F87FE700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::2
May 07 09:01:55 058145 [12031700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:3 TID:0x0000000e00000080
May 07 09:01:55 058225 [12031700] 0x02 -> SM class trap 128: Directed Path Dump of 1 hop path: Path = 0,1
May 07 09:01:55 058236 [12031700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:3 GID:fe80::ec0d:9a03:1c:1d20
May 07 09:01:55 059897 [E82A700] 0x02 -> osm_spst_rcv_process: Switch 0xec0d9a03001c1d20 SwitchIB Mellanox Technologies port 5 changed state from ACTIVE to DOWN
May 07 09:01:55 062428 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:6 GID:ff12:601b:ffff::1:ff7e:6615
May 07 09:01:55 062442 [D55B8700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:6 GID:fe80::506b:4b03:28:d06a
May 07 09:01:55 062453 [D55B8700] 0x02 -> drop_mgr_remove_port: Removed port with GUID:0x506b4b030028d06a LID range [14, 14] of node:inode5 HCA-1
May 07 09:01:55 073259 [D55B8700] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches
May 07 09:01:55 076831 [D55B8700] 0x02 -> SUBNET UP
I am also thinking if I need to use AOC cables. Right now I am using five 3 meters DAC cables and 6 2 meters DAC cables.
PS. I also attach the full log, which is too big to upload here.
Attachments
-
256.7 KB Views: 1
Last edited: