I use a VM with OmniOS r151024ap and napp-it v. 18.01b as a storage server for a VMware 6.0u3 cluster. Around 30 VMs use 8 iSCSI LUNs for storage. It is mostly very stable, but about twice a year the storage VM stops responding to iSCSI requests from ESXi hosts. After an incident, vmkernel log shows messages of the form:
hostd log shows all LUNs lost shortly after the iSCSI "connection closed by peer".
iSCSI LUNs remain disconnected until the storage VM is rebooted. Before reboot, VM responds to ping on all its NICs and does not appear overloaded. After reboot, all LUNs quickly come back up and VMware datastores are operational again.
Physical host uses Connect-X 3 Pro 10G NICs. nmlx4 driver reports no errors on the logs. Physical storage NIC is on a separate storage VLAN.
Storage VM is configured with 1 e1000 NIC for LAN access and 2 vmxnet3 NICs serving 2 storage subnets. open-vm-tools version is 10.1.15 from r151024ap.
Incidents don't seem correlated with periods of highest storage activity. Haven't increased logging levels in OmniOS. Default logs don't show much.
Has anyone seen a similar behavior? Any suggestions to prevent this?
Thanks.
Code:
2019-06-02T15:11:57.699Z cpu23:32870)nmlx4_en: vmnic3: nmlx4_en_RxQAlloc - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:628) RX queue 2 is allocated
2019-06-02T15:11:57.708Z cpu23:32870)nmlx4_en: vmnic3: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2145) MAC RX filter (class 1) at index 0 is applied on
2019-06-02T15:11:57.708Z cpu23:32870)nmlx4_en: vmnic3: nmlx4_en_QueueApplyFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2152) RX ring 2, QP[0x5a], Mac address 00:0c:29:e3:b6:ca
2019-06-02T15:13:22.708Z cpu26:32870)nmlx4_en: vmnic3: nmlx4_en_QueueRemoveFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2325) MAC RX filter (class 1) at index 0 is removed from
2019-06-02T15:13:22.708Z cpu26:32870)nmlx4_en: vmnic3: nmlx4_en_QueueRemoveFilter - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:2332) RX ring 2, QP[0x5a], Mac address 00:0c:29:e3:b6:ca
2019-06-02T15:13:22.708Z cpu26:32870)nmlx4_en: vmnic3: nmlx4_en_RxQFree - (partners/mlnx/nmlx4/nmlx4_en/nmlx4_en_multiq.c:789) RX queue 2 is freed
...
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: vmhba37:CH:0 T:1 CN:0: Failed to receive data: Connection closed by peer
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Sess [ISID: 00023d000001 TARGET: iqn.2010-08.org.illumos:san04-t0 TPGT: 1 TSIH: 0]
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic: Conn [CID: 0 L: 10.2.5.14:34960 R: 10.2.5.32:3260]
2019-06-02T15:14:48.081Z cpu0:33527)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba37:CH:0 T:1 CN:0: Connection rx notifying failure: Failed to Receive. State=Online
2019-06-02T15:14:48.081Z cpu0:33527)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000001 TARGET: iqn.2010-08.org.illumos:san04-t0 TPGT: 1 TSIH: 0]
2019-06-02T15:14:48.081Z cpu0:33527)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 10.2.5.14:34960 R: 10.2.5.32:3260]
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba37:CH:0 T:1 CN:0: iSCSI connection is being marked "OFFLINE" (Event:6)
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000001 TARGET: iqn.2010-08.org.illumos:san04-t0 TPGT: 1 TSIH: 0]
2019-06-02T15:14:48.081Z cpu0:33527)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 10.2.5.14:34960 R: 10.2.5.32:3260]
Code:
2019-06-02T15:14:57.519Z info hostd[42A03B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 101143 : Lost access to volume 5b9fdd93-bdc2efeb-3fe7-002590c7c9c0 (l0401) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2019-06-02T15:14:57.520Z info hostd[42A03B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 101144 : Lost access to volume 5a5e0f16-7d3c0820-d1c0-90e2ba160560 (l0402) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
2019-06-02T15:14:57.521Z info hostd[42A03B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 101145 : Lost access to volume 57c089bc-d0040083-f68c-90e2ba160560 (l0403) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
...
Physical host uses Connect-X 3 Pro 10G NICs. nmlx4 driver reports no errors on the logs. Physical storage NIC is on a separate storage VLAN.
Storage VM is configured with 1 e1000 NIC for LAN access and 2 vmxnet3 NICs serving 2 storage subnets. open-vm-tools version is 10.1.15 from r151024ap.
Incidents don't seem correlated with periods of highest storage activity. Haven't increased logging levels in OmniOS. Default logs don't show much.
Has anyone seen a similar behavior? Any suggestions to prevent this?
Thanks.