I have a lot of time sunk into this, including a couple of days with MS Support (who was useless), but I haven't been able to get to the bottom of this issue. Any help would be greatly appreciated.
Environment:
Windows Server 2019 Cluster (fully patched - both OS and HP firmware)
2 Hosts/Nodes, each with 10 total NICs: (8) 1gbs and (2) 10gb.
iSCSI storage array directly cabled to the 10gb ports on the hosts
MPIO enabled
File Share Witness for quorum
Cluster Shared Volumes
No Exchange or SQL (No availability Groups)
All functionality working for several years (backups, live migrations, etc)
Cluster passes all validation testing before and after the issue (Driver was reinstalled)
To get to the point, the NIC driver on Host2 failed. Those NICs disappeared from Control Panel/Network & Sharing. They were visible, but in an offline-error state in device manger. This NIC driver was for the (8) NICs [(2) 4-port cards] that make up the VMTeam. When this happened, the heartbeat was lost and both hosts were ejected from the cluster (Host1 ejected Host2, Host2 ejected Host1). Despite the isolation time being set to the default 240 seconds, the VMs paused for atleast 10-15- if not longer. During that time, the VM status showed Paused - Critical I/O issues. Most of the VMs eventually resumed, but they were ALL corrupted so badly they needed full restores. The ISCSI connections on HOST2 were never impacted - different NIC with a different driver. Host 2 OS was stable... no issues otherwise
Question 1:
When Host 2 had the NIC failure, it lost connectivity with the File Share Witness, but Host 1 did not. Given this, why wasn't Host1 elected the "winner" sooner? I thought this was the point of the witness?
Question 2:
How did the VMs on Host 1 become corrupted? It didn't experience any software/hardware failures. I don't think the VMs on Host 2 should have even been corrupted, but I really can't understand what could have happened to cause such an impact to the VMs on Host 1.
Question 3:
How much of this is expected behavior in Hyper-V Clusters? I've experienced similar problems in ESXi environments and there were only minor impacts.
Thanks very much !
Environment:
Windows Server 2019 Cluster (fully patched - both OS and HP firmware)
2 Hosts/Nodes, each with 10 total NICs: (8) 1gbs and (2) 10gb.
iSCSI storage array directly cabled to the 10gb ports on the hosts
MPIO enabled
File Share Witness for quorum
Cluster Shared Volumes
No Exchange or SQL (No availability Groups)
All functionality working for several years (backups, live migrations, etc)
Cluster passes all validation testing before and after the issue (Driver was reinstalled)
To get to the point, the NIC driver on Host2 failed. Those NICs disappeared from Control Panel/Network & Sharing. They were visible, but in an offline-error state in device manger. This NIC driver was for the (8) NICs [(2) 4-port cards] that make up the VMTeam. When this happened, the heartbeat was lost and both hosts were ejected from the cluster (Host1 ejected Host2, Host2 ejected Host1). Despite the isolation time being set to the default 240 seconds, the VMs paused for atleast 10-15- if not longer. During that time, the VM status showed Paused - Critical I/O issues. Most of the VMs eventually resumed, but they were ALL corrupted so badly they needed full restores. The ISCSI connections on HOST2 were never impacted - different NIC with a different driver. Host 2 OS was stable... no issues otherwise
Question 1:
When Host 2 had the NIC failure, it lost connectivity with the File Share Witness, but Host 1 did not. Given this, why wasn't Host1 elected the "winner" sooner? I thought this was the point of the witness?
Question 2:
How did the VMs on Host 1 become corrupted? It didn't experience any software/hardware failures. I don't think the VMs on Host 2 should have even been corrupted, but I really can't understand what could have happened to cause such an impact to the VMs on Host 1.
Question 3:
How much of this is expected behavior in Hyper-V Clusters? I've experienced similar problems in ESXi environments and there were only minor impacts.
Thanks very much !