Hyper-V Cluster Failure - trying to ID root cause

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

general_sigh

New Member
Mar 24, 2025
4
0
1
I have a lot of time sunk into this, including a couple of days with MS Support (who was useless), but I haven't been able to get to the bottom of this issue. Any help would be greatly appreciated.

Environment:
Windows Server 2019 Cluster (fully patched - both OS and HP firmware)
2 Hosts/Nodes, each with 10 total NICs: (8) 1gbs and (2) 10gb.
iSCSI storage array directly cabled to the 10gb ports on the hosts
MPIO enabled
File Share Witness for quorum
Cluster Shared Volumes
No Exchange or SQL (No availability Groups)
All functionality working for several years (backups, live migrations, etc)
Cluster passes all validation testing before and after the issue (Driver was reinstalled)

To get to the point, the NIC driver on Host2 failed. Those NICs disappeared from Control Panel/Network & Sharing. They were visible, but in an offline-error state in device manger. This NIC driver was for the (8) NICs [(2) 4-port cards] that make up the VMTeam. When this happened, the heartbeat was lost and both hosts were ejected from the cluster (Host1 ejected Host2, Host2 ejected Host1). Despite the isolation time being set to the default 240 seconds, the VMs paused for atleast 10-15- if not longer. During that time, the VM status showed Paused - Critical I/O issues. Most of the VMs eventually resumed, but they were ALL corrupted so badly they needed full restores. The ISCSI connections on HOST2 were never impacted - different NIC with a different driver. Host 2 OS was stable... no issues otherwise



Question 1:
When Host 2 had the NIC failure, it lost connectivity with the File Share Witness, but Host 1 did not. Given this, why wasn't Host1 elected the "winner" sooner? I thought this was the point of the witness?

Question 2:
How did the VMs on Host 1 become corrupted? It didn't experience any software/hardware failures. I don't think the VMs on Host 2 should have even been corrupted, but I really can't understand what could have happened to cause such an impact to the VMs on Host 1.

Question 3:
How much of this is expected behavior in Hyper-V Clusters? I've experienced similar problems in ESXi environments and there were only minor impacts.

Thanks very much !
 

DavidRa

Infrastructure Architect
Aug 3, 2015
343
160
43
Central Coast of NSW
www.pdconsec.net
OK so I have some questions.
  • Where is the FSW? Is it in cluster or external?
  • What vSwitches do you have and which pNICs are assigned to those vSwitches? What vNICs do you have configured?
  • How are cluster roles assigned to NICs? Specifically - which vNICs are used for CSV, which vNICs for LM, which vNICs are used for Cluster comms vs Client comms? Are roles assigned to pNICS too?
  • Which host owned the CSVs during the failure? What's the configuration for preferred and possible owners?
For your questions:

1. If the two hosts could still communicate on any network (hence my question about what cluster roles are assigned and where) they won't force an election.

2. My initial hypothesis is that Host 2 was the volume owner, so it's responsible for metadata management on the volume. This might have led to the corruption, but it's a weak hypothesis IMO.

3. Not much. You've seen it run for years already without issue. I've deployed HV clusters for nearly 2 decades without seeing this type of thing, it's not something that comes into planning.
 
  • Like
Reactions: general_sigh

general_sigh

New Member
Mar 24, 2025
4
0
1
  • Where is the FSW? Is it in cluster or external?
    • NAS thats connected to the same switch as the nodes
  • What vSwitches do you have and which pNICs are assigned to those vSwitches? What vNICs do you have configured?
    • Just one - VMTeam. All 8 of the pNICs are assigned to it. The only vNIC is the vEthernet (VMTeam)
  • How are cluster roles assigned to NICs? Specifically - which vNICs are used for CSV, which vNICs for LM, which vNICs are used for Cluster comms vs Client comms? Are roles assigned to pNICS too?
    • Cluster Network 2 = Cluster Only
    • Cluster Network 3 = Cluster & Client
    • No roles are assigned to pNICS
  • Which host owned the CSVs during the failure? What's the configuration for preferred and possible owners?
    • I don't know... is there anyway to look at it historically?
    • For preferred owners, the only thing we have it is for our DCs. DC1 is pref for Host1 and DC2 is pref for Host2

THank you very much for your help with this!
 

general_sigh

New Member
Mar 24, 2025
4
0
1
OK so I have some questions.
  • Where is the FSW? Is it in cluster or external?
  • What vSwitches do you have and which pNICs are assigned to those vSwitches? What vNICs do you have configured?
  • How are cluster roles assigned to NICs? Specifically - which vNICs are used for CSV, which vNICs for LM, which vNICs are used for Cluster comms vs Client comms? Are roles assigned to pNICS too?
  • Which host owned the CSVs during the failure? What's the configuration for preferred and possible owners?
For your questions:

1. If the two hosts could still communicate on any network (hence my question about what cluster roles are assigned and where) they won't force an election.

2. My initial hypothesis is that Host 2 was the volume owner, so it's responsible for metadata management on the volume. This might have led to the corruption, but it's a weak hypothesis IMO.

3. Not much. You've seen it run for years already without issue. I've deployed HV clusters for nearly 2 decades without seeing this type of thing, it's not something that comes into planning.
Some further information regarding your question about which host owned the CSV:
I have two CSVs: DS-A [Volume1] and DS-B [Volume2] and my VMs are spread across them. When the NICs on HOST 2 failed, HOST 1 lost access to DS-A and HOST 2 lost access to DS-B, so to answer your question:
When the cluster failed, HOST1 owned DS-B and HOST2 owned DS-A

Sample error message from HOST1: Cluster Shared Volume 'Volume1' ('DS-A') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.



Mapping overview
Host-VM-Cluster
Host1-VM-34-DS-A
Host1-VM-X-DS-B
Host1-VM-26-DS-B
Host1-VM3-DS-A

Host2-VM-35-DS-B
Host2-VM-33-DS-B
Host2-VM-29-DS-B
Host2-VM-1-DS-A

Questions:
1) Why did each host lose access to the other's CSV despite their ISCSI connections staying up? I read something that said the CSV owner shares access via an SMB2 share, but if this is true, it seems like poor design on MS side to have a ISCSI volume's access dependent on LAN connectivity? In VMWare, I'm pretty sure there is a storage heartbeat from each host to the storage and there's no reliance from host to host?

2) Should I have been purposefully setting preferences so that any VMs are running on the host that owns the CSV they are running ont? For example, should VMs stored in DS-B had a preference for Host1 since it is/was the owner?

3) Both servers also show this log at the time of failure:
The Cluster Service service terminated with the following service-specific error:
A quorum of cluster nodes was not present to form a cluster.

I would expect this on HOST2 because it lost all LAN connectivity, but HOST1 did not and it should have been able to communicate with the File Share Witness. Isn't this what the File Share Witness is for?

This is a crap ton of questions... thanks for any and all answers.
 
Last edited:

DavidRa

Infrastructure Architect
Aug 3, 2015
343
160
43
Central Coast of NSW
www.pdconsec.net
OK that looks to me like each node lost contact with the other and with the FSW. That would likely be because the access from each node to both additional votes is via the single point of failure - your single vNIC for Cluster and Client access. In that scenario, you have each node with one out of two (or three) votes - which is not more than 50%. Note the language - 50% of the votes isn't enough.

Right, Q1.

CSVs have multiple paths to things. In general, you have one node which manages the metadata and all other nodes use SMB3 to co-ordinate. This is for things like actually opening the file, extending the file size in the case of a VHDX, etc. Changes to the underlying NTFS or REFS filesystems.

The iSCSI paths (and you should have two, with independent connectivity to the storage controllers) are used for block access - read the block at location 0x32198560, write the block at 0x12581920 and so forth. So the majority of disk IO is direct.

The fact that you have only a single vNIC on the (8 port) vSwitch, but your cluster configuration lists "Cluster Network 2" and "Cluster Network 3" suggests there's some minor misconfiguration happening. Maybe you don't have quite the connectivity you expected. Rather than try to divine your specific config, I'll describe what I'd generally do here:
  • Your 8 ports in a vSwitch are fine, though honestly, overkill. You'll only get 1Gb per stream out of this configuration regardless of whether you use SET, NetLbfo teaming with LACP (no longer supported) etc. I don't usually go above 4 ports in a SET vSwitch.
  • vNIC 1 for Host Management (Mgmt)
    • New-VMNetworkAdapter -vSwitch vSwitch0 -name Mgmt -ManagementOS:$True
  • vNIC 2 for CSV (optional VLAN, but recommended)
    • New-VMNetworkAdapter -vSwitch vSwitch0 -name CSV -ManagementOS:$True | Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 200
  • vNIC 3 for Live Migration (optional VLAN, but recommended)
    • New-VMNetworkAdapter -vSwitch vSwitch0 -name LM -ManagementOS:$True | Set-VMNetworkAdapterVlan -ManagementOS -Access -VlanId 300
  • Optionally, you can double-up on the CSV and LM vNICs, and use adapter pinning to control which underlying adapters are used - but given your failure mode, this won't help. If you're interested, you want the Lenovo S2D documentation (Microsoft Storage Spaces Direct (S2D) Deployment Guide)
  • pNIC1 for iSCSI (subnet A)
  • pNIC2 for iSCSI (subnet B)
  • Configure the cluster to prioritize CSV traffic on CSV vNIC and LM vNIC (or CSV1 and CSV2)
  • Configure the cluster to prioritize Live Migration traffic on LM vNIC and CSV vNIC (or LM1 and LM2)
  • Configure the cluster admin point to use the Mgmt vNIC
  • Configure the cluster networks as follows:
    • vNIC1 Mgmt - Cluster and Client
    • vNIC2 CSV - Cluster Only
    • vNIC3 LM - Cluster Only
    • pNIC1 - None
    • pNIC2 - None
  • Make sure you have multiple connections to the storage, and MPIO correctly configured - I expect you did already, but ... common problem. The specific MPIO policy isn't important in the general case, so I go round-robin by preference but it's fine to set it up with a primary/failover config if that suits the array better.
Since you have an iSCSI array, which I assume provides HA (multiple controllers) and hopefully ALUA access, I'd slightly prefer a 1GB Quorum volume over a FSW though its definitely an "old school" approach and not a requirement. This specific, particular failure may not have been as bad in this alternate configuration, but it's not a slam dunk - mostly because there's no resiliency there for the CSV access.

So Q2.

Nope. No point. The whole idea is that the volumes are controlled anywhere in the cluster, and the VMs run anywhere in the cluster. I actually suspect that both CSVs were owned by the node that had the network failure.

Q3. Check your quorum and vote configurations. Node 1 should have had access to the FSW and taken ownership. The fact that it didn't suggests it could not access/control that FSW vote.
 
Last edited:

general_sigh

New Member
Mar 24, 2025
4
0
1
Thank you for all the detail. I had just been reviewing best practices for the storage array that is in use here and, along with what you noted, the best practice is to use two subnets for the iSCSI communication. The current setup is only one. How much of a role could that have played in the VM corruption?