Hyper-V Two Node Fail Over Cluster Help

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

chadtn

New Member
Jan 15, 2025
6
0
1
Hoping you guys could point me in the right direction for configuring a Hyper-V failover cluster that can handle live migrations. I think I have everything setup to the point that I just need to bring some shared storage online, but started getting a little overwhelmed with all of the different technologies available.

Background:
I started out with a single Server 2022 Datacenter host running Hyper-V. It's a dual CPU Epyc 7B13 (128 cores total) setup with 512gb of DDR4. For VM storage I setup a S2D simple four column storage pool on four 2TB NVME drives. I also setup another storage pool running a two column mirror on four 12TB spinning drives. So the VM's all run on 8TB of striped NVME storage and I've been making VM backups on 24TB of redundant HDD storage. Everything lived behind a virtualized Opnsense firewall and I've had about a dozen or so VM's chugging along for the last six months.

I've recently added a second node with the same hardware and software specs as the first one. The only real change so far is that I added Mellanox ConnectX-5 CDAT cards and have the hosts directly connected with DAC cables over a 200gb SET Team. The second node is hosting another Opnsense firewall and I've spent the last 2-3 weeks sorting out all of the HA network configuration. I've been able to successfully failover the network between virtual firewalls without dropping any connections on the VM's.

Where I'm at right now:
I migrated all of my VM's to a different NVME drive and removed the old VM storage pool. The cluster has been deployed and I setup an SMB file share witness running on a Raspberry Pi. I've created separate vnics on the 200gb SET Team for cluster, cluster storage, and Live Migration traffic. The only warnings I'm getting from the cluster validation wizard is that I haven't mapped vnics to physical adapters and that RDMA is enabled, but the vnics are reporting unknown for the RDMA Technology.

Here is where I may have screwed up. I ran "Enable-ClusterStorageSpacesDirect" and didn't notice it scooped up my redundant HDD storage pool on node 1 and is presenting it as Clustered Windows Storage. I had intended that to remain stand alone. Is it ok to leave it that way, or should I try removing it? Other than that, everything looks good. Both nodes show eight NVME drives (Four on each side) as primordial Clustered Windows Storage available disks. What I had originally envisioned was each node striping their four drives and mirroring changes for the shared storage. I thought this was something I could configure directly through S2D, but started reading about nested S2D resiliency, scale-out file servers, and several other options that left me a little overwhelmed. heh.. Could anyone give some advice on how to proceed? Not sure if I'm supposed to create the storage pools for each node in Server Manager and then use Cluster Manager to mirror those striped volumes as a cluster pool or if I should be looking at some of the other options.

Thanks!

Chad
 

chadtn

New Member
Jan 15, 2025
6
0
1
Maybe I wasn't as far off as I thought. Just watched a video that showed I was supposed to have some extra parameters naming the CSV storage when I ran "Enable-ClusterStorageSpacesDirect" so that it would show up as a pool in Cluster Manager. Might wait to see if anyone has some advise before messing with it any more.

Should I try to roll back "Enable-ClusterStorageSpacesDirect", then run it again naming the NVME CSV storage pool and excluding my old HDD pool? Mainly worried about the HDD pool since it has all my backups on it. The NVME disks were already empty in anticipation of rolling out the shared storage.

Thanks!

Chad
 

chadtn

New Member
Jan 15, 2025
6
0
1
I was able to successfully create a CSV S2D two way mirror with four columns. I migrated all of the VM's over to the clustered storage last night and did some testing. I was able to RDP into a VM and live migrate it to a different node without any noticeable impact while I was clicking on random web pages in the browser.

So far everything appears to be working, but I may need to work on the network settings some more. I didn't map vnics to physical nics when I created the SET Team since both 100Gb interfaces are on the same card and the hosts are directly connected. I'm reading that this can affect how SMB3 handles multichannel streams. One thing I noticed during the migrations was that all of the traffic appears to be going over the vnic I created for Live Migrations, but it's capping out at 10Gbs of throughput. Prior to clustering I was able to push sustained 130Gbs of throughput over the SET Team just by throwing a few extra cores at iperf. Not sure if that was hitting a pcie barrier, single thread cpu or ddr4 limit, but I was happy enough at the time to move onto something else. heh..

Does anyone know how to specify which vnic S2D uses for SMB3 traffic? My internet search recommend "
Enable-ClusterS2D -S2DPreferredNetwork", but I think that may have been deprecated. From what I read, it sounds like SMB3 tries to use all pathways instead of a single point. I primarily created the different vnics to segregate traffic so I could apply QOS settings in the future, but the 200Gb pipeline may not really need it.

Thanks!

Chad
 

chadtn

New Member
Jan 15, 2025
6
0
1
I'll throw a few tips in here that I learned the hard way. Might help people with similar setups or someone can tell me what else I did wrong. heh..

For my virtual Opnsense HA networking to work with HyperV I had to make sure all of the vnics assigned to the firewalls had "Enable MAC address spoofing" checked on the Advanced Features tab. I also never could get the Opnsense CARP to work if the vSwitch was created with SR-IOV enabled. All of the networking started working perfectly as soon as I created a new vSwitch with SR-IOV not enabled and moved everything over to that one. I spent a week or two trying to figure those two simple little things out. If I had known that from the start things would have been sooo much easier on the network side of things.

On the storage side of the house. I had been using S2D pools for a couple of years and didn't pay much attention when I enabled Clustered S2D. BE VERY CAREFUL HERE....Clustered S2D is a whole different ball game. I was good at first because I already ~200TB of spinning drives offline and passed through directly to a VM for Chia farming. I had another 48TB of spinning drives already pooled on one host for backups. The only unclaimed disks when I enabled clustered S2D were the eight NVME drives I wanted to use for shared VM storage. I was able to create a two way mirror on the NVME drives and everything ran GREAT for a week until I decided to add some more drives to the system.

If you just turn clustered S2D on in Server 2022 with "Enable-ClusterStorageSpacesDirect", it's default behavior is to claim every poolable drive in the system and place it into one pool. I was fine when the only available drives were the ones I wanted in that pool. I found out the hard way that the default behavior is to also take any drive you add after the fact and slam it into that same storage pool. Here's where I really screwed up. I wanted to build out some additional storage pools and play around with different levels of tiering and redundancy for workloads other than just hosting my VM system files. I went through my closet, pulled out 12 old sata SSD's and another 48TB spinning drives of unknown condition and split them evenly across both nodes. S2D snatched them all up into my VM storage pool and I've just spent the last week or so migrating my data off. I ended up in storage hell as at least one of those drives I added ended up being bad. The more I tried to salvage the initial cluster, the longer the rebuilds/repairs would take. Finally settled on migrating everything off, destroying the cluster, and recreating the shared storage with clean disks specifically added to storage pools of my choosing. heh..

Thanks!

Chad
 
Last edited: