Technical debt 1+2: Solving the quad bay raidz1 issue / migrating away from iSCSI
So this one is relatively easy to solve, and as Nick from
Bad Obsession Motorsports would've put it "it's just money".
To summarize what we have at the start, we have:
1. An HP t740 hypervisor with a 64GB eMMC drive containing
VMWare ESXi 6.5, which is connected to the HP Microserver G7 N40L via 40GbE iSCSI using a Mellanox Connect-X 3 VPI...which is used to host an iSCSI datastore. There are 12 VMs hosted there, one of which is a HomeAssistant VM instance which controls a bunch of devices via a ZWave hub (passthrough from the t740's USB port) and app integration with veSync (which is used for controlling various smart light sockets in the house).
2. An HP Microserver G7 N40L with 16GB of ECC DDR3, 4 4TB HGST Deskstar drives, a Mellanox ConnectX-3 VPI card, and a single 256GB
SATA hard drive hosting the OS (TrueNAS Core 13.0 U6). It's the home SMB/CIFS server and the iSCSI storage target for the t740. The 4 4TB drives are arranged in a raidz1 zpool (basically raid5 for zfs). The same zpool is used for both SMB/CIFS and the iSCSI bit bucket.
The first thing to do would be to do something about the raidz1 array (which contains ~2.9TB of data), and that one will require 4 power-downs - first we shut down most of the VMs on the ESXi hypervisor (on the t740) so we can unmount the iSCSI datastore, and then once that's done, turn off the t740, deactivate iSCSI and SMB/CIFS sharing on TrueNAS Core and then go into the storage/disk page so we can mark the failing drive as bad/declare it as offline. Once that's done, power the N40L server down and then swap the bad 4TB drive out with 8TB drives (I got a bunch of
Seagate Skyhawk AI drives at 180 USD each during Black Friday). Then power it back up, go to TrueNAS core, mark the new drive as the replacement for the offline one, and watch the resilvering being done (took ~2h 30 minutes).
While this was happening, we will prep the new t755 - 64GB of RAM, a large SSD (I have a 1TB Crucial P3, the retail version of the Micron 2550 NVMe drive), a Mellanox ConnectX-3 VPI (not needed but I still want one installed just in case) and Proxmox 8.2 (the newest release at the time the project started, now on Proxmox 7.3). Installing took a few minutes and it's as easy as connecting it to the network and waiting for the install to finish, then to go through the proxmox web GUI to set up the rest.
In the meantime the next step would be to disconnect the thin client from the 40GbE link(s), open it up and then pop in a new SSD - in this case, I got a pile of
Teamgroup MP33 512GB SSDs for a decent price, and with the way how the t740/t755s are setup, you have 2 slots, either of which can do NVMe or SD7 eMMC. The t740 can also do SATA but I don't think the t755 can do as well. The first slot on the t740 has a 64GB eMMC drive used for
VMWare ESXi 6.5's datastore1. Normally the hypervisor boots up, connects to the iSCSI datastore and mounts the storage on the Microserver so the VMs can be seen/started. So what do we do? Well, stick a blank 512GB MP33 onto the second bay, plug the 40GbE links back on, and then bring the t740 with ESXi back up. Leave iSCSI off on the N40L for now.
In the meantime, get my laptop (an HP EliteBook 845G9) ready with VMWare Workstation 17 and
VMWare converter installed...which meant dealing with Broadcom's obnoxious new SSO scheme/paywall. I had to use my work email to log into Broadcom and grab
VMWare Workstaton/Player (free for personal use but you will need a non-gmail email to access it).
By this time the resilvering should be done with 1 drive, so the N40L with TrueNAS core should have 3 4TB drive and 1 8TB drive. offline another 4TB drive, power the N40L down again and swap it out for an 8TB. We should have a "good" used 4TB drive that we can repurpose...which is a good thing. Power up the N40L and do a replacement assignment, and resilver. In the meantime, take that 4TB drive and connect it to a SATA to USB3 adapter (they should be fairly common), and then connect it to the t755. I think you should have a good idea what I am planning to do. At this stage you should have the t755 proxmox host running essentially linux waiting for VMs to be loaded, which can also be used to store up to 4TB of data.
So at this point I have 2 options - I can do an ssh rsync between the t755 and the TrueNAS box and copy the files off the zpool (where it has 2 new drives and 2 old drives) onto that good 4TB drive (and hope like hell it won't fail)...or I can upgrade/resilver all 4 old drives to 4 new ones off the zraid1 zpool and ensure that it's safe, and then do something with 2 other old 4 TB drives. I chose to do the latter, but it's not a waste of time waiting for it since I had iSCSI turned back on during the resilver.
I then go to the t740, remount the iSCSI datastore, go into the datastore browser within ESXi 6.5, format/prep the new SSD, and designate it as datastore2, and make sure that the iSCSI datastore is readable and that all VMs are off Then within the datastore browser we just had to move the various folders containing the VMs off from the iSCSI datastore to the datastore2 SSD (while it is resilvering in the background), unregister the VMs from the old location/re-register the VMs from the new. I was able to copy everything off the iSCSI datastore during one drive re-silvering, so I can then unmount the iSCSI datastore and turn it off on the N40L..this effectively discontinued the iSCSI dependency between the t740 and the N40L, and then I can spend the next 3 hours to swap/reassign/resilver the 2 remaining 4TB drives with 8TB replacements. So what did I do with the 2 other 4TB drives?
I got my wife a
Terramaster F2-423 dualbay chassis (for 240 USD during black Friday) so she can use it as an SMB hosted Timemachine backup source for her Macbook Pro. The software is
honestly mediocre, but the chassis is
easy to upgrade and fun to mess with. Pop in 2 x8GB of DDR4
SODIMM (I have some leftover from upgrading my Framework 13 Alder Lake i5-1240p laptop), another Teamgroup MP33/512GB SSD (the Terramaster can host 2 NVMe bays), yank out their stupid little 4GB flash drive containing their operating system, and in goes TrueNAS core as well. That one is getting a mirrored (yeah, raid1) zpool, but before I turn on SMB/CIFS, it'll be used as a secondary backup for the N40L's raidz1 zpool.
Once it's done...fire up
VMWare converter on the laptop and copy the VMs out of the ESXi hypervisor onto a folder on the laptop (I have a 512GB external SSD formatted to exFAT ready to go). The smart thing to do is to power up each VM on the t740 hypervisor once the copy is done so the downtime window is limited - prioritizing on the HomeAssistant VM first. So at the end of the day you want the following:
1. One working 2 bay mini-NAS with a ~4TB zpool.
2. One t740 hypervisor with its VMs hosted on its own datastore
3. One HP Microserver Gen 7 N40L with its 4 4TB HGST Deskstar storage drives (1 with error messages) swapped out and replaced with 4 8TB Seagate Skyhawk AI drives...which is still used to host the raidz1 zpool. The OS drive is on a 256GB SSD sitting on where the optical drive was supposed to be. That one might be failing though.
4. One laptop with an attached external exfat formatted USB drive that contains a bunch of VMs migrated off the hypervisor
5. One HP t755 hypervisor running proxmox, with a single 4TB SATA HDD attached as a USB drive, just waiting for something to happen.
Then all we need to do is setup ssh keys between the t755, the N40L and the mini-NAS, run screen (so you can detach the terminal ssh) and on both the t755 and the Terramaster, do an ssh rsync to the effect of:
rsync -h -v -r -P -t root@N40L:/mnt/pool01/FileShare01/ FileShare01/
(or whereever you want the fileshare contents to be hosted). This rsync copy from the N40L to both the t755 and the 2 bay NAS took about 13 hours and moved roughly 2.9TB to each, and at 40-80MB/sec, it's no joke.
When it's all done, do some spot checks, delete the raidz1 zpool on the N40L, recreate it using the 4 8TB drives under raidz2 or mirrored/striped (same as raid10), and then rsync everything back in from either the Terramaster OR the t755, and once done, turn SMB/CIFS back on. For me, this took another 9 hours (since I am only receiving from one source, and sending to 2). This effectively solved the raidz2 issue.
In the meantime, we could detach the
external USB drive containing the VMs off the laptop and connect it to the t755 directly, and start importing the VMDKs/make new replacement VMs for the proxmox hypervisor.
That would be for...Technical debt 3 - how to migrate VMs from ESXi to Proxmox.