I gave up or "The history of my new ZFS filer"

Rand__ · Apr 11, 2020

Sorry for the long post, it was an even longer process but I am sure it will be of interest to some or at least entertaining enough for the others

Prelude:

I have been running 3 and 4 node vSan clusters for 3+ years now since I was looking for reliable VM hosting for my home environment. I run almost everything on VMs, firewall, monitoring, AD, file services and especially desktops & gaming boxes. My wife opted out in the early days (too much downtime) so its only the Kid and I on it but I do have a quite high expectation set and he is o/c not known for his patience

Original reason to move to VMs on everything was noise and power saving - the former was achieved (Zero clients), the latter ... well lets not talk about that.

I have been running various variations of vSan during that period (2 node + remote witness, 3 node, and finally 4 node) and decided to stay at 4 nodes due to increased resilience. O/c 2 node vsan with witness on my VPN connected remote backup server worked fine - until one day due to I can't remember what both of my local hosts went down at the same time ...
O/c i was running vCenter on vSan so it could move around ... but no vCenter, no dvSwitch... no dvSwitch no vSan... no vSan no vCenter... o/c that meant also no Firewall/no VPN, no AD and actually no vSan hosted desktops. What a fun day

Since that time I have an 'emergency admin box' (physical) in my office (directly connected to the router) so I can at least google thing in case of error.

The move from 3 nodes to 4 nodes came when doing maintenance on one node and it did not come up properly any more for whatever reason - no fun running on two nodes when any downtime of either of these would cause the cluster to fail again... so eventually I moved to 4 for peace of mind.
This was reinforced when one day an ESXi update went haywire and knocked two of my 4 boxes out of the cluster due to some dvSwitch issues (no communication despite the vmk's being there and hosts being member of the dvSwitch. Needed to remove the host from the switch and recreate the vmks by readding. This issue happened at least 10 times in those years... no proper cli for dvSwitch either and clone option in Gui did suck. They got it under control now I think, did not happen in a while).

So at that point there was a more or less stable cluster so I could think about other aspects - performance.
During that instable phase I had been moving VMs on and off vSan a lot - to local SSDs or a FreeNas box and it sucked. Also vSan hosted VMs did not perform that well with file operations, so I decided I need more performance. At that point I was looking to get 500MB/s for vMotions - sounded like a nice number and would keep vMotion times acceptable even for those 40G Windows boxes or even the 100G+ game VMs.

Rand__ · Apr 11, 2020

Chapter 1 - The need for speed
After having started with a ragtag of leftover drives (https://forums.servethehome.com/index.php?threads/vsan-design-considerations-for-soho.12444/) I had moved on to NVME pretty quickly (6 months later: https://forums.servethehome.com/ind...an-for-hyperconverged-environment-home.14847/) but it did not work as expected.
At around that point I settled for a setup of Intel 750 as capacity drives and P3700's (400GB) as cache drive which was in my opinion pretty beefy HW for what I was looking for ... but no great performance for me.

O/c I do have a non enterprise use case - single user (max 2), no hard hitting vms (io wise) so no parallelism to speak of... I tried to create that by playing around with the available vSan options (FTT, multiple diskgroups, more hosts [up to 6 at a time]), more disks instead (8 S3700 in a disk group) - nothing helped increase single user performance.
So for my expectation level this still was not satisfying (https://forums.servethehome.com/index.php?threads/vmware-vsan-performance.19308/#post-187060)

It took me a very long time to realize that this is working as designed - vSan is an enterprise product optimized to run acceptable performance for as many users as possible. Therefore individual performance is somehow limited to a fraction of the theoretical possible. Don't get me wrong, I am sure vSan can perform quite good if you hit it with typical enterprise scenarios - I probably could have run 200 users or more on the hardware I run for 2...

Note: memory consumption includes 2 FreeNas boxes with significant allocation.

The only way I found to increase single user performance is to increase individual drive performance (remember this, I will come back to that) - so eventually I moved my vSan cluster from P3700s to Optane 280s and now to P4800x's just recently.

But back in the day I was looking for alternative solutions providing better performance.

Rand__ · Apr 11, 2020

Chapter 2 - Looking for alternatives

An alternative to vSan should provide similar functionality while providing better performance so it needed to be

running as HA capable setup (2+ nodes)
provide safe hosting for VMs (no dataloss/corruption on unexpected shutdowns)
Free or within acceptable range for non production use
support All flash/NVME setups

I looked at everything that I could find, but had to skip most due to cost (no free version) or supported HW limitations (no AllFlash, no Cluster).
I think I tried Starwind, Compuverde, and ScaleIO on the HCI front and of course Napp-It/FreeNas on the ZFS side of this and I also tried Ceph .

Starwind was performing very well, but required hardware raid controllers and only had windows at the time. I strongly considered them especially later when they had a linux version and supported avanced features like RDMA/NvDimms, but they have a weird NFR policy - you can get a NFR license only once and only for a single year. I didn't know that (first part especially) to didnt use it while I had it due to me not having hardware raid controllers at that time. When I was ready to test again they graciously offered another 3 months test period which I couldnt use due to yet another issue; at that point I learned about the NFR only once rule and decided to skip them. Yes there is a free version but at this point I was not willing to run that.
Also I think its quite complicated to set up (or I never got the design) and all the blog posts never seemed to describe my use case and were either outdated or for the wrong OS.

Compuverde was nice since they have a cool display what blocksize writes hit the storage, I liked that very much to understand the action during vMotions or local machine tests. Unfortunately performance was not up to my expectation level and despite them being very helpful it was not the tool I was looking for,

ScaleIO was very nice (slightly weird setup, cli only management but ok) abd was also performing very very well (due to the fact that they slice writes into small chunks of 1 MB and distribute that over 2 of the 3 nodes so they offered parallelism even on single thread writes). I was eagerly awaiting the new version which was supposed to have better nvme support iirc, but right there Dell moved it to customer only (https://forums.servethehome.com/ind...roduction-scaleio-gone-with-the-merger.19151/) so that was that.

I ran Ceph on a small 3 node cluster with everything on NVME drives (no optimizations), and it performed below expectations (https://forums.servethehome.com/index.php?threads/ceph-benchmark-request-nvme.23601/) so I gave up on it.

That means I was left with ZFS based systems which I did not have the best success with either ... (https://forums.servethehome.com/index.php?threads/napp-it-not-scaling-well.17154/) .
But since I've been running FreeNas since 2013 I at least know my way around this, and o/c reliability is key right next to my performance expectations

Rand__ · Apr 11, 2020

Chapter 3: Run baby, run

There are 3 aspects that make a ZFS filer a good storage for VMs (actually there are 4 but I kind of ignored the HA aspect at first or I could have given up at once).

Local Pool performance - If the pool does not perform adequately when running local tests it will not provide good performance for remote systems
Proper slog - we want to safe operations without the risk of vm corruption - yes unlikely but a part of my expectation set
Network performance - if I can't get it to the ESXi hosts its of no use

Initially I have been focusing on the first two items since they are the ones that can be worked on most easily and directly.

The general moniker for ZFS says that mirrors are better for vms (IOPS) and that more devices in the pool will yield higher performance. Now the former is getting less true by the day due to increased individual drive performance - an NVME drive has massive IOPS already so you can get away with RaidZ as well - especially in my use case as "more drives = more performance) is just not true (Pool performance scaling at 1J QD1).
Yes, if you are running multiple threads *then* performance will scale with more vdevs to distribute the load on, but for my use case (1 User, interactive activity) this does not apply.

This is the second truth I have needed even longer to realize.

I have been through a lot of hardware to get to that point; I started with 12 S3700s, moved up to 24 of them; moved to SAS SSDs since that was supposed to be better; even ran a striped NVME pool of six Optane 900Ps and it holds true ...

Pool performance scaling at 1J QD1

So at this point I was at the same result I found at vSan - to improve single user performance you need to improve individual drive performance (told you I'd come back to that). So despite all general 'knowledge' for low user count its better to have few faster larger drives than lots of slower smaller ones...

O/c this is at least slightly remediated by point #2 - choice of zfs slog which will slow things down and of course ZFS caching which will ingest all writes and then output them to the storage drives in a more orderly way. O/C maybe I just have not found the correct tuning parameter to speed things up while moving from cache to drives and everything I just told you is crap - that would be great b/c that would mean I could get more performance out of my drives after all - and I don't mind being wrong in this

Rand__ · Apr 11, 2020

Interjection:

As a side node - I have probably been running tens of thousands of fio tests to get to that result - I have a 5000 line perl script for automated testing which I need to share one day. Its not done yet, still had a ton of ideas to dump into it and probably a few bugs too, but since I am not sure whether I'll continue working on it I just might post it somewhere if anyone is interested. Its for FreeNas at this point, it uses @nephri 's disklist.pl which prevents it from running on ZoL or Solarish - was too lazy to recreate the functionality.
It runs fio and dd, has automated pool creation/test/destruction and gathers a lot of metrics while running. Never got around to put those into Grafana or other time based database though). Output is a bit unpolished due to various reasons but could be easily dumbed down for a direct graphable csv file.

Rand__ · Apr 11, 2020

Back to the main story:

So SLOG - the single most important piece of hardware in deciding ZFS file VM performance - I think I had most of the commonly available ones at one point or the other ... S3700, P3700, Zeus in the early days, 900p, 4800x, NVRam and finally NVDimm (-N).

Short version of a long long search - 900p is probably fine for you, but if you got a compatible board get NVDimm-Ns (not -P, Optane Persistent memory, I think these are slower, have not tested them since I dont have second gen skylake)
SLOG benchmarking and finding the best SLOG - they rock.

O/c if you got a compatible board is the very very short culmination of 6 months of work which I need to do a separate post about one day. Lets just say it was quite complicated but maybe @DinoRS can help you out if in need - he fixed his issues impressively fast

The third aspect is network performance or rather the current lack thereof.
I've been running high speed networking for years now - 10G FreeNas filer in 2103 when I started, moving to FDR when @mpogr created hist brilliant conversion thread and even have way too much 100G stuff at home at this point since I felt limited in the single thread performance.
O/c this was before I realized truth #2, but until then I thought that maybe the 14Gbit a single process can run over a single FDR link might somehow be causing the slow network performance I saw for my filers or vSan.
I see the folly now, especially after discovering truth #2 but at that time I was grasping at straws ...

However one good thing came out of the whole 100G experiment - the understanding of RDMA (not that you need 100G for that, but it was a side effect of looking deeper into networking options, capabilities and error analysis.
If you look at the chart you will see that remote performance is abysmal - the faster the local pool performs the less of it (percentage) you get on the client end [bandwith in MB/s left, percentage of remote vs local perf right]

(Fio BS128K, Pool Recordsize=128K, no atime, FreeNas=Xeon Gold 6150 (18C), 512GB Ram, HGST SS300 800GB or Optane 4800X 375 U2, Client = Dual Xeon Gold 5115, both Chelsio 62100-LP-CR Adapters in X16 Slot; slog for sync was 16GB NVDimm-N DDR4 2666MHz)

The industry solution for this is o/c RDMA, but unfortunately RDMA is not there yet (for ESXi & ZFS filers - I will discuss why ESXi and FreeNas shortly) but I expect this will be the next major boost in on premise performance and a major step towards Fibre Channel latency levels (in a good way). But, not there yet (Building/Compiling FreeNAS with OFED/iSER support). I am still sad vmWare canned IB since this would have had RDMA for free :/

Rand__ · Apr 11, 2020

The competition:
O/c there are a lot of alternative solutions solutions out there which potentially might provide better performance (talking about proxmox, or a ZoL filer). Problem with that is that I am running VMWare Horizon as VDI solution, so no idea whether I could simply migrate that to Proxmox. Also I was ever looking for improved performance for what I had which had already had a ton of topics for me to look into without adding another hypervisor in the mix - maybe a mistake - happy to hear it if thats so

Re ZoL - i strongly considered this but at this very point (with TrueNas on Linux looming ahead and honestly ZoL not being really stable yet (performance wise), not being there yet (at what I wanted), and a new construction site to boot I decided to skip it for now)

NappIT/Solarish* - I like it very much (even if the Gui is a bit antique, but @gea is a great guy and so very helpful all the time, would have loved to run it with a pro for home license to give back at least something.
Unfortunately Solarish* does not do NVDimms can't do it, sorry.

Rand__ · Apr 11, 2020

Finals:

So a few weeks back after coming to terms with the scalability issue (and the resulting needless spend on tons of hardware) I realized that I kind of lost the drive to continue searching. I decided that I will give up and settle for a less than optimal solution. The area I decided to give up was Perfect ZFS Filer aspect #4 - HA capability.
Up until now I always envisioned that after finding the proper hardware I would run them on @geas vCluster in a Box design to have a fully redundant HA capable filer that I could replace vSAN with.

Now the issue of no NVDimm support in Solaris put that thought to an end. The realization that performance does not scale with drives (for QD1J1) put an end to need to scale up to 20+ drives for each filer... I probably can get away with a few large ones in a mirror or even a RaidZ - ZFS currently does not scale very well beyond one or two devices anyway.
So I can put away the HGST SAS3 JBODs, and Supermicro CSE-417 I got for scaling up; also I won't need the tons of SSDs I got ... but so be it

I would still love a HA capable filer, but there is no out of the box solution for this, FreeNas does not do it, its possible in ZoL but at this point I have no nerve for it.
I also realized one other point - vSan has become much more stable with time. Originally I had to move VMs between local drives, vSan and FreeNas all the time, and that was the area which really was annoying.
Nowadays I rarely move VMs around/off of vSan (except for backup purposes) and it runs okish for that.

So I decided that I will stay with vSan as the main HA capable daily driver and will only move high performance VMs to the ZFS filer. The filer will also be the backup/clone/temp target location b/c it just works (so much faster too).
Updated vSan to P4800xs (and P3600/P4510 2TB as capacity) in the meantime

Rand__ · Apr 11, 2020

The solution:

After this long coming to here is the build - this is a physical FreeNas box since ESXi does not like my NVDimms.

Board: Supermicro X11SPH-nCTPF (Bios 2.1, not the newer 2.2, that didnt play nice)
CPU: Xeon Gold 5122 (single thread perf for Q1J1)
Memory: 512GB DDR4 2666 (what can I say, I got lucky on a few 128G modules. But board - memory - NVDimm compatibility is another story to tell)
NVDimm: single 16GB Micron NVDimm DDR4-2666 with PGem
Drives: 12 SAS3 HGST SS300 800GB in 6x2 mirror after all, with NVDimm slog
Network: Mellanox CX5 (I had it. not really needed o/c) running on a SX6036
HBA: Broadcom 9305-24i

It takes about ~200-250 W from the line depending on load, but I also have my main storage pool with an extra HBA and a 900p in it.

Have not run a lot of performance tests to be honest - tested a few vmotions on and off vSan and a single CDM (https://forums.servethehome.com/index.php?threads/vsan-3node-real-numbers.27193/#post-259760) but its fast enough for me (satisfies the 500MB/s goal) for now, or maybe I am just tired of tinkering for a change.

Hope you enjoyed the read, felt I should share after all this time

Lix · Apr 11, 2020

Great read, thanks for posting.

itronin · Apr 11, 2020

Brilliant! well written, engaging! Thank you @Rand__ for sharing your experience and journey! You've confirmed a lot of what I've been getting a gut feel for over the last year. I pulled out my CSE-417 last month and sold the SAS3 drives.

I must say though I do like the TrueNAS HA I deployed at a customer. But it ain't free, won't ever be free...

Ultimately I call my own conclusions: When IT just needs to be good enough: but you still have to do it "right".

Stephan · Apr 12, 2020

> O/c i was running vCenter on vSan so it could move around ... but no vCenter, no dvSwitch... no dvSwitch no vSan... no vSan no vCenter... o/c that meant also no Firewall/no VPN, no AD and actually no vSan hosted desktops

That line made me cringe.

It is actually a very typical mistake I see often with clients. System was built for redundancy, but in the end one huge dependency chain exists where everything is down once any one component breaks. Whoops.

What I have found from my home lab is that semi-pro/pro hardware has gotten good enough that within say a 5 year cycle where I use the hardware there will be only one big failure in the whole stack. And it will be easier to just fix that one problem than to invest days on end into sophisticated HA setups, which introduce more problems than they solve. The once-in-5-years problem might involve anything from broken PSU, dead drives (I do use raid for HDD, but not SSDs anymore), lightning strike taking out the vdsl modem (have a couple as spares), etc.

Much more key these days imho is how fast can you restore data from backup after user error. I switched everything to borgbackup and bareos with LTO5+6 tape backup. Not the cheapest option but data can be read back from media after 10 years no problem. I also like its data durability that is impossible with flash storage, a big gamble with SATA drives and still a smallish gamble with SAS drives.

Rand__ · Apr 12, 2020

Stephan said:
> O/c i was running vCenter on vSan so it could move around ... but no vCenter, no dvSwitch... no dvSwitch no vSan... no vSan no vCenter... o/c that meant also no Firewall/no VPN, no AD and actually no vSan hosted desktops

That line made me cringe. It is actually a very typical mistake I see often with clients. System was built for redundancy, but in the end one huge dependency chain exists where everything is down once any one component breaks. Whoops.

I agree - it ran fine for a while, was the cheapest option (power saving) and I actually had no clue of the dependency of the dvSwitch on vCenter ... lesson learned. Ran vCenter Cluster for a while but now make sure I have a fairly recent backup available on the Filer.
I do have tape too for long term storage, so not all is lost in case of major issues but honestly it not as current as it should be. Issues with the tape chassis... have a new build for that on the (large) ToDo list

vangoose · Apr 17, 2020

I'm down to 5 physical servers from a dozen. 4 ESX and 1 backup. FW is physical juniper.
1 ESX for management and storage, 1 AD, 2 vCenters and 2 FreeNAS/solaris.
3 ESX in a cluster for others.
1 CentOS backup running Netbackup.

Tried vSam but not happy with it. Didn't like FreeNAS but 11.3 is kind of ok. I have 2 storage VMs, 1 nvme storage for VM datastore, 1 SAS capacity server for NFS/CIFS.

muhfugen · Apr 30, 2020

Stephan said:
> O/c i was running vCenter on vSan so it could move around ... but no vCenter, no dvSwitch... no dvSwitch no vSan... no vSan no vCenter... o/c that meant also no Firewall/no VPN, no AD and actually no vSan hosted desktops

That line made me cringe. It is actually a very typical mistake I see often with clients. System was built for redundancy, but in the end one huge dependency chain exists where everything is down once any one component breaks. Whoops.

The fix for that is to use ephemeral port binding for the port group for ESXi and vCenter, that way vCenter isnt required to be up to bind a port.

Rand__ · Apr 30, 2020

Thanks - will have to look into that

muhfugen · Apr 30, 2020

Rand__ said:
Thanks - will have to look into that

Here is a VMware KB article on the subject:

VMware Knowledge Base

kb.vmware.com

You can assign a virtual machine to a distributed port group with ephemeral port binding on ESX/ESXi and vCenter, giving you the flexibility to manage virtual machine connections through the host when vCenter is down. Although only ephemeral binding allows you to modify virtual machine network connections when vCenter is down, network traffic is unaffected by vCenter failure regardless of port binding type.
Note:Ephemeral port groups are generally only used for recovery purposes when there is a need to provision ports directly on a host, bypassing vCenter Server.
VMware Validated Designs, for example, use these for the Management Domain to help allow flexibility in the management cluster in the event of a vCenter outage.

However it would be discouraged to make every port group ephemeral:

Why use Static Port Binding on VDS ? - VMware vSphere Blog

In the last post Demystifying port limits… I discussed the vitual port limits on the vSphere Standard (VSS) and Distributed switch (VDS). While discussing the VDS limits, I talked about the three different port-binding options available when you configure a port group on VDS. The port binding...

blogs.vmware.com

And where you set it:

Rand__ · Apr 30, 2020

Very good thanks

O/C at this time with 4 nodes there are little issues; everything recovers fine even from a full power down - but would be an important thing in case i go smaller again.

I dimly remember some dvSwitch issues with ephemeral ports too, but that has been years ago, no idea re the details.

badskater · May 28, 2020

Yeah, there were some, but it's been solved for a while. Been using ephemeral ports since 6.0 without a single issue.

VMware Knowledge Base

kb.vmware.com

(That was most likely the KB as well)

I gave up or "The history of my new ZFS filer"

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Automation Architect