The ultimate ZFS ESXi datastore for the advanced single User (want, not have)

Rand__ · Nov 3, 2019

So I've been trying to build the ultimate ZFS storage system for a while now, and have thrown tons of time and probably even more money at it but without actually getting *there*. So I thought I'd start fresh and ask for help from real experts since my tinkering didnt work out

I will leave all the old issues out for now and start with a fresh idea if possible (but of course reuse in case I can in the actual build).

Now I am totally aware that this is highly depending on what the goal/expectation is so here are mine (I am sure I have stated these quite often in my various rants/complain threads but just to have all in one place).

I want to run an ESXi cluster with a shared storage on a ZFS system, ideally via nfs (so sync writes).
I want the ZFS system to be HA capable with at least 2 nodes
I want a write speed of 500MB/s per VM when moving from off system to on system and vice versa (cold or hot vmotion)

Now lets have a look at this regardless of the hardware I already have - if I had unlimited funds what hardware would be capable of achieving this?
If you think an off-the-shelf all flash array can then lets hear it too (although most are not catering for single users), but o/c that is most likely not the way to go. But the info that a 5 node cluster of 24nvme drives each might be needed would be helpful (if you actually have seen it deliver and not read the specs only {which rarely show QD1/1T values anyway})

Thanks

gea · Nov 4, 2019

The Cluster/ HA aspect is secondary if you use a ZFS dualhead active/passive cluster with multiport SAS/Nvme. It may be only special with iSCSI/FC or singleport Sata/NVMe on ESXi where the disk/network/ESXi performance may be a limiting factor depending on setup.

If you use a single server ZFS dualport SAS/NVMe disk configuration that suits your performance needs with a single server, the Cluster just means to add and connect a second head (barebone or virtualized) to the second ports of the SAS/NVMe disks.

Rand__ · Nov 4, 2019

I agree, but at this time I haven't even found a single server solution satisfying my requirements (with Zfs).

From your experience, what kind of hw (nr of disks/type + cpu) might be able to provide the desired performance?

gea · Nov 4, 2019

500 MB/s pool performance sequentially is not a problem. Easy to go > 1GB/s. If there is a lot of concurrent or small io, this will be very different. What I would try is to start with some Optane 900 in a raid-0 setup with as much RAM as available and a recsize like 32k or 64k when using VMs. This marks the upper end of what is possible with ZFS on your hardware. Then you can reduce to less Optane, NVMe or 12G SAS/ Sata SSD instead or use other cost limiting factors like dedicated Slog/ special vdev with a slower pool. If you care about sync performance this is again very special. Optane can land at around 800 MB/s sequentially. Nvme Flash maybe at 500 MB/s.

When your pool can satisfy your needs, then it comes to network. 10G Ethernet without tcp tuning and Jumboframes lands at around 400 MB/s. With Jumboframes and tuning this can double. Above 1 GB/s you are in the region of a very special and not common setup.

As ZFS is a local high security filesystem, performance is limited by the performance of a local pool (and not as good as filesystems without the security of ZFS). A Cluster filesystem can scale beyond as every node must only offer a part of the overall performance. But as there is a overhead as well, there will be a break even point with x nodes required to perform better than a single local filesystem.

Rand__ · Nov 4, 2019

sequential is indeed not an issue but up until now I have not been able to manage to get VMs move at that speed. Not sure whether thats due to the smaller blocksize VMWare is using or the not exclusively sequential IO.

My primary issue though was the incapability of ZFS (both OmniOS and FreeBSD) to scale up even close to linearly with more devices (even on sequential).
I have run tests with 4 Optane drives and 12 SAS3 (HGST SS300) and neither reached expected levels -

here is a fio chart showing 1-6 mirror vdevs, 3Ghz CPU, 1M Blocksize, 1M recordsize ( qd 1, numjobs=1)

And I have similar issues all the time whenever i get to 1Gb/s or more (sometimes its as bad as here, sometimes the next mirror just adds single digit performance improvement)

gea · Nov 4, 2019

With a typical ZFS server, you look at the triangle price <> capacity <> performance. You select two of the parameters and the third is the result. If you are not satisfied with the result, modify the input parameters.

For ultimate performance, the triangle may change to price <>performance<>data security. This can mean that an ext4 solution is faster than a ZFS solution but without the security aspect that comes with CopyOnWrite and checksums what cost performance.

From theory, a Raid (ZFS and others) scale with number of datadisks sequentially. Basically this is correct and with large file streams (zfs send) this may scale up to a certain limit.

In reality a Raid does not work sequentially. ZFS for example tries to spread data quite even over a pool what means that a lot of load switches to random io.

Another aspect is ESXi where you create a filesystem like ext4 or ntfs based on a vmfs file. From the view of a VM, this is a blockdevice with say 8k blocksize. The VM is optimized to update data based on this blocksize as fast as possible (expecting a physical disk blocksize of 512B/4k).

If the "file" itself is on ZFS, all real io is based on the recsize of the ZFS filesystem. If this is 1M this is highly un-optimized when the filesystem wants to read/write 8k and must process 1M each. As ZFS becomes slow with very low recsizes (as checksum, dedup, compress is based on) you should try a lower recsize ex 32k or 64k for best VM performance.

As your benchmark is based on sequential ZFS this may give a different view than from a VM view. Normal Flash additionally has the problem that it needs to delete/write a whole large page to write a small datablock. (Optane is superiour, not affected by this performance "break")

T_Minus · Nov 4, 2019

Some thoughts... QD1\T1 performance improvements once you've 'maxed' drive performance will come by improving latency.

- CPU Frequency (& Available Cores)
- Cable Connections.
- Cable Quality\Type.
- Drivers
- Firmware
- Drive configuration\pool setup (hardware. IE: Which HBA, Physical Pool\vdev configuration)
- Memory Performance
- CPU Various Configurations in BIOS
- ZFS Various Configurations
- You're using an E3 not an E5, and memory performance is not nearly as good, to few PCIE lanes to get achieve top performance from Optanes + 12 SAS3, etc... (unsure if you tested with something else, just throwing ideas)
- (If there's network then 1gb vs 10, vs 100, tuning, network stuff... etc...)

gea · Nov 4, 2019

Real world is unfair.
In theory everything is predictable and the answer of every serious question is "42".

A solution seems perfect. You step on a cable occasionally and this can mean 100 Mb/s vs 10 Gb/s (1:100)

Mostly you can only define most important features, try to find a standard solution and if result is worse than expected: start trouble/bug finding.

Rand__ · Nov 4, 2019

gea said:
With a typical ZFS server, you look at the triangle price <> capacity <> performance. You select two of the parameters and the third is the result. If you are not satisfied with the result, modify the input parameters.

For ultimate performance, the triangle may change to price <>performance<>data security. This can mean that an ext4 solution is faster than a ZFS solution but without the security aspect that comes with CopyOnWrite and checksums what cost performance.

From theory, a Raid (ZFS and others) scale with number of datadisks sequentially. Basically this is correct and with large file streams (zfs send) this may scale up to a certain limit.

In reality a Raid does not work sequentially. ZFS for example tries to spread data quite even over a pool what means that a lot of load switches to random io.

Another aspect is ESXi where you create a filesystem like ext4 or ntfs based on a vmfs file. From the view of a VM, this is a blockdevice with say 8k blocksize. The VM is optimized to update data based on this blocksize as fast as possible (expecting a physical disk blocksize of 512B/4k).

If the "file" itself is on ZFS, all real io is based on the recsize of the ZFS filesystem. If this is 1M this is highly un-optimized when the filesystem wants to read/write 8k and must process 1M each. As ZFS becomes slow with very low recsizes (as checksum, dedup, compress is based on) you should try a lower recsize ex 32k or 64k for best VM performance.

As your benchmark is based on sequential ZFS this may give a different view than from a VM view. Normal Flash additionally has the problem that it needs to delete/write a whole large page to write a small datablock. (Optane is superiour, not affected by this performance "break")

That bench basically was just a min/max size test when I got a bunch of new drives - here is the same run with 4K blocksize/4k recordsize (didnt do 32/64k on that run unfortunately) . Will need to see if I hace some 32/64k results stored away somewhere (likely also from optanes)

Basically I am trying to differentiate two things here
1. Establish a basic HW set that should be able to meet my goals
2. Establish the criteria to measure that (beyond moving vms back and forth) - from what you said (and I agree), VMs should have 32/64k blocksize and will have a certain amount of random IO included

3. Also I am trying to find out why adding more mirrors seems to be detrimental to performance (worst case) or not helping as expected (best case)

Rand__ · Nov 4, 2019

T_Minus said:
Some thoughts... QD1\T1 performance improvements once you've 'maxed' drive performance will come by improving latency.

- CPU Frequency (& Available Cores)
- Cable Connections.
- Cable Quality\Type.
- Drivers
- Firmware
- Drive configuration\pool setup (hardware. IE: Which HBA, Physical Pool\vdev configuration)
- Memory Performance
- CPU Various Configurations in BIOS
- ZFS Various Configurations
- You're using an E3 not an E5, and memory performance is not nearly as good, to few PCIE lanes to get achieve top performance from Optanes + 12 SAS3, etc... (unsure if you tested with something else, just throwing ideas)
- (If there's network then 1gb vs 10, vs 100, tuning, network stuff... etc...)

Yeah thats what I did - throw more hardware at it

I started with drives (optane, sas3), then more drives... then more network (useless at this point) , then better CPUs (E5/Scalable) then better slogs (at this point I got a 4800x, a nvdimm and a NV1616 lying here) and nothing really worked out as I had hoped...
Thats why I though I'd ask somebody who knows that stuff better than I

I.e. more money then sense and trying to rectify

Edit:
What is missing is a baseline, i.e. a realistic number - apparently its not
vdev * <single drive performance>
but I have not been able to find many results at qd1/t1 since its not the typical enterprise use case and most normal ppl are not using similar hw

Rand__ · Nov 4, 2019

Here is a 64k bs/recsize test with 2 900ps in a single mirror with different slogs - this is on a 6150 (again QD1/T1)

The nvdimm comes close

gea · Nov 4, 2019

If I read it correctly (and as a mirror does not help on writes)
- 900P without slog is nearly as good as 900p + dc 4800 slog
- 900P + nvdimm = 4000 iops vs 6000 write iops (1.5 x)

remains the question:
Does this improve a VM move by 5% or 30% ?
(Your initial real world concern)

Rand__ · Nov 4, 2019

Yeah

Too much testing, not enough real world experience.

Will have the target board back tomorrow from SM and will be able to do some real tests when I rebuilt the system. Will let you know

i386 · Nov 4, 2019

Rand__ said:
Yeah thats what I did - throw more hardware at it
I started with drives (optane, sas3), then more drives... then more network (useless at this point) , then better CPUs (E5/Scalable) then better slogs (at this point I got a 4800x, a nvdimm and a NV1616 lying here) and nothing really worked out as I had hoped...

If changing hardware doesn't improve the performance than it's time to look at the software/os stack

Rand__ · Nov 5, 2019

I contemplated going back to raid and starwind or open-e (got a ha license with some hardware a few months back) but zfs is nicer

As mentioned one of my problems is the lack of comparison data, there are just too few numbers of high speed qd1/t1 setups, so i don't know if it is at all possible with a zfs based system...

Rand__ · Nov 6, 2019

So did a not so realistic test
- stripe of 900p's with nvdimm as recipient; a 4800x as source, both on a freenas box exported via nfs3 (and sync=always).
VMotion from datastore A to B.

Also the box is quite beefy, 6150 (16core, 3.4Ghz allcore) and 280GB memory (so the test vm might have been cached in ram completely, its only 22GB)

zpool iostat stripe_900p 1, 25 ticks for 19GB so ~760 MB/s actual transfer rate.

Will need to run some further tests with remote transfer and a more realistic drive set.

Code:

stripe_900p  5.60M   888G      0      0      0      0
stripe_900p  5.60M   888G      0      0      0      0
stripe_900p  5.60M   888G      0      0      0      0
stripe_900p  5.60M   888G      0      0      0      0
stripe_900p  5.60M   888G      0      0      0      0
stripe_900p   203M   888G      0  6.70K      0   484M
stripe_900p   956M   887G      0  20.7K      0  1.61G
stripe_900p  1.74G   886G      0  21.7K      0  1.68G
stripe_900p  2.67G   885G      0  25.8K      0  1.93G
stripe_900p  3.41G   885G      0  20.5K      0  1.59G
stripe_900p  4.27G   884G      0  23.9K      0  1.78G
stripe_900p  5.01G   883G      0  22.1K      0  1.69G
stripe_900p  5.87G   882G      0  22.4K      0  1.71G
stripe_900p  6.80G   881G      0  25.0K      0  1.93G
stripe_900p  7.42G   881G      0  16.1K      0  1.23G
stripe_900p  7.79G   880G      0  10.9K      0   844M
stripe_900p  8.53G   879G      0  20.0K      0  1.53G
stripe_900p  9.34G   879G      0  22.8K      0  1.72G
stripe_900p  10.1G   878G      0  24.4K      0  1.85G
stripe_900p  10.9G   877G      0  20.7K      0  1.58G
stripe_900p  11.8G   876G      0  25.3K      0  1.88G
stripe_900p  12.6G   875G      0  24.7K      0  1.86G
stripe_900p  13.4G   875G      0  22.2K      0  1.63G
stripe_900p  14.3G   874G      0  24.9K      0  1.88G
stripe_900p  15.1G   873G      0  22.9K      0  1.73G
stripe_900p  15.9G   872G      0  23.8K      0  1.81G
stripe_900p  16.7G   871G      0  22.2K      0  1.70G
stripe_900p  17.6G   870G      0  25.9K      0  1.94G
stripe_900p  18.5G   869G      0  26.7K      0  2.02G
stripe_900p  19.2G   869G      0  15.8K      0  1.16G
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0      0      0  3.76K
stripe_900p  19.2G   869G      0    594      0  38.9M
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0      0      0      0
stripe_900p  19.2G   869G      0    340      0  22.6M
stripe_900p  19.2G   869G      0  2.39K      0   186M
stripe_900p  20.1G   868G      0  24.4K      0  1.81G
stripe_900p  21.0G   867G      0  24.2K      0  1.82G
stripe_900p  21.9G   866G      0  26.2K      0  2.03G
stripe_900p  22.6G   865G      0  19.9K      0  1.48G
stripe_900p  22.6G   865G      0      0      0      0
stripe_900p  22.6G   865G      0      0      0      0
stripe_900p  22.6G   865G      0      0      0      0
stripe_900p  22.6G   865G      0      0      0      0
stripe_900p  22.6G   865G      0    637      0  28.2M
stripe_900p  23.2G   865G      0  16.9K      0  1.29G
stripe_900p  23.2G   865G      0     14      0  73.6K
stripe_900p  23.2G   865G      0      0      0      0
stripe_900p  23.2G   865G      0      0      0      0
stripe_900p  23.2G   865G      0      0      0      0
stripe_900p  23.3G   865G      0    515      0  42.3M
stripe_900p  23.3G   865G      0      0      0      0
stripe_900p  23.3G   865G      0      0      0      0
stripe_900p  23.3G   865G      0      0      0      0
stripe_900p  23.3G   865G      0      0      0      0
stripe_900p  23.3G   865G      0      0      0      0

xsibackup · Jun 11, 2021

I'm probably a bit late to the conversation but, have you thought of the multiple layered file systems alignment issues?
File System Alignment in Virtual Environments

Rand__ · Jun 12, 2021

I think i had briefly looked into possible issues back then, but had not followed the specific advice in the article you linked, thanks.

At this point the combination of NVME datastore + nvdimm slog has provided enough performance to meet my initial requirement set so at one point i simply stopped.

xsibackup · Jun 12, 2021

I fell into this thread due to ZFS underperforming in a client's infrastructure and read it all. People just hear that ZFS is great by word of mouth and tend to use it in scenarios in which it just can't fit. They overlook the price <> capacity <> performance as perfectly laid out by Rand some messages above and forget that ZFS is so resource hungry that will reduce any non well sized system to nothing if you don't know what you are doing.

In regards to using it as the base of a virtualization infrastructure, I noticed that nobody had brought the issue up, when it is indeed something crucial. In such scenarios you have a number of layers up from the bit at the physical level. If they happen not to be aligned, you may be forcing your hardware to perform multiple reads or writes where just one would suffice. This is another really interesting post on the subject.

Storage block size and alignment » boche.net – VMware vEvangelist

www.boche.net

Rand__ · Jun 12, 2021

4. NFS datastores are not concerned with VMFS alignment as they are not block VMFS datastores

I think this is more related to iSCSI or RDM then...

The ultimate ZFS ESXi datastore for the advanced single User (want, not have)

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member