ZFS planning question

ipreferpie · Oct 19, 2023

Hi there, I‘m rather new to napp-it and am looking to use several drives to maximize their potential. I was wondering if anyone can comment to see if this makes sense and apologies ahead of time if it’s not the right sub forum. My primary use case is for a large virtualized Nextcloud host, and iSCSI ESXi target for VMs at home for 5-8 users within a 40GBe (may upgrade to 100GBe) network. The ZFS pool will have the following resouces:

1) up to 24 cores from a Xeon 8176M Platinum
2) up to 256GB DDR4 RAM allocated
3) 3x Optane 905p 960GB U.2 drives
4) 2x Intel D4502 7.68TB QLC drives
5) 3x Samsung PM1733 15.36TB TLC drives

a) I’m looking to create 2 namespaces per 15.36TB drive (at 7.68TB each) so that it can match the 2x 7.68TB drives, giving me 8x 7.68TB to work with
b) should I overprovision each drive (7.68TB —> 6.4TB) and use RAIDz1 granting me 44.8TB usable vs. no overprovisioning and use RAIDz2 granting me 46.08TB usable?
c) with the Optane drives, I’m planning on using one for SLOG, and then mirroring 2 of them for special VDEV. Is this the best way to utilize the drives?

I might be missing something, but much appreciated in advance!

mattventura · Oct 19, 2023

The thing that immediately stands out to me is that you probably don't want to use a single drive as SLOG, as it would be a single point of failure. You really want a mirror for safety.

Unless you have a workload where you know that RAIDZ will perform well, mirrors are the way to go. Doing multiple namespaces per drive might confuse ZFS, since it might believe a drive to be idle when in reality the other namespace on that same drive is under heavy load. Mirrors avoid this issue since multiple mirror sets need not be the same size. You'd still be stuck with a leftover 15.36TB drive though. There isn't a particularly good solution there.

ipreferpie · Oct 19, 2023

Those are some really good points. And in regards to the mirror approach, I’ll only get 23.04TB usable in the vdev + and extra 15.36TB drive leftover. I’m willing to take a performance in to trade off speed and redundancy.

Perhaps this points me likely going for raidZ3 (w/o overprovisioning) granting me 38.4TB of space and reslivering safety, and I’m willing to take a hit in performance if I can utilize my disks fully that I already have.

For Optane usage, perhaps I should use 2x mirrored Optane drives for SLOG, and then 1x for L2ARC, and no special vdev (since it’s all NVME)?

Bjorn Smith · Oct 20, 2023

I would drop the QLC drives entirely - if you are 100% sure they are in fact QLC ( think they are TLC)
Use mirrors for VM hosting, it gives you maximum IOPS which is what you want with virtual machines or basically anything that is not backup
Although creating namespaces will split the drive in two, you are creating a ticking bomb for yourself, since a lost drive can make two members suddenly break - I would not do that - if you absolutely have to do it - it would be better if you make sure that no drive is in more than one mirror
The optanes are WAY too big for slog - you most likely need something smaller than 32GB

All in all in my opinion you have a very unfortunate set of drives you are trying to mix into something usable.

If you cannot return the drives and set matching sizes - I would probably make two pools:

One pool with the D4502 drives (mirrors) -> 7.68TB usable
One pool with 3xPM1733 (Mirrors, one spare) -> 15.36 usable

Drop the Optanes entirely as a SLOG - I doubt you really need it when you have fast enterprise NVME's

Also you do not need a L2ARC - add more RAM if you can instead.

22TB Storage for VM's is a lot of storage - you have to remember that ZFS compresses.

I am running 32 VM's on a ceph pool with only around 1TB used of data - so 22TB will enable you to have a lot of VM's unless you plan on storing a lot of binary data on the VM's

sko · Oct 20, 2023

RaidZ loses at least ~15% (more realistic is ~20%, worst case easily up to 30%, especially with very few disks like in your case) to padding and meta/parity-data. So forget about those "usable space" estimations for RaidZ.

As @mattventura said: if you have a special use case where you *need* RaidZ you should _always_ go with mirrors for small pools, *especially* because this gives you much more flexibility for future upgrades.
It also gives you the added IOPS of the drives, which will benefit VMs greatly (especially unoptimized ones like windows, which is trashing its registry HARD at boot)

Regarding SLOG: do you have workloads that would have A LOT of synchronous writes (i.e. huge databases)? If not, yo won't need an SLOG - the RAM that would be eaten by L2ARC tables is always better spent for the ARC.
For VMs and general purpose usage a special device would give the most advantages. You absolutely want a mirrored vdev for that. the special device holds most metadata - if that's gone, all data is gone.
OTOH - with fast drives the need for any additional caching or special device on only slightly faster storage is rather pointless. Due to additional housekeeping and memory usage for that cache or special device, you might not see any improvements for the pool.

Regarding multiple namespaces: Don't do that. in case of one failing drive, 2 providers for your pool will fail. This might pose enough load on the pool for renovations that another ageing drive might fail.

gea · Oct 20, 2023

some thoughts
Making 2 x 7.5 TB from a 15 TB SSD to build a pool with the other 7.5 TB SSD makes no sense. If the 15TB fails overall redundancy is affected, leaving a Z2 without redundancy

An Slog does not need to be mirrored. If it fails the pool revert sync logging to the onpool ZIL with a reduced performance. Worst case is a crash together with an Slog failure. This is the only but very rare case of a dataloss. Given the minimal improvement of an Optane Slog together with enterprise SSD/NVMe that offer plp I would probably skip the whole idea of an Slog and just enable sync.

ZFS readcaching works on a read last/ read most optimazation of ZFS datablocks. It does not cache files. With 256 GB RAM L2Arc Cache hitrate will be near zero. There is a minimal advantage of an L2Arc SSD over Arc RAM as L2Arc is persistent and allows enabling of read ahead. Both can help a little. Readcache L2Arc can fail without a dataloss. L2Arc should not be larger than say 5% of RAM as it needs RAM to organize.

For the VMs I would prefer NFS over iSCSI. With same sync settings both are quite equal regarding performance but NFS is much simple/ offers a file/VM based revert to a formar state of any of the VMs on NFS. With iSCSI you must revert the whole target what means that you should use a target per VM. Napp-it also includes a mechanism to embed crash save ESXi memory state or quiesce snaps into ZFS snaps.

With enterprise SSD/NVMe you have a high steady iops rate. Mirrors are mostly not needed due their better iops performance, Raid-Z is good enough.

What I would suggest
Use two or three Z1 pools with different performance level from the different disks. An option would be a hybrid pool with an Optane mirror where small io, metadate or critical ZFS filesystems land on Optane, the rest on the slower pool but do not expect too much from a special vdev as your NVMe are already quite fast. Hybrid pools are best with hd pools.

I would propably try to replace the 7.5 TB Intel for one or two more 15 TB Samsung to create a larger pool from them with an Optane special vdev mirror for most critical data or a second OPtane Z1 pool. Check performance improvement of an L2Arc or Slog in napp-it menu Pool > Benchmark regarding sync write and iops performance. Check L2Arc hitrate.

Important!
What is your backup/ recovery strategy?
Minimal is a local hd backup pool, best is a second backup server in a different location. If the backup server has ESXi as a base, you can even use it as a failover option with a reduced performance where a special vdev makes sense for the VMS.

ano · Oct 20, 2023

we dont use slogs for fast devices

z1 and z2 has less overhead and loss than most people think, with enough cpu

usually you hit weird limits within zfs before parity and cpu is exhausted.

dont bother with l2 arc

use lz4

dont bother with namespace

you have very few drives, and so many different types, not ideal.

ipreferpie · Oct 21, 2023

Thanks very much for everyone's comments so far. The density of information has been enlightening that led me to reevaluate my options even more in depth. As such, I might do this instead now:

iSCSI target ---> NFS share (I'm way outdated on my options from years ago), no using NVME namespaces
a) 2x 7.68TB vdev, mirrored (temporarily, until I swap out as 15.36TB drives get cheaper) ---> used for Nextcloud
b) 3x 15.36TB vdev, RAIDz1 (to be upgraded to 5x later) ---> used for ESXi VMs
-2x Optane 960GB special vdev, mirrored
-1x Optane 60GB partition 1, SLOG for sync writes
-1x Optane 900GB partition 2, L2ARC
I'll allocate 64GB DDR4 RAM (instead of the 256GB earlier, and will reallocate to other VMs the remainder)

However, I'm still unsure about 2 things:
1) is it ok to partition an Optane drive for both SLOG and L2ARC?
2) Is my RAM (64GB) to L2ARC (900GB) ratio suitable?

Last note is that I really like the Optane drives not just for their random I/O speed, but also that they can provide a layer of write endurance & PLP for the data vdevs.

mattventura · Oct 21, 2023

Still a few issues with that setup:

1. Losing your SLOG won't leave your pool in an unrecoverable state, but it still can still cause problems for some applications, because they think the write is fully committed when it actually was lost due to SLOG failure.
2. You've got only a single L2ARC device, meaning it would probably bottleneck, especially if you start adding more drives later, or upgrade to gen4 drives, etc. I'd test with and without to see if you're actually getting any benefit.
3. Attaching more drives to a RAIDZ may make the performance gap larger.
4. I'm not sure what you mean by "iSCSI target ---> NFS share". I think the other poster was saying they prefer iSCSI as an alternative to NFS. You can do iSCSI or NFS. Though, I'm not sure if VMware has an integration with ESXi to automatically create zvols for each VM like libvirt does.

Do you plan to immediately use >22TB? If not, you can just do a mirror of 2x7.68TB and a mirror of 2x15.36TB in a pool. Then, once you buy another 15TB drive, you can mirror that one with the spare 15TB, bringing your total capacity to ~38TB. Mirrors are nice because they give you the most opportunities to expand capacity. You can add two more drives, or replace two drives with larger drives.

Don't forget, VMs tend to dedup and compress reasonably well, especially if you have a bunch of similar VMs.

ericloewe · Oct 21, 2023

ipreferpie said:
Last note is that I really like the Optane drives not just for their random I/O speed, but also that they can provide a layer of write endurance & PLP for the data vdevs.

Not by any meaningful measure, no. Writes to the rest of the pool are unaffected (relative to the sync=off scenario) so it's a serious stretch to say that Optane will benefit the rest of the pool.

ipreferpie said:
However, I'm still unsure about 2 things:
1) is it ok to partition an Optane drive for both SLOG and L2ARC?
2) Is my RAM (64GB) to L2ARC (900GB) ratio suitable?

It depends, and it's unclear that you need either of them.

mattventura said:
1. Losing your SLOG won't leave your pool in an unrecoverable state, but it still can still cause problems for some applications, because they think the write is fully committed when it actually was lost due to SLOG failure.

You're not wrong, if the SLOG fails at the same time the system goes down. Is that a serious concern for a home user? I tend to say "no".

ipreferpie said:
b) 3x 15.36TB vdev, RAIDz1 (to be upgraded to 5x later) ---> used for ESXi VMs

mattventura said:
3. Attaching more drives to a RAIDZ may make the performance gap larger.

Beyond that, space efficiency with small blocks will be impacted. With a three-wide RAIDZ vdev, you could go as low as 8k blocks (assuming 4k sectors on disk) without taking a hit from not being able to break up a block over multiple columns of the vdev. At five-wide, that's 16k... Not a serious concern for large files, but it is meaningful for block storage.

mattventura said:
Don't forget, VMs tend to dedup

Ehh... That's a big can of worms.

gea · Oct 21, 2023

iSCSI -> all VMs on a single NFS share
Another advantage is that you can share the filesystem via NFS and SMB
You can then connect via SMB to copy/move/clone/edit VMs and to access ZFS snaps via Windows previous versions

Dataloss due an Slog fail
-The Slog is never read beside a crash situation with uncompleted writes. If the Slog fails, all writes are logged on the onpool ZIL instead without interruption. You need a crash + a faulted Slog at this moment for a dataloss. Not very likely.

Partition Optane for Slog and L2Arc
- this is possible either as real partitions or in a simpler way via ESXi vdisks and an Slog size up from 10GB.

3 disk z1 vdev -> 5 disk vdev
- this is nearly ready in Open-ZFS as a beta. I would wait some time until stability is proven. Illumos may include with a delay.

64GB RAM with 900 GB L2Arc
- Bad idea as you want RAM for fast Arc caching not to organize L2Arc. With 64 GB RAM you may use 32-64 GB max for L2Arc.

Dedup
- I would avoid. You want RAM for caching and performance not for dedup tables. Special vdev for dedup is a solution but only if dedup rates are >>5x

Number of disks per vdev
- There is no longer a "golder number" of disks per vdev. You can set recsize with any number of disks. If you enable compress (as you mostly do) real blocksize on pool depend on compressability of data so no fixed blocksize based on number of disks per vdev.

How I Learned to Stop Worrying and Love RAIDZ

The popularity of OpenZFS has spawned a great community of users, sysadmins, architects and developers, contributing a wealth of advice, tips…

www.delphix.com

Small recsize filesystem settings
-You should avoid very low ZFS recsizes like 8k without strong reasons as this affects ZFS efficiency negatively (checksums, dedup, compress, encrypt etc are related to ZFS datablocks in variable sizes up to filesystem recsize setting). For databases and VMs mostly 32-64k recsize is good, for most data 128k default is fine, for a mediafiler prefer 1M.

Search

ZFS planning question

ipreferpie

Member

mattventura

Active Member

ipreferpie

Member

Bjorn Smith

Well-Known Member

sko

Active Member

gea

Well-Known Member

ano

Well-Known Member

ipreferpie

Member

mattventura

Active Member

ericloewe

Active Member

gea

Well-Known Member

How I Learned to Stop Worrying and Love RAIDZ