ZFS filesystem sync compr dedup settings

ARNiTECT · Apr 19, 2020

Hopefully this is an easy question for more experienced users.

I have used Napp-it for many years, but never tried deduplication, compression or changing the sync-write settings.

All-in-One for home use
ESXI 6.7
OmniOS r151032e / Napp-it 19.12a1 noncommercial homeuse (will upgrade to pro soon for replication)
128Gb of RAM, 24GB to OmniOS, propose increase to 48Gb.
Xeon E-2278g, 2x vCPU

What ZFS settings should I set for each of the following Pools/Filesystems (SMB & NFS are enabled on all 3):

Tank1/ZFS1 - 2x 2TB NVMe (NFS share for VM OSs eg c:\ drives)
SYNC: standard / COMPR: off / DEDUP: off
Performance is important, saving space would be nice as long as the performance impact is minimal.

Tank2/ZFS2 - 4x 8TB HDD in RAID10 (NFS share for Filer virtual disks and VEEAM Backup repository virtual disk)
SYNC: standard / COMPR: off / DEDUP: off
Storage space is important for backups, but I don't want to impact the performance of the Filer too badly.

Tank3/ZFS3 - 8x 3TB HDD in RAIDZ2 for Replication of ZFS1 & ZFS2
SYNC: standard / COMPR: off / DEDUP: off
Storage space is important, performance not much of a concern as long as it doesn't impact the use of ZFS1 & ZFS2.

Will deduplication and compression only apply to newly written data, or is it possible to run a manual job for these on existing data?

Will OmniOS/Napp-it organise the ram itself between each pool for optimal performance, depending on deduplication feature, read/write cache etc.? Or are there settings elsewhere I need to adjust to get the performance I need?

ARNiTECT · Apr 20, 2020

I have continued my research:

tl;dr
sync: mixed
compr: on
dedup: off

Tank1/ZFS1 - 2x 2TB NVMe (NFS share for VM OSs eg c:\ drives)
Priority: Speed
SYNC: standard - write sync important for VM OSs
COMPR: on - my system should be fast enough
DEDUP: off - at first it would appear to make sense to enable dedup, as there is a lot of repetition with multiple VM of the same OS, with only a small percentage of difference, but I have read of huge impacts on performance even with lots of RAM. I would prefer the extra RAM is used for increased performance, rather than extra capacity.

Tank2/ZFS2 - 4x 8TB HDD in RAID10 (NFS share for Filer virtual disks and VEEAM Backup repository virtual disk)
Balance: Speed/Capacity
SYNC: disabled - I'm hoping in real use of moving files around during the day and backup at night, the risk won't be an issue.
COMPR: on - useful for filer, I will also enable compression on Veeam B&R to reduce traffic, I assume extra ZFS compression can't do any harm to compressed Veeam backups.
DEDUP: off
- For the filer, there would be repetition within many files that have been incrementally 'saved-as' and badly organised copies of files in multiple locations, but a large portion of data (photos, music, videos) would not benefit from deduplication. Perhaps I just enable NTFS deduplication on the Windows Server VM file server, which includes folder redirection.
- For Veeam backups, I will enable deduplication in Veeam

Tank3/ZFS3 - 8x 3TB HDD in RAIDZ2 for Replication of ZFS1 & ZFS2
Priority: Capacity
SYNC: standard & disabled - not sure on this, if ZFS1 goes down, I would run the VM's from this pool with sync enabled, if ZFS2 goes down I don't think I would need sync. Perhaps I need 2 file systems in the pool with different settings.
COMPR: on, can't think of any reason not to here
DEDUP: off - It looks like Dedup would only have an effect on the ZFS1 data, but if I needed to run the deduped VMs from here then performance would massively impacted on top of going from NVMe to HDD. Also, I understand dedup applies to the pool, not individual file systems.

gea · Apr 20, 2020

Only a few base rules

Realtime dedup like in ZFS works blockbased on a whole pool but can be enabled per filesystem. As it is realtime, you need to hold a dedup table either in RAM (count up to 5 GB RAM per TB dedup data additional to the RAM you want to use for read/write caching) or on pool. On pool is usually far to slow (a snap destroy can last hours) unless you use a special vdev for dedup data ex a mirror of two Optane NVMe. The special vdev construct from Intel may be a very interesting option for large pools where you want to enable dedup.

As a thumb rule I would say, with a dedup rate of 10+ and enough RAM, normal ZFS dedup is worth to consider, otherwise add disks with LZ4 as this is cheaper or more efficient. If you want dedup with a lower dedup rate or without the RAM need, use a special vdev mirror, https://www.napp-it.org/doc/downloads/special-vdev.pdf (care about same ashift of a special vdev than the pool !!)

Compress via LZ4 helps in most cases to reduce amount of data without too many problems. In most cases LZ4 can be enabled without negative impacts.

Sync is a method to protect the rambased writecache. After a crash committed writes that are in an Slog but not on pool are written to pool on next reboot. As consistency of ZFS is not affected by a crash, sync security is needed only for VM storage or transactional databases where otherwise a crash can result in a corrupt guest filesystem or database.

For a normal NFS or SMB filer without VMs or databases you can disable sync as sync can reduce write performance down to 10% of the unsync value. Only with fast NVMe pools with plp or a disk pool with an Optane Slog ex 4801, the performance loss is acceptable so you may want to have the additionaly sync security even on a filer. This would protect data/committed writes and small files that are in write cache but not yet on the pool.

ARNiTECT · Apr 20, 2020

Thanks for the detailed response gea!

I have a few follow up questions:

Deduplication:
- My ZFS1 NVMe VM pool is currently using approx 500Gb and I expect will grow to 2TB in the future. I didn't think it is large enough to warrant an Optane special vdev. If I allow more than 5Gb ram per 1TB of Data would this improve performance, say 10Gb per 1TB (20GB)? Could I expect similar performance to a non-dedup filesystem for VMs, or is this unnecessary? your tests in https://napp-it.org/doc/downloads/performance_smb2.pdf seem to result in about 20% loss of performance with dedup enbabled.
- If the dedup table is held in ram, would a restart of OmniOS simply result in a wait while it repopulates the table into ram?
- On my ZFS2, would I expect to see significant savings in space using ZFS realtime dedup only, across the whole pool, vs. no ZFS dedup and Windows Server dedup for the Filer and Veeam dedup for the backups? If there isn't a significant difference, I would save a huge a mount of ram not enabling dedup here. A pair of more affordable Optane 900P 280GB might be a future option here. If so, I understand I would have to destroy and recreate the pools to dedup the existing data anyway.
- On the ZFS3 (replication destination), could I share mirrored Optane used on ZFS2?
- However; currently, with no Optane special vdev, ZFS1 with dedup potentially enabled (as proposed above) and ZFS2 with dedup disabled, would I need to allow for the same amount of ram again as required in ZFS1, and also would I need to allow for ZFS2 as well, as I understand dedup is applied across the whole pool (ZFS3), which I definitely don't have enough ram for?

Compression:
I will enable LZ4 on all 3 pools.
Can I apply compression to existing data, or do I have to copy the data off and back on again?

Sync:
I will enable sync in ZFS1, I understand Corsair MP510 have PLP
I will disable sync on ZFS2, if the server crashes, can I estimate how much data I would lose, eg previous 10 seconds?
ZFS3, perhaps I need to separate into 2 file systems (ZFS3&4) with sync settings as ZFS1&2, I would prefer 1 large filesystem filling the whole pool, but I haven't looked into how Replication will work yet.

gea · Apr 20, 2020

Dedup
Greatest problem for dedup is that RAM for caching is more relevant as RAM for dedup and mostly dedup rates are not big enough.

If you use an Slog, you can use a single disk. A mirror avoids the preformance degration if it fails and avoids the very small chance of a dataloss when the slog fails with a crash at the same time

Amount of needed RAM depends on dedup table size that is related to amount of dedup data.

Compress
If you enable compress this affects new data. For current data you need a move between filesystems or a replication.

Sync
Default RAM write cache size on Open-ZFS is 10% RAM, max 4 GB. Half the size is active cache and the amount of data lost on a crash. But it is not size that matters. One wrong bit in metadate can lead to a corrupt VM filesystem.

ARNiTECT · Apr 22, 2020

Thanks gea,

I'm thinking of adding more Filesystems per pool to help with the backup strategy, but maybe it also helps by allowing more granular settings:

Is there any problem with applying different sync settings to individual filesystems on the same pool? eg:
ZFS1/VMs/ sync:enabled
ZFS1/Files/ sync:disabled

And the same with with dedup, can individual Filesystems have different dedup settings? I remember reading somewhere that dedup is applied across the whole pool. Does this mean the dedup table is sized for the whole pool or just the filesystems with dedup enabled? Does it work that if a file is saved to a dedup enable Filesystem, it is checked against the whole pool for duplicate blocks or just the same Filesystem and if a file is saved to a dedup disabled Filesystem, it does not check for duplicates? eg:
ZFS2/Backups/ dedup:enabled
ZFS2/Files/dedup:disabled

If I could be more targeted with ram or special vdev allocation, dedup might be worth it, otherwise, it doesn't seem worth it on any of my Pools as they each contain mixed data types, which I understand creates a large set of data for a dedup table, but only a small portion has high dedup ratios. With at least 48Gb of ram spare and perhaps a new affordable single Optane 900P 280Gb, I don't think this is enough to make dedup worth it unless I can be more targeted.

gea · Apr 22, 2020

Sync is per filesystem. Different settings is normal.

Dedup is also per filesystem but dedup is processed against the whole pool. This is why you can never disable dedup on a pool once it was enabled on a filesystem. Size of dedup table does not depend on pool size but amount of dedup data.

If you want to add a special vdev for dedup, care about redundancy as a vdev lost means a pool lost (2/3 way mirror). Also care about same ashift on pool and special vdev (important!)

ARNiTECT · Apr 22, 2020

Thanks gea,

For special vdev redundancy, unfortunately I can't justify 2x Optane at the moment.

Do I understand this correctly, I am extrapolating from half understanding other forum posts:

Pool size, example:
Total size of data on disk (allocated?) = 6TB
Total size of all data (referenced?) =9TB
ratio 1.5?
is the amount of dedup data the 'allocated' 6TB amount, or the difference 3TB?
I have read the ram can be calculated as number of blocks x 320, eg 6TB at 64k = 32Gb, which is about 5GB/1TB as you mentioned (plus RAM for Arc and the OS etc)
But I have also read the size of the dedup table on a special vdev is 5% of the dedup data, which for 6TB is about 308Gb, is this correct?
If so, an extra couple of 3TB hard disks would be much cheaper than 2x 480Gb Optane drives, but 2x 500Gb NVMe would be affordable, but I'm not sure if consumer grade NVMe (with PLP) is suitable.

ARNiTECT · Apr 24, 2020

As luck would have it, the Optane 900P 280Gb drives are today just £166, so I ordered 2.
This may have been hasty
...either I figure out how to use them as mirrored dedup devs & slog in my primary server, or if it doesn't work out, single L2Arc & slog in 2 servers.

gea · Apr 24, 2020

Optane 900 is perfect.
Even without "official plp" I would expect a very good powerloss behavior. In very first Intel specs, plp was even guaranteed and later removed.

3rd option would be s pool from a mirror of them for high peformance needs. If you create a special vdev mirror, care about ashift. You must force the same ashift as the pool, mostly ashift=12 !. Only then you can remove it later when wanted.

Dedup table size is related to amount of data with a dedup reference and blocksize so I would asume the 3TB is relevant, but did not care about dedup internals for a long time.

ARNiTECT · Apr 24, 2020

Great.
I will set ashift=12 for special vdev and pool to allow removal.

I understood that if the special vdev for a pool is lost, then the whole pool is down. Mirrored drives gives me drive redundancy; however, if I lose a controller or the whole system, PLP would ensure the data is written and hopefully I could just remove the special vdev drives and the rest of the pool and import it all into another server as usual.

I already have a pair of NVMe drives as a high performance pool for VMs. I presume a single 20Gb Slog is used across all pools in the system and even this NVMe pool would benefit from the Optane Slog on write-sync.

I'm unsure if I should pass-through the Optanes to OmniOS, like all my other devices, or keep them in ESXi and place vdisks on them.

From a hardware point of view, I have limited PCIe options on my very full motherboard (X11-SCA-F); this is my proposed layout:

Slot x8 - AOC-SHG3-4M2P PLX switch:
> 2x M.2 Corsair MP510 Stripe [Pool1:VMs]
> 2x M.2-U.2 Optane 900P Mirror [slog, special-vdev or L2arc]
Slot x8 - GPU1
x4 PCH:
> Slot x1 - GPU2
> Slot x4 - 9400-16i HBA - 4x 8TB HDD [Pool2:Files/Backup]
> Slot x4 - 2x 10G NIC
> AHCI 8x SATA - 8xHDD [Pool3:Replication]
> PCI 32bit - SATA card - ESXi Datastores
> 10x USB2.0-3.1g2

Is any of this relevant?

ARNiTECT · May 1, 2020

For Slog, L2arc and special vdevs in an AIO, can I pass-through the Optane 900P drives to OmniOS, or is there still a bug preventing this, meaning I must keep them in ESXi as datastores and place vdisks on them? I read some people have had success editing the passthru.map and adding a VM config parameter.

gea · May 2, 2020

In my last tests 900P worked in pass-through mode with ESXi 6.7 unlike some other Intel NVMe.

ARNiTECT · May 4, 2020

Great, I'll try it.

So do you now recommend passing through the Optane 900p drive to OmniOS and partition the drive for Slog, L2Arc? or is it better using vdisks on Optane datastore in ESXi?

Pool1. (Fast, virtual desktops for 3D creation & gaming, server VMs)
>Should I expect a performance boost adding a 20GB Optane 900p slog to my sync:enabled 2x Corsair 2TB MP510 NVMe drives?
>would L2arc on the Optane benefit? It looks like Special vdevs are no use here.

Pool2. (Large, for 'file storage' vdisks on NFS datastores, NFS backup repository for Veeam, SMB shares, minIO S3, iSCSI)
>Are any of these types of shares more at risk of data corruption and should have sync:write enabled? eg, if there is a failure during a write to an NFS share, the whole share would be corrupted, but a failure when writing to an SMB only affects the files being written? I could then balance the risk/performance.
>Would a special vdev (metadata/smallio) or L2arc benefit here?

Pool3: (Replication target for disaster recovery, for rebuilding Pool1 or 2, or importing pool into another machine)
>Should I enable:sync here? I expect neither special vdevs or L2arc are required here.

OmniOS to have 48Gb ram

Optane Partions:

1x Optane 900p (no special vdevs):
partitions: 3x 20GB slogs for each pool and 200GB for L2arcs

2x Optane 900p (special vdevs):
A drive partitons: 3x 20GB slogs and 200GB for special vdev mirror
B drive partitons: 1x 60GB for L2arc and 200GB for special vdev mirror

gea · May 5, 2020

- With 48 GB RAM I would not expect too much from an L2Arc. You can check arcstat to verify.

- Only for VM storage you need or want to enable sync. If you simply use ZFS without foreign filesystems on it (VMs), ZFS will never corrupt on a crash due Copy On Write. So for regular filer use, disable sync unless you want a maximal security for small files like on a mailserver.

- If you want superiour data security, you should care about powerloss protection (pool and slog). Only the datacenter Optane 4800 has guaranteed plp by Intel but I would expect a very uncritical behaviour with Optane 900 but not with desktop ssds/NVMes.

I would simply add some RAM and disable sync for all filesystems but critical VMs. An additional Optane Slog to an NVMe pool without plp may help a little regarding performance and security but in the end it is "neither fish nor fowl". I would simply enable sync for critical filesystems on pool1 and check performance. If you want more security and performance, add a smaller pool from an Optane mirror where you put critical VMs onto.

Main use case for special vdev is a large disk based pool where you want a better overall performance or force some filesystems to the special vdev.

Using vmdk with an Optane is a good option due the extraordinary performance of Optane, especially as it makes handling of small partitions easier. For a simple high performance Optane mirror, prefer pass-through.

ARNiTECT · May 5, 2020

Thanks gea!

No special vdevs then (I might experiment with dedup on another system one day)
I will check arcstat after the system has been running for a while, before considering to add any L2arc.
After some use, I'll also know if I can spare more ram for OmniOS.

I understood the Corsair MP510 have some PLP, but it sounds like it is not up to the standard of an Optane 900p.

I'll disable sync for the Pool1 virtual desktops for performance and if there is an issue, I'll recover them from snapshots or backups. I might leave sync enabled for the servers and other appliance VMs.

For Pool2, my query about vmdk on NFS shares was relating to the note in Napp-it below sync=disabled: "...very dangerous as ZFS is ignoring the synchronous transaction demands of applications such as databases or NFS" The NFS part confused me, where I have file/folder only NTFS formatted vmdk's on NFS shares. So for example, I deploy a Windows VM with a vmdk for c:\ drive (OS & apps) on a Pool1 NFS share with sync enabled for security, and another vmdk for f:\ drive (files & folders) on a Pool2 NFS share for capacity with sync disabled. The f:\ drive is important and quite large and I would rather not have to spend time recovering this, but if my only risk is the last few files written to the vmdk, then I can manage this, as long as the entire vmdk doesn't get corrupted.

If a vmdk does become corrupt due to a powerloss or hardware failure (assuming the pool still has redundancy), then I understand the ZFS pool is ok due to Copy On Write, can I then revert the corrupted vmdk to the previous snapshot?

Pool3. Does my disaster recovery replication pool require sync to be enabled for all target filesystems, or just the ones where the sources are sync enabled? It sounds like, due to Copy On Write, I don't need sync enabled as if there is a problem mid-replication, the pool is ok, but the replication would fail and the previous snapshot would still be available.

gea · May 5, 2020

pool2
On a crash during write, the rambased writecache is lost unless you enable sync. The currently written file (if not yet completely in cache) is always lost, does not matter sync. The problem are foreign filesystems (VMs). From ZFS view they are like a large file. From VM view they are organized like a disk with data and metadata. On a crash data and metadata can become corrupt. This is where sync helps to keep VM filesystem file consistency intact.

This is not related to NFS or SMB. For a simple NFS filer on ZFS you can disable sync as you mostly do with SMB.
On a crash, you can use ZFS snapshots to go back. Main problem: Which snap is ok and up from when is it corrupt?

pool 3
This is a backup pool. No need to enable sync. If a replication fails, simply restart and it will create a new snap pair.

herby · May 5, 2020

I'm pretty sure I saw some benchmarks where a pool with LZ4 rather than taking a performance hit was actually faster than with compression off. Of course that's anecdotal and I'm far from expert. Strangely enough for a major ZFS feature I've never hear of anyone recommending or using dedup.

ARNiTECT · May 5, 2020

Thanks gea,

Pool2
When you say foreign filesystems (VMs), I understand this includes all vmdk files, whether the vmdk contains an OS, or the vmdk just contains files & folders, as each vmdk is just one large file and therefore in all cases, there is a risk of data & metadata corruption.

I understand a simple NFS Filer is where files are stored directly on the NFS share, files including: iso's, photos, docs etc and here sync can be disabled as it would be with an SMB Filer. When the NFS share contains VM files such as vmdk vmx etc. then this must be sync enabled.

Pool3, thanks

Don.key · May 5, 2020

herby said:
I'm pretty sure I saw some benchmarks where a pool with LZ4 rather than taking a performance hit was actually faster than with compression off. Of course that's anecdotal and I'm far from expert. Strangely enough for a major ZFS feature I've never hear of anyone recommending or using dedup.

This is easily the case with spinner based pools if the data can be compressed. Even on NVMe based volumes performance improves if CPU is able to keep up with compression.

Dedup? Naaah, disk space is cheap and ZFS dedup was/is can of worms better to avoid.

ZFS filesystem sync compr dedup settings

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Active Member

Member

Member