ZFS Allocation Classes / Special VDEV

luckylinux

New Member
Mar 18, 2012
27
2
3
Hi,

I just discovered that ZFS on Linux had support for allocation classes since version 0.8.0 :rolleyes:.
Before knowing this, I already moved my NAS in order to upgrade the pool layout in order to get better performances (raidz2 -> striped mirrors).

I would like to install a mirrored special VDEV (2-way mirror).

Concerning this new ZFS special VDEV I would like to know some things:
a. How would you add this to the pool and force a "migration" / upgrade / "resync" of all metadata (as well as small files) so that they move from the slow HDDs to the new fast SSD ? If I have enough space I guess I could achieve this with
Code:
zfs send poolname/mydataset poolname/mydataset_special

zfs rename poolname/mydataset_special poolname/mydataset
Is there a better way? And would this actually work (e.g. with existing snapshots)?

b. Would you reccomend setting a medium value for recordsize / special_small_blocks (around 1M) so that even small files (e.g. Word documents, Excel spreadsheets, ...) get stored on the fast SSD?
c. Which ashift setting should I use? ashift = 12 seems to lead to a lot of overhead, so some people suggest using ashift = 9, which however doesn't seem very stable nor reccomended because it's different that HDD/SSD "native" block size

Would a SLOG / ZIL dedicated device still be reccomended in such setup?

Thank you for your help!
 

gea

Well-Known Member
Dec 31, 2010
2,791
960
113
DE
a. Any file creation (copy, move, replication between filesystems) will rebalance a pool of fill a special vdev.

b. A special vdev can hold metadate, dedup data and small io. With the small io feature you can control via recsize if a whole filesystem ist forced to a special vdev.

See my performance tests in
napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris : Manual (Chapter 8)

c. There is still a bug in the special vdev feature (on Illumos and ZoL, Free-BSD does currently not support special vdev) when you add a special vdev with a different ashift than the pool. At least you cannot remove the special vdev. This is why all vdevs (normal and special) should use the same ashift.

d. If you force a filesystem to a special vdev and enable sync or if the pool is fast enough for sync write you may not need an Slog ex if the special vdev is an Optane mirror, otherwise you want an Slog for sync write . In any case, care about powerloss protection of the disks (pool or special vdev) for sync write.
 

luckylinux

New Member
Mar 18, 2012
27
2
3
a. Any file creation (copy, move, replication between filesystems) will rebalance a pool of fill a special vdev.

b. A special vdev can hold metadate, dedup data and small io. With the small io feature you can control via recsize if a whole filesystem ist forced to a special vdev.

See my performance tests in
napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris : Manual (Chapter 8)

c. There is still a bug in the special vdev feature (on Illumos and ZoL, Free-BSD does currently not support special vdev) when you add a special vdev with a different ashift than the pool. At least you cannot remove the special vdev. This is why all vdevs (normal and special) should use the same ashift.

d. If you force a filesystem to a special vdev and enable sync or if the pool is fast enough for sync write you may not need an Slog ex if the special vdev is an Optane mirror, otherwise you want an Slog for sync write . In any case, care about powerloss protection of the disks (pool or special vdev) for sync write.
Thank you for your answer gea.

a. I made a mistake in my code. I meant
Code:
zfs send poolname/mydataset poolname/mydataset_special
zfs destroy -r poolname/mydataset
zfs rename poolname/mydataset_special poolname/mydataset
b. I saw your tests at http://napp-it.org/doc/downloads/special-vdev.pdf. Even without seeing those tests I think I need this special vdev. Simply browsing though a samba share is ultra slow, due to the sheer number of small files on that pool.

c. I saw that issue mentioned several times on this forum by you. I start with the assumption that I won't be removing the special vdev at all. Why is it even allowed to remove the special vdev though? If all the metadata are on it, this means that once removed you will be left without any metadata at all

d. This is a 4-way striped mirror (2-way mirror) of 7200 rpm HDD. Not that fast. ZFS scrubs were doing between 400MB/s and 500MB/s.
This is quite an old system (Intel Xeon V3) and due to limitations of # of PCIe slots, I cannot use a NVME mirror. I was simply thinking about using a mirror of 2 x Intel S3700 200GB SSDs (hopefully it will be enough). Otherwise possibly 2 x Crucial MX500 500GB (although power loss protection is not that good, i.e. only partial power loss protection is provided). For the SLOG / ZIL (if really needed) I was thinking about the Winkom SLC 32GB that I bought back in 2015 based on your reccomendations. Quite old, but should be still much faster than the HDDs for small files (lots of IOps). In terms of sequential the pool can do 500MB/s so the ZIL won't help there at all (and neither will the special vdev).

Care to elaborate on the powerloss protection part? What exactly is the risk if a sudden power loss occurs? And why only using sync writes? I can imagine e.g. NFS client writing data to a share and that will be lost. However, if it's "just" modifying existing data and power loss occurs, since ZFS is CoW, the old data should still be there (although the new not / partially).

For ZIL I went single disk in the past using the Winkom SLC 32GB. I gradually removed it from many systems that are SSD only because I feel there is very little need there for a dedicated SLOG. Is mirror ZIL (again) reccomended nowadays?
 

gea

Well-Known Member
Dec 31, 2010
2,791
960
113
DE
b.
A special vdev helps for small io not small files. All writes to ZFS go to the rambased write cache and are then written as large and fast blocks in recsize. Only a small overhead for data that do not fit in recsize blocks is then small io. This is different when you force a filesystem to use a special vdev for all io due a recsize setting < small io threshold. You may see a small impact due metadata on the special vdev. For active data (in Arc cache) more RAM may be more helpful. For random access on a large pool a special vdev is superiour. The multithreaded kernelbased SMB server on Solarish may also be an option to improve SMB performance over SAMBA.

c.
A vdev remove (you can remove slog, l2arc, mirrors and special vdevs) copies its content back to the pool (other vdevs). To be save, just force a special vdev to same ashift than other vdevs.

d.
All writes go to the rambased write cache (up to several GB). On a crash during write these already committed writes are lost. This does not affect ZFS consistency but can result in corrupted files and a dataloss. While large files are always lost (only a client like Word can care about with tmp files), small files already completely in cache or transactional databases or VM storage is affected. As a VM is a single large file from filesystem view, there is a good chance to corrupt a VM on a crash as ZFS cannot guarantee atomic writes (data+metadata) on a guest filesystems. This guarantee comes with sync write where every committed write is logged on stable storage and will be written to pool on next reboot. For a regular NFS/SMB filer and if you can ignore a small file lost on a crash, disable sync.

If you just enable sync, this logging is done onpool in an area not affected by fragmentation and optimized for small io. You can use an Slog instead that is more suited for this type of workload.

A typical Raid-10 mirror of disks may give a sequential performance of say 400 MB/s. If you enable sync without an Slog, sync write performance may go down to 40 MB/s. With a good Slog ex an Optane 48xx/9xx sync write performance may be in the 300 MB/s area. With an S3700 it may be at around 100 MB/s.

As sync write is a security technology against a dataloss due a crash/power outage with the price of a much lower performance, it would not be wise to accept the performance degration without the gain of security. If the slog (or the pool when you just enable sync without Slog) does not offer powerloss protection, it cannot protect against a dataloss what makes the whole sync idea quite useless.

An Slog mirror helps in two ways.
If an Slog dies without a crash, ZFS simply reverts to onpool logging (ZIL). The mirror helps to avoid the performance degration in such a situation

If an Slog dies during a crash, its content is finally lost. The mirror protects against.
 
Last edited: