ZFS Allocation Classes // performance benchmarks

gea · Oct 12, 2019

Open-ZFS Allocation classes are a new vdev type to hold dedup tables, metadata, small io or single filesystems.

They should offer a comparable redundany as the pool. Beside basic vdevs (not suggested as a disk lost=pool lost) you can use n-way mirrors. With several mirrors load is distributed over them.

I have made some performance benchmarks,
http://napp-it.org/doc/downloads/special-vdev.pdf

I am really impressed about the result as this allows to use a slow disk pool where you can decide per ZFS filesystem based on the "recsize" vs "special_small_blocks" settings that data of this filesystem land on the special vdev ex an Intel Optane.

gea · Oct 16, 2019

Update

Dedup and special vdevs are removable vdevs. This works only when all vdevs in the pool have the same ashift setting, ex ashift =12, best for 4k disks.

At least in current Illumos there is a problem that a pool crashes (corrupted) when you try to remove a special vdev from a pool with different ashift settings, ex a pool with ashift=12 vdevs and a special vdev with ashift=9. In current napp-it-dev I therefor set ashift=12 instead "auto" as default to create or extend a pool.

If you want to remove a special or dedup vdev, first check the ashift setting of all vdevs (menu Pool, click on the datapool). I have send a mail to illumos-dev and hope that this bug is solved prior next OmniOS stable.

If you create or extend a pool, I suggest to care about same ashift. When you try to remove a regular vdev (ex basic, mirror) from a pool and vdev is different then it stops with a message that this cannot be done due different ashift settings (but no crash like with special vdevs).

gea · Oct 19, 2019

A fix for the special vdev remove bug is under way
Bug #11851: ZFS special vdev ashift mismatch causes panic on removal - illumos gate - illumos

gea · Oct 21, 2019

Just to complete for those using ZoL
panic when removing vdev from pool with different-ashift special/dedup vdev · Issue #9363 · zfsonlinux/zfs

gea · Oct 26, 2019

In my first tests, I compared performance of a slow diskbased basic pool vs adding an Optane as special vdev. As expected, the filesystem on the special vdev was superiour.

I a second round, I tried a faster pool (Multi-Raid 10) and this time the results need
a more different view as the disk pool was faster without special vdev in a benchmark situation than with the special vdev (Intel P3600-400 in this case).

I have updated http://napp-it.org/doc/downloads/special-vdev.pdf

Allocation Classes
Content:

1. About Allocation Classes
2. Performance of a slow diskbased pool
3. With special vdev (metadata only)
4. With special vdev (for a single filesystem)
5. With special vdev (for a single filesystem) and Slog (Optane)
6. Performance of a fast diskbased pool
7. Fast diskbased pool vwith special vdev
8. NVMe Pool vs special vdev (same NVMe)
9. Compare Results
10. Conclusion
11. When is a special vdev helpful
12. When not
13. General suggestions

edit: I have added NVMe pool vs special vdev from same NVMe

Rand__ · Oct 27, 2019

Thanks for answering my untold question

gea · Oct 28, 2019

The more I play with special vdevs, the more fun and questions I have.

I have now added a benchmark with a pure Optane 900 basic pool vs a disk pool with Optane as special vdev and Slog and a disk pool with affordable NVMe as special vdev and a small Optane as Slog

Fazit
A huge and cheap disk pool paired with affordable SSD/12G SAS/NVMe as special vdev mirrors for metadata and selected filesystems + a small expensive Slog (ex 4801x-100/200, WD SS530) allows to build a single multi purpose pool that offers huge capacity and superiour performance when needed at a decent price.

geppi · Nov 19, 2019

@gea In your benchmark report you write:

"Result compared to 2.) is very different.
As the filesystem recordsize is equal to the special_small_block size, all data land on the Optane. This is why you want this feature, to decide if a filesystem writes to regular vdevs or the special vdev."

Well, not really. If I want all data of a filesystem go to a particular drive configuration I setup a pool with that drive configuration, create the filesystem on it and call it a day.

The idea behind the special_small_blocks property is to divide the data into one part that fits the performance characteristics of the pools normal vdev configuration and another part that is better serviced by the pools special vdev configuration. Therefore whith recordsize=128K it would for example be reasonable to set special_small_blocks=32K. The actual value for special_small_blocks would depend on the amount of data you expect with this blocksize or smaller in the filesystem. You don't want to fill the special vdev more than the usual 80% because block allocation on it would then suffer the usual zfs problems of metaslab spacemap loading and gang block creation. In case the special vdev would be filled by 100% data would go to the normal vdevs anyhow. So depending on the size of the special vdev, your data pattern and the performance characteristics the value for special_small_blocks could be smaller than 32K or a little bit bigger but definitely not 128K for a filesystem with this recordsize.

Now for the actual benchmarks. The disk pool you use in the majority of your tests has a normal vdev configuration of 3-way mirrors in a 5 wide stripe. This is a pretty performant disk pool that trades storage efficiency (only 33%) for increased fault tolerance and higher performance. An alternative for configuring those 15 disks would be for example to create a single raidz3 pool with a storage efficiency of 80%. The reason not to use this kind of configuration is in most cases the low random r/w performance of such a pool. This is exactly what allocation classes and in particular the special_small_blocks setting promise to change.

It would be very interesting to see the performance numbers for such a raidz3 pool without a special vdev compared to the same pool with a special vdev made from a mirror of SSDs with special_small_blocks set to e.g. 16K or 32K.

On the issue of measuring random r/w performance I would say that you should in general disable all caching because otherwise you will at least partially just measure cache performance. For measuring sequential r/w performance it is sufficient to run streaming benchmarks long enough that the accumulated data over-floods the caches but for random r/w benchmarking you would have to take more sophisticated measures.

However, at the end of the day all those benchmarks are somehow artificial and the real benefit (or lack of it) would only show in a real world multi-user scenario with mixed access patterns like office applications and video streaming running in parallel on the same pool.

BoredSysadmin · Nov 19, 2019

8. Pool with NVMe vs special vdev same NVMe than the fromer special vdev

I assume typo, should be former

gea · Nov 19, 2019

As I see or asume special vdevs (as there are very few infos around).

The Intel concept of special vdevs is new idea how to design datapools from a mix of disks with a different performance level like mechnical disks and NVMe, mainly to improve overall performance but this can also be used to force most critical data onto the faster disks as a more intelligent alternative to classical data tiering storages where you move critical data to/from a faster part of an array to a slower part.

Some aspects are quite obvious but hard to measure with a classical benchmark. This is the metadata aspect. Metadata for active data are in ramcache so no benefit when stored on faster vdevs. But in use cased with many users and volatile data, many metadata are not in cache. When all metadata are on a special vdev you can read/write them faster what should improve overall read/write performance. To measure this you would need a special benchmark that reads and writes constantly random data for many users.

Another point is the small blocks aspect for a whole pool. When you for example have a filesystem recsize of default 128k and a special blocksize of 32k, I would expect a write performance improvment for a whole pool but only for a very small amount of data. When you write for example 544k data with enough RAM then this goes to the rambased writecache first and is then flushed to disk in form of datablocks in recsize (128k or smaller when compressed). This means in this example write 4 datablocks with 128k and one block with 32k. Only the last block (where the diskpool is slow) lands on the special vdev. For reads the situation much depends on the question if the data is already in Arc otherwise read of the 32k block is faster from the special vdev than from the regular vdevs. The effect of this for the whole pool is also not easy to measure as it also requires a similar testszenario like the improvement for metadata. But as metadata is around 1% and small blocks with a size smaller than recsize affects only a small amount of data, this means a better overall pool performance can be achieved with a mirror of quite small NVMe as special vdev. This is the focus of most articles I have read.

The above is not an alternative to a classic data tiering where you massively want to improve performance for a special class of data. You have mentioned a second pool from faster SSD/NVMe for this. This is the traditional ZFS way.

What I mainly want to show with my tests is that special vdevs can achieve tiering alike functionality in a very flexible way without a data move. If you set a filesystem for a recsize up to 128k (smaller is good for some data like VM) and set small block size for this filesystem also to the same value, you force all datablocks for this filesystem to the special vdev. If this is no longer needed you can disable small blocks for this filesystem and all next changes go to the slower pool what empties the special vdev again unless you do not block by snaps.

A near full pool or vdev is uncritical in general and with NVMe not so performance relevant (with Optane a near full disk behaves quite like an empty one). When a special vdev gets full, new data land on the slower vdevs what would only mean a lower performance. To guarantee a superiour performance for special data, you need only large enough special vdevs. This is a matter of calculating size depending on use case. Nobody said that special vdev size cannot be 30% of a pool when you want 30% of the pool (or more exactly some filesystems) to be much faster. For me this aspect is at least as important as the global pool improvement aspect.

@BoredSysadmin
thanks, fixed

Rand__ · Feb 12, 2021

Hey @gea ,

this has been around a while in The Solarish world (new to FreeBSD/TrueNas), so maybe you have more experience and can answer my q.
I run a NVME pool with NVDimm slog and plenty of memory; do you think that pool could benefit from a special vdev (optane mirror) for metadata? From the usual understanding it just might (assuming that Optane is faster than the nvme pool, but slower the the NVDimm)

Just wondering since got some 900ps doing nothing at this point and wondered whether i could find a use for them

gea · Feb 13, 2021

In my tests the special vdevs is superiour on a slower pool when you force a filesystem onto. Also for dedup I would expect an advantage over rambased dedup tables as long as you do not have a huge ram.

In my tests, special vdev for small io and metadata gives no improvement on benchmarks it was sometimes even slower. My explanation is that for current data ex running benchmarks the metadata is already in the ramcache that is even faster than the special vdev. A special vdev can only give an improvement for newly reads of metadata. From workload this may be the case on a mailserver with many users and millions of files.

A benchmark to test this would propably require a script that generates many small files randomly. Then reboot to empty ramcache and maybe copy all files to /dev/null with a time counter. Then redo the same with a special vdev what would then mean all metadata on it what may improve read performance.

Propably the test with the Optane special vdev wins with a small margin. I see it mainly as an option for disk based pools with a huge performance gap to a special vdev and a high volatibility of data.

Rand__ · Feb 13, 2021

Thanks,
thats similar what I deducted from your excellent performance tests, a cheap(er) way to speed up slower pools, but not really usefull to already fast pools.

Search

ZFS Allocation Classes // performance benchmarks

gea

Well-Known Member

gea

Well-Known Member

gea

Well-Known Member

gea

Well-Known Member

gea

Well-Known Member

Rand__

Well-Known Member

gea

Well-Known Member

geppi

New Member

BoredSysadmin

Not affiliated with Maxell

gea

Well-Known Member

Rand__

Well-Known Member

gea

Well-Known Member

Rand__

Well-Known Member