Use case and performance considerations for an OmniOS/OpenIndiana/Solaris based ZFS server
This is what I am asked quite often
If you simply want the best performance, durability and security, order a server with a very new CPU with a frequency > 3GHz and 6 cores or more, 256 GB RAM and a huge Flash only storage with 2 x 12G multipath SAS (10dwpd) or NVMe in a multi mirror setup - with a datacenter quality powerloss protection to ensure data on a powerloss during writes or background garbage collection. Do not forget to order twice as you need a backup on a second location at least for a disaster like fire, theft or Ransomware.
Maybe you can follow this simple suggestion, mostly you search a compromise between price, performance and capacity under a given use scenario. Be aware that when you define two of the three parameters, the third is a result of your choice ex low price + high capacity = low performance.
When your main concern is a well balanced workable solution, you should not start with a price restriction but with your use case and the needed performance for that (low, medium, high, extreme). With a few users and mainly office documents, your performance need is low, even a small server with a 1.5 GHz dualcore CPU, 4-8 GB RAM and a mirror from two SSD or HD can be good enough. Add some external USB disks for a rolling daily backup and you are ready.
If you are a media firm with many users that want to edit multitrack 4k video from ZFS storage, you need an extreme solution regarding pool performance (> 1GB/s sequential read,write), network (multiple 10G) and capacity according your needs. Maybe you come to the conclusion to prefer a local NVMe for hot data and a medium class disk based storage for shared file access and versioning only. Do not forget to add a disaster backup solution.
After you have defined the performance class/use case (low, medium, high, extreme), select needed components.
For lower performance needs and 1G networks, you can skip this. Even a cheap dual/quadcore CPU is e good enough. If your performance need is high or extreme with a high throughput in a 10G network or when you need encryption, ZFS is quite CPU hungry as you see in https://www.napp-it.org/doc/downloads/epyc_performance.pdf
. If you have the choice prefer higher frequency over more cores. If you need sync write (VM storage or databases) avoid encryption as encrypted small sync writes are always very slow and add an Slog for diskbased pools.
Solaris based ZFS systems are very resource efficient due the deep integration of iSCSI, NFS and SMB into the Solaris kernel that was developped around ZFS from the beginning. You need less than 3 GB for a 64bit Solaris based OS itself to be stable with any pool size. Use at least 4-8 GB RAM to allow some caching for low to medium needs with only a few users.
As ZFS uses most of the RAM (unless not dynamically demanded by other processes) for ultrafast read/write caching to improve performance you may want to add more RAM. Per default Open-ZFS uses 10% of RAM for write caching. As a rule of thumb you should collect all small writes < 128k in the rambased write cache as smaller writes are slower or very slow. As you can only use half of the write cache unless the content must be written to disk, you want at least 256k write cache that you can have with 4 GB RAM in a single user scenario. This RAM need for write caching scale with number of users that write concurrently so add around 0.5 GB RAM per active concurrent user.
Oracle Solaris with native ZFS works different. The rambased writecache caches last 5s of writes that can consume up to 1/8 of total RAM. In general this often leads to similar RAM needs than OI/OmniOS with Open-ZFS. On a faster 10G network with a max write of 1 GB/s this means 8GB RAM min + RAM wanted for readcaching.
Most of the remaining RAM is used for ultrafast rambased readcaching (Arc). The readcache works only for small io on a read last/ read most optimazation. Large files are not cached at all. Cache hits are therefore for metadata and small random io. Check napp-it menu System > Basic Statistic > Arc after some time of storage usage. Unless you does not have a use scenario with many users, many small files and a high volatility (ex a larger mailserver), cache hit rate should be > 80% and metadata hit rate > 90%. If results are lower you should add more RAM or use high performance storage like NVMe where caching is not so important.
If you read about 1GB RAM per TB storage, forget this. It is a myth unless you do not activate rambased realtime dedup (not recommendet at all or when dedup is needed use fast NVMe as a special vdev mirror for dedup). Needed RAM size depends on number of users, files or wanted cache hit rate not poolsize.
L2Arc is an SSD or at best NVMe that can be used to extend the rambased Arc. L2Arc is not as fast as RAM but can increase cache size when more RAM is not an option or when the server is rebooted more often as L2Arc is persistent. As L2Arc needs RAM to organize, do not use more than say 5x RAM as L2Arc. Additionally you can enable read ahead on L2Arc that may improve sequential reads a little. (add "set zfs:l2arc_noprefetch=0" to /etc/system or use napp-it System > Tuning).
RAM can help a lot to improve ZFS performance with the help of read/write caching. For larger sequential writes and reads or many small io it is only raw storage performance that counts. If you look at the specs of disks the two most important values are seqential transfer rate for large transfers and iops that counts when you read or write small datablocks.
On mechanical disks you find values of around 200-300 MB/s max sequential transfer rate and around 100 iops. As a Copy on Write filesystem like ZFS is not optimized to a single user/single datastream load, it spread data quite evenly over the pool for a best multiuser/multithread performance. It is therefore affected by fragmentation with many smaller datablocks spread over the whole pool where performance is more limited by iops than sequential values. On average use you will often see no more than 100-150 MB/s per disk. When you enable sync write on a single mechanical disk, write performance is not better than say 10 MB/s due the low iops rating.
Desktop Sata SSD
can achieve around 500 MB/s (6G Sata) and a few thousand iops. Often iops values from specs are only valid for a short time until performance drops down to a fraction on steady writes.
can hold their performance and offer powerloss protection PLP. Without PLP last writes are not save on a power outage during write as well as data on disk with background operations like firmware based garbage collection to keep SSD performance high.
Enterprise SSDs are often available as 6G Sata or 2 x 12G multipath SAS. When you have an SAS HBA prefer 12G SAS models due the higher performance (up to 4x faster than 6G Sata) and as SAS is full duplex while Sata is only half duplex with a more robust signalling with up to 10m cable length (Sata 1m). The best of all SAS SSDs can achieve up to 2 GB/s transfer rate and over 300k iops on steady 4k writes. SAS is also a way to build a storage with more than 100 hotplug disks easily with the help of SAS expanders.
NVMes are the fastest option for storage. The best like Intel Optane 5800x rate at 1.6M iops and 6.4 GB/s transfer rate. In general Desktop NVMe lack powerloss protection and can hold write iops not on steady write so prefer datacenter models with PLP. While NVMe are ultrafast it is not as easy to use many of them as each wants a 4x pci lane connection (pci-e card, M.2 or oculink/U.2 connector). For a larger capacity SAS storage is often nearly as fast and easier to implement especially when hotplug is wanted. NVMe is perfect for a second smaller high performance pool for databases/VMs or to tune a ZFS pool witha special vdev or an Slog for faster sync write on disk based pools, a persistent L2Arc or a special vdev mirror.
ZFS Pool Layout
ZFS groups disks to a vdev and stripes several vdevs to a pool to improve performance or reliability. While a ZFS pool from a single disk vdev without redundancy rate as described above, a vdev from several disks can behave better.
Raid-0 pool (ZFS always stripes data over vdevs in a raid-0)
You can create a pool from a single disk (this is a basic vdev) or a mirror/raid-Z vdev and add more vdevs to create a raid-0 configuration. Overall read/write performance from math is number of vdevs x performance of a single vdev as each must only process 1/n of data. Real world performnce is not a factor n but more 1.5 to 1.8 x n depending on disks or disc caches and decreases with more vdevs. Keep this in mind when you want to decide if ZFS performance is "as expected"
A pool from a single n-way mirror vdev
You can mirror two or more disks to create a mirror vdev. Mostly you mirror to improve datasecurity as write performance of an n-way mirror is equal to a single disk (a write is done when on all disks). As ZFS can read from all disks simultaniously read performance and read iops scale with n. When a single disk rate with 100 MB/s and 100 iops a 3way mirror can give up to 300 MB/s and 300 iops. If you run a napp-it Pool > Benchmsrk with a singlestream read benchmark vs a fivestream one, you can see the effect. In a 3way mirror any two disks can fail without a dataloss.
A pool from multiple n-way mirror vdevs
Some years ago a ZFS pool from many striped mirror vdevs was the preferred method for faster pools. Nowaday I would use mirrors only when one mirror is enough or when an easy extension to a later Raid-10 setup ex from 4 disks is planned. If you really need performance, use SSD/Nvme as they are by far superiour.
A pool from a single Z1 vdev
A Z1 vdev is good to combine up to say 4 disks. Such a 4 disk Z1 vdev gives the capacity of 3 disks. One disk of the vdev is allowed to fail without a dataloss. Unlike other raid types like raid-5 a readerror in a degraded Z1 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z1 is much better and named different than raid-5. Sequential read/write performance of such a vdev is similar to a 3 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)
A pool from a single Z2 vdev
A Z2 vdev is good to combine say 5-10 disks. A 7 disk Z2 vdev gives the capacity of 5 disks. Any two disks of the vdev are allowed to fail without a dataloss. Unlike other raid types like raid-6 a readerror in a totally degraded Z2 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z2 is much better and named different than raid-6. Sequential read/write performance of such a vdev is similar to a 5 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)
A pool from a single Z3 vdev
A Z1 vdev is good to combine say 11-20 disks. A 13 disk Z2 vdev gives the capacity of 10 disks. Any three disks of the vdev are allowed to fail without a dataloss. There is no equivalent to Z3 in traditional raid. Sequential read/write performance of such a vdev is similar to a 10 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io).
A pool from multiple raid Z[1-3] vdevs
Such a pool stripes the vdevs what means sequential performance and iops scale with number of vdevs (not linear similar to the raid-0 degression with more disks)
Many small disks vs less larger disks
Many small disks can be faster but are more power hungry and as performance improvement is not linear and failure rate scale with number of parts I would always prefer less but larger disks. The same is with number of vdevs. Prefer a pool from less vdevs. If you have a pool of say 100 disks and an annual failure rate of 5%, you have 5 bad disks per year. I you asume a resilver time of 5 days per disk you can expect 3-4 weeks where a resilver is running with a noticeable performance degration.
Some high end storages offer tiering where active or performance sensitive files can be placed on a faster part of an array. ZFS does not offer traditional tiering but you can place critical data based on their physical size (small io), type (dedup or metadata) or based on the recsize setting of a filesystem on a faster vdev of a ZFS pool. Main advantage is that you do not need to copy files around so this is often a superiour approach as mostly the really slow data is data with a small physical file or blocksize. As a vdev lost means a pool lost, use special vdevs always in a n-way mirror. Use the same ashift as all other vdevs (mostly use ashift=12 for 4k physical disks) to allow a special vdev remove.
To use a special vdev, use menu Pools > Extend, select a mirror (best a fast SSD/NVMe mirror with PLP) with type=special. Allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks. This means you can force all data of a certain filesystem to the special vdev when you set the ZFS property "special_small_blocks" ex special_small_blocks=128K for a filesystem with a recsize setting smaller or equal. In such a case all small io and some critical filesystems are on the faster vdev others on the regular pool. If you add another vdev mirror load is distributed over both vdevs. If a special vdev is too full, data is stored on the other slower vdevs.
With ZFS all writes always go to the rambased writecache (there may be a direct io option in a future ZFS) and are written as a fast large transfer with a delay. On a crash during write the content or the writcache is lost (up to several MB). Filesystems on VM storage or databased may get corrupted. If you cannot allow such a dataloss you can enable sync write for a filesystem. This will force any write commit immediately to a faster Zil area of the pool or to a fast dedicated Slog device that can be much faster than the pool ZIL area and additionally in a second step as a regular cache write. Every bit that you write is writtn twice, once directly and once collected in writecache. This can never be as fast as a regular write vie writecache. So Slog is not a performance option but a security option when you want acceptable sync write performance. The Slog is never read beside after a power outage to redo missing writes on next reboot, similar to the BBU protection of hardware raid.
Add an Slog only when you need sync write and buy the best that you can afford regarding low latency, high endurance and 4k write iops. The Slog can be quite small (min 10GB). Widely used are the Intel datacenter Optane.
Beside the above "physical" options you have a few tuning options. For faster 10G+ networks you can increase tcp buffers or NFS settings in menu System > Tuning. Another option is Jumboframes that you can set in menu System > Network Eth ex to a "payload" of 9000. Do not forget to set all switches to highest possible mtu value or at least to 9126 (to include ip headers)
Another setting is ZFS recsize. For VM storage with filesystems on it I would set to 32K or 64K (not lower as ZFS becomes inefficient then). For mediadata a higer value of 512K or 1M may be faster.