ZFS sync settings with PLP SSDs and Optane

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

DanielWood

Member
Sep 14, 2018
44
17
8
I'm a bit confused on ZFS sync when it comes to PLP arrays.

I'm setting up a bunch of HGST S840 2TB SAS drives and have been playing around with 6x in a RAID-Z1 to create a 11TB NFS SAN Backend for my ESXi hosts over 4x10Gbe.

With sync=always I get approximately 100MB/sec write speeds on the array. With Sync=disabled I get approximately 200MB/sec.

These numbers seem off, considering I get pretty similar results on a single drive, but I'm going to pull things apart tomorrow and make sure its not performance issues due to PCIe passthrough or other things.

The question I have is, since these disks have PLP in the form of massive capacitors. Do I need to have sync=always, or is sync=disabled fine? This will be for VM storage backend, so integrity is important. If the answer is that I need sync=always, I'll just gollow @gea's recommendation and buy an Optane 800P for ZIL and call it a day.

I guess the second part of the question is, will the 800P ZIL act as an extended write cache? Meaning if I were to write over 100GB long sequential writes and the backend couldnt keep up, it would just accumulate on the 118GB 800P until it could flush to disk.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
For VM storage you mainly need iops.
As a Raid-Z1 has the same iops than a single disk, i would prefer a multi Raid-10 setup as in this case read iops scale with number of disks and write iops with number of mirrors.

On writes, ZFS use always RAM as writecache (OpenZFS default 10% of RAM, max 4GB) to do large and fast sequential writes and to avoid small and slow random writes. On a crash the content of the cache is lost. This does not affect ZFS filesystem consistency due CopyOnWrite but as you have other filesystems on ZFS when using ZFS as VM storage, their filesystems may become corrupted on a crash.

To protect the content of the rambased writecache you can enable sync write. This either logs any single committed write (going to the rambased writecache) to a onpool ZIL device or a dedicated Slog device. The Slog allows to use specialized disks optimized for the log load (ultra low latency, hogh steady write iops, plp). Think of the Slog like the batterie unit on a hardware raid.

Unlike L2Arc that is an SSD/NVMe extension of the rambased Arc readcache, Slog is not an extension of the rambased writecache. It is a protector for the content of the rambased writecache only. Content of ZIL/Slog is never read beside after a crash to redo missing but committed writes on next bootup.

Intel Optane (800P and up) is a perfect Slog for a SoHo/lab setup. If this is a production setup with heavy writes you may prefer the 900 model due a higher write endurance. For a critical production system you may even prefer the 4800 Optane as Intel guarantees plp only on these models (while I would expect a very good plp behaviour for Optane in general)
 

DanielWood

Member
Sep 14, 2018
44
17
8
For VM storage you mainly need iops.
As a Raid-Z1 has the same iops than a single disk, i would prefer a multi Raid-10 setup as in this case read iops scale with number of disks and write iops with number of mirrors.
That is already under consideration when I reconfigure things this week to figure out why I'm only getting single disk write performance (IOPS is expected, but I expected sequential write to somewhat scale with increased drives). As it stands, I probably will switch to 3x Mirrors, that knocks me down to 6TB, but it also makes scaling up really simple in that I just add more mirrors(R720 is my SAN, so I'll have another 10 bays free).


On writes, ZFS use always RAM as writecache (OpenZFS default 10% of RAM, max 4GB) to do large and fast sequential writes and to avoid small and slow random writes. On a crash the content of the cache is lost. This does not affect ZFS filesystem consistency due CopyOnWrite but as you have other filesystems on ZFS when using ZFS as VM storage, their filesystems may become corrupted on a crash.
And this was the key bit of information I was looking for. While technically I could tolerate this (24 hourly snapshots, so just roll back to the last one) in the environment it is intended for, I would rather have the set and forget of a SLOG.

To protect the content of the rambased writecache you can enable sync write. This either logs any single committed write (going to the rambased writecache) to a onpool ZIL device or a dedicated Slog device. The Slog allows to use specialized disks optimized for the log load (ultra low latency, hogh steady write iops, plp). Think of the Slog like the batterie unit on a hardware raid.

Unlike L2Arc that is an SSD/NVMe extension of the rambased Arc readcache, Slog is not an extension of the rambased writecache. It is a protector for the content of the rambased writecache only. Content of ZIL/Slog is never read beside after a crash to redo missing but committed writes on next bootup.

Intel Optane (800P and up) is a perfect Slog for a SoHo/lab setup. If this is a production setup with heavy writes you may prefer the 900 model due a higher write endurance. For a critical production system you may even prefer the 4800 Optane as Intel guarantees plp only on these models (while I would expect a very good plp behaviour for Optane in general)
I doubt I'm going to come close to the 365TWB before this system is retired in another 2-3 years. While it will be potentially 50-200 VMs in the coming months, the overall write load will likely be far under 1TB/day, on the average day it will probably not even break 200GB. (Which is a good thing, because my off-site backup that will be getting all those snapshots is on the other end of a 100Mbps link, so anything more than 1TB/day means backlogs.)

That said, if we start pushing higher, I can always swap in a 900p.

Many thanks, @gea .