@mattventura - I know all this.
It is and always will be. If you give ZFS bare access to an HDD:
1. It disables the disk drive write cache, which it should. And so do RAID controllers.
2. But with bare access now ZFS is writing directly to the disk with no buffer in between. With a RAID controller, the DRAM cache is the buffer (and it's power protected).
The apples-to-apples comparison here would be to use some SSDs with PLP as your ZIL SLOG. That way, data
is written to a capacitor-backed DRAM cache, and the drive is able to safely write the data in the event of a power loss.
Agreed. But that's my point. That power protected NVME is no different than a power protected controller cache, is it?
They fill the same role, yes. But being able to directly expose the drive to ZFS is preferable for performance.
But, that's no different than the failure domain of the NVME drive (in this case), is it? Unless you mean mirrored NVME as SLOG, vs "not" mirrored RAID controllers?
Yes, you should mirror your SLOG to avoid having a single point of failure which could cause you to lose unwritten data in the event of a power loss. The same is true if you use a single non-redundant RAID controller - you can lose data if the RAID controller itself dies.
It's not about buggy software. If I'm gonna run ZFS in my use case, it has to run with sync=always. The data is realtime (and fast, which is why it runs in the Equinix DC with very low latency to the NYSE. The NBBO feed can approach 40gbps...) and has to be written reliably.
Let me explain this in more detail.
When an application writes data, it can do so synchronously or asynchronously. When an application performs a synchronous write, it is expecting that the write operation does not return back to the program until the write is truly committed (written to permanent storage or a battery-backed cache). This is used for things like databases where you don't want the DB to tell the client that the data is committed until and unless it is
really committed. Async writes, on the other hand, immediately return control back to the program. This is useful for when the program is dealing with less-important or temporary data, and doesn't care that the data might actually be lost. Or for things like simple log files, where you don't want to impact the performance of the program just to write a log.
By default, i.e. `sync=default`, ZFS will treat synchronous writes as complete when they are written to the actual storage devices, or in the case of a SLOG device, to the SLOG. Similarly, a RAID card with a battery- or supercap-backed write cache will report a sync write as "done" when it is in the cache. ZFS will treat async writes as complete when they are in the RAM-based write cache. async writes never actually hit the SLOG - it just puts it directly into main storage. It doesn't try to use a SLOG as an accelerator, because nothing is actually waiting for the write to complete.
If you set sync=always, there is
no difference for sync writes. All that does is cause it to also treat async writes as sync writes. Likewise, setting sync=never causes it to treat sync writes as if they were async (dangerous, but has some niche uses). Thus, the only reason to use sync=always is if your application really
should be using sync writes, but is actually performing async writes.
No solution, not even a hardware RAID with BBU, will stop all "data loss" - the point isn't to never lose data, it's to never lose data that the storage layer claimed was committed. Similarly, the purpose of a database transaction is not to guarantee that the data will be written - it's to ensure that if the DB
says the data was written, that it actually was.