ESXi 5.5 and ZFS - sync, async NFS + iSCSI powerloss data security

J-san · Dec 6, 2014

Here's a table of safety for powerloss by datastore type for ZFS backed NFS and iSCSI storage, and also guest VM connected/mounted storage.

Why you should care:

Operating Systems filesystems, and databases will perform sensitive writes in "write through" mode in a synchronous manner, which will wait for the write to succeed and be written in stable storage before returning to the calling code. These "write through" mode writes are tagged as such to bypass disk write caches and save the data directly to disk.

If important filesystem data is written and the "write through" (sync) request isn't honored by the disks or raid controller then guest filesystem corruption or fatal filesystem damage is possible (and probable) with a loss of power. Traditional true "hardware" Raid controllers will usually lie to the requesting system, but instead contain a battery backed cache to preserve important data until power is restored. Once power is restored the cache will write out its contents to stable storage disks preserving filesystem integrity.

Datastore connections from ESXi to ZFS storage are used to save and access virtual machine disk files (VMDK files) which contain the filesystems and data for each virtual machine. Virtual Machines disk files (VMDKs) in ESXi are often written to in a datastore that is mounted over NFS or iSCSI network protocols.

When guest writes are to a VMDK stored on a ZFS filesystem that is set to sync=disabled for the ESXi NFS datastore:
- the Virtual machine OS writes some sensitive filesystem data that ESXi sends
- It's received via NFS or iSCSI on Solaris, FreeBSD, ZoL
- Sensitive write data ends up in volatile memory in the asynchronous data queue on the receiving end in ZFS (or in the iSCSI target's volatile write cache).
- Everything is fine, until power is lost and it's not.
- The ZFS filesystem is fine that the VMDK files are stored on.
- But the VMDK file's guest OS filesystem could have missed vital data being written and could now be corrupted.

Disclaimers:

1) I define "Corrupted" as the filesystem becoming corrupted within the VMDK from a guest's perspective.

2) Async writes 'lost' are those that have not been commited to stable storage. (eg writes in the writeback cache pending flush or flushed async lazy writes)

If you have updates or corrections please post below and I'll updated the image.

Guest connected iSCSI with write cache enabled disks (and zfs sync=standard on the iSCSI lu) are more similar to a non-virtualized OS connected to a single physical disk with a non-battery backed cache. All file writes to these disks are written asynchronously, and unless written in a "write-through" mode by the OS or database/program, any in-flight data will be lost if the power is lost.

Note: If using VM guest iSCSI for data disks you must ensure that the iSCSI write cache settings in the guest are setup correctly and are supported. Otherwise "write-through" sensitive writes may also be lost to power loss causing potential data loss/and or filesystem corruption.

Note 2: ESXi VM snapshots will not cover guest iSCSI filesystems as they are external to the VMDK file of the VM and a ZFS snapshot must also be coordinated.

Note 3. Quiescence of guest mounted iSCSI disks is required for snapshots. Otherwise snapshots of a powered on guest mounted iSCSI disk from ZFS may result in inconsistent data (similar to corrupted filesystem). Powering down the VM which has mounted the iSCSI disk is required or some other method of guaranteeing no-writes will be made. (ZFS filesystem snap will be fine, but upon mounting the snap the filesystem may be in an inconsistent state when iSCSI disk snap is mounted from client OS).

Biren78 · Dec 6, 2014

Nice table! Thanks

Patrick · Dec 6, 2014

I like this guide. May help (also possible for a main site post) to have a bit on why it is important and define each column.

J-san · Dec 6, 2014

I expanded upon the post and the reasoning behind why it's so important. If you have any corrections or suggestions to the preamble or table let me know.

TuxDude · Dec 6, 2014

I'm pretty sure that ESX does not force all writes to sync mode on NFS. I haven't verified it recently, but at least on 5.0 and previous I know that async NFS to ESX is not safe.

J-san · Dec 6, 2014

TuxDude said:
I'm pretty sure that ESX does not force all writes to sync mode on NFS. I haven't verified it recently, but at least on 5.0 and previous I know that async NFS to ESX is not safe.

Hi TuxDude, I guess it isn't clear but I'm talking about async writes from within a VM to its VMDK disk on a NFS mounted datastore with the backing zfs filesystem sync attribute set to sync="standard, or always" are safe as they are treated as sync writes. (at least in ESXi 5.5)

I updated the image to try and make the different types of writes generated from within a VM guest perspective (either async, or "write through") to the VM's disk VMDK more clear.

gea · Dec 7, 2014

Hello J-San
Thanks, everyone using storage should be aware of such problems

I want to add the following
- This is not originally a ZFS problem but a general problem.
Every good hardware raidcontroller and many Operating Systems and disks tools allows
to enable/disable write back so a user can decide to choose security or performance

see a nice article about disk cache in general
Write-through, write-around, write-back: Cache explained

- Caches are often cascaded from OS-caching over raid-controller caching to disk caching
If you disable all OS cachings and use sync write but disk caching is used (and your disk is not powerloss-safe - example a desktop SSD with a cache but without powerloss protection) you may be affected with a corrupted filesystem as well despite all secure ZFS sync and Comstar writeback settings when not using a powersafe ZIL.

- If you want to avoid the powerloss problem on writes, you need a hardware raid-controller with a BBU that can help with older non Copy On Write filesystems. While a BBU mostly targets the write hole problem of inconsistent Raid-sets, you must combine it there with a safe cacheback setting. With Copy on Write filesystems like ZFS you can use a ZIL logging either on pool or with a dedicated and faster ZIL to get a even higher level of protection. But for real critical things, you should use powerloss safe disks as well ex. enterprise SSDs with powerloss protection or a dedicated ZIL.

- The writeback/ sync/ cache behaviour of ZFS is a little different from writeback settings in your OS example in Windows as ZFS converts many small random write always to a faster cached sequential write to increase performance. This is done even with sync=always. The ZIL that is activated in that case is not a write cache but a logdevice that logs and commits every single write. It is only used in case of a crash on next boot. This is why you can achieve high performance writes even with sync enabled with a dedicated powerloss and fast ZIL.

- A corrupted filesystem on a sudden powerloss or shutdown can occur but must not.
For home settings with backups, a cheap UPS can be an alternative.
For critical production systems, secure write behaviour is a must

- Regarding write back settings vs sync settings
I also do not know if iscsi writeback overwrites or use the sync setting of the underlying or parent ZFS but would expect that these cache settings are working more in a cascaded way.

If the LU is a file, I would expect that sync=always or standard must be used while sync =none makes the writeback setting irrelevant.

If you use a ZVOL, I would expect that the parent-ZFS setting is inherited, so a sync=none should not be used there.

And at last, do not forget the writeback behaviour of a disk itself. The ZFS iSCSI setting reflects the usual disk setting but a write back setting of the real underlying disks stays active. The danger of the disk cache is not as serious as the cache is only a fraction of the other caches. If you really care, you must use powerloss protected enterprise SSDs or a dedicated powerloss safe ZIL to log writes thay may be lost on the pool otherwise.

TuxDude · Dec 7, 2014

J-san said:
Hi TuxDude, I guess it isn't clear but I'm talking about async writes from within a VM to its VMDK disk on a NFS mounted datastore with the backing zfs filesystem sync attribute set to sync="standard, or always" are safe as they are treated as sync writes. (at least in ESXi 5.5)

I updated the image to try and make the different types of writes generated from within a VM guest perspective (either async, or "write through") to the VM's disk VMDK more clear.

I did a bit of searching around, and it seems as though Solaris (and its free clones) have no concept of sync vs async at the NFS layer, which Linux (what I am used to using) does. And so for most ZFS-backed shared storage setups around here it is not something to worry about, though ZoL may be different I haven't done the research there. I know that at least on ext4, xfs, btrfs, etc. on Linux, if you setup your NFS share with the 'async' property then writes are cached in the storage-host RAM and immediately acknowledged and data loss can occur if the storage host loses power or crashes before those writes are pushed down to disk.

gea · Dec 7, 2014

On Solaris, sync is a ZFS filesystem property just like NFS or CIFS sharing and I would expect that ZoL acts similar with ZFS. You cannot set sync based on a protocol. Usually you set sync to default where the writing application can decide whether to use sync or not, like CIFS server is async while ESXi with shared NFS always request sync write.

J-san · Dec 7, 2014

TuxDude said:
I did a bit of searching around, and it seems as though Solaris (and its free clones) have no concept of sync vs async at the NFS layer, which Linux (what I am used to using) does. And so for most ZFS-backed shared storage setups around here it is not something to worry about, though ZoL may be different I haven't done the research there. I know that at least on ext4, xfs, btrfs, etc. on Linux, if you setup your NFS share with the 'async' property then writes are cached in the storage-host RAM and immediately acknowledged and data loss can occur if the storage host loses power or crashes before those writes are pushed down to disk.

Hi TuxDude, yes I believe at the NFS layer in Solaris it's either sync or not based upon the underlying ZFS "sync" filesystem property.

Setting sync=standard or sync=always on the zfs filesystem that's shared over NFS causes all writes to be treated as "Sync/write_through" even if originally the write was not requested as such.

I see you are talking about Linux client OS (a guest VM or physical machine) mounting a NFS shared filesystem from a Linux server host in "async" mode. Yes, any writes made to this "async" mounted filesystem by the guest will be immediately OK'd as written even though they are just in RAM on the server host.

In this situation the Linux client OS can write in two modes, "async" or requesting "sync/write_through" to the NFS mounted filesystem and the server will say OK, it's written to disk (I swear!), but it really isn't safe from powerloss/crash.
-----

In the graphic, I'm trying to show that with ESXi mounting the datastore for VMDK files over NFS to Solaris you must be careful setting the ZFS 'sync' property on the datastore's ZFS filesystem and the implications of those settings.

Search

ESXi 5.5 and ZFS - sync, async NFS + iSCSI powerloss data security

J-san

Member

Biren78

Active Member

Patrick

Administrator

J-san

Member

TuxDude

Well-Known Member

J-san

Member

gea

Well-Known Member

TuxDude

Well-Known Member

gea

Well-Known Member

J-san

Member