OpenZFS (OmniOS) write throttle tuning - NFS benchmarking 4k@QD32

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
I just noticed something that may help others.

ZFS write throttling can kick in to slow down your write performance.

I ran into this while benchmarking when I reduced the memory given to my OmniOS VM from 16GB down to 13GB.

The pool I was benchmarking is mirrored striped 2TB RE4 Drive pool with a 100GB S3700 SSD:
Code:
tank
mirror-0
  c21t50014EE05926E121d0  RE4 2TB
  c20t50014EE25F104CBEd0  RE4 2TB
mirror-1
  c13t50014EE2B50E9766d0  RE4 2TB
  c14t50014EE25FB90889d0  RE4 2TB
logs
  c17t55CD2E404B65494Ed0  100GB S3700 SSD
/w sd.conf tuning:
Code:
# DISK tuning
# Set correct non-volatile settings for Intel S3500 + S3700 SSDs
# WARNING: Do not set this for any other SSDs unless they have powerloss protection built-in
# WARNING: It is the equivalent to running zfs with sync=disabled if your SSD does not have powerloss protection.
sd-config-list=
"ATA  INTEL SSDSC2BB48", "physical-block-size:4096, cache-nonvolatile:true, throttle-max:32, disksort:false",
"ATA  INTEL SSDSC2BA10", "physical-block-size:4096, cache-nonvolatile:true, throttle-max:32, disksort:false";
I'm benchmarking @ 9000MTU mounted via an all-in-one network virtual 10G esxi vswitch connection to the NFS datastore on the pool above.

Compression=off, sync=standard ( equal to 'always' for ESXi 5.5 NFS datastore)

Benchmarking with:

OmniOS VM memory at 16GB
CrystalDiskMark 3.0.3 x64
Test size: 1000MB
47 MB/s 4K@QD32 write speed

OmniOS VM memory at 13GB
CrystalDiskMark 3.0.3 x64
Test size: 1000MB
20 MB/s 4K@QD32 write speed

I was testing other settings (NUMA) so I didn't realize this was the byproduct of memory changes.
After a while I realized that the memory size had changed so I changed it back.

Performance was back to where it was.
But why did it decrease so much?

If the size of your dirty data starts to reach close to a percentage of the zfs_dirty_data_max setting (which varies based upon amount of RAM) it can start throttling writes.

So I ran a test by changing the VM memory size down to 14GB and it lay between the two results above.

I checked a dtrace script while running the 1000MB 4k@QD32 write test:

Code:
~# dtrace -s dirty.d tank
...
  1  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  64MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  516MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  637MB of 1432MB used
  1  4181  txg_sync_thread:txg-syncing  933MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  927MB of 1432MB used
  1  4181  txg_sync_thread:txg-syncing  932MB of 1432MB used
  1  4181  txg_sync_thread:txg-syncing  940MB of 1432MB used
  1  4181  txg_sync_thread:txg-syncing  925MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  932MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  935MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  752MB of 1432MB used
  0  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
It's write throttling as it gets closer to full, which means my RE4 vdevs can't
absorb the data being thrown at them async by ZFS after the s3700 SSD log device has written and acknowledged them.

So I temporarily bumped up the max dirty amount to 2495MB so I have lots of room:

Code:
# echo zfs_dirty_data_max/W0t2617101363 | mdb -kw

# dtrace -s dirty.d tank

  0  4181  txg_sync_thread:txg-syncing  0MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing  64MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing  567MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing  667MB of 2495MB used
  0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
  0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
  0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
  0  4181  txg_sync_thread:txg-syncing 1002MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing  936MB of 2495MB used
  1  4181  txg_sync_thread:txg-syncing  0MB of 2495MB used
Result:

OmniOS VM memory at 14GB
CrystalDiskMark 3.0.3 x64
Test size: 1000MB
zfs_dirty_data_max=2495MB

127 MB/s 4K@QD32 write speed

(yes 127MB/s that's not a typo, with sync=standard)

I dropped the zfs_dirty_data_max lower and ran the 500MB test and got the same results.

So just make sure you either:

A) Ensure your slower vdev disk pool can handle the writes thrown at it if you have a hybrid (SSD + slower disk) pool and you want to max out sustained 4k @QD32 throughput of your SLOG device.

B) Give the VM enough memory to handle the incoming amount of data that your SLOG can acknowledge.
(which will raise zfs_dirty_data_max as a byproduct)

C) Ensure you increase zfs_dirty_data_max independently if you can't spare more RAM, but want to steal from your ARC cache to accommodate the longest bursts of SLOG write data.

Hope people find this useful.

zfs throttle info and scripts from:
Adam Leventhal's blog » Tuning the OpenZFS write throttle
 
Last edited:

Hank C

Active Member
Jun 16, 2014
644
66
28
i'll have 192gb of RAM for the zfs (possible freenas)
What does the dirty zfs setting for? is that on the RAM or the SLOG?
have you tried turning off sync write and see what happen?
 

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
zfs_dirty_data_max
Controls the async write buffer in RAM

zfs_dirty_data_sync
Controls the sync write buffer in RAM

The "cache-nonvolatile:true" setting in the sd.conf for the S3700 and S3500 essentially turns off SYNC behavior for those drives ONLY.
(sync is still honored correctly in the filesystem/pool for the slower 7200rpm RE4 data drives by the seperate SSD slog)

If you disabled Sync for the filesystem (datastore), then it bypasses the separate SSD slog and writes async instead, probably making performance quite similar to the above hybrid pool.

BUT
if you disable sync for a zfs filesystem that the ESXi datastore connects to then upon powerloss your VMDKs may become corrupt and unrecoverable. This is because important writes in RAM that were guaranteed to be written sync to stable storage were acknowledged as written to disk, so ESXi happily continued on assuming the important data is stable in the VMDK, but the writes in RAM are lost upon powerloss, or a system crash/hang.
 
Last edited:

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
Hank, I actually am not using L2ARC on my system. Just regular RAM (ARC) only at this point.

However, from what I've read the best should be a really fast reading SSD that doesn't need powerloss protection.
Maybe a Samsung 850 Pro?

From what I've read though you should max out your regular RAM first as huge amounts of L2ARC eat up your regular much faster RAM.
If your commonly used data all fits in your regular RAM ARC then you don't need the L2ARC.
 
Last edited: