OpenZFS (OmniOS) write throttle tuning - NFS benchmarking 4k@QD32

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by J-san, Dec 10, 2014.

  1. J-san

    J-san Member

    Joined:
    Nov 27, 2014
    Messages:
    66
    Likes Received:
    42
    I just noticed something that may help others.

    ZFS write throttling can kick in to slow down your write performance.

    I ran into this while benchmarking when I reduced the memory given to my OmniOS VM from 16GB down to 13GB.

    The pool I was benchmarking is mirrored striped 2TB RE4 Drive pool with a 100GB S3700 SSD:
    Code:
    tank
    mirror-0
      c21t50014EE05926E121d0  RE4 2TB
      c20t50014EE25F104CBEd0  RE4 2TB
    mirror-1
      c13t50014EE2B50E9766d0  RE4 2TB
      c14t50014EE25FB90889d0  RE4 2TB
    logs
      c17t55CD2E404B65494Ed0  100GB S3700 SSD
    
    /w sd.conf tuning:
    Code:
    # DISK tuning
    # Set correct non-volatile settings for Intel S3500 + S3700 SSDs
    # WARNING: Do not set this for any other SSDs unless they have powerloss protection built-in
    # WARNING: It is the equivalent to running zfs with sync=disabled if your SSD does not have powerloss protection.
    sd-config-list=
    "ATA  INTEL SSDSC2BB48", "physical-block-size:4096, cache-nonvolatile:true, throttle-max:32, disksort:false",
    "ATA  INTEL SSDSC2BA10", "physical-block-size:4096, cache-nonvolatile:true, throttle-max:32, disksort:false";
    
    I'm benchmarking @ 9000MTU mounted via an all-in-one network virtual 10G esxi vswitch connection to the NFS datastore on the pool above.

    Compression=off, sync=standard ( equal to 'always' for ESXi 5.5 NFS datastore)

    Benchmarking with:

    OmniOS VM memory at 16GB
    CrystalDiskMark 3.0.3 x64
    Test size: 1000MB
    47 MB/s 4K@QD32 write speed

    OmniOS VM memory at 13GB
    CrystalDiskMark 3.0.3 x64
    Test size: 1000MB
    20 MB/s 4K@QD32 write speed

    I was testing other settings (NUMA) so I didn't realize this was the byproduct of memory changes.
    After a while I realized that the memory size had changed so I changed it back.

    Performance was back to where it was.
    But why did it decrease so much?

    If the size of your dirty data starts to reach close to a percentage of the zfs_dirty_data_max setting (which varies based upon amount of RAM) it can start throttling writes.

    So I ran a test by changing the VM memory size down to 14GB and it lay between the two results above.

    I checked a dtrace script while running the 1000MB 4k@QD32 write test:

    Code:
    ~# dtrace -s dirty.d tank
    ...
      1  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  64MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  516MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  637MB of 1432MB used
      1  4181  txg_sync_thread:txg-syncing  933MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  927MB of 1432MB used
      1  4181  txg_sync_thread:txg-syncing  932MB of 1432MB used
      1  4181  txg_sync_thread:txg-syncing  940MB of 1432MB used
      1  4181  txg_sync_thread:txg-syncing  925MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  932MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  935MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  752MB of 1432MB used
      0  4181  txg_sync_thread:txg-syncing  0MB of 1432MB used
    
    It's write throttling as it gets closer to full, which means my RE4 vdevs can't
    absorb the data being thrown at them async by ZFS after the s3700 SSD log device has written and acknowledged them.

    So I temporarily bumped up the max dirty amount to 2495MB so I have lots of room:

    Code:
    # echo zfs_dirty_data_max/W0t2617101363 | mdb -kw
    
    # dtrace -s dirty.d tank
    
      0  4181  txg_sync_thread:txg-syncing  0MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing  64MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing  567MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing  667MB of 2495MB used
      0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
      0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
      0  4181  txg_sync_thread:txg-syncing 1001MB of 2495MB used
      0  4181  txg_sync_thread:txg-syncing 1002MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing  936MB of 2495MB used
      1  4181  txg_sync_thread:txg-syncing  0MB of 2495MB used
    
    Result:

    OmniOS VM memory at 14GB
    CrystalDiskMark 3.0.3 x64
    Test size: 1000MB
    zfs_dirty_data_max=2495MB

    127 MB/s 4K@QD32 write speed

    (yes 127MB/s that's not a typo, with sync=standard)

    I dropped the zfs_dirty_data_max lower and ran the 500MB test and got the same results.

    So just make sure you either:

    A) Ensure your slower vdev disk pool can handle the writes thrown at it if you have a hybrid (SSD + slower disk) pool and you want to max out sustained 4k @QD32 throughput of your SLOG device.

    B) Give the VM enough memory to handle the incoming amount of data that your SLOG can acknowledge.
    (which will raise zfs_dirty_data_max as a byproduct)

    C) Ensure you increase zfs_dirty_data_max independently if you can't spare more RAM, but want to steal from your ARC cache to accommodate the longest bursts of SLOG write data.

    Hope people find this useful.

    zfs throttle info and scripts from:
    Adam Leventhal's blog ยป Tuning the OpenZFS write throttle
     
    #1
    Last edited: Dec 14, 2017
    whitey, DaveBC, Patrick and 4 others like this.
  2. Hank C

    Hank C Active Member

    Joined:
    Jun 16, 2014
    Messages:
    642
    Likes Received:
    66
    i'll have 192gb of RAM for the zfs (possible freenas)
    What does the dirty zfs setting for? is that on the RAM or the SLOG?
    have you tried turning off sync write and see what happen?
     
    #2
  3. J-san

    J-san Member

    Joined:
    Nov 27, 2014
    Messages:
    66
    Likes Received:
    42
    zfs_dirty_data_max
    Controls the async write buffer in RAM

    zfs_dirty_data_sync
    Controls the sync write buffer in RAM

    The "cache-nonvolatile:true" setting in the sd.conf for the S3700 and S3500 essentially turns off SYNC behavior for those drives ONLY.
    (sync is still honored correctly in the filesystem/pool for the slower 7200rpm RE4 data drives by the seperate SSD slog)

    If you disabled Sync for the filesystem (datastore), then it bypasses the separate SSD slog and writes async instead, probably making performance quite similar to the above hybrid pool.

    BUT
    if you disable sync for a zfs filesystem that the ESXi datastore connects to then upon powerloss your VMDKs may become corrupt and unrecoverable. This is because important writes in RAM that were guaranteed to be written sync to stable storage were acknowledged as written to disk, so ESXi happily continued on assuming the important data is stable in the VMDK, but the writes in RAM are lost upon powerloss, or a system crash/hang.
     
    #3
    Last edited: Dec 10, 2014
  4. Hank C

    Hank C Active Member

    Joined:
    Jun 16, 2014
    Messages:
    642
    Likes Received:
    66
    what L2ARC ssd do you recommend?
     
    #4
  5. J-san

    J-san Member

    Joined:
    Nov 27, 2014
    Messages:
    66
    Likes Received:
    42
    Hank, I actually am not using L2ARC on my system. Just regular RAM (ARC) only at this point.

    However, from what I've read the best should be a really fast reading SSD that doesn't need powerloss protection.
    Maybe a Samsung 850 Pro?

    From what I've read though you should max out your regular RAM first as huge amounts of L2ARC eat up your regular much faster RAM.
    If your commonly used data all fits in your regular RAM ARC then you don't need the L2ARC.
     
    #5
    Last edited: Dec 11, 2014
Similar Threads: OpenZFS (OmniOS)
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it OpenZFS launches September 17th 2013 Sep 18, 2013
Solaris, Nexenta, OpenIndiana, and napp-it Solaris (OmniOS) w/ Napp-It ZPool Share Permissions for CIFS [Solved] Sep 17, 2018
Solaris, Nexenta, OpenIndiana, and napp-it Napp-in-one (OmniOS) on ESXi 6.5U1, performance issues? Feb 21, 2018

Share This Page