ZFS bottlenecks

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by Stril, Dec 20, 2018.

  1. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    I really like ZFS, but i was not able to get a REALLY good performance as ZFS-iscsi-target for small IO.

    Today, my test to look at its limits was:

    - Pool of only one Optane 900p
    - Pool of a stripe of two Optane 900p
    - Pool of one Optane as simple-volume + one Optane as SLOG

    Result was always:
    Not more than 60.000 IOPS at 8kb, 50% write, diskspd. Read was about 110.000 IOPS.
    ...tested with OpenE Jovian and VSphere as Client.

    Same with Starwind was MUCH faster.

    What I could see was a huge CPU load - up 100!

    Do you think, higher IOPS are possible with faster CPUs, or are there just some limits with ZFS?

    Thank you for your help!


    Stril.

    ...I know, OpenE is running ZoL which is not as fast as others, but I need commercial support...
     
    #1
  2. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,513
    Likes Received:
    1,344
    Might want to post up your CPU specs, RAM Specs, etc :)
     
    #2
  3. Monoman

    Monoman Active Member

    Joined:
    Oct 16, 2013
    Messages:
    279
    Likes Received:
    68
    Did your starwind and ZoL test have the same hardware? This will be important.

    FreeNAS has commercial support. Consider testing their ZFS as well. I'm close to performing these tests myself.
     
    #3
  4. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    CPU is one Intel Scalable Silver 4112, 48 GB Ram and the Starwind test was on the same system.

    The question for me is: how much performance is possible with:
    - More cores (AMD Epyc?)
    - Or few but faster cores

    I did not find any commercial reseller for ixSystems in Germany, yet...
     
    #4
  5. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Additional:
    Intel 540 Nics, 2x10 GbE MPIO without switches
     
    #5
  6. m4r1k

    m4r1k Member

    Joined:
    Nov 4, 2016
    Messages:
    44
    Likes Received:
    5
    If you need commercial support, the best and faster ZFS implementation is by Oracle. ZFS is first and foremost data consistency. In some cases, this is more important than pure speed.

    You could try 11.4 with Napp-it and see what performance you will get.
    You could also try pure Napp-it based on OmniOSce and see if it’s a OpenE issue or what
     
    #6
  7. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    Have you enabled sync write on ZFS?
    With iSCSI this can be either set as sync property=on (of the underlying zvol) or via writeback cache=disabled.

    The default value means that the writing app decides but you can force on or off.
    The Slog protects the content of the rambased writecache. On Optane with an extra Optane as Slog makes no sense as this is not really faster than the Optane with the onpool ZIL for logging.

    If Starwind is not faster but much faster, then I suppose ZFS does sync write while Starwind does not. Set sync to disable/ write-back enable to check this or force sync on Starwind.

    Beside that ntfs is faster than ZFS especially with less RAM. The security with extra checksums on data and metadata and copy on write costs performance but give superiour data security. If you switch to ReFS on Windows you will see a similar degration (at a lower level, ZFS is then much faster)

    How much RAM?
    ZFS use 10% RAM up to 4GB as write cache. this is essential for performance.

    All Open-ZFS platform should perform similar with currently slight advantages for Illumos based systems as Open-ZFS memory management is still Solaris alike even on BSD or Zol. Oracle Solaris with a genuine ZFS was in all my tests always faster.
     
    #7
    Last edited: Dec 20, 2018
    gigatexal and RageBone like this.
  8. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    I did also some tests with "sync=disable". The performance was not much better - i think, the NVMe already reaches one of my bottlenecks.

    Starwind seems to be easy with "raw-devices" (no snapshots, no async replication): Better hardware, more performance.

    I am doing two kinds of tests:
    1. Windows-client -> iSCSI MPIO -> Storage with diskspd
    2. WindowsVM on ESXi -> iSCSI MPIO -> Storage with diskspd


    In test 1, I was able to get:
    Starwind: 350.000 IOPS
    OpenE-ZFS: 80.000 IOPS

    In test 2, I was able to get:
    Starwind: 100.000 IOPS
    OpenE-ZFS: 52.000 IOPS

    BUT:
    I see nearly the same performance, if I am using more and hardware with ZFS, while Starwind scales.


    So my thought was: Is there a CPU-bottleneck for checksumming, etc. with ZFS? Do you see that huge CPU-load, too?
     
    #8
  9. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    Translate 100% CPU as "as fast as I can"
    A faster CPU may help but possibly only a little as ZFS is more RAM than CPU limited.

    The reason is because ZFS does not write small blocks. The smallest datablock is the ZFS blocksize, default 128k. They are always collected in the RAM based writecache and written as a large sequential (up to several Gigabytes) . If the disk is 4k physial, such a ZFS block is divided accordingly.

    If your real load is small datablocks ex iSCSI with 8k, you should reduce ZFS blocksize from 128k to 64k or 32k, not less. This can improve performance especially if sync is enabled to avoid a writecache dataloss in case of a crash. In general sync=disabled vs sync=always must give a performance difference as in one case you only write over the Rambased writecache (large sequential write) in the other case you must additionally log every small committed ZFS block. If you see no difference then check not only the sync setting of the zvol that is base of the LUN but also the corresponding writeback cache setting of the target (sync=disabled and writebackcache =enabled means no sync write).

    Open-E is ZoL (ZFS on Linux).
    A Solaris with genuine ZFS, Illumos or Free-BSD with Open-ZFS based appliance may be faster but do not expect same performance on ZFS than on ntfs or ext4. Security comes with a price. While an Optane may have a capacity of >200k iops (8k) I would expect 50k a good value with 8k writes and no ram writecache involved. A stripe set of several vdevs scales iops with number of vdevs so this is an option to increase values on ZFS.

    see also Comstar, the enterprise class iSCSI framework of Solaris and Illumos/OmniOS
    Configuring Storage Devices With COMSTAR - Oracle Solaris Administration: Devices and File Systems

    If you need commercial support in Germany you can also look at Oracle (Solaris) or NexentaStor (Illumos) ex zstor.de. OmniOS while OpenSource comes with a commercial support option from the devs (located in UK and Switzerland), Commercial Support
     
    #9
  10. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    I think, RAM is not the current bottleneck. Biggest problem is write performance and with sync disable an 48 GB memory, I should have the maximum "effect" - right?

    The goal is to have a "cluster" with sync=always and best performance possible. Sync=disable is for me only a debugging-test.

    Did you ever see, that CPU was/is the bottleneck?

    I will test writebackcache=enable - but open-E does not have that option on pool-level.

    @open-E
    I just do not have enough knowledge to be brave enough to use a "self-made" storage cluster because I cannot handle an outage.
    I was using nexenta in the past but VERY unhappy with their support. Today, I will test FreeNAS to see, if its faster.

    Thank you VERY much for all your input!
     
    #10
  11. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,934
    Likes Received:
    403
    If you can describe your test setup (clients, commands used) etc more closely I might be able to duplicate them when I find the time over the holidays, got a testbox (esxi based but should not matter too much) if you're interested
     
    #11
  12. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    That would be great. Current setup is simple:

    two identical servers, Intel 4112, 48 GB RAM, interconnect 2x 10 GbE with Jumbo frames.
    Test-command is:

    Code:
    diskspd -c30G -w50 -b8K -F8 -r -o32 -W0 -d20 -Sh e:\testfile.dat
    
    ...on Windows 2016 with all patches

    @gea
    I just did a test with blocksize=32k:
    Seems to be faster. I will provide results in a few hours...
     
    #12
  13. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,934
    Likes Received:
    403
    E is an iscsi volume hosted by the ZFS box, which size? Any optimizations (network [except jumbo], energy saving mode etc)?
    Whats the iperf speed?
     
    #13
  14. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    Yes, iSCSI on ZFS. I tested with 100GB volumes. The only optimization is: disable power-management, jumbo-frames. nothing else...
     
    #14
  15. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    For iSCSI you should try Comstar (Oracle Solaris or OmniOS)
     
    #15
  16. NISMO1968

    NISMO1968 [ ... ]

    Joined:
    Oct 19, 2013
    Messages:
    70
    Likes Received:
    10
    Open-E is pretty shitty product in terms of performance, software quality/maturity and especially support being pretty much AWOL. If you plan to stick with ZFS I'd recommend either plain vanilla FreeBSD or Linux + ZoL done right (don't be afraid of ZoL, next version of FreeBSD is going to have ZoL ported to FreeBSD, rather then using existing Illumos source code).

     
    #16
  17. NISMO1968

    NISMO1968 [ ... ]

    Joined:
    Oct 19, 2013
    Messages:
    70
    Likes Received:
    10
    If you like StarWind performance you could think about replacing non-RDMA Intel NICs with Mellanox CX3 (they are cheap on eBay) or CX4 cards to have iSER rather than iSCSI for both East-West traffic & vSphere uplinks as well. RDMA is king :)

     
    #17
  18. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,533
    Likes Received:
    355
    Can you test it again but without the -r argument?

    Code:
    diskspd -c30G -w50 -b8K -F8 -o32 -W0 -d20 -Sh e:\testfile.dat
    
     
    #18
  19. Stril

    Stril Member

    Joined:
    Sep 26, 2017
    Messages:
    151
    Likes Received:
    6
    Hi!

    @i386:
    Without "-r" performance is VERY good (about 95.000 IOPS in 50% mix), but with "-r" performance goes down to 50.000 in my current setup. Shouldn't this be equal with ZFS?

    @NISMO1968
    Mellanox Cards are great, but my VMWare-Hosts are on Intel-Cards.
    I do not like Open-E, but they provide a product with support. I am just afraid of building a cluster by myself without support.

    @gea
    I will give FreeNAS a try, next week...
     
    #19
  20. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,043
    Likes Received:
    654
    FreeNAS is Free-BSD,
    Comstar is Solarish only
     
    #20

Share This Page