The ultimate ZFS ESXi datastore for the advanced single User (want, not have)

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by Rand__, Nov 3, 2019.

  1. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    So I've been trying to build the ultimate ZFS storage system for a while now, and have thrown tons of time and probably even more money at it but without actually getting *there*. So I thought I'd start fresh and ask for help from real experts since my tinkering didnt work out:p

    I will leave all the old issues out for now and start with a fresh idea if possible (but of course reuse in case I can in the actual build).

    Now I am totally aware that this is highly depending on what the goal/expectation is so here are mine (I am sure I have been stated these quite often in my various rants/complain threads but just to have all in one place).

    • I want to run an ESXi cluster with a shared storage on a ZFS system, ideally via nfs (so sync writes).
    • I want the ZFS system to be HA capable with at least 2 nodes
    • I want a write speed of 500MB/s per VM when moving from off system to on system and vice versa (cold or hot vmotion)
    Now lets have a look at this regardless of the hardware I already have - if I had unlimited funds what hardware would be capable of achieving this?
    If you think an off-the-shelf all flash array can then lets hear it too (although most are not catering for single users), but o/c that is most likely not the way to go. But the info that a 5 node cluster of 24nvme drives each might be needed would be helpful (if you actually have seen it deliver and not read the specs only {which rarely show QD1/1T values anyway})

    Thanks:)
     
    #1
  2. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    The Cluster/ HA aspect is secondary if you use a ZFS dualhead active/passive cluster with multiport SAS/Nvme. It may be only special with iSCSI/FC or singleport Sata/NVMe on ESXi where the disk/network/ESXi performance may be a limiting factor depending on setup.

    If you use a single server ZFS dualport SAS/NVMe disk configuration that suits your performance needs with a single server, the Cluster just means to add and connect a second head (barebone or virtualized) to the second ports of the SAS/NVMe disks.
     
    #2
  3. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    I agree, but at this time I haven't even found a single server solution satisfying my requirements (with Zfs).

    From your experience, what kind of hw (nr of disks/type + cpu) might be able to provide the desired performance?:)
     
    #3
  4. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    500 MB/s pool performance sequentially is not a problem. Easy to go > 1GB/s. If there is a lot of concurrent or small io, this will be very different. What I would try is to start with some Optane 900 in a raid-0 setup with as much RAM as available and a recsize like 32k or 64k when using VMs. This marks the upper end of what is possible with ZFS on your hardware. Then you can reduce to less Optane, NVMe or 12G SAS/ Sata SSD instead or use other cost limiting factors like dedicated Slog/ special vdev with a slower pool. If you care about sync performance this is again very special. Optane can land at around 800 MB/s sequentially. Nvme Flash maybe at 500 MB/s.

    When your pool can satisfy your needs, then it comes to network. 10G Ethernet without tcp tuning and Jumboframes lands at around 400 MB/s. With Jumboframes and tuning this can double. Above 1 GB/s you are in the region of a very special and not common setup.

    As ZFS is a local high security filesystem, performance is limited by the performance of a local pool (and not as good as filesystems without the security of ZFS). A Cluster filesystem can scale beyond as every node must only offer a part of the overall performance. But as there is a overhead as well, there will be a break even point with x nodes required to perform better than a single local filesystem.
     
    #4
    Last edited: Nov 4, 2019
  5. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    sequential is indeed not an issue but up until now I have not been able to manage to get VMs move at that speed. Not sure whether thats due to the smaller blocksize VMWare is using or the not exclusively sequential IO.

    My primary issue though was the incapability of ZFS (both OmniOS and FreeBSD) to scale up even close to linearly with more devices (even on sequential).
    I have run tests with 4 Optane drives and 12 SAS3 (HGST SS300) and neither reached expected levels -

    here is a fio chart showing 1-6 mirror vdevs, 3Ghz CPU, 1M Blocksize, 1M recordsize ( qd 1, numjobs=1)

    upload_2019-11-4_17-48-6.png

    And I have similar issues all the time whenever i get to 1Gb/s or more (sometimes its as bad as here, sometimes the next mirror just adds single digit performance improvement)
     
    #5
    SRussell likes this.
  6. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    With a typical ZFS server, you look at the triangle price <> capacity <> performance. You select two of the parameters and the third is the result. If you are not satisfied with the result, modify the input parameters.

    For ultimate performance, the triangle may change to price <>performance<>data security. This can mean that an ext4 solution is faster than a ZFS solution but without the security aspect that comes with CopyOnWrite and checksums what cost performance.

    From theory, a Raid (ZFS and others) scale with number of datadisks sequentially. Basically this is correct and with large file streams (zfs send) this may scale up to a certain limit.

    In reality a Raid does not work sequentially. ZFS for example tries to spread data quite even over a pool what means that a lot of load switches to random io.

    Another aspect is ESXi where you create a filesystem like ext4 or ntfs based on a vmfs file. From the view of a VM, this is a blockdevice with say 8k blocksize. The VM is optimized to update data based on this blocksize as fast as possible (expecting a physical disk blocksize of 512B/4k).

    If the "file" itself is on ZFS, all real io is based on the recsize of the ZFS filesystem. If this is 1M this is highly un-optimized when the filesystem wants to read/write 8k and must process 1M each. As ZFS becomes slow with very low recsizes (as checksum, dedup, compress is based on) you should try a lower recsize ex 32k or 64k for best VM performance.

    As your benchmark is based on sequential ZFS this may give a different view than from a VM view. Normal Flash additionally has the problem that it needs to delete/write a whole large page to write a small datablock. (Optane is superiour, not affected by this performance "break")
     
    #6
    Last edited: Nov 4, 2019
    SRussell and T_Minus like this.
  7. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493
    Some thoughts... QD1\T1 performance improvements once you've 'maxed' drive performance will come by improving latency.

    - CPU Frequency (& Available Cores)
    - Cable Connections.
    - Cable Quality\Type.
    - Drivers
    - Firmware
    - Drive configuration\pool setup (hardware. IE: Which HBA, Physical Pool\vdev configuration)
    - Memory Performance
    - CPU Various Configurations in BIOS
    - ZFS Various Configurations
    - You're using an E3 not an E5, and memory performance is not nearly as good, to few PCIE lanes to get achieve top performance from Optanes + 12 SAS3, etc... (unsure if you tested with something else, just throwing ideas)
    - (If there's network then 1gb vs 10, vs 100, tuning, network stuff... etc...)
     
    #7
  8. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    Real world is unfair.
    In theory everything is predictable and the answer of every serious question is "42".

    A solution seems perfect. You step on a cable occasionally and this can mean 100 Mb/s vs 10 Gb/s (1:100)

    Mostly you can only define most important features, try to find a standard solution and if result is worse than expected: start trouble/bug finding.
     
    #8
    Evan, Rand__ and T_Minus like this.
  9. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    That bench basically was just a min/max size test when I got a bunch of new drives - here is the same run with 4K blocksize/4k recordsize (didnt do 32/64k on that run unfortunately) . Will need to see if I hace some 32/64k results stored away somewhere (likely also from optanes)

    upload_2019-11-4_21-2-41.png



    Basically I am trying to differentiate two things here
    1. Establish a basic HW set that should be able to meet my goals
    2. Establish the criteria to measure that (beyond moving vms back and forth) - from what you said (and I agree), VMs should have 32/64k blocksize and will have a certain amount of random IO included

    3. Also I am trying to find out why adding more mirrors seems to be detrimental to performance (worst case) or not helping as expected (best case)
     
    #9
  10. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Yeah thats what I did - throw more hardware at it :p
    I started with drives (optane, sas3), then more drives... then more network (useless at this point) , then better CPUs (E5/Scalable) then better slogs (at this point I got a 4800x, a nvdimm and a NV1616 lying here) and nothing really worked out as I had hoped...
    Thats why I though I'd ask somebody who knows that stuff better than I ;)
    I.e. more money then sense and trying to rectify;)

    Edit:
    What is missing is a baseline, i.e. a realistic number - apparently its not
    vdev * <single drive performance>
    but I have not been able to find many results at qd1/t1 since its not the typical enterprise use case and most normal ppl are not using similar hw;)
     
    #10
    Last edited: Nov 4, 2019
  11. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Here is a 64k bs/recsize test with 2 900ps in a single mirror with different slogs - this is on a 6150 (again QD1/T1)

    The nvdimm comes close;)

    upload_2019-11-4_23-8-39.png

    upload_2019-11-4_23-9-26.png
     
    #11
    Last edited: Nov 4, 2019
  12. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    If I read it correctly (and as a mirror does not help on writes)
    - 900P without slog is nearly as good as 900p + dc 4800 slog
    - 900P + nvdimm = 4000 iops vs 6000 write iops (1.5 x)

    remains the question:
    Does this improve a VM move by 5% or 30% ?
    (Your initial real world concern)
     
    #12
  13. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Yeah :p
    Too much testing, not enough real world experience.

    Will have the target board back tomorrow from SM and will be able to do some real tests when I rebuilt the system. Will let you know;)
     
    #13
  14. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,683
    Likes Received:
    412
    If changing hardware doesn't improve the performance than it's time to look at the software/os stack :D
     
    #14
  15. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    I contemplated going back to raid and starwind or open-e (got a ha license with some hardware a few months back) but zfs is nicer;)

    As mentioned one of my problems is the lack of comparison data, there are just too few numbers of high speed qd1/t1 setups, so i don't know if it is at all possible with a zfs based system...
     
    #15
  16. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    So did a not so realistic test
    - stripe of 900p's with nvdimm as recipient; a 4800x as source, both on a freenas box exported via nfs3 (and sync=always).
    VMotion from datastore A to B.

    Also the box is quite beefy, 6150 (16core, 3.4Ghz allcore) and 280GB memory (so the test vm might have been cached in ram completely, its only 22GB)

    zpool iostat stripe_900p 1, 25 ticks for 19GB so ~760 MB/s actual transfer rate.

    Will need to run some further tests with remote transfer and a more realistic drive set.

    Code:
    stripe_900p  5.60M   888G      0      0      0      0
    stripe_900p  5.60M   888G      0      0      0      0
    stripe_900p  5.60M   888G      0      0      0      0
    stripe_900p  5.60M   888G      0      0      0      0
    stripe_900p  5.60M   888G      0      0      0      0
    stripe_900p   203M   888G      0  6.70K      0   484M
    stripe_900p   956M   887G      0  20.7K      0  1.61G
    stripe_900p  1.74G   886G      0  21.7K      0  1.68G
    stripe_900p  2.67G   885G      0  25.8K      0  1.93G
    stripe_900p  3.41G   885G      0  20.5K      0  1.59G
    stripe_900p  4.27G   884G      0  23.9K      0  1.78G
    stripe_900p  5.01G   883G      0  22.1K      0  1.69G
    stripe_900p  5.87G   882G      0  22.4K      0  1.71G
    stripe_900p  6.80G   881G      0  25.0K      0  1.93G
    stripe_900p  7.42G   881G      0  16.1K      0  1.23G
    stripe_900p  7.79G   880G      0  10.9K      0   844M
    stripe_900p  8.53G   879G      0  20.0K      0  1.53G
    stripe_900p  9.34G   879G      0  22.8K      0  1.72G
    stripe_900p  10.1G   878G      0  24.4K      0  1.85G
    stripe_900p  10.9G   877G      0  20.7K      0  1.58G
    stripe_900p  11.8G   876G      0  25.3K      0  1.88G
    stripe_900p  12.6G   875G      0  24.7K      0  1.86G
    stripe_900p  13.4G   875G      0  22.2K      0  1.63G
    stripe_900p  14.3G   874G      0  24.9K      0  1.88G
    stripe_900p  15.1G   873G      0  22.9K      0  1.73G
    stripe_900p  15.9G   872G      0  23.8K      0  1.81G
    stripe_900p  16.7G   871G      0  22.2K      0  1.70G
    stripe_900p  17.6G   870G      0  25.9K      0  1.94G
    stripe_900p  18.5G   869G      0  26.7K      0  2.02G
    stripe_900p  19.2G   869G      0  15.8K      0  1.16G
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0      0      0  3.76K
    stripe_900p  19.2G   869G      0    594      0  38.9M
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0      0      0      0
    stripe_900p  19.2G   869G      0    340      0  22.6M
    stripe_900p  19.2G   869G      0  2.39K      0   186M
    stripe_900p  20.1G   868G      0  24.4K      0  1.81G
    stripe_900p  21.0G   867G      0  24.2K      0  1.82G
    stripe_900p  21.9G   866G      0  26.2K      0  2.03G
    stripe_900p  22.6G   865G      0  19.9K      0  1.48G
    stripe_900p  22.6G   865G      0      0      0      0
    stripe_900p  22.6G   865G      0      0      0      0
    stripe_900p  22.6G   865G      0      0      0      0
    stripe_900p  22.6G   865G      0      0      0      0
    stripe_900p  22.6G   865G      0    637      0  28.2M
    stripe_900p  23.2G   865G      0  16.9K      0  1.29G
    stripe_900p  23.2G   865G      0     14      0  73.6K
    stripe_900p  23.2G   865G      0      0      0      0
    stripe_900p  23.2G   865G      0      0      0      0
    stripe_900p  23.2G   865G      0      0      0      0
    stripe_900p  23.3G   865G      0    515      0  42.3M
    stripe_900p  23.3G   865G      0      0      0      0
    stripe_900p  23.3G   865G      0      0      0      0
    stripe_900p  23.3G   865G      0      0      0      0
    stripe_900p  23.3G   865G      0      0      0      0
    stripe_900p  23.3G   865G      0      0      0      0
    
     
    #16
Similar Threads: ultimate ESXi
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it intel x540-t2 passthrough to OmniOS 151032 on ESXi 6.7u3 Nov 30, 2019
Solaris, Nexenta, OpenIndiana, and napp-it OmniOS 151030 VM (ESXi) with LSI 9400-8i Tri-Mode HBA freezing up Aug 10, 2019
Solaris, Nexenta, OpenIndiana, and napp-it ESXi 6.5 with ZFS backed NFS Datastore - Optane Latency AIO benchmarks Apr 5, 2019
Solaris, Nexenta, OpenIndiana, and napp-it ESXi VMDK on ZFS backed NFS - thin provisioning Mar 25, 2019
Solaris, Nexenta, OpenIndiana, and napp-it how to create iscsi volume for datastore use in esxi Mar 1, 2019

Share This Page