1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

ZFS Advice for new Setup

Discussion in 'Linux Admins, Storage and Virtualization' started by humbleThC, Jan 2, 2017.

  1. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    Good point, I do have a UPS + crash-free history of the server (despite hardware failures) + daily backups of all important VMs. So I'll stick with sync=disabled for the dime being...

    Just for some additional charts for comparison:

    ATTO w/all optimisations ------------------------------------- w/sync=always (w/SLOG)
    upload_2017-1-16_18-53-48.png
     
    #121
    humbleThC likes this.
  2. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    I've taken a different approach to this. On one hand, I want to have redundancy over Ethernet, so I exposed the iSCSI target just via the Ethernet IP on the server (using the "allowed_portal" parameter of SCST). On the other hand, I don't want this link to be used unless SRP fails, so I don't use round robin. Instead, I use the "Fixed" multipathing policy with the first SRP adapter link as "preferred". At the same time, I don't use IPoIB on that port, I haven't even defined a VMkernel NIC on top of it.
    Now, I do use the IPoIB capability of the second port for very fast inter-ESXi networking. There are so many possible uses for it, e.g. VMotion, exposing CIFS shares etc. All you need to do is add another NIC to your VM and connect it to the corresponding vSwitch and, voila, your VM is on a 56Gbps (in my case) network.
     
    #122
  3. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    I'm in no hurry :), and I work in all time zones.. Although i'm near Chicago, my active engagements right now are in UK, NL, and Canada. I'm rocking -3hrs in one timezone and +6hrs in another.

    Also thats the difference between CX2 and CX3/4s ... I can't get SRV-IO to work (from what i've read) so none of my VMs will technically ever be on the IB network :( For me i consider it a 'SAN' fabric, but I do like the idea of using it as VMotion Network, and maybe Provisioning. Plus i've done several tests with RR enabled on all SRP targets. RR enabled on all SRP+IPoIB targets, and Fixed on a single SRP target. No noticeable difference in my configuration. (still cant break 1.7GBs in data xfer). Which is about as fast as i've ever got over SMB3/RDMA from Win10 to Win2016 copying from RAMdrive to RAMdrive (if i remember correctly).
     
    #123
    Last edited: Jan 17, 2017
  4. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    Also I really need help on partitioning these Intel S3710s.

    From my research, the reason why you want to 'whole disk it' is two fold. One a initialized ZFS disk has a blank GPT partition, which by itself offsets/aligns to 4K sectors. Another main reason is apparently ZFS wont use the onboard disk cache of a device, unless you whole partition it, to prevent some type of UFS file system caching issue that could exist, if you use the other partitions for UFS etc.

    Obviously i dont want to use the whole disk for ZIL, ideally i'm thinking I want to size my ZIL for 4GBs * 5 seconds = 20GB. And Ideally I want it mirrored across all (4) Intel SSDs for performance and redundancy. so 20GB Usable = 10GB partitions in RAID1/0.

    I can't use the napp-it GUI to partition the disks, because it defaults to MBR partition scheme, which isnt 4K sector aligned. And I dont want to take a huge hit on the SSD performance accordingly :(.

    Next, is there a way to override the 'use disk cache' on disk, where you are only using a partition, and not the whole disk? I promise i'll never use UFS on the other partitions :) I still plan to create a 1.2TB'ish L2ARC across the remainder of the Intel SSDs in RAID0. And I need these partitions to be GPT/aligned as well.

    Pretty sure my initial benchmark results were gimped, due to the alignment of the SSDs for ZIL/L2ARC the way I partitioned it via GUI.
     
    #124
    Last edited: Jan 17, 2017
  5. ttabbal

    ttabbal Active Member

    Joined:
    Mar 10, 2016
    Messages:
    439
    Likes Received:
    116
    For SLOG/ZIL, you just add them to the pool as log devices, there's no RAID involved, nor should there be. Just make 10GB partitions and add those as log devices.

    I haven't used napp-it for a long time now, but you might have to go CLI for GPT. gparted should be able to create them wouldn't much hassle.

    For the L2ARC sharing the drive... that's kind of frowned on, but it may work fine for you. The catch is that you are now using controller bandwidth for SLOG and L2ARC traffic. If you do it, I would recommend leaving at least 25% of the disk as unpartitioned space. That leaves some room for the disk's controller and firmware to clean up, so there are always erased blocks ready to use.

    I seem to remember reading that the whole 4k alignment thing wasn't really an issue with newer SSDs. Might want to look into that, but maybe that was more due to GPT becoming popular than the disks getting better about it.

    I don't know what, if any, effect partitions vs full disks has these days. I know it used to be a big deal, but that might have changed. Since you're benchmarking, it might be an interesting test. Make a mirror with drives, test, then one with partitions, test...
     
    #125
  6. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    I haven't actually meant SRV-IO, nor I can even use it. FYI, SRV-IO is only supported on VMware drivers 2.x and on, but those have neither SRP nor iSER over IB, so I stick to 1.8.2.5.
    What I did mean was regular VMware networking via the VMkernel NIC on top of the second Mellanox card IPoIB interface. It won't be full speed by any means, but still much better than 1 Gbps. This will serve VMotion/Provisioning out of the box + you can connect virtual NICs from your VMs to it.
    I'm pretty sure RR won't provide you with any additional benefit on top of a single link, and, if your second IB port is used for IPoIB networking, it may actually cause more harm than provide any benefit.
    As for the 1.7 GB/s peak you're unable to break, I'd rather suspect you're single-core CPU limited on the server side, and I'm afraid you have very limited (if any) means of breaking it on Solaris. On Linux, as you could see from my posts, you have much more control in terms of employing multithreading, which could lead to significant increase of the overall throughput.
     
    #126
  7. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    What would you do in my position?
    ZoL via CentOS 7.3, and if so got any good links of where to start?
     
    #127
  8. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    I think I'd given Linux a go (not necessarily CentOS). You need to find info on the following:
    * ZoL
    * Mellanox OFED
    * SCST
    Luckily, all of these are very well documented.
     
    #128
  9. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    Messing around with Ubuntu 16.04.1.
    Updated to latest 4.4.0-59-generic kernel.
    Adding MLNX OFED 3.4.2 Drivers with --add-kernel-support for 4.4.0-59.

    root@nas01:~/MLNX_OFED_LINUX-3.4-2.0.0.0-ubuntu16.04-x86_64# hca_self_test.ofed

    ---- Performing Adapter Device Self Test ----
    Number of CAs Detected ................. 1
    PCI Device Check ....................... PASS
    Kernel Arch ............................ x86_64
    Host Driver Version .................... MLNX_OFED_LINUX-3.4-2.0.0.0 (OFED-3.4-2.0.0): 4.4.0-59-generic
    Host Driver RPM Check .................. PASS
    Firmware on CA #0 HCA .................. v2.10.0720
    Host Driver Initialization ............. PASS
    Number of CA Ports Active .............. 0
    Error Counter Check on CA #0 (HCA)...... PASS
    Kernel Syslog Check .................... PASS
    Node GUID on CA #0 (HCA) ............... NA
    ------------------ DONE ---------------------



    Decided to try Ubuntu next, but will want to try CentOS as well.
    I saw that napp-it was available for Linux, specifically older Ubuntu, is that still a thing?

    I'm on attempt#2 for the SCST compiling :) so there's that.. got a good link for that?
     
    #129
  10. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    I'd stay away from trunk of SCST, grab the 3.2.x branch, which should be stable:

    svn checkout ...

    Then in the main directory:

    make 2perf

    Then:

    make scst scstadmin srpt iscsi-scst
    sudo make install scst scstadmin srpt iscsi-scst


    That should be it. Next you need to create /etc/scst.conf, google for examples.
     
    #130
  11. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    Initial compile failed, due to a Qlogic 2000 device not being present, removed that dir , and recompiled, so far so good.

    I have HowTo Configure SCST Block Storage Target Enabl... | Mellanox Interconnect Community for a reference of scst.conf
    And this old guide How to Set up an Infiniband SRP Target on Ubuntu 12.04

    But i'm not sure my SCST is installed correctly atm, is there a way to validate?
    i dont have an /etc/init.d/scst for example.

    NVM, decided CentOS was a better starting place for me personally, due to it's relationship to RedHat, and the value add i'd get from learning this Linux variant.

    Will try again from scratch tomorrow :)
     
    #131
    Last edited: Jan 18, 2017
  12. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    Long couple days @ work building AWS clouds :) Finally got back to the NAS project...

    This time i'm messing around with RedHat Server 7.3
    I have the Mellanox OFED 3.4.2 Drivers successfully installed
    I have the SCST 3.2.x package successfully installed.
    I have ZoL Version 0.6.5.8-1.el7_3.centos successfully installed.

    Now to figure out where to go next :)
    But I feel that i'm close to being able to test.

    *Update*
    Almost ready to start benchmarking!!

    Got ZFS setup, ARC limited to 64GB, Pools are created (with properly aligned partitions) , Ashift=12, and Intels are split 20GB/365GB for RAID1/0 ZIL and RAID0 L2ARC.

    Just need to figure out how to properly configure the infiniband adapters now.
    I'm thinking Port 1 - will be for IPoIB CIFS/NFS
    Port 2 - Will be for iSCSI only.

    Any suggestions on how to get to this config?
     
    #132
    Last edited: Jan 22, 2017
  13. mpogr

    mpogr Member

    Joined:
    Jul 14, 2016
    Messages:
    78
    Likes Received:
    34
    I hope you meant Port 2 will be SRP-only. If you're after IB-RDMA for ESXi clients (as you should be), SRP is the only viable option at the moment. You can (and probably should) define additional iSCSI targets for the same volume exposed via SRP, but make them visible only via the 1 Gbps Ethernet and use this path as a backup (in case IB fails for some reason).

    Here's how my scst.conf looks like:

    HANDLER vdisk_blockio {
    DEVICE vmfs {
    filename /dev/zvol/pool1/vmfsroot
    nv_cache 1
    rotational 1
    write_through 0
    threads_num 4
    }
    DEVICE vmfs2 {
    filename /dev/zvol/pool2/vmfsroot
    nv_cache 1
    rotational 0
    write_through 0
    threads_num 4
    }
    }

    TARGET_DRIVER iscsi {
    enabled 1

    TARGET iqn.2014-12.org.xxx.xxx.xxx:storage:zfs-sn-798798789 {
    allowed_portal 192.168.120.53
    QueuedCommands 128
    rel_tgt_id 3
    LUN 0 vmfs
    LUN 1 vmfs2
    enabled 1
    }
    }

    TARGET_DRIVER ib_srpt {

    TARGET fe80:0000:0000:0000:f452:1403:007c:15d1 {
    enabled 1
    rel_tgt_id 1
    LUN 0 vmfs
    LUN 1 vmfs2
    }

    TARGET fe80:0000:0000:0000:f452:1403:007c:15d2 {
    enabled 1
    rel_tgt_id 2
    LUN 0 vmfs
    LUN 1 vmfs2
    }
    }


    In my case, "vmfs" is a volume on the RAID-10 HDD pool and "vmfs2" is a volume on the RAID1 SSD pool.
    "nv_cache 1" results in effective sync=disabled, unless you explicitly override it on ZFS level.
    "allowed_portal" allows the iSCSI target to be exposed only on 1 IP address (the Ethernet one).
    "QueueCommands 128" is an important tweak for iSCSI, the default is too low (32).
    Make sure you have the right addresses for your SRP targets (RTM on how to find them). You can enable only one of them, but then your dmesg will be flooded with unsuccessful login attempt messages for the second port. Only one will be affectively used anyway.
    Finally, "threads_num" was the parameter that allowed to break the single core limitation (for connections coming from a single initiator) and get the impressive results I posted several days ago.

    Good luck!
     
    #133
    T_Minus and humbleThC like this.
  14. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    MASSIVE thanks to mpogr for helping me with the following setup, it wasnt easy, and I was failing for days without him, probably would have been weeks, or GTFU. But now I have something to work/test/compare with!!! :)

    RedHat 7.3 Server [Linux 3.10.0-514.el7.x86_64]
    --- Latest as of today, fully updated/patched.

    Mellanox OFED Drivers [MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-x86_64]
    -- Drivers custom make/installed with --add-kernel-support

    SCST 3.2.x
    --- custom make/installed leveraging the MLNX_OFED SRP drivers, that were kernel integrated.
    ISCSI w/ SRP to ESX 6.0 u2 [Using v1.8.2.5 OFED drivers for ESX 5.x]

    ZoL latest
    ---

    via ConnectX-2 40Gb QDQ HCAs attached to Mellanox 4036.

    Initial Test Benchmarks.

    Same Hitachi RaidZ1 (4+1)*2 setup, with Intel S3710s partitioned for both ZIL and L2ARC, but this time with properly aligned partitions to 4K sectors, so ruled out any potential impact there.
    I had a 'user issue' trying to partition/align in Solarish , and the default GUI used MBR (mis-aligned partitions) but turns out , that wasnt a thing, or it is a thing, and my bottleneck is somewhere else.​

    upload_2017-1-23_0-33-58.png

    So far, nearly identical to my OmniOS+NappIt results. (using the same disks, in the same exact way ZFS wise, with the same ZFS settings)
    But a massive difference in kernel/OS between OmniOS and Redhat server for sure.

    Definately notice the ZFS & iSCSI SRP traffic being balanced across all 16x CPUs.
    So my assumption of 1.7GBs being an upper cap due to only 1/3rd of my memory bandwidth being uses via SunOS 5.x kernel didnt pan out.


     
    #134
    Last edited: Jan 23, 2017
  15. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,113
    Likes Received:
    636
    Either one of you gonna publish a definitive 'using SRP storage in vSphere' guide? Not talking a 'hand holding' soup-to-nuts work instruction but maybe a primer w/ golden nuggets of knowledge/lessons learned/config files/solid links to articles that cover each component of config would sure be nice. I have a config almost there (vSphere w/ 1.8.2.5 SRP vib's, CentOS7 ZoL latest from openzfs, OFED latest, missing SCST) but need to bone up on SCST apparently. Oh and I got sidetracked w/ another project.

    I see a scst.conf file attached so there's part of the puzzle, any other critical config files? Remind me @humbleThC what IB switch are you using? Looks like CX2 cards for sure.
     
    #135
  16. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    Some 'things' that confuse me in my overall math/benchmarking over the course of this project:

    The ZFS Server RAM appears to have more 'speed/power'
    # dd if=/dev/zero of=/home/humble/tempfs/testfile bs=1M count=4096
    4096+0 records in
    4096+0 records out
    4294967296 bytes (4.3 GB) copied, 1.56663 s, 2.7 GB/s

    iperf3 (over IPoIB) can be as high as 18Gbs = > 2.0 GB/s
    and i'm using SRP which should be amazingly more efficient.

    The Pool/FileSystem appears to have more 'speed/power'
    #dd if=/dev/zero of=/Hitachi/data/vdisk0 bs=1M count=1048576
    1048576+0 records in
    1048576+0 records out
    1099511627776 bytes (1.1 TB) copied, 460.256 s, 2.4 GB/s

    But no matter what I do, I can only ever see 1.7GB/s in any protocol, in any way, from any OS, with any driver :)
    i.e. Windows 2012 R2 / Windows 2016
    OmniOs + Napp-it w/ native ZFS (latest)
    RedHat + SCST + ZoL (latest)

    Testing from ESX 6.0 u2 Hosts via NFS/iSCSI/SRP
    Testing from Windows 10 Host via SMB3/iSCSI

    Mix any of the above with any of the above, and it's 1.7GBs any way you slice it.
    (which is part of the reason i keep testing :)
     
    #136
    Last edited: Jan 23, 2017
  17. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    I very much plan to do the homework required for this.
    But I'll gladly build it, and hand it over to MPOGR for final editing, and publishing.
    Without his instruction I wouldnt understand it well enough to document it in the first place.
    And he's quite active on Mellanox user forums, where i've read his years of commitment to this science.

    ConnectX-2 HBAs (MT26428) with v2.10.720 FW
    (But I can say this will work for CX2/CX3/CX4, all use the same exact driver installation package, and the 'trick' is installing it with kernel support, or the ib_srpt module wont load due to the ib_srpt.ko being the wrong version)
    And this will work the same for CentOS/RedHat as well as Ubuntu/Others that are supported by the same linux package (theoretically).​

    Mellanox 4036 Switch (with subnet mgr running)
    (But i can also say this can easily be substituted for the software based openSM)
    In fact, in small environments (2-3 hosts) one might argu it might be worth it
    even if you have a switch, not to use it, due to the 4036 switch maxing out on 4092/4096 Byte MTU.
    With direct attach hosts (or perhaps new switches) you could enjoy 65520 Byte MTU.​

    I was stuck exactly exactly where you are now. And a huge piece of the conversation was 68 posts back and forth directly between MPOGR and myself, as to not muck up this thread. But there is amazing troubleshooting notes, and ultimately i documented it from a soup-to-nuts fashion, and plan to rebuild this a few more times from scratch just to prove I can (before I'm comfortable enough to consider it "production" and start deploying my actual ESX homeLAB on it again).

    So like I said i'm happy to do the homework, as long as MPOGR edits it, and he deserves the publish credit^^
     
    #137
    Last edited: Jan 23, 2017
    whitey likes this.
  18. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    1,312
    Likes Received:
    354
    I do not use IB in my setups but regarding CIFS or iSCSI Oracle Solaris 11.3 was mostly 5-30% faster than OmniOS. While I support Solaris in most functions this would mean a move from Open-ZFS to Oracle's closed source ZFS v37.
     
    #138
  19. humbleThC

    humbleThC Member

    Joined:
    Nov 7, 2016
    Messages:
    98
    Likes Received:
    6
    The company I work for, will basically let me leverage our partner relationships for homeLab licensing. So Oracle/Redhat/Microsoft/other Licenses are available, (if I need one).

    Solaris 11.3 does have the drivers I need, "in theory" i'll have to do a whole new research path on the state of SRP in inband Solaris, -vs- Mellanox provided , -vs- the capabilities/dependency of the iSCSI daemon, -vs- COMSTAR. And see if there's a solution in there that's better than anything i've seen/tested.

    The RedHat/CentOS path i'm down now, is viable (once you have a solid runbook, and are willing to custom make your entire driver/scst/zfs stack, every time you want to update your OS kernel.) Granted its insanely hard to figure this out the 1st time, but once you troubleshoot every layer of every thing once, it all makes sense, and tearing it down and rebuilding seems easy enough. Thus doing it as a maintenance activity around key updates is an acceptable thing to some people, who want to stay down this path.

    However, there's a unbelievably huge advantage on the OmniOS+Napp-It from a, you are up in 15minutes, and it works out of the box, and oh, there's a GUI and monitoring and all the extra NAS/SAN stuff , its very impressive, and i'm amazed by what you've done for that entire storage path. And I might be back down this path shortly :)

    Need to test the hell out of this setup, (and document it, and publish it for others, and myself, incase i come back, or do it again someday)... And see where it stands.
     
    #139
  20. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    4,858
    Likes Received:
    871
    RE: 4096 Byte MTU max of the 4036.
    Have you compared IOPs and latency when doing transfers to & from high performing SSD or NVME pools and the affects the MTU limitation may have vs direct connect with the higher value?
     
    #140
Similar Threads: Advice Setup
Forum Title Date
Linux Admins, Storage and Virtualization New home server setup need advice. Jan 30, 2016
Linux Admins, Storage and Virtualization Advice on ZFS/FreeNAS for 40TB Feb 1, 2017
Linux Admins, Storage and Virtualization Need NAS Advice - Is there anything better than Synology? Jun 10, 2013
Linux Admins, Storage and Virtualization couple of strange questions about proxmox and pfsense setup. Today at 6:58 AM
Linux Admins, Storage and Virtualization How would you setup this Proxmox storage? Feb 15, 2017

Share This Page