ZFS iSER SAN

Discussion in 'Linux Admins, Storage and Virtualization' started by efschu2, Feb 14, 2019.

  1. efschu2

    efschu2 New Member

    Joined:
    Feb 14, 2019
    Messages:
    20
    Likes Received:
    1
    I'm planing a high performance ZFS iSER SAN as storage for 4 ESXi hosts and I got a couple of questions. This will not become an absolutly no latency high-availability cluster, but if the primary-node becomes faulty and the SAN will be "online" again in about 10 seconds, this will be just fine. I will use pcsd/corosync/pacemaker for this.

    First the planed configuration:
    • debian, ubuntu or centos - I have not decided jet, probably it gets ubuntu 18.04
    • Supermicro SuperStorage Server SSG-2029P-DN2R24L (2-node in a box system)
    • 2x Intel Xeon Bronze 3104 (6x 1.7GHz, no HT, no Turbo) each node
    • 12x16GB (primary-node) + 4x16GB (failover-node)
    • 8-12 Samsung SSD PM1725a 3.2TB U.2
    • dual ported 40GbE NIC each node - not decided jet
    • dual ported 10GbE NIC each ESXi - not decided jet
    • some 10GbE+40GbE Switches - not decided jet

    Question is about NICs and Switches. For iSER I need RDMA capable NICs and switches.
    I'm looking for CAVIUM FastLinQ QL45412HLCU-CI (2x 40GbE each node) for the SAN and CAVIUM FastLinQ QL41132HLCU (2x 10GbE for ESXi) for each ESXi but I'm totaly unsure about which switch I should pick. I know DCB and PFC is a must have, but I'm confused about the need of ETS, does the switches realy need this for iSER?

    Because of some budget limitations (well it's kind of still enough budget :D ) we will reuse some already existing hardware like cpus and ram, so I do not want to spend another 20k only for the two switches.
    Maybe someone got a suggestion which switches I should buy for this (used hardware from ebay is OK too). In fact a switch with 4 SFP+ (or maybe RJ45?) 10GbE and 2 QSFP+ 40GbE Ports will be enough for throughoutput, more 10/40GbE Ports would be just a nice to have feature.

    I'm looking forward in your (hopefuly many) suggestions.

    Regards
     
    #1
  2. efschu2

    efschu2 New Member

    Joined:
    Feb 14, 2019
    Messages:
    20
    Likes Received:
    1
    Would a Arista Networks 7050Q-16 be capable?
     
    #2
  3. markpower28

    markpower28 Active Member

    Joined:
    Apr 9, 2013
    Messages:
    391
    Likes Received:
    98
    I would stay with Mellanox ConnectX-4 and above.
     
    #3
  4. efschu2

    efschu2 New Member

    Joined:
    Feb 14, 2019
    Messages:
    20
    Likes Received:
    1
    Is it possible to connect the QSFP28 from the CX-4 to the QSFP+ ports of the Arista 7050Q? Which cable would I need for this, 1m cable lenght would be enough, possible to use DAC?

    Else I would go for some CX-3 Pros, which DACs would you recommend for this?

    Do I understand this correct that I can switch the port type from IB to Ethernet on any Mellanox CX3/4 (Pro)?
     
    #4
  5. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    96
    Likes Received:
    33
    Every 40gb qsfp DAC cable I've tried has worked between Mellanox CX3 and Arista 7050QX.
    Here's a sample of part numbers I'm currently using.
    Amphenol 530-4445-01-
    Arista Networks CAB-Q-Q-2M
    Mellanox MC2206130-002

    The Arista reports zero errors for all of them, so I can't point to any practical differences.
    I like the mechanical quality of the Mellanox.

    Any length should be fine. I have none less than 2m, but plan to use shorter, including 0.5m.

    Be wary of Netapp SAS QSFP cables. I've heard others using them, but they have very poor signal quality at 40G and can completely fail to link due to errors.
     
    #5
    BoredSysadmin likes this.
  6. Barbapappa

    Barbapappa New Member

    Joined:
    Jan 2, 2017
    Messages:
    5
    Likes Received:
    2
    I would recommend higher clock speeds to fully utilize the higher speed networking gear. Else just go with 10 GbE.
     
    #6
  7. efschu2

    efschu2 New Member

    Joined:
    Feb 14, 2019
    Messages:
    20
    Likes Received:
    1
    Could you explain your calculation? I mean, if the clock speed matters that much, then I would need some non-existant 6.8GHz Xeons for 40GbE (and what about the ppl using 100GbE?). But with iSER I'm offloading TCP/IP to the NICs and I have never seen any high CPU load from ZFS - for checksumming each node has 24 AVX-512 units, which should be way more then capable for 40Gb/s.

    So exactly which process would benefit from about 50% higher clock speed to quadruplethe the throughoutput or IOPS?
     
    #7
  8. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,942
    Likes Received:
    406
    So how did it go so far?
    Was hypothesizing something similar with OEL and infiniband intra cluster communication the other day so wondered how it was going for you:)
     
    #8
  9. BoredSysadmin

    BoredSysadmin Member

    Joined:
    Mar 2, 2019
    Messages:
    57
    Likes Received:
    7
    #9
  10. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    96
    Likes Received:
    33
    I'm interested as well.

    Has anyone been successful using CX3 Pro and ESXI 6.x for RoCE?

    I have not found a way to enable ECN (Explicit Congestion Notification) on a CX3 Pro on ESXI 6.7 with the current tools. I see stalls in RoCE traffic, and that's why I'm looking at ECN.

    The mellanox CLI reports:
    esxcli mellanox uplink ecn rRoceRp enable -u vmnic4
    Error: Did not detect compatible driver / NIC with nmlxcli
    Error: For Mellanox ConnectX-3 NIC required driver ver. 3.X.9.8 or greater,for ConnectX-4/5 required driver ver. 4.X.12.10 or greater

    I've tried both the inbox (version 3.17.9.12) and a couple of versions of Mellanox's drivers (3.15.11.10 and 3.15.5.5), with the same result.
     
    #10
  11. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    96
    Likes Received:
    33
    The CX3 Pro VPI cards should automatically switch to ethernet and automatically switch to 10 or 40G. You don't need to make any configuration changes.

    There are certain protocols and features that won't work if one port is Ethernet and the other IB. The release notes for the driver should cover those limitations.
     
    #11
  12. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,942
    Likes Received:
    406
    Me personally,
    I run vsan now, too slow for my use case (very few high perf users), dont like nutanix's always on requirement, ceph was also too slow( untweaked) and was hoping to get an IB based system running. :)
    Still early in planning though, not sure it actually will work;)
     
    #12
  13. BoredSysadmin

    BoredSysadmin Member

    Joined:
    Mar 2, 2019
    Messages:
    57
    Likes Received:
    7
    I was able to get functioning servers running with older [low end] nutanix 3 node system on a 1gig network on hybrid storage. I wonder how fast it would be with all flash and RDMA 40gig network ...
    Could you expand on this please ? What exactly do you mean? Nutanix support for Always-On clusters?
     
    #13
  14. superempie

    superempie New Member

    Joined:
    Sep 25, 2015
    Messages:
    21
    Likes Received:
    3
    Did anyone try running a Linux VM as ZFS file server with iSER using PVRDMA in ESXi 6.7, serving it to another ESXi host?
     
    #14
  15. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,942
    Likes Received:
    406
    No, they need you to have a active internet connection so they can... improve their system ... by tracking your utilization.
    At least thats how i understood it with the scarce public documentation available.
    At least its been that way a while ago when i looked into it
     
    #15
  16. BoredSysadmin

    BoredSysadmin Member

    Joined:
    Mar 2, 2019
    Messages:
    57
    Likes Received:
    7
    Interesting. There is also this (last entry in table):
    Nutanix Portal

    Don't think this applies to the commercial editions
     
    #16
  17. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    96
    Likes Received:
    33
    Slide 10 of this: https://www.openfabrics.org/images/eventpresos/2016presentations/102parardma.pdf
    says PVRDMA, "Can only work when both endpoints are VMs."

    SRV-IO will allow a VM to exchange RoCE with other hosts. The VM has to have all memory reserved (pinned) to allow this. Also, the flow control is configured on the ESXI host, not the VM, which is an added bit of complexity.

    BTW, Qemu has a PVRDMA device that can communicate with bare metal peers: qemu/qemu
     
    #17
    superempie and BoredSysadmin like this.
  18. superempie

    superempie New Member

    Joined:
    Sep 25, 2015
    Messages:
    21
    Likes Received:
    3
    Thanks, will check it out.
     
    #18
  19. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    2,942
    Likes Received:
    406
    Yeah. And probably not.
     
    #19
  20. efschu2

    efschu2 New Member

    Joined:
    Feb 14, 2019
    Messages:
    20
    Likes Received:
    1
    Well I still sit in "planing phase" cus we have to do a lot of other stuff right now. But forsure I will report here.

    Ty for advice, but I dislike cephs performance.
     
    #20
    Last edited: Mar 12, 2019
Similar Threads: iSER
Forum Title Date
Linux Admins, Storage and Virtualization iSER on floating IP ? Oct 5, 2016

Share This Page