ZFS iSER SAN

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

efschu2

Member
Feb 14, 2019
68
12
8
I'm planing a high performance ZFS iSER SAN as storage for 4 ESXi hosts and I got a couple of questions. This will not become an absolutly no latency high-availability cluster, but if the primary-node becomes faulty and the SAN will be "online" again in about 10 seconds, this will be just fine. I will use pcsd/corosync/pacemaker for this.

First the planed configuration:
  • debian, ubuntu or centos - I have not decided jet, probably it gets ubuntu 18.04
  • Supermicro SuperStorage Server SSG-2029P-DN2R24L (2-node in a box system)
  • 2x Intel Xeon Bronze 3104 (6x 1.7GHz, no HT, no Turbo) each node
  • 12x16GB (primary-node) + 4x16GB (failover-node)
  • 8-12 Samsung SSD PM1725a 3.2TB U.2
  • dual ported 40GbE NIC each node - not decided jet
  • dual ported 10GbE NIC each ESXi - not decided jet
  • some 10GbE+40GbE Switches - not decided jet

Question is about NICs and Switches. For iSER I need RDMA capable NICs and switches.
I'm looking for CAVIUM FastLinQ QL45412HLCU-CI (2x 40GbE each node) for the SAN and CAVIUM FastLinQ QL41132HLCU (2x 10GbE for ESXi) for each ESXi but I'm totaly unsure about which switch I should pick. I know DCB and PFC is a must have, but I'm confused about the need of ETS, does the switches realy need this for iSER?

Because of some budget limitations (well it's kind of still enough budget :D ) we will reuse some already existing hardware like cpus and ram, so I do not want to spend another 20k only for the two switches.
Maybe someone got a suggestion which switches I should buy for this (used hardware from ebay is OK too). In fact a switch with 4 SFP+ (or maybe RJ45?) 10GbE and 2 QSFP+ 40GbE Ports will be enough for throughoutput, more 10/40GbE Ports would be just a nice to have feature.

I'm looking forward in your (hopefuly many) suggestions.

Regards
 

efschu2

Member
Feb 14, 2019
68
12
8
Is it possible to connect the QSFP28 from the CX-4 to the QSFP+ ports of the Arista 7050Q? Which cable would I need for this, 1m cable lenght would be enough, possible to use DAC?

Else I would go for some CX-3 Pros, which DACs would you recommend for this?

Do I understand this correct that I can switch the port type from IB to Ethernet on any Mellanox CX3/4 (Pro)?
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Is it possible to connect the QSFP28 from the CX-4 to the QSFP+ ports of the Arista 7050Q? Which cable would I need for this, 1m cable lenght would be enough, possible to use DAC?

Else I would go for some CX-3 Pros, which DACs would you recommend for this?

Do I understand this correct that I can switch the port type from IB to Ethernet on any Mellanox CX3/4 (Pro)?
Every 40gb qsfp DAC cable I've tried has worked between Mellanox CX3 and Arista 7050QX.
Here's a sample of part numbers I'm currently using.
Amphenol 530-4445-01-
Arista Networks CAB-Q-Q-2M
Mellanox MC2206130-002

The Arista reports zero errors for all of them, so I can't point to any practical differences.
I like the mechanical quality of the Mellanox.

Any length should be fine. I have none less than 2m, but plan to use shorter, including 0.5m.

Be wary of Netapp SAS QSFP cables. I've heard others using them, but they have very poor signal quality at 40G and can completely fail to link due to errors.
 
  • Like
Reactions: BoredSysadmin

efschu2

Member
Feb 14, 2019
68
12
8
Could you explain your calculation? I mean, if the clock speed matters that much, then I would need some non-existant 6.8GHz Xeons for 40GbE (and what about the ppl using 100GbE?). But with iSER I'm offloading TCP/IP to the NICs and I have never seen any high CPU load from ZFS - for checksumming each node has 24 AVX-512 units, which should be way more then capable for 40Gb/s.

So exactly which process would benefit from about 50% higher clock speed to quadruplethe the throughoutput or IOPS?
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
So how did it go so far?
Was hypothesizing something similar with OEL and infiniband intra cluster communication the other day so wondered how it was going for you:)
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
I'm interested as well.

Else I would go for some CX-3 Pros, which DACs would you recommend for this?
Has anyone been successful using CX3 Pro and ESXI 6.x for RoCE?

I have not found a way to enable ECN (Explicit Congestion Notification) on a CX3 Pro on ESXI 6.7 with the current tools. I see stalls in RoCE traffic, and that's why I'm looking at ECN.

The mellanox CLI reports:
esxcli mellanox uplink ecn rRoceRp enable -u vmnic4
Error: Did not detect compatible driver / NIC with nmlxcli
Error: For Mellanox ConnectX-3 NIC required driver ver. 3.X.9.8 or greater,for ConnectX-4/5 required driver ver. 4.X.12.10 or greater

I've tried both the inbox (version 3.17.9.12) and a couple of versions of Mellanox's drivers (3.15.11.10 and 3.15.5.5), with the same result.
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Do I understand this correct that I can switch the port type from IB to Ethernet on any Mellanox CX3/4 (Pro)?
The CX3 Pro VPI cards should automatically switch to ethernet and automatically switch to 10 or 40G. You don't need to make any configuration changes.

There are certain protocols and features that won't work if one port is Ethernet and the other IB. The release notes for the driver should cover those limitations.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
If ESXi is required, I'd also check vSAN. Nutanix CE is free for up to 4 nodes. If you seem comfortable with Linux, why not run Ceph?
Nutanix Portal
https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide
RDMA support in vSphere | vSphere 6.7 Core Storage | VMware
Me personally,
I run vsan now, too slow for my use case (very few high perf users), dont like nutanix's always on requirement, ceph was also too slow( untweaked) and was hoping to get an IB based system running. :)
Still early in planning though, not sure it actually will work;)
 

BoredSysadmin

Not affiliated with Maxell
Mar 2, 2019
1,050
437
83
I was able to get functioning servers running with older [low end] nutanix 3 node system on a 1gig network on hybrid storage. I wonder how fast it would be with all flash and RDMA 40gig network ...
dont like nutanix's always on requirement
Could you expand on this please ? What exactly do you mean? Nutanix support for Always-On clusters?
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I was able to get functioning servers running with older [low end] nutanix 3 node system on a 1gig network on hybrid storage. I wonder how fast it would be with all flash and RDMA 40gig network ...
Could you expand on this please ? What exactly do you mean? Nutanix support for Always-On clusters?
No, they need you to have a active internet connection so they can... improve their system ... by tracking your utilization.
At least thats how i understood it with the scarce public documentation available.
At least its been that way a while ago when i looked into it
 

BoredSysadmin

Not affiliated with Maxell
Mar 2, 2019
1,050
437
83
No, they need you to have a active internet connection so they can... improve their system ... by tracking your utilization.
At least thats how i understood it with the scarce public documentation available.
At least its been that way a while ago when i looked into it
Interesting. There is also this (last entry in table):
Nutanix Portal

Don't think this applies to the commercial editions
 

zxv

The more I C, the less I see.
Sep 10, 2017
156
57
28
Did anyone try running a Linux VM as ZFS file server with iSER using PVRDMA in ESXi 6.7, serving it to another ESXi host?
Slide 10 of this: https://www.openfabrics.org/images/eventpresos/2016presentations/102parardma.pdf
says PVRDMA, "Can only work when both endpoints are VMs."

SRV-IO will allow a VM to exchange RoCE with other hosts. The VM has to have all memory reserved (pinned) to allow this. Also, the flow control is configured on the ESXI host, not the VM, which is an added bit of complexity.

BTW, Qemu has a PVRDMA device that can communicate with bare metal peers: qemu/qemu
 

efschu2

Member
Feb 14, 2019
68
12
8
So how did it go so far?
Was hypothesizing something similar with OEL and infiniband intra cluster communication the other day so wondered how it was going for you:)
Well I still sit in "planing phase" cus we have to do a lot of other stuff right now. But forsure I will report here.

If ESXi is required, I'd also check vSAN. Nutanix CE is free for up to 4 nodes. If you seem comfortable with Linux, why not run Ceph?
Nutanix Portal
https://community.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide
RDMA support in vSphere | vSphere 6.7 Core Storage | VMware
Ty for advice, but I dislike cephs performance.
 
Last edited: