ZFS Advice for new Setup

mpogr · Jan 15, 2017

Good point, I do have a UPS + crash-free history of the server (despite hardware failures) + daily backups of all important VMs. So I'll stick with sync=disabled for the dime being...

Just for some additional charts for comparison:

ATTO w/all optimisations ------------------------------------- w/sync=always (w/SLOG)

mpogr · Jan 16, 2017

humbleThC said:
For reference:
View attachment 4258
I'm trying Round Robin to load balance my IOs across both SRP interfaces. (vmhba32 and vmhba33)
I've manually disabled the IPoIB paths which dont leverage SRP. (vmhba39)
If i'm not mistaken, this will be the ideal setup.

I've taken a different approach to this. On one hand, I want to have redundancy over Ethernet, so I exposed the iSCSI target just via the Ethernet IP on the server (using the "allowed_portal" parameter of SCST). On the other hand, I don't want this link to be used unless SRP fails, so I don't use round robin. Instead, I use the "Fixed" multipathing policy with the first SRP adapter link as "preferred". At the same time, I don't use IPoIB on that port, I haven't even defined a VMkernel NIC on top of it.
Now, I do use the IPoIB capability of the second port for very fast inter-ESXi networking. There are so many possible uses for it, e.g. VMotion, exposing CIFS shares etc. All you need to do is add another NIC to your VM and connect it to the corresponding vSwitch and, voila, your VM is on a 56Gbps (in my case) network.

humbleThC · Jan 16, 2017

mpogr said:
Well, the real beauty of Infiniband (with RDMA) + ZFS is that you have BETTER THAN SSD speeds across the wire, thanks to the massive in-memory caching on the server side...

That would be a bit difficult to organise, considering I'm in the land down under and the work week has already started here :-(

I'm in no hurry

, and I work in all time zones.. Although i'm near Chicago, my active engagements right now are in UK, NL, and Canada. I'm rocking -3hrs in one timezone and +6hrs in another.

Also thats the difference between CX2 and CX3/4s ... I can't get SRV-IO to work (from what i've read) so none of my VMs will technically ever be on the IB network

For me i consider it a 'SAN' fabric, but I do like the idea of using it as VMotion Network, and maybe Provisioning. Plus i've done several tests with RR enabled on all SRP targets. RR enabled on all SRP+IPoIB targets, and Fixed on a single SRP target. No noticeable difference in my configuration. (still cant break 1.7GBs in data xfer). Which is about as fast as i've ever got over SMB3/RDMA from Win10 to Win2016 copying from RAMdrive to RAMdrive (if i remember correctly).

humbleThC · Jan 17, 2017

Also I really need help on partitioning these Intel S3710s.

From my research, the reason why you want to 'whole disk it' is two fold. One a initialized ZFS disk has a blank GPT partition, which by itself offsets/aligns to 4K sectors. Another main reason is apparently ZFS wont use the onboard disk cache of a device, unless you whole partition it, to prevent some type of UFS file system caching issue that could exist, if you use the other partitions for UFS etc.

Obviously i dont want to use the whole disk for ZIL, ideally i'm thinking I want to size my ZIL for 4GBs * 5 seconds = 20GB. And Ideally I want it mirrored across all (4) Intel SSDs for performance and redundancy. so 20GB Usable = 10GB partitions in RAID1/0.

I can't use the napp-it GUI to partition the disks, because it defaults to MBR partition scheme, which isnt 4K sector aligned. And I dont want to take a huge hit on the SSD performance accordingly

.

Next, is there a way to override the 'use disk cache' on disk, where you are only using a partition, and not the whole disk? I promise i'll never use UFS on the other partitions

I still plan to create a 1.2TB'ish L2ARC across the remainder of the Intel SSDs in RAID0. And I need these partitions to be GPT/aligned as well.

Pretty sure my initial benchmark results were gimped, due to the alignment of the SSDs for ZIL/L2ARC the way I partitioned it via GUI.

ttabbal · Jan 17, 2017

For SLOG/ZIL, you just add them to the pool as log devices, there's no RAID involved, nor should there be. Just make 10GB partitions and add those as log devices.

I haven't used napp-it for a long time now, but you might have to go CLI for GPT. gparted should be able to create them wouldn't much hassle.

For the L2ARC sharing the drive... that's kind of frowned on, but it may work fine for you. The catch is that you are now using controller bandwidth for SLOG and L2ARC traffic. If you do it, I would recommend leaving at least 25% of the disk as unpartitioned space. That leaves some room for the disk's controller and firmware to clean up, so there are always erased blocks ready to use.

I seem to remember reading that the whole 4k alignment thing wasn't really an issue with newer SSDs. Might want to look into that, but maybe that was more due to GPT becoming popular than the disks getting better about it.

I don't know what, if any, effect partitions vs full disks has these days. I know it used to be a big deal, but that might have changed. Since you're benchmarking, it might be an interesting test. Make a mirror with drives, test, then one with partitions, test...

mpogr · Jan 17, 2017

humbleThC said:
...the difference between CX2 and CX3/4s ... I can't get SRV-IO to work (from what i've read) so none of my VMs will technically ever be on the IB network ...

I haven't actually meant SRV-IO, nor I can even use it. FYI, SRV-IO is only supported on VMware drivers 2.x and on, but those have neither SRP nor iSER over IB, so I stick to 1.8.2.5.
What I did mean was regular VMware networking via the VMkernel NIC on top of the second Mellanox card IPoIB interface. It won't be full speed by any means, but still much better than 1 Gbps. This will serve VMotion/Provisioning out of the box + you can connect virtual NICs from your VMs to it.

humbleThC said:
...i've done several tests with RR enabled on all SRP targets. RR enabled on all SRP+IPoIB targets, and Fixed on a single SRP target. No noticeable difference in my configuration. (still cant break 1.7GBs in data xfer)...

I'm pretty sure RR won't provide you with any additional benefit on top of a single link, and, if your second IB port is used for IPoIB networking, it may actually cause more harm than provide any benefit.
As for the 1.7 GB/s peak you're unable to break, I'd rather suspect you're single-core CPU limited on the server side, and I'm afraid you have very limited (if any) means of breaking it on Solaris. On Linux, as you could see from my posts, you have much more control in terms of employing multithreading, which could lead to significant increase of the overall throughput.

humbleThC · Jan 17, 2017

mpogr said:
I haven't actually meant SRV-IO, nor I can even use it. FYI, SRV-IO is only supported on VMware drivers 2.x and on, but those have neither SRP nor iSER over IB, so I stick to 1.8.2.5.
What I did mean was regular VMware networking via the VMkernel NIC on top of the second Mellanox card IPoIB interface. It won't be full speed by any means, but still much better than 1 Gbps. This will serve VMotion/Provisioning out of the box + you can connect virtual NICs from your VMs to it.

I'm pretty sure RR won't provide you with any additional benefit on top of a single link, and, if your second IB port is used for IPoIB networking, it may actually cause more harm than provide any benefit.
As for the 1.7 GB/s peak you're unable to break, I'd rather suspect you're single-core CPU limited on the server side, and I'm afraid you have very limited (if any) means of breaking it on Solaris. On Linux, as you could see from my posts, you have much more control in terms of employing multithreading, which could lead to significant increase of the overall throughput.

What would you do in my position?
ZoL via CentOS 7.3, and if so got any good links of where to start?

mpogr · Jan 17, 2017

I think I'd given Linux a go (not necessarily CentOS). You need to find info on the following:
* ZoL
* Mellanox OFED
* SCST
Luckily, all of these are very well documented.

humbleThC · Jan 18, 2017

mpogr said:
I think I'd given Linux a go (not necessarily CentOS). You need to find info on the following:
* ZoL
* Mellanox OFED
* SCST
Luckily, all of these are very well documented.

Messing around with Ubuntu 16.04.1.
Updated to latest 4.4.0-59-generic kernel.
Adding MLNX OFED 3.4.2 Drivers with --add-kernel-support for 4.4.0-59.

root@nas01:~/MLNX_OFED_LINUX-3.4-2.0.0.0-ubuntu16.04-x86_64# hca_self_test.ofed

---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-3.4-2.0.0.0 (OFED-3.4-2.0.0): 4.4.0-59-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v2.10.0720
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 0
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... NA
------------------ DONE ---------------------

Decided to try Ubuntu next, but will want to try CentOS as well.
I saw that napp-it was available for Linux, specifically older Ubuntu, is that still a thing?

I'm on attempt#2 for the SCST compiling

so there's that.. got a good link for that?

mpogr · Jan 18, 2017

I'd stay away from trunk of SCST, grab the 3.2.x branch, which should be stable:

svn checkout ...

Then in the main directory:

make 2perf

Then:

make scst scstadmin srpt iscsi-scst
sudo make install scst scstadmin srpt iscsi-scst

That should be it. Next you need to create /etc/scst.conf, google for examples.

humbleThC · Jan 18, 2017

mpogr said:
I'd stay away from trunk of SCST, grab the 3.2.x branch, which should be stable:

svn checkout ...

Then in the main directory:

make 2perf

Then:

make scst scstadmin srpt iscsi-scst
sudo make install scst scstadmin srpt iscsi-scst

That should be it. Next you need to create /etc/scst.conf, google for examples.

Initial compile failed, due to a Qlogic 2000 device not being present, removed that dir , and recompiled, so far so good.

I have HowTo Configure SCST Block Storage Target Enabl... | Mellanox Interconnect Community for a reference of scst.conf
And this old guide How to Set up an Infiniband SRP Target on Ubuntu 12.04

But i'm not sure my SCST is installed correctly atm, is there a way to validate?
i dont have an /etc/init.d/scst for example.

NVM, decided CentOS was a better starting place for me personally, due to it's relationship to RedHat, and the value add i'd get from learning this Linux variant.

Will try again from scratch tomorrow

humbleThC · Jan 21, 2017

Long couple days @ work building AWS clouds

Finally got back to the NAS project...

This time i'm messing around with RedHat Server 7.3
I have the Mellanox OFED 3.4.2 Drivers successfully installed
I have the SCST 3.2.x package successfully installed.
I have ZoL Version 0.6.5.8-1.el7_3.centos successfully installed.

Now to figure out where to go next

But I feel that i'm close to being able to test.

*Update*
Almost ready to start benchmarking!!

Got ZFS setup, ARC limited to 64GB, Pools are created (with properly aligned partitions) , Ashift=12, and Intels are split 20GB/365GB for RAID1/0 ZIL and RAID0 L2ARC.

Just need to figure out how to properly configure the infiniband adapters now.
I'm thinking Port 1 - will be for IPoIB CIFS/NFS
Port 2 - Will be for iSCSI only.

Any suggestions on how to get to this config?

mpogr · Jan 22, 2017

humbleThC said:
'm thinking Port 1 - will be for IPoIB CIFS/NFS
Port 2 - Will be for iSCSI only.

I hope you meant Port 2 will be SRP-only. If you're after IB-RDMA for ESXi clients (as you should be), SRP is the only viable option at the moment. You can (and probably should) define additional iSCSI targets for the same volume exposed via SRP, but make them visible only via the 1 Gbps Ethernet and use this path as a backup (in case IB fails for some reason).

Here's how my scst.conf looks like:

HANDLER vdisk_blockio {
DEVICE vmfs {
filename /dev/zvol/pool1/vmfsroot
nv_cache 1
rotational 1
write_through 0
threads_num 4
}
DEVICE vmfs2 {
filename /dev/zvol/pool2/vmfsroot
nv_cache 1
rotational 0
write_through 0
threads_num 4
}
}

TARGET_DRIVER iscsi {
enabled 1

TARGET iqn.2014-12.org.xxx.xxx.xxx:storage:zfs-sn-798798789 {
allowed_portal 192.168.120.53
QueuedCommands 128
rel_tgt_id 3
LUN 0 vmfs
LUN 1 vmfs2
enabled 1
}
}

TARGET_DRIVER ib_srpt {

TARGET fe80:0000:0000:0000:f452:1403:007c:15d1 {
enabled 1
rel_tgt_id 1
LUN 0 vmfs
LUN 1 vmfs2
}

TARGET fe80:0000:0000:0000:f452:1403:007c:15d2 {
enabled 1
rel_tgt_id 2
LUN 0 vmfs
LUN 1 vmfs2
}
}

In my case, "vmfs" is a volume on the RAID-10 HDD pool and "vmfs2" is a volume on the RAID1 SSD pool.
"nv_cache 1" results in effective sync=disabled, unless you explicitly override it on ZFS level.
"allowed_portal" allows the iSCSI target to be exposed only on 1 IP address (the Ethernet one).
"QueueCommands 128" is an important tweak for iSCSI, the default is too low (32).
Make sure you have the right addresses for your SRP targets (RTM on how to find them). You can enable only one of them, but then your dmesg will be flooded with unsuccessful login attempt messages for the second port. Only one will be affectively used anyway.
Finally, "threads_num" was the parameter that allowed to break the single core limitation (for connections coming from a single initiator) and get the impressive results I posted several days ago.

Good luck!

humbleThC · Jan 22, 2017

MASSIVE thanks to mpogr for helping me with the following setup, it wasnt easy, and I was failing for days without him, probably would have been weeks, or GTFU. But now I have something to work/test/compare with!!!

RedHat 7.3 Server [Linux 3.10.0-514.el7.x86_64]
--- Latest as of today, fully updated/patched.

Mellanox OFED Drivers [MLNX_OFED_LINUX-3.4-2.0.0.0-rhel7.3-x86_64]
-- Drivers custom make/installed with --add-kernel-support

SCST 3.2.x
--- custom make/installed leveraging the MLNX_OFED SRP drivers, that were kernel integrated.
ISCSI w/ SRP to ESX 6.0 u2 [Using v1.8.2.5 OFED drivers for ESX 5.x]

ZoL latest
---

via ConnectX-2 40Gb QDQ HCAs attached to Mellanox 4036.

Initial Test Benchmarks.

Same Hitachi RaidZ1 (4+1)*2 setup, with Intel S3710s partitioned for both ZIL and L2ARC, but this time with properly aligned partitions to 4K sectors, so ruled out any potential impact there.
I had a 'user issue' trying to partition/align in Solarish , and the default GUI used MBR (mis-aligned partitions) but turns out , that wasnt a thing, or it is a thing, and my bottleneck is somewhere else.

So far, nearly identical to my OmniOS+NappIt results. (using the same disks, in the same exact way ZFS wise, with the same ZFS settings)
But a massive difference in kernel/OS between OmniOS and Redhat server for sure.

Definately notice the ZFS & iSCSI SRP traffic being balanced across all 16x CPUs.

So my assumption of 1.7GBs being an upper cap due to only 1/3rd of my memory bandwidth being uses via SunOS 5.x kernel didnt pan out.

whitey · Jan 23, 2017

Either one of you gonna publish a definitive 'using SRP storage in vSphere' guide? Not talking a 'hand holding' soup-to-nuts work instruction but maybe a primer w/ golden nuggets of knowledge/lessons learned/config files/solid links to articles that cover each component of config would sure be nice. I have a config almost there (vSphere w/ 1.8.2.5 SRP vib's, CentOS7 ZoL latest from openzfs, OFED latest, missing SCST) but need to bone up on SCST apparently. Oh and I got sidetracked w/ another project.

I see a scst.conf file attached so there's part of the puzzle, any other critical config files? Remind me @humbleThC what IB switch are you using? Looks like CX2 cards for sure.

humbleThC · Jan 23, 2017

Some 'things' that confuse me in my overall math/benchmarking over the course of this project:

The ZFS Server RAM appears to have more 'speed/power'
# dd if=/dev/zero of=/home/humble/tempfs/testfile bs=1M count=4096

4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 1.56663 s, 2.7 GB/s

iperf3 (over IPoIB) can be as high as 18Gbs = > 2.0 GB/s
and i'm using SRP which should be amazingly more efficient.

The Pool/FileSystem appears to have more 'speed/power'
#dd if=/dev/zero of=/Hitachi/data/vdisk0 bs=1M count=1048576

1048576+0 records in
1048576+0 records out
1099511627776 bytes (1.1 TB) copied, 460.256 s, 2.4 GB/s

But no matter what I do, I can only ever see 1.7GB/s in any protocol, in any way, from any OS, with any driver

i.e. Windows 2012 R2 / Windows 2016
OmniOs + Napp-it w/ native ZFS (latest)
RedHat + SCST + ZoL (latest)

Testing from ESX 6.0 u2 Hosts via NFS/iSCSI/SRP
Testing from Windows 10 Host via SMB3/iSCSI

Mix any of the above with any of the above, and it's 1.7GBs any way you slice it.
(which is part of the reason i keep testing

humbleThC · Jan 23, 2017

whitey said:
Either one of you gonna publish a definitive 'using SRP storage in vSphere' guide? Not talking a 'hand holding' soup-to-nuts work instruction but maybe a primer w/ golden nuggets of knowledge/lessons learned/config files/solid links to articles that cover each component of config would sure be nice. I have a config almost there (vSphere w/ 1.8.2.5 SRP vib's, CentOS7 ZoL latest from openzfs, OFED latest, missing SCST) but need to bone up on SCST apparently. Oh and I got sidetracked w/ another project.

I see a scst.conf file attached so there's part of the puzzle, any other critical config files? Remind me @humbleThC what IB switch are you using? Looks like CX2 cards for sure.

I very much plan to do the homework required for this.
But I'll gladly build it, and hand it over to MPOGR for final editing, and publishing.
Without his instruction I wouldnt understand it well enough to document it in the first place.
And he's quite active on Mellanox user forums, where i've read his years of commitment to this science.

ConnectX-2 HBAs (MT26428) with v2.10.720 FW

(But I can say this will work for CX2/CX3/CX4, all use the same exact driver installation package, and the 'trick' is installing it with kernel support, or the ib_srpt module wont load due to the ib_srpt.ko being the wrong version)
And this will work the same for CentOS/RedHat as well as Ubuntu/Others that are supported by the same linux package (theoretically).

Mellanox 4036 Switch (with subnet mgr running)

(But i can also say this can easily be substituted for the software based openSM)
In fact, in small environments (2-3 hosts) one might argu it might be worth it
even if you have a switch, not to use it, due to the 4036 switch maxing out on 4092/4096 Byte MTU.
With direct attach hosts (or perhaps new switches) you could enjoy 65520 Byte MTU.

I was stuck exactly exactly where you are now. And a huge piece of the conversation was 68 posts back and forth directly between MPOGR and myself, as to not muck up this thread. But there is amazing troubleshooting notes, and ultimately i documented it from a soup-to-nuts fashion, and plan to rebuild this a few more times from scratch just to prove I can (before I'm comfortable enough to consider it "production" and start deploying my actual ESX homeLAB on it again).

So like I said i'm happy to do the homework, as long as MPOGR edits it, and he deserves the publish credit^^

gea · Jan 23, 2017

humbleThC said:
But no matter what I do, I can only ever see 1.7GB/s in any protocol, in any way, from any OS, with any driver

i.e. Windows 2012 R2 / Windows 2016
OmniOs + Napp-it w/ native ZFS (latest)
RedHat + SCST + ZoL (latest)

Testing from ESX 6.0 u2 Hosts via NFS/iSCSI/SRP
Testing from Windows 10 Host via SMB3/iSCSI

Mix any of the above with any of the above, and it's 1.7GBs any way you slice it.
(which is part of the reason i keep testing

I do not use IB in my setups but regarding CIFS or iSCSI Oracle Solaris 11.3 was mostly 5-30% faster than OmniOS. While I support Solaris in most functions this would mean a move from Open-ZFS to Oracle's closed source ZFS v37.

humbleThC · Jan 23, 2017

gea said:
I do not use IB in my setups but regarding CIFS or iSCSI Oracle Solaris 11.3 was mostly 5-30% faster than OmniOS. While I support Solaris in most functions this would mean a move from Open-ZFS to Oracle's closed source ZFS v37.

The company I work for, will basically let me leverage our partner relationships for homeLab licensing. So Oracle/Redhat/Microsoft/other Licenses are available, (if I need one).

Solaris 11.3 does have the drivers I need, "in theory" i'll have to do a whole new research path on the state of SRP in inband Solaris, -vs- Mellanox provided , -vs- the capabilities/dependency of the iSCSI daemon, -vs- COMSTAR. And see if there's a solution in there that's better than anything i've seen/tested.

The RedHat/CentOS path i'm down now, is viable (once you have a solid runbook, and are willing to custom make your entire driver/scst/zfs stack, every time you want to update your OS kernel.) Granted its insanely hard to figure this out the 1st time, but once you troubleshoot every layer of every thing once, it all makes sense, and tearing it down and rebuilding seems easy enough. Thus doing it as a maintenance activity around key updates is an acceptable thing to some people, who want to stay down this path.

However, there's a unbelievably huge advantage on the OmniOS+Napp-It from a, you are up in 15minutes, and it works out of the box, and oh, there's a GUI and monitoring and all the extra NAS/SAN stuff , its very impressive, and i'm amazed by what you've done for that entire storage path. And I might be back down this path shortly

Need to test the hell out of this setup, (and document it, and publish it for others, and myself, incase i come back, or do it again someday)... And see where it stands.

T_Minus · Jan 23, 2017

RE: 4096 Byte MTU max of the 4036.
Have you compared IOPs and latency when doing transfers to & from high performing SSD or NVME pools and the affects the MTU limitation may have vs direct connect with the higher value?

ZFS Advice for new Setup

Active Member

Active Member

Member

Member

Active Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Active Member

Member

Moderator

Member

Member

Well-Known Member

Member

Build. Break. Fix. Repeat