ZFS Advice for new Setup

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I believe FreeNas 9 has SMBv3 if you want to give that a whirl.
It has other issues o/c but its not all bad;)
Not sure whether version 4 will be on upcoming FN10 or if it sticks with 3.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
Yes you can set higher NFS server or tcp and vmxnet3s buffers manually as well.

btw
You may try SAMBA with Solaris but beside SMB3 I would always prefer the Solaris SMB server (multithreaded, better permission and snap integration, Windows SMB groups etc)
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
Long day o' benchmarking :)
I'm compiling a massive Word Doc with all my setups/results. Should have it ready in a day or two.

So to start, i ran every benchmark in Napp-It against all (3) of the unique pools
10x Hitachi in RAIDz1 (4+1) * 2 = RAID 50 equiv
4x Samsung in RAID1 (1+1) * 2 = RAID 10 equiv
4x Intel in RAID1 (1+1) *2 = RAID 10 equiv

Then without any tuning TCP/IB/CIFS/NFS etc, i'm running ATTO Disk Benchmark against each of the ZFS File systems mounted to the root of each pool via SMB2.1.

I'm running all benchmarks against all pools, with Sync on the pool set to (enabled, disabled, always)
Then I plan to tune TCP/IB/CIFS/NFS, and re-run the ATTO benchmark against all (3) pools as-is again, in all sync modes.
Then I plan to add the Intels back in to the Hitachi HDD pool in various fashions (L2ARC/ZIL), and run them again.
I should be able to tell soon™, exactly what workload they do help in, at what IO size, and the bandwidth/IOPs for each.

NFS/iSCSI benchmarks maybe next week :)
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
Yes you can set higher NFS server or tcp and vmxnet3s buffers manually as well.

btw
You may try SAMBA with Solaris but beside SMB3 I would always prefer the Solaris SMB server (multithreaded, better permission and snap integration, Windows SMB groups etc)
Are you saying there's a better SMB package/service I can use over the builtin SMB2.1 share of ZFS file system via napp-it gui?
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,625
2,043
113
Keep in mind the benchmarks you're doing are cool to see 'best' case scenario but in ZFS, multi-user, VM, etc, environments they're not close to real world. SSD or HDD you're always going to have some read or writes going on if multi-user or used as VM data store. SSD may not drop performance for a 20hrs, 30hrs, 40hrs depending on the load and what happens next is where the real quality enterprise SSD will shine.

Enterprise or consumer SSD drop performance there's no question about that.

Keep in mind you're seeing 'best case' performance you may NEVER see again in your real usage.
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Inspired by this thread, I decided to run some benchmarks on my system.

The client:
  • 4 vCPUs
  • 16GB RAM
  • OS: Windows 10 Enterprise
The client ESXi host:
  • Intel E3-1270 v2 CPU (run in Sandy Bridge compatibility mode)
  • 32GB 1600 MHz ECC RAM
  • Mellanox Connect-X3 IB FDR
  • Mellanox drivers 1.8.2.5
The main datastore server (bare metal):
  • Intel E3-1275 v2 CPU
  • 32GB 1600 MHz ECC RAM
  • LSI 2008 8-port internal controller
  • Mellanox Connect-X3 IB FDR
  • OS: CentOS 7.3
  • Mellanox OFED 3.4.2 (latest)
  • 3 x 1Gbps Ethernet NICs (teamed)
  • SCST 3.2.x (latest) compiled with SRP (for IB) and iSCSI (as a backup for Ethernet) targets
Datastore 1 on the box above: ZFS pool consiting of:
  • 3 x 2 WD RE 4TB HDD (RAID10) + 1 hot spare
  • Intel S3700 100GB SSD (overprovisioned to 16GB) as log
  • Intel 335 240GB SSD as cache
  • All the drives except the S3700 are connected to the LSI 2008. The S3700 is connected to the onboard SATA3 port.
  • ZFS iSCSI volume of 6TB is exposed
Datastore 2 on the box above: ZFS pool consisting of:
  • 2 x Intel S3610 480GB SSD (RAID1)
  • The drives are now connected to the SATA2 onboard ports (pending arrival of the new controller)
  • ZFS iSCSI volume is exposed off the max available space (~410GB)
Virtual HDD (48GB) backed by the datastore 1:

upload_2017-1-15_1-8-32.png
Virtual HDD (48GB) backed by the datastore 2:

upload_2017-1-15_1-15-4.png
As you can see, the read results a virtually identical, meaning they come exclusively from the ARC (RAM) and are limited by the server CPU (one of the cores maxed out). The lower write results of the second datastore (SSD) are likely explained by the fact its drives are connected to the SATA2 ports. I'll rerun the tests when I get the new controller.
 
Last edited:
  • Like
Reactions: humbleThC

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
Are you saying there's a better SMB package/service I can use over the builtin SMB2.1 share of ZFS file system via napp-it gui?
On Solarish you can use the same SAMBA SMB server that is available on BSD or Linux as an option to the Solarish SMB server. In napp-it I only support the Solarish CIFS server. Samba must be managed via CLI. Both have their special advantages like SMB3 or that shares are independent from ZFS filesystems on SAMBA or the much better and more Windows alike ACL and snap behaviour and a perfect integration of the multithreaded Solarish SMB server into ZFS.
 
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
Nice post Mpogr! Definitely welcome benchmarks / setups / protocols to reference and contrast against.

I'm a bit jelly of your numbers, it makes me want to skip the rest of my CIFS testing and go straight to iSCSI/SRP to my ESX. I still have my ESX boxes powered off until i'm finished tweaking the NAS.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
On Solarish you can use the same SAMBA SMB server that is available on BSD or Linux as an option to the Solarish SMB server. In napp-it I only support the Solarish CIFS server. Samba must be managed via CLI. Both have their special advantages like SMB3 or that shares are independent from ZFS filesystems on SAMBA or the much better and more Windows alike ACL and snap behaviour and a perfect integration of the multithreaded Solarish SMB server into ZFS.
Thank you! Now I know what to tell my wife i'll be working on today!

So i'm clear the builtin SMB server has better ACL integration, and better multi-threaded, and better snapshot integration.

The external SAMBA based package, supports SMB3, but you have to manage it manually via CLI, and likely the background snapshots aren't file-open aware. So you get the block copy of the snap, with whatever files are open at the time being inconsistent?

Just took a break from CIFS benchmarking, to get iSCSI setup and working with iSER.
Partially via GUI, partially via CLI. About to spin up a new Win2016 server and do some iSCSI benchmarks of an ESX datastore from the ZFS volume export.

Interesting, in ESX my Mellanox shows up as a pair of SCSI adapters (vmhba32,33)
I've always ignored these, as I didnt know how they could be used to begin with.
Anyways when setting up the iSCSI initiators on the ESX hosts (v1.8.2.5 drivers), and discovering the iSCSI target/LUN for the 1st time is said 'features = iscsi, parallel scsi'

What's interesting is, when I discover the iSCSI LUN, I see the expected 2 targets 1 device 2 paths. But when I manage paths of the device, I actually see (6) total paths.
(2) Paths via vmhba39 (my software iSCSI initiator, which is bound to two separate IB nics)
(4) Paths via vmhba32, and vmhba33 (I'm guessing this is where the parallel scsi feature comes in, and it's using additional software devices to handle additional IO streams?)

Still have a bit before my test VM is ready for benchmarking.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
The external SAMBA based package, supports SMB3, but you have to manage it manually via CLI, and likely the background snapshots aren't file-open aware. So you get the block copy of the snap, with whatever files are open at the time being inconsistent?
No, a ZFS snap is always the state like a sudden power off.

SAMBA is simply not ZFS specific. It should run on any filesyystem and any toaster that can add one to one binary. As a share on SAMBA is a simple folder while a snap on ZFS is strictly related to a filesystem, it is quite complicated to realize ZFS snaps as Windows previous versions under SAMBA while this just works on Solarish due its strict relation of shares as a ZFS property.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
I'm debating how to display my results .... because I have a lot of them :)
Some zoom required :)

Default Pool Configuration
upload_2017-1-14_16-26-6.png
Hitachi HDD Pool = RAIDZ1 (4+1) * 2 = 'RAID5/0'
Intel SSD (Enterprise) Pool = RAID1 (1+1) * 2 = 'RAID1/0'
Samsung SSD (Consumer) Pool = RAID1 (1+1) * 2 = 'RAID1/0'

Initial Pool Benchmarks using Napp-It
Bonnie++
upload_2017-1-14_16-29-7.png
Iozone
upload_2017-1-14_16-29-11.png
FileBench – (fileserver.f) – 60sec
upload_2017-1-14_16-29-21.png

Sync = Standard with SMB2.1 – TCP buffers/xmit/recv_hiwat tuned / IB MTU 4092
Hitachi Pool----------------Samsung Pool----------------Intel Pool
upload_2017-1-14_16-30-50.pngupload_2017-1-14_16-30-55.pngupload_2017-1-14_16-31-13.png

These were just mounting a standard Solarish SMB2.1 share off a ZFS file system on each pool.
All tests were done from my Win10 Desktop CX2 (2.10.720)
 
Last edited:

humbleThC

Member
Nov 7, 2016
99
9
8
48
And now is where things start to get interesting... iSCSI/iSER/SRP with MPIO on 2 initiators 2 targets, separate IB partitions/vlans, for proper multi-pathing.

All tests done via a fresh Windows 2016 DC Server installation, against the C:\ of the VM.
I moved the VM around different datastores to test the pools against each other.

Hitachi Pool (RAIDz1*2 no SSD) – ZFS Volume

Sync Disabled----------------------------Sync Always
upload_2017-1-14_16-35-22.pngupload_2017-1-14_16-35-25.png
Here we see that HDDs are amazing for with sync disabled, Writes around 1.1-1.3GBs~, Reads around 1-1.7GBs~
We also see that HDDs are terrible with sync enabled, Writes averaging 30MBs!! Reads were still amazing 1.6GBs~
The assumption that HDDs should only be used for non-sync workloads makes sense.

Intel SSD Pool (RAID1*2) – ZFS Volume
Sync Disabled----------------------------Sync Always
upload_2017-1-14_16-37-40.pngupload_2017-1-14_16-37-44.png
Here we see that the Intel S3710 Enterprise SSDs perform no better than the HDDs with Sync Disabled (you were right! <insert you here>)
We also see that these SSDs perform pretty amazing 500MBs write 700MBs read with Sync Enabled!
The assumption that these SSDs by themselves in a dedicated pool for consistent speed on ESX 'sync mode' NFS/iSCSI is also true.

Samsung SSD Pool (RAID1*2) – ZFS Volume
Sync Disabled----------------------------Sync Always
upload_2017-1-14_16-40-11.pngupload_2017-1-14_16-40-14.png
Here we see that the Samsung 850 Evo Consumer SSDs perform 1/2 to 1/3rd as well as the Intel Enteprise SSDs in a Sync Workload!
I really didnt believe it until I saw it, but here it is... (you were right! <insert you here>)
Don't believe manufacture specifications, or even reviews with benchmarks that dont apply to the environment you plan to use them.
In a sync disabled workload, they under-perform the HDDs as well. So they would have only made a very margin performance value add only on sync writes if used for ZIL. But without PLP and using onboard disk cache, they are a risky proposition for not much gain.
 
Last edited:
  • Like
Reactions: liv3010m

humbleThC

Member
Nov 7, 2016
99
9
8
48
And my ultimate question... can the Intel S3710s be used in a meaningful way to accelerate the performance of the HDD pool, such that I can have my cake, and eat it too. (i.e. use the capacity of the HDD pool for ESX lab space, while only actively using a small portion of it at any specific time (i.e. within my L2ARC space after warm up).

upload_2017-1-14_16-49-50.png

Carved up the Intel S3710 400GB SSDs in to (2) Partitions - 85% (336GB) and 15% (56GB)
With each the SSDs on separate Controllers, and using RAID1/0 for ZIL = (112GB) 2 drives worth of performance on writes
Then used the 336GB * 4 Partitions as non-mirrored L2ARC = 1.3TB of L2ARC.
This might not be the most optimal sizes/scheme, but it's a start...

Hitachi Pool (RAIDz1*2 with SSD) – ZFS Volume
Sync Disabled (QD4)-------- Sync Always (QD4)----------Sync Always (QD10)
upload_2017-1-14_16-52-56.pngupload_2017-1-14_16-52-59.pngupload_2017-1-14_16-53-2.png

Here we see that pool is getting HDD speeds on non-sync workloads, which we expect.
The interesting finding is, with sync enabled and QD4.
We're getting 80% of the pure Intel Pool speeds on Sync Writes across all IO sizes.
But we're also getting 100% of the pure Hitachi Pool speeds on Reads, which is faster than using the Intels in a pool by themselves.
Testing again with QD10 just shows you can squeeze a little more performance out of super small IOs, but larger blocks are about the same.

My current thoughts...
I'm kind of loving the Intels accelerating the Hitachi Pool at the moment.
I'm getting the best of both worlds, in that I can:
  • Use the Hitachi HDD pool for large archive/CIFS based workloads (30TB~)
  • And also use the same reservoir of space for iSCSI ESX Sync workloads
  • I won't get full pure SSD speeds always and forever. But 75-80% is very livable, if that's what it ends up being (where it counts)
  • Especially since the Intels by themselves in RAID1/0 is < 800GB~ Usable for ESX (I need 5TB+ to start with < 1TB active
  • Plus i'm getting better than Intel SSD speeds on all reads, so they aren't hurting
Iperf Benchmarks on 4 threads were consistently around 15.5Gbits/second.
Add iSCSI iSER/SRP overhead, and seeing 1.7GBs reads/writes is pretty solid.

I'm also seeing the benefit of the latest and greatest CPUs and memory architecture when it comes to RAW ram performance, and how that pertains to ZFS performance. I've also got my eyes on the Intel 750 PCIe NVMe 400GB card, even if using it in my PCIe 2.0 x8 configuration will gimp it.

I could theoretically sell my (4) Samsung Evo 850s and my (4) Intel S3710s, and just put that PCIe behind my hitachi's and end up better off. (you were right!!! <insert you here>)

But unless i'm incredibly mistaken, and just plain doing it wrong... I kind of feel vindicated, that there is a meaningful way to use the Intels to accelerate the Hitachi's, without impacting the performance of the HDDs, and only helping where it counts.
 
Last edited:
  • Like
Reactions: liv3010m and azev

humbleThC

Member
Nov 7, 2016
99
9
8
48
Now the debate, on if DeDuplication would help me inside my ESX Lab, and how much RAM is required to store the tables, is another story.
Because obviously having a dedicated SSD pool of only 1-2TB would make that always easier / more efficient.

Where trying to enable dedupe against my entire 30TB pool, could perhaps use enough memory, to slow the overall performance substantially. (maybe?)

Tomorrow I explore NFS :) And maybe SMB3
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
A couple of notes:
1. The dedicated storage adapters you see in ESXi are SRP. You should make one of these the preferred path to your Datastores. Don't use the paths you're getting from Software iSCSI in ESXi, these will utilise IPoIB without RDMA. I'd suggest you even making your iSCSI target on the server listen only on the Ethernet IP (and not on the IPoIB one), so the initiators don't get confused. This will make your ESXi use that path only when the primary (SRP) fails.
2. It is very likely the read throughput you're seeing is limited by your server CPU. That's the penalty of using a very old platform. You can confirm it by monitoring the CPU while your tests are running.
3. It is very likely you won't see any difference if you consolidate your drives to just 2 storage controllers instead of 5. Your server will be way safer this way (less current drawn, better air flow) + you can sell those off while there is still demand.
4. Be very careful with dedupe! I tried it several times and have always been burnt. The biggest problem arises when the data deduplication needs to be broken. Also, the deduplication tables heavily depend on the available RAM and REALLY don't like when your filesystem finds itself in an environment where it has LESS RAM than the original amount (e.g. if you moved it from a physical box to a virtual one). The result of this is file operations get stuck/begin running very slowly. My advice, if you want your system to be fast and trouble-less, don't mess with dedupe.
 
  • Like
Reactions: humbleThC

humbleThC

Member
Nov 7, 2016
99
9
8
48
A couple of notes:
1. The dedicated storage adapters you see in ESXi are SRP. You should make one of these the preferred path to your Datastores. Don't use the paths you're getting from Software iSCSI in ESXi, these will utilise IPoIB without RDMA. I'd suggest you even making your iSCSI target on the server listen only on the Ethernet IP (and not on the IPoIB one), so the initiators don't get confused. This will make your ESXi use that path only when the primary (SRP) fails.
2. It is very likely the read throughput you're seeing is limited by your server CPU. That's the penalty of using a very old platform. You can confirm it by monitoring the CPU while your tests are running.
3. It is very likely you won't see any difference if you consolidate your drives to just 2 storage controllers instead of 5. Your server will be way safer this way (less current drawn, better air flow) + you can sell those off while there is still demand.
4. Be very careful with dedupe! I tried it several times and have always been burnt. The biggest problem arises when the data deduplication needs to be broken. Also, the deduplication tables heavily depend on the available RAM and REALLY don't like when your filesystem finds itself in an environment where it has LESS RAM than the original amount (e.g. if you moved it from a physical box to a virtual one). The result of this is file operations get stuck/begin running very slowly. My advice, if you want your system to be fast and trouble-less, don't mess with dedupe.
1. Ahh that makes sense, i've always seem them when I load 1.8.2.4 or 1.8.2.5 under Storage Adapters, and i'm like SCSI??? how the/who the/what the, would use this...
You make an interesting point, that might be related to my confusion on why I see 6 paths.
I expect 2 paths.
10.0.0.4 (SRP Target) --> 10.0.0.10 (SRP Client)
10.0.1.4 (SRP Target) --> 10.0.1.10 (SRP Client)
But i'm seeing (6) where (2) appear to be the iSCSI software initiator (which I believe you are eluding to is IPoIB w/o RDMA.
And then the (4) which come from the virtual SCSI adapter.

With my MPIO policy of RoundRobin, i'm effectively load balancing across all 6 right now.
Where 4 are preferred SRP , and 2 are non-preferred IPoIB.

I will look to in setting that to exclude the (2) IPoIBs and see if that makes any improvement.
And also I may not be partitioning/VLAN'ing how I thought I was.
Because I shouldnt see 4 SRP paths, I should see 2.
Looks like my static discovery cross-connected my initiators and targets as well.​

2. Yeah i'm starting to see that. Especially when i compare to people pushing 3-4GBs read/writes using 'lower quantity/quality drives' than I have in this box atm.
But now that I monitor CPU during benchmarking, i'm not seeing a CPU bottleneck at all.
In fact the CPU's are all quite cold.
I think the main issue with older CPUs is the older memory architecture and speeds that are associated with them. In ZFS you can actually disk benchmark the performance difference out of 1300MHz -vs- 1866MHz DDR3 -vs- DDR4 difference. ECC memory is never as fast, and always has higher latency. So a good single cpu modern desktop can out ZFS a 2 generation old server any day of the week.​

3. I do like the ability of maintaining controller redundancy. The way I have it now, if any one HBA fails, my system stays online, in degraded mode.
4. Totally. In theory it would only make sense in a dedicated pool space, where you know you're doing VDI or some insanely dedupe friendly environment. Most likely with ultra fast storage NVMe/PCIe so you can squeeze 10's of TB of 'virtual space' out of 1-2TB of real space. But I dont plan on testing boot storm of VDI in my homeLab, so i'll skip that hardware/test :)
 
Last edited:

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Two points:
1. SRP is not an IP based protocol. You can have no IPs assigned to either NICs (client, server or both) and still see the link. Had you have iSER, that wouldn't be the case, because the initiation phase of iSER is based on IP and only then it switches to RDMA.
2. Have a look at INDIVIDUAL CORE CPU consumption on the server. If you have 4 cores and only one of them is maxed, the overall consumption will be around 25% and the CPU will stay quite cool.
Memory-wise, aside of the sheer frequency, there is not much difference between different DDR3-based platforms. I have 32GB (4x8GB UDIMMs) of 1600MHz ECC RAM, which is the maximum my Xeon E3 platform allows.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,625
2,043
113
Did you limit ARC so reads were hitting disks to get a real idea of which pool setup and drives performed best when reading from the actual drives and not from ARC (ram) ?
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
I didn't disable ARC. Will do and repost the results.

Sent from my SM-G920I using Tapatalk