ZFS bottlenecks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

I really like ZFS, but i was not able to get a REALLY good performance as ZFS-iscsi-target for small IO.

Today, my test to look at its limits was:

- Pool of only one Optane 900p
- Pool of a stripe of two Optane 900p
- Pool of one Optane as simple-volume + one Optane as SLOG

Result was always:
Not more than 60.000 IOPS at 8kb, 50% write, diskspd. Read was about 110.000 IOPS.
...tested with OpenE Jovian and VSphere as Client.

Same with Starwind was MUCH faster.

What I could see was a huge CPU load - up 100!

Do you think, higher IOPS are possible with faster CPUs, or are there just some limits with ZFS?

Thank you for your help!


Stril.

...I know, OpenE is running ZoL which is not as fast as others, but I need commercial support...
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,625
2,043
113
Might want to post up your CPU specs, RAM Specs, etc :)
 

Monoman

Active Member
Oct 16, 2013
408
160
43
Did your starwind and ZoL test have the same hardware? This will be important.

FreeNAS has commercial support. Consider testing their ZFS as well. I'm close to performing these tests myself.
 

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

CPU is one Intel Scalable Silver 4112, 48 GB Ram and the Starwind test was on the same system.

The question for me is: how much performance is possible with:
- More cores (AMD Epyc?)
- Or few but faster cores

I did not find any commercial reseller for ixSystems in Germany, yet...
 

m4r1k

Member
Nov 4, 2016
75
8
8
35
If you need commercial support, the best and faster ZFS implementation is by Oracle. ZFS is first and foremost data consistency. In some cases, this is more important than pure speed.

You could try 11.4 with Napp-it and see what performance you will get.
You could also try pure Napp-it based on OmniOSce and see if it’s a OpenE issue or what
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
Have you enabled sync write on ZFS?
With iSCSI this can be either set as sync property=on (of the underlying zvol) or via writeback cache=disabled.

The default value means that the writing app decides but you can force on or off.
The Slog protects the content of the rambased writecache. On Optane with an extra Optane as Slog makes no sense as this is not really faster than the Optane with the onpool ZIL for logging.

If Starwind is not faster but much faster, then I suppose ZFS does sync write while Starwind does not. Set sync to disable/ write-back enable to check this or force sync on Starwind.

Beside that ntfs is faster than ZFS especially with less RAM. The security with extra checksums on data and metadata and copy on write costs performance but give superiour data security. If you switch to ReFS on Windows you will see a similar degration (at a lower level, ZFS is then much faster)

How much RAM?
ZFS use 10% RAM up to 4GB as write cache. this is essential for performance.

All Open-ZFS platform should perform similar with currently slight advantages for Illumos based systems as Open-ZFS memory management is still Solaris alike even on BSD or Zol. Oracle Solaris with a genuine ZFS was in all my tests always faster.
 
Last edited:

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

I did also some tests with "sync=disable". The performance was not much better - i think, the NVMe already reaches one of my bottlenecks.

Starwind seems to be easy with "raw-devices" (no snapshots, no async replication): Better hardware, more performance.

I am doing two kinds of tests:
1. Windows-client -> iSCSI MPIO -> Storage with diskspd
2. WindowsVM on ESXi -> iSCSI MPIO -> Storage with diskspd


In test 1, I was able to get:
Starwind: 350.000 IOPS
OpenE-ZFS: 80.000 IOPS

In test 2, I was able to get:
Starwind: 100.000 IOPS
OpenE-ZFS: 52.000 IOPS

BUT:
I see nearly the same performance, if I am using more and hardware with ZFS, while Starwind scales.


So my thought was: Is there a CPU-bottleneck for checksumming, etc. with ZFS? Do you see that huge CPU-load, too?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
Translate 100% CPU as "as fast as I can"
A faster CPU may help but possibly only a little as ZFS is more RAM than CPU limited.

The reason is because ZFS does not write small blocks. The smallest datablock is the ZFS blocksize, default 128k. They are always collected in the RAM based writecache and written as a large sequential (up to several Gigabytes) . If the disk is 4k physial, such a ZFS block is divided accordingly.

If your real load is small datablocks ex iSCSI with 8k, you should reduce ZFS blocksize from 128k to 64k or 32k, not less. This can improve performance especially if sync is enabled to avoid a writecache dataloss in case of a crash. In general sync=disabled vs sync=always must give a performance difference as in one case you only write over the Rambased writecache (large sequential write) in the other case you must additionally log every small committed ZFS block. If you see no difference then check not only the sync setting of the zvol that is base of the LUN but also the corresponding writeback cache setting of the target (sync=disabled and writebackcache =enabled means no sync write).

Open-E is ZoL (ZFS on Linux).
A Solaris with genuine ZFS, Illumos or Free-BSD with Open-ZFS based appliance may be faster but do not expect same performance on ZFS than on ntfs or ext4. Security comes with a price. While an Optane may have a capacity of >200k iops (8k) I would expect 50k a good value with 8k writes and no ram writecache involved. A stripe set of several vdevs scales iops with number of vdevs so this is an option to increase values on ZFS.

see also Comstar, the enterprise class iSCSI framework of Solaris and Illumos/OmniOS
Configuring Storage Devices With COMSTAR - Oracle Solaris Administration: Devices and File Systems

If you need commercial support in Germany you can also look at Oracle (Solaris) or NexentaStor (Illumos) ex zstor.de. OmniOS while OpenSource comes with a commercial support option from the devs (located in UK and Switzerland), Commercial Support
 

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

I think, RAM is not the current bottleneck. Biggest problem is write performance and with sync disable an 48 GB memory, I should have the maximum "effect" - right?

The goal is to have a "cluster" with sync=always and best performance possible. Sync=disable is for me only a debugging-test.

Did you ever see, that CPU was/is the bottleneck?

I will test writebackcache=enable - but open-E does not have that option on pool-level.

@open-E
I just do not have enough knowledge to be brave enough to use a "self-made" storage cluster because I cannot handle an outage.
I was using nexenta in the past but VERY unhappy with their support. Today, I will test FreeNAS to see, if its faster.

Thank you VERY much for all your input!
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
If you can describe your test setup (clients, commands used) etc more closely I might be able to duplicate them when I find the time over the holidays, got a testbox (esxi based but should not matter too much) if you're interested
 

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

That would be great. Current setup is simple:

two identical servers, Intel 4112, 48 GB RAM, interconnect 2x 10 GbE with Jumbo frames.
Test-command is:

Code:
diskspd -c30G -w50 -b8K -F8 -r -o32 -W0 -d20 -Sh e:\testfile.dat
...on Windows 2016 with all patches

@gea
I just did a test with blocksize=32k:
Seems to be faster. I will provide results in a few hours...
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
E is an iscsi volume hosted by the ZFS box, which size? Any optimizations (network [except jumbo], energy saving mode etc)?
Whats the iperf speed?
 

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

Yes, iSCSI on ZFS. I tested with 100GB volumes. The only optimization is: disable power-management, jumbo-frames. nothing else...
 

NISMO1968

[ ... ]
Oct 19, 2013
87
13
8
San Antonio, TX
www.vmware.com
Open-E is pretty shitty product in terms of performance, software quality/maturity and especially support being pretty much AWOL. If you plan to stick with ZFS I'd recommend either plain vanilla FreeBSD or Linux + ZoL done right (don't be afraid of ZoL, next version of FreeBSD is going to have ZoL ported to FreeBSD, rather then using existing Illumos source code).

Hi!

I did also some tests with "sync=disable". The performance was not much better - i think, the NVMe already reaches one of my bottlenecks.

Starwind seems to be easy with "raw-devices" (no snapshots, no async replication): Better hardware, more performance.

I am doing two kinds of tests:
1. Windows-client -> iSCSI MPIO -> Storage with diskspd
2. WindowsVM on ESXi -> iSCSI MPIO -> Storage with diskspd


In test 1, I was able to get:
Starwind: 350.000 IOPS
OpenE-ZFS: 80.000 IOPS

In test 2, I was able to get:
Starwind: 100.000 IOPS
OpenE-ZFS: 52.000 IOPS

BUT:
I see nearly the same performance, if I am using more and hardware with ZFS, while Starwind scales.


So my thought was: Is there a CPU-bottleneck for checksumming, etc. with ZFS? Do you see that huge CPU-load, too?
 

i386

Well-Known Member
Mar 18, 2016
4,220
1,540
113
34
Germany
Hi!

That would be great. Current setup is simple:

two identical servers, Intel 4112, 48 GB RAM, interconnect 2x 10 GbE with Jumbo frames.
Test-command is:

Code:
diskspd -c30G -w50 -b8K -F8 -r -o32 -W0 -d20 -Sh e:\testfile.dat
...on Windows 2016 with all patches

@gea
I just did a test with blocksize=32k:
Seems to be faster. I will provide results in a few hours...
Can you test it again but without the -r argument?

Code:
diskspd -c30G -w50 -b8K -F8 -o32 -W0 -d20 -Sh e:\testfile.dat
 

Stril

Member
Sep 26, 2017
191
12
18
41
Hi!

@i386:
Without "-r" performance is VERY good (about 95.000 IOPS in 50% mix), but with "-r" performance goes down to 50.000 in my current setup. Shouldn't this be equal with ZFS?

@NISMO1968
Mellanox Cards are great, but my VMWare-Hosts are on Intel-Cards.
I do not like Open-E, but they provide a product with support. I am just afraid of building a cluster by myself without support.

@gea
I will give FreeNAS a try, next week...