10G write speed for ZFS filer

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
Continuation from https://forums.servethehome.com/index.php?threads/zfs-vcluster-in-a-box.22460/#post-209541

Is it even possible to get 10G sync speed on non specialized hardware ?? I am getting no where near that with optane zil and bunch of enterprise class ssd.
I dont know, hence the question. But after @gea has found a solution for the HA problem this will be the next major hurdle I think (at least for my search for the ultimate high speed shared VM storage ;))
The optane (900/905p) can do ~1.2 gbyte/s @ 1t/1qd with large io (1mb+) and sequential writes.

The question is what workload; how many io requests, which io sizes, sequential or random writes? :D
But not synced write speed, can it?

FreeNas filer with spinners, slog:
upload_2018-11-10_10-21-34.png

FreeNas filer with 4x2 mirror SSD (no slog):

upload_2018-11-10_10-22-17.png

I also ran some test without sync but that was inconsistent so didnt record it...
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
In my tests (https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf) I was able to get up to 500 MB/s sync write performance (filebench singstreamwrite) from 4 hgst HE8 in raid0 + an Optane 900 Slog (unsync > 1000 MB/s, chapter 3.5)

With two Optane 900P in a Raid-0, I got around 800 MB/s sync and 1700 MB(s unsync. (chapter 2.14). A quad 900P was slightly faster.

So if around 500 MB/s sync write is enough (this is what you get from 10G without tunings), a disk based pool ex from at least 4 disks in raid-0 is ok. If you want 10G=up to 1000 MB/s sync write, you need at least a raid-0 from 2 Optane. If you want redundancy with a raid-10 setup, double numbers.

These are benchmarks with a focus on non-cacheable data on reads. For writes, you have the rambased write cache of ZFS that transforms small random writes to large and fast sequential writes. Most real workloads get benefits from the caches so results may be even better than the benchmark.

Due the massive use of caches (large ARC/L2ARC readcache and up to 4 GB rambased write cache) on most workloads, sync write values of SSD pools are essentially not so much better than disk pools (with an Optane that keeps small sync writes away.). The real effect depends on workload (how much random load that cannot benefit from caches).
 

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
So if around 500 MB/s sync write is enough (this is what you get from 10G without tunings), a disk based pool ex from at least 4 disks in raid-0 is ok. If you want 10G=up to 1000 MB/s sync write, you need at least a raid-0 from 2 Optane. If you want redundancy with a raid-10 setup, double numbers.
Those were 2 actual drives and not 2 parts/virtual disks of the same drive iirc?

These are benchmarks with a focus on non-cacheable data on reads. For writes, you have the rambased write cache of ZFS that transforms small random writes to large and fast sequential writes. Most real workloads get benefits from the caches so results may be even better than the benchmark.

Due the massive use of caches (large ARC/L2ARC readcache and up to 4 GB rambased write cache) on most workloads, sync write values of SSD pools are essentially not so much better than disk pools (with an Optane that keeps small sync writes away.). The real effect depends on workload (how much random load that cannot benefit from caches).
So what workload can profit from the 4gb memory write cache?
I'd imagine that should max out 10G then easily but I never actually see that..
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
I used two physical Optane 900P in Raid-0

On Open-ZFS (at least Illumos but i suppose BSD use the same defaults) the rambased write cache is using 10% of RAM up to 4 GB so you need at least 40GB RAM to get this. You can increase this with less RAM but you want RAM for Arc readcache as well. This is different to a genuine Solaris ZFS where the writecache is not related to ramsize but time (5s of last writes) what means you need a larger Slog with a faster network.

In general all sync writes benefits from this writecache. Only sync write logging cannot use the writecache as every single committed write must be written immediatly to the logging device ZIL or Slog. This is why you need the fast Slog especially an Optane as the best of all beside very special dram based ones (not the old ZeusRAM that is not as good as an Optane).

This affects local pool performance. For a filer you must additionally care about network (nic, buffers, mtu etc) and services (ex SAMBA or multithreaded SMB on Solaris.) To reach 10G sync writes on a client everything must fit. In my tests the best what I could achieve with SMB was around 900 MB/s on OmniOS and quite a lot more more with Solaris, each on 40G so net was not a limit.
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
Ok, so its still valid that to speed up nfs based vm write speed one needs to speed up sync write logging.

That makes me wonder:
1. Why is a pure optane pool faster than a optane slog supported pool (if the sync write speed is the only factor and not the actual pool syn write speed)
2. How far can we scale this up (If one uses 4/8 optane drives [not even so expensive potentially when using m2 drives])
3. Would a SSD only pool scale up sync write logging speed with more vdevs?
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
1.
ZFS does all writes as large sequential writes over the writecache and additionally it is logging every commit on the Slog. Optane has not only a perfect behaviour on small random writes (ultra low latency, 500k iops), it also has a superiour sequential performance without the SSD hassle like erase prior write, trim, garbage collection etc. This is why a pure Optane pool is faster.

2.
A single optane 900 gave me 670 MB/s sync write, a raid-0 of two 824 MB/s and 4 Optane in Raid-0 only 876 MB/s so not much more. I suppose its hard to go beyond 1 GB/s sync write.

3.
More vdevs helps sequentially but not so when used as a Zil or a single SSD as Slog. An Optane is much better than SSDs so an Optane as Slog would be better - even for an SSD pool. A very good enterprise SSD has around 50k iops on small random writes, the Optane has 500k
 
  • Like
Reactions: SRussell

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
ok so toying with a new build and trying to get max perf out of it. If that works out fine then I might switch my vsan to dual ZFS boxes...

I will perform the initial tests on a X11DPI-T /6150's, 192GB Ram running Esxi 6.7U1. Bios is on the lastest SM version, Performance mode active in bios too.

I currently have a bunch of NVMe drives attached to it (Optanes, P3700's, Intel 750s) to have a proper basis for testing.
I got the OmniOS build and updated it to the latest version (currently omnios-r151028-d3d0427bff) and updated napp-it too to the last free one (Btw I tried to use Solaris 11.4 but couldnt get it to run in my VM so since this was not the primary focus i skipped it for now)

I dabbled around a bit and got speeds below expectations , so now want to check what the issue is.

Since my goal is to optimize for QD1T1 I am looking at single thread performance here; the final environment will have 40G networking, but for now I am Intra ESX until a second node makes sense, so should not be limited by this.

I started by analyzing networking (o/c I performed appliance tuning first) since I saw hard limits on Atto.

To identify the culprit I run two Win2012 vms and the napp IT VM (where i installed iperf2 in addition to 3 to be able to easily run multi-process tests)

Network speed between my two identical WinVMs:
Code:
C:\>iperf -c 192.168.124.28
------------------------------------------------------------
Client connecting to 192.168.124.28, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[216] local 192.168.124.32 port 49170 connected with 192.168.124.28 port 5001
[ ID] Interval       Transfer     Bandwidth
[216]  0.0-10.0 sec  26.3 GBytes  22.6 Gbits/sec

C:\>iperf -c 192.168.124.28 -w 1M
------------------------------------------------------------
Client connecting to 192.168.124.28, TCP port 5001
TCP window size: 1.00 MByte
------------------------------------------------------------
[216] local 192.168.124.32 port 49172 connected with 192.168.124.28 port 5001
[ ID] Interval       Transfer     Bandwidth
[216]  0.0-10.0 sec  25.6 GBytes  22.0 Gbits/sec
so perfectly fine for this test.

Unfortunately speed to OmniOS is way less (with MTU9000):
Napp-It Client to Win Server

Code:
~/iperf-2.0.12# /usr/local/bin/iperf -c 192.168.124.28 -w 64K -
/usr/local/bin/iperf: ignoring extra argument -- -
------------------------------------------------------------
Client connecting to 192.168.124.28, TCP port 5001
TCP window size: 64.0 KByte
------------------------------------------------------------
[  3] local 192.168.124.30 port 55601 connected with 192.168.124.28 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  8.44 GBytes  7.25 Gbits/sec
Win Client to NappIt Server:
Code:
C:\>iperf -c 192.168.124.30
------------------------------------------------------------
Client connecting to 192.168.124.30, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[220] local 192.168.124.32 port 49216 connected with 192.168.124.30 port 5001
[ ID] Interval       Transfer     Bandwidth
[220]  0.0-10.0 sec  9.16 GBytes  7.86 Gbits/sec

C:\>iperf -c 192.168.124.30 -P 4
------------------------------------------------------------
Client connecting to 192.168.124.30, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[292] local 192.168.124.32 port 49220 connected with 192.168.124.30 port 5001
[268] local 192.168.124.32 port 49219 connected with 192.168.124.30 port 5001
[252] local 192.168.124.32 port 49217 connected with 192.168.124.30 port 5001
[264] local 192.168.124.32 port 49218 connected with 192.168.124.30 port 5001
[ ID] Interval       Transfer     Bandwidth
[292]  0.0-10.2 sec  2.70 GBytes  2.27 Gbits/sec
[268]  0.0-10.0 sec  2.39 GBytes  2.05 Gbits/sec
[252]  0.0-10.4 sec  2.78 GBytes  2.30 Gbits/sec
[264]  0.0-10.0 sec  3.37 GBytes  2.89 Gbits/sec
[SUM]  0.0-10.4 sec  11.2 GBytes  9.28 Gbits/sec

C:\>iperf -c 192.168.124.30 -w 1M
------------------------------------------------------------
Client connecting to 192.168.124.30, TCP port 5001
TCP window size: 1.00 MByte
------------------------------------------------------------
[216] local 192.168.124.32 port 49221 connected with 192.168.124.30 port 5001
[ ID] Interval       Transfer     Bandwidth
[216]  0.0-10.0 sec  11.8 GBytes  10.1 Gbits/sec
So with 1 M blocksize I theoretically get 10G speed, but thats far from the 22G I get from Windows. This might or might not be the cause for the below expectation speed, but its something I should work on to get the potential impact out of the way.

The question o/c is - what is the reason why OmniOS is so 'slow' ? I played around with a larger send buffer but that did not help :(
 

dswartz

Active Member
Jul 14, 2011
393
33
28
I found that with a very high speed enet connection, I needed to increase zfs_dirty_data_max. At least on ZoL it is only 4GB. Of course, my server has 128GB ram, so increasing it may not be practical for people with smaller amounts of ram.
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
The default size of the write buffer on all Open-ZFS is 10% RAM, max 4 GB but his should not affect iperf but write performance to pool, Tuning the OpenZFS write throttle | Delphix

On Solaris the size is variable, its time limited (5s of writes)

Did you
- use vmxnet3 and increase buffers in vmxnet3 settings
- increase tcp buffers
 
  • Like
Reactions: ArcyneTheFirst

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
vmxnet3 o/c,
which buffers are the vmxnet3 ones if they are not the ones affected by tuning?
max_buf/recv_buf/send_buf ? I adjusted send_buf, but no change
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
for vmxnet3 its (default 256)
You can set (store and compare) settings in Napp-it System > Tunings

TxRingSize=4096,4096,4096,4096,4096,4096,4096,4096,4096,4096;
RxRingSize=4096,4096,4096,4096,4096,4096,4096,4096,4096,4096;
RxBufPoolLimit=4096,4096,4096,4096,4096,4096,4096,4096,4096,4096;

I can give you an evalkey for Pro features if needed
 

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
It was already set accordingly, but MTU was at 1500 for /kernel/drv/vmxnet3s.conf.
Does both (mtu there) and on nic need to be set to be active?

I tried both ways, but no change.
Iperf from Win to NappIt - 10GB/s, to another Win host still on 22GB/s :/

And I am still on 30 day fresh install period, but thanks for the offer:)
 

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
I have set mtu 9000 throughout the bank,. but thanks for the suggestion:)
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
If you set Jumboframes, do not forget to enable this on the ESXi vswitches.
But as ESXi internal traffic is in software only, settings that affect real ethernet do not change anything.
 

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
@gea no idea left?
What are your iperf rates on AIOs if i may ask?
Its weird that this can be so different on pretty standard setups
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
I saw up to 10Gb in my tests on OmniOS and more on Solaris (40G nic) but never digged deeper in > 10G performance.
Maybe you ask at omniosorg/Lobby where the OmniOS devs are around.
 

Rand__

Well-Known Member
Mar 6, 2014
4,491
877
113
So set up a Sol 11.4 box, still not more than 10G (well 11.4) with P1 (even locally with 127.0.0.1 (different box with 2.4 cpu though).
Will play around some more