ZFS performance vs RAM, AiO vs barebone, HD vs SSD/NVMe, ZeusRAM Slog vs NVMe/Optane

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
I have extended my benchmarks to answer some basic questions

- How good is a AiO system compared to a barebone storage server
- Effect of RAM for ZFS performance (random/ sequential, read/write)
(2/4/8/16/24G RAM)
- Scaling of ZFS over vdevs
- Difference between HD vs SSD vs NVMe vs Optane
- Slog SSd vs ZeusRAM vs NVMe vs Optane

current state:
http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
My menu Pools > Benchmarks (this one is in 17.07dev) is a simple Perl script. In the current benchmark set it uses some Filebench workloads for random, sequential and mixed r/w workloads. The other options are dd and a simple write loop of 8k or larger writes via echo. The script executes the benchmarks one by one switching sync on writes automatically and allows modification of some settings directly or via shellscript (for ZFS tuning) to reduce the stupid manual switching of settings between large benchmarks series as each run consists of 7 benchmarks (write random, write sequential, both sync and async, read random, r/w and read sequential). For the many benchmarls to be run I selected benchmarks with a quite proper result but ones with a short runtime. Therefor from run to run the differences are at about 10% but this should not affect the general result.

So it should work on Zol. But I would expect similar results on ZoL, maybe you need some extra RAM for same values.
 
Last edited:

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
ok, will setup an aio on proxmox and have a look into it.
i just would like to have a point to compare zol / some idea what numbers to expect with the optane but usually use fio what doesn't compare good with your numbers.
i want to try what happens if i export a partition of the optane via NVMEoF and use it as slog on the initiator for some local hdd.

makes sense to turn on/off sync by script in a benchmark series :)
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
So when will auto pool creation/destruction/composition based on a config file be added?
Looking forward to run on my ssd or potential nvme pool;)

edit:
typo :
4.5 A SSD based pool via LSI pass-through (4 x Intel DV 3510 vdev)

and other places same error
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
Spirit is faster than my fingers...
Will correct them.

At the moment the whole benchmark series is a voluntary extra task.
Now one wants to classify results.

In German we say: Wer misst misst Mist.
 
  • Like
Reactions: _alex

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Yeah - i thought you probably have most of the stuff scripted anyway - and it might make your next run simpler too;)
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
i like benchmarks for the case of seeing other bottlenecks/missconfigurations.
so, if there is a clear range what should be reached there must be something wrong if own results are magnitudes below ;)
 

azev

Well-Known Member
Jan 18, 2013
769
251
63
@gea which nvme driver did you use on your test ?? Native from ESXI installation? or did you install intel NVME driver from vmware website ?
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
another two days later in the lab
http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

Can someone with Solaris verify these results vs OmniOS from Windows (SMB and iSCSI).
They are too good!

TL;DR
-
„too long; didn‘t read“
This benchmark sequence was intended to answer some basic questions about disks, ssds, NVMe/Optane,
the effect of RAM and the difference between native ZFS in Oracle Solaris v.37 vs OpenZFS in the free
Solaris fork OmniOS. If you want to build a ZFS system this may help to optimize.

1. The most important factor is RAM

Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as
fast as an ultimate Optane pool. Calculate around 2 GB for your OS. Then add the wanted rambased
write cache (OmniOS default: 10% of RAM, max 4GB) and add the RAM that you want as readcache.
If your workload exceed your RAM capabilities or cannot use the RAM like with sync-write performance can
dramatically go down. In a homeserver/ mediaserver/ SoHo filer environment with a few users and 1G networks 4-8GB RAM is ok.In a multiuser environment or with large amount of random data (VMs, larger databases) use 16-32GB RAM. Ifyou have a faster network (10/40G) add more RAM and use 32-64G and more.

2. Even a pure HD pool can be nearly as fast as a NVMe pool.

In my tests I used a pool from 4 x HGST HE8 disks with a combined sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendous fast. The huge fallback when using sync-write can be nearly eleminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity.
Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 disks in a Raid-0 + an Optane 900P offered 1.6 GB/s sync write performance on Solaris and 380GB/s on OmniOS

3. Critical workloads (many user, many random data)

In such a situation, use SSD only pools. A dedicated Slog is not needed but prefer SSDs with
powerloss protection when you want sync write.
4. Ultra critical or performance sensitive workloads
Intel Optane is unbeaten!
Compared to a fast NVMe it reduces latency from 30us down to 10us and increases iops from 80k to 500k.
While on most workloads you will not see much difference as most workloads are more sequential or the RAM takes the load some are different. If you really need small random read write performance you do not have an alternative. Additionally Optane is more organized like RAM. This means no trim or garbage collection or erase prior write like on Flash is needed. Even a concurrent read/write workload does not affect performance in the same way as it was on Flash.

5. Oracle Solaris with native ZFS v.37 beats OpenZFS

OmniOS, a free Solartis fork is known to be one of the fastest OpenZFS systems but native ZFS v.37 on
Solaris plays in a different ligue when you check pool performance as well when you check services like SMB. What I have found is that Solaris starts writes very fast and stalls then for a short time. OmniOS with its write throtteling seems not as fast regarding overall write performance but can guarantee a lower latency.
RAM efficiency regarding caching seems to be the major advantage on Solaris and even with low RAM
for caching sync write performance even on harddisks is top
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Still the Intel DV 3510 typo ;)
Also "Optana 900p", sungle, Junboframes, rersulta,AStto, qualitzy wi-the,requirte, wrize-ramcache

On Solaris only two of ny Optane 900P were detected, so a compare 4 Optans on OmniOS vs 2 Optane on Solaris
On Solaris only two of ny Optane 900P were detected, so a comparison of 4 Optans on OmniOS vs 2 Optane on Solaris


Else very nice, thanks a lot for the extensive testing
 

jp83

New Member
Dec 29, 2017
7
0
1
40
He is running the slog on an optane nvme drive;)
I see that, but I thought I saw he couldn't pass it through natively, so was using a virtual disk to make it available to the vm, and that's where my question is, because I can't seem to get any decent performance like that.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Yes he is using that via vmdk, but i think the magic is in the drive not the setup. The same setup with a P3700 was not really worthwhile a couple of months ago.
 

james23

Active Member
Nov 18, 2014
441
122
43
52
wow gea, what a beautiful pdf document. thank you for it.

i had a question(s)

(pls correct if wrong), ive been trying to bench freenas (with many different disk and hardware) setups for a few months. i often cant get past the ARC messing with my read results.

I see in alot of your tests you set: readcache=all or readcache=none , is this enable/disable the ARC/ram read cache?
on freenas, the best ive been able to comeup with (and these arent very good, as the read and writes speeds become very poor) is:

zfs set primarycache=metadata MYPOOL
or
first: hw.physmem 8294967296 (and reboot, so that freenas at the OS level only "sees/uses" 8 gb of ram total)
and then sysctl vfs.zfs.arc_max=1514128320 = ~ 1.5gb of ram for arc). if i dont do hw.physmem, then my real 128g or 256g of memory is active and i cant set sysctl vfs.zfs... any lower than 16gb)

i then use these tools to TRY to get some consistent benchmarks.
dd (often if=/dev/null or of=)
fio (seems to give wrong/bogus results if i increase the threads or job count)
iozone
bonnie+++
cp xyz /dev/null

(sync=disabled, or sync=always. / atime=disabled , no compression)

is there anything else i should be trying to get more consistent/repeatable speeds (mainly for read)?
(or is there a way i can modify your napp-it bench1.sh script to run on freebsd/freenas?) it looks pretty consistent.

(ive mostly given up on read, and only benchmark write results)
maybe this isnt the best indicator, but i always watch the updates of gstat -s -p
(if i dont see my drives being close to maxed out in % terms, i assume something is keeping me from the max speed i could be seeing on my benchmark)


My goal is to get some more stable info and notes (i have tons already) on my actual Freenas performance with different pool layouts. So i can pick the best to go with for my final build.
(ive been at this for months, and still have many more months to keep playing with Freenas / my layout - before i commit to something). I have alot of equipment to test and play with for this. (ie ~ 40x 3tb HGST sas disks, ~20 HGST sas3 SSDs , 6x enterprise NVMe s and one 280g optane. + a SM x10 and a few x9 sys , all sitting idle for my testing)

thanks for any input
 

james23

Active Member
Nov 18, 2014
441
122
43
52
my benchmark notes are a bit of a mess (but i ofcourse know whats what) so it might be hard to read, but i'll post this one since its easy to post.

I think best will be i'll post/share it , then if you have ? or need specifics of what i was running then ask and i'll answer / give more info. here is one i can grab now (its a huge excel SS so i figure best way is via google sheets share).

Ill post some others (that are in google docs , not excel format) nb; my freenas box is 11.1 or 11.2 and on baremetal x9dr3-ln4f+ , 128g ram, 9207-8i and 9207-8e with a 4u 846 TQ backplane (i move to an expander BP in my later tests in future docs ill post):

ZFS DISK BENCH SHEET _ JAN 2019 excel XLSx.xlsx
(those "new110" comparisons way off to the right are a windows box i have with adaptec 8 series raid as a comparison)

for somereason when you preview with google sheets the formatting looks correct, but when you open with google sheets, it looses
alot of the formatting.

NOTE ALOT OF THE NON COLORED text are results i copy/paste from https://calomel.org/zfs_raid_speed_capacity.html
and then my own tests (with same type of pool) are in colored text

any windows images/screenshots you see, are run via SMB (or some iSCSI, but most smb) against xyz pool config (using 2012r2 on a separate esxi host, via 10g to the FN box).
unfortunately, it was only recently that i found that windows10 / srv 2016 gives you MUCH better SMB performance (i think because those OSs support SMB multi-streaming, which works better with freenas's single threaded SMB, ie with win7 or 2012r2 , even baremetal, i rarely get above 500-600mb with windows file copy, with win10 / 2016 i can get 1000 mb/s on a fast pool (ie a HGST ssd stripe'd pool)

some of the results i'll post tomoro have more of the SSD results, and are easier to read / follow. (alot of the Spreadsheet above was when i was only 1 month into learning FN, vs 3 or 4 months of playing with FN, now)
----

EDIT: this a 2nd set of benches and might be easier to follow (maybe :/ ) alot of the tests towards the top are from a RAID card on 2012r2 (not zfs), i did for my own comparisons. its a pdf of a google doc shared via google drive:

(pt 3of3) 2019- Huge disk Benchmarks - Google Docs.pdf

(page ~22 is where FN stuff is mostly, esp page 27)
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
Any benchmark produces s special workload case what means every benchmark must give different results. What I have done is creating a series of benchmarks (filebench in may case) as there you can select a workload for a benchmark (ex more sequential or random workload, filer, webserver etc). The goal was not to get some absolute values but to have some ideas how to design a pool, RAM needs or configuration settings in a real setup where you can modify settings in the triangle price, performance and capacity to get an optimal setting with a given use case (ex should I add more RAM or use more disks or Raid-10 instead of Z2 for a new machine with a given use case )

My tests are based on a series of tests where every write benchmark is done with sync enabled or disabled. Readcache (Arc) all vs none shows the effect of RAM. Only with Intel Optane readcache=none gives similar results for both settings (means RAM is not so relevant) and different number of disks and raid settings. With Arc enabled I have additionally done tests with a different amount of RAM. As the benchmark series is scripted it was easy to run series of it with different settings. The bench.sh would allow to add some own benchmarks to every run.

As RAM is used for read and write caching, it affects read and write performance (Even on writes you must read metadata that can be cached in Arc). Flash is faster on reads than writes. Disks should perform similar so with disks a pure write test (sync vs async, sequential vs random load) can give enough informations.

With FreeBSD you should get at least a similar behaviour as the ZFS principles are the same. You may need some more RAM on FreeNAS than on OmniOS/Solaris. This is partly related to FreeNAS and partly related to the ZFS internal RAM management that is (still) based on Solaris but OS related differences in Open-ZFS become smaller and smaller. Even on the Illumos dev mail list I have seen efforts to include commits ex from Linux more or less directly.
 
Last edited: