ZFS performance vs RAM, AiO vs barebone, HD vs SSD/NVMe, ZeusRAM Slog vs NVMe/Optane

gea · Dec 6, 2017

I have extended my benchmarks to answer some basic questions

- How good is a AiO system compared to a barebone storage server
- Effect of RAM for ZFS performance (random/ sequential, read/write)
(2/4/8/16/24G RAM)
- Scaling of ZFS over vdevs
- Difference between HD vs SSD vs NVMe vs Optane
- Slog SSd vs ZeusRAM vs NVMe vs Optane

current state:
http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

_alex · Dec 6, 2017

what exactly does the benchmark you call from the GUI?
any chance this can be reproduced on zol to compare?

gea · Dec 6, 2017

My menu Pools > Benchmarks (this one is in 17.07dev) is a simple Perl script. In the current benchmark set it uses some Filebench workloads for random, sequential and mixed r/w workloads. The other options are dd and a simple write loop of 8k or larger writes via echo. The script executes the benchmarks one by one switching sync on writes automatically and allows modification of some settings directly or via shellscript (for ZFS tuning) to reduce the stupid manual switching of settings between large benchmarks series as each run consists of 7 benchmarks (write random, write sequential, both sync and async, read random, r/w and read sequential). For the many benchmarls to be run I selected benchmarks with a quite proper result but ones with a short runtime. Therefor from run to run the differences are at about 10% but this should not affect the general result.

So it should work on Zol. But I would expect similar results on ZoL, maybe you need some extra RAM for same values.

_alex · Dec 6, 2017

ok, will setup an aio on proxmox and have a look into it.
i just would like to have a point to compare zol / some idea what numbers to expect with the optane but usually use fio what doesn't compare good with your numbers.
i want to try what happens if i export a partition of the optane via NVMEoF and use it as slog on the initiator for some local hdd.

makes sense to turn on/off sync by script in a benchmark series

Rand__ · Dec 6, 2017

So when will auto pool creation/destruction/composition based on a config file be added?
Looking forward to run on my ssd or potential nvme pool

edit:
typo :
4.5 A SSD based pool via LSI pass-through (4 x Intel DV 3510 vdev)

and other places same error

gea · Dec 6, 2017

Spirit is faster than my fingers...
Will correct them.

At the moment the whole benchmark series is a voluntary extra task.
Now one wants to classify results.

In German we say: Wer misst misst Mist.

Rand__ · Dec 6, 2017

Yeah - i thought you probably have most of the stuff scripted anyway - and it might make your next run simpler too

_alex · Dec 6, 2017

i like benchmarks for the case of seeing other bottlenecks/missconfigurations.
so, if there is a clear range what should be reached there must be something wrong if own results are magnitudes below

azev · Dec 14, 2017

@gea which nvme driver did you use on your test ?? Native from ESXI installation? or did you install intel NVME driver from vmware website ?

gea · Dec 14, 2017

ESXi native

gea · Dec 19, 2017

another two days later in the lab
http://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

Can someone with Solaris verify these results vs OmniOS from Windows (SMB and iSCSI).
They are too good!

TL;DR
-
„too long; didn‘t read“
This benchmark sequence was intended to answer some basic questions about disks, ssds, NVMe/Optane,
the effect of RAM and the difference between native ZFS in Oracle Solaris v.37 vs OpenZFS in the free
Solaris fork OmniOS. If you want to build a ZFS system this may help to optimize.

1. The most important factor is RAM

Whenever your workload can be mainly processed within your RAM, even a slow HD pool is nearly as
fast as an ultimate Optane pool. Calculate around 2 GB for your OS. Then add the wanted rambased
write cache (OmniOS default: 10% of RAM, max 4GB) and add the RAM that you want as readcache.
If your workload exceed your RAM capabilities or cannot use the RAM like with sync-write performance can
dramatically go down. In a homeserver/ mediaserver/ SoHo filer environment with a few users and 1G networks 4-8GB RAM is ok.In a multiuser environment or with large amount of random data (VMs, larger databases) use 16-32GB RAM. Ifyou have a faster network (10/40G) add more RAM and use 32-64G and more.

2. Even a pure HD pool can be nearly as fast as a NVMe pool.

In my tests I used a pool from 4 x HGST HE8 disks with a combined sequential read/write performance of more than 1000 MB/s. As long as you can process your workload mainly from RAM, it is tremendous fast. The huge fallback when using sync-write can be nearly eleminated by a fast Optane Slog like the 900P. Such a combination can be nearly as fast as a pure SSD pool at a fraction of the cost with higher capacity.
Even an SMB filer with a secure write behaviour (sync-write=always) is now possible as a 4 x HGST HE8 disks in a Raid-0 + an Optane 900P offered 1.6 GB/s sync write performance on Solaris and 380GB/s on OmniOS

3. Critical workloads (many user, many random data)

In such a situation, use SSD only pools. A dedicated Slog is not needed but prefer SSDs with
powerloss protection when you want sync write.
4. Ultra critical or performance sensitive workloads
Intel Optane is unbeaten!
Compared to a fast NVMe it reduces latency from 30us down to 10us and increases iops from 80k to 500k.
While on most workloads you will not see much difference as most workloads are more sequential or the RAM takes the load some are different. If you really need small random read write performance you do not have an alternative. Additionally Optane is more organized like RAM. This means no trim or garbage collection or erase prior write like on Flash is needed. Even a concurrent read/write workload does not affect performance in the same way as it was on Flash.

5. Oracle Solaris with native ZFS v.37 beats OpenZFS

OmniOS, a free Solartis fork is known to be one of the fastest OpenZFS systems but native ZFS v.37 on
Solaris plays in a different ligue when you check pool performance as well when you check services like SMB. What I have found is that Solaris starts writes very fast and stalls then for a short time. OmniOS with its write throtteling seems not as fast regarding overall write performance but can guarantee a lower latency.
RAM efficiency regarding caching seems to be the major advantage on Solaris and even with low RAM
for caching sync write performance even on harddisks is top

Rand__ · Dec 19, 2017

Still the Intel DV 3510 typo

Also "Optana 900p", sungle, Junboframes, rersulta,AStto, qualitzy wi-the,requirte, wrize-ramcache

On Solaris only two of ny Optane 900P were detected, so a compare 4 Optans on OmniOS vs 2 Optane on Solaris
On Solaris only two of ny Optane 900P were detected, so a comparison of 4 Optans on OmniOS vs 2 Optane on Solaris

Else very nice, thanks a lot for the extensive testing

jp83 · Jan 8, 2018

@gea, Are you doing anything special to get good performance out of the vmdk based SLOG? Is the vm using either the LSI (Parallel), SAS controller, or paravirtual?

Mine seems to be limited (in ESXi w/ FreeNAS), see me full post here:
https://forums.servethehome.com/ind...miting-slog-for-sync-write-performance.18215/

Rand__ · Jan 8, 2018

He is running the slog on an optane nvme drive

jp83 · Jan 8, 2018

Rand__ said:
He is running the slog on an optane nvme drive

I see that, but I thought I saw he couldn't pass it through natively, so was using a virtual disk to make it available to the vm, and that's where my question is, because I can't seem to get any decent performance like that.

Rand__ · Jan 8, 2018

Yes he is using that via vmdk, but i think the magic is in the drive not the setup. The same setup with a P3700 was not really worthwhile a couple of months ago.

james23 · Feb 8, 2019

wow gea, what a beautiful pdf document. thank you for it.

i had a question(s)

(pls correct if wrong), ive been trying to bench freenas (with many different disk and hardware) setups for a few months. i often cant get past the ARC messing with my read results.

I see in alot of your tests you set: readcache=all or readcache=none , is this enable/disable the ARC/ram read cache?
on freenas, the best ive been able to comeup with (and these arent very good, as the read and writes speeds become very poor) is:

zfs set primarycache=metadata MYPOOL
or
first: hw.physmem 8294967296 (and reboot, so that freenas at the OS level only "sees/uses" 8 gb of ram total)
and then sysctl vfs.zfs.arc_max=1514128320 = ~ 1.5gb of ram for arc). if i dont do hw.physmem, then my real 128g or 256g of memory is active and i cant set sysctl vfs.zfs... any lower than 16gb)

i then use these tools to TRY to get some consistent benchmarks.
dd (often if=/dev/null or of=)
fio (seems to give wrong/bogus results if i increase the threads or job count)
iozone
bonnie+++
cp xyz /dev/null

(sync=disabled, or sync=always. / atime=disabled , no compression)

is there anything else i should be trying to get more consistent/repeatable speeds (mainly for read)?
(or is there a way i can modify your napp-it bench1.sh script to run on freebsd/freenas?) it looks pretty consistent.

(ive mostly given up on read, and only benchmark write results)
maybe this isnt the best indicator, but i always watch the updates of gstat -s -p
(if i dont see my drives being close to maxed out in % terms, i assume something is keeping me from the max speed i could be seeing on my benchmark)

My goal is to get some more stable info and notes (i have tons already) on my actual Freenas performance with different pool layouts. So i can pick the best to go with for my final build.
(ive been at this for months, and still have many more months to keep playing with Freenas / my layout - before i commit to something). I have alot of equipment to test and play with for this. (ie ~ 40x 3tb HGST sas disks, ~20 HGST sas3 SSDs , 6x enterprise NVMe s and one 280g optane. + a SM x10 and a few x9 sys , all sitting idle for my testing)

thanks for any input

Rand__ · Feb 9, 2019

Are you going to write that all up and share with us?

james23 · Feb 9, 2019

my benchmark notes are a bit of a mess (but i ofcourse know whats what) so it might be hard to read, but i'll post this one since its easy to post.

I think best will be i'll post/share it , then if you have ? or need specifics of what i was running then ask and i'll answer / give more info. here is one i can grab now (its a huge excel SS so i figure best way is via google sheets share).

Ill post some others (that are in google docs , not excel format) nb; my freenas box is 11.1 or 11.2 and on baremetal x9dr3-ln4f+ , 128g ram, 9207-8i and 9207-8e with a 4u 846 TQ backplane (i move to an expander BP in my later tests in future docs ill post):

ZFS DISK BENCH SHEET _ JAN 2019 excel XLSx.xlsx
(those "new110" comparisons way off to the right are a windows box i have with adaptec 8 series raid as a comparison)

for somereason when you preview with google sheets the formatting looks correct, but when you open with google sheets, it looses
alot of the formatting.

NOTE ALOT OF THE NON COLORED text are results i copy/paste from https://calomel.org/zfs_raid_speed_capacity.html
and then my own tests (with same type of pool) are in colored text

any windows images/screenshots you see, are run via SMB (or some iSCSI, but most smb) against xyz pool config (using 2012r2 on a separate esxi host, via 10g to the FN box).
unfortunately, it was only recently that i found that windows10 / srv 2016 gives you MUCH better SMB performance (i think because those OSs support SMB multi-streaming, which works better with freenas's single threaded SMB, ie with win7 or 2012r2 , even baremetal, i rarely get above 500-600mb with windows file copy, with win10 / 2016 i can get 1000 mb/s on a fast pool (ie a HGST ssd stripe'd pool)

some of the results i'll post tomoro have more of the SSD results, and are easier to read / follow. (alot of the Spreadsheet above was when i was only 1 month into learning FN, vs 3 or 4 months of playing with FN, now)
----

EDIT: this a 2nd set of benches and might be easier to follow (maybe :/ ) alot of the tests towards the top are from a RAID card on 2012r2 (not zfs), i did for my own comparisons. its a pdf of a google doc shared via google drive:

(pt 3of3) 2019- Huge disk Benchmarks - Google Docs.pdf

(page ~22 is where FN stuff is mostly, esp page 27)

gea · Feb 9, 2019

Any benchmark produces s special workload case what means every benchmark must give different results. What I have done is creating a series of benchmarks (filebench in may case) as there you can select a workload for a benchmark (ex more sequential or random workload, filer, webserver etc). The goal was not to get some absolute values but to have some ideas how to design a pool, RAM needs or configuration settings in a real setup where you can modify settings in the triangle price, performance and capacity to get an optimal setting with a given use case (ex should I add more RAM or use more disks or Raid-10 instead of Z2 for a new machine with a given use case )

My tests are based on a series of tests where every write benchmark is done with sync enabled or disabled. Readcache (Arc) all vs none shows the effect of RAM. Only with Intel Optane readcache=none gives similar results for both settings (means RAM is not so relevant) and different number of disks and raid settings. With Arc enabled I have additionally done tests with a different amount of RAM. As the benchmark series is scripted it was easy to run series of it with different settings. The bench.sh would allow to add some own benchmarks to every run.

As RAM is used for read and write caching, it affects read and write performance (Even on writes you must read metadata that can be cached in Arc). Flash is faster on reads than writes. Disks should perform similar so with disks a pure write test (sync vs async, sequential vs random load) can give enough informations.

With FreeBSD you should get at least a similar behaviour as the ZFS principles are the same. You may need some more RAM on FreeNAS than on OmniOS/Solaris. This is partly related to FreeNAS and partly related to the ZFS internal RAM management that is (still) based on Solaris but OS related differences in Open-ZFS become smaller and smaller. Even on the Illumos dev mail list I have seen efforts to include commits ex from Linux more or less directly.

ZFS performance vs RAM, AiO vs barebone, HD vs SSD/NVMe, ZeusRAM Slog vs NVMe/Optane

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member