ZFS storage - is it worthy to go full-flash

Stril · Jun 1, 2018

Hi!

I need to buy a new HA-storage and want to use ZFS.
In the past, I did always use:
- HA-Cluster
- 2x head + 2x JBOD
- mirrored SLOG with fast SSDs
- mirrored storage with 10k SAS-drives

Now, I am thinking, if a full flash-system would be a good idea, but I did not find ANY benchmark on how much a full flash-system would be faster, or if the bottleneck would be ZFS.

What do you think? Did you ever try a full-flash ZFS-HA-cluster?
My setup would be:
- 2x10 SAS-SSDs with 800GB
- No SLOG?

Thank you for your thoughts!

i386 · Jun 1, 2018

How fast do you want to go? 10gbe? 40gbe?

I tried 40gbe and the tcp/ip handling alone would consume 70% on a xeon e5 v4 with 3.7ghz.

Rand__ · Jun 1, 2018

And what kind of workload? My playing around (see solaris subforum) indicated (to me) that all flash on zfs is not really worth it...
You get 10G easily but beyond that ...

But have not tried on Linux, so take the results with a grain (or more) of salt

Stril · Jun 1, 2018

Hi!

I want to use the system for VM-workload - mixed linux, windows with some databases.
Plan is to do do 4x10 or 2x40 GbE uploads on the ZFS-cluster.

I am just not sure, if ZFS will be the bottleneck, so that there is nearly NO advantage, if I buy the all-flash-system over a good hybrid solution (that will be about 15000$ cheaper...)

Rand__ · Jun 1, 2018

As I said, check the various threads in the Solaris Subforum (especially @gea 's test results).
At the moment the best option I see is to go for needed capacity and add Optane slog(s) to either disk or ssd pool.

Note most tests will not cater for many users at the same time, it's quite possible that SSDs will shine there (if you can't satisfy read requests from cache)

gea · Jun 1, 2018

With ZFS all small random writes are collected in the rambased writecache and then written as a large sequential write. Regarding this, disks are quite as fast as SSDs.

On sync writes, the logging for crash protection requires that all small random writes must be written immediatly to disk. With ZFS you can use a dedicated Slog for this. If you use Optane as Slog (the best of all options now), the disk pool is nearly as fast as the SSD pool.

Most of all small random reads are delivered by the rambased readcache so up from second access pool performance is not relevant.

For the above cases a fast disk based pool is nearly as good as an SSD pool. A real and then maybe huge difference happens with reads that are not cached by the readcache what happens with low RAM, many users or large amount of different data that is processed. In such a case a SSD only pool is much faster due much higher iops:

Hard disk: around 100 iops
Sata enterprise SSD: 30k-80k iops, 100us latency+
Flash NVMe: up to 200k iops, 50us latency+
Optane NVMe: 500k iops, 10us latency

For HA
as there is no multipath Optane NVMe, a failover will result in a Slog offline (reduced performance) but for the main system (with Optane) such an Slog is much faster than with any Flashbased Slog or even older dram based Slogs like the ZeusRAM (that offers dualpath SAS)

Stril · Jun 2, 2018

Hi!

My problem is, that I need a real cluster-solution without data-loss. I did not find any possibility to use a NVMe in that case, so I think, the only possibility is to use SAS-SSDs which are much slower.
I thought, that it is NOT possible to use an NVMe which is only accessible by one node, because it would not just lead to a performance degrade in case of a failover - it would also lead to a data loss of the last 5s of writes (which is not a good idea for VMs).
Am i right?

I just never had so possibility to benchmark a "full-flash-ZFS-cluster" - only hybrid-systems.

Or another question: What would you use for that usecase? The only system in my mind would be Starwind which can work with NVMe and full flash and which could reach 300k IOPS in my SATA-SSD-setup, but as it is running on windows, I would have some licensing issues...

Stril

gea · Jun 3, 2018

1. Slog
An Slog is only read during bootup to redo commited writes that are not yet on stable storage. It is never read during regular operations. So a dataloss can only happen after a crash during writes when the Slog is not available on next bootup. In a HA environment this can be a problem if the first head crases during writes and the slog is not available after the pool failover to the second head. This is not a problem for a planned failover as in this case all writes are finished prior the failover and after the pool failover (with missing Slog) the onpool ZIL is used instead.

This is a general problem in any HA config that is based on two servers with common multipath disks and pool failover like RSF-1 for example. There are two options for this, either use multipath SAS SSDs that are good enough to operate without additional Slog, use an mutipath SAS Slog (like the ZeusRAM that was often used in former times) or use multipath iSCSi targets based on NVMe Optanes.

The other option would be using a method that is not based on multipath disks but based on VM, service or OS level. As a proof of concept (without support), I have added a storage/service failover concept to napp-it that is based on a mirrored pool from two iSCSI targets from two heads with failover.

Another possible (untested) option would be using ESXi as a base system with a vdisk (from an Optane) as Slog.

If you need production quality with support, the current ZFS offer is RSF-1 with multipath SAS.

Stril · Jun 3, 2018

gea said:
If you need production quality with support, the current ZFS offer is RSF-1 with multipath SAS.

...and in this case: Do you think, a full-flash-SAS-cluster (perhaps without SLOG) would be much faster than a hybrid-pool with a SAS-SSD?

gea · Jun 4, 2018

Depends on workload

On a sequential nonsync workload a disk pool (with a few more disks) can be as fast as an SSD pool.
(There is practical no random write to a pool due the rambased writecache)

On a sequential sync workload, the Slog (or onpool ZIL) performance determines performance.
This means a disk pool with a good Slog can be nearly as fast as an SSD pool without Slog.

All random reads that are not cached in ARC makes the real difference. Regarding this, SSDs are much faster.

So in the end, its all about costs.
SSDs are faster than a hybrid pool but more expensive. Enterprise SSDs are quite well with a sync workload without Slog.

In a SSD pool you only need an Slog if the Slog performance is much faster than the SSDs (regarding latency and write iops).
Usually this means that such an Slog is not based on Flash but Intel Optane or some very expensive Dram based models.

Stril · Jun 4, 2018

Hi!

Thank you for your answer - sounds like I should go on SSD...

My Database-Workloads are mostly "random"...

It would just be great to have a benchmark on how many IOPS I can expect with such a setup...

gea · Jun 4, 2018

From expectation

Enterprise class flashbased SAS/Sata SSDs are at around 30k-80k write iops and 100us latency.
If you build a pool from Raid-Zn your total iops is number of vdevs x single SSD iops so with a single vdev this is it.

If you build a pool ex from mirrors, write iops scale with number of mirrors (read iops=2x number of mirrors) so a pool from 10 mirrors can go up to 300k-800k write iops and twice of that regarding read iops.

With sync write, latency is important. As latency will not scale (gets lower), sync write performance will not scale the same but only a little with number of mirrors.

On benchmarks, values may be higher due Ram-effects.

Search

ZFS storage - is it worthy to go full-flash

Stril

Member

i386

Well-Known Member

Rand__

Well-Known Member

Stril

Member

Rand__

Well-Known Member

gea

Well-Known Member

Stril

Member

gea

Well-Known Member

Stril

Member

gea

Well-Known Member

Stril

Member

gea

Well-Known Member