ZFS read caching with SSD: is cache used as well as storage, or instead?

TheBloke · Apr 7, 2017

Hi all

I am upgrading my Solaris 11.3 home server/NAS. I have 27 x 2TB SATA3 drives, which will be configured as three 9-drive VDEVs; either 3 x RAIDZ2 or 3 x RAIDZ3, I haven't decided yet.

I also plan to add a couple of SSDs for SLOG and read cache, and my question for this thread is about read cache (L2ARC).

When data is in read cache/L2ARC, does ZFS read this data only from the cache, or does it read some from L2ARC and some from the underlying disks?

In other words, when looking to boost performance with L2ARC devices, do I need those devices to always be faster than the underlying disks, or is the performance of L2ARC additional to the performance of the disks?

I'm wanting to add read cache primarily to boost my IOPS, where of course SSDs will be vastly better than spinning disks. But my spinners do give me great sequential read/write performance, and this will be valid for at least some of my workload (dealing with very large media files like MP4 videos of up to 20GB in size.)

In my current benchmarks, using iozone, I am recording sequential speeds of 2.2GB/s writes and 1.7GB/s reads.

My issue is that my planned cache devices likely won't be as fast for sequential reads. I was thinking of getting 2 x Samsung EVO 850 drives which are capable of 550MB/s reads. Two of these striped would give me 1.1GB/s sequential, which is quite a bit slower than the 1.7GB/s sequential reads my spinning disks can maintain.

So what would be perfect is if ZFS can use read cache in addition to underlying storage - ie if it striped the reads so that some come from L2ARC devices and some from underlying storage, such that total performance could be anywhere up to the sum of cache + pool bandwidth?

Conversely, if cached data is loaded only from L2ARC, bypassing the pool completely, then it seems to me that in some circumstances I will see a lowering of performance from my proposed config of 2 x 550MB/s SSDs?

I'd be most grateful for any thoughts/comments on how this actually works. It's not something I can easily test myself until I actually get the SSDs, but I don't want to order them until I better understand how things work, as understanding this might well affect my purchasing decision.

For example, if I found that I really do want to get L2ARC devices that are at least as fast, or faster for sequential as my disks, I might consider something exotic like a PCI-E flash drive. I can currently get a 1.2TB LSI Nytro Warp drive on eBay for about the same price as 3 x 250GB Samsung 850 SSDs. That's rather more than I planned to spend (I was planning to get only 2 x 250GB), but I will consider it if it's the only way to ensure caching increases my performance in all use cases.

TIA!

cliffr · Apr 7, 2017

9 drives with Z3? That's crazy. If you had all 27 in Z3 you'd be fine. Triple parity.

ZFS looks to ARC and L2ARC to see if it can find data, then looks to disks.

TheBloke · Apr 7, 2017

cliffr said:
9 drives with Z3? That's crazy. If you had all 27 in Z3 you'd be fine. Triple parity.

ZFS looks to ARC and L2ARC to see if it can find data, then looks to disks.

I don't have a huge amount of practical experience with high performance ZFS arrays but I have read a fair bit and from all I've read I've understood that having very wide RAIDZ VDEVs is absolutely not recommended.

33% redundancy may be overkill yes - I am also considering 3 x RAIDZ2 - but I certainly don't agree that one 27-drive VDEV is a good idea. I'd have the IOPS of a single drive and more importantly the rebuild times would be catastrophic.

And I believe it's those rebuild times that bring the most risk. If a pool is taking multiple days to rebuild, hammering every disk at 100% during that time, it's exponentially more likely to get successive failures.

That's one reason why mirrors are strongly recommended - although in theory you can lose the array with only two successive disk failures, the rebuild times are so short (the time it takes to copy one disk of data, with no impact on any other disk in the pool) that the probability of a second failure occurring on an already failed mirrored VDEV is much smaller. As compared to RAIDZ rebuilds that touch every block of data on every disk in the affected VDEV; every disk in the pool in the case of a single wide stripe.

I personally still don't want to take that perceived extra risk from mirrors given my old drives - not to mention that I would like more than 50% space efficiency - despite the great increase in IOPS. But it is an important factor, one that I have considered in deciding to have more narrower VDEVs rather than one huge one.

I've spent the last few days testing many configs - 2 x RAIDZ2, 2 x RAIDZ3, 3 x RAIDZ2 and RAIDZ3, plus a couple of mirrored tests. I also considered and tested a couple of 4 x VDEV configs.

Overall I think my sweet spot is at three VDEVs. I'm very happy with the sequential read and write performance I get from 27 disks in three VDEVs, it provides enough space to last me at least several years, and the large amount of redundancy is comforting - especially bearing in mind that my array will be far too large for me to have any meaningful backups of 99% of the data stored. I know IOPS won't be great, but I hope the SSDs will help a lot with that. (And it'll be 3x better than a single VDEV.)

Here's a couple of good articles discussing this point. As the titles suggest, they take different views on mirror vs VDEV but their recommendations on stripe width consideration are common:

TheBloke · Apr 7, 2017

Back to my original question: I have done more research in the meantime and have found some useful info from one of the creators of L2ARC, Brendan Gregg, from 2008 when L2ARC was first released (ahh, back in those halcyon Sun days, when sunshine and happiness still ruled over the Shire, before the Dark Eye of ~~Sauron~~ Ellison hadn't plunged all into wailing and despair.)

Anyway. In his blog post he says:

What's bad about the L2ARC?

It was designed to either improve performance or do nothing, so there isn't anything that should be bad. To explain what I mean by do nothing - if you use the L2ARC for a streaming or sequential workload, then the L2ARC will mostly ignore it and not cache it. This is because the default L2ARC settings assume you are using current SSD devices, where caching random read workloads is most favourable; with future SSDs (or other storage technology), we can use the L2ARC for streaming workloads as well.

So that pretty neatly answers my question - I should have found that before posting!

L2ARC at the very least can't make things worse. Unless things have changed since 2008, it seems that for sequential/streaming workloads, it simply won't cache the data at all. It will try to only cache random workloads that are benefit from an SSD's much improved IOPS. Which sounds fine for my purposes, although not quite as good as I hoped it might be in reading from both L2ARC and disks in parallel. But that may have been unrealistic to expect.

But now what I'm interested to know is whether the future changes he talks about ever happened. Whether L2ARC has been extended in the past 8 years so it will now try to cache streaming workloads as well, given how much SSDs have improved in that time? Especially given we now have PCI-E cards like the Fusion-IO and the Warp I mentioned, offering both huge IOPS and also sequential speeds in the multi-GB/s range, which would therefore likely benefit all workflows in nearly every pool.

Ideally it would actually calculate the relative performance of the cache vs pool devices and write and read cache data intelligently according to the maximum benefits. Perhaps even reading it in parallel from multiple sources as I first thought it might.

I really do need to try and test this I think. So I may go ahead and get a couple of Samsung SSDs sooner rather than later. I'd still dearly love that Warp drive, but don't know if I can justify the cost - not to mention that I'd have to cut open my case and work out a way to support the card as it's full-height and I have a 2U low-profile chassis. It's worth it though! I think!

In the meantime I'd be most grateful to hear from anyone who has already tested it and has real life numbers. And/or anyone who knows how things happen in Open-ZFS, which inherited Sun's code as of 2010 - a couple of years on from the first L2ARC release - and may have improved it further since then.

EDIT: Answering my own question again, it looks like this may be tunable.

FreeBSD definitely allows tuning the L2ARC caching of streaming/sequential workloads:

By default the L2ARC does not attempt to cache streaming/sequential workloads, on the assumption that the combined throughput of your pool disks exceeds the throughput of the L2ARC devices, and therefore, this workload is best left for the pool disks to serve. This is usually the case. If you believe otherwise (number of L2ARC devices X their max throughput > number of pool disks X their max throughput), then this can be toggled with the following sysctl:
vfs.zfs.l2arc_noprefetch

As for Solaris, I haven't yet found 100% confirmation that the same is possible. But there are a number of Kernel tunables in /etc/system, such as zfs_prefetch_disable. This certainly sounds similar to the FreeBSD sysctl mentioned above, but might not be. EDIT 2: actually this is enabled by default it seems, so is unlikely to be the same as FreeBSD's L2ARC-specific sysctl.

Of course, this may be a moot point for me unless I do get a Warp drive with 2+ GB/s sequential read speeds. My initial concern was that using SSDs with 'only' 1GB/s sequential read would be a performance detriment; I see now that that definitely shouldn't be the case.

gea · Apr 8, 2017

Add the following to /etc/system to enable l2arc caching for sequential data on Solarish

set zfs:l2arc_noprefetch=0

Solaris Tunable Parameters Reference Manual
napp-it Pro user can set in menu System > Appliance tuning

TheBloke · Apr 8, 2017

gea said:
Add the following to /etc/system to enable l2arc caching for sequential data on Solarish

Solaris Tunable Parameters Reference Manual
napp-it Pro user can set in menu System > Appliance tuning

Awesome, thanks gea!

I did read the Oracle Tunable Parameters docs, and I don't see that param listed anywhere. The one you linked there is actually for Solaris 9

Here's the 11.3 docs, but it doesn't list many ZFS params

So these L2ARC params are undocumented I guess? Though now I look again I see it was mentioned in that original 2008 Brendan Gregg article.

That's great to hear that napp-it can set these. I will have to check that out to see what else it can do! I like doing stuff on command line so I didn't try your UI yet, but I will definitely install it as a reference at least, as it sounds like it will expose new features and tunables to me.

For future reference in this thread, these are the params Brendan described back in 2008:

Code:

      l2arc_write_max         max write bytes per interval
      l2arc_write_boost       extra write bytes during device warmup
      l2arc_noprefetch        skip caching prefetched buffers
      l2arc_headroom          number of max device writes to precache
      l2arc_feed_secs         seconds between L2ARC writing

Search

ZFS read caching with SSD: is cache used as well as storage, or instead?

TheBloke

Active Member

cliffr

Member

TheBloke

Active Member

TheBloke

Active Member

gea

Well-Known Member

TheBloke

Active Member