Tiered storage configuration

ColPanic · Jul 4, 2022

I’m rebuilding my home/lab server and wondering if anything has changed with tiered storage. I haven’t paid much attention over the last 4 or 5 years. I have an assortment of 8 and 10 TB HDDs a bunch of 960GB Sata SSDs and some NVME drives I’d like to put together in a server. When I last looked into tiers, windows server and S2D was the only viable option and was problematic to say the least so I ended up just doing separate TrueNAS pools for SSD and HDD. Has that changed any and are there any other options? What I would love to do is pool everything together in some sort of smart tiered system. No ZFS tiers yet?

Sean Ho · Jul 4, 2022

Storage tiers are still a good idea in the sense of tailoring storage to the use-case, e.g., all-flash for DBs, big and slow spinners for bulk media. If you're looking for a "smart tiered" system, you're really talking about workload-agnostic caching, which has a limited scope of utility. Cache that's mismatched to the workload can even harm performance, e.g., what happens to many folks who prematurely add L2ARC.

ColPanic · Jul 5, 2022

I’m more interested in write cache for ingesting media and general drive pooling than more read cache I’m not even using L2 cache drives. I just give zfs gobs of RAM (64GB) and it does a pretty good job with read cache. I just wish there were a way with zfs so that anytime I saved a big file it would write to nvme and saturate my 10gb connection then move it down to the slower tiers in an intelligent way.I currently only sustain 300-400MB/s to the NAS on 10gb but if I go from one nvme workstation to another I can sustain full 1.2GB/s line speed. (I have ZIL drives but they don’t make much of a difference).

Windows storage spaces direct supposedly does this with 3 storage tiers. I just don’t know if the performance is any good or not. I know it’s garbage with parity but I’m ok with raid 10. Here are the drives I have but I’ll probably add a few more 10TB.
For storage:
NVME - 4@ 1TB
SATA SSD - 16 @960GB (zfs raid 10)
HDD - 4@ 10TB + 8@8TB (currently zfs Raid10)
2@100GB ssd (currently slog or zil drives)

For ESXI Datastores:
HGST SAS SSD 2@800GB (mirrored) + 4@400GB (Raid Z1)

i386 · Jul 5, 2022

ColPanic said:
2@100GB ssd (currently slog or zil drives)

What exact models are these?

ColPanic · Jul 5, 2022

i386 said:
What exact models are these?

I believe they are Intel DC p3700

ColPanic · Jul 5, 2022

I set up a quick test system with WS2019 with 1 HDDs, 1 SSD and 1 NVME and it looks promising. I'm getting over 800MB in either direction.

nabsltd · Jul 6, 2022

ColPanic said:
I just wish there were a way with zfs so that anytime I saved a big file it would write to nvme and saturate my 10gb connection then move it down to the slower tiers in an intelligent way.

The problem is that ZFS is very old tech, and when it was designed, there was no device that was both big enough and fast enough to handle incoming writes that were 10x the size of RAM.

So, there was just no design thought for a special vdev that would hold the transaction queue for when it overflows RAM. Now, with NVMe able to keep up with even fairly fast network connections, it seems like a no-brainer to add as a feature.

I believe they are Intel DC p3700

Have you partitioned or over-provisioned them? I don't see any P3700 smaller than 400GB. I have the 800GB model, and was thinking I'd partition down to about 200GB for SLOG, and the rest for whatever special thing might come along.

ColPanic · Jul 6, 2022

My mistake. They are DC S3700. I may give windows server 2019 a try. I’ve been testing it for a couple days and I can saturate a 10gb link all day long. Being able to pool different types of storage together with multiple tiers would be a big plus. I know that parity raid is garbage but I’m fine mirroring everything for now.

ecosse · Jul 7, 2022

There's a few third party options for a windows setup, couldn't tell you if they were any good but if storage spaces works its probably not worth bothering

PrimoCache - Excellent Software Caching Solution to Accelerate Storage

Home page of PrimoCache product which is a supplementary software caching scheme to improve the system performance.

www.romexsoftware.com

RAM-SSD-Zwischenspeicherungssoftware

Block-Level-RAM und SSD-Cache. Beschleunigt HDD basierte Volumes über die Geschwindigkeit der SSD.

elitebytes.com

Stablebit has a SSD plugin if you were using this.

StableBit - The home of StableBit CloudDrive, StableBit DrivePool and the StableBit Scanner

stablebit.com

gea · Jul 7, 2022

Some thoughts..

Tiering is ok to separate "hot/fast" and "cold/slow" data but
there is a lot of storage load to move between (reduced overall storage performance during move)

The idea of ZFS special vdev (from Intel) is a different and i would say superiour approach. You separate data based on physical data properties like ZFS blocksize or ZFS recsize. This will force small/slow datablocks to a faster device or you can force some data with a recsize setting < small block size to a ZFS filesytem on a faster device (ex NVMe mirror).

For these data (based on physical blocksize or selected data per ZFS filesystem) you get full NVMe read/write performance, https://www.napp-it.org/doc/downloads/special-vdev.pdf

btw
Slog is not a writecache.
When sync is enabled writes are always slow but your rambased writecache is protected. For filer use, always disable sync and forget about Slogs.

Writecache on ZFS is RAM (nothing else, 10% RAM, default max 4GB on Open-ZFS).

10G (1GB/s) write performance without sync is no problem for a ZFS system based on discs, with Intel Optane even 10G sync write performance is achievable ex https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf or https://napp-it.org/doc/downloads/epyc_performance.pdf

ColPanic · Jul 7, 2022

gea said:
Some thoughts..

Tiering is ok to separate "hot/fast" and "cold/slow" data but
there is a lot of storage load to move between (reduced overall storage performance during move)

The idea of ZFS special vdev (from Intel) is a different and i would say superiour approach. You separate data based on physical data properties like ZFS blocksize or ZFS recsize. This will force small/slow datablocks to a faster device or you can force some data with a recsize setting < small block size to a ZFS filesytem on a faster device (ex NVMe mirror).

For these data (based on physical blocksize or selected data per ZFS filesystem) you get full NVMe read/write performance, https://www.napp-it.org/doc/downloads/special-vdev.pdf

btw
Slog is not a writecache.
When sync is enabled writes are always slow but your rambased writecache is protected. For filer use, always disable sync and forget about Slogs.

Writecache on ZFS is RAM (nothing else, 10% RAM, default max 4GB on Open-ZFS).

10G (1GB/s) write performance without sync is no problem for a ZFS system based on discs, with Intel Optane even 10G sync write performance is achievable ex https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf or https://napp-it.org/doc/downloads/epyc_performance.pdf

Thanks. I know what zil/slogs are and aren't and the difference between sync and non-sync writes. I've been using ZFS forever and will continue to do so where it makes sense. What I'm trying to solve for now is bulk storage of large files. I have a lot of HDDs, SSDs and now several NVME drives that I would like to 1) pool together into a single network share that will 2) saturate a 10gb connection with a sequential write. And 3) it would also be nice if I could expand the pool with pairs of drives later on. I don't really care if this system runs on Linux, FreeBSD, Windows, Fortran, MacOS or something else and I'm OK using raid10 rather than parity.

Currently, AFAIK, Windows Server is the only game in town that will do all three of those things without relying on 3rd party software. I currently use TrueNAS and have different pools for HDDs and SSDs. This is very inconvenient but currently there is no other option with ZFS. I have not specifically tried mixing SSD and HDD vdevs in a zfs pool, but I'm pretty certain it's a bad idea. What I would really like is for IX Systems to implement some sort of tiering so we can take advantage of the abundance of cheap and fast SSD storage and the resiliency of zfs without having to sacrifice somewhere else. That seems like a feature that a lot of people would jump on rather than all of the other rabbit holes they seem to have gone down over the last few years.

gea · Jul 8, 2022

1. Why do you want sync write for a media filer? ZFS is never in danger of filesystem corruptions on a crash during write due CoW and with large files there is no advantage in securing the content of the rambased writecache. On a crash during write the currently written file is damaged even with sync unless its is already completely in RAM writecache. Very unlikely.

Without sync, even an average diskpool is capable of 10G read/write performance. With sync enabled and an Optane Slog even 10G sync write is possible.

2. Mixing SSD and HD per normal vdev is quite senseless as the slower disks will limit performance. This is why you use special vdevs then to improve performance of problematic small datablocks (or per filesystem) or metadata. They are the only problem especially for hd pools.

3. Mirrors are best to optimize a pool for io. While a Raid Z vdev gives the iops of a single disk on read/write, each mirror gives 2 x read io compared of a single disk.

4. As said tiering can improve performance unless you do not move data between tiering stages. It is quite inflexible on dynamic use cases where the special vdev approach is more flexible as it improves performane based on data structures not on a hot/cold setting.

5. The fastest SMB server for 10/40G that I have found is Oracle Solaris with its multithreadad kernelbased SMB server and native ZFS. The free Solaris forks like OmniOS include the same multithreaded, kernelbased SMB server but are a little slower with Open-ZFS.

6. Any ZFS filer is faster than Windows with ReFS that is needed if you want only nearly comparable security like ZFS. And sync write security on Windows requires a hardware raid with flash/BBU protection - usually much slower than ZFS sync with a high performance Slog concept.

ColPanic · Jul 8, 2022

gea said:
1. Why do you want sync write for a media filer? ZFS is never in danger of filesystem corruptions on a crash during write due CoW and with large files there is no advantage in securing the content of the rambased writecache. On a crash during write the currently written file is damaged even with sync unless its is already completely in RAM writecache. Very unlikely.

Without sync, even an average diskpool is capable of 10G read/write performance. With sync enabled and an Optane Slog even 10G sync write is possible.

2. Mixing SSD and HD per normal vdev is quite senseless as the slower disks will limit performance. This is why you use special vdevs then to improve performance of problematic small datablocks (or per filesystem) or metadata. They are the only problem especially for hd pools.

3. Mirrors are best to optimize a pool for io. While a Raid Z vdev gives the iops of a single disk on read/write, each mirror gives 2 x read io compared of a single disk.

4. As said tiering can improve performance unless you do not move data between tiering stages. It is quite inflexible on dynamic use cases where the special vdev approach is more flexible as it improves performane based on data structures not on a hot/cold setting.

5. The fastest SMB server for 10/40G that I have found is Oracle Solaris with its multithreadad kernelbased SMB server and native ZFS. The free Solaris forks like OmniOS include the same multithreaded, kernelbased SMB server but are a little slower with Open-ZFS.

6. Any ZFS filer is faster than Windows with ReFS that is needed if you want only nearly comparable security like ZFS. And sync write security on Windows requires a hardware raid with flash/BBU protection - usually much slower than ZFS sync with a high performance Slog concept.

Either I’m not communicating properly or you aren’t reading what I’m saying. I don’t want sync writes. I never said I did. What I want (the #1 thing listed above) is to pool different types of storage together. I.e some ssd, some hdd and some nvme together into a single network share. I don’t know much about zfs special vDevs but my (limited) understanding is that they store metadata, dedupe tables and the like but do not contribute to overall capacity, similar to how a zil does not contribute to capacity. It would not be reasonable to put all of the flash storage I want to utilize into a special vDev unless it contributes to total capacity.

gea · Jul 8, 2022

It depends on what you want. 10G read/write ( say >600 MB/s) or prove of concept.

Special vdev is a method to improve io to performance sensitive data. If you do not need sync, a good ZFS pool up from say 4 mirrors is capable for 10G r/w, no real need for expensive NVMe or SSD or tiering beside the smaller datablocks that are really slow outside the rambased read/write caches of ZFS.

If you simply want 10G performance on ZFS, use enough RAM for read/write caching, normal disks for capacity (prefer mirrors for best iops) and optionally add a special vdev mirror only for those datablocks like metadata or small io that are otherwise slow on read/write. For larger datablock count up to 200 MB/s per hd and multiple per ZFS mirror on write and 2x on read.

If you just have SSDs/NVMes, add them as normal vdevs to improve capacity and slightly improve overall performance.

btw
I know a Windows server is very good regarding SMB performance but mainly if you avoid the resilent filesystem ReFS

Sean Ho · Jul 8, 2022

very clearly expressed, thx gea!

Mithril · Jul 8, 2022

It feels like a lot of the articles and testing around higher performance ZFS are old enough to be potentially misleading. There have been large changes on the software side (for example memory needs of dedupe are like 10% of what they used to be), and on the hardware landscape.

Some things remain very valid, ZLOG needs to be able to sustain massive small block sync writes for pool heath and speed.
Other things have changed, like persistence l2arc is now a reality and changes how useful it is to home users.

Related, I wouldn't personally use mirrors, the risk isn't worth the performance. With mirrors any 2-drive loss is potentially the entire pool. I'd rather run drives in sets of 4 disk Raid-z2 where I can still lose *any* 2 drives, or 6/8 sets if capacity if more important than performance. Maybe if the pool in question is intentionally less resilient and snapshots get exported to a safer pool

Lix · Jul 9, 2022

Have a look at this:

ColPanic · Jul 9, 2022

gea said:
It depends on what you want. 10G read/write ( say >600 MB/s) or prove of concept.

Special vdev is a method to improve io to performance sensitive data. If you do not need sync, a good ZFS pool up from say 4 mirrors is capable for 10G r/w, no real need for expensive NVMe or SSD or tiering beside the smaller datablocks that are really slow outside the rambased read/write caches of ZFS.

If you simply want 10G performance on ZFS, use enough RAM for read/write caching, normal disks for capacity (prefer mirrors for best iops) and optionally add a special vdev mirror only for those datablocks like metadata or small io that are otherwise slow on read/write. For larger datablock count up to 200 MB/s per hd and multiple per ZFS mirror on write and 2x on read.

If you just have SSDs/NVMes, add them as normal vdevs to improve capacity and slightly improve overall performance.

btw
I know a Windows server is very good regarding SMB performance but mainly if you avoid the resilent filesystem ReFS

Sure, but zfs will not currently allow me to combine different types of storage together into one share, correct? I already have (16) 960GB SSD drives that I want to use along with a bunch of 10TB spinners. If the special vdev contributed to capacity, then it would be perfect, but I cant find anything that indicates that it does.

And to be clear, I am currently using TrueNAS with 3 pools. One with spinning rust drives and slogs others with only flash storage. That's what I want to get away from.

Lix · Jul 9, 2022

Something like this then?

GitHub - 45Drives/autotier: A passthrough FUSE filesystem that intelligently moves files between storage tiers based on frequency of use, file age, and tier fullness.

A passthrough FUSE filesystem that intelligently moves files between storage tiers based on frequency of use, file age, and tier fullness. - 45Drives/autotier

github.com

Rttg · Jul 9, 2022

ColPanic said:
If the special vdev contributed to capacity, then it would be perfect

It does although you’ll likely need need to use the `special_small_blocks` tunable to force data (i.e., non metadata) blocks to the special vdev.

Tiered storage configuration

Member

seanho.com

Member

Well-Known Member

Member

Member

Attachments

Well-Known Member

Member

Active Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

seanho.com

Active Member

Member

Member

Member

Member