Tiered storage configuration

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
I’m rebuilding my home/lab server and wondering if anything has changed with tiered storage. I haven’t paid much attention over the last 4 or 5 years. I have an assortment of 8 and 10 TB HDDs a bunch of 960GB Sata SSDs and some NVME drives I’d like to put together in a server. When I last looked into tiers, windows server and S2D was the only viable option and was problematic to say the least so I ended up just doing separate TrueNAS pools for SSD and HDD. Has that changed any and are there any other options? What I would love to do is pool everything together in some sort of smart tiered system. No ZFS tiers yet?
 

Sean Ho

seanho.com
Nov 19, 2019
823
385
63
Vancouver, BC
seanho.com
Storage tiers are still a good idea in the sense of tailoring storage to the use-case, e.g., all-flash for DBs, big and slow spinners for bulk media. If you're looking for a "smart tiered" system, you're really talking about workload-agnostic caching, which has a limited scope of utility. Cache that's mismatched to the workload can even harm performance, e.g., what happens to many folks who prematurely add L2ARC.
 

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
I’m more interested in write cache for ingesting media and general drive pooling than more read cache I’m not even using L2 cache drives. I just give zfs gobs of RAM (64GB) and it does a pretty good job with read cache. I just wish there were a way with zfs so that anytime I saved a big file it would write to nvme and saturate my 10gb connection then move it down to the slower tiers in an intelligent way.I currently only sustain 300-400MB/s to the NAS on 10gb but if I go from one nvme workstation to another I can sustain full 1.2GB/s line speed. (I have ZIL drives but they don’t make much of a difference).

Windows storage spaces direct supposedly does this with 3 storage tiers. I just don’t know if the performance is any good or not. I know it’s garbage with parity but I’m ok with raid 10. Here are the drives I have but I’ll probably add a few more 10TB.
For storage:
NVME - 4@ 1TB
SATA SSD - 16 @960GB (zfs raid 10)
HDD - 4@ 10TB + 8@8TB (currently zfs Raid10)
2@100GB ssd (currently slog or zil drives)

For ESXI Datastores:
HGST SAS SSD 2@800GB (mirrored) + 4@400GB (Raid Z1)
 

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
I set up a quick test system with WS2019 with 1 HDDs, 1 SSD and 1 NVME and it looks promising. I'm getting over 800MB in either direction.
transfer.pngtransfer2.png
 

Attachments

nabsltd

Well-Known Member
Jan 26, 2022
589
421
63
I just wish there were a way with zfs so that anytime I saved a big file it would write to nvme and saturate my 10gb connection then move it down to the slower tiers in an intelligent way.
The problem is that ZFS is very old tech, and when it was designed, there was no device that was both big enough and fast enough to handle incoming writes that were 10x the size of RAM.

So, there was just no design thought for a special vdev that would hold the transaction queue for when it overflows RAM. Now, with NVMe able to keep up with even fairly fast network connections, it seems like a no-brainer to add as a feature.

I believe they are Intel DC p3700
Have you partitioned or over-provisioned them? I don't see any P3700 smaller than 400GB. I have the 800GB model, and was thinking I'd partition down to about 200GB for SLOG, and the rest for whatever special thing might come along.
 

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
My mistake. They are DC S3700. I may give windows server 2019 a try. I’ve been testing it for a couple days and I can saturate a 10gb link all day long. Being able to pool different types of storage together with multiple tiers would be a big plus. I know that parity raid is garbage but I’m fine mirroring everything for now.
 

ecosse

Active Member
Jul 2, 2013
466
113
43
There's a few third party options for a windows setup, couldn't tell you if they were any good but if storage spaces works its probably not worth bothering :)


Stablebit has a SSD plugin if you were using this.
 

gea

Well-Known Member
Dec 31, 2010
3,351
1,307
113
DE
Some thoughts..

Tiering is ok to separate "hot/fast" and "cold/slow" data but
there is a lot of storage load to move between (reduced overall storage performance during move)

The idea of ZFS special vdev (from Intel) is a different and i would say superiour approach. You separate data based on physical data properties like ZFS blocksize or ZFS recsize. This will force small/slow datablocks to a faster device or you can force some data with a recsize setting < small block size to a ZFS filesytem on a faster device (ex NVMe mirror).

For these data (based on physical blocksize or selected data per ZFS filesystem) you get full NVMe read/write performance, https://www.napp-it.org/doc/downloads/special-vdev.pdf

btw
Slog is not a writecache.
When sync is enabled writes are always slow but your rambased writecache is protected. For filer use, always disable sync and forget about Slogs.

Writecache on ZFS is RAM (nothing else, 10% RAM, default max 4GB on Open-ZFS).

10G (1GB/s) write performance without sync is no problem for a ZFS system based on discs, with Intel Optane even 10G sync write performance is achievable ex https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf or https://napp-it.org/doc/downloads/epyc_performance.pdf
 
Last edited:
  • Like
Reactions: Sean Ho

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
Some thoughts..

Tiering is ok to separate "hot/fast" and "cold/slow" data but
there is a lot of storage load to move between (reduced overall storage performance during move)

The idea of ZFS special vdev (from Intel) is a different and i would say superiour approach. You separate data based on physical data properties like ZFS blocksize or ZFS recsize. This will force small/slow datablocks to a faster device or you can force some data with a recsize setting < small block size to a ZFS filesytem on a faster device (ex NVMe mirror).

For these data (based on physical blocksize or selected data per ZFS filesystem) you get full NVMe read/write performance, https://www.napp-it.org/doc/downloads/special-vdev.pdf

btw
Slog is not a writecache.
When sync is enabled writes are always slow but your rambased writecache is protected. For filer use, always disable sync and forget about Slogs.

Writecache on ZFS is RAM (nothing else, 10% RAM, default max 4GB on Open-ZFS).

10G (1GB/s) write performance without sync is no problem for a ZFS system based on discs, with Intel Optane even 10G sync write performance is achievable ex https://napp-it.org/doc/downloads/optane_slog_pool_performane.pdf or https://napp-it.org/doc/downloads/epyc_performance.pdf
Thanks. I know what zil/slogs are and aren't and the difference between sync and non-sync writes. I've been using ZFS forever and will continue to do so where it makes sense. What I'm trying to solve for now is bulk storage of large files. I have a lot of HDDs, SSDs and now several NVME drives that I would like to 1) pool together into a single network share that will 2) saturate a 10gb connection with a sequential write. And 3) it would also be nice if I could expand the pool with pairs of drives later on. I don't really care if this system runs on Linux, FreeBSD, Windows, Fortran, MacOS or something else and I'm OK using raid10 rather than parity.

Currently, AFAIK, Windows Server is the only game in town that will do all three of those things without relying on 3rd party software. I currently use TrueNAS and have different pools for HDDs and SSDs. This is very inconvenient but currently there is no other option with ZFS. I have not specifically tried mixing SSD and HDD vdevs in a zfs pool, but I'm pretty certain it's a bad idea. What I would really like is for IX Systems to implement some sort of tiering so we can take advantage of the abundance of cheap and fast SSD storage and the resiliency of zfs without having to sacrifice somewhere else. That seems like a feature that a lot of people would jump on rather than all of the other rabbit holes they seem to have gone down over the last few years.
 

gea

Well-Known Member
Dec 31, 2010
3,351
1,307
113
DE
1. Why do you want sync write for a media filer? ZFS is never in danger of filesystem corruptions on a crash during write due CoW and with large files there is no advantage in securing the content of the rambased writecache. On a crash during write the currently written file is damaged even with sync unless its is already completely in RAM writecache. Very unlikely.

Without sync, even an average diskpool is capable of 10G read/write performance. With sync enabled and an Optane Slog even 10G sync write is possible.

2. Mixing SSD and HD per normal vdev is quite senseless as the slower disks will limit performance. This is why you use special vdevs then to improve performance of problematic small datablocks (or per filesystem) or metadata. They are the only problem especially for hd pools.

3. Mirrors are best to optimize a pool for io. While a Raid Z vdev gives the iops of a single disk on read/write, each mirror gives 2 x read io compared of a single disk.

4. As said tiering can improve performance unless you do not move data between tiering stages. It is quite inflexible on dynamic use cases where the special vdev approach is more flexible as it improves performane based on data structures not on a hot/cold setting.

5. The fastest SMB server for 10/40G that I have found is Oracle Solaris with its multithreadad kernelbased SMB server and native ZFS. The free Solaris forks like OmniOS include the same multithreaded, kernelbased SMB server but are a little slower with Open-ZFS.

6. Any ZFS filer is faster than Windows with ReFS that is needed if you want only nearly comparable security like ZFS. And sync write security on Windows requires a hardware raid with flash/BBU protection - usually much slower than ZFS sync with a high performance Slog concept.
 
Last edited:

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
1. Why do you want sync write for a media filer? ZFS is never in danger of filesystem corruptions on a crash during write due CoW and with large files there is no advantage in securing the content of the rambased writecache. On a crash during write the currently written file is damaged even with sync unless its is already completely in RAM writecache. Very unlikely.

Without sync, even an average diskpool is capable of 10G read/write performance. With sync enabled and an Optane Slog even 10G sync write is possible.

2. Mixing SSD and HD per normal vdev is quite senseless as the slower disks will limit performance. This is why you use special vdevs then to improve performance of problematic small datablocks (or per filesystem) or metadata. They are the only problem especially for hd pools.

3. Mirrors are best to optimize a pool for io. While a Raid Z vdev gives the iops of a single disk on read/write, each mirror gives 2 x read io compared of a single disk.

4. As said tiering can improve performance unless you do not move data between tiering stages. It is quite inflexible on dynamic use cases where the special vdev approach is more flexible as it improves performane based on data structures not on a hot/cold setting.

5. The fastest SMB server for 10/40G that I have found is Oracle Solaris with its multithreadad kernelbased SMB server and native ZFS. The free Solaris forks like OmniOS include the same multithreaded, kernelbased SMB server but are a little slower with Open-ZFS.

6. Any ZFS filer is faster than Windows with ReFS that is needed if you want only nearly comparable security like ZFS. And sync write security on Windows requires a hardware raid with flash/BBU protection - usually much slower than ZFS sync with a high performance Slog concept.
Either I’m not communicating properly or you aren’t reading what I’m saying. I don’t want sync writes. I never said I did. What I want (the #1 thing listed above) is to pool different types of storage together. I.e some ssd, some hdd and some nvme together into a single network share. I don’t know much about zfs special vDevs but my (limited) understanding is that they store metadata, dedupe tables and the like but do not contribute to overall capacity, similar to how a zil does not contribute to capacity. It would not be reasonable to put all of the flash storage I want to utilize into a special vDev unless it contributes to total capacity.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,351
1,307
113
DE
It depends on what you want. 10G read/write ( say >600 MB/s) or prove of concept.

Special vdev is a method to improve io to performance sensitive data. If you do not need sync, a good ZFS pool up from say 4 mirrors is capable for 10G r/w, no real need for expensive NVMe or SSD or tiering beside the smaller datablocks that are really slow outside the rambased read/write caches of ZFS.

If you simply want 10G performance on ZFS, use enough RAM for read/write caching, normal disks for capacity (prefer mirrors for best iops) and optionally add a special vdev mirror only for those datablocks like metadata or small io that are otherwise slow on read/write. For larger datablock count up to 200 MB/s per hd and multiple per ZFS mirror on write and 2x on read.

If you just have SSDs/NVMes, add them as normal vdevs to improve capacity and slightly improve overall performance.

btw
I know a Windows server is very good regarding SMB performance but mainly if you avoid the resilent filesystem ReFS
 
  • Like
Reactions: Sean Ho

Mithril

Active Member
Sep 13, 2019
432
148
43
It feels like a lot of the articles and testing around higher performance ZFS are old enough to be potentially misleading. There have been large changes on the software side (for example memory needs of dedupe are like 10% of what they used to be), and on the hardware landscape.

Some things remain very valid, ZLOG needs to be able to sustain massive small block sync writes for pool heath and speed.
Other things have changed, like persistence l2arc is now a reality and changes how useful it is to home users.


Related, I wouldn't personally use mirrors, the risk isn't worth the performance. With mirrors any 2-drive loss is potentially the entire pool. I'd rather run drives in sets of 4 disk Raid-z2 where I can still lose *any* 2 drives, or 6/8 sets if capacity if more important than performance. Maybe if the pool in question is intentionally less resilient and snapshots get exported to a safer pool
 

ColPanic

Member
Feb 14, 2016
130
23
18
ATX
It depends on what you want. 10G read/write ( say >600 MB/s) or prove of concept.

Special vdev is a method to improve io to performance sensitive data. If you do not need sync, a good ZFS pool up from say 4 mirrors is capable for 10G r/w, no real need for expensive NVMe or SSD or tiering beside the smaller datablocks that are really slow outside the rambased read/write caches of ZFS.

If you simply want 10G performance on ZFS, use enough RAM for read/write caching, normal disks for capacity (prefer mirrors for best iops) and optionally add a special vdev mirror only for those datablocks like metadata or small io that are otherwise slow on read/write. For larger datablock count up to 200 MB/s per hd and multiple per ZFS mirror on write and 2x on read.

If you just have SSDs/NVMes, add them as normal vdevs to improve capacity and slightly improve overall performance.

btw
I know a Windows server is very good regarding SMB performance but mainly if you avoid the resilent filesystem ReFS
Sure, but zfs will not currently allow me to combine different types of storage together into one share, correct? I already have (16) 960GB SSD drives that I want to use along with a bunch of 10TB spinners. If the special vdev contributed to capacity, then it would be perfect, but I cant find anything that indicates that it does.

And to be clear, I am currently using TrueNAS with 3 pools. One with spinning rust drives and slogs others with only flash storage. That's what I want to get away from.
 

Rttg

Member
May 21, 2020
73
48
18
If the special vdev contributed to capacity, then it would be perfect
It does although you’ll likely need need to use the `special_small_blocks` tunable to force data (i.e., non metadata) blocks to the special vdev.