TrueNAS general purpose write-caching

cheezehead · May 4, 2024

Looking for others to help up vote a general purpose write-caching method for TrueNAS.

Log in with Atlassian account

mrpasc · May 4, 2024

I believe that upvote will not have success. Don’t think iX will bork openZFS for that. Their development aimes towards enterprise customers. If those need fast writes they go all flash / all NVME.
Consider switching to something like BeCacheFS or just unRaid for home use.

Chriggel · May 4, 2024

I can't even read what has been suggested without logging in.

Anyway, into the blue, I'd say this is pointless. There are already several effective methods in place that speed up writes and some of them are cache or cache-like / cache-adjacent methods.

Writes are already collected in transaction groups before they're written to disk and that happens in RAM. Data is collected until the TXG is too big or the timeout is reached. You can tune these settings to cache more dirty data in RAM if you like.

You can decide if you want sync or async writes. If you want sync writes to slow pools, move the ZIL to a separate fast SLOG vdev.

You can define the Special Allocation Class to not only take metadata but also small files that would otherwise slow down your low IOPS pool and redirect them to a faster, separate vdev.

And if all that is not enough, you just make your pool faster so that basically all forms of caching and similar methods becomes completely irrelevant for you.

So, is there something missing here that a "general purpose write cache" could achieve?

nabsltd · May 4, 2024

Chriggel said:
Anyway, into the blue, I'd say this is pointless. There are already several effective methods in place that speed up writes and some of them are cache or cache-like / cache-adjacent methods.

Writes are already collected in transaction groups before they're written to disk and that happens in RAM. Data is collected until the TXG is too big or the timeout is reached. You can tune these settings to cache more dirty data in RAM if you like.

These really have limited effect, as you pay a lot of performance penalty for the safety of ZFS.

I have a Ubuntu system with 128GB of RAM that has a LSI 9361-8i with 6x 4TB drives in RAID6. Using a 3.2TB U.2 NVMe drive (HGST SN100) as cache for lvmcache, I can burst to well over 10GBytes/sec and sustain 2GB/sec for 5 minutes. I literally don't have enough network speed (only 10Gbps) to keep up.

To get that same kind of speed with TrueNAS, I'd need a pool with all SSD (maybe even NVMe). This means I can have much less expensive hardware (e.g., no need to mirror spinning disks for speed, and no need for a lot of SSDs) for the same amount of useable storage.

It seems silly that ZFS doesn't have some sort of tiered storage capability, where you could front spinning disks with a mirror of SSDs, and once the data is committed to the SSDs, ZFS would be happy. Eventually, the data would be copied from the SSD to the spinning disk. If the SSD size is large enough, it could even be used as L2ARC for the main pool.

mrpasc · May 4, 2024

ZFS was made (and is still intended for) hosting a massive amount of data which is important to be safe. Think long term archival of critical business data (in EU company’s have to store their business files for 10 years) Mostly on spinning rust. So you get lot of spindles and thus lot of write IO even with harddisks. ZFS starts to shine if you have hundreds of disks or thousands. It was never designed to deliver massive IO on our home lab gear being some disks or some flash. And I think (and I hope) the ZFS development will focus on reliability even in the future.
There are other developments like BeCacheFS which try to offer those kind of data tiering and write caching, so let’s see if there is a real demand in the business world. Us homelabbers and enthusiasts are a really small niche, I wouldn’t expect any company like iX would spent engineering times for such a request.

Chriggel · May 4, 2024

nabsltd said:
These really have limited effect, as you pay a lot of performance penalty for the safety of ZFS.

Well, if someone made the decision to use ZFS, they know about its properties and probably made it a deliberate choice. It's not as if ZFS is inherently slow, it's just slower than some alternatives but it all depends on your setup. And you say that TXGs have limited effect, but that's why I said you can tune those to whatever you think is appropriate. I don't know if there are any hard caps, but the default TXG timeout is 5 seconds and you can increase this. Some people use timeouts in the range of several minutes, which isn't necessarily the intention of ZFS, but if this fits your use case, you can do it. It will allow you to keep minutes worth of changes in RAM, basically caching everything at RAM speeds, if it fits in your RAM and doesn't exceed the TXG size limit.

The TXG concept is right at the core of ZFS, so that's going to stay anyway. I'm not a filesystem developer, but I guess the considerations one needs to make to even think about adding another storage tier are substantial. If this would be easy and fit the ZFS concept, they would have done it long ago, I guess.

And the end of the day, every kind of cache is only a temporary solution. If your pool isn't fast enough and you have a constant stream of new data and changes, your pool will never catch up and when your cache is saturated, performance will drop down to pool speeds. No cache in the world will prevent this, you can't have an arbitrarily slow pool and think that caching will fix this for you.

People who have performance issues with ZFS usually misunderstood some concepts and/or have unsuitable pool layouts and configurations in general or specifically for their use case.

If someone is extremely performance oriented to the point where not having a SSD cache between TXGs and the pool is a serious concern, they should probably use something else.

cheezehead · May 4, 2024

Thanks for the roasting. IXSystems like all businesses are looking to make money, if it makes financial sense they could. There is no fork to ZFS needed to support it. The functionality is there already with regular linux distros, it's really a question of if they want to take advantage of what's already out there.

They are not ZFS, they can choose to do as they will. While they are focused on development for businesses, "enterprise" is a lofty goal but there is a lot of SMB covering the bills (excluding labs). In general in the SMB space, they are competing against other hybrid arrays with better write tiering/caching options. Take a 12-36 HDDs, add a pair of SSD's and let the SSD's take the hits with destaging down to the rust. These make good arrays for SMB and good tier-2 arrays in enterprise. Once your in the all NVMe space, there's little sense for something like it unless your pairing Optane with lower performance/high capacity NVMe drives that can't take the DPW workloads...but even then; given the price point many enterprises have the budget/support requirements/purchasing restrictions and will end up with Pure, PowerStore, ect.

BcacheFS isn't ready for primetime yet. Lvmcache was mentioned as was regular Bcache (BcafeFS is based upon this). These are similar caching methods to how NetApp, Synology, ect handle their hybrid arrays.

Mithril · May 5, 2024

So, I've looked into a few "on top of" ZFS ideas for both tiered storage as well as OFFLINE dedupe at the file level. There *are* several solutions out there, and with TrueNAS core it would be easier to add as several are agnostic as to file system or the features they need ZFS has. However, the big "gotcha" with all of these (so far as I have found) is that it "breaks" snapshoting. By that I mean if you have HOT-COLD tiers (or more) and data moves around, since each tier is a ZFS pool you get thrashing; same problem with offline dedupe. It also complicates replication via snapshots.

Of my way too many projects here is what I am thinking is a solution for homelabs and SOHO/SMB: Mirrored (or raidz) optane namespaced or partitioned for ZIL and "special dev" (metadata), with *persistent* (this is key) L2ARC either on optane or an enterprise grade SSD depending on hardware and usecase. I currently have this *somewhat* deployed but really need to get some time to do benchmarks and compare with some real datasets as well.

- ZIL on optane sized and speced for the network speed and pool speed alleviates much of your SYNC write woes.
- the special dev: store metadata speeding up many file system lookups, store dedupe tables making it unlike to run into "dedupe results in ram starvation" (yes, tables are still stored in ram to a degree, but being backed by NVMe/optane means no death spiral or reading from HDD)
- Set the special dev to store small "files", this acts as a partial write and read cache for the data most impacted by random IO which HDD lack.
- persistent L2 arc acts as a "frequently used" cache, since this is a cache only it does NOT need to be mirrored (you could stripe it if you want).

Optane (the better ones, not the little 16GB) tend to be well suited to being slammed constantly with small reads and writes and handling mixed IO. My limited testing of using a pair of 118GB M2 optane for both ZIL and metadata (with small files and dedupe) has been reasonable positive on a machine with 10Gb networking.

Given the rising cost of flash for the foreseeable future due to the "it's not collusion to raise prices if we don't directly talk" I expect all but the biggest players to be *interested* in any solution that is not "just make a full flash array".

gea · May 5, 2024

Be aware

With enough RAM, L2Arc only helps when you reboot quite often and you have many users with many volatile files
For a pure filer, sync write is not needed and only lowers write performance without gain.
A special vdev is not a cache but the only place where this data is stored. Unlike L2Arc or Slog, a special vdev failure is a pool lost.

Mithril · May 7, 2024

gea said:
Be aware

With enough RAM, L2Arc only helps when you reboot quite often and you have many users with many volatile files

For a pure filer, sync write is not needed and only lowers write performance without gain.

A special vdev is not a cache but the only place where this data is stored. Unlike L2Arc or Slog, a special vdev failure is a pool lost.

I'll have to read up again, but when I checked it seemed like *persistent* L2Arc (newish to TrueNAS) behaves differently and ends up being a frequency based read cache; any read cache is only as good as the cachehit ratio, and we do lose the idea of a HOT tier (recently written data being automatically in the read cache by nature. Not flawless but also fully "within" ZFS. Recent changes reduce the RAM needed for persistent L2arc as well.

I'm aware, I don't force sync. Since we are using namespace or partitions we're only giving up 10-20GB of the optane to prevent anything that does request sync writes to not grind to a halt, IMHO worth the trade.

Correct, which is why I call out doing a mirror here (as in, minimal fault tolerance of a single device) . Also we all backup our NAS and VM hosts right?

Personally I run my backup with Raid-Z2 vdevs as mirrors can fail with only 2 lost drives, but a pool of mirror vdevs is much snappier for primary use. Mirror special on a pool of mirrors can be removed fairly cleanly so long as at least 1 of the mirror drives is ok in an emergency. Personally I think everything else in my system is going to fail before mirrored optane

nabsltd · May 7, 2024

mrpasc said:
ZFS was made (and is still intended for) hosting a massive amount of data which is important to be safe. Think long term archival of critical business data (in EU company’s have to store their business files for 10 years) Mostly on spinning rust.

The safest and cheapest way to "long term archive" a "massive amount of data" is tape. If you actually need spinning-rust speed access to the data on a regular basis, then it's not really archive.

That said, I'd much rather use a distributed filesystem for a such an archive, if I really wanted safety and needed occasional live access. Any storage with a single point of failure and corruption (e.g., one system with bad RAM, a glitchy HBA, etc.) isn't as safe as it could be.

nabsltd · May 7, 2024

Chriggel said:
And the end of the day, every kind of cache is only a temporary solution. If your pool isn't fast enough and you have a constant stream of new data and changes, your pool will never catch up and when your cache is saturated, performance will drop down to pool speeds. No cache in the world will prevent this, you can't have an arbitrarily slow pool and think that caching will fix this for you.

Yes, if you truly ingest 24/7 at faster than the pool write speed, then you will eventually slow to that. In that case, though, you designed your system incorrectly.

And note that tiered storage isn't cache...you don't care about getting the data from the ingest to the pool within some fixed time limit. Once the data is written to the fast tier, it's safe. Eventually it gets copied to the pool, but that can happen when the pool is idle. The great part about tiered storage is that you can always add more fast tier if your have designed incorrectly and your ingest is so fast that everything in the fast tier still needs to be written to the pool.

I really can't see how anybody could design a system so poorly that they have NVMe drives that can keep up with 100Gbps Ethernet, but only get enough of them to store less than a minute of saturated network. 1TB of NVMe is good for nearly 2 minutes of saturated 100Gbps. If you are writing that fast, you can afford to buy 30-60 minutes worth of network saturation.

Chriggel · May 8, 2024

nabsltd said:
Yes, if you truly ingest 24/7 at faster than the pool write speed, then you will eventually slow to that. In that case, though, you designed your system incorrectly.

Oh, absolutely. But following your example, it's not just ingesting 24/7 and then eventually hit the limit at some point. It could be as little as minutes, maybe even seconds, until you've reached the point. Or never achieve expected speeds in the first place. Incorrect design and incorrect expectations happen very frequently.

nabsltd · May 8, 2024

Chriggel said:
It could be as little as minutes, maybe even seconds, until you've reached the point.

Again, there are many datacenter NVMe drives than can sustain close to 2GB/sec for the entire size of the drive. A 3.8TB drive could handle that speed for over 30 minutes. This is a trivial solution for up to 25Gbps on the ingest network.

Basically, the speed of the ingest network determines the size and config of the "fast" tier. This is so simple that it should never be done incorrectly by anyone with half a brain.

Chriggel · May 9, 2024

Not sure if you got this the wrong way, I don't disagree with you.

Recap:
cheezehead is asking for general purpose write cache. If it's really write cache or just another persistent storage tier is another question.
Neither exists in ZFS right now, but there are other methods in place to speed things up and ZFS can be reasonably fast, IF you got your storage layout and configuration right and have the correct expectations. There are many examples all over the internet where people failed in at least one category. This is the situation right now and you're correct, it maybe would make it more trivial if such a solution would exist, but it doesn't, so here we are.

Am I against this feature in ZFS? No. But I don't think it will happen anytime soon, or maybe ever. And in the the meantime, there are less trivial (and/or more expensive) ways to optimize performance, so it's not as if you don't have options already. Not everyone will like all of them, for sure, for different reasons. Understanding all the inner workings of ZFS isn't trivial for everyone. Spending $$$ on all flash storage is also not everyone's thing.

Search

TrueNAS general purpose write-caching

cheezehead

Active Member

mrpasc

Well-Known Member

Chriggel

Member

nabsltd

Well-Known Member

mrpasc

Well-Known Member

Chriggel

Member

cheezehead

Active Member

Mithril

Active Member

gea

Well-Known Member

Mithril

Active Member

nabsltd

Well-Known Member

nabsltd

Well-Known Member

Chriggel

Member

nabsltd

Well-Known Member

Chriggel

Member