Cache plans for ZFS server

loxsmith · Sep 30, 2019

So I have taken the plunge and am building a Linux lab server around a E2100 Xeon. I have six 8TB SATA drives that I plan to put into a ZFS RaidZ2 pool. I am considering adding some L2ARC and ZIL/SLOG cache to increase IOPS and I have a 240GB M2 SSD (Corsair MP510) for that. To complete the picture, I'll have 32GB RAM initially and plan to upgrade when larger UDIMMS are available: either increase to 96 GB or replace with the maximum possible - 128GB. Network connectivity is 1GB ethernet and the system will be used mostly by myself for storage, containers and virtual machines, and 2 or 3 others who will mostly store files over nfs and/or smb.

I also plan to port another two 4TB SATA drives that I already have (so I may as well use them) that I'll use either in a mirror pool or as two separate RaidZ0 pools. Not sure which to do as I think the drives may have different RPMs. This space will be used for ad-hoc unimportant stuff (so lack of redundancy is acceptable) and secondary backups of some datasets on the main pool. I can also put in a second MP510 SSD for more cache if that makes sense.

I'm not sure how to arrange the cache SSDs, which will need to be partitioned (because it'll also be the boot drive) and potentially used for both ZIL/SLOG and L2ARC, which I may want also for the other pools. Not sure yet.

How big do the cache partitions need to be? I've read that the SLOG need be no more than 1GB based on the network bandwidth) and the L2ARC 4*RAM which is 128GB. I understand I need separate cache devices for each pool that I want to cache. I've also read that log/cache devices can be added to/removed from pools so these things could be changed around if necessary.

I could mirror the log partitions and use unmirrored partitions for the cache. That would allow me, with two SSDs, to provide L2ARC for each of two pools and mirrored ZIL/SLOG for all pools. That seems like a reasonable compromise given budget and intended use.

I realise using partitions isn't what one would do on a production commercial system but this for a lab environment with a limited budget. Given that, is the above reasonable?

Thoughts/comments/suggestions appreciated.

gea · Sep 30, 2019

some remarks about cache

Cache on ZFS is basically RAM.
For write cache a genuine Solaris ZFS caches around last 5s of writes. Open-ZFS defaults to a write cache of 10% RAM, max 4GB. Main readcache is Arc and also rambased. It caches on a read most/read last base but only small random reads and metadata, not sequential data.

You can extend the Arc with an L2Arc SSD or NVMe (max 5-10x RAM) but this also caches only random reads, not large sequential ones. You can enable read ahead for L2Arc what can improve performance a little. With 32GB Ram or more I would not expect an improvement with L2Arc unless you have a multi-user environment with many small file. I would skip the L2Arc for more RAM.

Slog is NOT a write cache. A cache device sits between RAM and datapool. Slog does not. It is a protector for the rambased writecache when sync is enabled as it logs last writes. It is only read after a crash to write last cache content to pool an a reboot. For Slog you need ultra low latency, high steady write iops and powerloss protection. Your SSD does not offer either. So either use a proper Slog (ex Intel Optane), just enable sync (slow) or do not use sync at all. For a pure filer, this is uncritical, for databases and VM storage sync is a must.

btw
I would always use a small system disk/ssd and seperate disks for the datapool. With an Optane you can use partitions without a serious performance degration due concurrent use, but in general this is not a suggested setup.

loxsmith · Oct 3, 2019

Thanks for this informative answer @gea. I finally get why the write "cache" is called a "log".

So, if I understand correctly, for write protection I should just get an optane ssd and partition it so my pools can have slog, and enable sync for them. This is worth doing as I plan running some vms but separate optane ssds for this isn't practical for my situation, at least to begin with.

L2ARC I could stick on the partitioned SSD I have, although there isn't much gain to be had. I will upgrade the memory at some point, once larger udimms are available. Maybe I'll leave L2ARC until then.

DedoBOT · Oct 3, 2019

What Gea said. I will add that the bottleneck will be the 1gbs network , not the filler itself.

dragonme · Oct 10, 2019

so many people try and use and more correctly, mis-use, every feature, knob and dial on the ZFS file system without ever figuring out if they even need it

I have been running ZFS backing on several servers since 2005, and now an ESXi install and have never once needed or used a cache device, further, I don't run my primary pools with ANY redundancy ... thats right none.. because redundancy IE mirror/RAIDZ is NOT a backup.. its not to deter the disaster of data loss.. its to keep a box going IF a drive goes down (for the most part - there are use cases but few) so that a server doesn't have to be pulled off line to repair a dead or dying drive. Just have a BACKUP pool, that has redundancy.. so that you can fix bit-rot and have a more robust backup.... as long as you back up regularly.. there is little reason to run with data correction/redundancy pools.. unless you just like spending money on cache devices to speed back up what you purposely slow down. its like driving with the parking brake on ... on purpose..

so running 3-5 drive stripes of even spinning rust is way faster than most people need, (again there are edge cases) . really big data or databases .. but most small businesses don't even have this requirement

ZFS is designed to be 'ENGINEERED STORAGE' ... i.e. build for a purpose by someone who knows the file system, its capabilities, and the systems requirements.. so if you don't NEED the iops.. don't spend the money on it

my 2 cents

Thomas H · Oct 24, 2019

dragonme said:
so many people try and use and more correctly, mis-use, every feature, knob and dial on the ZFS file system without ever figuring out if they even need it

I have been running ZFS backing on several servers since 2005, and now an ESXi install and have never once needed or used a cache device, further, I don't run my primary pools with ANY redundancy ... thats right none.. because redundancy IE mirror/RAIDZ is NOT a backup.. its not to deter the disaster of data loss.. its to keep a box going IF a drive goes down (for the most part - there are use cases but few) so that a server doesn't have to be pulled off line to repair a dead or dying drive. Just have a BACKUP pool, that has redundancy.. so that you can fix bit-rot and have a more robust backup.... as long as you back up regularly.. there is little reason to run with data correction/redundancy pools.. unless you just like spending money on cache devices to speed back up what you purposely slow down. its like driving with the parking brake on ... on purpose..

so running 3-5 drive stripes of even spinning rust is way faster than most people need, (again there are edge cases) . really big data or databases .. but most small businesses don't even have this requirement

ZFS is designed to be 'ENGINEERED STORAGE' ... i.e. build for a purpose by someone who knows the file system, its capabilities, and the systems requirements.. so if you don't NEED the iops.. don't spend the money on it

my 2 cents

Are you saying layout ZFS pools as so:
Primary Pool:

stripe all drives (i.e., no RAIDZ redundancy/mirror)
caching not needed (even RAM?)
Pros: fastest data with the benefits of ZFS
Cons: any drive down will lose all data (from last backup), downtime until restored, is prone to bit-rot??

Backup Pool: for backing up primary pool

use RAIDZ redundancy/mirror
fixes bit-rot in backups
Pros: robust, reliable backups

So in summary, a fast striped primary pool combined with a backup pool using RAIDZ redundancy is best knowing this is for non-mission critical, some server downtime until drive replaced and data restored from backup, and lost of some data between last backup is acceptable?

dragonme · Oct 24, 2019

Thomas H said:
Are you saying layout ZFS pools as so:
Primary Pool:

stripe all drives (i.e., no RAIDZ redundancy/mirror)

caching not needed (even RAM?)

Pros: fastest data with the benefits of ZFS

Cons: any drive down will lose all data (from last backup), downtime until restored, is prone to bit-rot??

Backup Pool: for backing up primary pool

use RAIDZ redundancy/mirror

fixes bit-rot in backups

Pros: robust, reliable backups

So in summary, a fast striped primary pool combined with a backup pool using RAIDZ redundancy is best knowing this is for non-mission critical, some server downtime until drive replaced and data restored from backup, and lost of some data between last backup is acceptable?

Not exactly, and I would suggest given your statements above.. reading a couple ZFS reference materials to actually learn how ZFS work, its terminology, and its major settings and preferences.. without relying on what is moreover mostly bad internet commando ju-ju out here.

What I can see with definitiveness not knowing your actual storage needs first hand are these generalities and widely accepted statements

ZFS is not a 'consumer level' file system.. its not DOS.. its got lots of bell and whistles and is a file system that was designed to be setup and run buy storage engineers for enterprise data.. and yes.. the benefits of ZFS are applicable to home use, but you have to know what you are doing

In MY USE CASE.. yes.. stripping drives on my online pool and running redundancy on by backup pool for ME is a good fit. The online pool is 100% storage, WAY faster than anything hitting it with no expensive L2ARC or ZIL device (ram caching is no a log device and is hard to fully turn off not that you would want to)
YES.. one of the limitations of running without redundancy means that if there is bit rot or a failed drive I MAY have data loss not backed up to my backup pool, again, zpool status has flags you can use if errors are present and it will let you know what files if any are effected as running without raidz does not mean its not checksumming .. it just can't repair .. that is what my backups are for

RAIDZ as explained to me both in documentation and a engineer is so the pool can remain on-line if a drive fails until a back array is brought up.. it keeps a website from going down or a database from completely stopping.. but most say they would never rebuild while it was actually hot, they would transfer the workload to a snapshot backup, pull the bad pool off-line and likey scrap it.. rebuilding a pool with a failed drive is false economy because if one drive failed there is now a likelihood that others in the array are not too far behind

additionally, running without RAIDZ allows me to expand the Hot pool 1, 2, 5 drives at a time.. rather than having to expand a 8 drive raidz2 with ANOTHER 8 drive RAIDZ2 ..

what I do is expand in 2 or 3.. as a 2 drive pool fills, add another 2 drives.. etc.. that gives me plenty enough bandwidth IOPS for what I need.. then when the first pair starts getting up around 3-4 years or when much higher capacity drives are around, I snap and send to a new pool of higher capacity pair and move the 4-5 drives over to my backup pool and add another 5 drive raidz stripe..

this saves me a bunch of money since I don't have to buy 5 expensive 10 TB drives for 30 TB raidz2 I can by 20TB worth in 2 drives now.. next year or year later when they start getting full.. those drives now cost me HALF as much.. add 2 more for hundreds less than if I had bought them all at the same time..

etc etc..

ZFS is a massive toolbox... no one tool better than the other, no setup more right than another.. you have to build your pool to your retirements, space, speed, and COST... and remember no singe pool regardless of how much redundancy you have is a backup.. no backup.. no safety

ttabbal · Oct 24, 2019

Personal choice. There is no "one true way", there are only various tradeoffs. ZFS isn't any more magic than any other setup. It does things in a way a lot of us find useful though.

If you are fine with some downtime and are willing to use a rigorous backup strategy, there is nothing wrong with that approach. Note that the RAM cache, ARC in ZFS terms, is always there and used. A cache device (L2ARC) has different uses, and for most people in a home setting, is probably not all that helpful with the default settings. It can be for some things, but you have to hit the rust at some point. Gea covers that well in his various posts, no need to rehash it here. A non-redundant pool will not have undetected bit-rot, but will likely get some eventually. It will not be able to auto-repair though, as there is no redundancy to repair from. The checksums still work though, so you will at least know there is a problem. You could also do copies=2 for some redundancy/auto-repair, but at that point you might as well just do real redundancy. Perhaps for really important data that changes quickly it might be useful, as it can be a per-filesystem setting.

Redundant setups add the ability to auto-repair bit-rot and prevent downtime if you suffer a disk failure, which everyone does eventually. That capability comes with a cost, as all features do. This one is speed. More work means less speed. If you are on 1Gb networking, don't worry about that. You will likely never be bottlenecked on the drives, with the possible exception of small chunk IOPS. Particularly with RAIDZ.

I chose to go another route, stripped mirrors, AKA RAID10. I still get redundancy, but get much higher performance compared to RAIDZ. It is still lower than pure stripes (RAID0), in most cases. The up side is I also get the ability to expand in 2-disk sets. However, the potential for total failure of the array is higher than a RAIDZ2 with a similar number of drives. So you have to be willing to be vigilant and keep an eye on things. In theory, I could use higher mirror counts to mitigate that to some degree, but it's good enough for my needs this way.

And yes, redundancy/RAID is not a backup. I have a whole different server with backup pools in RAIDZ. Slower, but faster and cheaper than tape for my needs.

dragonme · Oct 24, 2019

@ttabbal

"A non-redundant pool will not have undetected bit-rot, but will likely get some eventually."

um.. no.. it will never have undetected bitrot unless you manually turn checksumming off... it MAY get DETECTED bitrot which it can't self repair, but having run ZFS .. at home, on consumer SATA drives with ECC for much of it. for more than a decade.. the only errors were the result of a loose sata cable. I have never witnessed the cosmic bit flip phenomenon .. yet.

I have had several hard drives on non-redundant pools begin failing with reallocated sectors and other than slowing the pool during writes.. ZFS never laid down a single bad bit.. and have not lost a file in more that 10 years since switching to ZFS... now touch wood.. ...

ttabbal said:
Personal choice. There is no "one true way", there are only various tradeoffs. ZFS isn't any more magic than any other setup. It does things in a way a lot of us find useful though.

If you are fine with some downtime and are willing to use a rigorous backup strategy, there is nothing wrong with that approach. Note that the RAM cache, ARC in ZFS terms, is always there and used. A cache device (L2ARC) has different uses, and for most people in a home setting, is probably not all that helpful with the default settings. It can be for some things, but you have to hit the rust at some point. Gea covers that well in his various posts, no need to rehash it here. A non-redundant pool will not have undetected bit-rot, but will likely get some eventually. It will not be able to auto-repair though, as there is no redundancy to repair from. The checksums still work though, so you will at least know there is a problem. You could also do copies=2 for some redundancy/auto-repair, but at that point you might as well just do real redundancy. Perhaps for really important data that changes quickly it might be useful, as it can be a per-filesystem setting.

Redundant setups add the ability to auto-repair bit-rot and prevent downtime if you suffer a disk failure, which everyone does eventually. That capability comes with a cost, as all features do. This one is speed. More work means less speed. If you are on 1Gb networking, don't worry about that. You will likely never be bottlenecked on the drives, with the possible exception of small chunk IOPS. Particularly with RAIDZ.

I chose to go another route, stripped mirrors, AKA RAID10. I still get redundancy, but get much higher performance compared to RAIDZ. It is still lower than pure stripes (RAID0), in most cases. The up side is I also get the ability to expand in 2-disk sets. However, the potential for total failure of the array is higher than a RAIDZ2 with a similar number of drives. So you have to be willing to be vigilant and keep an eye on things. In theory, I could use higher mirror counts to mitigate that to some degree, but it's good enough for my needs this way.

And yes, redundancy/RAID is not a backup. I have a whole different server with backup pools in RAIDZ. Slower, but faster and cheaper than tape for my needs.

dragonme · Oct 24, 2019

"I chose to go another route, stripped mirrors, AKA RAID10."

you forgot to mention the most costly tradeoff of raid10/mirros.. you only get 50% storage efficiency.. IE you TB stored costs double what mine does... and its still not a backup so you still need to spend money on a backup pool that also, will have less or should have less than 100% useable storage.. lots of overhead there

dragonme said:
@ttabbal

"A non-redundant pool will not have undetected bit-rot, but will likely get some eventually."

um.. no.. it will never have undetected bitrot unless you manually turn checksumming off... it MAY get DETECTED bitrot which it can't self repair, but having run ZFS .. at home, on consumer SATA drives with ECC for much of it. for more than a decade.. the only errors were the result of a loose sata cable. I have never witnessed the cosmic bit flip phenomenon .. yet.

I have had several hard drives on non-redundant pools begin failing with reallocated sectors and other than slowing the pool during writes.. ZFS never laid down a single bad bit.. and have not lost a file in more that 10 years since switching to ZFS... now touch wood.. ...

gea · Oct 24, 2019

In the past, a multi mirror pool was the only way to improve iops as this scales on reads with 2 x number of mirrors and on write with number of mirrors. Today nobody use mirrors when high iops is a concern. You add enough RAM for read/write caching or use SSD instead. Newer ZFS features like special/dedup vdevs allow even to use mixed vdev pools from disks and SSD/Nvme where metadata, dedup tables, small io or special filesystems can land on the faster vdevs.

Only advantage of mirrors nowadays is the easier capacity increase by more mirrors (until adding a disk to a Raid-Z lands in ZFS) and a simple n-way mirror when you mainly need a superiour data security. I know of several pools with a single 3way mirror for this reason.

A pool without redundany (with backup only) seems a nogo for me. A disk failure is guaranteed over time and the backup is then like bread, always from yesterday or worse. Redundancy and snaps is most important for data security. This is your all day insurance policy.

Backup with ZFS is for the disaster case (fire, theft, flash, amok hardware etc)

ttabbal · Oct 24, 2019

@dragonme ... Um, the first reply we're saying the same thing...

Note the "not".

What happens when a drive goes bad depends on the specific case. Some slow down with reallocated sectors, some just disappear. I have them completely die more often than do the slow down thing, but it's also a sample size of 6 or so, so not really enough to say what the more common method is.

50% capacity, yeah, that's a downer, but not one I choose to care about at the moment. It should be obvious that is the case, if one knows anything at all about RAID setups, but yes, it's valid to point it out. It helps that my drives are cheapo 2TB for the most part, and I don't really need more space at the moment, so the cost isn't really a big deal. Like I said though, there is no "one true way", just which tradeoffs work for me or for you and our use cases. I only mentioned my setup as mirrors hadn't been brought up, so I figured it might be interesting to the OP along with my personal use case/reasons for using them.

@gea ... Are those new features Oracle only? I only run OpenZFS, and haven't really paid much attention to the Solaris only stuff. While my personal use is unlikely to be of any interest to Oracle, they are aggressive enough that I choose not to deal with them. It is good to be aware of them though.

gea · Oct 24, 2019

ttabbal said:
@gea ... Are those new features Oracle only? I only run OpenZFS, and haven't really paid much attention to the Solaris only stuff. While my personal use is unlikely to be of any interest to Oracle, they are aggressive enough that I choose not to deal with them. It is good to be aware of them though.

Special/ dedup vdevs is an Open-ZFS feature and not in Oracle Solaris. You can try it for examle in newest Illumos (OmniOS 151031+/ OpenIndiana) or Zol 0.8+

dragonme · Oct 24, 2019

@gea

'In the past, a multi mirror pool was the only way to improve oops'

again, not really true... your IPOS improve with VDEVs so adding more stripes to a raidz improves IOPS.. I have a backup pool of 3 x 5 drive raidz vdevs and that pool reads and writes at lightning speed...

now what is true is that in general its cheeper and easier to expand pairs than it is to expand 3-x wide raidz

gea said:
In the past, a multi mirror pool was the only way to improve iops as this scales on reads with 2 x number of mirrors and on write with number of mirrors. Today nobody use mirrors when high iops is a concern. You add enough RAM for read/write caching or use SSD instead. Newer ZFS features like special/dedup vdevs allow even to use mixed vdev pools from disks and SSD/Nvme where metadata, dedup tables, small io or special filesystems can land on the faster vdevs.

Only advantage of mirrors nowadays is the easier capacity increase by more mirrors (until adding a disk to a Raid-Z lands in ZFS) and a simple n-way mirror when you mainly need a superiour data security. I know of several pools with a single 3way mirror for this reason.

A pool without redundany (with backup only) seems a nogo for me. A disk failure is guaranteed over time and the backup is then like bread, always from yesterday or worse. Redundancy and snaps is most important for data security. This is your all day insurance policy.

Backup with ZFS is for the disaster case (fire, theft, flash, amok hardware etc)

gea · Oct 25, 2019

While this is correct (iops scale with vdevs), a multi raid-Z will never reach similar iops than mirrors.

Ex. Your pool with 15 disks from 3 x raid-z vdevs gives you 3 x the iops of a single disk on read and write. If you estimate around 100 iops per disk your pool has around 300 iops on read and write.

A pool from 7 x mirrors has less capacity but would offer 700 iops on writes and 1400 iops on reads (as ZFS can read from both disks simultaniously)

dragonme · Oct 25, 2019

I never said it did.. but blanket statements like zfs has to be this, has to use that, only mirrors this .. or have to use cache this or you need 1GB of memory for every TB of data... all just absolutely wrong BS...

and that is why most people that just read these blanked statements instead of actual tech documents on ZFS will either never get it working right, spend way too much on a over complicated setup they don't need, loose data.. or a combination of all the above...

gea said:
While this is correct (iops scale with vdevs), a multi raid-Z will never reach similar iops than mirrors.

Ex. Your pool with 15 disks from 3 x raid-z vdevs gives you 3 x the iops of a single disk on read and write. If you estimate around 100 iops per disk your pool has around 300 iops on read and write.

A pool from 7 x mirrors has less capacity but would offer 700 iops on writes and 1400 iops on reads (as ZFS can read from both disks simultaniously)

dragonme · Oct 25, 2019

hehe.. that is what I thought you probably meant to say.. but after the comma .. "but will like get some (un-detected bitrot) eventually... was kinda misleading to the un-initiated ....

provided checksumming is on.. zfs will never read or write bad data without letting you know.. if you are not using ECC memory.. there is the slightest chance that ZFS gets handed garbage.. and it will happily checksum that garbage and store it without complaint.. but that is a different discussion.. I ran 24TB of storage behind 8GB of total system ram (non-ECC) on my first ZFS box for 9 years without a single issue... on sata disks and a cheap core duo 8400 and never could max it out on my network...

ZFS not only checksums on write, it does it on ever read.. scrubs are really not required to be run as often as most people do them... its not like ECC or a hardware raid where you want it 'on-patrol'

QUOTE="dragonme, post: 242067, member: 7324"]@ttabbal

"A non-redundant pool will not have undetected bit-rot, but will likely get some eventually."

um.. no.. it will never have undetected bitrot unless you manually turn checksumming off... it MAY get DETECTED bitrot which it can't self repair, but having run ZFS .. at home, on consumer SATA drives with ECC for much of it. for more than a decade.. the only errors were the result of a loose sata cable. I have never witnessed the cosmic bit flip phenomenon .. yet.

I have had several hard drives on non-redundant pools begin failing with reallocated sectors and other than slowing the pool during writes.. ZFS never laid down a single bad bit.. and have not lost a file in more that 10 years since switching to ZFS... now touch wood.. ...[/QUOTE]

DedoBOT · Oct 25, 2019

To much philosophy, real life shows this / no compression or SSDs involved / :
time mkfile 20g ttt
real 0m4.638s
user 0m0.144s
sys 0m4.445s

am45931472 · Oct 28, 2019

I think my question is relevant here since this is essentially a workload ZFS discussion.

I've always hosted my proxmox virtual machines on local storage, mirrored intel p3600s ZFS for linux. but have always wondered what the performance / speed / latency would be like to host them over a 10 or 40Gb network instead on say freenas. I have a 10GB switch, but I understand that getting network speeds higher than 10GB can be difficult to setup. Obviously when done locally nvme disks like the p3600 are at least 2000-3000MB a sec, way over 10Gbit. No idea what hosting over network would do to I/O.

many of my VMs are windows 10. Win 10 tends to like low latency, high clock speed, so as to not hurt the experience of running windows. A little different that just hosting a database VM.

Search

Cache plans for ZFS server

loxsmith

New Member

gea

Well-Known Member

loxsmith

New Member

DedoBOT

Member

dragonme

Active Member

Thomas H

Member

dragonme

Active Member

ttabbal

Active Member

dragonme

Active Member

dragonme

Active Member

gea

Well-Known Member

ttabbal

Active Member

gea

Well-Known Member

dragonme

Active Member

gea

Well-Known Member

dragonme

Active Member

dragonme

Active Member

DedoBOT

Member

am45931472

Member