Advice on ZFS/FreeNAS for 40TB

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Hi all, its now time for me to expand storage for a production env and until now ive been running mdadm,lvm and xfs with 4x4TB HGST NAS disks. its used for file storage, no vm, db etc!
Now i bought 5x8TB Seagte IronWolf for either mdadm raid10 or zfs raidz2.
i.need advice on the following:
option 1: i really like to keep the host OS Centos 7,
so could i install Freenas as a KVM guest and pass through the disks to create raidz2 AND export using NFS? there are at least 5 VMs needing access to the NFS export.

or option 2: i read ppl.mention zol and ubuntu 16.04. What would be diffrent here than with FreeNAS as a guest to handle raidz?

The most important thing is not loosing the files or getting issues with rebuilds etc.
Thanks a billion for any advice.
 

Blinky 42

Active Member
Aug 6, 2015
615
232
43
48
PA, USA
Are you adding the 5x8TB drives into the same server or replacing the current 4x4TB ?

From a keep it simple perspective, you can just do another mdadm raid array (16TB raid 10 + hot spare, or 24TB as RAID6, or 32TB as raid 5 if you are feeling lucky and have a cold spare 8T drive on hand ready to swap in if needed ASAP)
Make that a PV and add it to your existing vg or a new vg and you are done, everything can keep in the main C7 instance without any need to do any VMs or pass through of drives.

You can set up NFS, CIFS and/or iSCSI exports of your new/expanded storage pool pretty easy in C7 by just adding the correct daemon for each.
 

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
You could add to the existing mdadm setup... Or do you not want to? Since you're asking about ZFS/FreeNAS...

Yes, you can run FreeNAS as a KVM guest. However, do not pass the individual disks through to the VM. That seems to be problematic. The best practice is to pass through a controller and let FreeNAS manage that. If reliability is the priority, this is the best option for a storage VM.

Another option is to install ZoL on the CentOS host. There is some information here about that. RHEL & CentOS · zfsonlinux/zfs Wiki · GitHub

NFS is available either way.

You mention mdadm raid10 or zfs raidz2... Why not mdadm raid6 or zfs mirrors? You can make a "raid10" on ZFS, just keep adding mirrors to the pool. I've switched my home system to this setup as you can upgrade/add 2 drives at a time and rebuilds are significantly faster.
 

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
I was not clear about it (sorry for that), but my plan is to remove the old mdraid 10 and make mdraid 6 to get more space.
@Blinky 42 : i'm adding the new disks into a new server (SM 826 chassis) & Supermicro h8dg6-f board. Since space is very important for me, i wouldlike to go for RAID6 and not 10 with the new disks to get 24TB usable.
@ttabbal : Right, i don't want to add to the existing setup :) Oh ok, good to know that. I plan on adding another M1015 in addition to the LSI SAS 2008 onboard the MB, so i could possibly pass through M1015 (IIRC, it's in IT mode) into the FreeNAS VM. Yes, reliabiliy is the #1 priority since i'll be saving critical user data. My plan after setting up the new server with 5x8TB with ZFS is to, after migrating the data over ofc, to add 1 more of 4TB HGST disk on the old server to get 5x4TB mdraid6 or ZFS for backup purposes. This way i'll be getting 12TB of usable space and can expand easily with more 4TB HGST disks.

I'd love to use mirrors on ZFS, but it's an economical decision i had to take in order to get those 24TB (will be much less ofc due to TB vs TiB). The disk usage is growing and at minimum i need 20TiB for the coming years. Also with RAID6, i can simply add another 8TB to get 32TB usable.

Regarding ZoL on CentOS, IIRC, for a couple of years back on #centos on Freenode someone mentioned it wasn't rock solid back then with some hangs they couldn't explain. I think i even had a chat with one of the dev of OpenIndiana or similar who also told me the same story.
If it's very stable, i'd like to that that route to avoid VM/FreeNAS etc. I love CentOS and know how to manage such an OS pretty well with years now :)
I run VMs/containers that are ALL CentOS so...

Btw, when expaing by adding another 8TB into a ZFS raidz2, how will that affect performance of the running system?

Thanks a lot for you input, really appreciated as i'm newbie to the ZFS world and a bit cautious since it's about production data.
 
Last edited:

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
If CentOS uses the same ZoL source, and there's no reason they wouldn't, it should be about the same as any other distro. I'm running it on Proxmox now, I've done it on Debian and Ubuntu without issues as well. The Linux distros all seem to have standardized the ZFS code, so you should be good to go.

If you want to be able to expand an existing RAID set, you don't want ZFS raidz. It can't do that. You can add more sets to the pool, but you can't just add one drive to an existing vdev. That's a big reason I switched to mirrors. I can add 2 drives as a mirror to the pool, where I would want a minimum of 6 for a raidz2. Upgrades and reslivers are much easier and faster as well. Performance is also significantly better. If you're using 8TB drives, you need 6 drives for a mirror pool of 24TB usable, minus overhead and base 10/2 conversion. That leaves you with another 6 bays for future expansion. You could use the existing 4TB drives, adding one more, for an additional 12TB. If data safety is a priority, ZFS is the only filesystem I use. The ability to do scrubs and online checksum check on reads is worth it. BTRFS can do scrubs too, but it's much newer code without as much testing. It's also not considered stable for parity arrays, so you're limited to mirrors with it anyway. Might as well use ZFS and benefit from the added time it's had to mature.

One more consideration for reliability... Mirrors have a complete copy of all the data. Parity arrays only keep checkums to recover. In a disk failure scenario, I prefer knowing I have a full copy on the live system as well as backup. Disk is cheap, take advantage of that. :)

As it's for a production system, you would be well advised to set up a test system and test things like drive failure and recovery. When a drive dies, you need to be able to replace it without having to think too much about how to do it. :) It also gives you a chance to do burn-in tests on all the hardware. Drives in particular. Even brand new drives I don't trust without running badblocks and SMART tests on them. I would also recommend keeping a spare drive, that has been tested, of the largest size you have in the arrays around as a spare. So you can swap right away when needed. I keep one in a static bag on the shelf, pull it and run badblocks in a backup machine every few months to make sure it's still ready to go.

Thinking of backups, don't forget those. And don't forget to test them to make sure they can be restored. Nothing is quite so painful as going to restore and realizing that your backups don't restore.
 

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Ok, then i'll definitely try ZoL on C7 and avoid FreeNAS.
Huuuh, that's a major downside of ZFS i have to say. RAID6 is pretty common in the industry and being able to expand with 1 disks is pretty normal and i took it for granted with ZFS :( At my work we are now waiting for HGST 10TB 60bays SAN, using multiple RAID6 sets (14+2 IIRC), but all controlled by HGST controllers though.
So my understanding is that if you plan well, say using almost all of the bays of the chassis, then it might work with ZFS raidz, since you are probably not gonna add more disks. Or you are prepared to expand using the same amount of drives as you started with, but then you loose double amount of disks to the parity right? So this has to be thought carefully before going the ZFS raidz route!
Anyway, i really would like to go the ZFS way, especially now after reading about compression (is there any cons to enabling this?) which could be a tradeoff for me with RAID10/mirrors.
Now i need to get at least 1 more 8TB disk and ideally 2 for spare as well.

Yes, the server is at my home and i've just found a place to run it for some days with burn-in tests following [How To] Hard Drive Burn-In Testing on all my 5 8TB Seagate disks. The 6th one i'll be ordering directly to the datacenter (i bought the 5 disks here locally) hence the tests must be run later.
The server will be shipped to a DC in Netherlands.

What do you recommend for backup other than scripting with rsync? My plan is to use 4TB HGST disks for backup and for that i think i'm gonna use RAID6 with mdadm, just to keep the costs (and bay usage) down. I know that i rely on parity on the backup, but yeah...i have to think about the costs too :)
 

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
Online expansion has pros and cons. ZFS was designed as an enterprise system, buying a pile of disks isn't an issue there. Note that there is a maximum number of drives you should consider per raidz set as well. I personally wouldn't go above about 10 for a raidz2.

You don't need to add the same number of drives per vdev. You can have a pool with a 6-disk raidz2 and a 8-disk raidz2. No problem. Best practice is to keep them the same, mostly for admin sanity. :)

With a raidz pool, you have 2 options to increase capacity later. You can add another vdev, yes, each vdev uses independent parity so you lose that space to redundancy for each vdev. I don't personally consider it "lost", it's being used, but that's just perspective. You can also replace individual drives in the array one or two at a time, waiting for the resliver between each. Once you get them all replaced, the array will auto-expand. Last time I did this, each disk took about 14 hours to resliver. Performance tanked, but user stuff does get I/O priority, which helps some. Reslivering a mirror of the same size took about an hour.

Compression is nice. It uses a bit more CPU to decompress it, but unless it's for things like photos and videos that are already compressed or just don't do well with standard compression systems, it's worth having. Even pretty old cheap CPUs can process compression so fast that you don't notice it unless you are CPU starved for some reason. What you want to avoid unless you have a really good reason is dedupe. It needs a ton of memory and only helps if you have a lot of identical blocks.

For backup, there are lots of options. rsync isn't bad, but if you want complete copies, consider ZFS send/receive (both sides have to be ZFS though). It will copy snapshots at the block level. After the first one, they are differential so they are quite fast. I already have Crashplan running for client backups, so I just have the backup server run it as well and set machines to backup to both, with the server also set to backup to the backup box. I don't backup the full set, just the important stuff. So my backup server runs a ZFS mirror as well, only about 4TB of space total, but I'm using about 1TB of it, so it's fine.

My only concern using mdadm for backup storage would be that I would want some way to verify the backup without having to restore the thing. I do that with ZFS scrubs and Crashplan has a similar setup. I've had data die from bit-rot, so I'm a little paranoid about that. So long as you have a good way to test the backup data to ensure it's correct, you should be good. rsync is pretty good there as it compares checksums to decide what to send, so it will cover that for you. And if you use something like Crashplan you have to run a restore once in a while to make sure it works. A backup that you can't restore, isn't. :)
 

nk215

Active Member
Oct 6, 2015
412
143
43
50
I was not clear about it (sorry for that), but my plan is to remove the old mdraid 10 and make mdraid 6 to get more space.
@Blinky 42 : i'm adding the new disks into a new server (SM 826 chassis) & Supermicro h8dg6-f board. Since space is very important for me, i wouldlike to go for RAID6 and not 10 with the new disks to get 24TB usable.
@ttabbal
I'd love to use mirrors on ZFS,
When ppl talk about adding mirrored pair of HDDs onto ZFS vdev, they didn't tell you about the risk of that. If those 2 drives fail, the entire pool is gone. The chance of this is very small but in a large enough array, it may happen. A ZFS pool is only as reliable as its weakest vdev.
 

wildchild

Active Member
Feb 4, 2014
389
57
28
When ppl talk about adding mirrored pair of HDDs onto ZFS vdev, they didn't tell you about the risk of that. If those 2 drives fail, the entire pool is gone. The chance of this is very small but in a large enough array, it may happen. A ZFS pool is only as reliable as its weakest vdev.
This is exactly the reason you would have one or two disk as a hot spare and auto replace set to on in a rather large array, this is not different to any other enterprise array type
 

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Yeah, that's the risk of running RAID10 generally, you loose the whole array if 2 disks from the same stripe fail.
While with RAID6, you can loose ANY 2 disks and still have the data. But ZFS seems to have a major limitation in that it doesn't support single disk expansion using raidz2, which works fine using regular RAID6.
For my use case, it looks like i'm more or less forced to use mirrored pairs since i need to be able to expand in the future without having to destroy and re-create. I wouldn't have another place to copy 24TB of data, destroy, expand and then re-create raidz2 with more disks :)

One option i'm considering is just buying another disks, with a total of 6 and still use raidz2, in order to get +/- 32TB usable, that will cover the storage space for many years to come, but is a risk to take ofc!

When ppl talk about adding mirrored pair of HDDs onto ZFS vdev, they didn't tell you about the risk of that. If those 2 drives fail, the entire pool is gone. The chance of this is very small but in a large enough array, it may happen. A ZFS pool is only as reliable as its weakest vdev.
 

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
All software has bugs. The severity is the important thing. That one is pretty minor, all things considered. I personally have never seen it, but I don't hit snapshots that often.
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Since we are on the topic of ZFS downsides....
While using mirrors with ZFS makes it easier to expand since you only need to add two disks at a time, ZFS does not rebalance data after you add the drives. This means that if you add two new drives to your ZFS pool of mirrors after the pool is nearly full (a likely scenario) then most of your writes will only go to those two drives. Reads to that data latter will also be limited to those two drives. This has a significant performance impact which is one of the reasons why people don't recommend letting a ZFS pool get anywhere near full.
 
  • Like
Reactions: fossxplorer

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
The common recommendation is to avoid getting to >80%. This is mostly because ZFS is a copy-on-write system, so any time to write to a block, it reads it, modifies the data, and writes it somewhere else. So it needs that space to be free so it has somewhere to put things. Much like how an SSD benefits from TRIM or overprovisioning.

While it doesn't rebalance the data, that's not a big issue in practice for most people. Writes will generally spread out over time and stored data that doesn't get written to won't be any worse off than it was before. It still has the same read performance it did before you added disks, which is generally sufficient. If you want there to be more performance, you can clone it, but you need the available space for that. It's not perfect, and it would be nice if there were an automated way to do it without having to clone the whole dataset. Interestingly, from what I've read, the same thing needed to expand raidz vdevs would work for rebalancing. Block pointer rewrite. Sadly, nobody seems very interested in coding it.
 

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Hmm @grogthegreat that's hugely important detail for me!
With std raid6, the read performance will increase as you add more disks, and it's not the case with ZFS AFAIK now?
It all looks like i'm gonna stick to mdraid6+LVM+XFS for my prod data and just play with ZFS to get more knowledge.
 

wildchild

Active Member
Feb 4, 2014
389
57
28
Hmm @grogthegreat that's hugely important detail for me!
With std raid6, the read performance will increase as you add more disks, and it's not the case with ZFS AFAIK now?
It all looks like i'm gonna stick to mdraid6+LVM+XFS for my prod data and just play with ZFS to get more knowledge.
Of course it does.
But be aware that if you fill the pool to much before expanding, there is a penalty to pay.
Regarding the rebalancing in general, i would agree this to be nice, however if you build your pool with mirrow, it is pretty easy to break that mirror, build a new pool with all new of the new disks , zfs send the data to the new pool, remove old pool and add mirrors again
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
ZFS was not created for best performance but for ultimate data secuity. This is why it includes CopyOnWrite (always consistent and crash resistent filesystem, no writehole problem on Raid) and data checksums (always verified data). This means higher fragmentation than a non CoW filesystem and more calculation and data. So it cannot be as fast. You will see this quite clearly when you use ZFS with less RAM or only a single disk or a single vdev.

Another design goal was huge storage with many many disks. With a Raid-5/6 you can go up to around a dozen of disks. A larger Raid-6 array does not make sense as a rebuild on problems last too long. Another problem is that iops of a Raid-6 is like a single disk. Only sequential performance scale with number of datadisks.

ZFS use a pool from multiple vdev concept. While a single vdev has the same limitations like a Raid-6 regarding rebuild time or iops, you can use as many vdevs as needed. Capacity, iops and sequential performance scale with vdevs up to Petabytes.

To overcome the conceptional performance degration, ZFS adds cache concept like Arc and L2Arc that are quite the best. A ZFS system can deliver most reads from readcache and avoids small random writes with a huge rambased writecache that transforms several seconds of random writes to a single large sequential write.
 
Last edited:
  • Like
Reactions: Kybber

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Thanks all for awesome insight into ZFS really.

Now i've got the 6th disk and ready to set up ZFS. Just going through HOWTO install EL7 (CentOS RHEL) to a Native ZFS Root Filesystem · zfsonlinux/pkg-zfs Wiki · GitHub and chose to go to DKMS since i use ML kernel 4.10 from elrepo.
But i can't see clearly how to create my mirrored vdevs? Should i simple create 3 mirrored vdevs with 2 disks in each and then create a pool out of it? Any help appreciated!

EDIT: is it vital to create partitions as written in that guide and not to use whole disks?
It seems whole disks are used here HowTo : Create Striped Mirror Vdev ZPool » ZFS Build though.
 

ttabbal

Active Member
Mar 10, 2016
747
207
43
47
There's a bunch of examples here.

19.3. zpool Administration

multi-mirror create is handy..

zpool create mypool mirror /dev/ada1 /dev/ada2 mirror /dev/ada3 /dev/ada4


Add a mirror to a pool..

zpool add mypool mirror ada2p3 ada3p3


There are arguments both ways on use of partitions vs whole devices. I use devices as I don't see the point of partitions when the disk is only going to be used with ZFS. It makes more sense to me to just let ZFS deal with it however it likes. There were some performance downsides to using partitions a while back, but I think they are resolved now. The only downside I can see to using a whole disk is that some models/manufacturers will sell a 2TB drive that's 1.999TB while the next one is 2.0001TB. If you try to replace a mirror and happen to be a little short, that can be really annoying. It's never happened to me, but I do tend to use replacements as an excuse to upgrade to the next size up most of the time.

There's a parameter you can use to see what it would do, but not actually do it. I believe it's "-n". That's particularly nice when adding devices to make sure you don't accidentally add a single drive and not be able to fix it. Not as big an issue with mirrors as you can "zpool attach" a mirror device to a single drive. It doesn't work that way for raidz though. I like to use that to make sure what I typed is what I want, then use command line editing to remove the parameter to really do it.