How a very large ZFS pools configured ?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Fritz

Well-Known Member
Apr 6, 2015
3,386
1,387
113
70
Conventional wisdom seems to suggest that vdevs over 10 drives are not a good thing. Mirrored vdevs seem to waste too much space and seems like they would get hard to manage beyond a certain number. So how would you configure a ZFS pool that's 100TB or larger in size?
 

Fritz

Well-Known Member
Apr 6, 2015
3,386
1,387
113
70
14 2TB drives. I asked at the FreeNAS forum and they said no, stick with a 10 disk vdev. So is ZFS not good for large pools? Is there an alternative that is?
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Mirrors to gain high iops are no longer a good idea with disks.
If you really need high iops, use raid-z with SSDs ex the 3,8 TB Samsung PM 836

A ZFS pool from multiple Raid-Z2 with 6 or 10 disks per vdev is the recommended pool layout as it offers high capacity with an acceptable rebuild time in case of disk failures.

If you use raid-z2 with 6 or 10 disks per vdev as the capacity optimized option you can

- use a 24 bay case with 3 x 8 port HBA and Sata disks (avoid expander + Sata)
with 2 x 10 disk z2 you have 16 disks usable. With 10 TB this gives you 160 TB, with 6 TB disks you are at 96 TB usable

If you use 4 x 6 disks z2 instead of 2 x 10, you have the same capacity but twice the iops

If you need more than this 160 TB limit, you can use a SuperMicro jbod case with expander and up to 90 disks per case.
With up to 9 x 10 disk Z2 per jbod, you have 72 datadisks. If you use 8 TB SAS disks, you are at 576 TB per jbod.

With two of these cases you are at a Petabyte.
 
Last edited:

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
I use, and like RaidZ2 with 6 disks -- any higher and it's very costly to expand the pool.

I don't know about you but I'd rather add 6 at a time not 10 :)

That's for my storage/media/etc... for VMs I like SSD in pool of mirrors.
 

azev

Well-Known Member
Jan 18, 2013
769
251
63
I know I am beating a dead horse here, but with SSD pool of mirrors, do you still need ZIL & L2ARC ??
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
L2ARC - Yes. In most all cases, especially if it's NVME.
SLOG - My 02 is it depends on the # of mirrored pairs, performance of the SSDs used, and type of SSD used and planned SLOG device.
I'm actually going to test this out next week as I install 8x 400GB SAS SSD in my system. I want to see if a 12Gb/s SSD as SLOG is a benefit or if a ZeusRAM is benefit to an already very nice pool :) I really wish Pass-Through of NVME in ESXi to OmniOS worked!
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,142
594
113
New York City
www.glaver.org
Conventional wisdom seems to suggest that vdevs over 10 drives are not a good thing. Mirrored vdevs seem to waste too much space and seems like they would get hard to manage beyond a certain number. So how would you configure a ZFS pool that's 100TB or larger in size?
Are you aiming for:
  • Maximum speed
  • Maximum space efficiency
  • Maximum survivability
or some combination of those? There's a lot of folklore areound ZFS, such as using prime numbers of drives for raidz1, prime+1 for raidz2, etc. Another one of those is that pool performance is the sum of vdev performance. I've benchmarked 4 * 4-drive raidz1's vs. 3 * 5-drive raidz1's and didn't see much of a difference. I'm using 3 * 5 as that gives me a hot spare in a 16-bay chassis.

I have the data replicated on another server as well as backed up to tape, so raidz1 vs. raidz2 is not an issue for me. Besides, if the idea behind raidz2 is that it is likely for a second drive to fail during a resilver, why not a 3rd drive? or a 4th drive? While it is possible for another drive or drives to fault out of the pool and possibly render the pool non-recoverable, at some point you run into diminishing returns. And, just like "RAID is not backup", "ZFS is not backup".
 
  • Like
Reactions: Fritz

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Raid is not backup. This is not only true but with Raid on conventional filesystem a daily necessity as every power outage may produce a corrupted raid or filesystem. Any virus or accidental delete require the backup. So backup is part of daily work.

ZFS ist not backup. This is true but only in case of a real disaster like sabotage, fire, theft or a flash. ZFS is intended not to loose data during regular operation or virus/Locky attacks due its CopyOnWrite (crash resistent, no corrupted filesystem or Raid per design) or versioning with snaps. I do backups to two backup systems, one in another building but I have not needed the backups since I use ZFS. Access to Snaps is the daily recovery option. So backup with ZFS is disaster backup.

about number of disks ver vdev
ZFS does not care if you use a Raid-Z1 with 2 or 20 disks. The rule that you should data disks per vdev as a power of two (2,4,8,16) + redundancy disks is more duer the idea that you write datablocks of 128k, 64k, 32k etc. With a different number, you have always a small waste with a slightly reduced capacity compared to what you bought.

Performance wise, the number of vdevs is not relevant regarding sequential performance. This scale with number of data disks only. But as you must position each disk to write or read a datablock, the iops of a vdev is like a single disk so the more vdevs the higher the iops. This is the reason for massive n x raid-10 configs in the past. Today you use SSD if you need iops.

about Slog in an SSD only pool
If you enable sync, you must write every datablock twice, once immediatly as a slow small write and once as a large fast sequential write over the write cache. If your Slog is not substantially faster (lower latency, write iops) than your pool, an slog is useless. Only one problem remains. If you use an SSD pool without powerloss protection, you should always add a fast Slog SSD or NVMe like an Intel P750/3600/3700 with powerloss protection. But for critical data you should always use pool SSDs with powerloss protection to allow background garbage collection without data risk in case of a power outage.
 
Last edited:
  • Like
Reactions: lowfat

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
No and no.

L2ARC will use your plenty of your RAM for mapping the L2ARC. You rather want this RAM for caching data and metadata instead.
Not using L2ARC because of RAM is a valid concern, but should not hold you back from doing it if you have sufficient RAM.

With everyone getting the cheap DDR3 RDIMM lately I'm thinking most have sufficient RAM for most-common SSD sized L2ARC.
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,142
594
113
New York City
www.glaver.org
ZFS ist not backup. This is true but only in case of a real disaster like sabotage, fire, theft or a flash. ZFS is intended not to loose data during regular operation or virus/Locky attacks due its CopyOnWrite (crash resistent, no corrupted filesystem or Raid per design) or versioning with snaps. I do backups to two backup systems, one in another building but I have not needed the backups since I use ZFS. Access to Snaps is the daily recovery option. So backup with ZFS is disaster backup.
Strange things can happen, many of which aren't ZFS's fault but can lead to a pool becoming unavailable. Two I've run into in the past were:

1) A ZIL device failed on a ZFS v18 pool. Since log devices could not be removed until ZFS v19, any attempt to write to the pool caused an immediate panic. I don't know if a "zfs send" would have replicated the fault to the backup pool or not.

2) A SAS expander backplane developed a fault where random drives would drop offline and come back online a randome amount of time later. Eventually the pool was left with insufficient redundancy to survive.

In each case the hardware problem was corrected and the pool's contents restored from tape. ZFS v28/v5000 is in wide enough use that most bugs have been found and fixed, or worked around. But there may still be some corner cases where it doesn't "do the right thing", and then there are hardware issues outside of ZFS's control.

A backup is a copy of the data, on a different medium in a different format, stored at a different location (and no, a different floor in the same house / office doesn't count).
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
You have a valid point if you are talking about SAS1 expanders. But please elaborate whats wrong with SAS2 expanders and SATA disks?

Works very well and is the only way to go for larger deployments.
There is only one reason to use an SAS expander + Sata disks as the disks are slightly bigger and cheaper

While SAS expander + Sata disks works in general you will hardly find a professional setup with support with such a combo. I have seen discussions in the Illumos/OmniOS witch the request for help on problems where a single (semidead) disk caused errors on the expander itself and suspicious problems on other disks without a clear state which disk was the real problem. In such cases using SAS disks is much more robust.

The usual comment ex from OmniTi was: For such a combo we would even refuse paid support.

In my own setups and where I am involved, i do not suggest expander + Sata.
If one needs more disks that you have available ports, use SAS disks

btw.
I use backup systems with a chenbro case with 50 bays where I wanted to reuse Sata disks.
Even such a system can be built without expander with 6 x 8port HBA or 3 x 16 port HBA.
You must not care about such problems and it is a faster solution with fast disks.

And for the real big or pro systems, the premium for SAS disks is not an item and for smaller setups adding another HBA instead an expander is the faster and more robust solution.
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Strange things can happen, many of which aren't ZFS's fault but can lead to a pool becoming unavailable. Two I've run into in the past were:

1) A ZIL device failed on a ZFS v18 pool. Since log devices could not be removed until ZFS v19, any attempt to write to the pool caused an immediate panic. I don't know if a "zfs send" would have replicated the fault to the backup pool or not.

2) A SAS expander backplane developed a fault where random drives would drop offline and come back online a randome amount of time later. Eventually the pool was left with insufficient redundancy to survive.
Yes you cannot skip a dister backup, even with the newest ZFS releases. Newer features may introduce there bugs as well like the L2Arc problem last year. And a ZFS bug is then a disaster like a fire.

But ZFS is very robust. Importing a ZFS pool with a missing slog is possible since ZFS v19. Flipping disks on a hardware raid will ruin a raid completely but ZFS is more robust. It should not happen more than a degraded state if a disk fails and a offline state if more disks fail does not matter if they come and go.

In the end when all disks are back, a clear error should be enough due the CopyOnWrite with checksum nature of ZFS.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
So to all you zfs masters, (you guys know way more than I do) simplify this down for me as well...

Why isnt mirrors a good thing anymore? Seemed like a good idea to increase iops and make it super easy to expand if I got my reading done right. I was planning on converting to mirrors instead of raidz2. I currently only use raidz1. I know, I know. I am an idiot...

I'm just so greedy for my drives! I don't want to give up any more drives to zfs than I have to. Besides, my backup plan is super simple. No backup. I use the venerable Karen's Directory Printer.

I just need to know what I lost. If the day comes that I do lose my precious media and porn, I will just re-download it. Chip and Dales Rescue Rangers isn't that hard to find through torrent, and Abella Anderson videos abound all over the web. (God bless that Spanish mami. I would do her so hard...) Data like pictures and documents that I don't want to lose are placed on several different high capacity flash drives. Several of them.
Pools of mirrors is still good and used for expanding, increasing IOPs/performance etc...
I do this with SSD, and know many many others here do the same too.
 

gea

Well-Known Member
Dec 31, 2010
3,163
1,195
113
DE
Pools of mirrors is still good and used for expanding, increasing IOPs/performance etc...
I do this with SSD, and know many many others here do the same too.
Are your iops needs really that high?
A single disk mirror can give you around 100-200 iops so with 50 mirrors you have 10000 iops

A single SSD gives you the same iops.
This is why I use Raid-Z2 with SSDs - simply more than good enough (if you use quality SSDs)
even for high iops demands (and much cheaper as you need less SSDs).
 

jgreco

New Member
Sep 7, 2013
28
16
3
14 2TB drives. I asked at the FreeNAS forum and they said no, stick with a 10 disk vdev. So is ZFS not good for large pools? Is there an alternative that is?
Going wider than 10 is *possible* (and 14 is within that) but as you go farther out past ~10-12, the chance of problems increase.

Mirrors to gain high iops are no longer a good idea with disks.
If you really need high iops, use raid-z with SSDs ex the 3,8 TB Samsung PM 836
Depends on what you need... that's a hideously unqualified statement to make. RAIDZ is entirely unsuited to some use cases, especially including live VM storage. If you need 16TB of pool space, that's six PM 863's at about $13,000 in cost and you still only get a single vdev's performance out of it, plus all the worries of the variable storage involved in RAIDZ2. It's very possible to build a 16TB pool out of something cheaper, like 2TB drives and some fast L2ARC. To get a similar level of protection, 24 2TB drives in three way mirrors .. like 24 ST2000NX0273 and two 512GB NVMe L2ARC SSD's will run you about $10K, or you can go the cheapie route with 24 ST2000LM003 and two 512GB NVMe SSD's for about $3K.

If you really need high IOPS, use mirrors with SSD's - not RAIDZ.

So to all you zfs masters, (you guys know way more than I do) simplify this down for me as well...

Why isnt mirrors a good thing anymore? Seemed like a good idea to increase iops and make it super easy to expand if I got my reading done right. I was planning on converting to mirrors instead of raidz2. I currently only use raidz1. I know, I know. I am an idiot...
Mirrors are fine. Three way mirrors are generally optimal for performance, but they're hideously expensive. It's probably the only way to go for certain classes of storage, especially VM storage.

RAIDZ2/RAIDZ3 is much more efficient at bulk storage, but with that comes the significantly reduced performance. Our 11 drive RAIDZ3 filer tends to wail in laggy agony if there's too much going on at once. There just aren't enough IOPS.

Feel free to head on over to the FreeNAS forums to debate this more :)
 
Apr 2, 2015
48
1
8
46
my first freenas box was built with 12 x1TB in raidz1, and it ran for 1+ year, surviving 2 disk failures/rebuilds during that time (not simultaneously ofc).


After it saved my ass a few times, i rebuilt it with 2 raidz1 vdevs on 8 x2TB drives, but this time using a supported HBA, that actually allows me to see smart status of the drives, and does correct passthrough of the drives (as opposed to first version that just had 12 hardware raid0-es presented to zfs)

So basically it depends on what you need. For me it served the purpose of backup space for veeam ( i had a 9-ish tb iscsi target on it).
ZFS will build whatever config you want. Weigh in all your needs, and whatever bothers you the most (rebuild time, iops, total space, redundancy) and go from there.

For me total space was the most important at first, but now i value rebuild times more, and also expandability. so i went for 2 vdevs of 4 disks in raidz1, with room to expand with another similar vdev in the future.

Sent from my Nexus 7 using Tapatalk