How a very large ZFS pools configured ?

gea · May 15, 2016

jgreco said:
Going wider than 10 is *possible* (and 14 is within that) but as you go farther out past ~10-12, the chance of problems increase

Depends on what you need... that's a hideously unqualified statement to make. RAIDZ is entirely unsuited to some use cases, especially including live VM storage. If you need 16TB of pool space, that's six PM 863's at about $13,000 in cost and you still only get a single vdev's performance out of it, plus all the worries of the variable storage involved in RAIDZ2. It's very possible to build a 16TB pool out of something cheaper, like 2TB drives and some fast L2ARC. To get a similar level of protection, 24 2TB drives in three way mirrors .. like 24 ST2000NX0273 and two 512GB NVMe L2ARC SSD's will run you about $10K, or you can go the cheapie route with 24 ST2000LM003 and two 512GB NVMe SSD's for about $3K.

If you really need high IOPS, use mirrors with SSD's - not RAIDZ.

You must look at sequential performance and iops seperately.
Sequential write performance in a raid-z scale with number of datadisks (in a mirror with number of vdevs) while iops is equal to a single disk while it scales in a multiple mirror pool with 1/2 number of disks. With a 10disk Z2 from disks vs a 5 x mirror this means 100 iops vs 500 iops while the Z2 may be slightly faster sequentially.

But how many iops do you need?
In former times with disks, this was easy. You need 10 mirrors to go above 1000 iops.
But when a single enterprise SSD can give 20-80000 iops under constant load the real question is: Do you need that many iops that it is worth to loose 50% of the capacity for redundancy in a Raid-10 compared to 20% example in a 10 disk raid-Z2? This and the additional plus that you can loose any two disks makes it clear for me.

With SSD only pools, multiple Raid-Z instead of multiple mirrors is the economical way for high performance SSD storage (unless you really need > 80000 iops that you can achieve with one SSD Raid-Z vdev) the equivalent of hundreds of spindle disks in a raid-10.

And: iops scale with number of raid-z vdevs as well with less needed SSDs than with mirrors

PigLover · May 15, 2016

@gea - can I paraphrase what I think you are saying to make sure I've understood you?

Regarding Raid-10 performance for SSD: You are not saying you can't still configure them this way. You're not even saying you shouldn't configure them as in Raid-10 (vs RaidZ). But I think you are saying that the reason people used to prefer this config in the past, to gain write-IOP performance, is no longer relevant in most use cases when using SSD so why waste the space. In most use cases the single-disk write performance of an SSD is high enough that you won't gain much of practical value anymore from Raid-10 and multiple vdevs vs RaidZ.

Did I understand you correctly?

gea · May 15, 2016

yes, exactly

jgreco · May 15, 2016

gea said:
You must look at sequential performance and iops seperately.
Sequential write performance in a raid-z scale with number of datadisks (in a mirror with number of vdevs) while iops is equal to a single disk while it scales in a multiple mirror pool with 1/2 number of disks. With a 10disk Z2 from disks vs a 5 x mirror this means 100 iops vs 500 iops while the Z2 may be slightly faster sequentially.

But how many iops do you need?
In former times with disks, this was easy. You need 10 mirrors to go above 1000 iops.
But when a single enterprise SSD can give 20-80000 iops under constant load the real question is: Do you need that many iops that it is worth to loose 50% of the capacity for redundancy in a Raid-10 compared to 20% example in a 10 disk raid-Z2? This and the additional plus that you can loose any two disks makes it clear for me.

With SSD only pools, multiple Raid-Z instead of multiple mirrors is the economical way for high performance SSD storage (unless you really need > 80000 iops that you can achieve with one SSD Raid-Z vdev) the equivalent of hundreds of spindle disks in a raid-10.

And: iops scale with number of raid-z vdevs as well with less needed SSDs than with mirrors

Of course you have to look at sequential performance and IOPS separately. But IOPS under ZFS is a complicated topic. ZFS being able to give you two thousand or more "random write" IOPS on a traditional hard disk with 10% occupancy throws a real wrench into the whole thing.

But the point here is that you can't really make generalizations; the stuff you're saying is basically trying to take some very simplistic viewpoints and extrapolate it onto an extremely complex system. In many cases you can balance things out to get reasonable or even excellent performance without going all-SSD. In cases where you actually need high IOPS, RAIDZ is probably going to be bad, because the database/VM storage scenarios that typically involves are going to suffer from the extra I/O (or cost of extra space) resulting from stripe size/record size issues. In general, unless you're doing massive writes (and often even then, unless they're actually random), it is often more convenient and less expensive to have a large HDD based pool, controlling the occupancy rate to mitigate fragmentation's impact on the write speeds, and then relying on ARC and L2ARC to accelerate reads on the working set. That generally works out to be much less expensive than SSD at a good fraction of SSD's speeds, even today, although maybe the price part of that will change in the next year or two.

Yes, I know that won't work for EVERY scenario, but it seems to address a lot of them.

As for hundreds of spindle disks in a RAID10, the real problem is that right now you can build a petabyte array at a not-horrifying cost out of HDD, and you'll be getting a lot more space for a lot less cost. There seems to be a lot of resistance out there to tiering storage appropriately, but I guess that's another discussion.

gea · May 15, 2016

At first place, iops is related to mechanical properties of a disk like rpm or the quality of an SSD,
see IOPS - Wikipedia, the free encyclopedia

Whenever a benchmark gives you higher values, you must check for cache effects. Whenever your values are lower you must check for other effects ex consumer SSDs that can hold a high iops value only for a short time.

The throughput is different as this is more a sequential value that is a function of fillrate with disks mainly due fragmentation. While fragmentation is irrelevant for SSDs their performance lowers with fillrate as you must delete data prior re-use a block. With enough free blocks ex after a secure erease, writing is faster. With SSDs quality is essential for performance. While a dekstop and an enterprise SSD like an Intel S3700 claims both 80000 iops and a similar sequential performance per datasheet, the Intel keeps this values under constant load, while desktop SSDs may go down to 10% after some time.

jgreco · May 15, 2016

gea said:
At first place, iops is related to mechanical properties of a disk like rpm or the quality of an SSD,
see IOPS - Wikipedia, the free encyclopedia

Whenever a benchmark gives you higher values, you must check for cache effects. Whenever your values are lower you must check for other effects ex consumer SSDs that can hold a high iops value only for a short time.

The throughput is different as this is more a sequential value that is a function of fillrate with disks mainly due fragmentation. While fragmentation is irrelevant for SSDs their performance lowers with fillrate as you must delete data prior re-use a block. With enough free blocks ex after a secure erease, writing is faster. With SSDs quality is essential for performance. While a dekstop and an enterprise SSD like an Intel S3700 claims both 80000 iops and a similar sequential performance per datasheet, the Intel keeps this values under constant load, while desktop SSDs may go down to 10% after some time.

IOPS was at one time related to mechanical properties of a disk, but these days is a general measure of storage subsystem workload. With all these new software defined storage systems, especially including ZFS, deciding that a storage system is capable of a particular number of IOPS is a complicated game, which is one of the reasons that we frequently do handwaving when discussing ZFS. It's moved far beyond an easily measurable number and involves a knowledge of anticipated workload, performance over time, pool architecture, and other issues.

ZFS in particular has some interesting qualities due to the CoW fragmentation that occurs; as I previously mentioned, it is perfectly possible to observe 2000 write IOPS on a single hard drive despite the fact that if the drive were actually seeking around that you'd only be seeing ~200.

In any case, I disagree with making expensive generalizations, which "buy $13K in SSD's" appears to me to be.

T_Minus · May 15, 2016

gea said:
Are your iops needs really that high?
A single disk mirror can give you around 100-200 iops so with 50 mirrors you have 10000 iops

A single SSD gives you the same iops.
This is why I use Raid-Z2 with SSDs - simply more than good enough (if you use quality SSDs)
even for high iops demands (and much cheaper as you need less SSDs).

That high?
Even cheap VMs leased from places online come with thousands of sustainable IOPs not 100-200.
The entire purpose of SSD and NVME (for me at-least) is to get > performance than spinning disks (and > performance than the VMs I can lease for $/mo) not simply re-structure the storage for the same performance.

Just because you can get by with an OS at 100-200 IOPs does not mean you can't tell the difference when the OS has more, and since my projects are all data-related yes I can tell the difference.

Sure RaidZ eats up less disks and you get more storage capacity but depending what the VMs are doing you may end up with more unusable capacity because it's so slow.

With the price of Enterprise SSD now days I don't see a reason not to use pools of mirrors. Obviously if you're supporting 1000s of VMs and 100s of hosts you're going to have to factor in cost a lot more than 5-20 hosts.

Obviously if your machines sit idle and don't do anything than SSD RaidZ may work perfectly fine with a SLOG and L2ARC --- I think it all just depends on your usage, but there's def. a very valid point to using pool of mirrored SSD still

gea · May 15, 2016

Mostly you need either high performance or high capacity not both.
With 13k $ you are now near 12 TB SSD only storage in a Raid-Z. Mostly 1-2 TB is enough for high performance needs what is available for 2k $ or less.

With disks you get 10 times the capacity with similar throughput but a fraction of the iops.
While Arc, L2Arc and a ZIL can help, in the end only raw and real disk power counts.

T_Minus · May 15, 2016

gea said:
Mostly you need either high performance or high capacity not both.
With 13k $ you are now near 12 TB SSD only storage in a Raid-Z. Mostly 1-2 TB is enough for high performance needs what is available for 2k $ or less.

With disks you get 10 times the capacity with similar throughput but a fraction of the iops.
While Arc, L2Arc and a ZIL can help, in the end only raw and real disk power counts.

I guess it depends on if you're using it for local / AIO or a SAN. I could see 12TB raw / 6TB usable utilized with a handful of host systems.

I look at a goal of 1TB per-host with 24-28 cores, likely would see 500-700 utilized. For enterprise SSD that's still really cheap, especially if they're not write intensive... For instance 24x S3500 120GB would yield 1.4TB and cost $1600 or less. That's some killer performance that should withstand most any work load, and has the capacity needed per-host. 300GB for $100 (common) * 8 = 1.2TB for $800, even cheaper... less performance or nearly the same due to less vdev but > capacity drive > performance.

300-400GB for SSD is the "sweet" spot for price right now and performance with 6-8x mirrored pools.

Again this is all based on allowing the VMs to not slow down to a crawl if they're "all" doing something.

gea · May 16, 2016

Its always a compromise even if you skip desktop parts

- larger SSDs are faster than smaller ones
- mirrors are faster than raid-z rgearding iops
- Intel DC/P 3700 are faster than a S3500 or a Samsung PM

If one wants around 8-10TB SSD usable
The fastest solution at all would be 5 x mirror of 2 TB P3700
(ok you will have problems to find 10 pci- slots, but just for math)
but this is around 30000 Euro and you cannot hot replace a pci-e device
(hot replacable NVMe is the future)

A more price sensitive compromise with nearly similar performance would be
the Intel S3610, the cheaper variant of the 3700 where you need
12 x 1,6 TB mirrors what gives you around 10 TB for 14000 Euro

If you use a Raid-Z2 of 10 disks Intel S3610-1,2 instead, you need
around 9000 euro for about the same sequential performance but less
iops ("only" around 28k with 4k random write, the value of a single P3610)

If your workload is more read orientated, you may select Intel S3510
or Samsung SM/PM 863 insted

A 5 x mirror of 10 x Samsung PM863- 1,9 TB costs also around 9000 Euro
and now the question where we currently talk about is, is this pool faster or
slower than the raid-z from faster Intel S3610?

The cheapest solution with those enterprise SSDs would be a raid-Z2
from 10 x 960 GB Samsung PM at around 4000 Euro.

The way between
n x raid-z2 with a 6 ssd per vdev where price and iops is between mirror and single vdev-z2
ex 12 x 960GB in 2 x raid-z2 from 6 ssds per vdevs
ex 24 x 480 GB in 4 x raid-z2 from 6 ssds per vdevs

I would skip the smaller SSDs as they are slower than the larger ones
and you need more vdevs only to compensate the slower SSD

unwind-protect · Jun 1, 2016

Mirrors are fine. There is a huge advantage for robustness there. You can rip any single drive out of a mirror and have a usable, full copy of the data.

Speedups can and usually should happen for reads. For writes you could speed up in exchange for safety but I haven't seen that done anywhere.

I do not recommend mirrors on top of same-type SSDs. The theory that the write pattern that is equal for all drives might expose a particular problem with a particular write pattern on a particular SSD is very real to me. There isn't any disadvantage to homogenous mirrors and raids if you use ZFS' internal mechanisms for them.

T_Minus · Jun 1, 2016

unwind-protect said:
Mirrors are fine. There is a huge advantage for robustness there. You can rip any single drive out of a mirror and have a usable, full copy of the data.

Speedups can and usually should happen for reads. For writes you could speed up in exchange for safety but I haven't seen that done anywhere.

I do not recommend mirrors on top of same-type SSDs. The theory that the write pattern that is equal for all drives might expose a particular problem with a particular write pattern on a particular SSD is very real to me. There isn't any disadvantage to homogenous mirrors and raids if you use ZFS' internal mechanisms for them.

Still waiting to be shown all these issues using the 'same' SSD in arrays / pools... this is what's done at huge # of disks at the enterprise level, data/service provider level, etc... There's a much larger # of people using same SSD without issue. I also believe every solution provider at the entry level to enterprise will also sell their chassis loaded with same drives.

I'd like you to show me not only all these issues but vendors that offer chassis with say 50% Intel SSD and 50% OCZ... they don't exist.

PigLover · Jun 1, 2016

unwind-protect said:
...I do not recommend mirrors on top of same-type SSDs. The theory that the write pattern that is equal for all drives might expose a particular problem with a particular write pattern on a particular SSD is very real to me. There isn't any disadvantage to homogenous mirrors and raids if you use ZFS' internal mechanisms for them.

I really do wish you'd step up and defend this strange idea of yours. When challenged in the other thread you chose to just stop posting and ignore it. But really - I believe you are spreading complete FUD here and need to stop it. This site is a forum for serious, fact based discussion - not random strange and very wrong ideas that will lead people astray.

If you can provide any defensible support for this idea I'll back off. But so far - even though asked over and over and over - you've provided none.

RyC · Jun 2, 2016

Respectfully, on the contrary Slyek, I believe T_Minus and PigLover are in fact raising serious concerns in a productive and non confrontational way. There is serious danger when FUD is spread because people who don't know any better read it, wrongly believe it, and harm themselves. And when it's spread continually, without defense or explanation, there is no debate or discussion! If he doesn't want to share his rationale, he shouldn't post his suggestion, especially after being asked about it multiple times in several topics. We should always be prepared to explain our decisions, especially on a forum such as this, because as you said we are eager to learn, and learn why!

We're nowhere near the climate FreeNAS forums, and I don't believe posts like this tilt us towards it in the least. I wish we could be buddy buddy not serious all the time, my friend, but sometimes people just need to be called out!

razvanc.mobile · Jun 3, 2016

T_Minus said:
I'd like you to show me not only all these issues but vendors that offer chassis with say 50% Intel SSD and 50% OCZ... they don't exist.

You gotta admin T_Minus, i would make an interesting marketing idea. Manufacture defects are a real thing, and you could say that one could manifest at the same time on 2 devices from the same production batch. But it's a general thing, and definitely does not apply only to SSD's, actually i think spinners could manifest this way more often.

But yeah, would be interesting to see a professional NAS marketed with at least 2 different device types, to cater to people scarred of this possibility

Sent from my Nexus 7 using Tapatalk

Patrick · Jun 3, 2016

Sorry - if you are using modern, non-consumer drives just about nobody orders servers like this. (I am sure there are a handful of people so I do not want to say nobody.)

I asked three different major SSD vendors over drinks today at Computex and the answer was that it does not happen.

I do mix drives in the STH Ceph cluster nodes but that is more of a function of getting drives inexpensively at the capacity points I want in each node. Distributed file systems are different since you are unlikely to have the same simultaneous write patterns on drives in the cluster. All of the ZFS mirrors I run use the same drives.

mackle · Jun 3, 2016

If I'd mixed HDDs when I built my Raid-5 array of Seagate 7200.12's back in the day I might have experienced half the failures... [Any excuse to bring up that situation]. But that's very different to the theory posited above.

ttabbal · Jun 3, 2016

I haven't done SSD arrays, but with spinners I've done all the same and mix-and-match. Never had a problem with either way, so long as I thoroughly test the drives. On SSD I would be even less concerned about it, particularly used units. The built-in wear leveling is likely randomized at least a little by the drive firmware, and the order any flash cells fail, so write patterns won't be exactly the same even in a mirror. With used units, the existing data use should push them further out of sync, if they ever were.

All this is based on ZFS. I've read some people say that at least older hardware RAID required identical drives. ZFS doesn't seem to care either way, all that matters is that replacements be >= the old drive size. New arrays use the lowest common size. I've even done mirrors with a 120GB and a 500GB. The 500 was a replacement and just happened to be the only available known good drive I had sitting about when one side of the mirror died. It worked fine.

Oh... the Seagate 7200.12 debacle ... I feel for you. If it makes you feel any better, I had about 6 of the infamous 75GXPs die in the same month. The others were burned in effigy.

Terry Kennedy · Jun 3, 2016

ttabbal said:
I've read some people say that at least older hardware RAID required identical drives.

There were lots of drive-level features to support older RAID controllers, or even non-RAID controllers. First was Rotational Position Sensing, which was an option* on washing-machine-sized drives. It told the controller how far away the first sector was. This was pretty important when the disk was only spinning at 1200 to 1500 RPM. CPU speed increased far more rapidly than disk speed, so even when disk speed was standardized at 2400 and 3600 RPM it was still important.

Somewhat later, Spindle Sync was introduced. This caused all drives to operate at exactly the same speed. It was beneficial because there were usually multiple drives on the (SMD / SCSI / whatever) data bus and the system could order writes so that a multi-sector write could be performed across the drive array without "blowing a rev". This could mean the difference between being able to write the data in one revolution vs. multiple revolutions.

The last use for Spindle Sync was to get updated firmware from drive vendors that normally didn't want to make updates available. You could get a drive back from an RMA with a newer firmware revision and you were normally out of luck. But going "I need spindle sync!" was pretty much guaranteed to get you the new firmware for your other drives.

* You might thing this was something silly to charge for. But "way back when", Direct Seek was an option. If you didn't buy the option, every time the drive did a seek it would recalibarate to Track 0 first, then seek to the desired track. So if you were on track 117 and wanted 118, it actually went 117 - 0 - 118. Totally clobbered performance.

unwind-protect · Jun 5, 2016

Sleyk said:
Unwind my friend, if you are reading this, try to give a bit of clarity to your position eh? You haven't responded as yet, but we would like to hear about your thoughts if possible.

Well, unfortunately I'm not (yet) in the habit of saving all web links.

To understand my position on homogenous mirrors on SSDs you first need to understand that from my experience SSDs practically never die when their expected lifetime of writes is exceeded. I have a stack of SSDs that I killed (some by accident, some deliberately) that all had the controllers (the ones on the SSD) die under specific "non-expected" write patterns. Doing that was pretty easy, and every new generation everybody came out that those problems are well-known and now fixed in the current generation. Then the hyped Intel SSDs that were all the huge behaved the same way for me and I had it. The Samsung 850 was the first one that I couldn't kill within a day, but somebody over at Tomshardware did discover a pattern to do that. As I mentioned elsewhere I have non-death problems with the Samsung 850, too.

So, if you believe, like I do, that SSDs can be killed with specific write patterns, then very obviously you don't want to back a single raw device with a mirror on drives that will die on the same write pattern.

I do remember that both those patterns and the "don't do raid1 on SSDs" issue were popular enough to be Slashdotted at some point. Today people just say "raid is no backup, what did you expect" when people lose critical mass in arrays, so no real discussion happens anymore.

How a very large ZFS pools configured ?

Well-Known Member

Moderator

Well-Known Member

New Member

Well-Known Member

New Member

Build. Break. Fix. Repeat

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Active Member

Build. Break. Fix. Repeat

Moderator

Active Member

Member

Administrator

Active Member

Active Member

Well-Known Member

Active Member