ServeTheHome's RAID MTTDL Calculator Bug and Suggestion Box

Patrick

Administrator
Staff member
Dec 21, 2010
11,905
4,865
113
Starting a thread to discuss bugs in the RAID mean time till data loss (MTTDL) calculator on ServeTheHome. We do need help testing so any assistance is greatly appreciated.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,905
4,865
113
Found an "undocumented feature" this morning when migrating to production. For those wondering, this is something you can use to figure out ballpark data loss chances if you have say, 8 drive RAID 5 v 6.
 

Jeggs101

Well-Known Member
Dec 29, 2010
1,484
222
63
that is awesome! holy cr@p! i haven't ever seen something online like that. i know it isn't a fullblown simulation model but makes enough sense to be useful. really thanks for having this done.
 

mobilenvidia

Moderator
Sep 25, 2011
1,767
60
48
New Zealand
Hmmm, it's working too well, I may have to rethink my RAID5 array and go RAID6 now.
This calculator is costing, will need to get 2 more drives now.

Was cheaper not to see this :)

Working well and makes one think, and possibly rework a setup
 

dwm

New Member
Dec 9, 2012
4
1
3
Michigan
Just explaining the units would be helpful. I assume the second table is probability on a scale from 0 to 100 (i.e. percent chance)? Is that for loss or failure?

Would be nice to see raidz1 and raidz2 in the tables even if the model is the same/similar as others, just to avoid newbie confusion ("Why raidz3 and no raidz2 or raidz1?").

Thanks for the work!
 

Thatguy

New Member
Dec 30, 2012
45
0
0
The option 'Volumes' does not work the way I think it should, at least for raidz3

if for example, I have 36 drives, 4 8 Drive RaidZ3's, and 4 hot spares, I should be able to say that my 36 drive raidz3 is 4 volumes, or something like that.

Maybe I'm just lazy :)
 

Maltz

New Member
Jan 7, 2013
1
0
0
I LOVE this calculator!!

But... I think it may be VASTLY underrating the importance of uncorrectable bit error rate. Take this extreme example:

RAID5
MTBF: 36.5k hrs
UBER: 10^6
2TB drives
4 drives
1 volume
15MB/s rebuild rate

Now, unless I'm mis-reading the table, it's estimating 59.64 years before data loss for a RAID5 array. That doesn't make sense at all. When the first HD dies (~4 years) with an UBER of 10^6, you're pretty much guaranteed to run into thousands of unrecoverable errors while reading the 6TB required to rebuild the array. Then you've lost data. The UBER isn't that important while the RAID is functioning, but in a degraded state, it's VERY important on large arrays like that.

Am I misunderstanding the table? Or is the uncorrectable error rate during a rebuild not being taken into account properly?

Great work though! This is a tremendously helpful tool, and quite a bit of googling kept taking me back here. lol All the more reason it needs to be right, though. :)
 

nitrobass24

Moderator
Dec 26, 2010
1,083
127
63
TX
That seems about right to me. Granted MTTDL is not the best representation as far as accuracy is concerned, but should be relative when comparing other Raid levels, MTBFs, etc.

With only 3 drives in a degraded state, you are not likely to encourter a URE. Take those same numbers and bump the # number of drives up to 10, you will see that it drastically drops to less than 3 years.

Also something to keep in mind, no one even the HDD manufacturers have a clue about MTBF or UREs. If it was really scientific do you think we would have such perfectly round numbers? No way! :)
A bunch of Engineers, Marketing, and Accounting people got together took the rough crap data they had and come up with something that Marketing can use and Accounting/finance can use to set price points and determine warranties without putting huge liabilities on their books.

My brother in law works for a large Semiconductor mfr here in Dallas, TX. Even he will tell you they test and test but at the end of the day its just an extrapolation of a few algorithms to come up with a pretty good guess. The other people take this guess and adjust to make it work in the market and work on the books. The chips they make for TVs are the same ones they sell the military, but the military ones have a higher MTBF and better SLA/Warranty...all for a higher price. Just how the world works.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,905
4,865
113
Planning to revamp this weekend. Off to CES. Thanks for the feedback and keep it coming!
 

matt_garman

Active Member
Feb 7, 2011
205
36
28
Is MTTDL assumed to be a normal distribution (i.e. bell curve)? Looking at the numbers, "mean" to me means I have a 50% chance of data loss after X years, depending on the parameters I input. What I'd be interested in is the standard deviation, and also higher "confidence" tiers.

In other words, 50/50 odds doesn't mean much to me. I'd like to know, how many years (or months or days) do I have e.g. 75%, 90% and 99% chance of no data loss.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,905
4,865
113
Is MTTDL assumed to be a normal distribution (i.e. bell curve)? Looking at the numbers, "mean" to me means I have a 50% chance of data loss after X years, depending on the parameters I input. What I'd be interested in is the standard deviation, and also higher "confidence" tiers.

In other words, 50/50 odds doesn't mean much to me. I'd like to know, how many years (or months or days) do I have e.g. 75%, 90% and 99% chance of no data loss.
Poisson distribution. Updated for MTTDL model that should be more clear next week.
 

matt_garman

Active Member
Feb 7, 2011
205
36
28
Another question. With Quantity of Disks = 6, the numbers for RAID1 and RAID10 look the same. Is that right?

I guess, the first question is, what does RAID1 mean with 6 disks? Does that mean three independent duplicate-copy sets? Or does that mean one set with 6x redundancy? If I reduce the number of disks to 2, then RAID1 MTTDL number goes up by about a factor of three...

For that matter, "six disk" RAID10 is ill-defined, as you could stripe across two triple-redundant RAID1 sets. :) But I'll assume the calculator uses the traditional RAID10 of striping across two-way mirrors.
 

Ron Dennison

New Member
Aug 9, 2013
1
0
0
When "other" is selected for MTBF the calculated data seems to go to a strange default which is insensitive to whatever is entered into the "Enter # for MTBF (base 10): " box