storage gold rush?

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
Guess I'm not going to argue with you on that. You are only staying on the surface talking about a drive "failure" and doesn't even care to talk about what is a "failure" and why there is a "failure". And you don't even seem to understand the difference between clearing SMART vs clearing G-list vs clearing pending sectors. I'll stop here.
The issue is, the risk of a future drive failure increases dramatically on a drive that ever experiences these errors, and therefore should not be used in any environment where you would be inconvenienced by a drive failure.
 

msg7086

Active Member
May 2, 2017
404
147
43
35
SMH. I typed a lot but deleted them. I shouldn't be wasting time explaining how HDD works here. I'll leave it to someone else.
 

TedB

Active Member
Dec 2, 2016
123
33
28
44
SMH. I typed a lot but deleted them. I shouldn't be wasting time explaining how HDD works here. I'll leave it to someone else.
I think that what @funkywizard is trying to say is that according to Backblaze once a hdd gets a bad sector statistically it's more likely to fail. Backblaze has a very highly regarded reputation for providing reliable statistics about reliability of hard drives.

@funkywizard can you send a link to a Backblaze document which makes this claim?
 
  • Like
Reactions: funkywizard

msg7086

Active Member
May 2, 2017
404
147
43
35
I think that what @funkywizard is trying to say is that according to Backblaze once a hdd gets a bad sector statistically it's more likely to fail. Backblaze has a very highly regarded reputation for providing reliable statistics about reliability of hard drives.
Now, this claim IS true, because the event of having bad sectors and the event of failure are indeed related.

However claiming that one drive is going to fail BECAUSE a bad sector appears is wrong. Bad sectors can appear in so many ways, and only some of which are related to disk failures.

A man can have cancer or cold. Saying a man who gets sick will more likely to die soon could be true, but saying curing a cold would cover his cancer makes no sense. The problem of the failure needs to be triaged, hence why I mentioned a full wiping to see if it's a natural bitflip or a failure sector.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
I think that what @funkywizard is trying to say is that according to Backblaze once a hdd gets a bad sector statistically it's more likely to fail. Backblaze has a very highly regarded reputation for providing reliable statistics about reliability of hard drives.

@funkywizard can you send a link to a Backblaze document which makes this claim?
Correct.


What Is a Failure?

Backblaze counts a drive as failed when it is removed from a Storage Pod and replaced because it has 1) totally stopped working, or 2) because it has shown evidence of failing soon.
From experience, we have found the following five SMART metrics indicate impending disk drive failure:
  • SMART 5: Reallocated_Sector_Count.
  • SMART 187: Reported_Uncorrectable_Errors.
  • SMART 188: Command_Timeout.
  • SMART 197: Current_Pending_Sector_Count.
  • SMART 198: Offline_Uncorrectable.

We chose these five stats based on our experience and input from others in the industry because they are consistent across manufacturers and they are good predictors of failure.

Annual failure rate of drives with 0 pending sectors -- 0%
Annual failure rate of drives with 1 or 2 pending sectors -- 34%
Annual failure rate of drives with 2 - 6 pending sectors -- 68%


So it's not so much like "someone has a cold and then recovers from it". It's more like "someone got shot in the head and then you put a bandaid over it"
 

Brian Puccio

Member
Jul 26, 2014
68
32
18
39
Correct.







Annual failure rate of drives with 0 pending sectors -- 0%
Annual failure rate of drives with 1 or 2 pending sectors -- 34%
Annual failure rate of drives with 2 - 6 pending sectors -- 68%


So it's not so much like "someone has a cold and then recovers from it". It's more like "someone got shot in the head and then you put a bandaid over it"
If we’re going to use health analogies, a better one might be “you can get cancer and keep on living, but your chances of dying just went up drastically, even with chemo”.
 

TedB

Active Member
Dec 2, 2016
123
33
28
44
I believe @funkywizard claims that based on large sample at Backblaze single bad sector gives statistically high chance of the drive failing. Therefore it's irrelevant from statistical point of view what's the cause of the bad sector.

@msg7086 Many years ago I had similar scientific and engineering approach to yours however after few decades in the business I trust statistics as well :)

Although I agree that this drive might work for many years without a failure or any additional bad sectors. All depends what he wants to use it for. Chia mining or keeping plots is a perfect example where few bad blocks can't hurt.
 

msg7086

Active Member
May 2, 2017
404
147
43
35
So it's not so much like "someone has a cold and then recovers from it". It's more like "someone got shot in the head and then you put a bandaid over it"
A pending is a pending, meaning unconfirmed. If you want to use gun shot analogy, a pending means you blindly shoot someone far away but have no idea if he's actually shot. So you get to the person and see if there's actually a hole on the body.

A pending sector is a sector that fails ECC checksum. That's it, no more and no less. As to why there is a checksum error, there are just too many reasons. A random bitflip can cause it, and that's exactly why we have ECC to help correcting issues. The easiest way to get a bitflip, is to let it sit for long and wait for a bit to flip by itself.

Every time you write to a sector it would refresh the sectors (in your words, covering a failure). Those sectors that were never written will accumulate bitflips and eventually cause a checksum error. You will find a bad sector simply because it's last written 5 years ago.

In data centers, there's less chance for data to sit still for years without any writes. Data are constantly changing, flowing, and drives are constantly repurposing. For example, anytime someone terminates the service, you'll wipe the drives, essentially refreshing the sectors (in your words, covering a failure). There's no way to see the same kind of error unless you do a read test BEFORE a full wipe, because ONLY reads will detect a bitfliped sector. Even so, data will not stay long enough for these errors to appear.

Do you always wipe customers' drives after they cancel the service? If so, then you are essentially doing the exact same thing as I mentioned above, in your data centers, on your servers, everyday, every week. In this scenario, if you truly find a pending sector even after wiping them so many times during the years, then yes, they are weak and will see a failure soon.

But, that, ONLY applies to data center scenarios.
 
  • Like
Reactions: Jellyfish

msg7086

Active Member
May 2, 2017
404
147
43
35
Also that's the advantage of SAS drives, because SAS drives will count correctable errors and uncorrectable errors. In that case, you would clearly see sectors that are corrected, and sectors that are uncorrected.

Assuming an ECC code can correct 50 bits, all sectors that has less than 50 bits broken are considered "good" on SATA drives, more than 50 bits broken are "bad" sectors. However on SAS drives, you can clearly see the corrected sectors count increases, even if all those sectors are considered "good".
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
A pending is a pending, meaning unconfirmed. If you want to use gun shot analogy, a pending means you blindly shoot someone far away but have no idea if he's actually shot. So you get to the person and see if there's actually a hole on the body.

A pending sector is a sector that fails ECC checksum. That's it, no more and no less. As to why there is a checksum error, there are just too many reasons. A random bitflip can cause it, and that's exactly why we have ECC to help correcting issues. The easiest way to get a bitflip, is to let it sit for long and wait for a bit to flip by itself.

Every time you write to a sector it would refresh the sectors (in your words, covering a failure). Those sectors that were never written will accumulate bitflips and eventually cause a checksum error. You will find a bad sector simply because it's last written 5 years ago.

In data centers, there's less chance for data to sit still for years without any writes. Data are constantly changing, flowing, and drives are constantly repurposing. For example, anytime someone terminates the service, you'll wipe the drives, essentially refreshing the sectors (in your words, covering a failure). There's no way to see the same kind of error unless you do a read test BEFORE a full wipe, because ONLY reads will detect a bitfliped sector. Even so, data will not stay long enough for these errors to appear.

Do you always wipe customers' drives after they cancel the service? If so, then you are essentially doing the exact same thing as I mentioned above, in your data centers, on your servers, everyday, every week. In this scenario, if you truly find a pending sector even after wiping them so many times during the years, then yes, they are weak and will see a failure soon.

But, that, ONLY applies to data center scenarios.
The pending sectors is a huge indicator of impending failure. Much like, drowning in a river of your own blood doesn't specifically indicate any exact medical problem (you can't see the gunshot wound, only the river of blood), but who cares? You're about to bleed to death.

Going from 0% annual failure rate to above 30% from a single pending sector, I'd say that's about as bad as it gets. It's hard to imagine anything short of an actual complete failure that would need to be taken more seriously.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
Also that's the advantage of SAS drives, because SAS drives will count correctable errors and uncorrectable errors. In that case, you would clearly see sectors that are corrected, and sectors that are uncorrected.

Assuming an ECC code can correct 50 bits, all sectors that has less than 50 bits broken are considered "good" on SATA drives, more than 50 bits broken are "bad" sectors. However on SAS drives, you can clearly see the corrected sectors count increases, even if all those sectors are considered "good".
On SAS drives, sadly, they provide you with far less data than SATA drives. Uncorrectable errors are extremely important (if we get even one, we RMA or toss the drive), but sadly they don't report pending writes like SATA drives do, so you miss out on extremely important indications of impending failures.

Correctable errors are far less of an issue, though you don't want a sky high number of them either, it's not a metric I pay attention to.

I mostly look at the "dead man walking" statistics, and pending sectors are one of them.
 
Last edited:

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
A pending is a pending, meaning unconfirmed. If you want to use gun shot analogy, a pending means you blindly shoot someone far away but have no idea if he's actually shot. So you get to the person and see if there's actually a hole on the body.

A pending sector is a sector that fails ECC checksum. That's it, no more and no less. As to why there is a checksum error, there are just too many reasons. A random bitflip can cause it, and that's exactly why we have ECC to help correcting issues. The easiest way to get a bitflip, is to let it sit for long and wait for a bit to flip by itself.

Every time you write to a sector it would refresh the sectors (in your words, covering a failure). Those sectors that were never written will accumulate bitflips and eventually cause a checksum error. You will find a bad sector simply because it's last written 5 years ago.

In data centers, there's less chance for data to sit still for years without any writes. Data are constantly changing, flowing, and drives are constantly repurposing. For example, anytime someone terminates the service, you'll wipe the drives, essentially refreshing the sectors (in your words, covering a failure). There's no way to see the same kind of error unless you do a read test BEFORE a full wipe, because ONLY reads will detect a bitfliped sector. Even so, data will not stay long enough for these errors to appear.

Do you always wipe customers' drives after they cancel the service? If so, then you are essentially doing the exact same thing as I mentioned above, in your data centers, on your servers, everyday, every week. In this scenario, if you truly find a pending sector even after wiping them so many times during the years, then yes, they are weak and will see a failure soon.

But, that, ONLY applies to data center scenarios.
We do in fact wipe drives whenever they get removed from a server. If there is a pending sector -before- the wipe, the drive has failed our testing process. If there is a pending sector -after- the wipe, the drive has also failed our testing process.

It would be wildly inappropriate to use one of these drives to hold customer data. Even if someone didn't care about customer data, do you really want to have to replace the drive in the middle of the night when it fails (which it will)? No.
 

msg7086

Active Member
May 2, 2017
404
147
43
35
If there is a pending sector -before- the wipe, the drive has failed our testing process. If there is a pending sector -after- the wipe, the drive has also failed our testing process.
See, a pending sector can only be detected by a read. Without a full read, how many hidden failures have you "covered"?

When I refurbish drives for myself, I always do a full latency and read test, and then a full wipe. Only with this order of procedures would you know the real condition of the drives. You claim that you care a lot about drive healthy, but you are not even doing it correctly. Doing a full wipe before a full read, is like throwing a bottle of anti-biotics on a patient without even trying to do a blood test. You not only misunderstood the process of treating, but also misunderstood the process of diagnosing. Relying on observing a pending sector before a wipe, is like buying a lottery. The former customer must be so lucky to read the broken sector to trigger a pending sector. If a broken sector is never read, a pending error is never raised.

Let me rephrase it. Someone may or may not be sick. If someone has a fever (a pending sector), you diagnose it (do a latency test) and try to treat it with anti-biotics (do a wipe), and check their temperature again. Even someone doesn't have a fever (no pending sectors) they could be in suboptimal health, and you can still diagnose it. Now you were claiming that giving anti-biotics can cover a hidden cancer, so we shouldn't treat it, we should simply claim the patient is gonna die soon.

And in the quote, you are basically blindly throwing anti-biotics to anyone who's visiting, and claim that they are healthy without knowing how to correctly check them.

Don't get me wrong, in businesses it's common to do things like this, because this seems to be most economical. You can just hire some guy for cheap, to look at the numbers. 0, good, 3, bad, job done, easy peasy. You can't expect them to have 5 years of hard drive diagnosing experience on the resume, right?

There are of course other business strategies, like all servers are decommissioned after support period, all hard drives are drilled to pieces to prevent data leaking, etc etc.

But again, those business strategy don't apply to home users, or individual cases. I have no problem with you doing so in your data centers, but if you walk into a doctor and ask why he's bothered to do a blood test or a CT scan or give anti-biotics instead of just burying the patient in the backyard, it would be a silly move.
 
  • Like
Reactions: Jellyfish and NateS

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
See, a pending sector can only be detected by a read. Without a full read, how many hidden failures have you "covered"?

When I refurbish drives for myself, I always do a full latency and read test, and then a full wipe. Only with this order of procedures would you know the real condition of the drives. You claim that you care a lot about drive healthy, but you are not even doing it correctly. Doing a full wipe before a full read, is like throwing a bottle of anti-biotics on a patient without even trying to do a blood test. You not only misunderstood the process of treating, but also misunderstood the process of diagnosing. Relying on observing a pending sector before a wipe, is like buying a lottery. The former customer must be so lucky to read the broken sector to trigger a pending sector. If a broken sector is never read, a pending error is never raised.

Let me rephrase it. Someone may or may not be sick. If someone has a fever (a pending sector), you diagnose it (do a latency test) and try to treat it with anti-biotics (do a wipe), and check their temperature again. Even someone doesn't have a fever (no pending sectors) they could be in suboptimal health, and you can still diagnose it. Now you were claiming that giving anti-biotics can cover a hidden cancer, so we shouldn't treat it, we should simply claim the patient is gonna die soon.

And in the quote, you are basically blindly throwing anti-biotics to anyone who's visiting, and claim that they are healthy without knowing how to correctly check them.

Don't get me wrong, in businesses it's common to do things like this, because this seems to be most economical. You can just hire some guy for cheap, to look at the numbers. 0, good, 3, bad, job done, easy peasy. You can't expect them to have 5 years of hard drive diagnosing experience on the resume, right?

There are of course other business strategies, like all servers are decommissioned after support period, all hard drives are drilled to pieces to prevent data leaking, etc etc.

But again, those business strategy don't apply to home users, or individual cases. I have no problem with you doing so in your data centers, but if you walk into a doctor and ask why he's bothered to do a blood test or a CT scan or give anti-biotics instead of just burying the patient in the backyard, it would be a silly move.
The pending sector is a pending-write. Meaning, a write attempt that never completes. This is why wiping the drive will erase evidence of this failure, provided the more recent attempt to write to that sector succeeds.
 

NateS

Active Member
Apr 19, 2021
159
88
28
Sacramento, CA, US
The pending sectors is a huge indicator of impending failure. Much like, drowning in a river of your own blood doesn't specifically indicate any exact medical problem (you can't see the gunshot wound, only the river of blood), but who cares? You're about to bleed to death.

Going from 0% annual failure rate to above 30% from a single pending sector, I'd say that's about as bad as it gets. It's hard to imagine anything short of an actual complete failure that would need to be taken more seriously.
The distribution of that 30% matters a lot though. There's a big difference between each drive having an individual 30% chance of failure within a year and 30% of the drives having 100% chance of failure within a year, while the remaining 70% of drives have no elevated chance of failure.

To expand on the medical analogies a bit further, finding a pending sector is a bit like finding a lump, and when you go to the doctor, they tell you there's a 30% chance it's terminal cancer, and a 70% chance it's just a benign cyst. Doing some extra testing like msg7086 is suggesting is like doing a biopsy to find out if it's actually cancer or not.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
Awesome. Now we know where the misunderstanding comes. I'll stop here and let others correct you about this.
Even if you're right, you're intentionally missing the point.

If the data ever shows >0 errors of that type, the drive should be treated as though it will could fail at any time, as it probably will.
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
The distribution of that 30% matters a lot though. There's a big difference between each drive having an individual 30% chance of failure within a year and 30% of the drives having 100% chance of failure within a year, while the remaining 70% of drives have no elevated chance of failure.

To expand on the medical analogies a bit further, finding a pending sector is a bit like finding a lump, and when you go to the doctor, they tell you there's a 30% chance it's terminal cancer, and a 70% chance it's just a benign cyst. Doing some extra testing like msg7086 is suggesting is like doing a biopsy to find out if it's actually cancer or not.
It really doesn't matter. You go from a 0% annualized failure rate with 0 errors of this type, to 30% with one error, to 60% with more errors. If you want to argue semantics at this point be my guest but you're missing the point. Any drive with more than zero of that type of error should be replaced immediately if you care about the data being stored on it or the uptime of the system it is connected to.
 

msg7086

Active Member
May 2, 2017
404
147
43
35
Even if you're right, you're intentionally missing the point.

If the data ever shows >0 errors of that type, the drive should be treated as though it will could fail at any time, as it probably will.
That's, not even what we are discussing about. Every drive should be treated as though it could fail at any time, as it probably will. I never missed this point because we are not touching this point. My reply was to give a solution to a question is that fatal? and you came to confront me on my troubleshooting process. What does troubleshooting has anything to do with how a drive is treated? I don't get it, what exactly is your point?
 

funkywizard

mmm.... bandwidth.
Jan 15, 2017
847
400
63
USA
ioflood.com
That's, not even what we are discussing about. Every drive should be treated as though it could fail at any time, as it probably will. I never missed this point because we are not touching this point. My reply was to give a solution to a question is that fatal? and you came to confront me on my troubleshooting process. What does troubleshooting has anything to do with how a drive is treated? I don't get it, what exactly is your point?
The question was "should I treat this drive as good enough and use it like any other drive?" and the answer is a resounding "No!"