How long does it take for a dead HD to die?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
I have a WD WDC WD4001FYYG-0

BADHD.PNG

It's been like this for over a year now. I regularly exercise it. Lost track of the number of times I've filled it up, dumped it and filled it up again. I have a load of garbage set aside just for this purpose. I've done several long formats and low level formats and have never gotten any indication of a problem. This drive has become somewhat of an enigma. How can it fail to this point and then teeter on the edge for over a year?
 

Stephan

Well-Known Member
Apr 21, 2017
923
700
93
Germany
What SMART attribute has pre-failure active? Hard to say without a "smartcl -ax /dev/sdX" can you post that?
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
You do indeed need the actual SMART values to know what that super generic "drive bad, scary!" actually is based on. It's possible that it triggers with UDMA errors and it's simply a bad cable.
 
  • Like
Reactions: fohdeesha

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
C:\Program Files\smartmontools\bin>smartctl -ax pd2
smartctl 6.5 2016-05-07 r4318 [x86_64-w64-mingw32-2012r2] (sf-6.5-1)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: WD
Product: WDC WD4001FYYG-0
Revision: VR07
Compliance: SPC-4
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Logical block size: 512 bytes
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x50000c0f01d138ac
Serial number: WMC1F0887091
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Apr 26 14:17:10 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
Read Cache is: Enabled
Writeback Cache is: Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: HARDWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS [asc=5d
, ascq=14]

Current Drive Temperature: 35 C
Drive Trip Temperature: 69 C

Manufactured in week 46 of year 2013
Specified cycle count over device lifetime: 1048576
Accumulated start-stop cycles: 899
Specified load-unload count over device lifetime: 1114112
Accumulated load-unload cycles: 1
Elements in grown defect list: 1821

Error counter log:
Errors Corrected by Total Correction Gigabytes Tot
al
ECC rereads/ errors algorithm processed unc
orrected
fast | delayed rewrites corrected invocations [10^9 bytes] err
ors
read: 87584 608 3250 88192 613 161096.649
5
write: 138916 824 3422 139740 882 86809.261
0

Non-medium error count: 2355

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [
SK ASC ASQ]
Description number (hours)
# 1 Default Completed - 1028 - [
- - -]
# 2 Default Completed - 1028 - [
- - -]

Long (extended) Self Test duration: 31120 seconds [518.7 minutes]

Background scan results log
Status: no scans active
Accumulated power on time, hours:minutes 4926:05 [295565 minutes]
Number of background scans performed: 0, scan progress: 0.00%
Number of background medium scans performed: 0

Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 0
number of phys = 1
phy identifier = 0
attached device type: SAS or SATA device
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; 6 Gbps
attached initiator port: ssp=1 stp=1 smp=1
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x50000c0f01d138ae
attached SAS address = 0x5003048011bf3c01
attached phy identifier = 7
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 1
Phy reset problem = 0
Phy event descriptors:
Transmitted SSP frame error count: 0
Received SSP frame error count: 0
relative target port id = 2
generation code = 0
number of phys = 1
phy identifier = 1
attached device type: no device attached
attached reason: unknown
reason: unknown
negotiated logical link rate: phy enabled; unknown
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=0
SAS address = 0x50000c0f01d138af
attached SAS address = 0x0
attached phy identifier = 0
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Phy event descriptors:
Transmitted SSP frame error count: 0
Received SSP frame error count: 0


C:\Program Files\smartmontools\bin>
 

sko

Active Member
Jun 11, 2021
240
128
43
I've had several drives with SMART-data that stated "i'm perfectly fine" while they went dead for several seconds on each write or just trickled out (bad) data with a few bytes/sec. The worst drives are usually SATA because they seem to be particularly reluctant to just die and drag down the whole system to a crawl for hours.

So yes, SMART health-status is absolutely useless. Look at the attributes for actual errors (bad sectors, unrecoverable errors etc). Although I've also seen ZFS report checksum errors often long before the drive firmware logged the first reallocated/bad sectors (SATA) or uncorrected errors (SAS). Although most SAS drives seem to be a bit more honest with that and especially won't waste endless time on trying to correct anything if they obviously only have borked data to return...

as for your specific drive: yes, this thing will go out anytime soon - lots of logged defects and insanely high load/unload cycles (I'd even say this might be the reason why this drive is dying)
Those old RE drives usually were amazingly reliable - I replaced the last 2 RE3s IIRC last year, only because they became _really_ old and I ran out of drive bays. The 500GB RE2 were absolutely immortal - haven't replaced a single one of those because of a failure, but only because they became too small. some of them were used in clients or test systems for several years afterwards.

edit:
i've somehow mixed up the specified / accumulated numbers for load cycles. although the start/stop cycles still seem a tad high for that runtime, but nowhere near any value that might/should lead to considerable mechanical wear...
 
Last edited:
  • Like
Reactions: Fritz

Stephan

Well-Known Member
Apr 21, 2017
923
700
93
Germany
So 1821 grown defects wow. But WD seems to have made that disk with enough reserve sectors to compensate. Trouble is you write some file and it works for some time but then that sector also drops dead and your file is kaputt and you will not find out until you want to read the file again. Which can be months or years later. Basically a disaster waiting to happen. ;-) A reasonably paranoid filesystem like ZFS would of course flag read or checksum errors.

Edit: Yep @sko what's up with those insane load/unload cycles... WD Green had the same issue, you had to run a special utility to set the unload timer to something reasonable like 2 minutes not 5 seconds.
 
  • Like
Reactions: Fritz

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
Yeah, sometimes the blanket "this drive is GOOD/BAD" status really doesn't say anything useful and might flip at any time. Actual SMART values are important (so not -ax but -A).

But considering the dead sectors: once your drive runs out it can't remap them anymore and it just slowly dies. In this case, I suspect some other defect (i.e. a speck of dust or a manufacturing impurity) makes the sectors die abnormally fast. You basically can't trust the drive either way.
 

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
I wouldn't consider using this drive for anything important. Just wondering how long it will take to actually physically die. To date the only indication it was failing has been HD Sentinel. Nothing else has given even the slightest hint. Windows says it's healthy and so does WD's own Lifeguard Tools.

Lifeguard1.PNG

I'm running the long test now but last time it found nothing.
 

sko

Active Member
Jun 11, 2021
240
128
43
I've kept two such "should fail anytime soon" drives as 3rd providers in ZFS mirror vdevs in my backup server at home, and they survived almost another year after showing the first failing sectors.
The bad sectors on both increased in bursts in ever shorter intervals and increasing numbers. at the beginning only 2-3 sectors at a time with several weeks in between. at the end every few days several dozen sectors failed until one of them finally died completely and I ended the little experiment.
ZFS was already piling up the checksum errors on those drives and kicked the dying one out of the pool even before it went dark because of too much errors...
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
It really depends on a lot of things; Windows and WDLG and other tools try to give some allround "everything is good" indication but it's not really all that binary. A drive can be good and degrading at the same time. It just depends on who wrote the software what kind of message they will put out.

Some software is written to report *any* failed sectors as "THIS DRIVE BE DEAD" while others only look for a SMART self-report of "I am about to fail". There is no real universal answer, but you can do the following:

- Figure out how man reserved backup sectors you have
- See how many sectors go bad per day
- Divide the remaining backup sectors by the amount of sectors that die every day and you have a rough estimate of remaining days

On the other hand, if it's variable, e.g. some days it's 100 dead sectors, some days it's 0 and other days it's 1000, it becomes much harder to know how fast the drive is dying. Keep in mind that it is a complicated mechanical device where a bunch of tiny heads float nanometers above a metal spinning plate where every speck of dust or wrong vibration can wreak havoc. Maybe there is a bit of damage on a specific track and the drive only knows about that every time it happens to need to read something there. Maybe it's a single bad head. It's impossible to know at this point. What we do know is that the dead sectors are likely to cause dataloss, and remapping is only a temporary fix until not enough data remains to be useful.
 

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
I've been hanging out at hddguru.com lately just to see if I could pick up a few tips. What I learned was that HD's are far more complicated than I imagined. Most of the discussions there are gibberish to me. Almost like they're speaking a foreign language. What we see is simply data in and data out. There's waaaay more to it than that.
 
  • Like
Reactions: ecosse

itronin

Well-Known Member
Nov 24, 2018
1,240
801
113
Denver, Colorado
@Fritz polynomial maths (storage, and EC), aerodynamics, fluid dynamics, control systems, servo control systems, black magic magnetism - and my knowledge is from the early 90's...
 
  • Like
Reactions: Fritz

ari2asem

Active Member
Dec 26, 2018
745
128
43
The Netherlands, Groningen
forget hd sentinel.
use gsmartcontrol with latest smartmontools installed on your windows machine

gsmartcontrol is very nice gui for smartmontools.

i only use gsmartcontrol


and look at attributes. are more important then good/bad notifications
 
  • Like
Reactions: Fritz

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
After 8 hours the Lifeguard long scan finally finished.

Lifeguard2.PNG

Hard to imagine how this drive can complete a grueling 8 hour test without an hint of problems.
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
After 8 hours the Lifeguard long scan finally finished.

View attachment 22668

Hard to imagine how this drive can complete a grueling 8 hour test without an hint of problems.
If the extended test from SMART, that's not really a special "western digital secret test that does the bestest tests of all" but just another SMART command any tool can send. AFAIK, the 'extended' test is just a surface scan. Surface defects would have popped up really quickly, but it's also possible that it was just a 'few' defects and now the remap counter has gone up. That's not something the extended test will report.

It's also possible that there is some MEMS device that only has a problem in a specific idle range, specific temperature range and only exhibits the behaviour on writes or something like that. Again, there is no way to tell and no 'test' over the standard interface is going to say "yes this drive is good". The only thing tests can do is say "it is bad for sure at this time", everything else is just a guess. There are no guarantees.
 

nabsltd

Well-Known Member
Jan 26, 2022
410
274
63
Those old RE drives usually were amazingly reliable - I replaced the last 2 RE3s IIRC last year, only because they became _really_ old and I ran out of drive bays. The 500GB RE2 were absolutely immortal - haven't replaced a single one of those because of a failure, but only because they became too small. some of them were used in clients or test systems for several years afterwards.
The last place I worked, we decommissioned a rack of storage that had 2TB RE3s in dense pack (70+ per 4U). After putting some of the drives to use in various other applications, we still had around 300 disks. There was no desire to try and sell them, so management said that employees could take them (after we erased them, of course) as long as they didn't re-sell them. After people took what they wanted, there were still around 80 disks left, so I grabbed around 50, as I was just starting to build out storage at home.

I've had about 10 of them fail so far, but the remaining disks all now have close to 10 years of power on time and are still chugging along. Like you, the only reason they are getting replaced in my rack is because they are too small per bay.
 
  • Like
Reactions: Fritz

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
Similar experience here. Amazingly, same with the very old consumer Samsung SpinPoint 1TB drives that just won’t die.
 
  • Like
Reactions: Fritz

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
Similar experience here. Amazingly, same with the very old consumer Samsung SpinPoint 1TB drives that just won’t die.
Yea, same here, I have 4 Samsung drives. They're all older than dirt but still show 100% health.
 

Fritz

Well-Known Member
Apr 6, 2015
3,382
1,385
113
70
I accept that the drive is as untrustworthy as it can possibly be and still function and would never trust it. But still, the original question stands. I guess only time will tell.
 
  • Like
Reactions: oneplane