HDD Test / recertify options

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

mervincm

Active Member
Jun 18, 2014
162
45
28
My Synology 1813+ NAS recently notified me that I have 2 drives that have failed the smart test, and looking at the admin portal, it seems they each developed 4 bad sectors on the same day. Earlier that day, we had a brown out and I suspect that this was the actual cause of the bad sectors. I quickly did a complete backup and then scrubbed the volume. Both finished without incident. I put in a replacement for one of the bad disks, and the array is rebuilding now. in the mean time, I want to test this disk to see if its actually bad, and "refresh" it and put it back into service if it is actually OK. It'd a 3TB RED, so I grabbed the latest version of WDC diags, installed the bad disk into a spare system, and am currently running the long diagnostic.

In the Olden days, I would have used spinrite, and it would verify if bad sectors are indeed bad etc.

What can I do today? ideally under windows or a bootable linux distro.
 

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
What does the actual SMART data look like? Launch this from a Live LinuxCD
Code:
smartctl -a /dev/sdX
where sdX is your questionable hard drive. Please paste the entire results back here in a code block.
 

mervincm

Active Member
Jun 18, 2014
162
45
28
Thank you for the suggestion. tomorrow, when the WD long test is done, I will grab that info.

Here is the info from the smart info on the remaining bad disk from the synology admin portal

sorry google drive won't post as an inline image but...

disk5.png - Google Drive
 

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
You do have 4 pending sectors, but nothing else that looks bad about the drive (Reallocated_Sector_Ct, UDMA_CRC_Error_Count, etc.). If you can, I would pull this disk too once you have your replacement, and run a full destructive badblocks pass on the drive (this will take many hours on a large disk), and then re-check the SMART values from a live linux cd. If they increase, I would RMA this disk too.

Let's say your disk shows up as /dev/sdb

Code:
badblocks -wsv /dev/sdb
smartctl -a /dev/sdb
 
  • Like
Reactions: mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
Having difficulty getting that far with the linux end.

I Created a live ubuntu 14.04 LTE disk, booted it opened a terminal session, but smartctl is not included. tried the ubunto software center, It seemed to freeze, but smartctl did install.

sudo fdisk -l told me the 3TB drive was /dev/sdc

sudo smartctl -a /dev/sdc seemed to work and I got the output and copied it into a blank doc and saved it to another USB key
Code:
ubuntu@ubuntu:~$ sudo smartctl -a /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4Nxxxxxx
LU WWN Device Id: 5 0014ee 6599b1456
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Oct  9 03:18:15 2014 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (39240) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 394) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       689
  3 Spin_Up_Time            0x0027   214   179   021    Pre-fail  Always       -       4258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       195
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2774
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       92
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       310
194 Temperature_Celsius     0x0022   122   117   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       5
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      2739         9437194
# 2  Short offline       Completed: read failure       90%      2730         9437194
# 3  Short offline       Completed: read failure       90%      2706         9437194
# 4  Short offline       Completed without error       00%      2682         -
# 5  Short offline       Completed without error       00%      2658         -
# 6  Short offline       Completed without error       00%      2634         -
# 7  Short offline       Completed without error       00%      2610         -
# 8  Short offline       Completed without error       00%      2586         -
# 9  Short offline       Completed without error       00%      2562         -
#10  Short offline       Completed without error       00%      2538         -
#11  Short offline       Completed without error       00%      2514         -
#12  Short offline       Completed without error       00%      2490         -
#13  Short offline       Completed without error       00%      2466         -
#14  Short offline       Completed without error       00%      2442         -
#15  Short offline       Completed without error       00%      2418         -
#16  Short offline       Completed without error       00%      2394         -
#17  Short offline       Completed without error       00%      2370         -
#18  Short offline       Completed without error       00%      2346         -
#19  Short offline       Completed without error       00%      2322         -
#20  Short offline       Completed without error       00%      2298         -
#21  Short offline       Completed without error       00%      2274         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
then I tried badblocks -wsv /dev/sdc
got an error message /dev/sdc is apparently in use by the system; it's not safe to run badblocks!

I don't get this since the drive likely doesnt even have a partition table, I suspect WD long test killed all that.

tried to unmount with umount /dev/sdc, but got its not mounted according to mtab as an error message

then I tried badblocks -wsvf /dev/sdc to force it , and its running but not sure if it was OK to force it ......

PS included all the steps here for the next guy ....
 

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
If you are using the LiveCD, it might have mounted the disk if it still had a partition table and filesystem on it. The WD long test doesn't overwrite data, so that shouldn't have effected anything. It's fine to force the long test on a questionable disk, but you only want to run this test if the data on the disk is not important. It is a destructive test, so it completely wipes out the data on this disk.
 
  • Like
Reactions: mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
I left Badblocks running last night. I estimated about 4 hours to complete the 3 TB based of the first few %. Strangely this AM it was not complete. After 10 hours it was only 60% done. I am concerned that the explanation for the slow down might be that it ran into "bad" or "weak" areas of the disk that were recovered. So far no errors have been identified by BADBLOCKS. The only other idea I had was that perhaps it was a power saving issue, maybe it was asleep?
 

ColdCanuck

Member
Jul 23, 2013
38
3
8
Halifax NS
I left Badblocks running last night. I estimated about 4 hours to complete the 3 TB based of the first few %. Strangely this AM it was not complete. After 10 hours it was only 60% done. I am concerned that the explanation for the slow down might be that it ran into "bad" or "weak" areas of the disk that were recovered. So far no errors have been identified by BADBLOCKS. The only other idea I had was that perhaps it was a power saving issue, maybe it was asleep?

Ok a few things.

i)

The smartctl -a output you posted shows two interesting things:

a) The current pending is 5, which means there are 5 LBAs which don't pass the read ECC, but which *have not* yet been reallocated. They will be when new data is written to those LBAs. Then they may or may not show up as reallocated blocks. WD firmware is inconsistent in this; I have some drives which behave as expected, others which just quietly remap the LBA. So you have already gone from 4 to 5. Not looking great for that disk.

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 5


b) The Synology was running SMART short test every 24 hours.

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 2739 9437194
# 2 Short offline Completed: read failure 90% 2730 9437194
# 3 Short offline Completed: read failure 90% 2706 9437194
# 4 Short offline Completed without error 00% 2682 -
# 5 Short offline Completed without error 00% 2658 -
# 6 Short offline Completed without error 00% 2634 -


Item #4 and older passed. Items # 1, 2 ,3 show errors (consistent with the Current Pending_Sector). So the NAS correctly told you the disk's SMART show failing. The extended long self test was probably you using the WD tools.


ii) As noted above the WD test and the NAS SMART self tests are non-destructive.


iii) You don't need a partition table to have a filesystem on the disk, if that filesystem uses all the disk. Don't know if Synology does this or not, but it certainly is possible.


iv) By running badblocks you have certainly now have destroyed the filesystem.

v) Assuming a 100MB/s average rate for the WD Red (close enough) it will take at least 8 hours to write and 8 hours to read the disk. Badblocks does this 4 times; so expect this to take 67 hours.

vi) badblocks will write to the "bad" blocks, and should cause the 197 counter to go to zero. I don't know whether the blocks get reallocated and show on Attribute 5 of the SMART data; none of my Red's has thrown an error yet. Certainly most of the 'Greens' I have do not, but all the Blacks I have do.

So good luck, and watch the SMART output, it conatins all the info on the health of the drive.
 
  • Like
Reactions: mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
b) The Synology was running SMART short test every 24 hours.

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 2739 9437194
# 2 Short offline Completed: read failure 90% 2730 9437194
# 3 Short offline Completed: read failure 90% 2706 9437194
# 4 Short offline Completed without error 00% 2682 -
# 5 Short offline Completed without error 00% 2658 -
# 6 Short offline Completed without error 00% 2634 -


Item #4 and older passed. Items # 1, 2 ,3 show errors (consistent with the Current Pending_Sector). So the NAS correctly told you the disk's SMART show failing. The extended long self test was probably you using the WD tools.
Thanks for the detailed feedback!
Can you tell me what the 90% is in these lines?
Any idea why the WDC diagnostic long smart test passed since this seems to indicate it failed?

I am hoping this disk can either be confirmed bad or stable (with a few bad sectors) so I can purchase a replacement, or reuse it to replace the similarly failing disk still in my synology (the pic from above on google drive)

I am hoping that the read error was simply caused by the brown-out I had earlier that day. It was very quick, PC's didn't even reboot, although a monitor did :) maybe the disk doesn't even actually have bad sectors at all, it just registered the error because I was reading from the disk at the time (I was indeed streaming a movie from it.)

PS I am using a bootable USB drive to live boot ubunto, works well enough! I hoped with a USB stick that the package installed might survive a reboot, but unfortunately not. might have to dig out an old hdd and install it.
 
Last edited:

ColdCanuck

Member
Jul 23, 2013
38
3
8
Halifax NS
Thanks for the detailed feedback!
Can you tell me what the 90% is in these lines?
The 90% means 90% remaining. In other words it only scanned 10% of the disk before it bailed with an error. The "9437194" number is the LBA of the bad block. The percent remaining only counts down in 10% steps so it is pretty coarse)

Have a look at Bad block HOWTO for smartmontools which shows how to use this number to simply zap the bad block.

In your case it is even easier since there is/was no partition table on the disk, the bad block is 9437194 512B LBAs into the disk. (or about 4.5 GiB in).


Any idea why the WDC diagnostic long smart test passed since this seems to indicate it failed?


The glib answers is because the tools are no good. The longer answer is the SMART says



Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Luckily for you the NAS looked a little deeper. Your disk might be fine if you simply overwrite the bad area and rerun the SMART long test.


a quick and dirty test would be:

Code:
dd if=/dev/zero of=/dev/sdx   oflag=direct seek=9437194  bs=512 count=1k
Which will write 512k of zeros at that spot on the disk.

Double check you use the correct /dev/sdx for the disk which has errors and the offset is correct, otherwise your day will suddenly get a whole lot worse.


Your disk definitely has a corrupt area; it was found by 3 separate SMART tests over more than 24 hours. That said, with luck, you can reallocate those sectors by writing to that area.
 

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
The badblocks destructive passes will write to all portions of the disk four times. I would look at the SMART data once that's done, and re-run a long SMART test.
 
  • Like
Reactions: mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
The badblocks destructive passes will write to all portions of the disk four times. I would look at the SMART data once that's done, and re-run a long SMART test.
Grr ubunto crashed 24 hours into darn test. AtI card was at full fan the whole time maybe vdriver crashed? No errors were found by it last time I checked. Will install ubunto on a spare system and try again.
 

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
You could just run Finnix instead of running a full Ubuntu Desktop environment LiveCD. It's command line only, but not too hard to use with these few commands. Just start it up and...
Code:
apt-get update && apt-get install smartmontools ssh screen -y
screen <press enter>
badblocks -wsv /dev/sdc
badblocks is already installed, so you should be good to go.

This is a little more complicated, but you could even use Putty to SSH into the machine if you want to make copy and pasting back here easier. Running this in screen will keep it running even if you disconnect or your connected computer goes into standby. This isn't necessary but can make copying the output alot easier if you don't know linux well enough to pipe output to a file and then scp the results to another system.
 
  • Like
Reactions: TubaMT and mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
Had an unused machine to install ubunto on so that was no big deal. Short, transport and long smart test passed with success. I grabbed a smartctl -a /dev/sdb then started another round of destructive bad blocks. In a few days I will rerun smartctl again and compare.
 

mervincm

Active Member
Jun 18, 2014
162
45
28
Progressing 28 hours into it.
Notes for the next guy, The percentage indicated is NOT for the over all test, but rather for the current phase of the test. I believe there are 8 phases by default, the Write (Testing with Pattern ....), then the verification of success (Reading and comparing..) for each of 4 different patterns.
Another thought is that because of the time involved, it would have been best to do both disks at the same time! This link seems to indicate that its not a problem.

[How To] Hard Drive Burn-In Testing | FreeNAS Community
and another usng ubuntu
Testing New Hard Drives
 
Last edited:
  • Like
Reactions: NeverDie

rubylaser

Active Member
Jan 4, 2013
846
236
43
Michigan, USA
Progressing 28 hours into it.
Notes for the next guy, The percentage indicated is NOT for the over all test, but rather for the current phase of the test. I believe there are 8 phases by default, the Write (Testing with Pattern ....), then the verification of success (Reading and comparing..) for each of 4 different patterns.
Another thought is that because of the time involved, it would have been best to do both disks at the same time! This link seems to indicate that its not a problem.

[How To] Hard Drive Burn-In Testing | FreeNAS Community
and another usng ubuntu
Testing New Hard Drives
Yes, you can certainly test multiple disks at once. I normally run them in a screen session so that I can disconnect and check on the processes many hours later.
 
  • Like
Reactions: mervincm

mervincm

Active Member
Jun 18, 2014
162
45
28
as an update, both drives finished badblocks, and afterwards a few short and 1 long smart test. they both now report 0's in all bad sector categories. I have but one back into the Synology, let it rebuild then did a scrub, no errors remain and they seem functional. I will wait another few weeks, then swap the other back in (free up my 4TB red that I used to temp replace)
thanks to all for ongoing advice and assistance.