This drive salvagable or need replacement?

Tinkerer · Jul 9, 2023

Moving and migrating data to make changes to my ZFS pool, one drive is acting up. ZFS showing:

Code:

# zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 93.9M in 02:20:47 with 0 errors on Sat Jul  1 22:38:26 2023
config:

    NAME                                   STATE     READ WRITE CKSUM
    tank                                   ONLINE       0     0     0
      mirror-0                             ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
      mirror-1                             ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
      mirror-2                             ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
      mirror-3                             ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE       0     0     0
        ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE      26     0    12
    special   
      mirror-4                             ONLINE       0     0     0
        nvme1n1                            ONLINE       0     0     0
        nvme2n1                            ONLINE       0     0     0

errors: No known data errors

The Smartctl output shows:

Code:

# smartctl -a /dev/disk/by-id/ata-HGST_HUH728080ALN600_ABCDXYZ
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.37-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar He8
Device Model:     HGST HUH728080ALN600
Serial Number:    ABCDXYZ
LU WWN Device Id: 5 000cca 260e841eb
Firmware Version: A4GNT907
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Size:      4096 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul  9 19:22:24 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (  101) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (1175) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       108
  3 Spin_Up_Time            0x0007   158   158   024    Pre-fail  Always       -       397 (Average 437)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       328
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   097   097   067    Pre-fail  Always       -       3
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       14054
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       326
22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   097   097   000    Old_age   Always       -       4004
193 Load_Cycle_Count        0x0012   097   097   000    Old_age   Always       -       4004
194 Temperature_Celsius     0x0002   150   150   000    Old_age   Always       -       40 (Min/Max 14/45)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       25
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

SMART Error Log Version: 1
ATA Error Count: 4497 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 4497 occurred at disk power-on lifetime: 14054 hours (585 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 18 00 05 52 c6 40 00      07:24:42.449  READ FPDMA QUEUED
  60 34 08 0d df 40 40 00      07:24:40.460  READ FPDMA QUEUED
  60 03 10 54 37 46 40 00      07:24:40.460  READ FPDMA QUEUED
  60 03 00 be 0e 80 40 00      07:24:40.458  READ FPDMA QUEUED
  60 15 00 9a 42 46 40 00      07:24:40.456  READ FPDMA QUEUED

Error 4496 occurred at disk power-on lifetime: 14054 hours (585 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 26 00 9f 57 44 40 00      07:24:34.619  READ FPDMA QUEUED
  60 01 10 2e 12 82 40 00      07:24:32.508  READ FPDMA QUEUED
  60 1b 08 f6 7d 48 40 00      07:24:32.503  READ FPDMA QUEUED
  60 08 10 77 57 44 40 00      07:24:32.501  READ FPDMA QUEUED
  60 20 08 05 7f 48 40 00      07:24:32.501  READ FPDMA QUEUED

Error 4495 occurred at disk power-on lifetime: 14054 hours (585 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 06 08 ab ae 44 40 00      07:24:32.120  READ FPDMA QUEUED
  60 0a 08 81 ae 44 40 00      07:24:29.306  READ FPDMA QUEUED
  60 06 08 6d ae 44 40 00      07:24:29.304  READ FPDMA QUEUED
  60 10 08 29 ae 44 40 00      07:24:29.301  READ FPDMA QUEUED
  60 10 10 8b 58 44 40 00      07:24:29.299  READ FPDMA QUEUED

Error 4494 occurred at disk power-on lifetime: 14053 hours (585 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 08 c7 96 05 40 00      06:15:58.001  READ FPDMA QUEUED
  60 60 08 27 96 05 40 00      06:15:55.197  READ FPDMA QUEUED
  60 20 08 69 ef c1 40 00      06:15:55.180  READ FPDMA QUEUED
  60 20 00 eb 83 c1 40 00      06:15:55.180  READ FPDMA QUEUED
  60 20 08 05 57 c1 40 00      06:15:55.176  READ FPDMA QUEUED

Error 4493 occurred at disk power-on lifetime: 14053 hours (585 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 00 9a db 05 40 00      06:15:54.821  READ FPDMA QUEUED
  60 20 00 3a db 05 40 00      06:15:52.025  READ FPDMA QUEUED
  60 a0 00 9a da 05 40 00      06:15:51.974  READ FPDMA QUEUED
  60 20 18 9f be 42 40 00      06:15:51.969  READ FPDMA QUEUED
  60 40 10 9a d8 05 40 00      06:15:51.969  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9837         -
# 2  Extended offline    Completed: read failure       90%      5593         1468010836

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Is this fixable?

Backup is almost complete to a spare pool and I have backups offsite so Im not worried about loosing data. Its just that I don't want to buy another 8TB spinning rust as I am about to replace the pool with SSD's. The first 2 are now special vdev, but more are coming.

Perhaps a surface scan to mark sectors?

Thanks!

rtech · Jul 9, 2023

Tinkerer said:
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 25

Disk is certainly useable very much so as a paperweight....

Tinkerer · Jul 9, 2023

rtech said:
Disk is certainly useable very much so as a paperweight....

Cheers, appreciate the reply. Not to be argumentive or anything, but that one has been there for like forever, way before these zfs errors occurred.

In the past 25 minutes:

Code:

       NAME                                   STATE     READ WRITE CKSUM
           ata-HGST_HUH728080ALN600_ABCDWXYZ  ONLINE     117     1    71

While these stayed unchanged:

Code:

198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       25

BackupProphet · Jul 9, 2023

I think you have a bad sata/power cable, maybe power issues. I recommend moving the disk to another computer, create a new zpool and fill it with junk data and run scrub.

Tinkerer · Jul 9, 2023

Thanks, appreciate the reply. I would appreciate it even more if people could back their suspicions with reasoning

I actually have some experience with those kind of issues and I don't think either one of those is the case. When I ran a pool of disks on a system with an PSU that barely had the juice to keep things going, I got SATA errors in kernel logging. The errors would occur randomly on all disks. A bad SATA cable gave similar errors in kernel logging but obviously limited to that one disk. I could replicate the errors on another disk by swapping that cable with a good disk, and see the errors move to that disk.

I don't think its the PSU for the aforementioned reason (its on a single disk) and I think I can rule out SATA cabling as I have switched cables from the internal SATA controllers to an M1015 controller. I have also tend to just randomly connect the cables because zfs doesn't care about the order. The problem would have moved to another disk.

Lastly, this is also occurring:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     14067         1468064631
# 2  Extended offline    Completed: read failure       90%      9837         -
# 3  Extended offline    Completed: read failure       90%      5593         1468010836

The #1 is a manual fire of the short test with smartctl -t short and it didn't complete.

At this point I have taken the disk out of the pool, a complete backup that finished last night is on another pool and I am awaiting new disks this week.

I'm running tests on the faulty disk right now to see what can be done.

Tinkerer · Jul 9, 2023

rtech said:
Disk is certainly useable very much so as a paperweight....

So, yeah you're right I just had to discover it for myself

.

I triggered these manually:

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     14067         1468064631
# 2  Short offline       Completed: read failure       90%     14067         1468064631

Created a partition that starts right before that sector:

Code:

Number  Start (sector)    End (sector)  Size       Code  Name
   1      1468009984      1553506640   326.1 GiB   8300  Linux filesystem

Run badblocks:

Code:

# badblocks -b 4096 -c 1024 -s /dev/disk/by-id/ata-HGST_HUH728080ALN600_VLJVKBYY-part1
Checking for bad blocks (read-only test): 9 0.00% done, 0:04 elapsed. (0/0/0 errors)
100.00% done, 0:07 elapsed. (1/0/0 errors)
110.00% done, 0:09 elapsed. (2/0/0 errors)
120.00% done, 0:11 elapsed. (3/0/0 errors)
130.00% done, 0:13 elapsed. (4/0/0 errors)
140.00% done, 0:15 elapsed. (5/0/0 errors)
150.00% done, 0:17 elapsed. (6/0/0 errors)
160.00% done, 0:20 elapsed. (7/0/0 errors)
170.00% done, 0:22 elapsed. (8/0/0 errors)

Its only reading errors. Its not updating reallocated sectors, just:

Code:

  7 Seek_Error_Rate         0x000b   066   066   067    Pre-fail  Always   FAILING_NOW 1966194

BackupProphet · Jul 10, 2023

If you run badblocks, you can usually fix those lba errors you get from a smart test. The firmware on the drives will usually do a remap of the sectors to working spare sectors. The badblocks command I use is:

Code:

badblocks -wsv -t random -b 4096 /dev/sdc

ZFS Read and write errors are usually connection issues caused by bad cable and you also already have UDMA_CRC_Error_Count=1

The easiest way to find out if its the hard drive or the hardware causing issues is to try this hard drive on another computer.

Stephan · Jul 10, 2023

If its not sub-par PSU or SATA cable or SATA port (soldering problem etc. on the PCB) then bad disk. Your tests also hint at that. You can try take it offline and run a shred -vzn1 to give drive a chance to relocate all bad sectors to pool of spares. But at more than handful bad sectors, it is circling the drain. UDMA_CRC_Error_Count is not zero, sure you don't have a slight cabling problem also? Good time to make sure your backup works.#

Edit: And now I read what @BackupProphet wrote... sorry for the duplication.

rtech · Jul 10, 2023

Here whats Backblaze has to say about SMART errors

What SMART Hard Disk Errors Actually Tell Us

Have you ever wondered what your hard drive SMART errors actually mean? Find out what we look at to determine if a drive is about to fail.

www.backblaze.com

If i had your disk i would either RMA it or maybe try what Stephan or BackupProphet suggest if its out of warranty.

Stephan · Jul 10, 2023

If this is Linux or similar, also recommended to run smartd to get e-mail right away when some pre-fail attribute changes. Maybe run a short self-test weekly and ZFS scrub monthly. I think the zed daemon is a must-have and I like to get e-mail about monthly scrub results no matter the outcome. Usually no errors. But I want to know notifications still work.

Tinkerer · Jul 10, 2023

Thanks!

No more warranty I'm afraid. They are recertified HGST Ultrastars. Offtopic but I will never get recertified drives again. 1 died in days (disappeared completely), 2 leaked helium like a balloon. Those 3 got replaced under warranty. Yet another drive dropped to ~70% helium in a matter of days. I tried to RMA that too but it got refused and its been stable since then so I guess its not an issue. Now this one in <20.000 hours.

@Stephan what do you do with the zed daemon? How did you set that up? And yes, Linux.

I am pretty sure that the UDMA errors are from a power cut. I once pulled the plug when it wouldn't turn off and got notifications on the next boot. To be sure, I will mark the cables. I have 3 sets of mini SAS breakout to 4x SATA, 2 are in use. I have swapped them before but never tracked errors to cables. Also, this pool was connected to an M1015 with another set of cables, same drive reporting this same SMART errors. The pool was filling up when ZFS hit that area and started reporting those read errors.

At this point I have 2 sets of full backups tested and working, and all critical data is stored encrypted offsite (which I also tested the other day).

@BackupProphet like I said, backblocks command only triggers Seek_Error_Rate, its not relocating sectors. At another point Im not sure which command I received an error that no data was received reading that area.

Today I will receive 4 new drives (SSD's) and will set those up, mark cables and note serial numbers.

Again, thanks for thinking along, I appreciate all the suggestions.

Stephan · Jul 10, 2023

ZED is the ZFS Event Daemon. man zed

It will notify you by mail if zfs recognizes a so-far unknown problem. Handy to not experience sudden pool death because too many devices have dropped dead over the years...

I have this in /etc/zfs/zed.d/zed.rc

ZED_EMAIL_ADDR="your@address.com"
ZED_EMAIL_PROG="mail"
ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
ZED_NOTIFY_INTERVAL_SECS=21600
ZED_NOTIFY_VERBOSE=1
ZED_USE_ENCLOSURE_LEDS=1
ZED_SYSLOG_SUBCLASS_EXCLUDE="history_event"

Needs local mail command working of course and the daemon enabled. Run a scrub, see if you get an e-mail after it is done.

Search

This drive salvagable or need replacement?

Tinkerer

Member

rtech

Active Member

Tinkerer

Member

BackupProphet

Well-Known Member

Tinkerer

Member

Tinkerer

Member

BackupProphet

Well-Known Member

Stephan

Well-Known Member

rtech

Active Member

What SMART Hard Disk Errors Actually Tell Us

Stephan

Well-Known Member

Tinkerer

Member

Stephan

Well-Known Member