Is it possible to "re-flash" a WD-Red drive?

zicoz

Member
Jan 7, 2011
140
0
16
I keep running into an issue when I "overfill" my ZFS pools on my FreeNAS-server (past the 80% mark), where the a harddrive starts reporting a SMART-error. This only seems to happen to WD60EFRX drives, and I've had 5 of them with this issue now.

When I run WDDLD on them I get the following message:

Quick test on drive 3 did not complete!
Status code = 03 (Fatal or unknown error), Failure Checkpoint = 98 (Unknown test))
SMART self-test did not complete on drive 3!

I thought I'd try flashing them with a new firmware, but when I use the tool from WD I get a message saying "Drive update not applicable"

I am guessing this mean they already have the latest firmware, but is there some way to re-flash the drives with the latest firmware to see if that solves the problem?
 
Last edited:

Rain

Active Member
May 13, 2013
240
81
28
"Overfilling" ZFS volumes may affect performance but it certainly won't directly cause SMART errors. If the drive is throwing SMART errors and is failing self-tests, the drive is dying.

This only seems to happen to WD60EFRX drives, and I've had 5 of them with this issue now.
Post the SMART output for us to review.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,218
412
83
If there isn't a known problem with the drives' firmware that neccessitates an upgrade, then attempting to update to some other random firmware is a Bad Idea. Lots of drives within the same product lines will frequently have differing - and sometimes incompatible - firmwares, and there's no guarantee of actually fixing anything.
 

zicoz

Member
Jan 7, 2011
140
0
16
The drives are already "shelved", so the fact that "something might go wrong" isn't really a problem.

edit:

As a side note, when I run an extended test in WDDLD it passes.
 
Last edited:

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,067
504
113
New York City
www.glaver.org
I keep running into an issue when I "overfill" my ZFS pools on my FreeNAS-server (past the 80% mark), where the a harddrive starts reporting a SMART-error. This only seems to happen to WD60EFRX drives, and I've had 5 of them with this issue now.
Which SMART errror?
When I run WDDLD on them I get the following message:
I don't know what tests Data Lifeguard runs, but drives from all manufacturers log far more data than you can see with SMART. WD has a tool that will gather between 2MB and 8MB of compressed log data per drive, depending on the model. Unfortunately, the only people with the tools to analyze those logs are at WD, and they don't deal with end users (nor is the utility available to end users).
I thought I'd try flashing them with a new firmware, but when I use the tool from WD I get a message saying "Drive update not applicable"
Drive manufacturers are very stingy with their firmware (and have been for years). Even if there is a firmware fix for a specific drive, their tools will only update the drive to the fixed version, not the most recent firmware. Part of that is that the internals of the drive frequently change without a change to the model number.

If the drives are under warranty I'd RMA them, saying "your tool says these are bad". Although I've never had a manufacturer decline a drive RMA for even a single grown defect reported via SMART. And the manufacturers definitely keep track of users and RMAs, and you'll hear from them if you RMA a bunch of drives that test as good. I actually triggered this at WD (in one of my end-user guises) because they kept replacing RE drives with non-RE drives and then I'd reject the non-RE drives and we'd start the cycle all over again. When talking to a WD engineering manager, he did say "I see you've returned a bunch of drives that came up 'no problem found'" and I had to explain that those were unopened drives they sent me by mistake. The manager shipped me a bunch of the right RE drives and sent a memo to the RMA center explaing the meaning of "replace like with like".
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,218
412
83
Did you run the smart tests with the drives in the same chassis or did you pull them out and place them elsewhere?

The fact that you've got the same set of UDMA CRC errors on multiple drives within a short space of time might be an indicator of a dodgy cable or connection - so it'd be worthwhile putting the affected drives in another computer and performing the tests on them there too.
 

zicoz

Member
Jan 7, 2011
140
0
16
The error first ocured when in the server, but they give the same error when they are in another computer. But I'm wondering if this is a "stored error", that is that there might be an "internal counter" that is stored in the drive, which then sends an error when I test it in another computer.

That's the reason why I want to flash it to see if that "resets" that counter.
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,067
504
113
New York City
www.glaver.org
The error first ocured when in the server, but they give the same error when they are in another computer. But I'm wondering if this is a "stored error", that is that there might be an "internal counter" that is stored in the drive, which then sends an error when I test it in another computer.
Almost all SMART values are "event happened" counters, not "current status" values (an example of a current status value would be drive temperature - SMART 194).

This indicates a transmission error on the cable - could be the drive (unlikely if several all started doing this), cable(s), or controller(s). It is theoretically possible that it could be host software, but I haven't seen that since the era of the "ultra dumb parallel port" IDE adapters.
That's the reason why I want to flash it to see if that "resets" that counter.
If flashing drives cleared the counters, you can bet that eBay sellers would be doing it. Even the sellers who use the OEM "initialize log area" commands to hide the previous SMART history leave a few clues.

I don't know about your HDDSCAN utility, but smartmontools will give you detailed event status, which should include the number of power-on hours the drive had at the time a particular event happened.

What kind (brand/model) of case / power supply / cables are involved here?
 

zicoz

Member
Jan 7, 2011
140
0
16
Thank you.

Case: SuperMicro CSE-846TQ
PSU: Corsair AX850
Cables: Unknown breakout cable

smartctl -a output for one of the drives:

Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  57) A fatal error or unknown test error
                                        occurred while the device was executing
                                        its self-test routine and the device
                                        was unable to complete the self-test
                                        routine.
Total time to complete Offline
data collection:                (   60) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   6) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       106
  3 Spin_Up_Time            0x0027   203   197   021    Pre-fail  Always       -       8833
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       208
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27028
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       208
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       203
193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       10321
194 Temperature_Celsius     0x0022   115   095   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   170   000    Old_age   Always       -       213471
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 9 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  e3 00 00 00 00 00 40 00      00:08:34.049  IDLE
  ef 42 00 00 00 00 40 00      00:07:34.104  SET FEATURES [Enable AAM] [OBS-ACS-2]
  ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]

Error 8 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 40  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 42 00 00 00 00 40 00      00:07:34.104  SET FEATURES [Enable AAM] [OBS-ACS-2]
  ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]

Error 7 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 40  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]

Error 6 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 40

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  e3 00 00 00 00 00 40 00   3d+04:05:45.687  IDLE
  ef 42 00 00 00 00 40 00   3d+04:04:45.711  SET FEATURES [Enable AAM] [OBS-ACS-2]
  ef 85 00 00 00 00 40 00   3d+04:04:45.686  SET FEATURES [Disable APM]

Error 5 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 61 00 00 00 00 40  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 42 00 00 00 00 40 00   3d+04:04:45.711  SET FEATURES [Enable AAM] [OBS-ACS-2]
  ef 85 00 00 00 00 40 00   3d+04:04:45.686  SET FEATURES [Disable APM]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Fatal or unknown error        90%     27028         -
# 2  Conveyance offline  Fatal or unknown error        90%     26882         -
# 3  Conveyance offline  Fatal or unknown error        90%     26857         -
# 4  Extended offline    Fatal or unknown error        90%     26840         -
# 5  Extended offline    Aborted by host               90%     26840         -
# 6  Extended offline    Fatal or unknown error        90%     26840         -
# 7  Conveyance offline  Fatal or unknown error        90%     26840         -
# 8  Conveyance offline  Fatal or unknown error        90%     26837         -
# 9  Conveyance offline  Fatal or unknown error        90%     26597         -
#10  Conveyance offline  Fatal or unknown error        90%     26573         -
#11  Conveyance offline  Fatal or unknown error        90%     26546         -
#12  Extended offline    Interrupted (host reset)      40%      7289         -
#13  Conveyance offline  Completed without error       00%      7283         -
#14  Short offline       Completed without error       00%      7283         -
#15  Extended offline    Completed without error       00%      6950         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
Last edited: