Is it possible to "re-flash" a WD-Red drive?

Discussion in 'Hard Drives and Solid State Drives' started by zicoz, Aug 17, 2018.

  1. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    I keep running into an issue when I "overfill" my ZFS pools on my FreeNAS-server (past the 80% mark), where the a harddrive starts reporting a SMART-error. This only seems to happen to WD60EFRX drives, and I've had 5 of them with this issue now.

    When I run WDDLD on them I get the following message:

    Quick test on drive 3 did not complete!
    Status code = 03 (Fatal or unknown error), Failure Checkpoint = 98 (Unknown test))
    SMART self-test did not complete on drive 3!

    I thought I'd try flashing them with a new firmware, but when I use the tool from WD I get a message saying "Drive update not applicable"

    I am guessing this mean they already have the latest firmware, but is there some way to re-flash the drives with the latest firmware to see if that solves the problem?
     
    #1
    Last edited: Aug 18, 2018
  2. Rain

    Rain Active Member

    Joined:
    May 13, 2013
    Messages:
    226
    Likes Received:
    71
    "Overfilling" ZFS volumes may affect performance but it certainly won't directly cause SMART errors. If the drive is throwing SMART errors and is failing self-tests, the drive is dying.

    Post the SMART output for us to review.
     
    #2
  3. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    Yes, that's what my research said as well, but when it's happened 5 times with the same scenario I'm starting to wonder.

    SMART from WDDLD is posted above.

    From HDDScan I get the following:

    WDC WD60EFRX-68MYMN1-WD-WX51D6427147-SMART.mht
     
    #3
    Last edited: Aug 18, 2018
  4. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    1,064
    Likes Received:
    353
    If there isn't a known problem with the drives' firmware that neccessitates an upgrade, then attempting to update to some other random firmware is a Bad Idea. Lots of drives within the same product lines will frequently have differing - and sometimes incompatible - firmwares, and there's no guarantee of actually fixing anything.
     
    #4
  5. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    The drives are already "shelved", so the fact that "something might go wrong" isn't really a problem.

    edit:

    As a side note, when I run an extended test in WDDLD it passes.
     
    #5
    Last edited: Aug 18, 2018
  6. Terry Kennedy

    Terry Kennedy Well-Known Member

    Joined:
    Jun 25, 2015
    Messages:
    1,016
    Likes Received:
    473
    Which SMART errror?
    I don't know what tests Data Lifeguard runs, but drives from all manufacturers log far more data than you can see with SMART. WD has a tool that will gather between 2MB and 8MB of compressed log data per drive, depending on the model. Unfortunately, the only people with the tools to analyze those logs are at WD, and they don't deal with end users (nor is the utility available to end users).
    Drive manufacturers are very stingy with their firmware (and have been for years). Even if there is a firmware fix for a specific drive, their tools will only update the drive to the fixed version, not the most recent firmware. Part of that is that the internals of the drive frequently change without a change to the model number.

    If the drives are under warranty I'd RMA them, saying "your tool says these are bad". Although I've never had a manufacturer decline a drive RMA for even a single grown defect reported via SMART. And the manufacturers definitely keep track of users and RMAs, and you'll hear from them if you RMA a bunch of drives that test as good. I actually triggered this at WD (in one of my end-user guises) because they kept replacing RE drives with non-RE drives and then I'd reject the non-RE drives and we'd start the cycle all over again. When talking to a WD engineering manager, he did say "I see you've returned a bunch of drives that came up 'no problem found'" and I had to explain that those were unopened drives they sent me by mistake. The manager shipped me a bunch of the right RE drives and sent a memo to the RMA center explaing the meaning of "replace like with like".
     
    #6
  7. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    #7
  8. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    1,064
    Likes Received:
    353
    Did you run the smart tests with the drives in the same chassis or did you pull them out and place them elsewhere?

    The fact that you've got the same set of UDMA CRC errors on multiple drives within a short space of time might be an indicator of a dodgy cable or connection - so it'd be worthwhile putting the affected drives in another computer and performing the tests on them there too.
     
    #8
  9. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    The error first ocured when in the server, but they give the same error when they are in another computer. But I'm wondering if this is a "stored error", that is that there might be an "internal counter" that is stored in the drive, which then sends an error when I test it in another computer.

    That's the reason why I want to flash it to see if that "resets" that counter.
     
    #9
  10. Terry Kennedy

    Terry Kennedy Well-Known Member

    Joined:
    Jun 25, 2015
    Messages:
    1,016
    Likes Received:
    473
    Almost all SMART values are "event happened" counters, not "current status" values (an example of a current status value would be drive temperature - SMART 194).

    This indicates a transmission error on the cable - could be the drive (unlikely if several all started doing this), cable(s), or controller(s). It is theoretically possible that it could be host software, but I haven't seen that since the era of the "ultra dumb parallel port" IDE adapters.
    If flashing drives cleared the counters, you can bet that eBay sellers would be doing it. Even the sellers who use the OEM "initialize log area" commands to hide the previous SMART history leave a few clues.

    I don't know about your HDDSCAN utility, but smartmontools will give you detailed event status, which should include the number of power-on hours the drive had at the time a particular event happened.

    What kind (brand/model) of case / power supply / cables are involved here?
     
    #10
  11. zicoz

    zicoz Member

    Joined:
    Jan 7, 2011
    Messages:
    140
    Likes Received:
    0
    Thank you.

    Case: SuperMicro CSE-846TQ
    PSU: Corsair AX850
    Cables: Unknown breakout cable

    smartctl -a output for one of the drives:

    Code:
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (  57) A fatal error or unknown test error
                                            occurred while the device was executing
                                            its self-test routine and the device
                                            was unable to complete the self-test
                                            routine.
    Total time to complete Offline
    data collection:                (   60) seconds.
    Offline data collection
    capabilities:                    (0x7b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        (   6) minutes.
    Conveyance self-test routine
    recommended polling time:        (   5) minutes.
    SCT capabilities:              (0x303d) SCT Status supported.
                                            SCT Error Recovery Control supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       106
      3 Spin_Up_Time            0x0027   203   197   021    Pre-fail  Always       -       8833
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       208
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       27028
     10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       208
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       203
    193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       10321
    194 Temperature_Celsius     0x0022   115   095   000    Old_age   Always       -       37
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   170   000    Old_age   Always       -       213471
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
    
    SMART Error Log Version: 1
    ATA Error Count: 9 (device log contains only the most recent five errors)
            CR = Command Register [HEX]
            FR = Features Register [HEX]
            SC = Sector Count Register [HEX]
            SN = Sector Number Register [HEX]
            CL = Cylinder Low Register [HEX]
            CH = Cylinder High Register [HEX]
            DH = Device/Head Register [HEX]
            DC = Device Command Register [HEX]
            ER = Error register [HEX]
            ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 9 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      04 61 00 00 00 00 40
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      e3 00 00 00 00 00 40 00      00:08:34.049  IDLE
      ef 42 00 00 00 00 40 00      00:07:34.104  SET FEATURES [Enable AAM] [OBS-ACS-2]
      ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]
    
    Error 8 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      04 61 00 00 00 00 40  Device Fault; Error: ABRT
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ef 42 00 00 00 00 40 00      00:07:34.104  SET FEATURES [Enable AAM] [OBS-ACS-2]
      ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]
    
    Error 7 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      04 61 00 00 00 00 40  Device Fault; Error: ABRT
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ef 85 00 00 00 00 40 00      00:07:34.080  SET FEATURES [Disable APM]
    
    Error 6 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      04 61 00 00 00 00 40
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      e3 00 00 00 00 00 40 00   3d+04:05:45.687  IDLE
      ef 42 00 00 00 00 40 00   3d+04:04:45.711  SET FEATURES [Enable AAM] [OBS-ACS-2]
      ef 85 00 00 00 00 40 00   3d+04:04:45.686  SET FEATURES [Disable APM]
    
    Error 5 occurred at disk power-on lifetime: 26364 hours (1098 days + 12 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      04 61 00 00 00 00 40  Device Fault; Error: ABRT
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ef 42 00 00 00 00 40 00   3d+04:04:45.711  SET FEATURES [Enable AAM] [OBS-ACS-2]
      ef 85 00 00 00 00 40 00   3d+04:04:45.686  SET FEATURES [Disable APM]
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Fatal or unknown error        90%     27028         -
    # 2  Conveyance offline  Fatal or unknown error        90%     26882         -
    # 3  Conveyance offline  Fatal or unknown error        90%     26857         -
    # 4  Extended offline    Fatal or unknown error        90%     26840         -
    # 5  Extended offline    Aborted by host               90%     26840         -
    # 6  Extended offline    Fatal or unknown error        90%     26840         -
    # 7  Conveyance offline  Fatal or unknown error        90%     26840         -
    # 8  Conveyance offline  Fatal or unknown error        90%     26837         -
    # 9  Conveyance offline  Fatal or unknown error        90%     26597         -
    #10  Conveyance offline  Fatal or unknown error        90%     26573         -
    #11  Conveyance offline  Fatal or unknown error        90%     26546         -
    #12  Extended offline    Interrupted (host reset)      40%      7289         -
    #13  Conveyance offline  Completed without error       00%      7283         -
    #14  Short offline       Completed without error       00%      7283         -
    #15  Extended offline    Completed without error       00%      6950         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
     
    #11
    Last edited: Aug 26, 2018
Similar Threads: possible re-flash
Forum Title Date
Hard Drives and Solid State Drives is 'data recovery' from a USB stick gone silent possible? Oct 27, 2017
Hard Drives and Solid State Drives Is it possible to get these SAS Hitachi drives to work on a "normal" system? Mar 14, 2017
Hard Drives and Solid State Drives Fastest possible storage for S2600CP motherboard Jul 7, 2016
Hard Drives and Solid State Drives A possible ZFS ZIL Drive Solution Aug 16, 2014
Hard Drives and Solid State Drives WD RED: Possible refresh soon? Jul 5, 2014

Share This Page