BestBuy - WD - Easystore 10TB with 32GB Flash Drive - $180

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

svtkobra7

Active Member
Jan 2, 2017
362
87
28
Here is the SMART data (too long for prior post):

Code:
########## SMART status report for da7 drive (: REDACTED) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   130   130   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   209   209   024    Pre-fail  Always       -       328 (Average 301)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1201
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       57
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       57
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 22/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1150 hours (47 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 48 18 b3 98 40 00   5d+07:00:16.718  READ FPDMA QUEUED
  60 00 58 18 b5 98 40 00   5d+07:00:14.966  READ FPDMA QUEUED
  60 00 50 18 b4 98 40 00   5d+07:00:14.966  READ FPDMA QUEUED
  60 00 40 18 b2 98 40 00   5d+07:00:14.966  READ FPDMA QUEUED
  60 00 78 80 99 98 40 00   5d+07:00:14.955  READ FPDMA QUEUED

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       822         -
Short offline       Completed without error       00%      1200         -

########## SMART status report for da8 drive (: REDACTED) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   130   130   054    Old_age   Offline      -       109
  3 Spin_Up_Time            0x0007   212   212   024    Pre-fail  Always       -       324 (Average 297)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1201
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       16
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       54
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       54
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 21/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1150 hours (47 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 80 76 98 40 00   5d+07:00:14.514  READ FPDMA QUEUED
  60 00 78 80 99 98 40 00   5d+07:00:12.759  READ FPDMA QUEUED
  60 00 70 80 98 98 40 00   5d+07:00:12.759  READ FPDMA QUEUED
  60 00 68 80 97 98 40 00   5d+07:00:12.759  READ FPDMA QUEUED
  60 00 60 80 96 98 40 00   5d+07:00:12.759  READ FPDMA QUEUED

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       822         -
Short offline       Completed without error       00%      1200         -

########## SMART status report for da9 drive (: REDACTED) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   131   131   054    Old_age   Offline      -       104
  3 Spin_Up_Time            0x0007   197   197   024    Pre-fail  Always       -       318 (Average 349)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1201
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       54
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       54
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 22/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1150 hours (47 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 40 80 4e 8d 40 00   5d+07:00:14.649  READ FPDMA QUEUED
  60 00 78 80 61 8d 40 00   5d+07:00:12.895  READ FPDMA QUEUED
  60 00 70 80 60 8d 40 00   5d+07:00:12.895  READ FPDMA QUEUED
  60 00 68 80 5f 8d 40 00   5d+07:00:12.895  READ FPDMA QUEUED
  60 00 60 80 5e 8d 40 00   5d+07:00:12.894  READ FPDMA QUEUED

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       822         -
Short offline       Completed without error       00%      1200         -

########## SMART status report for da11 drive (: REDACTED) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   130   130   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   192   192   024    Pre-fail  Always       -       328 (Average 357)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       18
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1201
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       55
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       55
194 Temperature_Celsius     0x0002   196   196   000    Old_age   Always       -       33 (Min/Max 22/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1150 hours (47 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 70 80 50 8d 40 00   5d+07:00:15.040  READ FPDMA QUEUED
  60 00 78 80 51 8d 40 00   5d+07:00:13.283  READ FPDMA QUEUED
  60 00 68 80 4f 8d 40 00   5d+07:00:13.283  READ FPDMA QUEUED
  60 00 60 80 4e 8d 40 00   5d+07:00:13.283  READ FPDMA QUEUED
  60 00 58 80 3d 8d 40 00   5d+07:00:13.281  READ FPDMA QUEUED

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       822         -
Short offline       Completed without error       00%      1200         -

########## SMART status report for da14 drive (: REDACTED) ##########
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       116
  3 Spin_Up_Time            0x0007   209   209   024    Pre-fail  Always       -       297 (Average 333)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1201
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       55
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       55
194 Temperature_Celsius     0x0002   203   203   000    Old_age   Always       -       32 (Min/Max 21/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       1

ATA Error Count: 1
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1150 hours (47 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 e0 50 d1 40 00   5d+07:00:12.908  READ FPDMA QUEUED
  60 00 98 e0 73 d1 40 00   5d+07:00:11.158  READ FPDMA QUEUED
  60 00 90 e0 72 d1 40 00   5d+07:00:11.158  READ FPDMA QUEUED
  60 00 88 e0 71 d1 40 00   5d+07:00:11.158  READ FPDMA QUEUED
  60 00 80 e0 70 d1 40 00   5d+07:00:11.158  READ FPDMA QUEUED

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       823         -
Short offline       Completed without error       00%      1200         -
 

msg7086

Active Member
May 2, 2017
423
148
43
36
UDMA CRC errors are known to be due to bad cables or poor cable signals. No matter how you burn the drive, it won't find out / fix the cable / signal problem.
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
UDMA CRC errors are known to be due to bad cables or poor cable signals. No matter how you burn the drive, it won't find out / fix the cable / signal problem.
  • Right - I understand the definition; however, not sure how they cropped up in my scenario (thus the inclusion of mentioning that other drives had been connected to the exact same backplane and cable for a longer period of time).
  • So in follow up two questions (questions implied from a rather lengthy post - sorry):
    • Where you have an integrated 2208 (flashed to a 9207-8i) connected to the same backplane that all drives are on, how can you attribute this to a bad cable? Especially considering other drives had been connected to that same SAS cable & backplane. Not debating your posit, rather seeking understanding.
    • If that SMART error is 100% unrelated to drive health, why would HGST ever honor an RMA for a single UDMA CRC error (where that error = only SMART error on the drive)? I ask as that seems like a nonsensical RMA policy and despite the definition, have always wondered if it had anything to do with drive function.
 

Rain

Active Member
May 13, 2013
276
124
43
Where you have an integrated 2208 (flashed to a 9207-8i) connected to the same backplane that all drives are on, how can you attribute this to a bad cable? Especially considering other drives had been connected to that same SAS cable & backplane. Not debating your posit, rather seeking understanding.
If the value continues to increase, I would suspect something is wrong with the backplane. If the value was zero when you finished testing the drives and "1" the first time you checked after placing plugging them into the system they currently reside in, my first question would be: Did you disconnect the drives from the machine you tested them on while they were powered and data was possibly being read/written to them?

If that SMART error is 100% unrelated to drive health, why would HGST ever honor an RMA for a single UDMA CRC error (where that error = only SMART error on the drive)? I ask as that seems like a nonsensical RMA policy and despite the definition, have always wondered if it had anything to do with drive function.
The RMA system is mostly automated, I'd imagine. HGST isn't worried about you RMA'ing one or two drives with CRC errors that are likely insignificant. If you ordered a pallet of drives, plugged them in at your datacenter, experienced CRC errors, and tried to RMA them all they'd probably want more information.
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
If the value continues to increase, I would suspect something is wrong with the backplane. If the value was zero when you finished testing the drives and "1" the first time you checked after placing plugging them into the system they currently reside in, my first question would be: Did you disconnect the drives from the machine you tested them on while they were powered and data was possibly being read/written to them?
  • To clarify prior remarks, the WDC drives were burned in on the same system in which they sit today. Heck, they haven't even been removed from the bay they were inserted into for burn in (but they only have 50 days of power on hours).
  • No power loss events / all shutdowns were graceful.
  • If you look at the SMART data I posted (or just take my word for it) ;), you will notice that error occurred at power on hour 1150, where the current power on hour as of SMART short run time was 1200.
The RMA system is mostly automated, I'd imagine. HGST isn't worried about you RMA'ing one or two drives with CRC errors that are likely insignificant. If you ordered a pallet of drives, plugged them in at your datacenter, experienced CRC errors, and tried to RMA them all they'd probably want more information.
  • Your assertion is correct regarding the RMA process being automated: Insofar as I can tell, customer creates RMA > if serial # is within warranty period, then automated approval > if drive received, replacement sent.
    • To your point about being absolutely insignificant to HGST/WDC, I couldn't agree more. I've considered that they may not even check the SMART data upon receipt, unless the hypothetical pallet is returned, or other behavior triggers more intensive review.
  • I'm not making the suggestion that HGST deems UDMA_CRC_Error_Count as RMA worthy as I shipped them a drive and received a replacement (which was the ultimate case though), rather there was something in the warranty verbiage that triggered me to create a ticket in advance and confirm whether this was grounds for RMA. (I didn't want to pay for return shipping for the same drive or something like that if RMA was denied.)
    • My ticket: "a S.M.A.R.T. Test shows errors for the following ID / Attribute and I would like to clarify whether this qualifies the drive for RMA: SMART ID = 199 / SMART ATTRIBUTE NAME = UDMA_CRC_Error_Count"
    • HGST's response: "Yes, this S.M.A.R.T error would confirm that the unit would need to be replaced." But to your point, maybe that was a bot.
    • And now I can never look at that SMART error as just a cable issue (which it is by definition)! ;)
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I can tell you from my personal experience that if it is a cable issue it will increase. Its also unlikely to hit 5 drives at once (assuming all occurred at similar the same PoH)?
5 is also an uncommon number as it should be 4 (hba cable) or 4 (836 backplane) or 3 (826 backplane) (provided the chassis were filled completely, else a combination can theoretically occur). You could now map the failed drives to bays if you still have that data to see whether its a clustered or scattered distribution...
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
I can tell you from my personal experience that if it is a cable issue it will increase. Its also unlikely to hit 5 drives at once (assuming all occurred at similar the same PoH)?
  • Same exact PoH.
5 is also an uncommon number as it should be 4 (hba cable) or 4 (836 backplane) or 3 (826 backplane) (provided the chassis were filled completely, else a combination can theoretically occur). You could now map the failed drives to bays if you still have that data to see whether its a clustered or scattered distribution...
  • 12/12 populated ... errors as per below ... no pattern that I can see
Code:
X = ERROR
2       5 X     8 X     11 X
1       4 X     7       10
0       3       6 X     9
(more of a curiosity than genuine concern)
 

msg7086

Active Member
May 2, 2017
423
148
43
36
I'd say there's not enough clue for me to figure out what's going on. The only thing I know is CRC error usually occurs between the control chip on the HDD, and the chip on the HBA / MB. Bad cable, bad plug, bad contacts on the backplane, etc, many possible causes. From further information you provided above, I can't come into a conclusion. But bottom line is I don't think it's an issue from the drive itself.

For SMART, it's probably just routine answer by a level 1 customer service rep who probably has only little knowledge about HDD.

Also check my other post https://forums.servethehome.com/index.php?threads/wd100emaz-strange-noise.23031/ about a defective drive that looks almost completely normal. Burn-in tests don't necessary tell the truth.
 

Rain

Active Member
May 13, 2013
276
124
43
Where you have an integrated 2208 (flashed to a 9207-8i) connected to the same backplane that all drives are on, how can you attribute this to a bad cable? Especially considering other drives had been connected to that same SAS cable & backplane.
Based on this, your 826 HDD backplane has an expander, correct? Specifically, I'm assuming the backplane model number is BPN-SAS2-826EL1 or BPN-SAS2-826EL2?

I would have assumed the SAS expander would "NAK" (spec manual, courtesy Seagate, page 76) corrupt frames instead of sending them to the drive, but according to these slides from HP and the SCSI Trade Association:
Note that frame CRC errors are not included -- Expanders do not examine frame contents (except for SMP frames addressed to themselves)
(Slide/Page 36)

So, based on that, its entirely possible the 8087 cable to the backplane could have caused CRC issues on multiple drives.

Note that those slides are from 2003; it's possible that modern expanders do check frames for CRC errors before forwarding them and all of the above is nonsense. Regardless, I'd swap the cable if the CRC error count continues to rise. If even modern expanders don't preform CRC checks on all frames, I most certainly would have lost a bet!
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
Based on this, your 826 HDD backplane has an expander, correct? Specifically, I'm assuming the backplane model number is BPN-SAS2-826EL1 or BPN-SAS2-826EL2?
  • Correct - BPN-SAS2-826EL1 (sorry for the delayed reply to your question).
I would have assumed the SAS expander would "NAK" (spec manual, courtesy Seagate, page 76) corrupt frames instead of sending them to the drive, but according to these slides from HP and the SCSI Trade Association:
(Slide/Page 36)

So, based on that, its entirely possible the 8087 cable to the backplane could have caused CRC issues on multiple drives.

Note that those slides are from 2003; it's possible that modern expanders do check frames for CRC errors before forwarding them and all of the above is nonsense. Regardless, I'd swap the cable if the CRC error count continues to rise. If even modern expanders don't preform CRC checks on all frames, I most certainly would have lost a bet!
  • Quite interesting, thank you very much for the informative and thoughtful reply.
  • I will keep an eye on it and would suggest that I'd be more diligent about reviewing running SMART tests / output review (but do feel as though I am already):
    • Periodic SMART Tests scheduled in FreeNAS: Long = 8th and 22nd of month @ 4AM + Short = 5th, 12th, 19th, 26th @ 3 AM
    • Further, I do review the output of a daily cronjob which summarizes zpool status and SMART status (and frankly it is one of the few emails everyday I receive that I actually look forward to receiving). ;)
    • (any enhancements you care to offer here are certainly welcomed)
  • One bit I might add regarding the Self-test logs that I posted in code tags here:
    • Upon looking into the script that produces the daily SMART status, it is only presenting the last entry for short & extended offline tests (see spoiler 1 below) and appending that info below the dashboard.
    • When I ran smartctl -a I confirmed that Self-tests are indeed being executed as intended (see spoiler 2 below), in line with the periodic testing regime mentioned above, and there are 16 log entries not 2 as suggested.
    • Note to self = Never copy/paste output from an email to a forum, in lieu of running the cmd myself! :oops:
  • Since my last post on the topic, an Extended offline test (which takes forever on a 10 TB drive) executed and UltraDMA CRC Error count hasn't increased.
  • Now if I could just figure how to update the drivedb ("Device is: Not in smartctl database"), but not that it matters (it isn't in the database anyway, so I just submitted a ticket - not sure how actively drivedb requests are updated looking on the number of outstanding drivedb tickets).
@ forum generally, my apologies for taking this thread off topic, but with a solid sample set of these drives (12) running nearly 24/7, had I actually found an issue, I figure this may have been the best place to call that out

@msg7086 @Rain @Rand__ : My sincere thanks for your kind input / replies. Greatly appreciated! :)
Code:
Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Extended offline    Completed without error       00%       822         -
Short offline       Completed without error       00%      1200         -
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA                                                                                                                                                             _of_first_error
# 1  Extended offline    Completed without error       00%      1232         -
# 2  Short offline       Completed without error       00%      1200         -
!!! ERROR @ HOUR 1150 HERE !!!
# 3  Short offline       Completed without error       00%      1137         -
# 4  Short offline       Completed without error       00%       899         -
# 5  Extended offline    Completed without error       00%       822         -
# 6  Short offline       Completed without error       00%       732         -
# 7  Short offline       Completed without error       00%       707         -
# 8  Short offline       Completed without error       00%       605         -
# 9  Short offline       Completed without error       00%       564         -
#10  Short offline       Completed without error       00%       396         -
#11  Short offline       Completed without error       00%       357         -
#12  Extended offline    Completed without error       00%       178         -
#13  Short offline       Completed without error       00%       160         -
#14  Extended offline    Completed without error       00%        18         -
#15  Short offline       Completed without error       00%         0         -
#16  Short offline       Completed without error       00%         0         -
Code:
root@FreeNAS-01[~]# update-smart-drivedb -v
Download branches/RELEASE_6_6_DRIVEDB/drivedb.h with fetch
fetch --no-redirect -o /usr/local/share/smartmontools/drivedb.h.new https://svn.code.sf.net/p/smartmontools/code/branches/RELEASE_6_6_DRIVEDB/smartmontools/drivedb.h
/usr/local/share/smartmontools/drivedb.h.new  100% of  186 kB 1513 kBps 00m00s
Download branches/RELEASE_6_6_DRIVEDB/drivedb.h.raw.asc with fetch
fetch --no-redirect -o /usr/local/share/smartmontools/drivedb.h.new.raw.asc https://svn.code.sf.net/p/smartmontools/code/branches/RELEASE_6_6_DRIVEDB/smartmontools/drivedb.h.raw.asc
/usr/local/share/smartmontools/drivedb.h.new.r100% of  455  B 3339 kBps 00m00s
gpg: Warning: using insecure memory!
gpg: can't connect to the agent: IPC connect call failed
[/SPOILER]
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Your google-fu is weak today... ;)

search term 'smartctl updatedb' ->
FAQ – smartmontools ->
Download – smartmontools

... quick test on FN...
Code:
 which update-smart-drivedb
/usr/local/sbin/update-smart-drivedb
root@freenas:~ # /usr/local/sbin/update-smart-drivedb
/usr/local/sbin/update-smart-drivedb: gpg: not found ('--no-verify' to ignore)
root@freenas:~ # /usr/local/sbin/update-smart-drivedb --no-verify
/usr/local/share/smartmontools/drivedb.h updated from branches/RELEASE_6_6_DRIVEDB (NOT VERIFIED)
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
Your google-fu is weak today... ;)
  • What is this Google ... I've been using this gem => AOL Search
root@freenas:~ # /usr/local/sbin/update-smart-drivedb --no-verify
  • So it worked, nice! and Thanks! but I take assume no accountability here =>
  • IMO: If a storage OS was to keep any utility up to date for you, well I don't know, it might be smartctl?!?
Anyway, Mr. Smartctl: How about you create a vib for smartmontools v7 and send my way (like this)? Its always been super annoying how little info ESXi presents [1] and the fact that that port won't work with Optane (but I'm guessing now I can just update the drive db and it will? ... will try tonight).

[1]
Code:
esxcli storage core device smart get -d [device name]
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Nice, never checked for a smarctl vib. I'd assume that will have a header file that you simply can replace if need be from a similar level installation somewhere on a linux box if the update does not work on esxi.
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
Nice, never checked for a smarctl vib. I'd assume that will have a header file that you simply can replace if need be from a similar level installation somewhere on a linux box if the update does not work on esxi.
  • Not so easy to update as there is no drivedb.h to update (speculate you knew that based upon your reply). :(
  • Beyond that the dated port (smartctl 6.6 2016-05-10) doesn't support nvme, so even if if the drivedb could be updated, I believe it lacks the functionality to be of any use (to me, i.e. only drives local to ESXi = Optane / but I've used it before with an INTL S3500 so know it works ) [see spoilers]
    • And considering isdct doesn't work with "enthusiast" Optane, i.e. 900p, I don't know of any way to achieve line of site to drive info. <= annoying
  • You give me way too much credit if you think I could ever update some header if I can't update s normal smartmontools install, let alone a ported vib. ;)
Code:
[root@ESXi-01:/opt/smartmontools] esxcli storage core device smart get -d t10.NVMe____INTEL_blah_blah
Parameter                     Value  Threshold  Worst
----------------------------  -----  ---------  -----
Health Status                 OK     N/A        N/A
Media Wearout Indicator       N/A    N/A        N/A
Write Error Count             N/A    N/A        N/A
Read Error Count              N/A    N/A        N/A
Power-on Hours                2457   N/A        N/A
Power Cycle Count             23     N/A        N/A
Reallocated Sector Count      0      100        N/A
Raw Read Error Rate           N/A    N/A        N/A
Drive Temperature             35     80         N/A
Driver Rated Max Temperature  N/A    N/A        N/A
Write Sectors TOT Count       N/A    N/A        N/A
Read Sectors TOT Count        N/A    N/A        N/A
Initial Bad Block Count       N/A    N/A        N/A

Code:
[root@ESXi-01:/opt/smartmontools] ./smartctl -d nvme --all /dev/disks/t10.NVMe____INTEL_blah_blah
smartctl 6.6 2016-05-10 r4321 [x86_64-linux-6.7.0] (daily-20160510)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Function not implemented