How many corrected errors is too many?

nabsltd · Jul 26, 2022

I bought 6x HUS726060AL5215 (6TB SAS3) and ran badblocks on them. Five of them had runs that looked very similar to this:

Code:

# time badblocks -svw -b 4096 -c 131072 /dev/da1
Checking for bad blocks in read-write mode
From block 0 to 1465130645
Pass completed, 0 bad blocks found. (0/0/0 errors)

real    4545m36.927s
user    60m19.007s
sys     20m51.058s

The sixth had the following:

Code:

# time badblocks -svw -b 4096 -c 131072 /dev/da0
Checking for bad blocks in read-write mode
From block 0 to 1465130645
Pass completed, 0 bad blocks found. (0/0/0 errors)

real    7307m25.928s
user    54m20.283s
sys     20m34.253s

You'll notice the running time is 160% as long, dropping the overall average read/write from about 175MB/sec down to about 110MB/sec. The culprit seems to be corrected errors. Fast disks:

Code:

Accumulated power on time, hours:minutes 32734:58
Accumulated start-stop cycles:  30
Accumulated load-unload cycles:  1220
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 53830322436964352

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      853         0       853   22857415    2228793.886           0
write:         0       90         0        90    3996424     580876.925           0
verify:        0        0         0         0     365101          2.214           0

Slow disk:

Code:

Accumulated power on time, hours:minutes 32905:16
Accumulated start-stop cycles:  34
Accumulated load-unload cycles:  1423
Elements in grown defect list: 83

Vendor (Seagate Cache) information
  Blocks sent to initiator = 8059375755198464

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   866516         0    866516   17891861    1342161.269           0
write:         0       69         0        69    5686799      84367.620           0
verify:        0        0         0         0     717638          1.713           0

Despite having about the same PoH, the slow disk has read and written far less data, and has 1000x as many corrected errors.

So, dead disk walking?

Mithril · Jul 26, 2022

Does seem suspect to me. It doesn't claim to be doing full re-reads for those ECC errors at least. Maybe put it though badblocks again and see if it behaves the same? Might be worth contacting the seller with the same finding as seeing what they say; start the convo as open ended "hey by the way, one of the disks behaved like this".

RolloZ170 · Jul 26, 2022

make sure it is not the cable first.

Mithril · Jul 26, 2022

RolloZ170 said:
make sure it is not the cable first.

A very good point! I spent ages chasing down odd drive/ZFS behavior years ago that turned out to be dodgy SATA cables.

nabsltd · Jul 26, 2022

RolloZ170 said:
make sure it is not the cable first.

All the drives are connected to an SAS3-EL1 backplane, and the problem moves with the drive. So, it's the drive.

Now, if you have any ideas about a hardware issue with the drive's connector that I might be able to see, I'll check.

Mithril said:
Might be worth contacting the seller with the same finding as seeing what they say; start the convo as open ended "hey by the way, one of the disks behaved like this".

I'm running a long SMART test now to see if anything else appears. If there is anything wonky from that, I'll definitely just start a return. But, I think you are right, and I'll still see what they can do.

RolloZ170 · Jul 26, 2022

the errors are only in read's.
maybe this is raw data and different in FW revisions ?
check if FW is same on all drives.

Mithril · Jul 26, 2022

The slower speed of the drive is suspect that there is *something* going on.

RolloZ170 · Jul 26, 2022

Elements in grown defect list: 83
oooops.

Stephan · Jul 26, 2022

Dead disk walking indeed, do not put into production. Once grown defects grow beyond a handful, its downhill.

Mithril · Jul 26, 2022

Oh I totally missed the defect count, yeeeeeeah. Thats a drive you give to someone you want revenge on.

Search

How many corrected errors is too many?

nabsltd

Well-Known Member

Mithril

Active Member

RolloZ170

Well-Known Member

Mithril

Active Member

nabsltd

Well-Known Member

RolloZ170

Well-Known Member

Mithril

Active Member

RolloZ170

Well-Known Member

Stephan

Well-Known Member

Mithril

Active Member