Is this SAS drive really failing?

eduncan911 · Apr 30, 2022

So, Total uncorrected errors = 0. Meaning, it was able to correct all errors?

Smartctl long test resulted the same READ error, twice, which halted the tests:

Bash:

# smartctl -l selftest /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.174-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       7   52547        7071673278 [0x3 0x5d 0x1]
# 2  Background long   Failed in segment -->       7   52540        7071673278 [0x3 0x5d 0x1]

Long (extended) Self-test duration: 37452 seconds [624.2 minutes]

And I have this showing up on one of the SAS drive log pages after thrashing the disk with fio random rw, write, read tests over several hours:

Bash:

# sg_logs -a /dev/sdb > sg_logs.sdb
(...snip...)

Read error counter page  [0x3]
  Errors corrected without substantial delay = 127464
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127481
  Total times correction algorithm processed = 1213208
  Total bytes processed = 6714450675400
  Total uncorrected errors = 0

Just wondering, since that "Total uncorrected errors" is zero... Does that mean, it's still viable?

itronin · Apr 30, 2022

Smart halting like that is abnormal in my experience.
If I saw that in a production environment I'd proactively replace the disk if this was the only disk on the system exhibiting this behavior.
If multiple disks were exhibiting the behavior I'd double check my backup status and then see if there was maybe an OS or driver issue.
I'm as concerned about the errors corrected with possible delay count number.
Is that number incrementing?

There's a lot of (to me) scary math with how drives store data and how they can correct things and so some error correction is to be expected as normal. To me the uncorrected errors say - I can trust my data that is on the drive as of that check.

what kind of HBA are you using and what OS/software? There was a 3008 driver or firmware bug that showed up with FreeNAS 11.3 a while back but I'm not sure that is applicable here.

eduncan911 · May 1, 2022

itronin said:
Smart halting like that is abnormal in my experience.

It gets down to 90% and the test is halted, according to the status of no active scans in progress nor queued. It did it twice.

itronin said:
If I saw that in a production environment I'd proactively replace the disk if this was the only disk on the system exhibiting this behavior.
If multiple disks were exhibiting the behavior I'd double check my backup status and then see if there was maybe an OS or driver issue.

I do have 2 spares. It's just, I bought these a year ago and just now getting around to install them and... frack.

itronin said:
I'm as concerned about the errors corrected with possible delay count number.
Is that number incrementing?

So I kicked off another randrw for a few hours to see...

Code:

# before randrw for 2 hours
Read error counter page  [0x3]
  Errors corrected without substantial delay = 127464
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127481
  Total times correction algorithm processed = 1213208
  Total bytes processed = 6714450675400
  Total uncorrected errors = 0

# after
Read error counter page  [0x3]
  Errors corrected without substantial delay = 127779
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127796
  Total times correction algorithm processed = 1213398
  Total bytes processed = 6714734595400
  Total uncorrected errors = 0

So yeah, it's increasing.

itronin said:
what kind of HBA are you using and what OS/software? There was a 3008 drive or firmware bug that showed up with FreeNAS 11.3 a while back but I'm not sure that is applicable here.

Supermicro HBA which is an LSI SAS 2308 under the covers (BIOS says on boot that it's already in IT mode). Proxmox VE is the OS (Debian under the hood).

eduncan911 · May 1, 2022

I also found another log entry:

Code:

Background scan results page  [0x15]
  Status parameters:
    Accumulated power on minutes: 3159404 [h:m  52656:44]
    Status: background scan enabled, none active (waiting for BMS interval timer to expire)
    Number of background scans performed: 324
    Background medium scan progress: 0.00 %
    Number of background medium scans performed: 324
  Medium scan parameter # 1 [0x1]
    Power on minutes when error detected: 3107100 [51785:0]
    Reassignment pending receipt of Reassign or Write command
    sense key: Medium Error  [sk,asc,ascq: 0x3,0x11,0x0]
      Additional sense: Unrecovered read error
    LBA (associated with medium error): 0x00000001a5812bbe

The fact that it says, Medium Error, tells me it's physical. Yeah, i'll scrap the drive...

UhClem · May 1, 2022

eduncan911 said:
... Yeah, i'll scrap the drive...

Might be a bit

drastic.
After 6+ years of active life (and 300+ error-free scans), the drive has its first bad sector. [If you were in charge of HHS, would you replace all hospitals with crematoriums?

]

So, Total uncorrected errors = 0. Meaning, it was able to correct all errors?

Background vs Foreground
You're seeing the results you've posted because the bad sector was discovered/encountered in a background scan (vs host-initiated command).

If[**] you were to do:

Code:

dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278

you would (almost certainly) get an I/O error, and a dmesg | tail -33 would show the kernel reaction (to this Foreground action). And Total _uncorrected would be *1*.

[**] But, before doing that ...

I bought these a year ago and just now getting around to install them ...

Since it sounds like you might not have any (non-expendable) data on that drive, I'd prefer that you instead do:

Code:

dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278

Rationale:
You've definitely got a bad sector, and no matter what, before making ANY use of the drive, you want that sector spared (reallocated) and added to the grown defect list. That is accomplished with the 2nd dd command.

By NOT doing the first dd command, I'm thinking we can avoid tallying up an Uncorrected_error (just for geek-grins).

Are you game?

eduncan911 · May 1, 2022

UhClem said:
Might be a bit drastic.
After 6+ years of active life (and 300+ error-free scans), the drive has its first bad sector. [If you were in charge of HHS, would you replace all hospitals with crematoriums?]

Background vs Foreground
You're seeing the results you've posted because the bad sector was discovered/encountered in a background scan (vs host-initiated command).

If[**] you were to do:

Code:

dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278

you would (almost certainly) get an I/O error, and a dmesg | tail -33 would show the kernel reaction (to this Foreground action). And Total _uncorrected would be *1*.

[**] But, before doing that ...

Since it sounds like you might not have any (non-expendable) data on that drive, I'd prefer that you instead do:

Code:

dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278

Rationale:
You've definitely got a bad sector, and no matter what, before making ANY use of the drive, you want that sector spared (reallocated) and added to the grown defect list. That is accomplished with the 2nd dd command.

By NOT doing the first dd command, I'm thinking we can avoid tallying up an Uncorrected_error (just for geek-grins).

Are you game?

Let's do it!

These are used 4TB SAS drives I got from eBay a while back. Never been in a pool or used on my side, so we can do anything destructive/fixes you want.

Not sure which commands you want me to run first after that post; but, feel free to give instructions. Machine is just sitting idle this weekend as I do housework.

UhClem · May 1, 2022

eduncan911 said:
Let's do it!

Do:

Code:

dmesg | tail -3
smartctl -a /dev/sdb > sdb_before
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
sync
smartctl -a /dev/sdb > sdb_after
diff sdb_before sdb_after
echo ===
dmesg | tail -3

and copy/paste the results. tnx

eduncan911 · May 1, 2022

Fantastic!! You nailed it, spot on!. And, I learned to link the sectors to dd read/writes!

Code:

# dmesg | tail -3                               
[   21.422176] vmbr0: port 1(enp9s0f0) entered blocking state
[   21.422192] vmbr0: port 1(enp9s0f0) entered forwarding state
[   21.422573] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready

Code:

# dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
dd: error writing '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 1.83909 s, 0.0 kB/s

Code:

# dmesg | tail -3
[ 9321.129915] sd 1:0:1:0: [sdb] tag#9408 Add. Sense: Unrecovered read error
[ 9321.129942] sd 1:0:1:0: [sdb] tag#9408 CDB: Read(16) 88 20 00 00 00 01 a5 81 2b b8 00 00 00 08 00 00
[ 9321.129971] blk_update_request: critical medium error, dev sdb, sector 7071673278 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

And time for some diffs...

Code:

# diff -u sdb_before sdb_after
--- sdb_before  2022-05-02 01:48:21.117924114 -0400
+++ sdb_after   2022-05-02 01:54:37.714936224 -0400

(snip)

@@ -43,9 +43,9 @@
            Errors Corrected by           Total   Correction     Gigabytes    Total
                ECC          rereads/    errors   algorithm      processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
-read:     127779       17         0    127796    1213400       6714.744           0
+read:     127779       17         0    127796    1213547       6714.747           1
write:         0        0         0         0    1268483      99150.319           0
-verify:        0        0         0         0     144469          0.000           0
+verify:        0        0         0         0     144471          0.000           0

Non-medium error count:        0

There it is, total uncorrected errors.

And in the SAS log pages:

Code:

# diff -u sdb_before.sg sdb_after.sg
--- sdb_before.sg 2022-05-02 01:49:15.133508735 -0400
+++ sdb_after.sg  2022-05-02 01:54:52.234817570 -0400
@@ -32,16 +32,16 @@
   Errors corrected with possible delays = 17
   Total rewrites or rereads = 0
   Total errors corrected = 127796
-  Total times correction algorithm processed = 1213400
-  Total bytes processed = 6714744891400
-  Total uncorrected errors = 0
+  Total times correction algorithm processed = 1213547
+  Total bytes processed = 6714747125320
+  Total uncorrected errors = 1

Verify error counter page  [0x5]
   Errors corrected without substantial delay = 0
   Errors corrected with possible delays = 0
   Total rewrites or rereads = 0
   Total errors corrected = 0
-  Total times correction algorithm processed = 144469
+  Total times correction algorithm processed = 144471
   Total bytes processed = 0
   Total uncorrected errors = 0

(snip)

UhClem · May 2, 2022

We're only half-way thru the metamorphosis.
Let's finish it and try to learn something extra in the process.

Pls copy/paste the following into file :

Code:

smartctl -a /dev/sdb | grep -e read: -e write:
dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278 iflag=direct 2> /dev/null
echo ===
smartctl -a /dev/sdb | grep -e read: -e write:
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278 oflag=direct 2> /dev/null
echo ===
smartctl -a /dev/sdb | grep -e read: -e write:

Then tr -d "\r" < file > file2 , chmod 755 file2 and ./file2 > file2.out
Pls copy/paste contents of file2.out into your reply. Thanks
[An explanation will follow.]

Search

Is this SAS drive really failing?

eduncan911

The New James Dean

itronin

Well-Known Member

eduncan911

The New James Dean

eduncan911

The New James Dean

UhClem

just another Bozo on the bus

eduncan911

The New James Dean

UhClem

just another Bozo on the bus

eduncan911

The New James Dean

UhClem

just another Bozo on the bus