Is this SAS drive really failing?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

eduncan911

The New James Dean
Jul 27, 2015
648
506
93
eduncan911.com
So, Total uncorrected errors = 0. Meaning, it was able to correct all errors?

Smartctl long test resulted the same READ error, twice, which halted the tests:

Bash:
# smartctl -l selftest /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.174-2-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       7   52547        7071673278 [0x3 0x5d 0x1]
# 2  Background long   Failed in segment -->       7   52540        7071673278 [0x3 0x5d 0x1]

Long (extended) Self-test duration: 37452 seconds [624.2 minutes]
And I have this showing up on one of the SAS drive log pages after thrashing the disk with fio random rw, write, read tests over several hours:

Bash:
# sg_logs -a /dev/sdb > sg_logs.sdb
(...snip...)

Read error counter page  [0x3]
  Errors corrected without substantial delay = 127464
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127481
  Total times correction algorithm processed = 1213208
  Total bytes processed = 6714450675400
  Total uncorrected errors = 0
Just wondering, since that "Total uncorrected errors" is zero... Does that mean, it's still viable?
 

itronin

Well-Known Member
Nov 24, 2018
1,234
793
113
Denver, Colorado
Smart halting like that is abnormal in my experience.
If I saw that in a production environment I'd proactively replace the disk if this was the only disk on the system exhibiting this behavior.
If multiple disks were exhibiting the behavior I'd double check my backup status and then see if there was maybe an OS or driver issue.
I'm as concerned about the errors corrected with possible delay count number.
Is that number incrementing?

There's a lot of (to me) scary math with how drives store data and how they can correct things and so some error correction is to be expected as normal. To me the uncorrected errors say - I can trust my data that is on the drive as of that check.

what kind of HBA are you using and what OS/software? There was a 3008 driver or firmware bug that showed up with FreeNAS 11.3 a while back but I'm not sure that is applicable here.
 
Last edited:

eduncan911

The New James Dean
Jul 27, 2015
648
506
93
eduncan911.com
Smart halting like that is abnormal in my experience.
It gets down to 90% and the test is halted, according to the status of no active scans in progress nor queued. It did it twice.

If I saw that in a production environment I'd proactively replace the disk if this was the only disk on the system exhibiting this behavior.
If multiple disks were exhibiting the behavior I'd double check my backup status and then see if there was maybe an OS or driver issue.
I do have 2 spares. It's just, I bought these a year ago and just now getting around to install them and... frack.

I'm as concerned about the errors corrected with possible delay count number.
Is that number incrementing?

So I kicked off another randrw for a few hours to see...

Code:
# before randrw for 2 hours
Read error counter page  [0x3]
  Errors corrected without substantial delay = 127464
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127481
  Total times correction algorithm processed = 1213208
  Total bytes processed = 6714450675400
  Total uncorrected errors = 0

# after
Read error counter page  [0x3]
  Errors corrected without substantial delay = 127779
  Errors corrected with possible delays = 17
  Total rewrites or rereads = 0
  Total errors corrected = 127796
  Total times correction algorithm processed = 1213398
  Total bytes processed = 6714734595400
  Total uncorrected errors = 0
So yeah, it's increasing.

what kind of HBA are you using and what OS/software? There was a 3008 drive or firmware bug that showed up with FreeNAS 11.3 a while back but I'm not sure that is applicable here.
Supermicro HBA which is an LSI SAS 2308 under the covers (BIOS says on boot that it's already in IT mode). Proxmox VE is the OS (Debian under the hood).
 

eduncan911

The New James Dean
Jul 27, 2015
648
506
93
eduncan911.com
I also found another log entry:

Code:
Background scan results page  [0x15]
  Status parameters:
    Accumulated power on minutes: 3159404 [h:m  52656:44]
    Status: background scan enabled, none active (waiting for BMS interval timer to expire)
    Number of background scans performed: 324
    Background medium scan progress: 0.00 %
    Number of background medium scans performed: 324
  Medium scan parameter # 1 [0x1]
    Power on minutes when error detected: 3107100 [51785:0]
    Reassignment pending receipt of Reassign or Write command
    sense key: Medium Error  [sk,asc,ascq: 0x3,0x11,0x0]
      Additional sense: Unrecovered read error
    LBA (associated with medium error): 0x00000001a5812bbe
The fact that it says, Medium Error, tells me it's physical. Yeah, i'll scrap the drive...
 
  • Like
Reactions: itronin

UhClem

just another Bozo on the bus
Jun 26, 2012
433
247
43
NH, USA
... Yeah, i'll scrap the drive...
Might be a bit:) drastic.
After 6+ years of active life (and 300+ error-free scans), the drive has its first bad sector. [If you were in charge of HHS, would you replace all hospitals with crematoriums?;)]
So, Total uncorrected errors = 0. Meaning, it was able to correct all errors?
Background vs Foreground
You're seeing the results you've posted because the bad sector was discovered/encountered in a background scan (vs host-initiated command).

If[**] you were to do:
Code:
dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278
you would (almost certainly) get an I/O error, and a dmesg | tail -33 would show the kernel reaction (to this Foreground action). And Total _uncorrected would be *1*.

[**] But, before doing that ...
I bought these a year ago and just now getting around to install them ...
Since it sounds like you might not have any (non-expendable) data on that drive, I'd prefer that you instead do:
Code:
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
Rationale:
You've definitely got a bad sector, and no matter what, before making ANY use of the drive, you want that sector spared (reallocated) and added to the grown defect list. That is accomplished with the 2nd dd command.
Code:
   Reassignment pending receipt of Reassign or Write command
By NOT doing the first dd command, I'm thinking we can avoid tallying up an Uncorrected_error (just for geek-grins).

Are you game?
 
  • Love
Reactions: eduncan911

eduncan911

The New James Dean
Jul 27, 2015
648
506
93
eduncan911.com
Might be a bit:) drastic.
After 6+ years of active life (and 300+ error-free scans), the drive has its first bad sector. [If you were in charge of HHS, would you replace all hospitals with crematoriums?;)]

Background vs Foreground
You're seeing the results you've posted because the bad sector was discovered/encountered in a background scan (vs host-initiated command).

If[**] you were to do:
Code:
dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278
you would (almost certainly) get an I/O error, and a dmesg | tail -33 would show the kernel reaction (to this Foreground action). And Total _uncorrected would be *1*.

[**] But, before doing that ...

Since it sounds like you might not have any (non-expendable) data on that drive, I'd prefer that you instead do:
Code:
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
Rationale:
You've definitely got a bad sector, and no matter what, before making ANY use of the drive, you want that sector spared (reallocated) and added to the grown defect list. That is accomplished with the 2nd dd command.

By NOT doing the first dd command, I'm thinking we can avoid tallying up an Uncorrected_error (just for geek-grins).

Are you game?
Let's do it!

These are used 4TB SAS drives I got from eBay a while back. Never been in a pool or used on my side, so we can do anything destructive/fixes you want.

Not sure which commands you want me to run first after that post; but, feel free to give instructions. Machine is just sitting idle this weekend as I do housework.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
433
247
43
NH, USA
Let's do it!
Do:
Code:
dmesg | tail -3
smartctl -a /dev/sdb > sdb_before
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
sync
smartctl -a /dev/sdb > sdb_after
diff sdb_before sdb_after
echo ===
dmesg | tail -3
and copy/paste the results. tnx
 
  • Love
Reactions: eduncan911

eduncan911

The New James Dean
Jul 27, 2015
648
506
93
eduncan911.com
Fantastic!! You nailed it, spot on!. And, I learned to link the sectors to dd read/writes! :)

Code:
# dmesg | tail -3                               
[   21.422176] vmbr0: port 1(enp9s0f0) entered blocking state
[   21.422192] vmbr0: port 1(enp9s0f0) entered forwarding state
[   21.422573] IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Code:
# dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278
dd: error writing '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 1.83909 s, 0.0 kB/s
Code:
# dmesg | tail -3
[ 9321.129915] sd 1:0:1:0: [sdb] tag#9408 Add. Sense: Unrecovered read error
[ 9321.129942] sd 1:0:1:0: [sdb] tag#9408 CDB: Read(16) 88 20 00 00 00 01 a5 81 2b b8 00 00 00 08 00 00
[ 9321.129971] blk_update_request: critical medium error, dev sdb, sector 7071673278 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
And time for some diffs...

Code:
# diff -u sdb_before sdb_after
--- sdb_before  2022-05-02 01:48:21.117924114 -0400
+++ sdb_after   2022-05-02 01:54:37.714936224 -0400

(snip)

@@ -43,9 +43,9 @@
            Errors Corrected by           Total   Correction     Gigabytes    Total
                ECC          rereads/    errors   algorithm      processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
-read:     127779       17         0    127796    1213400       6714.744           0
+read:     127779       17         0    127796    1213547       6714.747           1
write:         0        0         0         0    1268483      99150.319           0
-verify:        0        0         0         0     144469          0.000           0
+verify:        0        0         0         0     144471          0.000           0

Non-medium error count:        0
There it is, total uncorrected errors.

And in the SAS log pages:

Code:
# diff -u sdb_before.sg sdb_after.sg
--- sdb_before.sg 2022-05-02 01:49:15.133508735 -0400
+++ sdb_after.sg  2022-05-02 01:54:52.234817570 -0400
@@ -32,16 +32,16 @@
   Errors corrected with possible delays = 17
   Total rewrites or rereads = 0
   Total errors corrected = 127796
-  Total times correction algorithm processed = 1213400
-  Total bytes processed = 6714744891400
-  Total uncorrected errors = 0
+  Total times correction algorithm processed = 1213547
+  Total bytes processed = 6714747125320
+  Total uncorrected errors = 1

Verify error counter page  [0x5]
   Errors corrected without substantial delay = 0
   Errors corrected with possible delays = 0
   Total rewrites or rereads = 0
   Total errors corrected = 0
-  Total times correction algorithm processed = 144469
+  Total times correction algorithm processed = 144471
   Total bytes processed = 0
   Total uncorrected errors = 0

(snip)
 
  • Like
Reactions: itronin

UhClem

just another Bozo on the bus
Jun 26, 2012
433
247
43
NH, USA
We're only half-way thru the metamorphosis.
Let's finish it and try to learn something extra in the process.

Pls copy/paste the following into file :
Code:
smartctl -a /dev/sdb | grep -e read: -e write:
dd if=/dev/sdb of=/dev/null bs=512 count=1 skip=7071673278 iflag=direct 2> /dev/null
echo ===
smartctl -a /dev/sdb | grep -e read: -e write:
dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=7071673278 oflag=direct 2> /dev/null
echo ===
smartctl -a /dev/sdb | grep -e read: -e write:
Then tr -d "\r" < file > file2 , chmod 755 file2 and ./file2 > file2.out
Pls copy/paste contents of file2.out into your reply. Thanks
[An explanation will follow.]