ZFS on centos rapid SATA drive failures

gsxrrcr · Oct 26, 2020

I have a system that's on a supermicro with several jbods. It was all 4TB SAS, but put in 8TB SATA. It has had more drives fault in 6 months than in my 20 years in storage. I am an enterprise storage guy, NetApp etc. This was dropped in my lap and I can't figure out why this is happening. Any help would be appreciated. Here's what dropped over the weekend. centos 7 zfs 0.8.4-1

raidz2-2 DEGRADED 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZNRP ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZPEZ ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZPQC ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZPQD ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZQ30 ONLINE 0 0 0
spare-5 DEGRADED 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZQHN FAULTED 16 0 0 too many errors
ata-ST8000NM000A-2KE101_WKD1CTC1 ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZX6D ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA15ZY4D ONLINE 0 0 0
ata-ST8000NM0055-1RM112_ZA160WPZ ONLINE 0 0 0

ttabbal · Oct 26, 2020

What controller/HBA? Any errors in dmesg? smartctl -a /dev/sdX show anything?

I'm debugging something similar with a replaced drive. It works to start with, but scrubs fail later. I started getting this in dmesg, "Unaligned partial completion" errors. I found that it is commonly linked to HBA and drive firmware. Turns out, I had one card with older firmware, so I updated it. I'm waiting for the scrub now, but it has run longer than the previous one and looks to be working. Hopefully it works better now. The new drive seemed to trigger the problem, the system was working fine before I replaced an old drive. In my case the new drive is a ST6000VN001, makes me wonder if maybe newer/larger Seagates combine with the HBA to trip things up. The drive itself has been through a few rounds of badblocks on a test machine, along with smart testing, I really don't think it's the drive that has an issue.

Hope it helps someone to mention it here.

gsxrrcr · Oct 27, 2020

Thanks! Here's what I see in dmesg. Not the same as yours, but I think you have pointed me in a good direction. I did update all the drive firmware, but not anything else. I think the problems actually started before the updates, but I will check the HBA out as well.

142.244535] blk_update_request: I/O error, dev sdfs, sector 268435416
[ 142.527517] sd 13:0:44:0: [sdfs] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 142.527532] sd 13:0:44:0: [sdfs] tag#0 Sense Key : Aborted Command [current]
[ 142.527539] sd 13:0:44:0: [sdfs] tag#0 Add. Sense: No additional sense information
[ 142.527560] sd 13:0:44:0: [sdfs] tag#0 CDB: Read(16) 88 00 00 00 00 00 0f ff ff d8 00 00 00 08 00 00
[ 142.527563] blk_update_request: I/O error, dev sdfs, sector 268435416
[ 142.527565] Buffer I/O error on dev sdfs, logical block 33554427, async page read
[ 142.810539] sd 13:0:44:0: [sdfs] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 142.810555] sd 13:0:44:0: [sdfs] tag#0 Sense Key : Aborted Command [current]
[ 142.810561] sd 13:0:44:0: [sdfs] tag#0 Add. Sense: No additional sense information
[ 142.810569] sd 13:0:44:0: [sdfs] tag#0 CDB: Read(16) 88 00 00 00 00 03 a3 81 2a a8 00 00 00 08 00 00
[ 142.810575] blk_update_request: I/O error, dev sdfs, sector 15628053160
[ 143.093557] sd 13:0:44:0: [sdfs] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[ 143.093573] sd 13:0:44:0: [sdfs] tag#0 Sense Key : Aborted Command [current]
[ 143.093580] sd 13:0:44:0: [sdfs] tag#0 Add. Sense: No additional sense information
[ 143.093587] sd 13:0:44:0: [sdfs] tag#0 CDB: Read(16) 88 00 00 00 00 03 a3 81 2a a8 00 00 00 08 00 00
[ 143.093593] blk_update_request: I/O error, dev sdfs, sector 15628053160
[ 143.093599] Buffer I/O error on dev sdfs, logical block 1953506645, async page read
[ 143.659545] Buffer I/O error on dev sdfs, logical block 1953506645, async page read
[ 186.534945] scsi 13:0:50:0: Direct-Access ATA ST8000NM000A-2KE SN02 PQ: 0 ANSI: 6
[ 186.534956] scsi 13:0:50:0: SATA: handle(0x003b), sas_addr(0x50050cc118059923), phy(35), device_name(0x0000000000000000)
[ 186.534959] scsi 13:0:50:0: enclosure logical id (0x50050cc102055160), slot(0)
[ 186.535073] scsi 13:0:50:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 186.535078] scsi 13:0:50:0: qdepth(32), tagged(1), simple(0), ordered(0), scsi_level(7), cmd_que(1)
[ 187.287292] sd 13:0:50:0: Power-on or device reset occurred
[ 187.287370] sd 13:0:50:0: Attached scsi generic sg188 type 0
[ 187.289071] sd 13:0:50:0: [sdfw] 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
[ 187.289074] sd 13:0:50:0: [sdfw] 4096-byte physical blocks
[ 217.884839] sd 13:0:50:0: attempting task abort! scmd(ffff9fbb23aaca00)
[ 217.884846] sd 13:0:50:0: tag#0 CDB: Mode Sense(6) 1a 00 3f 00 04 00
[ 217.884848] scsi target13:0:50: handle(0x003b), sas_address(0x50050cc118059923), phy(35)
[ 217.884850] scsi target13:0:50: enclosure logical id(0x50050cc102055160), slot(0)
[ 218.797459] sd 13:0:50:0: device_block, handle(0x003b)
[ 221.042118] sd 13:0:50:0: device_unblock and setting to running, handle(0x003b)
[ 221.782440] sd 13:0:50:0: task abort: SUCCESS scmd(ffff9fbb23aaca00)
[ 221.782459] sd 13:0:50:0: [sdfw] Write Protect is off
[ 221.782463] sd 13:0:50:0: [sdfw] Mode Sense: 00 00 00 00
[ 221.782482] sd 13:0:50:0: [sdfw] Asking for cache data failed
[ 221.782485] sd 13:0:50:0: [sdfw] Assuming drive cache: write through
[ 221.782934] sd 13:0:50:0: [sdfw] Read Capacity(16) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 221.782937] sd 13:0:50:0: [sdfw] Sense not available.
[ 221.782975] sd 13:0:50:0: [sdfw] Read Capacity(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 221.782978] sd 13:0:50:0: [sdfw] Sense not available.
[ 221.783040] sd 13:0:50:0: [sdfw] Attached SCSI disk

[root@dataserver0 log]# smartctl -a /dev/sdfs
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.19.1.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: ST8000NM000A-2KE101
Serial Number: WKD0B211
LU WWN Device Id: 5 000c50 0c2a53721
Firmware Version: SN02
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Oct 27 14:05:40 2020 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Read SMART Data failed: scsi error aborted command

=== START OF READ SMART DATA SECTION ===
SMART Status command failed: scsi error aborted command
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.

Read SMART Log Directory failed: scsi error aborted command

Read SMART Error Log failed: scsi error aborted command

Read SMART Self-test Log failed: scsi error aborted command

Selective Self-tests/Logging not supported

gsxrrcr · Oct 27, 2020

Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
82:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3216 PCI-Express Fusion-MPT SAS-3 (rev 01)
83:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3216 PCI-Express Fusion-MPT SAS-3 (rev 01)
84:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

Stephan · Oct 27, 2020

I suggest to connect the disk to a normal SATA controller, use a different cable, run smartctl -ax <device> again. If problem persists, disk is likely defective. Make sure you are not using any standby software with the LSI, asks for trouble.

gea · Oct 28, 2020

I would also check cabling first (swap with another disk that is ok).
If problem remains a problem of this disk, remove and check with a low level tool like wd data lifguard (intensive check) ex on Windows.

A problem that can be associated with the HBA is firmware.
Best option for ZFS is IT firmware. Firmware 20.0 up to 20.0.0.4 is quite buggy. Update to 20.0.0.7 then.

Tinkerer · Dec 5, 2020

Ive had similar errors in a system due to the power supply. Did the math on power consumption and while it should have been sufficient (on the edge), drives would report similar errors as you are seeing and after the 3rd or 4th drive failure, I figured something else was going on. Replaced the PSU with a bigger and more importantly, better one and all my drive problems were gone.

So I'd definitely look into that if I were you.

Search

ZFS on centos rapid SATA drive failures

gsxrrcr

New Member

ttabbal

Active Member

gsxrrcr

New Member

gsxrrcr

New Member

Stephan

Well-Known Member

gea

Well-Known Member

Tinkerer

Member