ZFS pool faulted "too many errors", but no errors?

Kybber

Active Member
May 27, 2016
132
33
28
44
My proxmox host sent me the following mail half an hour ago:
Code:
The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted."
So I logged on to my host and found this:
Code:
# zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 0h38m with 0 errors on Sun Mar 10 01:02:39 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            sdi2    ONLINE       0     0     0
            sdk2    FAULTED      0     0     0  too many errors

errors: No known data errors
How do I interpret this? If 0 errors is too many, then what is normal? o_O

Both drives are Intel DC S3710. Here are the smartclt attributes for the faulted drive:
Code:
# smartctl -A /dev/sdk
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.18-9-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       25381
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       70
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       69
175 Power_Loss_Cap_Test     0x0033   100   100   010    Pre-fail  Always       -       6870 (100 2757)
183 SATA_Downshift_Count    0x0032   100   100   000    Old_age   Always       -       2
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Temperature_Case        0x0022   076   069   000    Old_age   Always       -       24 (Min/Max 21/33)
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       69
194 Temperature_Internal    0x0022   100   100   000    Old_age   Always       -       24
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       26312405
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       7618
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       15
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1522689
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
234 Thermal_Throttle        0x0032   100   100   000    Old_age   Always       -       0/0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       26312405
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       4944506
Looks fine, no?
 
Last edited:

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
Seems very odd, almost like something was done to clear the error states.

Take a backup of it right away (just to be safe). Then do a zpool clear to reset the drive a new scrub to see if it reports more errors. If all is well then its "fixed" - but I'd certainly be cautious with it and do more frequent backups.
 

Kybber

Active Member
May 27, 2016
132
33
28
44
Thanks, guys. I did a zpool clear followed by a scrub. No issues whatsoever. Syslog entries pasted below. Not sure exactly what happened, but I guess it was (at least reasonably) benign. I'll keep this as a record so I can check back if something similar happens again. Also running a smart long test just in case.

Syslog entries:
Code:
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: attempting task abort! scmd(00000000352cca5a)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#3 CDB: Write(10) 2a 00 23 46 75 78 00 00 10 00
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: enclosure logical id(0x5000000080000000), slot(0)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#11 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#9 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#8 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#11 CDB: Write(10) 2a 00 23 48 2b 08 00 00 50 00
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#9 CDB: Write(10) 2a 00 23 48 82 78 00 00 18 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591954552
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#8 CDB: Read(10) 28 00 25 63 f6 b0 00 00 10 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591932168
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#7 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 627308208
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#7 CDB: Write(10) 2a 00 23 42 c2 48 00 00 38 00
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#10 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591577672
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#10 CDB: Write(10) 2a 00 23 48 28 48 00 00 48 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591931464
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#5 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#5 CDB: Write(10) 2a 00 23 46 76 18 00 00 20 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591820312
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#4 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#4 CDB: Write(10) 2a 00 23 46 6a 40 00 00 20 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591817280
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#6 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#6 CDB: Write(10) 2a 00 23 41 56 28 00 00 40 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591484456
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: task abort: SUCCESS scmd(00000000352cca5a)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#3 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#3 CDB: Write(10) 2a 00 23 46 75 78 00 00 10 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591820152
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: attempting task abort! scmd(00000000fffe4eb8)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#2 CDB: Write(10) 2a 00 23 48 24 a0 00 00 20 00
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: enclosure logical id(0x5000000080000000), slot(0)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: task abort: SUCCESS scmd(00000000fffe4eb8)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#2 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#2 CDB: Write(10) 2a 00 23 48 24 a0 00 00 20 00
[Wed Mar 27 15:45:43 2019] print_req_error: I/O error, dev sdk, sector 591930528
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: attempting task abort! scmd(00000000f880943d)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: [sdk] tag#1 CDB: Write(10) 2a 00 23 47 4a 20 00 01 00 00
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
[Wed Mar 27 15:45:43 2019] scsi target7:0:1: enclosure logical id(0x5000000080000000), slot(0)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: task abort: SUCCESS scmd(00000000f880943d)
[Wed Mar 27 15:45:43 2019] sd 7:0:1:0: Power-on or device reset occurred
[Wed Mar 27 15:45:44 2019] sd 7:0:1:0: Power-on or device reset occurred
 
Last edited:

czl

New Member
May 14, 2016
24
3
3
Based on the syslog you posted it looks to me that Linux kernel is having trouble communicating with the drive. ZFS retries and eventually succeeds hence no ZFS errors just the faulted device due the trouble ZFS has using it. Cause of the problem could be the drive or the connection from MB to the drive — you’d be able to tell if you swap drives around and see if the error follows the drive, stays with the slot or disappears because you reseated it.
 
  • Like
Reactions: Kybber