ZFS checksum error (on scrub) – how do I see affected files?

nle

Member
Oct 24, 2012
200
11
18
This is the first time I've ever gotten this kind of error across the board, I've only experienced bad drives before.

Code:
zpool status
  pool: datapool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 2M in 10h6m with 0 errors on Sun Jun 30 05:06:34 2019
config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c8t5000CCA24CD8473Ad0  ONLINE       0     0    12
            c8t5000CCA24CD84746d0  ONLINE       0     0    10
            c8t5000CCA24CD847C6d0  ONLINE       0     0    11
            c8t5000CCA24CD847F7d0  ONLINE       0     0    10
            c8t5000CCA24CD84816d0  ONLINE       0     0    13
            c8t5000CCA24CD85170d0  ONLINE       0     0     8
        cache
          c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0    AVAIL
How do I get info about what files is affected?
 

nephri

Active Member
Sep 23, 2015
535
104
43
42
Paris, France
At this time, no file was really affected.
Checksum errors was successfully handled by your raid-z2

ZFS warn you that you have a risk of failure in a future (long or not).

Specially since all of your data disks of your pool experienced checksum errors.
It can maybe a controller issue ? (cabling, ...) ?

Start by performing a backup if it's not already done.

As you have a spare, you could try a replace and start a resilvering on one disk.

The resilver process will stress other disks and it may quite a nervous process so, much time you wait, much nervous you will be.
You will see, if other disks will grow up in checksum errors or not.
 
  • Like
Reactions: nle and thulle

thulle

New Member
Apr 11, 2019
19
9
3
At this time, no file was really affected.
Checksum errors was successfully handled by your raid-z2
The handling of the errors is indicated by:
scrub repaired 2M in 10h6m with 0 errors
If it were permanent errors the status message would tell you to run zpool status -v and give you a message like:

errors: Permanent errors have been detected in the following files:
/tank/damaged_file.iso
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
Thanks guys.

I got hung up in all the checksum errors accross all the drives. I'll clear the error, and run a new scrub to double check.

(I do have offsite backups as of friday night)
 

nle

Member
Oct 24, 2012
200
11
18
Quick follow up on this one.

I recently got a lot more errors, also affecting files. I'll probably do a more thorough standalone post, but I have a couple of quick questions:

Question 1:
How common is it that multiple drives throws "too many errors" at the same time ish? Does drives (same batch, same install date) usually fail at the same time?

As far as I can see, the SMART data looks okay.

Question 2:
All the 17 files with errors are the same file, in different snapshots.

Code:
errors: Permanent errors have been detected in the following files:

        datapool/Lager@daily-1427810489_2019.07.24.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.28.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.16.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.14.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.17.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.29.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.27.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.15.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.23.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.20.08.00.13:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.13.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.21.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.25.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.18.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.22.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.19.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.26.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
Is that normal?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
It is very unlikely that multiple disks fail at the same time.
Most likely is a RAM problem followed by PSU, HBA, cabling or other hardware problems
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
Ok, thank you.

How would you start to debug that issue? Do you have any go-to tips?

And how likely is it that RAM suddenly starts failing (this is unfortunatly non-ecc memory)
 

gea

Well-Known Member
Dec 31, 2010
2,500
842
113
DE
You can either boot a test tool like memtest86 that offers a memory check or you can remove half of RAM, check if it is stable - otherwise try the other half.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
Ok, thanks! I'm going to let the resilver finish, and then run a memtest overnight.

If I'm lucky I can get away with replacing the memory. This is a pretty old setup anyways, so the cheaper the better.
 

thulle

New Member
Apr 11, 2019
19
9
3
Question 2: The error is in one actual file, that file is in 17 snapshots, making the error visible as 17 files. Just so you don't think the file is OK in earlier snapshots.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
Thanks.

When I'm checking the file I get I/O error (I was trying to use md5sum) in OmniOS. I have a working file from backup.

Additonal question:
I cleared the error (it's currently resilvering), but after I clear it it works for a while and then gives errors again. Always on the same drives.

It went from:

Code:
  pool: datapool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 30 10:39:42 2019
        5.85T scanned out of 10.0T at 277M/s, 4h24m to go
    993G resilvered, 58.24% done
config:

        NAME                         STATE     READ WRITE CKSUM
        datapool                     ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            c8t5000CCA24CD8473Ad0    ONLINE       0     0     0
            spare-1                  ONLINE       0     0     0
              c8t5000CCA22BF708C6d0  ONLINE       0     0     0
              c8t5000CCA22BF5E927d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0    ONLINE       0     0     1
            c8t5000CCA24CD847F7d0    ONLINE       0     0     0
            c8t5000CCA24CD84816d0    ONLINE       0     0     0
            c8t5000CCA24CD85170d0    ONLINE       0     0     0
        cache
          c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0      INUSE     currently in use

errors: No known data errors
To:
Code:
  pool: datapool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 30 10:39:42 2019
        5.96T scanned out of 10.0T at 275M/s, 4h19m to go
    1010G resilvered, 59.27% done
config:

        NAME                         STATE     READ WRITE CKSUM
        datapool                     DEGRADED     0     0    14
          raidz2-0                   DEGRADED     0     0    56
            c8t5000CCA24CD8473Ad0    DEGRADED     0     0     0  too many errors
            spare-1                  ONLINE       0     0     0
              c8t5000CCA22BF708C6d0  ONLINE       0     0     0
              c8t5000CCA22BF5E927d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0    DEGRADED     0     0     1  too many errors
            c8t5000CCA24CD847F7d0    ONLINE       0     0     0
            c8t5000CCA24CD84816d0    ONLINE       0     0     0
            c8t5000CCA24CD85170d0    DEGRADED     0     0     0  too many errors
        cache
          c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        datapool/Lager:<0x5162f8>
Does that still point to RAM as being the issue?

That was a very weird file to have error in also.
My plan is to run memtest after the resilvering is completed.

I'm also wondering why I have two drives under "spare-1" since I replaced one of those drives. I had to remove the replaced one since it did not get removed automatically. That triggered a resilver. My hopes are that the zpool layout goes back to "normal" after the resilver is complete (i.e no spare-1 with two drives)

Appreciate all input. It's so long between issues (thankfully), so I it's a bit like starting new. :)
 

thulle

New Member
Apr 11, 2019
19
9
3
You seem to have a lot of checksum errors on the pool that's not for one specific drive. Sounds like memory errors.
The odd filename is due to the metadata for that filesystem/folder being corrupt. Unless you have backups of all the data on the pool I'd suggest you import it read only or don't import it at all until you have this figured out, to avoid permanent corruption and dataloss. Remember that ZFS data safety guarantees doesn't hold if it can't trust its memory.
 
  • Like
Reactions: nle

thulle

New Member
Apr 11, 2019
19
9
3
Resilvering requires writing the corrected data, so I'm pretty sure you can't. But if you're having memory issues, you can't be sure you're writing correct data anyway.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
As long as nothing new is written to the pool, am I still at risk for corrupting data? The corrupted image file was two years old, and not accessed recently afaik. (sorry if this is very obvious, I just don't want to take any chances).

What would you do? Pull the plug, and run memtest to check for memory errors?

(And I do have backup.)
 

thulle

New Member
Apr 11, 2019
19
9
3
There are write activity going on anyway: new scheduled snapshots, updating the metadata for when files were last read, and probably more I'm not thinking about. The most dangerous things I think would be if a transaction is written with erroneus pointers to where earlier transactions are, this would force you to import the poolstate at a previous transaction number - and that can get messy, or if the free space maps is read erroneously and ZFS thinks space holding actual data is free and overwrites it.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
200
11
18
Ok. I disabled all automatic snapshot. Turned off all shares. I'm going to wait for the resilvering, then I'm booting up memtest.
 

nle

Member
Oct 24, 2012
200
11
18
Yes, I guess I could. But I'm upgrading to ECC unbuffered and 8 gb dimms.

(I've mounted the datapool as read only as suggested. Then I have access to files without risking damaging them – and I have backup)