ZFS checksum error (on scrub) – how do I see affected files?

nle · Jun 30, 2019

This is the first time I've ever gotten this kind of error across the board, I've only experienced bad drives before.

Code:

zpool status
  pool: datapool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 2M in 10h6m with 0 errors on Sun Jun 30 05:06:34 2019
config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c8t5000CCA24CD8473Ad0  ONLINE       0     0    12
            c8t5000CCA24CD84746d0  ONLINE       0     0    10
            c8t5000CCA24CD847C6d0  ONLINE       0     0    11
            c8t5000CCA24CD847F7d0  ONLINE       0     0    10
            c8t5000CCA24CD84816d0  ONLINE       0     0    13
            c8t5000CCA24CD85170d0  ONLINE       0     0     8
        cache
          c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0    AVAIL

How do I get info about what files is affected?

nephri · Jul 1, 2019

At this time, no file was really affected.
Checksum errors was successfully handled by your raid-z2

ZFS warn you that you have a risk of failure in a future (long or not).

Specially since all of your data disks of your pool experienced checksum errors.
It can maybe a controller issue ? (cabling, ...) ?

Start by performing a backup if it's not already done.

As you have a spare, you could try a replace and start a resilvering on one disk.

The resilver process will stress other disks and it may quite a nervous process so, much time you wait, much nervous you will be.
You will see, if other disks will grow up in checksum errors or not.

thulle · Jul 1, 2019

nephri said:
At this time, no file was really affected.
Checksum errors was successfully handled by your raid-z2

The handling of the errors is indicated by:

nle said:
scrub repaired 2M in 10h6m with 0 errors

If it were permanent errors the status message would tell you to run zpool status -v and give you a message like:

errors: Permanent errors have been detected in the following files:
/tank/damaged_file.iso

nle · Jul 1, 2019

Thanks guys.

I got hung up in all the checksum errors accross all the drives. I'll clear the error, and run a new scrub to double check.

(I do have offsite backups as of friday night)

nle · Jul 30, 2019

Quick follow up on this one.

I recently got a lot more errors, also affecting files. I'll probably do a more thorough standalone post, but I have a couple of quick questions:

Question 1:
How common is it that multiple drives throws "too many errors" at the same time ish? Does drives (same batch, same install date) usually fail at the same time?

As far as I can see, the SMART data looks okay.

Question 2:
All the 17 files with errors are the same file, in different snapshots.

Code:

errors: Permanent errors have been detected in the following files:

        datapool/Lager@daily-1427810489_2019.07.24.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.28.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.16.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.14.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.17.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.29.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.27.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.15.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.23.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.20.08.00.13:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.13.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.21.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.25.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.18.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.22.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.19.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
        datapool/Lager@daily-1427810489_2019.07.26.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg

Is that normal?

gea · Jul 30, 2019

It is very unlikely that multiple disks fail at the same time.
Most likely is a RAM problem followed by PSU, HBA, cabling or other hardware problems

nle · Jul 30, 2019

Ok, thank you.

How would you start to debug that issue? Do you have any go-to tips?

And how likely is it that RAM suddenly starts failing (this is unfortunatly non-ecc memory)

gea · Jul 30, 2019

You can either boot a test tool like memtest86 that offers a memory check or you can remove half of RAM, check if it is stable - otherwise try the other half.

nle · Jul 30, 2019

Ok, thanks! I'm going to let the resilver finish, and then run a memtest overnight.

If I'm lucky I can get away with replacing the memory. This is a pretty old setup anyways, so the cheaper the better.

thulle · Jul 30, 2019

Question 2: The error is in one actual file, that file is in 17 snapshots, making the error visible as 17 files. Just so you don't think the file is OK in earlier snapshots.

nle · Jul 30, 2019

Thanks.

When I'm checking the file I get I/O error (I was trying to use md5sum) in OmniOS. I have a working file from backup.

Additonal question:
I cleared the error (it's currently resilvering), but after I clear it it works for a while and then gives errors again. Always on the same drives.

It went from:

Code:

  pool: datapool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 30 10:39:42 2019
        5.85T scanned out of 10.0T at 277M/s, 4h24m to go
    993G resilvered, 58.24% done
config:

        NAME                         STATE     READ WRITE CKSUM
        datapool                     ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            c8t5000CCA24CD8473Ad0    ONLINE       0     0     0
            spare-1                  ONLINE       0     0     0
              c8t5000CCA22BF708C6d0  ONLINE       0     0     0
              c8t5000CCA22BF5E927d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0    ONLINE       0     0     1
            c8t5000CCA24CD847F7d0    ONLINE       0     0     0
            c8t5000CCA24CD84816d0    ONLINE       0     0     0
            c8t5000CCA24CD85170d0    ONLINE       0     0     0
        cache
          c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0      INUSE     currently in use

errors: No known data errors

To:

Code:

  pool: datapool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 30 10:39:42 2019
        5.96T scanned out of 10.0T at 275M/s, 4h19m to go
    1010G resilvered, 59.27% done
config:

        NAME                         STATE     READ WRITE CKSUM
        datapool                     DEGRADED     0     0    14
          raidz2-0                   DEGRADED     0     0    56
            c8t5000CCA24CD8473Ad0    DEGRADED     0     0     0  too many errors
            spare-1                  ONLINE       0     0     0
              c8t5000CCA22BF708C6d0  ONLINE       0     0     0
              c8t5000CCA22BF5E927d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0    DEGRADED     0     0     1  too many errors
            c8t5000CCA24CD847F7d0    ONLINE       0     0     0
            c8t5000CCA24CD84816d0    ONLINE       0     0     0
            c8t5000CCA24CD85170d0    DEGRADED     0     0     0  too many errors
        cache
          c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        datapool/Lager:<0x5162f8>

Does that still point to RAM as being the issue?

That was a very weird file to have error in also.
My plan is to run memtest after the resilvering is completed.

I'm also wondering why I have two drives under "spare-1" since I replaced one of those drives. I had to remove the replaced one since it did not get removed automatically. That triggered a resilver. My hopes are that the zpool layout goes back to "normal" after the resilver is complete (i.e no spare-1 with two drives)

Appreciate all input. It's so long between issues (thankfully), so I it's a bit like starting new.

thulle · Jul 30, 2019

You seem to have a lot of checksum errors on the pool that's not for one specific drive. Sounds like memory errors.
The odd filename is due to the metadata for that filesystem/folder being corrupt. Unless you have backups of all the data on the pool I'd suggest you import it read only or don't import it at all until you have this figured out, to avoid permanent corruption and dataloss. Remember that ZFS data safety guarantees doesn't hold if it can't trust its memory.

nle · Jul 30, 2019

Thanks. Can I do that in the midst of resilvering?

thulle · Jul 30, 2019

Resilvering requires writing the corrected data, so I'm pretty sure you can't. But if you're having memory issues, you can't be sure you're writing correct data anyway.

nle · Jul 30, 2019

As long as nothing new is written to the pool, am I still at risk for corrupting data? The corrupted image file was two years old, and not accessed recently afaik. (sorry if this is very obvious, I just don't want to take any chances).

What would you do? Pull the plug, and run memtest to check for memory errors?

(And I do have backup.)

thulle · Jul 30, 2019

There are write activity going on anyway: new scheduled snapshots, updating the metadata for when files were last read, and probably more I'm not thinking about. The most dangerous things I think would be if a transaction is written with erroneus pointers to where earlier transactions are, this would force you to import the poolstate at a previous transaction number - and that can get messy, or if the free space maps is read erroneously and ZFS thinks space holding actual data is free and overwrites it.

nle · Jul 30, 2019

Ok. I disabled all automatic snapshot. Turned off all shares. I'm going to wait for the resilvering, then I'm booting up memtest.

nle · Jul 31, 2019

Looks like you guys were spot on (as usual). I need to order new RAM.

nephri · Jul 31, 2019

You could try isolate which dimm is failing because you have probably only one faulty dimm !

nle · Jul 31, 2019

Yes, I guess I could. But I'm upgrading to ECC unbuffered and 8 gb dimms.

(I've mounted the datapool as read only as suggested. Then I have access to files without risking damaging them – and I have backup)

ZFS checksum error (on scrub) – how do I see affected files?

Member

Active Member

Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Active Member

Member