ZFS checksum error (on scrub) – how do I see affected files?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
A good example about

Why do you want ECC?
- To avoid these problems

Why do you want a filesystem with data checksums?
- To get informed about problems prior a serious damage

The myth about "Scrub to death" on ZFS,
- Mostly a "too many errors" is simply the result
 
Last edited:
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
204
11
18
Looks like the RAM was the reason. I've put in new ECC RAM, and all original drives. Everything is normal again.

Thanks for the help!
 

nle

Member
Oct 24, 2012
204
11
18
Bumping this again.

It looks like a "permanent error" is stuck in my pool. I have restored affected files from backup.

After running scrubs I got errors with a hex (?) reference (since the file is deleted) in some of my snapshots. I ended up destroying all snapshots to try to get rid of it, and run a scrub again, but the "permanent error" is still there in the pool.

Code:
# zpool status -v
  pool: datapool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Aug 28 09:34:52 2019
        7.11T scanned out of 10.0T at 357M/s, 2h22m to go
    0 repaired, 70.97% done
config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     2
          raidz2-0                 ONLINE       0     0     8
            c8t5000CCA24CD8473Ad0  ONLINE       0     0     0
            c8t5000CCA24CD84746d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0  ONLINE       0     0     0
            c8t5000CCA24CD847F7d0  ONLINE       0     0     0
            c8t5000CCA24CD84816d0  ONLINE       0     0     0
            c8t5000CCA24CD85170d0  ONLINE       0     0     0
        cache
          c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0    AVAIL

errors: Permanent errors have been detected in the following files:

        datapool/Lager:<0x5162f8>
How do I get rid of 0x5162f8 error without destroying the pool?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
Wait until the scrub is completed.
The reference to the damaged files should then vanish.

Other pool errors can be cleared with a zpool clear (menu Pools > clear errors)
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
204
11
18
I have cleared the error (the permanent error clears if you start and then stop a scrub), destroyed the snapshots and ran a new scrub. It is that last scrub that is reporting the error.

I read that a process could "keep" the reference in active (or something along those lines), so I have now tried a reboot and running a new scrub. I'll report back when it's done.
 

thulle

Member
Apr 11, 2019
48
18
8
From the blog post:

errors: Permanent errors have been detected in the following files:

<0x398>:<0x40229c>

[...]

Errors in Blocks Belonging to Deleted Files
The code on the right, 0x40229c, is the inode number of the deleted file.

But you have a path to the left, not a hex value:
datapool/Lager:<0x5162f8>

So, some metadata for the filesystem/folder "datapool/Lager" is permanently corrupted. You'll probably have to recreate/restore that filesystem/folder.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
204
11
18
@thulle Aha, thanks, that could make sense.

This is the output of my last scrub:
Code:
zpool status -v
  pool: datapool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 9h37m with 1 errors on Thu Aug 29 01:37:00 2019
config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   ONLINE       0     0     3
          raidz2-0                 ONLINE       0     0    12
            c8t5000CCA24CD8473Ad0  ONLINE       0     0     0
            c8t5000CCA24CD84746d0  ONLINE       0     0     0
            c8t5000CCA24CD847C6d0  ONLINE       0     0     0
            c8t5000CCA24CD847F7d0  ONLINE       0     0     0
            c8t5000CCA24CD84816d0  ONLINE       0     0     0
            c8t5000CCA24CD85170d0  ONLINE       0     0     0
        cache
          c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
        spares
          c8t5000CCA22BF5E927d0    AVAIL

errors: Permanent errors have been detected in the following files:

        datapool/Lager:<0x5162f8>
I did find this post about someone clearing a similar issue, but it did not work for me. I did get an output from "'zdb -dddd /datapool/Lager 0x5162f8", but I couldn't find anything with "find /datapool/Lager -inode 5333752 -print"

Anyhow. It looks like a scrub does not fix this.

Is there anything else I could try? Could an option be to create a new ZFS filesystem (i.e "Lager_tmp"), and move files internally inside the same pool? I don't have enough free space on the system to duplicate the data.
 
Last edited:

thulle

Member
Apr 11, 2019
48
18
8
Could an option be to create a new ZFS filesystem (i.e "Lager_tmp"), and move files internally inside the same pool? I don't have enough free space on the system to duplicate the data.
That's what i meant by recreating the filesystem/folder. Ie. not recreating the whole pool.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
204
11
18
Just wanted to report back. Creating a new ZFS filesystem did the trick.

Thanks for the help!
 

nle

Member
Oct 24, 2012
204
11
18
Ok, I'm back. The story continues.

I thought everything was fine, but today I removed the old filesystem and errors started showing:

Code:
# zpool status -v
  pool: datapool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 25h50m with 0 errors on Tue Oct  8 17:06:06 2019
config:

        NAME                       STATE     READ WRITE CKSUM
        datapool                   DEGRADED     0     0 7.51K
          raidz2-0                 DEGRADED     0     0 30.0K
            c8t5000CCA24CD8473Ad0  DEGRADED     0     0     0  too many errors
            c8t5000CCA24CD84746d0  DEGRADED     0     0     0  too many errors
            c8t5000CCA24CD847C6d0  DEGRADED     0     0     0  too many errors
            c8t5000CCA24CD847F7d0  ONLINE       0     0     0
            c8t5000CCA24CD84816d0  ONLINE       0     0     0
            c8t5000CCA24CD85170d0  DEGRADED     0     0     0  too many errors
        cache
          c8t5E83A97E17FC0D84d0    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x5162f8>
So, creating a new ZFS filesystem, moving all files from old to new, deleting the old ZFS filesystem, did not work.

I have tried clearing the error, starting/stopping scrubs, rebooting, etc. nothing so far.

How do I fix this?

Do I have to copy all files somewhere, destroy my pool, create a new pool and copy back? I could get one large drive in the spare slot, create new ZFS pool on that one drive, copy all data, destroy the old pool, recreate, and copy back. That should work, right?

(I removed the spare drive from the pool for the time being since it starts resilvering instantly.)

All suggestions are welcome!
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,708
515
113
Canada
How's your power supply to your disks doing?

Seeing those errors reminds me of a problem I recently fixed for someone. It was in fact the power supply to blame. The supply worked perfectly, and still does, it just couldn't keep up with the number of disks pulling down the 12V rail during any kind of heavy pool write cycle. Every time a scrub/ re-silver began, the disks would be fine for maybe 5-10 mins, then start reporting errors just like this. I was able to read from the pool in bursts, but as soon as a heavy write began, a I would get a whole slew of checksum errors, resulting in too many errors and degradation of the pool, then finally the pool going offline. I confirmed the issue before actually replacing the supply, by quickly adding a second one I had laying around and running just the disks from it for a few hours :)
 
  • Like
Reactions: nle

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
I have seen only one case in over ten years where multiple bad disks was the reason for a similar state. This was with 3TB Seagate disks that died like flies after around 2 years. So the problem is most propably a nondisk hardware problem, maybe a problem with two parts like a power supply problem that occurs only up from a certain number of disks.

In my serverroom I would power down the machine, move all disks to another server and run there a scrub. Then backup all data and move back the pool for problem finding. Then replace parts of the machine until the problem is gone followed by a longer test use. The most likely parts for such problems are indeed memory, power supply, backplane cabling followed by the HBA and mainboard.

If you do not have a second server, export and import the pool readonly to avoid further damage. Then care about backup. If all data is secure, you can start replacing parts in the order memory, power supply etc. After each replacement, run a zpool clear + scrub to check if the problem is still there.
 
  • Like
Reactions: nle

nle

Member
Oct 24, 2012
204
11
18
Thanks for the input. This is the only system I got on hand I am afraid.

Anyhow, should'nt this be from the same issue that started this thread, since the the permanent error is identical? The memory was confirmed bad and replaced with ECC memory.

Code:
errors: Permanent errors have been detected in the following files:

        <0xffffffffffffffff>:<0x5162f8>
0xffffffffffffffff = /datapool/Lager-old (which was the name I gave the old pool, it changed to"0xffffffffffffffff" when I destroyed the old pool).
0x5162f8 = is the same reference (to meta data?) from the original problem.

And the drives show no checksum errors?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
A simple checksum error on proper hardware is no problem. If you delete a damaged file and run a scrub, everything should be ok again. Metadata is there twice so this should also no problem usually.

Your problem seems
- every scrub creates errors again and/or
- your pool has a structural problem due damaged metadata

The first problem can only be fixed if you repair the hardware (replace bad parts). The second propably only with a pool destroy/recreate and restore from backup. But unless your hardware is ok, you will get the errors again after a pool recreate.

As I asume, that you have a backup, rerun a memory check, recreate the pool optionally first with half of disks and do stress tests (fill up a pool) followed by a scrub. If problem persists, use only Sata instead the HBA or only HBA instead Sata.

Without a second set of hardware, problem finding is not easy. If you can rule out memory or Sata/HBA problems or backplane/cabling, the remaining parts are PSU and mainboard. You would need a second PSU or mainboard to rule that out.
 
Last edited:

nle

Member
Oct 24, 2012
204
11
18
Thanks.

Since it is reporting the same error which occurred on bad memory, and no other errors since (as far as I can understand), I think the way forward is to copy all data on to another drive, delete the pool, recreate the pool, copy back – and hopefully all is good.
 

dragonme

Active Member
Apr 12, 2016
282
25
28
in my experience the more you move data around on a machine that is throwing ZFS errors... the greater the chance you will actually corrupt something (if it hasn't happened already)

the gold standard answer here is you should destroy the pool, destroy the disks (i.e. not use them again) and restore from a backup.. but you probably don't have one because most people think that RAIDZ is some magical unicorn backup that will never go bad..

RAIDZ is not a backup.. its a checksum that permits self healing or the ability for a file system to remain alive if a dev fails (provided its redundant)

personally, I would move the pool to a different server and re-import, scrub, and see if you were still getting the errors..

I have seen sata cables bent at too sharp an angle (pinched) throw errors, I have seen non-raid rated (consumer) desktop drives throw errors due to vibration (and also latency numbers go up even on enterprise drives) if there is too much vibration.. I have seen a server throw a cooling fan blade that vibrated a server to drop an array...

never lost a single file though and never run raidz on the production pool.. just my backup pools.. thereby eliminating in almost all cases a need for expensive and temperamental cache devices further eliminating points of failure
 

nle

Member
Oct 24, 2012
204
11
18
I do have fresh backup, offsite. Taken every day. And not as fresh local backups.

I've run scrubs every week for as long as this pool has been alive, and it has given me errors once (hence this post) and that was because of faulty memory (proven with memcheck). I have replaced all memory with new ECC memory (which was a mistake on me not using ECC in the first place).

After I replaced the memory I also restored all the affected files from backup (thankfully not many), but the remaing error is something related to ZFS pool metadata, and that will not go away easily.

I did try to create a new ZFS filesystem and moved all files over with no issue. I ran it like that for a couple of weeks with scrubbing, no issues on the new ZFS filesystem. Then I destroyed the old pool to retrieve the space and hopefully remove the error, but no go and the metadata error was still present on the system.

Since it's one error, same hex, same everything. That leads me to believe that by destroying the pool (and creating a new one) will solve the issue. Or at least I hope so. If not I guess I have to go the HW elimination route (or buy a brand new system, since this is pretty old anyways).

And unfortunately I don't have a second server easily available to me here, so I just have to do with what I have.
 

dragonme

Active Member
Apr 12, 2016
282
25
28
an old military motto... 2 is 1 and 1 is none and with 3 I just might make it out alive...

ie.. backup systems and full data backups are a necessity and given the cost per TB vs the value of a TB of data... there really is no reason not to have more than 1. I get nervous any time I have only 1 backup pool (even with redundancy) and have to or want to destroy the primary pool...

I usually wont destroy an old pool, until I build the new pool (with larger capacity rusty usually the reason) so that during the restor I have 2 sets of data.. I did however have an issue with Napp-it and esxi recently where napp-it added a drive to an existing pool incorrectly and will likely have to destroy the pool and rebuild and the backup is a 15 drive array of 3 raidz 5 drive stripes.. very performant and can survive 3 drive failures during the rebuild provided not more than 1 on any 5 drive strip .. the risk is low.. but there is still a risk. I may consider building another pool just to back it up twice

if you are having memory issues... remember, that ZFS can only checksum what it is handed. if its handed garbage, it will checksum the garbage and store it... and never complain. that is why ECC is a must on critical ZFS storage controllers. garbage in garbage out

I have never seen ZFS get corrupted in a way that it can't identify and repair but I can see several vectors on how that might be possible.

if the pool structure and meta is questionable, my suggestion would be to build a new pool on hardware that can be trusted (i.e. likely not one you suspect of having issues), reconstitute the bulk of the data from a last known good state from the offsite pool that has no errors, then attempt to scrobble whatever new data that was not part of that snap from the corrupted pool.. if you absolutely need it.. because in my mind, that data is suspect..remember.. if corruption happened in memory, that corrupt file will be saved and ZFS checksums will show no issue with it.. it the meta or storage tables are damaged.. the loss gets exponential