ZFS checksum error (on scrub) – how do I see affected files?

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by nle, Jun 30, 2019.

  1. shanester

    shanester New Member

    Joined:
    May 18, 2011
    Messages:
    15
    Likes Received:
    2
    This happened to me a couple of weeks ago. It turned out to be a bad dimm.
     
    #21
  2. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    A good example about

    Why do you want ECC?
    - To avoid these problems

    Why do you want a filesystem with data checksums?
    - To get informed about problems prior a serious damage

    The myth about "Scrub to death" on ZFS,
    - Mostly a "too many errors" is simply the result
     
    #22
    Last edited: Jul 31, 2019
    nle likes this.
  3. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Looks like the RAM was the reason. I've put in new ECC RAM, and all original drives. Everything is normal again.

    Thanks for the help!
     
    #23
  4. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Bumping this again.

    It looks like a "permanent error" is stuck in my pool. I have restored affected files from backup.

    After running scrubs I got errors with a hex (?) reference (since the file is deleted) in some of my snapshots. I ended up destroying all snapshots to try to get rid of it, and run a scrub again, but the "permanent error" is still there in the pool.

    Code:
    # zpool status -v
      pool: datapool
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: http://illumos.org/msg/ZFS-8000-8A
      scan: scrub in progress since Wed Aug 28 09:34:52 2019
            7.11T scanned out of 10.0T at 357M/s, 2h22m to go
        0 repaired, 70.97% done
    config:
    
            NAME                       STATE     READ WRITE CKSUM
            datapool                   ONLINE       0     0     2
              raidz2-0                 ONLINE       0     0     8
                c8t5000CCA24CD8473Ad0  ONLINE       0     0     0
                c8t5000CCA24CD84746d0  ONLINE       0     0     0
                c8t5000CCA24CD847C6d0  ONLINE       0     0     0
                c8t5000CCA24CD847F7d0  ONLINE       0     0     0
                c8t5000CCA24CD84816d0  ONLINE       0     0     0
                c8t5000CCA24CD85170d0  ONLINE       0     0     0
            cache
              c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
            spares
              c8t5000CCA22BF5E927d0    AVAIL
    
    errors: Permanent errors have been detected in the following files:
    
            datapool/Lager:<0x5162f8>
    How do I get rid of 0x5162f8 error without destroying the pool?
     
    #24
  5. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    Wait until the scrub is completed.
    The reference to the damaged files should then vanish.

    Other pool errors can be cleared with a zpool clear (menu Pools > clear errors)
     
    #25
    nle likes this.
  6. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    I have cleared the error (the permanent error clears if you start and then stop a scrub), destroyed the snapshots and ran a new scrub. It is that last scrub that is reporting the error.

    I read that a process could "keep" the reference in active (or something along those lines), so I have now tried a reboot and running a new scrub. I'll report back when it's done.
     
    #26
  7. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    14
    Likes Received:
    7
    From the blog post:

    errors: Permanent errors have been detected in the following files:

    <0x398>:<0x40229c>

    [...]

    Errors in Blocks Belonging to Deleted Files
    The code on the right, 0x40229c, is the inode number of the deleted file.

    But you have a path to the left, not a hex value:
    datapool/Lager:<0x5162f8>

    So, some metadata for the filesystem/folder "datapool/Lager" is permanently corrupted. You'll probably have to recreate/restore that filesystem/folder.
     
    #27
    nle likes this.
  8. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    @thulle Aha, thanks, that could make sense.

    This is the output of my last scrub:
    Code:
    zpool status -v
      pool: datapool
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: http://illumos.org/msg/ZFS-8000-8A
      scan: scrub repaired 0 in 9h37m with 1 errors on Thu Aug 29 01:37:00 2019
    config:
    
            NAME                       STATE     READ WRITE CKSUM
            datapool                   ONLINE       0     0     3
              raidz2-0                 ONLINE       0     0    12
                c8t5000CCA24CD8473Ad0  ONLINE       0     0     0
                c8t5000CCA24CD84746d0  ONLINE       0     0     0
                c8t5000CCA24CD847C6d0  ONLINE       0     0     0
                c8t5000CCA24CD847F7d0  ONLINE       0     0     0
                c8t5000CCA24CD84816d0  ONLINE       0     0     0
                c8t5000CCA24CD85170d0  ONLINE       0     0     0
            cache
              c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
            spares
              c8t5000CCA22BF5E927d0    AVAIL
    
    errors: Permanent errors have been detected in the following files:
    
            datapool/Lager:<0x5162f8>
    I did find this post about someone clearing a similar issue, but it did not work for me. I did get an output from "'zdb -dddd /datapool/Lager 0x5162f8", but I couldn't find anything with "find /datapool/Lager -inode 5333752 -print"

    Anyhow. It looks like a scrub does not fix this.

    Is there anything else I could try? Could an option be to create a new ZFS filesystem (i.e "Lager_tmp"), and move files internally inside the same pool? I don't have enough free space on the system to duplicate the data.
     
    #28
    Last edited: Aug 29, 2019
  9. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    14
    Likes Received:
    7
    That's what i meant by recreating the filesystem/folder. Ie. not recreating the whole pool.
     
    #29
    nle likes this.
  10. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Just wanted to report back. Creating a new ZFS filesystem did the trick.

    Thanks for the help!
     
    #30
  11. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Ok, I'm back. The story continues.

    I thought everything was fine, but today I removed the old filesystem and errors started showing:

    Code:
    # zpool status -v
      pool: datapool
     state: DEGRADED
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: http://illumos.org/msg/ZFS-8000-8A
      scan: scrub repaired 0 in 25h50m with 0 errors on Tue Oct  8 17:06:06 2019
    config:
    
            NAME                       STATE     READ WRITE CKSUM
            datapool                   DEGRADED     0     0 7.51K
              raidz2-0                 DEGRADED     0     0 30.0K
                c8t5000CCA24CD8473Ad0  DEGRADED     0     0     0  too many errors
                c8t5000CCA24CD84746d0  DEGRADED     0     0     0  too many errors
                c8t5000CCA24CD847C6d0  DEGRADED     0     0     0  too many errors
                c8t5000CCA24CD847F7d0  ONLINE       0     0     0
                c8t5000CCA24CD84816d0  ONLINE       0     0     0
                c8t5000CCA24CD85170d0  DEGRADED     0     0     0  too many errors
            cache
              c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
    
    errors: Permanent errors have been detected in the following files:
    
            <0xffffffffffffffff>:<0x5162f8>
    
    So, creating a new ZFS filesystem, moving all files from old to new, deleting the old ZFS filesystem, did not work.

    I have tried clearing the error, starting/stopping scrubs, rebooting, etc. nothing so far.

    How do I fix this?

    Do I have to copy all files somewhere, destroy my pool, create a new pool and copy back? I could get one large drive in the spare slot, create new ZFS pool on that one drive, copy all data, destroy the old pool, recreate, and copy back. That should work, right?

    (I removed the spare drive from the pool for the time being since it starts resilvering instantly.)

    All suggestions are welcome!
     
    #31
  12. pricklypunter

    pricklypunter Well-Known Member

    Joined:
    Nov 10, 2015
    Messages:
    1,527
    Likes Received:
    434
    How's your power supply to your disks doing?

    Seeing those errors reminds me of a problem I recently fixed for someone. It was in fact the power supply to blame. The supply worked perfectly, and still does, it just couldn't keep up with the number of disks pulling down the 12V rail during any kind of heavy pool write cycle. Every time a scrub/ re-silver began, the disks would be fine for maybe 5-10 mins, then start reporting errors just like this. I was able to read from the pool in bursts, but as soon as a heavy write began, a I would get a whole slew of checksum errors, resulting in too many errors and degradation of the pool, then finally the pool going offline. I confirmed the issue before actually replacing the supply, by quickly adding a second one I had laying around and running just the disks from it for a few hours :)
     
    #32
    nle likes this.
  13. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    I have seen only one case in over ten years where multiple bad disks was the reason for a similar state. This was with 3TB Seagate disks that died like flies after around 2 years. So the problem is most propably a nondisk hardware problem, maybe a problem with two parts like a power supply problem that occurs only up from a certain number of disks.

    In my serverroom I would power down the machine, move all disks to another server and run there a scrub. Then backup all data and move back the pool for problem finding. Then replace parts of the machine until the problem is gone followed by a longer test use. The most likely parts for such problems are indeed memory, power supply, backplane cabling followed by the HBA and mainboard.

    If you do not have a second server, export and import the pool readonly to avoid further damage. Then care about backup. If all data is secure, you can start replacing parts in the order memory, power supply etc. After each replacement, run a zpool clear + scrub to check if the problem is still there.
     
    #33
    nle likes this.
  14. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Thanks for the input. This is the only system I got on hand I am afraid.

    Anyhow, should'nt this be from the same issue that started this thread, since the the permanent error is identical? The memory was confirmed bad and replaced with ECC memory.

    Code:
    errors: Permanent errors have been detected in the following files:
    
            <0xffffffffffffffff>:<0x5162f8>
    
    0xffffffffffffffff = /datapool/Lager-old (which was the name I gave the old pool, it changed to"0xffffffffffffffff" when I destroyed the old pool).
    0x5162f8 = is the same reference (to meta data?) from the original problem.

    And the drives show no checksum errors?
     
    #34
    Last edited: Oct 9, 2019
  15. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    A simple checksum error on proper hardware is no problem. If you delete a damaged file and run a scrub, everything should be ok again. Metadata is there twice so this should also no problem usually.

    Your problem seems
    - every scrub creates errors again and/or
    - your pool has a structural problem due damaged metadata

    The first problem can only be fixed if you repair the hardware (replace bad parts). The second propably only with a pool destroy/recreate and restore from backup. But unless your hardware is ok, you will get the errors again after a pool recreate.

    As I asume, that you have a backup, rerun a memory check, recreate the pool optionally first with half of disks and do stress tests (fill up a pool) followed by a scrub. If problem persists, use only Sata instead the HBA or only HBA instead Sata.

    Without a second set of hardware, problem finding is not easy. If you can rule out memory or Sata/HBA problems or backplane/cabling, the remaining parts are PSU and mainboard. You would need a second PSU or mainboard to rule that out.
     
    #35
    Last edited: Oct 9, 2019
  16. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Thanks.

    Since it is reporting the same error which occurred on bad memory, and no other errors since (as far as I can understand), I think the way forward is to copy all data on to another drive, delete the pool, recreate the pool, copy back – and hopefully all is good.
     
    #36
  17. dragonme

    dragonme Active Member

    Joined:
    Apr 12, 2016
    Messages:
    282
    Likes Received:
    28
    in my experience the more you move data around on a machine that is throwing ZFS errors... the greater the chance you will actually corrupt something (if it hasn't happened already)

    the gold standard answer here is you should destroy the pool, destroy the disks (i.e. not use them again) and restore from a backup.. but you probably don't have one because most people think that RAIDZ is some magical unicorn backup that will never go bad..

    RAIDZ is not a backup.. its a checksum that permits self healing or the ability for a file system to remain alive if a dev fails (provided its redundant)

    personally, I would move the pool to a different server and re-import, scrub, and see if you were still getting the errors..

    I have seen sata cables bent at too sharp an angle (pinched) throw errors, I have seen non-raid rated (consumer) desktop drives throw errors due to vibration (and also latency numbers go up even on enterprise drives) if there is too much vibration.. I have seen a server throw a cooling fan blade that vibrated a server to drop an array...

    never lost a single file though and never run raidz on the production pool.. just my backup pools.. thereby eliminating in almost all cases a need for expensive and temperamental cache devices further eliminating points of failure
     
    #37
  18. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    I do have fresh backup, offsite. Taken every day. And not as fresh local backups.

    I've run scrubs every week for as long as this pool has been alive, and it has given me errors once (hence this post) and that was because of faulty memory (proven with memcheck). I have replaced all memory with new ECC memory (which was a mistake on me not using ECC in the first place).

    After I replaced the memory I also restored all the affected files from backup (thankfully not many), but the remaing error is something related to ZFS pool metadata, and that will not go away easily.

    I did try to create a new ZFS filesystem and moved all files over with no issue. I ran it like that for a couple of weeks with scrubbing, no issues on the new ZFS filesystem. Then I destroyed the old pool to retrieve the space and hopefully remove the error, but no go and the metadata error was still present on the system.

    Since it's one error, same hex, same everything. That leads me to believe that by destroying the pool (and creating a new one) will solve the issue. Or at least I hope so. If not I guess I have to go the HW elimination route (or buy a brand new system, since this is pretty old anyways).

    And unfortunately I don't have a second server easily available to me here, so I just have to do with what I have.
     
    #38
  19. dragonme

    dragonme Active Member

    Joined:
    Apr 12, 2016
    Messages:
    282
    Likes Received:
    28
    an old military motto... 2 is 1 and 1 is none and with 3 I just might make it out alive...

    ie.. backup systems and full data backups are a necessity and given the cost per TB vs the value of a TB of data... there really is no reason not to have more than 1. I get nervous any time I have only 1 backup pool (even with redundancy) and have to or want to destroy the primary pool...

    I usually wont destroy an old pool, until I build the new pool (with larger capacity rusty usually the reason) so that during the restor I have 2 sets of data.. I did however have an issue with Napp-it and esxi recently where napp-it added a drive to an existing pool incorrectly and will likely have to destroy the pool and rebuild and the backup is a 15 drive array of 3 raidz 5 drive stripes.. very performant and can survive 3 drive failures during the rebuild provided not more than 1 on any 5 drive strip .. the risk is low.. but there is still a risk. I may consider building another pool just to back it up twice

    if you are having memory issues... remember, that ZFS can only checksum what it is handed. if its handed garbage, it will checksum the garbage and store it... and never complain. that is why ECC is a must on critical ZFS storage controllers. garbage in garbage out

    I have never seen ZFS get corrupted in a way that it can't identify and repair but I can see several vectors on how that might be possible.

    if the pool structure and meta is questionable, my suggestion would be to build a new pool on hardware that can be trusted (i.e. likely not one you suspect of having issues), reconstitute the bulk of the data from a last known good state from the offsite pool that has no errors, then attempt to scrobble whatever new data that was not part of that snap from the corrupted pool.. if you absolutely need it.. because in my mind, that data is suspect..remember.. if corruption happened in memory, that corrupt file will be saved and ZFS checksums will show no issue with it.. it the meta or storage tables are damaged.. the loss gets exponential
     
    #39
Similar Threads: checksum error
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it Unable to get past checksum(s) error when installing All-in-one Mar 18, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Checking complete disks for errors in OmniOS Nov 9, 2019
Solaris, Nexenta, OpenIndiana, and napp-it [Solved] Soft Errors with Napp-It 18.06+ and Solaris 11.4 Mar 4, 2019
Solaris, Nexenta, OpenIndiana, and napp-it TLS email (GSuite/Gmail) error Mar 4, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Napp-it error in auto.pl May 29, 2017

Share This Page