ZFS checksum error (on scrub) – how do I see affected files?

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by nle, Jun 30, 2019.

  1. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    This is the first time I've ever gotten this kind of error across the board, I've only experienced bad drives before.

    Code:
    zpool status
      pool: datapool
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are unaffected.
    action: Determine if the device needs to be replaced, and clear the errors
            using 'zpool clear' or replace the device with 'zpool replace'.
       see: http://illumos.org/msg/ZFS-8000-9P
      scan: scrub repaired 2M in 10h6m with 0 errors on Sun Jun 30 05:06:34 2019
    config:
    
            NAME                       STATE     READ WRITE CKSUM
            datapool                   ONLINE       0     0     0
              raidz2-0                 ONLINE       0     0     0
                c8t5000CCA24CD8473Ad0  ONLINE       0     0    12
                c8t5000CCA24CD84746d0  ONLINE       0     0    10
                c8t5000CCA24CD847C6d0  ONLINE       0     0    11
                c8t5000CCA24CD847F7d0  ONLINE       0     0    10
                c8t5000CCA24CD84816d0  ONLINE       0     0    13
                c8t5000CCA24CD85170d0  ONLINE       0     0     8
            cache
              c8t5E83A97E17FC0D84d0    ONLINE       0     0     0
            spares
              c8t5000CCA22BF5E927d0    AVAIL
    How do I get info about what files is affected?
     
    #1
  2. nephri

    nephri Active Member

    Joined:
    Sep 23, 2015
    Messages:
    498
    Likes Received:
    82
    At this time, no file was really affected.
    Checksum errors was successfully handled by your raid-z2

    ZFS warn you that you have a risk of failure in a future (long or not).

    Specially since all of your data disks of your pool experienced checksum errors.
    It can maybe a controller issue ? (cabling, ...) ?

    Start by performing a backup if it's not already done.

    As you have a spare, you could try a replace and start a resilvering on one disk.

    The resilver process will stress other disks and it may quite a nervous process so, much time you wait, much nervous you will be.
    You will see, if other disks will grow up in checksum errors or not.
     
    #2
    nle and thulle like this.
  3. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    13
    Likes Received:
    7
    The handling of the errors is indicated by:
    If it were permanent errors the status message would tell you to run zpool status -v and give you a message like:

    errors: Permanent errors have been detected in the following files:
    /tank/damaged_file.iso
     
    #3
    nle likes this.
  4. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Thanks guys.

    I got hung up in all the checksum errors accross all the drives. I'll clear the error, and run a new scrub to double check.

    (I do have offsite backups as of friday night)
     
    #4
  5. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Quick follow up on this one.

    I recently got a lot more errors, also affecting files. I'll probably do a more thorough standalone post, but I have a couple of quick questions:

    Question 1:
    How common is it that multiple drives throws "too many errors" at the same time ish? Does drives (same batch, same install date) usually fail at the same time?

    As far as I can see, the SMART data looks okay.

    Question 2:
    All the 17 files with errors are the same file, in different snapshots.

    Code:
    errors: Permanent errors have been detected in the following files:
    
            datapool/Lager@daily-1427810489_2019.07.24.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.28.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.16.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.14.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.17.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.29.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.27.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.15.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.23.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.20.08.00.13:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.13.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.21.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.25.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.18.08.00.15:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.22.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.19.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
            datapool/Lager@daily-1427810489_2019.07.26.08.00.14:/dir/dir/dir/dir/dir/image_257.jpg
    
    Is that normal?
     
    #5
    Last edited: Jul 30, 2019
  6. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,248
    Likes Received:
    745
    It is very unlikely that multiple disks fail at the same time.
    Most likely is a RAM problem followed by PSU, HBA, cabling or other hardware problems
     
    #6
    nle likes this.
  7. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Ok, thank you.

    How would you start to debug that issue? Do you have any go-to tips?

    And how likely is it that RAM suddenly starts failing (this is unfortunatly non-ecc memory)
     
    #7
  8. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,248
    Likes Received:
    745
    You can either boot a test tool like memtest86 that offers a memory check or you can remove half of RAM, check if it is stable - otherwise try the other half.
     
    #8
    nle likes this.
  9. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Ok, thanks! I'm going to let the resilver finish, and then run a memtest overnight.

    If I'm lucky I can get away with replacing the memory. This is a pretty old setup anyways, so the cheaper the better.
     
    #9
  10. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    13
    Likes Received:
    7
    Question 2: The error is in one actual file, that file is in 17 snapshots, making the error visible as 17 files. Just so you don't think the file is OK in earlier snapshots.
     
    #10
    nle likes this.
  11. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Thanks.

    When I'm checking the file I get I/O error (I was trying to use md5sum) in OmniOS. I have a working file from backup.

    Additonal question:
    I cleared the error (it's currently resilvering), but after I clear it it works for a while and then gives errors again. Always on the same drives.

    It went from:

    Code:
      pool: datapool
     state: ONLINE
    status: One or more devices is currently being resilvered.  The pool will
            continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Tue Jul 30 10:39:42 2019
            5.85T scanned out of 10.0T at 277M/s, 4h24m to go
        993G resilvered, 58.24% done
    config:
    
            NAME                         STATE     READ WRITE CKSUM
            datapool                     ONLINE       0     0     0
              raidz2-0                   ONLINE       0     0     0
                c8t5000CCA24CD8473Ad0    ONLINE       0     0     0
                spare-1                  ONLINE       0     0     0
                  c8t5000CCA22BF708C6d0  ONLINE       0     0     0
                  c8t5000CCA22BF5E927d0  ONLINE       0     0     0
                c8t5000CCA24CD847C6d0    ONLINE       0     0     1
                c8t5000CCA24CD847F7d0    ONLINE       0     0     0
                c8t5000CCA24CD84816d0    ONLINE       0     0     0
                c8t5000CCA24CD85170d0    ONLINE       0     0     0
            cache
              c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
            spares
              c8t5000CCA22BF5E927d0      INUSE     currently in use
    
    errors: No known data errors
    To:
    Code:
      pool: datapool
     state: DEGRADED
    status: One or more devices is currently being resilvered.  The pool will
            continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Tue Jul 30 10:39:42 2019
            5.96T scanned out of 10.0T at 275M/s, 4h19m to go
        1010G resilvered, 59.27% done
    config:
    
            NAME                         STATE     READ WRITE CKSUM
            datapool                     DEGRADED     0     0    14
              raidz2-0                   DEGRADED     0     0    56
                c8t5000CCA24CD8473Ad0    DEGRADED     0     0     0  too many errors
                spare-1                  ONLINE       0     0     0
                  c8t5000CCA22BF708C6d0  ONLINE       0     0     0
                  c8t5000CCA22BF5E927d0  ONLINE       0     0     0
                c8t5000CCA24CD847C6d0    DEGRADED     0     0     1  too many errors
                c8t5000CCA24CD847F7d0    ONLINE       0     0     0
                c8t5000CCA24CD84816d0    ONLINE       0     0     0
                c8t5000CCA24CD85170d0    DEGRADED     0     0     0  too many errors
            cache
              c8t5E83A97E17FC0D84d0      ONLINE       0     0     0
            spares
              c8t5000CCA22BF5E927d0      INUSE     currently in use
    
    errors: Permanent errors have been detected in the following files:
    
            datapool/Lager:<0x5162f8>
    
    Does that still point to RAM as being the issue?

    That was a very weird file to have error in also.
    My plan is to run memtest after the resilvering is completed.

    I'm also wondering why I have two drives under "spare-1" since I replaced one of those drives. I had to remove the replaced one since it did not get removed automatically. That triggered a resilver. My hopes are that the zpool layout goes back to "normal" after the resilver is complete (i.e no spare-1 with two drives)

    Appreciate all input. It's so long between issues (thankfully), so I it's a bit like starting new. :)
     
    #11
  12. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    13
    Likes Received:
    7
    You seem to have a lot of checksum errors on the pool that's not for one specific drive. Sounds like memory errors.
    The odd filename is due to the metadata for that filesystem/folder being corrupt. Unless you have backups of all the data on the pool I'd suggest you import it read only or don't import it at all until you have this figured out, to avoid permanent corruption and dataloss. Remember that ZFS data safety guarantees doesn't hold if it can't trust its memory.
     
    #12
    nle likes this.
  13. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Thanks. Can I do that in the midst of resilvering?
     
    #13
  14. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    13
    Likes Received:
    7
    Resilvering requires writing the corrected data, so I'm pretty sure you can't. But if you're having memory issues, you can't be sure you're writing correct data anyway.
     
    #14
    nle likes this.
  15. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    As long as nothing new is written to the pool, am I still at risk for corrupting data? The corrupted image file was two years old, and not accessed recently afaik. (sorry if this is very obvious, I just don't want to take any chances).

    What would you do? Pull the plug, and run memtest to check for memory errors?

    (And I do have backup.)
     
    #15
  16. thulle

    thulle New Member

    Joined:
    Apr 11, 2019
    Messages:
    13
    Likes Received:
    7
    There are write activity going on anyway: new scheduled snapshots, updating the metadata for when files were last read, and probably more I'm not thinking about. The most dangerous things I think would be if a transaction is written with erroneus pointers to where earlier transactions are, this would force you to import the poolstate at a previous transaction number - and that can get messy, or if the free space maps is read erroneously and ZFS thinks space holding actual data is free and overwrites it.
     
    #16
    nle likes this.
  17. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Ok. I disabled all automatic snapshot. Turned off all shares. I'm going to wait for the resilvering, then I'm booting up memtest.
     
    #17
  18. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Looks like you guys were spot on (as usual). I need to order new RAM.

    [​IMG]
     
    #18
  19. nephri

    nephri Active Member

    Joined:
    Sep 23, 2015
    Messages:
    498
    Likes Received:
    82
    You could try isolate which dimm is failing because you have probably only one faulty dimm !
     
    #19
    nle likes this.
  20. nle

    nle Member

    Joined:
    Oct 24, 2012
    Messages:
    185
    Likes Received:
    6
    Yes, I guess I could. But I'm upgrading to ECC unbuffered and 8 gb dimms.

    (I've mounted the datapool as read only as suggested. Then I have access to files without risking damaging them – and I have backup)
     
    #20
Similar Threads: checksum error
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it Unable to get past checksum(s) error when installing All-in-one Mar 18, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Checking complete disks for errors in OmniOS Nov 9, 2019
Solaris, Nexenta, OpenIndiana, and napp-it [Solved] Soft Errors with Napp-It 18.06+ and Solaris 11.4 Mar 4, 2019
Solaris, Nexenta, OpenIndiana, and napp-it TLS email (GSuite/Gmail) error Mar 4, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Napp-it error in auto.pl May 29, 2017

Share This Page