Rebuild failure with an unexpected response indicating all drives are degraded

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

mauzilla

New Member
Aug 11, 2022
16
5
3
I have a 6 x 4TB disk RAID RAIDz2

A couple of weeks ago one drive failed, took it out and sent to supplier, only to have another drive start giving unrecoverable errors a couple of days later. We replaced the original drive and started an import which did a scrub and completed 100%, however, upon looking at the logs I get:

Pool RAIDZ2-32TB-VMBACKUPS state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
The following devices are not healthy:
  • Disk ATA WDC WD40EFRX-68N WD-WCC7K6NVUZTJ is DEGRADED
  • Disk ATA WDC WD40EFRX-68N WD-WCC7K5NZTY39 is DEGRADED
  • Disk ATA WDC WD40EFRX-68N WD-WCC7K5FR9KY4 is DEGRADED
  • Disk ATA WDC WD40EFRX-68N WD-WCC7K0EHKP5E is DEGRADED
  • Disk ATA HGST HUS726T4TAL V1GTLZ6H is DEGRADED
I am making an assumption that the rebuild failed, and likely due to the 2nd drive giving unrecovable errors, but what I dont understand why all of the other drives are shown as degraded. I ran a smart test on the 1st one and it came back without any faults, so not sure what to do?

It does appear that I had some data loss as there are a large number of data missing, fortunately this is a backup of a backup so a lesson learnt (just not sure what lesson yet :smile: )

What is my next steps? Does this simply mean the entire pool is degraded beyond repair? I again cannot imagine all 6 drives fail (and no, it's a IT mode HBA so not a raid controller)
 

CyklonDX

Well-Known Member
Nov 8, 2022
784
255
63
how about zpool status?

If all disks report bad state, and have errors on them there's likely something wrong with either your ram, cpu, backplane, or controller.
 

sko

Active Member
Jun 11, 2021
227
121
43
can you give a proper output of 'zpool status -v'? that list is pretty much useless...

What happens if you issue a 'zpool clear' or 'zpool online' to individual drives?
 

mauzilla

New Member
Aug 11, 2022
16
5
3
Here's the zpool status -v:


Code:
root@truenas[~]# zpool status -v
  pool: RAIDZ2-32TB-VMBACKUPS
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 1.69T in 08:01:20 with 178 errors on Thu Mar 16 18:01:03 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        RAIDZ2-32TB-VMBACKUPS                           DEGRADED     0     0 0
          raidz1-0                                      DEGRADED     0     0 0
            gptid/291f5a9f-8c2f-11ed-adc4-b8ac6f9a40e9  DEGRADED     0     0 0  too many errors
            gptid/2935b20a-8c2f-11ed-adc4-b8ac6f9a40e9  DEGRADED     0     0 0  too many errors
            gptid/292b6695-8c2f-11ed-adc4-b8ac6f9a40e9  DEGRADED     0     0 0  too many errors
            gptid/291064be-8c2f-11ed-adc4-b8ac6f9a40e9  DEGRADED     0     0 0  too many errors
            gptid/280891d4-8c2f-11ed-adc4-b8ac6f9a40e9  DEGRADED     0     0 0  too many errors
            gptid/8289a0ef-c3d0-11ed-b3d2-b8ac6f9a40e9  ONLINE       0     0 0

errors: Permanent errors have been detected in the following files:

        RAIDZ2-32TB-VMBACKUPS /Weekly@auto-2023-01-05_12-43:/xo-vm-backups/c83c16bb-67a2-ba87-4ed4-f2e734b9a17c/vdis/4106566b-22ff-490a-a796-75d3ce77cfd7/bc0cf564-91fc-456b-a7b3-4ec5e4a0bf95/20221224T135809Z.vhd

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:25 with 0 errors on Mon Mar 20 03:45:25 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool
I will try the zpool clear commands and post shortly
 

sko

Active Member
Jun 11, 2021
227
121
43
Aren't these the smr wd red drives?
Now that you mention it - the EFRX *are* SMR drives.

So in short: @mauzilla nuke the pool, throw away those drives and rebuild from backups with CMR drives.

And if you want to have something more flexible, A LOT faster (not only to resilver) and with much less loss to padding: DON'T use raidz, especially not as a single vdev. And especially not raidz1, where a second drive failure during resilver (which is rather common with raidz due to the extremely long resilver times) will nuke the whole vdev/pool.
IMHO RAIDZ is only somewhat justified for very large pools and/or if you are restricted by physical space and can't use bigger drives. For small pools just use mirror vdevs - they are the most practical and safe option, especially with consumer-grade drives (which WD RED simply are...).