Pool Degraded - Help!

ZzBloopzZ · May 25, 2019

Hello,

I noticed my file server acting really slow last few weeks but was too busy to investigate until now. Even napp-it would not be able to pull Napp-it would not even be able to retrieve anything from the Disks or Pools sections. It would keep loading and loading, even hour later. Finally, I remembered to ssh in to OmniOS directly and type "zpool status". I have attached my results. I'm running
OmniOS v11 r151022, the 10x 3TB drives are connected to two HBA LSI 9211-8i cards.

I have the pool set to have 2 spares. Now I am worried, from the screenshot does it seem 4 drives are bad or just the 2 faulted drives? What exactly should I do next? Replace all 4 drives or just the 2 faulted or degraded ones? Right now thinking to back up important data to few spare USB external drives I have and then focus on troubleshooting/fixing the pool. Appreciate any help please!

Thank you!

redeamon · May 25, 2019

Hey there, we need more info to help. What types of drives, size, etc. and a full "zpool status -v" would help. Have you run any scrubs?

gea · May 25, 2019

It is quite unlikely that three disks fail all together. I have only seen two cases where this happened. One was with Seagate 3TB disks that died after three years like flies. Onother case was a serverrom where the air conditioner failed during a weekend with some disks died due overtemperature.

At the moment you have three damaged files (unrepairable with checksum errors ) with two faulted disks. If they are really dead the next disk fail means a pool lost. If that happens and a disk come back, the pool reverts to degraded or online.

What I would do:
Check disk temperature and smart status (menu disks) and backup the most important data. Then poweroff and check cables. Maybe a bad disk or power connector is the reason. Then power on and do a Menu pools > clear to clear the errors.

To completely clear the error, you must delete the damaged files and start a scrub. If the disks come back and the reason remains unclear, do an intensive disk test with sea tools or wd data liveguard or a similar tool.

ZzBloopzZ · May 25, 2019

Firstly, thank you everyone for the support. I ended up deleting the three corrupt files as they were not important. Then I updated napp-it and finally the menu's were working again. I identified two drives that seemed to have many errors under the Disk menu so I replaced those two with the last 2 spares I had as extra. I also blew out dust from the server and unplugged and firmly replugged all power cables and SATA/HBA cables connections.

I then cleared errors with "zpool clear tank". Then replaced the two drives in napp-it so now pool is resilvering. Finally, after replacing the two drives I am now finally able to run the smartinfo in napp-it which I was unable to do before. It shows "!Failed" for two drives. Does that mean they are bad as well? Strangely, one drive has many errors but SMART for it passed while the two that failed do not show any errors for S H and T. Smart results are attached.

Also, one of the drives I replaced had hundreds of errors under H and T. While another had 50-60 H and few T. Pool is currently being resilvered:

Code:

pool: pool30tb
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat May 25 21:01:36 2019
    131G scanned out of 22.6T at 100M/s, 65h12m to go
    24.9G resilvered, 0.57% done
config:

        NAME                         STATE     READ WRITE CKSUM
        pool30tb                     DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            c3t5000CCA37EC13F7Cd0    ONLINE       0     0     0
            c3t5000CCA37EC1C4B1d0    ONLINE       0     0     0
            c3t5000CCA37EC1C4E4d0    ONLINE       0     0     0
            c3t5000CCA37EC1CD01d0    ONLINE       0     0     2
            c3t5000CCA37EC1EB05d0    ONLINE       0     0     0
            c3t5000CCA37EC1ED1Cd0    ONLINE       0     0     0
            replacing-6              UNAVAIL      0     0     0
              c3t5000CCA37EC3F74Ed0  UNAVAIL      0     0     0  cannot open
              c3t5000039FF4E7B7E5d0  ONLINE       0     0     0  (resilvering)
            c3t5000039FF4E7BF5Fd0    ONLINE       0     0     0
            replacing-8              UNAVAIL      0     0     0
              c3t5000CCA37EC21035d0  UNAVAIL      0     0     0  cannot open
              c3t5000039FF4E7B791d0  ONLINE       0     0     0  (resilvering)
            c3t5000CCA37EC2292Bd0    ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c2t0d0s0  ONLINE       0     0     0

errors: No known data errors

Although the system is resilvering, it is already responding much quicker. I know for a fact one of the drives I pulled were bad. I am going to connect them to my computer and run advanced HD diagnostics on them to verify if they defective or not.

gea · May 26, 2019

If you klick on the serial number in menu Disks > Smart, you can see the detailled smartlog. If a disk reports a smart failed, I would replace and do at least an intemsive disk test (wd data live or similar)

ZzBloopzZ · Jun 1, 2019

Update, the drives are still resilvering and have been monitoring them randomly throughout the week. Now as of this morning, it is showing most of the drives as 'DEGRADED'. I do know for the fact the two drives that I originally pulled and replaced with brand-new drives are indeed bad as I ran SMART on them in another machine with WD Data Lifeguard Diagnostics. They fail quickly with both short and extended tests.

Here is the current zpool status:

pool: pool30tb
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue May 28 10:08:16 2019
10.4T scanned out of 22.6T at 31.3M/s, 113h33m to go
1.97T resilvered, 45.93% done
config:

NAME STATE READ WRITE CKSUM
pool30tb DEGRADED 26 0 34
raidz2-0 DEGRADED 26 0 1.53K
c3t5000CCA37EC13F7Cd0 DEGRADED 0 0 0 too many errors
c3t5000CCA37EC1C4B1d0 DEGRADED 0 0 0 too many errors
c3t5000CCA37EC1C4E4d0 DEGRADED 0 0 0 too many errors
c3t5000CCA37EC1CD01d0 DEGRADED 0 0 2 too many errors (resilvering)
c3t5000CCA37EC1EB05d0 DEGRADED 60 0 0 too many errors (resilvering)
c3t5000CCA37EC1ED1Cd0 DEGRADED 0 0 0 too many errors
replacing-6 DEGRADED 0 0 0
c3t5000CCA37EC3F74Ed0 UNAVAIL 0 0 0 cannot open
c3t5000039FF4E7B7E5d0 ONLINE 0 0 0 (resilvering)
c3t5000039FF4E7BF5Fd0 DEGRADED 0 0 0 too many errors
replacing-8 DEGRADED 0 0 0
c3t5000CCA37EC21035d0 UNAVAIL 0 0 0 cannot open
c3t5000039FF4E7B791d0 ONLINE 0 0 0 (resilvering)
c3t5000CCA37EC2292Bd0 DEGRADED 0 0 0 too many errors

errors: 2 data errors, use '-v' for a list

pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors

Also, I have attached napp-it Disks and SmartInfo screenshots. Tons of errors on two drives, but then for SMART it looks like one of those error drives is bad while the other drive that it is saying is bad does not have any errors.

I'm worried, at this point I guess I have to wait for the two newest drives I installed to finish resilvering then plan to replace two more drives with new ones. What a nightmare... looks like perfect storm that all these drives are failing granted they are old. Server was running perfectly quick 6 weeks ago. It's not even used much either and sits in a cool basement with much airflow.

Should I just wait patiently or is there anything else I should do?

gea · Jun 2, 2019

If disks are known to be bad on problems makes fixing easy, just replace them. If disk state is unclear, it can be a bad disk, repairable disks with bad esctors, backplane, power or an expander with Sata disks where a bad disk can block or irritate the expander.

Best method to be sure is an external test ex via WD data liveguard (full test). Smart is a good indicator when failed. If you use non-destructive tests you may shut down the server and test the disks prior a resilvering. Such a test may also repair or block bad sectors what may help on a raid resilvering.

ZzBloopzZ · Jun 5, 2019

Is there a command to save all directory and file names in the pool to a text file? I googled around and could not find a solution. :c/

pricklypunter · Jun 5, 2019

I don't know Solaris/ OmniOS all that well, but maybe ls -R > your.txt or maybe install tree, if that's available for that OS, and use that to format the output how you like?

EffrafaxOfWug · Jun 6, 2019

ZzBloopzZ said:
Is there a command to save all directory and file names in the pool to a text file? I googled around and could not find a solution. :c/

One very simple way of doing that without installing tree would be to run something like:

Code:

find / * > /somedir/all_my_files_and_dirs.txt

I'm not 100% on what Solaris' find options are like compared to the GNU find I'm used to, but I'm fairly certain the above should work anywhere.

gea · Jun 9, 2019

You can use different find options on OmniOS
find / -name find

/usr/bin/find
/usr/xpg4/bin/find
/usr/gnu/bin/find

ZzBloopzZ · Aug 3, 2019

Hello,

I wanted to post an update. The pool has still been resilvering for over 5 weeks now. Now, other drives started resilvering as well. I have a strong feel I'm SOL now. Network transfer speeds are so slow since the beginning that I can't copy any files.

Is there a way I could "undo" the two disks I originally replaced with and put them back in the pool. Then try to replace the two worst drives? The reason, when I originally pulled two drives I did it based on numbers of errors since the SMART feature was not working/loading in nappit at the time even with reboots. Thus I replaced 2 drives and let them resilver. Afterwards, I updated nappit and that is when the SMART feature started to work. However, at the same time the two drives I pulled fully fail SMART with WD diagnostics almost instantly in the short test, and eventually fail in the long test. One of them failed right away in the long test. I have a strong feeling I am screwed here overall huh?

pricklypunter · Aug 3, 2019

I have nothing new to offer you, but I certainly feel your pain and understand where you're coming from.

I really wish ZFS had a freeze state option, something that just freezes the state of the pool and does nothing more than mount the pool, warts and all, in read only mode. At least that way there's a chance at least, of pulling what data is left intact off a pool. All this re-silvering that can't be stopped etc is total bollox. I get it, the pool is trying to self heal, and if your hardware is still intact, there's a very good chance it will succeed in doing just that, but there comes a point, especially when disks are failing and you have already lost pool redundancy, that there is going to be data loss, no matter how many re-silvers it goes through. In that scenario, there's little point in re-silvering anything, just stop the bus, park it in read only mode and retrieve what you can. The rest can come from the latest back-up that you, hopefully, were sensible enough to take.

I love ZFS and the features it brings, it's beyond amazing, but it's absolutely awful at helping you get your data off a pool when things begin to break, in fact I would say it is more likely to actively obstruct you that process than help you out with it, such is the self-healing nature of the beast

Oh, and I think you are SOL getting your pool back if my experiences of failed pools is anything to go by...

gea · Aug 4, 2019

As long as no more than two disks fail, the pool is degraded but intact. If a third disk fails, the pool is offline until one of the three disks come back.

If the resilver lasts very long this indicates that at least one disk is not fully dead but read/write is very slow ex due bad sectors. Check iostat ex in menu System > Basic Statistic > Disk. In a good pool b% and w% should be similar on all pooldisks. If one disk gives bad wait or busy values you should remove this disk unless not essential for pool redundancy.

If one or both of your disks to replace are the problem, simply remove them, optionally backup data then or insert a new disk and restart a replace originally unavail> new.

If suddenly more disks show problems like too many errors for a disk with a disk resilvering, power down and do a RAM check ex via memtest86 as this often indicates RAM problems especially when not using ECC RAM.

If a third disk fails and you still have the two (former unavail) disks and they work after a WD datalife check without a destruct check, you can try to insert them and see if the pool come back at least to a degraded state.

To freeze the pool state, export + import readonly. A scrub can be cancelled to reduce load although this is a low priority process and should not be a problem. A replace can be cancelled at least by removing the "replace with" disk.

pricklypunter · Aug 4, 2019

gea said:
To freeze the pool state, export + import readonly.

Except when it is, or was previously re-silvering, as that process cannot be stopped once it has begun. There should be an off switch for it. In the event of a last ditch attempt at data retrieval, all file system housekeeping, self healing or otherwise, should be halted and the pool mounted read only, warts and all, but that's only my opinion

Readonly mode in ZFS only prevents the user from interactions with the filesystem, as far as I can tell, it doesn't stop ZFS happily doing its thing in the background, which often means frustrating attempts to retrieve data from failing disks.

None of this has anything to do with your Nappit of course, the issue lies with ZFS being far too clever for its own good in believing that all data loss/ corruption is fixable and attempting to do so ad infinitum

gea · Aug 4, 2019

ZFS is right to do so.

The problem of the endles repair is a disk or other hardware. As long as a disk does not finally fails (answers within the disk timeout time, per default 60s) ZFS continues the repair effort even if you get only a byte per minute. You can simply fix this with a lower timeout ex 7s like most disks in a hardwareraid or if you simply unplug that disk.

ZFS is unbreakable per design. If there is a chance (and no bug within ZFS) any problem can be repaired unless it is not repairable at all. In such a case you have a disaster case (like fire, flash, theft, amok hardware or more disks fail than the redundancy level allows) for which you need a backup

Search

Pool Degraded - Help!

ZzBloopzZ

Member

Attachments

redeamon

Active Member

gea

Well-Known Member

ZzBloopzZ

Member

gea

Well-Known Member

ZzBloopzZ

Member

Attachments

gea

Well-Known Member

ZzBloopzZ

Member

pricklypunter

Well-Known Member

EffrafaxOfWug

Radioactive Member

gea

Well-Known Member

ZzBloopzZ

Member

pricklypunter

Well-Known Member

gea

Well-Known Member

pricklypunter

Well-Known Member

gea

Well-Known Member