ZFS - status DEGRADED

fedoracore · Nov 29, 2022

Hello All,

My zfs pool containing three 2TB disks, all of them about 5 years old. 2 days ago i noticed that 1 is missing when the pc has booted.
While checking the zfs status, i noticed it is DEGRADED. although i was unable to receive elaborated details regarding it.

[root@bugsy tux]# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
Tank 5.44T 1.25T 4.19T 23% 1.00x DEGRADED -
[root@bugsy tux]# zpool status
pool: Tank
state: DEGRADED
zpool: cmd/zpool/zpool_main.c:3675: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
Aborted (core dumped)
[root@bugsy tux]# zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
------------------------------------- ----- ----- ----- ----- ----- -----
Tank 1.25T 4.19T 20 5 2.36M 29.1K
raidz1 1.25T 4.19T 20 5 2.36M 29.1K
disk/by-id/wwn-0x50014ee207137b73 - - 11 2 1.21M 18.5K
disk/by-id/wwn-0x50014ee0ae0b7e4e - - 11 2 1.22M 18.5K
5845484241457792992 - - 0 0 0 0
------------------------------------- ----- ----- ----- ----- ----- -----

I already tried to unplug and replug the cables - with no luck.

I'd appreciate your help with trying to locate the bad hard drive

Bjorn Smith · Nov 29, 2022

Run

Bash:

zpool status -vLP

And post the output of that - hopefully that will show a little more info

Bjorn Smith · Nov 29, 2022

Also your disks might have a label on it that shows the wwn - so you can just write down the numbers from the disks still working, shut down the server - unplug the disks one at a time and look for a match until you find the one that is not in the two you wrote down.

fedoracore · Nov 29, 2022

Bjorn Smith said:
Run

Bash:

zpool status -vLP

And post the output of that - hopefully that will show a little more info

Thanks for your reply.
Unfortunately, no output given,

[root@bugsy tux]# zpool status -vLP
invalid option 'L'
usage:
status [-vx] [-T d|u] [pool] ... [interval [count]]

Bjorn Smith · Dec 1, 2022

fedoracore said:
Thanks for your reply.
Unfortunately, no output given,

[root@bugsy tux]# zpool status -vLP
invalid option 'L'
usage:
status [-vx] [-T d|u] [pool] ... [interval [count]]

I assumed zol - so try with just
zpool status -v hopefully that should show you what disk is missing - and when you paste the output here, try adding 'CODE' tags around it, so it gets formatted nicely

fedoracore · Dec 1, 2022

Bjorn Smith said:
I assumed zol - so try with just
zpool status -v hopefully that should show you what disk is missing - and when you paste the output here, try adding 'CODE' tags around it, so it gets formatted nicely

Same output as in zpool status

Bjorn Smith · Dec 1, 2022

I have never seen zpool status crash like yours does.

So I think you should unplug the drives and look for labels with the wwn numbers on it, I am certain the disks have it printed - and then you can see what disk it is that is no longer part of the pool.

Alternatively you unmount everything that uses the pool and unplug a drive - one at a time - and if the first drive you unplug does not change the the status of the pool, i.e. you still have two drives it - you are golden - otherwise you continue until you find one that is the broken one.

CyklonDX · Dec 1, 2022

can you take a picture/snip of terminal from zpool status (its hard to read without proper formatting.)

A good disks should have no counters in R/W/CKSUM

Bad disks will have them; if I read it right, 2 disks on your setup started to die or died.
If you think the disks are good, you will need to find which is which - and use replace functionality on them; and then zpool clear.

Bjorn Smith · Dec 2, 2022

CyklonDX said:
Bad disks will have them; if I read it right, 2 disks on your setup started to die or died.

No - what he wrote was that had had a pool with 3 disks - and now zpool status only show two of the disks.

Which is really strange, since usually zfs should show the last disk as offline - but because zpool status crashes - its hard to know - which is why I suggested that he pulls the disks one at a time looking for the wwn label.

Also this is not true:

CyklonDX said:
A good disks should have no counters in R/W/CKSUM

You can have a good disk with counters - if you have memory issues - which I had. But yes as a general rule - if you have 0's across the board, everything should be peachy.

fedoracore · Dec 11, 2022

Bjorn Smith said:
No - what he wrote was that had had a pool with 3 disks - and now zpool status only show two of the disks.

Which is really strange, since usually zfs should show the last disk as offline - but because zpool status crashes - its hard to know - which is why I suggested that he pulls the disks one at a time looking for the wwn label.

Also this is not true:

You can have a good disk with counters - if you have memory issues - which I had. But yes as a general rule - if you have 0's across the board, everything should be peachy.

My apologies for the late respond - i was badly sick for a few days.
I succeeded in finding the "bad" hard drive, and just bought a new 2TB HD.

What is the correct process in replacing bad hd with a new one in ZFS?

Thanks in advance.

fedoracore · Dec 11, 2022

I found this article, which seems to be pretty easy,

KB450412 - Replacing Drives in ZFS Pool on Ubuntu 20.04 - 45Drives Knowledge Base

You are here: KB Home Ubuntu KB450412 – Replacing Drives in ZFS Pool on Ubuntu 20.04 Table of Contents Scope/DescriptionPrerequisitesStepsThrough Houston UIThrough Command LineVerificationTroubleshooting Scope/Description This article will walk through the steps to replace a failed drive in a...

knowledgebase.45drives.com

Although as you may probably remember, the Degraded hd was NOT listed while performing zpool status. so I'm not sure if the replace command would work.

Bjorn Smith · Dec 11, 2022

Normally you would do:

Bash:

zfs replace pool old_device new_device

But as @fedoracore said the old degraded device was not even being shown - so perhaps it was somehow never truly a part of the pool - or perhaps it was just a spare device.

So if you have removed the old disk from the system then show what

Bash:

zpool status

gives you.

fedoracore · Dec 12, 2022

100% sure I've had the 3 disks configured as RAIDZ1.
Notice the screenshot below, SIZE 5.44T for 3 TB HD's.

@Bjorn Smith
Still the same DEGRADED output after 1 hd was physically removed

Bjorn Smith · Dec 12, 2022

Very strange - I have never seen zpool binary crash like yours does.

If you can remember the wwn name, you should be able to do zfs replace.

If not, then you might have inspec the pool with zdb.

What OS are you running zfs on?

You could try to update your OS, to see if the crash bug has been fixed. It should be safe for the pool, but it might make it easier to fix whatever is wrong.

fedoracore · Dec 15, 2022

I'm running Fedora35.
I've tried installing new updates, but still the same broken output.

I'm staring to think this troubleshooting does not worth the effort.
perhaps it would be easier to copy the entire data to an external drive, and reconfigure the zfs pool with the 3 hard drives.

in case you have a better idea - please let me know

Stephan · Dec 15, 2022

Suggestions:

1) Copy off all data as backup to be on the safe side.

2) WWNs known to the system can be found here: ls -la /dev/disk/by-id/wwn-*

Ignore the -part1... partition links, just look at the ../../sdX devices. Use hdparm -i /dev/sdX to get to the serial numbers (first line, SerialNo=...). Note the two working disks. Power off, check labels.

3) Those zpool errors are scary. To debug, check your hostid: "zpool status" errors

If you are using fuse-zfs (ugh) then maybe dump that and switch to archzfs packages on Arch if you are flexible with distributions, or use TrueNAS, or if you want to stay with Fedora, maybe switch to simple md RAID-5. If your distribution isn't following OpenZFS promptly, i.e. it has some ZFS version 2.1.7 dkms package by now, I'd dump it.

Edit: If you stay with ZFS, always activate the ZED daemon and let it send you mails if something happens. Scrub pools. I scrub monthly, and want to see the results. Again, ZED will send them, even if no errors were detected.

fedoracore · Dec 24, 2022

Well it appears that the SATA cable was bad, after replacing it the hd is back alive. spent too many hours because of this!!
Now the zfs status is back to ONLINE, but still the same broken output:

Any idea what could be done to mitigate this?

Bjorn Smith · Dec 25, 2022

fedoracore said:
Well it appears that the SATA cable was bad, after replacing it the hd is back alive. spent too many hours because of this!!
Now the zfs status is back to ONLINE, but still the same broken output:
Any idea what could be done to mitigate this?

I am happy that it seems like your data is okay - but I would not trust that zfs installation because it crashes when doing a zpool status.

So if I were you and I had a place to copy the data off to, I would copy the data to a safe place - nuke the pool and see if a new pool behaved in the same place - and if it did, I would reinstall ZFS and if that did not help a fresh OS installation.

And if a fresh OS installation still behaves the same with the same disks, something is probably off somewhere either with the disks, or the disk controller - or something else with the motherboard/ram/cpu.

But now that your data seems to be back - I would copy it off to a backup location - zfs prefereable.

bigfellasdad · Jan 9, 2023

i would definitely be running a scrub on the pool.

zpool scrub Tank

zpool status to show progress.

ZFS - status DEGRADED

New Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

Member