ZFS - status DEGRADED

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

fedoracore

New Member
Nov 29, 2022
8
0
1
Hello All,

My zfs pool containing three 2TB disks, all of them about 5 years old. 2 days ago i noticed that 1 is missing when the pc has booted.
While checking the zfs status, i noticed it is DEGRADED. although i was unable to receive elaborated details regarding it.

[root@bugsy tux]# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
Tank 5.44T 1.25T 4.19T 23% 1.00x DEGRADED -
[root@bugsy tux]# zpool status
pool: Tank
state: DEGRADED
zpool: cmd/zpool/zpool_main.c:3675: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
Aborted (core dumped)
[root@bugsy tux]# zpool iostat -v
capacity operations bandwidth
pool alloc free read write read write
------------------------------------- ----- ----- ----- ----- ----- -----
Tank 1.25T 4.19T 20 5 2.36M 29.1K
raidz1 1.25T 4.19T 20 5 2.36M 29.1K
disk/by-id/wwn-0x50014ee207137b73 - - 11 2 1.21M 18.5K
disk/by-id/wwn-0x50014ee0ae0b7e4e - - 11 2 1.22M 18.5K
5845484241457792992 - - 0 0 0 0
------------------------------------- ----- ----- ----- ----- ----- -----

I already tried to unplug and replug the cables - with no luck.

I'd appreciate your help with trying to locate the bad hard drive
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Also your disks might have a label on it that shows the wwn - so you can just write down the numbers from the disks still working, shut down the server - unplug the disks one at a time and look for a match until you find the one that is not in the two you wrote down.
 
  • Like
Reactions: T_Minus

fedoracore

New Member
Nov 29, 2022
8
0
1
Run
Bash:
zpool status -vLP
And post the output of that - hopefully that will show a little more info
Run
Bash:
zpool status -vLP
And post the output of that - hopefully that will show a little more info

Thanks for your reply.
Unfortunately, no output given,

[root@bugsy tux]# zpool status -vLP
invalid option 'L'
usage:
status [-vx] [-T d|u] [pool] ... [interval [count]]
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Thanks for your reply.
Unfortunately, no output given,

[root@bugsy tux]# zpool status -vLP
invalid option 'L'
usage:
status [-vx] [-T d|u] [pool] ... [interval [count]]
I assumed zol - so try with just
zpool status -v hopefully that should show you what disk is missing - and when you paste the output here, try adding 'CODE' tags around it, so it gets formatted nicely
 

fedoracore

New Member
Nov 29, 2022
8
0
1
I assumed zol - so try with just
zpool status -v hopefully that should show you what disk is missing - and when you paste the output here, try adding 'CODE' tags around it, so it gets formatted nicely

Same output as in zpool status
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
I have never seen zpool status crash like yours does.

So I think you should unplug the drives and look for labels with the wwn numbers on it, I am certain the disks have it printed - and then you can see what disk it is that is no longer part of the pool.

Alternatively you unmount everything that uses the pool and unplug a drive - one at a time - and if the first drive you unplug does not change the the status of the pool, i.e. you still have two drives it - you are golden - otherwise you continue until you find one that is the broken one.
 

CyklonDX

Well-Known Member
Nov 8, 2022
784
255
63
can you take a picture/snip of terminal from zpool status (its hard to read without proper formatting.)

A good disks should have no counters in R/W/CKSUM
1669935552646.png

Bad disks will have them; if I read it right, 2 disks on your setup started to die or died.
If you think the disks are good, you will need to find which is which - and use replace functionality on them; and then zpool clear.
 
Last edited:

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Bad disks will have them; if I read it right, 2 disks on your setup started to die or died.
No - what he wrote was that had had a pool with 3 disks - and now zpool status only show two of the disks.

Which is really strange, since usually zfs should show the last disk as offline - but because zpool status crashes - its hard to know - which is why I suggested that he pulls the disks one at a time looking for the wwn label.

Also this is not true:
A good disks should have no counters in R/W/CKSUM
You can have a good disk with counters - if you have memory issues - which I had. But yes as a general rule - if you have 0's across the board, everything should be peachy.
 

fedoracore

New Member
Nov 29, 2022
8
0
1
No - what he wrote was that had had a pool with 3 disks - and now zpool status only show two of the disks.

Which is really strange, since usually zfs should show the last disk as offline - but because zpool status crashes - its hard to know - which is why I suggested that he pulls the disks one at a time looking for the wwn label.

Also this is not true:


You can have a good disk with counters - if you have memory issues - which I had. But yes as a general rule - if you have 0's across the board, everything should be peachy.

My apologies for the late respond - i was badly sick for a few days.
I succeeded in finding the "bad" hard drive, and just bought a new 2TB HD.

What is the correct process in replacing bad hd with a new one in ZFS?

Thanks in advance.
 

fedoracore

New Member
Nov 29, 2022
8
0
1
I found this article, which seems to be pretty easy,


Although as you may probably remember, the Degraded hd was NOT listed while performing zpool status. so I'm not sure if the replace command would work.
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Normally you would do:
Bash:
zfs replace pool old_device new_device
But as @fedoracore said the old degraded device was not even being shown - so perhaps it was somehow never truly a part of the pool - or perhaps it was just a spare device.

So if you have removed the old disk from the system then show what

Bash:
zpool status
gives you.
 

fedoracore

New Member
Nov 29, 2022
8
0
1
100% sure I've had the 3 disks configured as RAIDZ1.
Notice the screenshot below, SIZE 5.44T for 3 TB HD's.

Screenshot from 2022-12-12 20-43-13.png

@Bjorn Smith
Still the same DEGRADED output after 1 hd was physically removed :(

Screenshot from 2022-12-12 20-39-55.png
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Very strange - I have never seen zpool binary crash like yours does.

If you can remember the wwn name, you should be able to do zfs replace.

If not, then you might have inspec the pool with zdb.

What OS are you running zfs on?

You could try to update your OS, to see if the crash bug has been fixed. It should be safe for the pool, but it might make it easier to fix whatever is wrong.
 
Last edited:

fedoracore

New Member
Nov 29, 2022
8
0
1
I'm running Fedora35.
I've tried installing new updates, but still the same broken output.

I'm staring to think this troubleshooting does not worth the effort.
perhaps it would be easier to copy the entire data to an external drive, and reconfigure the zfs pool with the 3 hard drives.

in case you have a better idea - please let me know :)
 

Stephan

Well-Known Member
Apr 21, 2017
920
698
93
Germany
Suggestions:

1) Copy off all data as backup to be on the safe side.

2) WWNs known to the system can be found here: ls -la /dev/disk/by-id/wwn-*

Ignore the -part1... partition links, just look at the ../../sdX devices. Use hdparm -i /dev/sdX to get to the serial numbers (first line, SerialNo=...). Note the two working disks. Power off, check labels.

3) Those zpool errors are scary. To debug, check your hostid: "zpool status" errors

If you are using fuse-zfs (ugh) then maybe dump that and switch to archzfs packages on Arch if you are flexible with distributions, or use TrueNAS, or if you want to stay with Fedora, maybe switch to simple md RAID-5. If your distribution isn't following OpenZFS promptly, i.e. it has some ZFS version 2.1.7 dkms package by now, I'd dump it.

Edit: If you stay with ZFS, always activate the ZED daemon and let it send you mails if something happens. Scrub pools. I scrub monthly, and want to see the results. Again, ZED will send them, even if no errors were detected.
 

fedoracore

New Member
Nov 29, 2022
8
0
1
Well it appears that the SATA cable was bad, after replacing it the hd is back alive. spent too many hours because of this!!
Now the zfs status is back to ONLINE, but still the same broken output:

Screenshot from 2022-12-24 14-22-09.png

Any idea what could be done to mitigate this?
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Well it appears that the SATA cable was bad, after replacing it the hd is back alive. spent too many hours because of this!!
Now the zfs status is back to ONLINE, but still the same broken output:
Any idea what could be done to mitigate this?
I am happy that it seems like your data is okay - but I would not trust that zfs installation because it crashes when doing a zpool status.

So if I were you and I had a place to copy the data off to, I would copy the data to a safe place - nuke the pool and see if a new pool behaved in the same place - and if it did, I would reinstall ZFS and if that did not help a fresh OS installation.

And if a fresh OS installation still behaves the same with the same disks, something is probably off somewhere either with the disks, or the disk controller - or something else with the motherboard/ram/cpu.

But now that your data seems to be back - I would copy it off to a backup location - zfs prefereable.