ZFS faulted drive disappeared after upgrade to 2.0.0

lancethepants · Dec 8, 2020

I have a pool of 4x 4tb drives drives, 2 mirrored vdevs. Last week one of my drives faulted and the pool became degraded. I ordered a replacement drive to fix the pool asap. In the mean time I upgraded to 0.8.5 and afterwards on it's own it attempted resilvering the existing bad drive and still faulted. Then zfs 2.0.0 came out so I upgraded to that. Once upgraded it again attempted on it's own to resilver the drive, but this times it says it was successful. The replacement drives arrived today, but the upgraded pool has now been operating seemingly with no errors since the upgrade to 2.0.0 several days ago.
So now I'm not sure what to think. Since it resilvered that drive that's basically a scrub as well isn't it? Should I now run a scrub and see if it results in any errors? I just don't know whether I should be trusting of this drive or if I should go ahead and replace it.

thanks

Evan · Dec 9, 2020

Run a scub and watch the smart data. But I assume you already checked the smart data on the drive and it’s generally looking healthy ? If not do that as well now.

dandanio · Dec 10, 2020

Wait, you upgraded a degraded array? Who does that? I hope that was not a production dataset. Also, how are your backups?

lancethepants · Dec 11, 2020

Well I ran a scrub and it didn't find any issues or errors. SMART says it the self evaluation PASSED. Did a short run but maybe I'll try a long test.
SMART does tell me about some of the previous errors the disk had. Look like some that happened "at disk power-on".

=== START OF INFORMATION SECTION ===Model Family: Seagate Barracuda 3.5D - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Always have another backup. An upgraded degraded pool will just be an upgraded pool that's a degraded pool. Except not in my case, but whether it was chance or maybe something that zfs 2.0.0 handles better, I can't say yet.

Tinkerer · Dec 13, 2020

Well, by the looks of it its seems to me you're completely in the dark whether you can rely on that disk and on your pool in general.

Out of interest, which distro are you running?

What brand/type are the disks? This is important because some manufacturers use smart stats for their own purposes.
You say the drive was degraded but nowhere do I see you give a reason for it. I asume you investigated but what did you find? Did it occur out of the blue, did something crash? Was there anything at all that might have caused it?
You say there where smart errors before, but what were they?
Where those smart errors on the disk that failed in the pool?
Did you compare smart stats from before and after the scrub and did any of them increase?
Where there any SATA errors in dmesg?

You run mirrors so at best 2 drives can fail before you loose the pool, but if the dice roll against you, you can loose the wrong disk and loose the entire pool.

To know whether you're good I think needs a few more answers to clear up the current situation.

Did you create your pool using disk-by-id or /dev/sda/sdb etc? If you used disk by id, did you record the disk serial that failed? You still know which one it was? If you used /dev/sda/sdb etc, zfs wont get confused but you might if the order in which the disks are initialized during boot, not likely, but that might change from one boot to another, especially if one of the drives is intermittently faulting.

lancethepants · Dec 14, 2020

@Tinkerer
That's right, I have no idea what to think at this point.

I am running Debian 10.
2x Seagate ST4000DM000
2x Seagate ST4000DM005

All I have to go off of is the pastebin link I put in my last post that has the SMART output. I just got an email from my server that one of my disks had faulted and the pool had degraded some time in the night. I saw it when I woke up. The disks shouldn't have been on any workload at the time. After the scrub I didn't get any more smart errors, and the scrub didn't return a single error either. Upon discovering the pool had degraded I looked at dmesg, and it did show that particular disk, /dev/sde, had some sata errors. I've rebooted since then and nothing appears now.

One interesting thing I just saw. My tower is just and old HP workstation from a university. I'm not sure all my sata ports are equal on this thing. That particular drive is labeled as "removable". All drives are internal of course, but that port is coloered differently like it was meant to be used for a hot-swap disk.

Code:

[    1.759075]  sda: sda1 sda2 sda3
[    1.759300] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.804448]  sdc: sdc1 sdc9
[    1.804872] sd 2:0:0:0: [sdc] Attached SCSI disk
[    1.815379]  sdb: sdb1 sdb9
[    1.815747] sd 1:0:0:0: [sdb] Attached SCSI disk
[    1.818675]  sde: sde1 sde9
[    1.818891] sd 4:0:0:0: [sde] Attached SCSI removable disk
[    1.821600]  sdd: sdd1 sdd9
[    1.821898] sd 3:0:0:0: [sdd] Attached SCSI disk

Not all my disks seems to be running at 6.0 Gb/s. Maybe I ought to get some hba card.

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    Z304ZAED
LU WWN Device Id: 5 000c50 086f33029
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Dec 14 07:42:45 2020 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

lance@MediaBox2:~$ sudo smartctl -i /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Desktop HDD.15
Device Model:     ST4000DM000-1F2168
Serial Number:    Z301JHRX
LU WWN Device Id: 5 000c50 0669314b5
Firmware Version: CC54
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Dec 14 07:42:47 2020 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

lance@MediaBox2:~$ sudo smartctl -i /dev/sdd
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 3.5
Device Model:     ST4000DM005-2DP166
Serial Number:    ZDH183YW
LU WWN Device Id: 5 000c50 0a293aaab
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Dec 14 07:42:49 2020 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

lance@MediaBox2:~$ sudo smartctl -i /dev/sde
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-13-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 3.5
Device Model:     ST4000DM005-2DP166
Serial Number:    ZDH1B3DW
LU WWN Device Id: 5 000c50 0a2a9896c
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Dec 14 07:42:50 2020 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Tinkerer · Dec 14, 2020

Im on mobile right now.

What is the output of zpool status?

lancethepants · Dec 14, 2020

Code:

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 06:48:37 with 0 errors on Thu Dec 10 17:50:26 2020
config:

        NAME                                 STATE     READ WRITE CKSUM
        storage                              ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000DM000-1F2168_Z304ZAED  ONLINE       0     0     0
            ata-ST4000DM000-1F2168_Z301JHRX  ONLINE       0     0     0
          mirror-1                           ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH183YW  ONLINE       0     0     0
            ata-ST4000DM005-2DP166_ZDH1B3DW  ONLINE       0     0     0

errors: No known data errors

Tinkerer · Dec 15, 2020

Well, my gut feeling is you're good for now, but I would take some precautions if I were you.

I'd double check and verify my backups (do a restore test!). Double check backups are running properly. Backup the OS disk, at least important directories like /etc /var/ /usr. I'd probably make a second backup to another location, too.

Then I'd switch sda sata cable with sde if that workstation allows booting from that sata port. Is it legacy bios or uefi? You may have to jump through some hoops to get it to boot after swapping sata cables around (create a rescue boot usb thumb drive so you can chroot into your installed OS).

Setup a monitor for SMART errors, via cron job or something with alerting if certain counters change. Test that spare disk in another system for good measure, a few days/weeks just to ensure it wont fail in the first few hours into a resilver. A lot of disk failures happen shortly after taken into use.

If you can't test that spare disk in another system, you could consider replacing a disk from the other mirror (not the mirror sde is part of), and keep the used disk as a spare for sde as its proven to be good. This is a bit more risky, but if you like living on the edge ...

.

Other than that, I don't think you can't do much else than to sit back and wait for it to fail. Just make sure you're prepared for that failure. Every disk is going to fail sooner or later and its never a question if its going to fail, only when. Prepare for it is the best advice I can give you right now.

Tinkerer · Dec 15, 2020

PS. I don't think 3Gb sata links poses any issues in stability or performance. Those disks won't saturate 150MB/s so there wouldn't be any benefit to having them on 6Gb links. I wouldn't get a hba card just for that reason.

Search

ZFS faulted drive disappeared after upgrade to 2.0.0

lancethepants

New Member

Evan

Well-Known Member

dandanio

Active Member

lancethepants

New Member

=== START OF INFORMATION SECTION ===Model Family: Seagate Barracuda 3.5D - Pastebin.com

Tinkerer

Member

lancethepants

New Member

Tinkerer

Member

lancethepants

New Member

Tinkerer

Member

Tinkerer

Member