Napp-it resilvering

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

markpower28

Active Member
Apr 9, 2013
413
104
43
Had a bad drive failed in a zpool with 12 x 2 TB mirror pool. Removed the failed drive with a new one and start the drive replace process.

Looks like the zpool is in resilvering loop somehow. It went 100% on the resilver process then start process all over again and again. It has been 2 weeks. (I did restart the server about a week ago) Is there anything can be done to the pool?

19.12b10 on OmniOS v11 r151034
Intel E5-2430 V2
96 GB RAM
HBA LSI 2308
NIC ConnectX-4 Lx 40Gbe
log drive: Intel optane P900 280GB
cache drive: Intel P750 400 GB
HDD: 12 x 2 TB SAS drive
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
If a ZFS resilver fails or restarts I would suppose that the "new" disk is also bad. Can you either try another disk or check the new disk with an intensive test ex via a disk tool like WD data lifeguard.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
Thanks @gea

I added another clean drive use the replace drive function.

As you can see, after it finished 100% it start the process again stated data corruption. I don't have backup/snapshot. Do I just lost the entire pool? (highlighted drives are the new drives I put in the pool)

2020-12-07_23-31-52.png
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
The resilvering message of many disks on the first screenshot is what worries me as this indicates mostly a hardwareproblem like ram or power problems. I would do a backup of important data prior further tests

The second screenshot shows that the pool is working in degraded mode. Unless you loose a vdev totally this is not a problem.

In Mirror 3 you have the situation that a failed drive is replaced twice with another disks. To solve this you must remove 2 disks from this "3way mirror" situation via menu Disk > Remove

- remove the failed/ unavail disk
- remove one of the mirrors ex ..67ea7d0

Due fail of the mirror3 redundancy there are detected checksum errors with damaged files. Check for these files, delete or restore from backup.
 
Last edited:

markpower28

Active Member
Apr 9, 2013
413
104
43
Thanks @gea

I could not delete any of the drive mentioned above. Have the following error: cannot detach c6t5000039628D97C26d0: no valid replicas

Any other thing I could do?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
- Can you remove the unavail disk?

What happens if you clear the pool error first (menu Pools) as both of the disks in "replacing" have different checksum errors. If this does not help, run a resolver again and then retry a error clear.

The checksom errors on both replaced disks are not funny. This should not be the case with good disks after a replacement and indicates other hardware problems than a simple failed disk. Have you done the backup of critical files?
 
Last edited:

markpower28

Active Member
Apr 9, 2013
413
104
43
When I try to remove the unavail disk I have the following error: cannot detach 6421037529530817339: no valid replicas

I only use the storage present iSCSI LUNs to VMware. So the error message point to the LUN itself.

2020-12-08_13-23-11.png
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
Backup all data as fast as possible/ move production system to another server and check system as a whole (mainly RAM and PSU)