15 drives out of 45 is in UNAVAIL status, incorrect labeling data

EduardCH

New Member
Jan 30, 2021
2
0
1
Hello,

I have a storage server build on top of HPE 380p Gen8 server, supermicro JBOD, ZFS and napp-it. This ZFS storage was used for storing not critical data, and was configured with three raidz3 VDEVs with 15 drives in each one. Today it failed with exactly 15 drives, with 5 failed drives in each vdev. Pool became unavailable. Reboots didn't help. There was a system error on failed LSI controller, which I replaced. Then I export and import pool, and one of three vdev became online. All five drives in that vdev now online. So there are only 10 UNAVAIL disks in the pool right now, with incorrect labeling data status. What else can be done to fix this? All drives are connected and ok, there are no SMART errors (I've checked all failed drives), or any other errors.

Thanks for any advice

This the status of a pool:

pool: Pool1
id: 6639334169153027388
state: UNAVAIL
status: One or more devices are unavailable.
action: The pool cannot be imported due to unavailable devices or data.
config:

Pool1 UNAVAIL insufficient replicas
raidz3-0 UNAVAIL insufficient replicas
c0t50014EE003896026d0 ONLINE
c0t50014EE0038AD80Ed0 ONLINE
c0t50014EE0038AD966d0 ONLINE
c0t50014EE0038AE436d0 ONLINE
c0t50014EE0038AF4B0d0 UNAVAIL incorrect labeling data
c0t50014EE0038B1096d0 ONLINE
c0t50014EE0038B1489d0 ONLINE
c0t50014EE0038B1494d0 UNAVAIL incorrect labeling data
c0t50014EE058DEF533d0 ONLINE
c0t50014EE058DF22ACd0 UNAVAIL incorrect labeling data
c0t50014EE058E00B16d0 UNAVAIL incorrect labeling data
c0t50014EE058E0251Ed0 UNAVAIL incorrect labeling data
c0t50014EE058E04342d0 ONLINE
c0t50014EE058E0435Dd0 ONLINE
c0t50014EE058E04529d0 ONLINE
raidz3-1 ONLINE
c0t50014EE058E04598d0 ONLINE
c0t50014EE0AE35BEFCd0 ONLINE
c0t50014EE0AE35C56Dd0 ONLINE
c0t50014EE0AE35C682d0 ONLINE
c0t50014EE0038B1494d0 ONLINE
c0t50014EE0AE35D550d0 ONLINE
c0t50014EE0AE35E395d0 ONLINE
c0t50014EE6ADE2DD25d0 ONLINE
c0t50014EE0038AF4B0d0 ONLINE
c0t50014EE60332A49Fd0 ONLINE
c0t50014EE60337A401d0 ONLINE
c0t50014EE60337B88Cd0 ONLINE
c0t50014EE603384593d0 ONLINE
c0t50014EE603384947d0 ONLINE
c0t50014EE6033864FFd0 ONLINE
raidz3-2 UNAVAIL insufficient replicas
c0t50014EE6033865C5d0 ONLINE
c0t50014EE6588D024Fd0 ONLINE
c0t50014EE6588D0CFBd0 UNAVAIL incorrect labeling data
c0t50014EE6588D0D2Fd0 ONLINE
c0t50014EE6588D10F6d0 ONLINE
c0t50014EE6588D1120d0 ONLINE
c0t50014EE6588DA81Bd0 UNAVAIL incorrect labeling data
c0t50014EE6ADE21C50d0 ONLINE
c0t50014EE6ADE220C6d0 UNAVAIL incorrect labeling data
c0t50014EE6ADE22BC3d0 ONLINE
c0t50014EE6ADE2C915d0 ONLINE
c0t50014EE6ADE2DCFDd0 ONLINE
c0t50014EE6ADE2DD25d0 UNAVAIL incorrect labeling data
c0t50014EE6ADE2E2EAd0 UNAVAIL incorrect labeling data
c0t50014EE6ADE2E317d0 ONLINE

device details:

c0t50014EE0038AF4B0d0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE0038B1494d0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE058DF22ACd0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE058E00B16d0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE058E0251Ed0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE6588D0CFBd0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE6588DA81Bd0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE6ADE220C6d0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE6ADE2DD25d0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.

c0t50014EE6ADE2E2EAd0 UNAVAIL incorrect labeling data
status: ZFS detected errors on this device.
The device has bad label or disk contents.
 

EduardCH

New Member
Jan 30, 2021
2
0
1
I have realized that there are duplicate drive IDs shown when I try to import pool, like c0t50014EE0038AF4B0d0. The same IDs in different VDEVs. How is that possible? Looks like ZFS somehow mixed drives or drive labels. Physically all drives are in place and all IDs are unique. Why ZFS using wrong drive IDs in some VDEVs? Is it possible to import pool using some other method of disk ID's?
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
A WWN number must be unique within a server as this is the disk identifier.
The WWN itself is not assigned by the OS but the disk manufacturer writes in to the disk like the mac adress of a nic. Normally there should never be two disks with same WWN worldwide.

In the past with the first Sata SSDs I have seen similar problems where disks from same production line had the same WWN but this is years ago.

If the disks are new, send them back is this is definitely a deficiency. If you can't replace you have only the two options not to use them on the same server or on a controller that detects disks based on WWN but on a controller port id ex Sata AHCI that detects disks based on the controller connector ex c1t3d0 (third disk on controller 1)
 
Last edited:

andrewbedia

Active Member
Jan 11, 2013
666
203
43
not sure if this helps, but if solaris maps drives in multiple ways like a lot of GNU/Linux things do, you may be able to search drives differently during import to get at yourdata. In ZoL it would be something like "zfs import Pool1 -d /dev/disk/by-id" or by-path. I'm not a solaris expert, but see if something like -d might help.
 

gea

Well-Known Member
Dec 31, 2010
2,649
908
113
DE
This multiple options of disk detection under user control is a Linux only thing.

On Solarish with a current HBA, WWN detection is to only option, not to say the best option as WWN is a worldwide unique disk identifier that remains the same even on a new server and unlike a serial it is well defined. With a current LSI HBA this id remains even the same when you move a disk between HBAs.

The other alternative is detection by controller port connector but you do not have any control about this as this is related to hardware and driver, ex Sata and Atto SAS HBAs are detecting disks this way. All LSI HBAs are using WWN only on Solarish.

btw
one of the reasons that disk detection and disk move just works properly on Solarish - as long as you have a proper WWN entry done by the disk manufacturer.