Proxmox: Odd ZFS behaviour with missing partitions after reboot

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

bob_dvb

Active Member
Sep 7, 2018
214
116
43
Not quite London
www.orbit.me.uk
Hi All,

I wanted to share my experience last night which was odd and so I wanted to make what I did indexable for future!

So, the first part is unrelated, I was reflashing my Mellanox CX3-Pro 40GbE card (4099) and needed to reboot.

When it came back up, half of my ZFS pools were missing and the other half broken. WTF?

When I did
zpool import -d /dev/sda1
it said that some of the devices were UNAVAIL
pool: bigone
id: 18208768974987298556
state: UNAVAIL
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
see: Message ID: ZFS-8000-5E — OpenZFS documentation
config:

bigone UNAVAIL insufficient replicas
raidz2-0 UNAVAIL insufficient replicas
sda ONLINE
ata-ST4000VN006-3CW104_ZW60BY21 ONLINE
ata-ST4000VN008-2DR166_ZDHAHMJG UNAVAIL
ata-ST4000VN008-2DR166_ZGY97GD3 UNAVAIL
ata-ST4000VN006-3CW104_ZW60896Z ONLINE
ata-ST4000VN006-3CW104_ZW6088TR ONLINE
ata-ST4000VN008-2DR166_ZDHAH42Q UNAVAIL
ata-ST4000VN008-2DR166_ZDHAX86T UNAVAIL
Then I did zdb -dep with -G to get more debug, it was a saying the devices didn't exist.
root@vmhost1:/usr/sbin# zdb -dep /dev/sdb1 -G bigone
zdb: can't open 'bigone': No such file or directory

ZFS_DBGMSG(zdb) START:
spa.c:6098:spa_import(): spa_import: importing bigone
spa_misc.c:418:spa_load_note(): spa_load(bigone, config trusted): LOADING
spa_misc.c:418:spa_load_note(): spa_load(bigone, config untrusted): vdev tree has 1 missing top-level vdevs.
spa_misc.c:418:spa_load_note(): spa_load(bigone, config untrusted): current settings allow for maximum 0 missing top-level vdevs at this stage.
spa_misc.c:403:spa_load_failed(): spa_load(bigone, config untrusted): FAILED: unable to open vdev tree [error=2]
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 0: root, guid: 18208768974987298556, path: N/A, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 0: raidz, guid: 1389112146403432605, path: N/A, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 6576567205135368536, path: /dev/disk/by-id/ata-ST4000VN008-2DR166_ZDHAX6MF-part1, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 1: disk, guid: 11223351249977905860, path: /dev/sdb1, healthy
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 2: disk, guid: 18406418524549592766, path: /dev/disk/by-id/ata-ST4000VN008-2DR166_ZDHAHMJG-part1, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 3: disk, guid: 12249813091560993336, path: /dev/disk/by-id/ata-ST4000VN008-2DR166_ZGY97GD3-part1, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 4: disk, guid: 16845159498843104203, path: /dev/disk/by-id/ata-ST4000VN006-3CW104_ZW60896Z-part1, healthy
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 5: disk, guid: 3727851619167910069, path: /dev/disk/by-id/ata-ST4000VN006-3CW104_ZW6088TR-part1, healthy
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 6: disk, guid: 14363383109377104818, path: /dev/disk/by-id/ata-ST4000VN008-2DR166_ZDHAH42Q-part1, can't open
vdev.c:212:vdev_dbgmsg_print_tree(): vdev 7: disk, guid: 6356202158419617414, path: /dev/disk/by-id/ata-ST4000VN008-2DR166_ZDHAX86T-part1, can't open
spa_misc.c:418:spa_load_note(): spa_load(bigone, config untrusted): UNLOADING
ZFS_DBGMSG(zdb) END
But the fdisk -l showed the partitions existed!?!
Then ls -la /dev/disk/by-id/ showed that the partitions weren't showing up in the device tree?!

After some searching, I found it could have been related to udev function in systemd.
udevadm test --action=add /sys/class/block/sd*
I believe this posts a message to udev which wakes things up? (someone confirm to me?)
Then I did partprobe and it recreated the symlinks in the dev tree for the missing partitions. Partprobe alone didn't seem to recover things, but using udevadm before it definitely helped.

pool: bigone
state: ONLINE
scan: scrub in progress since Mon May 8 01:24:08 2023
250G scanned at 9.98G/s, 2.14M issued at 87.8K/s, 6.13T total
0B repaired, 0.00% done, no estimated completion time
config:

NAME STATE READ WRITE CKSUM
bigone ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-ST4000VN008-2DR166_ZDHAX6MF ONLINE 0 0 0
ata-ST4000VN006-3CW104_ZW60BY21 ONLINE 0 0 0
ata-ST4000VN008-2DR166_ZDHAHMJG ONLINE 0 0 0
ata-ST4000VN008-2DR166_ZGY97GD3 ONLINE 0 0 0
ata-ST4000VN006-3CW104_ZW60896Z ONLINE 0 0 0
ata-ST4000VN006-3CW104_ZW6088TR ONLINE 0 0 0
ata-ST4000VN008-2DR166_ZDHAH42Q ONLINE 0 0 0
ata-ST4000VN008-2DR166_ZDHAX86T ONLINE 0 0 0

errors: No known data errors

If anyone knows what the root cause of this is, I'd appreciate it! Because next time the server reboots I don't want to do this again.
 

CyklonDX

Well-Known Member
Nov 8, 2022
819
267
63
Just to confirm you didn't use your mellanox card for jbod to those storage devices?

I do recall issues after flashing intel x540 with fcoe firmware, (i havve 2 lsi controllers, internal and external) the 2nd lsi controller after reboot went up and vanished during lsi utility post in really odd way. I had to do a cold boot (the disks still do not show in lsi post utility (only within the utility) ever since the disks original sdX names changed to scsi names.
1683598671475.png
(zfs3 and zfs4 are on same backplane and same controller, zfs4 were added after, i could not get the proper names back anymore - i haen't messed a lot with it - and it works, so i left it)
 
  • Like
Reactions: gb00s

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
Solaris with native ZFS and its forks with Open-ZFS use always worldwide unique WWN disknames with LSI HBAs and controller based names with Sata/NVMe

On LInux you can change controller-port/cable based naming like sda with unique diskbased naming during zpool import
with the wanted /dev/disk/ assignment ex

sudo zpool export zfs3
sudo zpool import -d /dev/disk/by-id zfs3

In general on a small homeserver controller portbased names are easier to handle. On installations with many servers, hbas and disks you are lost without unique disknames as you can then move disks around and they keep their names and ZFS (and you) detects them even on another diskbay, HBA or server. Most preferable is WWN (printed on the disk case together with serial) over serial number as serials are often not coded properly and have no standardized naming.

With portbased named, ZFS cannot find its disks when you move them to another port or the OS change portbased assignments for whatever reasons. A pool export + import can fix this.
 
Last edited:
  • Like
Reactions: gb00s

sko

Active Member
Jun 11, 2021
240
128
43
the provider names as shown by 'zpool status' have nothing to do with how zfs is assembling the pools. it's just a human-readable representation and can be changed e.g. by setting the 'kern.geom.label.disk_ident.enable' or 'kern.geom.label.gptid.enable' sysctls on FreeBSD.

ZFS uses none of those but its own identifiers stored in the metadata of the providers to assemble a pool, so the representation by the OS e.g. via 'zpool status' is purely cosmetical and changes completely e.g. when moving a pool from illumos to FreeBSD or vise versa.


You usually shouldn't import a pool by specifying a single provider. always use its name or UID.
What does a simple 'zpool import' show?
Are you using multiple HBAs? All cables are OK? half of the drives 'disappearing' screams broken HBA, bad cabling or missing power to part of the backplane. With dual HBA setups (or even just 2 separate channels/cables and direct-attach backplanes), for better resilience (and flexibility and space efficiency and rebuild times and several other benefits...) always use mirrors and put each of the mirror providers on one HBA/channel. This way a failing controller won't bring down the whole pool.

Try moving the disks around and check if the errors correspond to specific drive slots. (Again: the 'names' of the drives are completely irrelevant to ZFS.)



FTR: proprietary oracle solaris doesn't have an openzfs compatible ZFS implementation (i.e. it's heavily outdated and missing essential features), so don't mix anything solaris-related with modern openZFS. (but even on solaris the WWN is only the human-readable representation, not what ZFS is using when assembling a pool)
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
yes and no
ZFS use internally a disk guid. You can use this when you replace a disk instead the disk label. But if you remove disks and plug them in in a different order the pool is only working without further actions like export/import where all disk labels are read if you use a disk based detection like WWN and not a port based one like sda.

Oracle Solaris and native ZFS is incompatible to Open-ZFS but not outdated. Until recently (and in some aspects even now) it was more feature rich than Open-ZFS and still one of the fastest ZFS servers. The free Solaris Fork Illumos (ex Nexenta, OmniOS , OpenIndiana or SmartOS) is quite in sync with Linux Open-ZFS. While Illumos is no longer the Upstream for Open-ZFS (this is now Linux due more developing power) as it was until recently, it is still quite the most resource efficient ZFS platform with the OS integrated Solaris/Illumos SMB server as a killer feature. Regarding update/downgrade of OS, ZFS integration or setup easyness especiually regarding SMB I perceive Solarish still far ahead of any Linux ZFS implementation or with SAMBA.
 
Last edited: