Napp-IT / ESXi / OmniOS - Disks going offline (raidz2-0)

ssherwood · Apr 18, 2022

Hello,

I'm running Napp-IT on OmniOS in a VM running on ESXi.

ESXi : 6.7.0 Update 1 (Build 10764712)
OmniOS: OmniOS v11 r151030ex
Napp-IT: 21.06a7
LSI 2008 HBAs being passed through to Napp-IT (x3)

3 storage pools running total:
1) Mirrored SSDs (primary storage for VMs)
2) Stripped mirrors (20TB) (retired pool - now just used for temporary items - mostly empty)
3) Raidz2 pool (40TB) - this is the pool which is giving me trouble. Its fairly full - about 4.5TB available.

This configuration has been running for many years with minimal intervention.

I have backups of the most critical data from this pool (documents, photos etc.), but it stores a ton of media which I don't typically backup. Total pool size (useable) is just under 40TB.

Since yesterday evening, I've been troubleshooting an issue that has cropped up with a raidz2 pool on my home fileserver. I noted it was showing 2 disks as removed.

After shutting down all of the running VMs, I was able to shutdown the server as well. I physically looked at the front of my chassis for obvious signs of trouble, and seeing nothing out of the ordinary reseated all of the drives by removing and reinserting them while the server was shutdown. After restart, the drives showed up as online, and a resilver completed successfully.

I had been running older versions of OmniOS and Napp-IT, so I decided to update both. Napp-IT went from 18.12 to 21.06, and OmniOS from r151028 to r151030.

After all of this competed successfully, I decided to initiate a scrub of the pool and called it a night after monitoring it making progress for an hour or so. This morning I woke up and have the same issue again - 2 disks showing as removed from the same pool. It could be by imagination (don't think it is), but I believe it is 2 different drives that showed as removed this time.

I shutdown the running VMs, and rebooted Napp-IT (no shutdown - just an init 6 from the CLI) and the drives have showed up again as online, and started reslivering again.

So - I'm looking for suggestions as to what I should do next. I can confirm that I've already physically looked at the server, but I haven't cracked it open. I haven't been inside the chassis for a few months. The server is quite old (2x Xeon(R) CPU E5-2650 v2 CPUs), as are the HBAs (3x IBM Serveraid M1015), but still meets my performance needs.

Thank in advance for your help!

PS - I have a copy of what dmesg shows, but I'm not clear on what the correct way of sharing that kind of info is on STH. It shows "multipath status: degraded:" errors for the two devices that were "removed" (sd26 and sd30)

ssherwood · Apr 18, 2022

Hi again - just a short update to say that the 2nd resilver finished and the pool has been back online for the rest of the day without further issue. I decided to power down the server again. I discharged things by disconnecting power and holding down the power button. I checked that the breakout cables between the HBAs and the backplane were all connected, and things looked OK.

I closed things back up and powered on again, and as noted above, haven't seen a return of the problem. I'm thinking I may try to source a few "new" HBAs (Dell H310 seem available locally) and if the issues return, I'll look to swap out the HBAs.

If I do that, would I need to export and import the ZFS pools, or would they be picked up again without doing that using those HBAs in IT mode? I'm sure ESXi would see them as different cards, and I'd need to pass them through to the VM etc..

I'm also wondering if I should consider a more drastic change and replace the entire system. As I said before, its been running for several years, and so it might just be time to retire this rig and move onto something else. I'd prefer not to spend the $$$ if not needed, but I can if necessary. Its a Gigabyte GA-7PESH2 motherboard, with 128GB of ECC DDR3 RAM with a pair of Xeon LGA 2011 processors in a 24 bay LFF Norco chassis. Still plenty of horsepower for what I'm doing today, but I just don't have time between work and family these days to be troubleshooting the home system all the time. Thoughts?

Thanks.

gea · Apr 18, 2022

If disks fail random, its propably a problem around backplane, psu or cabling. If the disks are always the same, check them ex via wd data lifeguard and an intensive test ex from a hirens usb stick, Hiren's BootCD PE

You may also disable mpio in menu Disks > Details and mpio > SAS conf
set mpxio-disable="yes"; and reboot.

ssherwood · Apr 18, 2022

gea said:
If disks fail random, its propably a problem around backplane, psu or cabling. If the disks are always the same, check them ex via wd data lifeguard and an intensive test ex from a hirens usb stick, Hiren's BootCD PE

You may also disable mpio in menu Disks > Details and mpio > SAS conf
set mpxio-disable="yes"; and reboot.

Thanks @gea - when I made the change to SAS conf, it immediately removed many disks from my pools. Not sure it matters, but my disks are all SATA, not SAS. After a reboot, it seems like the disks came back online.

gea · Apr 19, 2022

your error "multipath status: degraded:" may indicate a problem around mpio that can be avoided when mpio is disabled when not needed.

ssherwood · Apr 27, 2022

@gea thanks - I disabled the feature, and that seemed to improve things for the last week or so, but I've had a repeat of the problem come up again. I've just ordered 3 new H310 HBAs and will look to change out the existing cards once they arrive.

When I do this, should I first export (with the original M1015 HBAs) and then import the pools back after they are connected to the new H310 HBAs?

I will likely replace the cabling between the HBAs and the backplane connectors as well. Then if the issue returns, that pretty much means its the backplane in the Norco 4224. I believe these are replaceable as well, but unfortunately Norco no longer exists and the few threads I've looked into seem to indicate that the backplanes are a bit of a unicorn find these days.

Looks like if it is the backplane, I'll be looking for a new chassis at a minimum...

gea · Apr 29, 2022

Export: Import
Normally export + import is suggested.
But long as disks are detected and pool alive, it is importable at least via zpool import -f

Search

Napp-IT / ESXi / OmniOS - Disks going offline (raidz2-0)

ssherwood

New Member

ssherwood

New Member

gea

Well-Known Member

ssherwood

New Member

gea

Well-Known Member

ssherwood

New Member

gea

Well-Known Member