ZFS - M1015 Failed, Replaced with new M1015 - Infinite resilvering?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Animosity

New Member
Apr 17, 2015
6
0
1
37
I have a 6x2TB disk raidz2 ZFS pool which I am hosting in Solaris 11

My original M1015 (flashed to IT mode) installed 2 years ago when I built this server failed and was no longer recognized on the PCI-E bus. I replaced it with another M1015 yesterday (flashed to IT mode) and Solaris found all the disks again.

The ZFS pool however was put into SUSPENDED mode (probably due to the previous M1015 dying while running and all the disks vanishing before) and I see resilvering occurring on 2 disks (??), with all disks listed as unavailable.

I have no doubt there are some errors in the pool. But I have cleared (fmadm repaired, and zpool clear) the faults in hopes that the pool could be remounted in degraded state. However upon reboots the pool first comes up as DEGRADED (some disks show as unavailable, some as degraded) , then immediate transitions to SUSPENDED with all disks showing as unavailable and resilvering starts.

The resilvering speed starts at about 100MBps and rapidly ramps down to 50kbps or less. This equates to several hundred hours of expected resilvering time. What's more is that iostat shows ZERO transactions occurring on any of the disks in the pool.

I have no way to offline any of the disks or export the pool while it is in suspended state (also, I have no certain idea why it is entering suspended state when "fmadm faulty"entries are all reported as repaired)

Where have I gone wrong in replacing the SAS controller and how do I recover?
 

Animosity

New Member
Apr 17, 2015
6
0
1
37
They're SATA drives - the architecture of the server is: I'm running VMWare ESXI 5.1. Solaris VM is running on an SSD (attached to motherboard SATA controller), and the SAS controller with my raidz2 disks is passed through to Solaris.

I can't pass through the motherboard SATA controller as that controller has the SSD my Solaris VM is on.

This is the ServeTheHome All-In-One architecture written about many times, except running Solaris 11 with Napp-it.
 

dswartz

Active Member
Jul 14, 2011
610
79
28
Can you swap roles? e.g. pass through the motherboard sata controller and not the HBA? Just wondering. I have seen solaris derivative OS have flaky behavior wrt the mptsas driver.
 

Animosity

New Member
Apr 17, 2015
6
0
1
37
I probably could but I'm not sure what that would prove. I can already see all the disks on the new M1015.

Does ZFS care about the order of the disks attached to the controller channels? Or can it resolve the pool independently of where the disks are attached? E.g. if I had a disk on M1015 channel 0 and I connected it to motherboard SATA controller's channel 2 - is this problematic?
 

dswartz

Active Member
Jul 14, 2011
610
79
28
What it (might) prove is the existence of a solaris mptsas bug. Not that that would help you directly, but if you could get the pool healthy again, you could switch back to the current arrangement.
 
  • Like
Reactions: Animosity

MatrixMJK

Member
Aug 21, 2014
70
27
18
55
ZFS does not care where the disks are attached or what order, as long as it can see them all, it will re-assemble the array.

I would build another VM or use the napp-it/omnios image and pass the m1015 thru to it and see if it can re-assemble it there. But if you are using a zpool version higher than 28, it will have a problem. In that case I'd still make another Solaris VM and try it there.
 
  • Like
Reactions: Animosity

dswartz

Active Member
Jul 14, 2011
610
79
28
Right. My point was there may be a solaris mptsas driver bug/issue. Just trying to eliminate that. It's very weird the symptoms he is describing...
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
As it worked with the first IBM

I would expect either a firmware problem, ex with the P20 LSI firmware.
If you use P20 IT firmware, go back to V19

Next, you may have a disk that blocks the controller.
Power off, remove all pool disks and reboot.
Then insert disk by disk and wait until it is detected,
Maybe with one disk, all disks are blocked: remove/replace that disk

Last option is a barebone setup, optionally with Sata best on another computer
to check pool health

clear the suspended message with menu pools: clear error (zpool clear)
 

Animosity

New Member
Apr 17, 2015
6
0
1
37
Update on methods attempted:

I realized that I may have flashed the replacement M1015 with a newer/different FW than I had originally. I was able to locate the files (backed the FW up thankfully) I used 2 years ago and reflash the M1015. This made no difference in operation unfortunately.

I then attempted to passthrough the motherboard SATA controller to the Solaris VM but the SATA controller is not available for passthrough.

I've now installed another HDD and instantiated Nappit+OmniOS VMWare appliance on it (with the new M1015 passed through) and it was able to import the pool (with -f) - it threw a number of errors that surpassed the console buffer but all I saw was that it thought my pool was Version 6 and needed to be updated to 28. However, the pool IS version 28. While it appeared to have failed the import, my pool does appear in zpool status and it is currently resilvering 2 disks at a sustained 155MBps which I intend to leave alone until completion.

zpool status said:
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Apr 17 02:37:54 2015
170G scanned out of 8.71T at 158M/s, 15h48m to go
25.7G resilvered, 1.90% done
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c4t5000C5005E169C55d0 ONLINE 0 0 2
c4t5000C5005C08BE07d0 ONLINE 0 0 0
c4t5000C5005C07780Ad0 ONLINE 0 0 0
c4t5000C5005E21AE92d0 ONLINE 0 0 0 (resilvering)
c4t5000C5005E0C5056d0 ONLINE 0 0 3
c4t5000C5005C04F982d0 ONLINE 0 0 0 (resilvering)
 

Animosity

New Member
Apr 17, 2015
6
0
1
37
The error pertaining to Version 6 was regarding how OmniOS cant mount the filesystem because it thinks it is Version 6 (it's not). The pool is imported but unmountable.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,182
113
DE
I suppose your pool or its filesystems are ZFS v6 (check properties).
Even with a pool version 28 this is incompatible with OpenZFS and ZFS v5.

But if the resilver ends without problem, you should be able to import in Solaris.
If a disk fails again, remove/replace that disk.

You should only avoid to update the pool in OmniOS as this ends with a pool 5000 v6 that
cannot be imported on any OS (with the only remaining option to zfs send to Solaris)
 
  • Like
Reactions: Animosity

Animosity

New Member
Apr 17, 2015
6
0
1
37
After the resilvering completed in OmniOS, I was able to boot into the Solaris 11 VM and see that pool and all disks were ONLINE. I was then able to remount the pool and access all the files, so there's a happy ending.