help, hotswap hangs entire system

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
Hello everyone! Sorry for the amount of text, please bear with me :)

I have a problem with the hotswap capabilities of our SAN. We are using,
Chassi: Supermicro CSE-216BA-R920L
Raid-controller: LSI 9211-8i
SSDs:Crucial M500 SSD
SanOne 5.11 omnios-8c08411

I have 8 SSDs in a raid-z2 configuration. As far as i know this setup should support hotswap without any problems.

I have read the following,
[1] http://docs.oracle.com/cd/E19253-01/819-5461/6n7ht6r7p/index.html
[2] http://docs.oracle.com/cd/E26502_01/html/E29006/devconfig2-8.html

I tested removing 1 drive and then replugging it after ~2 min. So this is what happens,
1. Plug drive,
Code:
root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 4.34M in 0h0m with 0 errors on Fri Aug 22 12:34:54 2014
config:

        NAME                       STATE     READ WRITE CKSUM
        ssdPoolOne                 DEGRADED     0     0     0
          raidz2-0                 DEGRADED     0     0     0
            c4t500A0751095162C1d0  ONLINE       0     0     0
            c4t500A075109520F01d0  ONLINE       0     0     0
            c4t500A075109520F06d0  ONLINE       0     0     0
            c4t500A075109520F1Cd0  REMOVED      0     0     0
            c4t500A07510953CD84d0  ONLINE       0     0     0
            c4t500A07510953CF6Cd0  ONLINE       0     0     0
            c4t500A07510953CF93d0  ONLINE       0     0     0
            c4t500A0751095E5163d0  ONLINE       0     0     0
2. Replug,
Code:
root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: ONLINE
  scan: resilvered 1.28M in 0h0m with 0 errors on Fri Aug 22 12:58:54 2014
config:

  NAME  STATE  READ WRITE CKSUM
  ssdPoolOne  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  c4t500A0751095162C1d0  ONLINE  0  0  0
  c4t500A075109520F01d0  ONLINE  0  0  0
  c4t500A075109520F06d0  ONLINE  0  0  0
  c4t500A075109520F1Cd0  ONLINE  0  0  0
  c4t500A07510953CD84d0  ONLINE  0  0  0
  c4t500A07510953CF6Cd0  ONLINE  0  0  0
  c4t500A07510953CF93d0  ONLINE  0  0  0
  c4t500A0751095E5163d0  ONLINE  0  0  0

errors: No known data errors
Everything seems to work OK. I can even scrub without any issues.


However my supplied sources specify that i should use Cfgadm -al and unconfigure the disk when hotswaping. Napp-it also says that one should unconfigure the disk, home Disks
Hotswap.
This is where the system hangs.

1. Before plugging the drive,
Code:
Ap_Id  Type  Receptacle  Occupant  Condition
c13  scsi-sas  connected  configured  unknown
c13::w0000000000000000,0  disk-path  connected  configured  unknown
2. After re-attaching the drive
Code:
c13  scsi-sas  connected  configured  unknown
c13::w0000000000000000,0  disk-path  connected  configured  unknown
c13::w500a0751095e5163,0  disk-path  connected  unconfigured unknown
Here cfgadm identifies two ids on the c13 port. Also Napp-it howswap page shows the disk as,
state:ONLINE, busy:unconfigured
All other disks are still state:ONLINE, busy:configured

When i now do another cfgadm -al or Napp-it home Disks
Hotswap
scan the system hangs.
I managed to check zpool status and it shows this,
Code:
root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub canceled on Fri Aug 22 13:00:59 2014
config:

  NAME  STATE  READ WRITE CKSUM
  ssdPoolOne  ONLINE  0  31  0
  raidz2-0  ONLINE  0  4  0
  c4t500A0751095162C1d0  ONLINE  0  4  0
  c4t500A075109520F01d0  ONLINE  0  4  0
  c4t500A075109520F06d0  ONLINE  0  4  0
  c4t500A075109520F1Cd0  ONLINE  0  0  0
  c4t500A07510953CD84d0  ONLINE  0  4  0
  c4t500A07510953CF6Cd0  ONLINE  0  4  0
  c4t500A07510953CF93d0  ONLINE  0  4  0
  c4t500A0751095E5163d0  ONLINE  0  7  0

errors: List of errors unavailable (insufficient privileges)
So basically everything looks to be working until i run the cfgadm rescan. This makes all terminals hang and the only way to reset it is by power-cycling the whole SAN.

What am i doing wrong here? I worries me that only by removing and attaching a disk i can bring the whole SAN down. All tips on how i can debug this are welcome.

EDIT: Napp-it logs shows,
Code:
Will be unloaded upon reboot.
Forcing update of sd.conf.p1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 28
exe: update_drv -f mptdevfsadm: driver failed to attach: mpt
Warning: Driver (mpt) successfully added to system but failed to attachp1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 29
exe: update_drv -f mpt_sas
Also when i do cfgadm -c unconfigure c13 nothing happens. C13 still says configured (in oracle example it should say unconfigured). Is cfgadm or mpt_sas broken on omniOs?

EDIT2:
Found some maybe relevant links,
Sun's ZFS file system on Solaris ()
Configuring ZFS to gracefully deal with pool failures

However, while failmode=continue makes it possible to still do zpool status, the machine itself still hang. No new ssh sessions can be created etc.

It looks like things work as long as i dont do a cfgadm scan or napp-it equivalent :/
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,160
1,195
113
DE
With your hotplug capable hardware, ignore the unconfigure option.
Just hot unplug/plugin the disks.
 
  • Like
Reactions: legen

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
I suspected as much :)
Now i tried just plugging the drive, re-inserting it and doing zpool online drive-id.
This has been running solid now for ~14h+.

Still, I am worried by the fact that one simple miss-click i Napp-it or running cfgadm scan by mistake brings the whole SAN down. I would love to know why, how to debug it and how to fix it.
 

legen

Active Member
Mar 6, 2013
213
39
28
Sweden
You can details read for example at SCSI Hot-Plugging With the cfgadm Command (Task Map) - Oracle Solaris 11.1 Administration: Devices and File Systems

main problem:
Different hardware reqires different ways to manage and this menus are needed for some hardware. As most people use LSI HBA now I remove the scan and un/configure menus from napp-it to avoid such problems.
Ok. From which Napp-it version will the scan and un-configure options be removed?
We will also ask for a Napp-it license quota later today via your website :)