help, hotswap hangs entire system

legen · Aug 22, 2014

Hello everyone! Sorry for the amount of text, please bear with me

I have a problem with the hotswap capabilities of our SAN. We are using,
Chassi: Supermicro CSE-216BA-R920L
Raid-controller: LSI 9211-8i
SSDs:Crucial M500 SSD
SanOne 5.11 omnios-8c08411

I have 8 SSDs in a raid-z2 configuration. As far as i know this setup should support hotswap without any problems.

I have read the following,
[1] http://docs.oracle.com/cd/E19253-01/819-5461/6n7ht6r7p/index.html
[2] http://docs.oracle.com/cd/E26502_01/html/E29006/devconfig2-8.html

I tested removing 1 drive and then replugging it after ~2 min. So this is what happens,
1. Plug drive,

Code:

root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 4.34M in 0h0m with 0 errors on Fri Aug 22 12:34:54 2014
config:

        NAME                       STATE     READ WRITE CKSUM
        ssdPoolOne                 DEGRADED     0     0     0
          raidz2-0                 DEGRADED     0     0     0
            c4t500A0751095162C1d0  ONLINE       0     0     0
            c4t500A075109520F01d0  ONLINE       0     0     0
            c4t500A075109520F06d0  ONLINE       0     0     0
            c4t500A075109520F1Cd0  REMOVED      0     0     0
            c4t500A07510953CD84d0  ONLINE       0     0     0
            c4t500A07510953CF6Cd0  ONLINE       0     0     0
            c4t500A07510953CF93d0  ONLINE       0     0     0
            c4t500A0751095E5163d0  ONLINE       0     0     0

2. Replug,

Code:

root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: ONLINE
  scan: resilvered 1.28M in 0h0m with 0 errors on Fri Aug 22 12:58:54 2014
config:

  NAME  STATE  READ WRITE CKSUM
  ssdPoolOne  ONLINE  0  0  0
  raidz2-0  ONLINE  0  0  0
  c4t500A0751095162C1d0  ONLINE  0  0  0
  c4t500A075109520F01d0  ONLINE  0  0  0
  c4t500A075109520F06d0  ONLINE  0  0  0
  c4t500A075109520F1Cd0  ONLINE  0  0  0
  c4t500A07510953CD84d0  ONLINE  0  0  0
  c4t500A07510953CF6Cd0  ONLINE  0  0  0
  c4t500A07510953CF93d0  ONLINE  0  0  0
  c4t500A0751095E5163d0  ONLINE  0  0  0

errors: No known data errors

Everything seems to work OK. I can even scrub without any issues.

However my supplied sources specify that i should use Cfgadm -al and unconfigure the disk when hotswaping. Napp-it also says that one should unconfigure the disk, home

Disks

Hotswap.
This is where the system hangs.

1. Before plugging the drive,

Code:

Ap_Id  Type  Receptacle  Occupant  Condition
c13  scsi-sas  connected  configured  unknown
c13::w0000000000000000,0  disk-path  connected  configured  unknown

2. After re-attaching the drive

Code:

c13  scsi-sas  connected  configured  unknown
c13::w0000000000000000,0  disk-path  connected  configured  unknown
c13::w500a0751095e5163,0  disk-path  connected  unconfigured unknown

Here cfgadm identifies two ids on the c13 port. Also Napp-it howswap page shows the disk as,
state:ONLINE, busy:unconfigured
All other disks are still state:ONLINE, busy:configured

When i now do another cfgadm -al or Napp-it home

Disks

Hotswap

scan the system hangs.
I managed to check zpool status and it shows this,

Code:

root@SanOne:~# zpool status -v ssdPoolOne
  pool: ssdPoolOne
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
  see: http://illumos.org/msg/ZFS-8000-HC
  scan: scrub canceled on Fri Aug 22 13:00:59 2014
config:

  NAME  STATE  READ WRITE CKSUM
  ssdPoolOne  ONLINE  0  31  0
  raidz2-0  ONLINE  0  4  0
  c4t500A0751095162C1d0  ONLINE  0  4  0
  c4t500A075109520F01d0  ONLINE  0  4  0
  c4t500A075109520F06d0  ONLINE  0  4  0
  c4t500A075109520F1Cd0  ONLINE  0  0  0
  c4t500A07510953CD84d0  ONLINE  0  4  0
  c4t500A07510953CF6Cd0  ONLINE  0  4  0
  c4t500A07510953CF93d0  ONLINE  0  4  0
  c4t500A0751095E5163d0  ONLINE  0  7  0

errors: List of errors unavailable (insufficient privileges)

So basically everything looks to be working until i run the cfgadm rescan. This makes all terminals hang and the only way to reset it is by power-cycling the whole SAN.

What am i doing wrong here? I worries me that only by removing and attaching a disk i can bring the whole SAN down. All tips on how i can debug this are welcome.

EDIT: Napp-it logs shows,

Code:

Will be unloaded upon reboot.
Forcing update of sd.conf.p1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 28
exe: update_drv -f mptdevfsadm: driver failed to attach: mpt
Warning: Driver (mpt) successfully added to system but failed to attachp1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 29
exe: update_drv -f mpt_sas

Also when i do cfgadm -c unconfigure c13 nothing happens. C13 still says configured (in oracle example it should say unconfigured). Is cfgadm or mpt_sas broken on omniOs?

EDIT2:
Found some maybe relevant links,
Sun's ZFS file system on Solaris ()
Configuring ZFS to gracefully deal with pool failures

However, while failmode=continue makes it possible to still do zpool status, the machine itself still hang. No new ssh sessions can be created etc.

It looks like things work as long as i dont do a cfgadm scan or napp-it equivalent :/

gea · Aug 22, 2014

With your hotplug capable hardware, ignore the unconfigure option.
Just hot unplug/plugin the disks.

legen · Aug 22, 2014

I suspected as much

Now i tried just plugging the drive, re-inserting it and doing zpool online drive-id.
This has been running solid now for ~14h+.

Still, I am worried by the fact that one simple miss-click i Napp-it or running cfgadm scan by mistake brings the whole SAN down. I would love to know why, how to debug it and how to fix it.

gea · Aug 23, 2014

You can details read for example at SCSI Hot-Plugging With the cfgadm Command (Task Map) - Oracle Solaris 11.1 Administration: Devices and File Systems

main problem:
Different hardware reqires different ways to manage and this menus are needed for some hardware. As most people use LSI HBA now I remove the scan and un/configure menus from napp-it to avoid such problems.

legen · Aug 25, 2014

gea said:
You can details read for example at SCSI Hot-Plugging With the cfgadm Command (Task Map) - Oracle Solaris 11.1 Administration: Devices and File Systems

main problem:
Different hardware reqires different ways to manage and this menus are needed for some hardware. As most people use LSI HBA now I remove the scan and un/configure menus from napp-it to avoid such problems.

Ok. From which Napp-it version will the scan and un-configure options be removed?
We will also ask for a Napp-it license quota later today via your website

gea · Aug 25, 2014

I removed the menus in current 0.9f2

Search

help, hotswap hangs entire system

legen

Active Member

gea

Well-Known Member

legen

Active Member

gea

Well-Known Member

legen

Active Member

gea

Well-Known Member