Hello everyone! Sorry for the amount of text, please bear with me
I have a problem with the hotswap capabilities of our SAN. We are using,
Chassi: Supermicro CSE-216BA-R920L
Raid-controller: LSI 9211-8i
SSDs:Crucial M500 SSD
SanOne 5.11 omnios-8c08411
I have 8 SSDs in a raid-z2 configuration. As far as i know this setup should support hotswap without any problems.
I have read the following,
[1] http://docs.oracle.com/cd/E19253-01/819-5461/6n7ht6r7p/index.html
[2] http://docs.oracle.com/cd/E26502_01/html/E29006/devconfig2-8.html
I tested removing 1 drive and then replugging it after ~2 min. So this is what happens,
1. Plug drive,
2. Replug,
Everything seems to work OK. I can even scrub without any issues.
However my supplied sources specify that i should use Cfgadm -al and unconfigure the disk when hotswaping. Napp-it also says that one should unconfigure the disk, home Disks
Hotswap.
This is where the system hangs.
1. Before plugging the drive,
2. After re-attaching the drive
Here cfgadm identifies two ids on the c13 port. Also Napp-it howswap page shows the disk as,
state:ONLINE, busy:unconfigured
All other disks are still state:ONLINE, busy:configured
When i now do another cfgadm -al or Napp-it home Disks
Hotswap
scan the system hangs.
I managed to check zpool status and it shows this,
So basically everything looks to be working until i run the cfgadm rescan. This makes all terminals hang and the only way to reset it is by power-cycling the whole SAN.
What am i doing wrong here? I worries me that only by removing and attaching a disk i can bring the whole SAN down. All tips on how i can debug this are welcome.
EDIT: Napp-it logs shows,
Also when i do cfgadm -c unconfigure c13 nothing happens. C13 still says configured (in oracle example it should say unconfigured). Is cfgadm or mpt_sas broken on omniOs?
EDIT2:
Found some maybe relevant links,
Sun's ZFS file system on Solaris ()
Configuring ZFS to gracefully deal with pool failures
However, while failmode=continue makes it possible to still do zpool status, the machine itself still hang. No new ssh sessions can be created etc.
It looks like things work as long as i dont do a cfgadm scan or napp-it equivalent :/
I have a problem with the hotswap capabilities of our SAN. We are using,
Chassi: Supermicro CSE-216BA-R920L
Raid-controller: LSI 9211-8i
SSDs:Crucial M500 SSD
SanOne 5.11 omnios-8c08411
I have 8 SSDs in a raid-z2 configuration. As far as i know this setup should support hotswap without any problems.
I have read the following,
[1] http://docs.oracle.com/cd/E19253-01/819-5461/6n7ht6r7p/index.html
[2] http://docs.oracle.com/cd/E26502_01/html/E29006/devconfig2-8.html
I tested removing 1 drive and then replugging it after ~2 min. So this is what happens,
1. Plug drive,
Code:
root@SanOne:~# zpool status -v ssdPoolOne
pool: ssdPoolOne
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 4.34M in 0h0m with 0 errors on Fri Aug 22 12:34:54 2014
config:
NAME STATE READ WRITE CKSUM
ssdPoolOne DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
c4t500A0751095162C1d0 ONLINE 0 0 0
c4t500A075109520F01d0 ONLINE 0 0 0
c4t500A075109520F06d0 ONLINE 0 0 0
c4t500A075109520F1Cd0 REMOVED 0 0 0
c4t500A07510953CD84d0 ONLINE 0 0 0
c4t500A07510953CF6Cd0 ONLINE 0 0 0
c4t500A07510953CF93d0 ONLINE 0 0 0
c4t500A0751095E5163d0 ONLINE 0 0 0
Code:
root@SanOne:~# zpool status -v ssdPoolOne
pool: ssdPoolOne
state: ONLINE
scan: resilvered 1.28M in 0h0m with 0 errors on Fri Aug 22 12:58:54 2014
config:
NAME STATE READ WRITE CKSUM
ssdPoolOne ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c4t500A0751095162C1d0 ONLINE 0 0 0
c4t500A075109520F01d0 ONLINE 0 0 0
c4t500A075109520F06d0 ONLINE 0 0 0
c4t500A075109520F1Cd0 ONLINE 0 0 0
c4t500A07510953CD84d0 ONLINE 0 0 0
c4t500A07510953CF6Cd0 ONLINE 0 0 0
c4t500A07510953CF93d0 ONLINE 0 0 0
c4t500A0751095E5163d0 ONLINE 0 0 0
errors: No known data errors
However my supplied sources specify that i should use Cfgadm -al and unconfigure the disk when hotswaping. Napp-it also says that one should unconfigure the disk, home Disks
This is where the system hangs.
1. Before plugging the drive,
Code:
Ap_Id Type Receptacle Occupant Condition
c13 scsi-sas connected configured unknown
c13::w0000000000000000,0 disk-path connected configured unknown
Code:
c13 scsi-sas connected configured unknown
c13::w0000000000000000,0 disk-path connected configured unknown
c13::w500a0751095e5163,0 disk-path connected unconfigured unknown
state:ONLINE, busy:unconfigured
All other disks are still state:ONLINE, busy:configured
When i now do another cfgadm -al or Napp-it home Disks
I managed to check zpool status and it shows this,
Code:
root@SanOne:~# zpool status -v ssdPoolOne
pool: ssdPoolOne
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-HC
scan: scrub canceled on Fri Aug 22 13:00:59 2014
config:
NAME STATE READ WRITE CKSUM
ssdPoolOne ONLINE 0 31 0
raidz2-0 ONLINE 0 4 0
c4t500A0751095162C1d0 ONLINE 0 4 0
c4t500A075109520F01d0 ONLINE 0 4 0
c4t500A075109520F06d0 ONLINE 0 4 0
c4t500A075109520F1Cd0 ONLINE 0 0 0
c4t500A07510953CD84d0 ONLINE 0 4 0
c4t500A07510953CF6Cd0 ONLINE 0 4 0
c4t500A07510953CF93d0 ONLINE 0 4 0
c4t500A0751095E5163d0 ONLINE 0 7 0
errors: List of errors unavailable (insufficient privileges)
What am i doing wrong here? I worries me that only by removing and attaching a disk i can bring the whole SAN down. All tips on how i can debug this are welcome.
EDIT: Napp-it logs shows,
Code:
Will be unloaded upon reboot.
Forcing update of sd.conf.p1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 28
exe: update_drv -f mptdevfsadm: driver failed to attach: mpt
Warning: Driver (mpt) successfully added to system but failed to attachp1main, /05_disks and controller/02_hotswap/01_scan/action.pl, line 29
exe: update_drv -f mpt_sas
EDIT2:
Found some maybe relevant links,
Sun's ZFS file system on Solaris ()
Configuring ZFS to gracefully deal with pool failures
However, while failmode=continue makes it possible to still do zpool status, the machine itself still hang. No new ssh sessions can be created etc.
It looks like things work as long as i dont do a cfgadm scan or napp-it equivalent :/
Last edited: