OmniOS adding new vDEV to RAIDz2 4 hrs later Server went down

methos · Jun 19, 2014

Subject says it all. Server built, detailed here . Added (4) more identical SAS Drives to expand

2014-06-17.13:34:13 zpool add ISCSI raidz2 c9t50000C0F01DE465Ed0 c10t50000C0F01299476d0 c11t50000C0F01DE3792d0 c12t50000C0F01DE37AEd0

About 4:59pm that same day, my Xenservers started texting me stating the ISCSI SR is down. Connecting via SSH took 5 minutes, but any ZFS; ZPOOL command hung. Had to power cycle the box, connected to KVM and it just kept rebooting stating a Drive failed. Booted into another environment and was able to log in, but again, any ZFS;ZPOOL command hung for 2-4 minutes. Identified the drive, one of the new ones, and offlined it. Reseated the drive; online - no change. I have OmniTI support but they just had me run some diags...

Anyone else have a drive failure take a Solaris down? This was not a root pool drive.

root@san03:~# zpool status -v
pool: ISCSI
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 3.50K in 0h0m with 0 errors on Tue Jun 17 20:36:45 2014
config:

NAME STATE READ WRITE CKSUM
ISCSI DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
c5t50000C0F02792D2Ad0 ONLINE 0 0 0
c6t50000C0F02795CE6d0 ONLINE 0 0 0
c7t50000C0F02795142d0 ONLINE 0 0 0
c8t50000C0F02CF836Ad0 ONLINE 0 0 0
raidz2-2 DEGRADED 0 0 0
c9t50000C0F01DE465Ed0 OFFLINE 0 0 0
c10t50000C0F01299476d0 ONLINE 0 0 0
c11t50000C0F01DE3792d0 ONLINE 0 0 0
c12t50000C0F01DE37AEd0 ONLINE 0 0 0
logs
c1t1d0 ONLINE 0 0 0

errors: No known data errors

gea · Jun 20, 2014

methos said:
Subject says it all. Server built, detailed here . Added (4) more identical SAS Drives to expand

2014-06-17.13:34:13 zpool add ISCSI raidz2 c9t50000C0F01DE465Ed0 c10t50000C0F01299476d0 c11t50000C0F01DE3792d0 c12t50000C0F01DE37AEd0

About 4:59pm that same day, my Xenservers started texting me stating the ISCSI SR is down. Connecting via SSH took 5 minutes, but any ZFS; ZPOOL command hung. Had to power cycle the box, connected to KVM and it just kept rebooting stating a Drive failed. Booted into another environment and was able to log in, but again, any ZFS;ZPOOL command hung for 2-4 minutes. Identified the drive, one of the new ones, and offlined it. Reseated the drive; online - no change. I have OmniTI support but they just had me run some diags...

Anyone else have a drive failure take a Solaris down? This was not a root pool drive.

root@san03:~# zpool status -v
pool: ISCSI
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: resilvered 3.50K in 0h0m with 0 errors on Tue Jun 17 20:36:45 2014
config:

NAME STATE READ WRITE CKSUM
ISCSI DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
c5t50000C0F02792D2Ad0 ONLINE 0 0 0
c6t50000C0F02795CE6d0 ONLINE 0 0 0
c7t50000C0F02795142d0 ONLINE 0 0 0
c8t50000C0F02CF836Ad0 ONLINE 0 0 0
raidz2-2 DEGRADED 0 0 0
c9t50000C0F01DE465Ed0 OFFLINE 0 0 0
c10t50000C0F01299476d0 ONLINE 0 0 0
c11t50000C0F01DE3792d0 ONLINE 0 0 0
c12t50000C0F01DE37AEd0 ONLINE 0 0 0
logs
c1t1d0 ONLINE 0 0 0

errors: No known data errors

Even if you put a disk offline it can block the controller. You should physically remove the disk. (With a SAS2 controller you can hot remove). Then test the disk with a low level tool from the disk manufacturer.

Insert a new disk and do a disk replace (missing/faulted -> new)

methos · Jun 20, 2014

gea said:
Even if you put a disk offline it can block the controller. You should physically remove the disk. (With a SAS2 controller you can hot remove). Then test the disk with a low level tool from the disk manufacturer.

Insert a new disk and do a disk replace (missing/faulted -> new)

Well, that's what we did was remove the drive physically. But why would rebooting to box in the default BE keep rebooting? In my 20 years experience in Servers, I've not had a NON-ROOT drive cause this...and I've seen a lot. I have OmniTI looking into it....their answer - upgrade to latest stable (sigh).

s0lid · Jun 20, 2014

This is because Solaris depends too much on ZFS, if the zfs process at the boot is hung by pool doing stupid stuff the whole server won't actually boot.
Even if the server boots you cannot login as the normal bash sessions too depend on zpool status.

I kind of learned this the hardway last xmas... When I removed pool worth of disks from running server without exporting the old pool and left my apartment right away for about 2 weeks...
SMB worked at the speeds of snail mail, IE 10-30KBps over VPN. Gladly Comstar didn't die, as the server provided storage to my ESXi cluster. You couldn't login to the server at all via ssh nor over IPMI.

gea · Jun 22, 2014

I expect the major issue is that OmniOS tries to automount the pool on startup. If there is a problem with a hanging/blocking disk, this can hinder a bootup for quite a long time until a timeout happens. You can reduce the problem with TLER disks with a shorter timeout but this only helps if disk electronic remains healthy and it is more a mechanical problem.

Nexenta is working on a better error handling in such cases. Hope we will see this sometime in OmniOS as well.

Search

OmniOS adding new vDEV to RAIDz2 4 hrs later Server went down

methos

New Member

gea

Well-Known Member

methos

New Member

s0lid

Active Member

gea

Well-Known Member