OmniOS adding new vDEV to RAIDz2 4 hrs later Server went down

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by methos, Jun 19, 2014.

  1. methos

    methos New Member

    Joined:
    Dec 19, 2013
    Messages:
    20
    Likes Received:
    0
    Subject says it all. Server built, detailed here . Added (4) more identical SAS Drives to expand

    2014-06-17.13:34:13 zpool add ISCSI raidz2 c9t50000C0F01DE465Ed0 c10t50000C0F01299476d0 c11t50000C0F01DE3792d0 c12t50000C0F01DE37AEd0

    About 4:59pm that same day, my Xenservers started texting me stating the ISCSI SR is down. Connecting via SSH took 5 minutes, but any ZFS; ZPOOL command hung. Had to power cycle the box, connected to KVM and it just kept rebooting stating a Drive failed. Booted into another environment and was able to log in, but again, any ZFS;ZPOOL command hung for 2-4 minutes. Identified the drive, one of the new ones, and offlined it. Reseated the drive; online - no change. I have OmniTI support but they just had me run some diags...

    Anyone else have a drive failure take a Solaris down? This was not a root pool drive.


    root@san03:~# zpool status -v
    pool: ISCSI
    state: DEGRADED
    status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
    action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
    scan: resilvered 3.50K in 0h0m with 0 errors on Tue Jun 17 20:36:45 2014
    config:

    NAME STATE READ WRITE CKSUM
    ISCSI DEGRADED 0 0 0
    raidz2-0 ONLINE 0 0 0
    c5t50000C0F02792D2Ad0 ONLINE 0 0 0
    c6t50000C0F02795CE6d0 ONLINE 0 0 0
    c7t50000C0F02795142d0 ONLINE 0 0 0
    c8t50000C0F02CF836Ad0 ONLINE 0 0 0
    raidz2-2 DEGRADED 0 0 0
    c9t50000C0F01DE465Ed0 OFFLINE 0 0 0
    c10t50000C0F01299476d0 ONLINE 0 0 0
    c11t50000C0F01DE3792d0 ONLINE 0 0 0
    c12t50000C0F01DE37AEd0 ONLINE 0 0 0
    logs
    c1t1d0 ONLINE 0 0 0

    errors: No known data errors
     
    #1
  2. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752

    Even if you put a disk offline it can block the controller. You should physically remove the disk. (With a SAS2 controller you can hot remove). Then test the disk with a low level tool from the disk manufacturer.

    Insert a new disk and do a disk replace (missing/faulted -> new)
     
    #2
  3. methos

    methos New Member

    Joined:
    Dec 19, 2013
    Messages:
    20
    Likes Received:
    0
    Well, that's what we did was remove the drive physically. But why would rebooting to box in the default BE keep rebooting? In my 20 years experience in Servers, I've not had a NON-ROOT drive cause this...and I've seen a lot. I have OmniTI looking into it....their answer - upgrade to latest stable (sigh).
     
    #3
  4. s0lid

    s0lid Active Member

    Joined:
    Feb 25, 2013
    Messages:
    258
    Likes Received:
    33
    This is because Solaris depends too much on ZFS, if the zfs process at the boot is hung by pool doing stupid stuff the whole server won't actually boot.
    Even if the server boots you cannot login as the normal bash sessions too depend on zpool status.

    I kind of learned this the hardway last xmas... When I removed pool worth of disks from running server without exporting the old pool and left my apartment right away for about 2 weeks...
    SMB worked at the speeds of snail mail, IE 10-30KBps over VPN. Gladly Comstar didn't die, as the server provided storage to my ESXi cluster. You couldn't login to the server at all via ssh nor over IPMI.
     
    #4
  5. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,273
    Likes Received:
    752
    I expect the major issue is that OmniOS tries to automount the pool on startup. If there is a problem with a hanging/blocking disk, this can hinder a bootup for quite a long time until a timeout happens. You can reduce the problem with TLER disks with a shorter timeout but this only helps if disk electronic remains healthy and it is more a mechanical problem.

    Nexenta is working on a better error handling in such cases. Hope we will see this sometime in OmniOS as well.
     
    #5
Similar Threads: OmniOS adding
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it intel x540-t2 passthrough to OmniOS 151032 on ESXi 6.7u3 Nov 30, 2019
Solaris, Nexenta, OpenIndiana, and napp-it OmniOS (CE) nvdimm Nov 17, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Checking complete disks for errors in OmniOS Nov 9, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Intel X553 driver for OmniosCE Sep 19, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Peformances problem using Resolve with folder on Omnios ZFS server Aug 12, 2019

Share This Page