Can't take failed disk offline - no such device in pool?

pocomo · Aug 7, 2013

Hi Guys -

I'm fairly clueless about ZFS and looking for some help. We have an OpenSolaris file server which has been running well for 3.5 years but finally has a failed disk. I'm trying to offline the disk so I can replace it but the zpool command is giving me an error.

Note that it's not reporting errors here because I cleared them (in frustration) trying to make things work:

root@lincoln:~# zpool status
pool: p1
state: ONLINE
scan: scrub canceled on Wed Aug 7 20:43:52 2013
config:

NAME STATE READ WRITE CKSUM
p1 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
logs
c1t1d0 ONLINE 0 0 0

errors: No known data errors

pool: syspool
state: ONLINE
scan: scrub repaired 0 in 0h1m with 0 errors on Fri Apr 5 17:45:48 2013
config:

NAME STATE READ WRITE CKSUM
syspool ONLINE 0 0 0
c1t0d0s0 ONLINE 0 0 0

errors: No known data errors

The bad device is 'c1t3d0', but for some reason I cannot take it offline:

root@lincoln:~# zpool offline p1 c1t3d0
cannot offline c1t3d0: no such pool or dataset

It seems like it doesn't recognize that id as a valid device name, so I tried adding a slice number:

root@lincoln:~# zpool offline p1 c1t3d0s0
cannot offline c1t3d0s0: no such device in pool
root@lincoln:~# zpool offline p1 c1t3d0s1
cannot offline c1t3d0s1: no such device in pool
root@lincoln:~# zpool offline p1 c1t3d0s2
cannot offline c1t3d0s2: no such device in pool

I am afraid to go ahead with replacing the disk because the server is still up and if it gets screwed up further with unrecognized devices we may lose data.

'zdb' gives some interesting output, but none that has helped me. I tried using the GUID for the disk and the /dev/dsk name from zdb for the offline command and neither worked:

root@lincoln:~# zdb
p1:
version: 26
name: 'p1'
state: 0
txg: 4648880
pool_guid: 4968018013301160962
hostid: 8985187
hostname: 'lincoln'
vdev_children: 5
vdev_tree:

type: 'root'
id: 0
guid: 4968018013301160962

children[0]:
type: 'disk'
id: 0
guid: 13134804042246883594
path: '/dev/dsk/c1t2d0s0'
devid: 'id1,sd@n6000c29521146953aa8896db510f7245/a'
phys_path: '/pci@0,0/pci15ad,1976@10/sd@2,0:a'
whole_disk: 1
metaslab_array: 30
metaslab_shift: 34
ashift: 9
asize: 1997146423296
is_log: 0
DTL: 65
create_txg: 4

children[1]:
type: 'disk'
id: 1
guid: 11536494763799534620
path: '/dev/dsk/c1t3d0s0'
devid: 'id1,sd@n6000c29035e47d3b4dac7f9c56d8ee85/a'
phys_path: '/pci@0,0/pci15ad,1976@10/sd@3,0:a'
whole_disk: 1
metaslab_array: 28
metaslab_shift: 34
ashift: 9
asize: 1997146423296
is_log: 0
DTL: 64
create_txg: 4

children[2]:
type: 'disk'
id: 2
guid: 14498981475727854810
path: '/dev/dsk/c1t4d0s0'
devid: 'id1,sd@n6000c291e255a1ffb0d60be9506ec2a7/a'
phys_path: '/pci@0,0/pci15ad,1976@10/sd@4,0:a'
whole_disk: 1
metaslab_array: 27
metaslab_shift: 34
ashift: 9
asize: 1997146423296
is_log: 0
DTL: 62
create_txg: 4

children[3]:
type: 'disk'
id: 3
guid: 13953919250895671919
path: '/dev/dsk/c1t5d0s0'
devid: 'id1,sd@n6000c29d8d53e0d9102b282e3fe90e29/a'
phys_path: '/pci@0,0/pci15ad,1976@10/sd@5,0:a'
whole_disk: 1
metaslab_array: 25
metaslab_shift: 34
ashift: 9
asize: 1997146423296
is_log: 0
DTL: 63
create_txg: 4

children[4]:
type: 'disk'
id: 4
guid: 12386275127038724319
path: '/dev/dsk/c1t1d0s0'
devid: 'id1,sd@n6000c294669285e7430f3a2b0d9f3da6/a'
phys_path: '/pci@0,0/pci15ad,1976@10/sd@1,0:a'
whole_disk: 1
metaslab_array: 24
metaslab_shift: 28
ashift: 9
asize: 38641336320
is_log: 1
DTL: 61
create_txg: 4

Any suggestions?

sotech · Aug 7, 2013

Perhaps try exporting and importing the pool? If there's any chance of data loss if things go south, though, make a backup while everything's still running fine.

pocomo · Aug 7, 2013

Thanks for the response @sotech.

Is there any downside risk to doing 'zpool export' / ' zpool import' ?

I will try a backup after I get through a scrub operation. Unfortunately the last backup attempt ran so slowly that it was unable to complete in less-than-infinite time. But the server actually seems to be performing better after a reboot so there is hope.

gea · Aug 8, 2013

Oh My Goodness!

You have build your pool from basic devices without any redundancy. If any disk fail, all data are lost.
You cannot offline a disk or your pool is history!

Do a complete backup immediatly and rebuild your pool with redundancy (ie Raid-Z1..3)
Then you can offline a disk. But I would just do a disk replace faulted -> new

I would also update to a newer OS like OmniOS.

pocomo · Aug 8, 2013

gea said:
Oh My Goodness!

You have build your pool from basic devices without any redundancy. If any disk fail, all data are lost.
You cannot offline a disk or your pool is history!

I would also update to a newer OS like OmniOS.

Thanks gea, this confirms my growing fear that it was not built correctly in the first place.

The glimmer of hope is that this is all running inside ESXi. I should be able to copy the VMDK from the flaky disk to a new disk to fix the hardware issue without having to offline or zpool replace anything. That will buy me time to build a new server with redundancy.

3 Quick questions:

- I still plan to run this under a hypervisor (ESXi). Is this OK if I'm not concerned with 'ultimate' performance?
- You would recommend OmniOS over straight OpenIndiana?
- Is dedupe stable enough now to depend on? Back in 2009 when we set this one up it was not.

Thank you very much for chiming in.

pocomo · Aug 8, 2013

@gea

I just found your recent post about "Affordable High-end Storage" which basically answered all my questions above. Thank you again!

I will post again when I resolve this situation, or disaster strikes, whichever comes first.

pocomo · Aug 19, 2013

pocomo said:
@gea

I will post again when I resolve this situation, or disaster strikes, whichever comes first.

Well, disaster struck a few days after posting this message. Over that weekend I tried to complete a backup (mostly successful before running out of time) followed by a couple of attempts to clone the VMDK for the virtual disk that was stored on the failing physical drive. That operation was never able to complete due to increasingly lengthy disk timeouts, and I was never able to bring the pool back online after that point.

So I finally bit the bullet and destroyed the pool, replaced the disk, build a new raid-z pool and restored from backup (which took days). Our file shares pretty much survived but the VM storage was incompletely backed up and I ran into trouble with permissions as I was trying share out the new pool over NFS. It's been a week and our critical VMs are running off of local disk on the VM servers temporarily.

I have constructed a new storage server using a refurb Dell FS12-NV7 from ebay; with 64G onboard it's significantly better provisioned than the 8G server it's replacing. It's running ESXi 4.1U3, OmniOS and napp-it and I synced the file share data to it over the weekend, so it will be put into service today. I believe the pool construction looks a lot better this time around

root@lincoln:/p1/vmstore# zpool status
pool: p1
state: ONLINE
scan: scrub repaired 0 in 1h19m with 0 errors on Sun Aug 18 09:42:15 2013
config:

NAME STATE READ WRITE CKSUM
p1 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c2t10d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c2t8d0 ONLINE 0 0 0
c2t9d0 ONLINE 0 0 0
logs
c2t1d0 ONLINE 0 0 0
cache
c2t2d0 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors

Thanks folks, for your helpful advice. I will be building a replica server and setting up ZFS replication soon, so I may have more question for the group.

2g33k4u · Aug 28, 2013

You should mirror your log drive for safety purposes cache doesn't need mirroring but logs should be mirrors.

pocomo · Sep 6, 2013

2g33k4u said:
You should mirror your log drive for safety purposes cache doesn't need mirroring but logs should be mirrors.

Sorry, I was away on holiday.

According to this ZFS Ã¢â‚¬â€œ To Mirror or Not to Mirror the ZIL | RackTop Systems I should be OK; This is not a super critical server and I think the chances of losing both memory and the SSD-based ZIL at the same time are very low. If I lose the ZIL the write performance will drop until it can be replaced but I can live with that. Plus, I have no more SATA ports available!

Thanks for reviewing my config. Replacement server has been up and working well for about 3 weeks now, very happy with it.

Search

Can't take failed disk offline - no such device in pool?

pocomo

New Member

sotech

Member

pocomo

New Member

gea

Well-Known Member

pocomo

New Member

pocomo

New Member

pocomo

New Member

2g33k4u

New Member

pocomo

New Member