Can't remove SLOG device

soundscribe · Sep 7, 2016

I'm running an all-in-one using esxi 5.0, omnios as a virtual machine, and sharing back via NFS the ZFS storage. When I built the machine, I had the bright idea of getting a dedicated SSD to use for the ZIL.

The SSD failed, and while the pool is still accessible and appears to be working fine, I cannot remove the SLOG from the pool.

Running:

pfexec zpool remove deep 11742054345802958404

returns without an error, but the device does not get removed. Trying to remove the device "c3t5..." doesn't work either -- device doesn't exist.

I've removed the failed drive from the system, rebooted and still the same behavior. Short of rebuilding the pool, what else can be done? Omnios version is current as of about 2 weeks ago...

The pool shows as:

dave@flanders:/export/home/dave$ pfexec zpool status
pool: deep
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: ZFS-8000-2Q
scan: scrub repaired 0 in 7h42m with 0 errors on Wed Sep 7 08:42:26 2016
config:

NAME STATE READ WRITE CKSUM
deep DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c3t50014EE2B66D3B97d0 ONLINE 0 0 0
c3t50014EE20BC3FFF1d0 ONLINE 0 0 0
logs
11742054345802958404 UNAVAIL 0 0 0 was /dev/dsk/c3t5001517972EC11BAd0s0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Sep 7 01:00:44 2016
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors
dave@flanders:/export/home/dave$ uname -a
SunOS flanders 5.11 omnios-b13298f i86pc i386 i86pc

gea · Sep 8, 2016

You can fix such a situation with a disk replace or disk remove.
Your command is correct to remove an slog from a pool.

Try again, login as root and reenter the command without pfexec

soundscribe · Sep 8, 2016

Thanks gea for responding. I tried logging in as root, but the result is the same. When removing the SLOG device, the command doesn't return an error; just back to the command prompt. I also tried using the /dev/dsk name also with the same result. I'm at a loss as to how to fix this short of rebuilding the pool...

root@flanders:/root# zpool status
pool: deep
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: ZFS-8000-2Q
scan: scrub repaired 0 in 7h42m with 0 errors on Wed Sep 7 08:42:26 2016
config:

NAME STATE READ WRITE CKSUM
deep DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c3t50014EE2B66D3B97d0 ONLINE 0 0 0
c3t50014EE20BC3FFF1d0 ONLINE 0 0 0
logs
11742054345802958404 UNAVAIL 0 0 0 was /dev/dsk/c3t5001517972EC11BAd0s0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Sep 7 01:00:44 2016
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors
root@flanders:/root# zpool remove deep 11742054345802958404
root@flanders:/root#
root@flanders:/root# zpool clear deep
root@flanders:/root# zpool status
pool: deep
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: ZFS-8000-2Q
scan: scrub repaired 0 in 7h42m with 0 errors on Wed Sep 7 08:42:26 2016
config:

NAME STATE READ WRITE CKSUM
deep DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c3t50014EE2B66D3B97d0 ONLINE 0 0 0
c3t50014EE20BC3FFF1d0 ONLINE 0 0 0
logs
11742054345802958404 UNAVAIL 0 0 0 was /dev/dsk/c3t5001517972EC11BAd0s0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Sep 7 01:00:44 2016
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors
root@flanders:/root#
root@flanders:/root# zpool remove deep /dev/dsk/c3t5001517972EC11BAd0s0
root@flanders:/root# zpool status
pool: deep
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: ZFS-8000-2Q
scan: scrub repaired 0 in 7h42m with 0 errors on Wed Sep 7 08:42:26 2016
config:

NAME STATE READ WRITE CKSUM
deep DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c3t50014EE2B66D3B97d0 ONLINE 0 0 0
c3t50014EE20BC3FFF1d0 ONLINE 0 0 0
logs
11742054345802958404 UNAVAIL 0 0 0 was /dev/dsk/c3t5001517972EC11BAd0s0

errors: No known data errors

pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Sep 7 01:00:44 2016
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c2t0d0s0 ONLINE 0 0 0

errors: No known data errors
root@flanders:/root#

gea · Sep 9, 2016

The remaining options are
- reboot after a power off and retry
- disk replace
- export/ import (may produce another problem due the missing slog)

I am not aware of a general problem around disk remove of a faulted slog.
Rebuilding the pool is always the last option.

dicecca112 · Sep 9, 2016

What version of ZFS are you running?

Code:

zpool get version <pool>

if its not version 19 or newer then you can't remove it.

wallenford · Nov 20, 2016

I have encounter the same situation in my zpool. just cannot remove the slog in pool

with the same symptoms as described in "zpool remove on mirrored logs fails silently #1422"

* Symptoms:
* - Removing the log device with: zpool remove <pool> <slog_device>
* has been run with no errors, but device still shows in zpool status -v and zpool iostat -v
* with status: ONLINE
* - After that examining the output of: zdb -C <poolname>
* the slog device shows the property: removing: 1
* - Although the slog device is ONLINE, no writes are being sent to the slog, causing all sync io to go to
* other log devices if present or the main pool vdevs.

ZFS version is 28 , but but no matter how I import -fm the pool , the LOG DEVICE still exists with no I/O
Even tried this fix but only get core dump and reboot

CAN ANYONE HELP WITH THIS ??

Config:
Napp-it Pro + OmniOS151018 on a DELL R520 with 4 mirror vdevs by 2 X 3TB drive.
AND 2 PLEXTOR SSD attached with LSI 1064E

T_Minus · Nov 20, 2016

Weird.

I've had no problem adding and removing SLOG devices inthe last ~2 years

sorry I can't be more help!!

pricklypunter · Nov 20, 2016

I would think adding a replacement slog device, letting it rebuild and get all happy and fuzzy again then removing it, should work. I have only tried doing this a couple of times during experiments, but it worked perfectly for me

wallenford · Nov 21, 2016

It always happen on my two dell R520 in last two years.

I can only remove slog device while the pool is new ,
after a while , I could never successfully remove the slog device.

To avoid that , I convert one server from all-in-one-passthrough to a
standalone NAPP-it server , didn't work out , still got the stuck slog !!

wallenford · Nov 21, 2016

Tried to replace the slog device , but replied

cannot replace c4t62d0 with c6t4d0: devices have different sector alignment

Which is interesting , that the physical-block-size setting in the sd.conf is the same

"ATA PLEXTOR PX-256M5", "physical-block-size:4096,cache-nonvolatile:true,throttle-max:32,disksort:false",
"ATA PLEXTOR PX-128M6", "physical-block-size:4096,cache-nonvolatile:true,throttle-max:32,disksort:false",
"ATA PLEXTOR PX-256M6", "physical-block-size:4096,cache-nonvolatile:true,throttle-max:32,disksort:false";

pricklypunter · Nov 21, 2016

I'm still very green on ZFS, there are others here that have way more experience of the finer details than I do. I could be way wrong here, but I think this might have something to do with the ashift value having been set or even auto detected and set. As I remember this setting is persistent. Your replacement disk is likely reporting a different sector size. I seem to remember you can force it to 512b by using an ashift value of 9, but only try this if you have a complete back-up to go to if it all goes sideways

Not much help I know, maybe someone else has a better idea?

wallenford · Nov 22, 2016

pricklypunter said:
Not much help I know, maybe someone else has a better idea?

Thanks for help anyway !!

I've forced ashift of the slog device to 12 ( 4096) in the sd.conf
but don't know why it's changing ( or not ? ) to ashift=9

pricklypunter · Nov 22, 2016

Did forcing it allow you to add the new slog back to the pool?

wallenford · Nov 22, 2016

I mean , when I created the zpool
I already modified /kernel/drv/sd.conf to force ashift=12 before add slog,
but I don't know if it worked , or something happened after I tried to remove the slog.

pricklypunter · Nov 22, 2016

I still suspect the issue is that the disk you are trying to add, is reporting a higher ashift value than the pool currently has. If I'm understanding you correctly, when you built the pool originally, you set ashift at 12 before adding your slog device, but if your new one is reporting ashift at 13, for example, not 4K but 8K sectors, you will have a problem. As I remember you can't add a disk with a higher ashift value than what the pool currently uses, and it also can't be changed afterward, at least not without moving the data off the pool, destroying it and doing it over again with the new ashift value. What command are you using to replace the old one with the new one? Have you checked the pool to see what ashift value it currently has?

wallenford · Nov 22, 2016

I agree with the replacing failed is caused by the ashift difference.

But , even a successful replacement still don't fix the slog stuck problem.
They still got no IO even it's replaced , cause I've done that before @@

pricklypunter · Nov 22, 2016

Might it be worth running zdb on the pool and seeing what it reports for ashift and see if any other errors are being reported?

wallenford · Nov 22, 2016

zdb comes out

version: 28
name: 'lfpool'
state: 0
txg: 98400
pool_guid: 855304150347047243
hostid: 1943544519
hostname: 'zfs01'
vdev_children: 5
vdev_tree:
type: 'root'
id: 0
guid: 855304150347047243
children[0]:
type: 'mirror'
id: 0
guid: 5606181845343030696
metaslab_array: 35
metaslab_shift: 34
ashift: 12
asize: 3000579653632
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 10105136713486591048
path: '/dev/dsk/c1t0d1s0'
devid: 'id1,sd@n50014ee604cb970f/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@0,1:a'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 7724209816156984522
path: '/dev/dsk/c1t1d1s0'
devid: 'id1,sd@n5000cca225e3c0dd/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@1,1:a'
whole_disk: 1
create_txg: 4
children[1]:
type: 'mirror'
id: 1
guid: 6032537474633327085
metaslab_array: 33
metaslab_shift: 34
ashift: 12
asize: 3000579653632
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 16390446207703533043
path: '/dev/dsk/c1t2d1s0'
devid: 'id1,sd@n50014ee604cbb2c4/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@2,1:a'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 12033065879898295788
path: '/dev/dsk/c1t3d1s0'
devid: 'id1,sd@n5000cca225e5eaba/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@3,1:a'
whole_disk: 1
create_txg: 4
children[2]:
type: 'mirror'
id: 2
guid: 6299752295000376478
metaslab_array: 32
metaslab_shift: 34
ashift: 12
asize: 3000579653632
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 17926222897294719945
path: '/dev/dsk/c1t4d1s0'
devid: 'id1,sd@n50014ee6af762fbc/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@4,1:a'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 7055628440860584055
path: '/dev/dsk/c1t5d1s0'
devid: 'id1,sd@n5000cca225e5c677/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@5,1:a'
whole_disk: 1
create_txg: 4
children[3]:
type: 'mirror'
id: 3
guid: 17572300864705394276
metaslab_array: 30
metaslab_shift: 34
ashift: 12
asize: 3000579653632
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 11587211460424558933
path: '/dev/dsk/c1t6d1s0'
devid: 'id1,sd@n50014ee003dfaa3f/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@6,1:a'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 8162337843667859742
path: '/dev/dsk/c1t7d1s0'
devid: 'id1,sd@n5000cca225e50790/a'
phys_path: '/pci@0,0/pci8086,e02@1/pci1028,1f51@0/sd@7,1:a'
whole_disk: 1
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 8208230845645423860
path: '/dev/dsk/c4t62d0s0'
devid: 'id1,sd@n500230310028c7cb/a'
phys_path: '/pci@0,0/pci8086,1d10@1c/pci1014,3bb@0/sd@3e,0:a'
whole_disk: 1
metaslab_array: 422
metaslab_shift: 31
ashift: 9
asize: 256047054848
is_log: 1
removing: 1
create_txg: 22105
offline: 1

wallenford · Nov 22, 2016

The sympton fits perfectly with someone report on zfsonlinux as noted previously
And there are a fix on linux which is originally comes from omnios

But When I tried the omnios fix , the system just went core dump!!
After that , whenever I use zdb -C lfpool , I got error message like this :

assertion failed for thread 0xfffffd7fff172a40, thread-id 1: space_map_allocated(msp->ms_sm) == 0 (0x3000 == 0x0), file ../../../uts/common/fs/zfs/metaslab.c, line 1445
Abort (core dumped)

I think it's because I mess with the allocation of the slog!!

But interestingly, using zdb still give me result , as long as not zdb -C

pricklypunter · Nov 22, 2016

I'm pretty much at a loss at this point I'm afraid, I don't think I'll be much more help to you. About the only way I can think of to get around the issue, although it doesn't explain why it's happening, would be to offline the whole thing, remove the device, import the pool back in and then remove the orphan disk normally. It might just be simpler and safer long term though to just create a new pool from scratch and run a restore from backup

Can't remove SLOG device

Member

Well-Known Member

Member

Well-Known Member

Active Member

Member

Build. Break. Fix. Repeat

Well-Known Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Well-Known Member