Can't remove SLOG device

pricklypunter

Well-Known Member
Nov 10, 2015
1,606
470
83
Canada
I know, crap answer, I hate not getting to the bottom of a problem too, especially one that's likely to bite me in the rear again later, but sometimes I have to just admit defeat and move forward. If you do ever get to the bottom of it, I would be interested to hear how you fix it :)
 
  • Like
Reactions: gigatexal

wallenford

Member
Nov 20, 2016
32
3
8
47
Maybe it's because some tweak I done to the system setting of nfs server.
I am going to destroy the pool and reinstall the latesest omnios to see if it goes well.
Will update here if I got some findings.
 

wallenford

Member
Nov 20, 2016
32
3
8
47
I reinstall the system with omnios-r151020-b5b8c75 and re-create the zpool

Config:
1. 12T zpool , 4vdevs * 2 * SATA3TB on Dell H310 HBA
2. 3 SSDs on system :
PLEXTOR-256M6+PLEXTOR-M6 on SAS1064E
PLEXTOR-256M5 on Intel SATA​
3. DELL R520​
SD.conf:
sd-config-list =
"ATA PLEXTOR PX-256M6", "physical-block-size:4096",
"ATA PLEXTOR PX-128M6", "physical-block-size:4096",
"ATA PLEXTOR PX-256M61.01", "physical-block-size:4096",
"ATA PLEXTOR PX-128M61.01", "physical-block-size:4096",
"ATA PLEXTOR PX-256M5", "physical-block-size:4096",
"ATA QEMU HARDDISK ", "physical-block-size:4096",
"IET VIRTUAL-DISK ", "physical-block-size:4096",
"NETAPP LUN ", "physical-block-size:4096";

Found something interesting , Maybe it's related to the silent fail of removing slog:

1. cannot force Plextor-256M6 or Plextor-128M6 to ashift=12:
no matter adding the log to zpool after I created it , or adding the log as I create the zpool , zdb -C always shows the ashift of slog is 9​
2. I can force the Plextor-256M5 on Intel SATA to ashift=12​


Now I'm using the Plextor-256M5 as slog device. I can add and remove slog so far . will try more as soon as the data restored.

 
Last edited:

wallenford

Member
Nov 20, 2016
32
3
8
47
After 2TB of zfs send by netcat , the slog still function well
(maybe it's because of very small sync writes to zpool)

so I mount one datastore on esxi 5.5 by nfs , try to upload a lot of files by vSphere Client.

After 26G of small files uploaded , the slog device become stuck ,and failed to remove from pool.
(while I upload the files , there is an operation of 1.1T zfs recive failed)

zdb failed
root@zfs01:/root# zdb -C lfpool
assertion failed for thread 0xfffffd7fff142a40, thread-id 1: space_map_allocated(msp->ms_sm) == 0 (0x12000 == 0x0), file ../../../uts/common/fs/zfs/metaslab.c, line 1551
Abort (core dumped)
 
Last edited:

wallenford

Member
Nov 20, 2016
32
3
8
47
Again, destroy the pool then rebuild it .

with following steps , the slog device stucked.

1. create zpool + cache + log at same time
2. create 10 zfs filesystems
3. start nc to recevice ZFS from another host
4. sucessfully recevice all zfs

I will replace the Plextor M6/SATA by an Intel 535 120G ,
then see what will happen.
 

wallenford

Member
Nov 20, 2016
32
3
8
47
I thought the issue was caused by Plextor ,too
But After I replaced Plextor M5pro with Intel 535 ,
Stucked log happend again.


The INTEL 535 worked well and can easily remove /add multiple times during the first 3TB data transfer

But but after send 4TB of ZFS , it become stuck again!!
I didn set zfs_nocacheflush to true while I was sending the zfs , don't know if it's the cause.


Will try to update DELL H310 firmware to see if there's any change.
 
  • Like
Reactions: T_Minus

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,992
1,568
113
CA
@wallenford appreciate your constant updates on what you're trying and not that will prove useful for others in the future I'm sure, always good to document it too :)
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,606
470
83
Canada
Did you note what was being reported for ashift when you re-created the pool with the Intel disk? If so, did this value change after the 4TB of writes when you could no longer remove the SLOG?
 

wallenford

Member
Nov 20, 2016
32
3
8
47
Did you note what was being reported for ashift when you re-created the pool with the Intel disk? If so, did this value change after the 4TB of writes when you could no longer remove the SLOG?
YES , The ashift=13 as same as the value in SD.conf
the value didn't change ( as far as I can remember ) , will take a look after I fix my dell R520
(because I did an idiot thing - try to crossflash the H310 embedded to LSI 9211-8i firmware ...
it won't pass the internal storage slot examination during booting the server )
 
Last edited:

pricklypunter

Well-Known Member
Nov 10, 2015
1,606
470
83
Canada
I am pretty much stumped now. If the ashift value is not changing/ reverting back with the Intel disk, and you still cannot remove the SLOG from the pool without error, I'm wondering if this could either be a bug in your implementation of ZFS itself or it's interaction with your OS, which I admit seems unlikely, others would have similar issues and I imagine it would be much more widely reported, or perhaps we are on the wrong track with this altogether and it has nothing to do with the ashift value, rather something more fundamental going wrong?

It might be worth taking this to one of the groups where the developers hang out and see if they can re-create your issue, maybe help track it down quicker, they do after all have an intimate knowledge of how it all hangs together :)
 

wallenford

Member
Nov 20, 2016
32
3
8
47
Well, after Dell service replaced the H310 embedded storage card, the server came back alive.
I destroyed the zpool , reset all the fine tune setting of nfs , reset zfs_nocacheflush to false .
then recovered the pool using zfs send by nc.

While I am sending zfs , " zpool iostat -v" is used to monitor the IOPS of slog.

In fact , there are hardly no activities on slog device during the process.
but after the send finished , the slog get stuck again.
I am sure the ashift of slog device didn't change.

just don't know what happened to my zfs@@
 
  • Like
Reactions: pricklypunter

pricklypunter

Well-Known Member
Nov 10, 2015
1,606
470
83
Canada
Actually @cperalt1 might have a good idea here, in that an rsync will be able to pull just your data across. As far as I remember a ZFS receive also pulls your original file system over with your data. I'm not exactly sure how or even if the file system could affect the use of a SLOG disk though, but at this point, I wouldn't rule out any possible misbehavior. Have you tried pulling just your data to back-up disk(s), then moving that data onto a fresh pool/ SLOG setup? I don't use/ have never used Napp-it specifically and know very little about it. As an experiment, perhaps to help rule out an OS/ implementation problem, have you tried replicating this using something else entirely, like Debian/ ZoL?
 
  • Like
Reactions: cperalt1

cperalt1

Active Member
Feb 23, 2015
178
51
28
39
The reason that I recommend the RSYNC is that it would rule out any issue having to do with a corrupted ZFS file system. Like any file system ZFS is not magic but does protect you from many things. The other benefit you will have is that it will eliminate any fragmentation present in your ZFS pool.
 

wallenford

Member
Nov 20, 2016
32
3
8
47
The reason that I recommend the RSYNC is that it would rule out any issue having to do with a corrupted ZFS file system. Like any file system ZFS is not magic but does protect you from many things. The other benefit you will have is that it will eliminate any fragmentation present in your ZFS pool.
I see, I will try to use the rsync to restore data.
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
A higher ZFS fragmentation compared to older filesystems is due CopyOnWrite. Using rsync vs zfs send has no effect. Missing ZIL activity during a replication is to be expected as using a logdevice then is not useful. Main advantage of zfs send is a much higher performance as you do not need a filecompare like rsync. Its a simple filestream based on a snap with modified datablocks. For a large filesystem this is the only option to keep them in sync down to a near realtime.

Have you modified sd.conf in this setup to force a special ashift for the SSD?
I would avoid this.

I would also avoid desktop class SSDs for an slog.
An Slog must offer powerloss protection, ultra low latency and high write iops or you should avoid an Slog as this may be even slower than using the onpool ZIL as a logdevice for sync write.
 

wallenford

Member
Nov 20, 2016
32
3
8
47
Have you modified sd.conf in this setup to force a special ashift for the SSD?
I would avoid this.
YES, I did modify sd.conf.
will delete it after destroy the zpool.

I would also avoid desktop class SSDs for an slog.
An Slog must offer powerloss protection, ultra low latency and high write iops or you should avoid an Slog as this may be even slower than using the onpool ZIL as a logdevice for sync write.
But there is no powerloss the all time .....
OR can you give me any recommendation for SSD ?

I've tried Plextor M5Pro / Plextor M6(M2.2280) / Intel 535
All of them get stuck.