LOG device on an encrypted pool = single point of failure? Can't unlock pool anymore after loosing SLOG...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Mastakilla

Member
Jul 23, 2019
36
8
8
Hi,

I am running TrueNAS 12 on my server and had some issues with data corruption recently which caused me to temporarily remove my Intel Optane from my server. After finding out the issues were caused by a broken CPU, I replaced the CPU and wanted to re-include my Intel Optane.

But something went wrong and now I can't unlock my pool anymore... Here's what happened:

I actually followed the same steps that I successfully performed on FreeNAS 11, when initially installing my Intel Optane about 6 months ago.
  • Create a GPT partition table
    • gpart create -s gpt nvd0
  • Create 3 partitions (I didn't do anything yet with the swap)
    • gpart add -i 1 -b 128 -t freebsd-swap -s 16g nvd0
    • gpart add -i 2 -t freebsd-zfs -s 20g nvd0
    • gpart add -i 3 -t freebsd-zfs nvd0
  • Add the 2nd partition as SLOG to my existing password encrypted RAIDZ2 pool
    • zpool add hgstpool log /dev/gptid/cf3dbeb8-c5dc-11ea-8b09-d05099d3fdfe
  • Create a non-redundant test pool from partition 3
    • zpool create -m /mnt/optanepool optanepool /dev/gptid/d23e29f0-c5dc-11ea-8b09-d05099d3fdfe
    • zpool export optanepool
    • Import optanepool in the GUI
Next I wanted to enable lz4 compression on optanepool in the GUI, but I got an error:
Code:
Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 137, in call_method
    result = await self.middleware._call(message['method'], serviceobj, methodobj, params, app=self,
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/service.py", line 463, in update
    rv = await self.middleware._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1195, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 972, in nf
    args, kwargs = clean_and_validate_args(args, kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 930, in clean_and_validate_args
    value = attr.clean(args[args_index + i])
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 607, in clean
    data[key] = attr.clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 167, in clean
    value = super(Str, self).clean(value)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 53, in clean
    raise Error(self.name, f'Invalid choice: {value}')
middlewared.schema.Error: [aclmode] Invalid choice: DISCARD
This step did work in FreeNAS 11, but for some reason it doesn't work anymore in TrueNAS 12.

I thought: Ok, no problem. I'll just re-create optanepool and enable compression with the "zpool create" command itself. So I exported the pool the pool in the TrueNAS GUI, but then I couldn't create a new zpool over the old one, as the old one was still there, but not visible, as it was exported.
Then I made the mistake to think that it might be better to destroy optanepool from the GUI, so I re-imported optanepool and then exported it again in the GUI, while selecting the destroy checkbox. At that time my main RAIDZ2 pool was actually still locked after a reboot (I even thought it would be safer to keep it locked).

But, here it comes, when destroying optanepool from the TrueNAS GUI, it didn't just destroy the optanepool partition only. No it destroyed ALL partitions on my Intel Optane, including the partition that I was using as SLOG for my password encrypted RAIDZ2 pool :eek::eek::eek:

When I now try to unlock my RAIDZ2 pool, it fails with the following error:
Code:
Error: concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 91, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 977, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 371, in import_pool
    self.logger.error(
  File "libzfs.pyx", line 391, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/zfs.py", line 365, in import_pool
    zfs.import_pool(found, new_name or found.name, options, any_host=any_host)
  File "libzfs.pyx", line 1095, in libzfs.ZFS.import_pool
  File "libzfs.pyx", line 1123, in libzfs.ZFS.__import_pool
libzfs.ZFSException: one or more devices is currently unavailable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 361, in run
    await self.future
  File "/usr/local/lib/python3.8/site-packages/middlewared/job.py", line 397, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.8/site-packages/middlewared/schema.py", line 973, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 290, in unlock
    raise e
  File "/usr/local/lib/python3.8/site-packages/middlewared/plugins/pool_/encryption_freebsd.py", line 272, in unlock
    await self.middleware.call('zfs.pool.import_pool', pool['guid'], {
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1238, in call
    return await self._call(
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1203, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1209, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1136, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/middlewared/main.py", line 1110, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
libzfs.ZFSException: ('one or more devices is currently unavailable',)
In the last line, you can clearly read "one or more devices is currently unavailable". In the console, you can also see that it successfully unlocks each HDD, one by one, and after unlocking all 8 HDDs, it then destroys them again:
Code:
Feb  4 10:35:29 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:29 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:29 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:31 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:31 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:31 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:32 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:32 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:32 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:34 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:34 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:34 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:36 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:36 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:36 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:38 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:38 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:38 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:39 data GEOM_ELI: Device gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:39 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:39 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:41 data GEOM_ELI: Device gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli created.
Feb  4 10:35:41 data GEOM_ELI: Encryption: AES-XTS 256
Feb  4 10:35:41 data GEOM_ELI:     Crypto: hardware
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
Feb  4 10:35:42 data GEOM_ELI: Device gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe.eli destroyed.
So apparently my accidently destroyed SLOG device, is causing the pool to become impossible to unlock.o_O
  • I can't add a new / replacement SLOG, because I need an unlocked pool for that
  • I can't "remove" the destroyed SLOG from my pool, because I need an unlocked pool for that
  • I tried physically removing the Intel Optane, so that TrueNAS might realize the SLOG is no longer there, but I still can't unlock my pool
I have some ideas to go forward, but my ZFS knowledge is too limited to know if these are safe / good ideas. I hope someone here can give some feedback...

I was thinking unlock my HDDs one by one using the command line. Perhaps then, even though I can't unlock the destroyed SLOG anymore, I'll be able to "see" and modify my unlocked RAIDZ2 pool and perhaps remove the SLOG, so that TrueNAS no longer expects a SLOG while unlocking. I found following commands to do this:

geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
...


Is this a good idea?
 
Last edited:

Mastakilla

Member
Jul 23, 2019
36
8
8
Here is what made my data available again (not sure if TrueNAS is still 100% healthy, but that’s a worry for later)
  • Below steps made my pool visible / available again in CLI (unlock each HDD, one by one)
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efde27c0-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef5b1297-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f03634f0-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/efd8967a-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f00c5450-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f04a1e2c-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/ef1cec0a-cdd4-11ea-a82a-d05099d3fdfe
    • geli attach -k /data/geli/012fff85-1065-49d7-a500-a4efe62e27a8.key /dev/gptid/f01fd8e3-cdd4-11ea-a82a-d05099d3fdfe
  • The next step made my pool fully visible in GUI (as degraded because of the missing SLOG)
    • zpool import -m hgstpool
  • This removed the log device from the pool and made the pool also healthy again in the GUI
    • zpool remove hgstpool gptid/f8725bbb-64c5-11eb-bbcd-d05099d3fdfe
After a reboot, I could unlock my pool again in the normal way via the GUI… I’ll now start updating my offline backups first
 

Mastakilla

Member
Jul 23, 2019
36
8
8
* Don't use the GUI to export / disconnect a pool that is using manually created partitions, as TrueNAS only supports full disk pools
* It's ok to unlock the pool via CLI for emergencies

(but I guess you wanted me to say that you shouldn't use multi-partitioned disks for creating pools in TrueNAS, right? :p)
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Nah, I dont do that anyway - just thought that it might be good for other users to learn from your experiences;)