Proxmox Ceph RBD snapshot hangs when ceph-fuse mount present in LXC

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

zdude

What is a Computer?
Aug 29, 2017
62
8
8
43
I have been trying to get a backup solution configured for my system. The system is using a hyperconverged ceph cluster with unprivileged LXC containers, the container image is contained on an RBD ceph pool. Inside some containers a ceph-fs is mounted via ceph-fuse.

If I take a snapshot of a container that does not have a cephfs fuse mount present it will always work correctly.
If I take a snapshot of a container with a cephfs fuse mount present and no files open from the mount it works correclty (somewhat theory based on when/how they fail).
If I take a snapshot of a container with a cephfs fuse mount present and at least one file open it seems to always fail.


Once the system has failed to take the snapshot it seems that there is a locked process of some kind which requires a reboot of the host. After rebooting, the proxmox GUI shows that a snapshot exists but is not a parent of "NOW". When examining the rbd image directly there is no snapshot present on the rbd image.

The only thing that I can find that might be relevant is an error in syslog indicating that my ceph admin key is not present (then the error key) even though it very much so is. It almost appears as if a non-root process is trying to read the ceph admin key but I can't find any such process.

Anybody seen something like this before and if so, any suggestions to get around it?
 

zdude

What is a Computer?
Aug 29, 2017
62
8
8
43
I have been able to get a little more detail and isolate what is causing the problem.

The sequence of events to cause a PVE hang appear to be the following:

LXC container and ceph-fuse mount inside container (not just RBD, also hangs with ZFS container root)
Process writing to file inside fuse mount in container (I have been just using FIO in my test env) reads don't seem to cause any problems.
Snapshot on root for LXC container.


I also found this experpt from Proxmox's documentation:
1709486750846.png

It appears to be a somewhat known problem. Unfortunately, making about 50000 bind mounts on the host is just not feasible for me.
 
Last edited: