Thanks
@oneplane for your reply! We run large KVM clusters, for a new app we are looking at bare metal with lxd/lxc containers, and I'm looking at MaaS or other tools to handle those deployments and using Ansible for post server install and for LXC deployment.
I think MaaS has mostly an added value if you don't have the manpower internally to setup automatic PXE (which is what pretty much everything under the sun does for bare metal provisioning). That used to be 'part of the job' but these days specialisation and sector growth mean you probably have to pick what you want to work on internally to add value, and what you want to just COTS.
What it all boils down to is controlling the servers to set them to boot over the network, and besides some iSCSI magic that's still PXE most of the time. As for how the boot method is set, that depends on the hardware (which is also a bit what MaaS hints at): in many cases the only way is to provision and then lock down IPMI (or whatever else the BMC does) so you can use it to dictate power and boot control remotely (which makes total sense).
For us, we didn't want to deal with the many variations of BMC software directly (even within the same vendor and the same systems and mainboard series consistency is severely lacking), so we only configured (or wired up -- in some cases you can wire up a reset line and the default boot method tends to be PXE) the boot method and go from there. First boot would also change the BMC from the host side over the non-network BMC interface. It's a PITA, but beats repetitive manual action. As for why we went to 'wire up some reset': BMC and UEFI/BIOS firmware is notoriously insecure and buggy, especially when you have control of the OS on the hardware. No amount of SGX, TPM, OPAL or SecureBoot fixes this. Spending time on just making that vulnerable bit less relevant was preferred over trying to test and fix something that realistically only the vendor should do.
I'd stay with PVE for this, however they don't support live migration of LXC's, hence why I am looking at BMR with LXD to manage our container deployments, since LXD allows live migration.
In this case I'd advocate for running PVE on it, and then a single VM inside that where you do the LXD stuff. It means that you can control the entire lifecycle of the LXD guest machine without ever having to touch the server itself. Purely from a Terraform and automation perspective that would be a must-have to us (especially the near-instant snapshotting, super useful for debugging a production system without actually touching the running system).
PS - How are you liking Harvester? Also curious what other autoclustering KVM stacks you are looking at.
It's pretty neat, but we didn't get the performance we wanted right away so as with most big systems with lots of components, it takes some effort. For us, a sidestep was to use a SAN (which breaks the whole idea of HCI) for now while we figure out storage. It might also simply be too young to have the same specs and capabilities as a legacy/traditional compute/network/storage stack where everything is rigid and separate.
We've also done some other distros that either do Kubernetes and KVM (so you get KubeVirt capabilities), oVirt with some cloud-init and ansible magic (it doesn't really do any auto-clustering properly otherwise), and OpenNebula which isn't a full solution, but once you're in Packer and Cloud-Init territory, everything else becomes just another instance.