Bare Metal Provisioning - MaaS, Something else?

tjk · Jun 7, 2023

Hey Folks,

Just curious what tools you are all using for bare metal provisioning?

Looking for something that can handle provisioning across multiple locations.

Haven't played with MaaS in a couple of years, just curious if this is still a decent tool and has been kept updated.

oneplane · Jun 8, 2023

Packer, PXE, Cloud-init and we're trialing various KVM-based autoclustering systems (including Harvester HCI). We don't really provision bare-metal without a Hypervisor anymore (except super-embedded RO systems), so even a single-consumer install would get a KVM or OSS Xen between them and the hardware. VMWare and Citrix XenServer are kinda dead ends and Hyper-V never got a foothold with us.

The most prevalent cycle would be PXE booting into an image loader which would gunzip a raw image over HTTPS onto the boot disk, and a Cloud-init drive onto a configuration partition (so a bit like a 16GB base drive with back in the day sometimes even a MBR bootable setup, an ESP partition, OS LVM and then LVs for OS, Swap, Cloud-init). We then have a few Cloud-init modules that do things like system personalisation, specific setup for hardware if required, and if needed a post-install reboot if something changed about the kernel or disks, or if TPM enablement requires it. We try to keep this bootstrap as small and fast as possible since as soon as you have a running system, normal management can take over anyway (be it via SaltStack, Ansible or a Kubelet).

It's a lot less now with almost everything gone to the cloud, but we also used to automatically enrol all hosts into FreeIPA and SaltStack.

On two sites we also had a high enough frequency of server install turnover where we ended up hard-wiring BMC resets via digital I/O on our PDUs; saves a lot of trips to noisy cold isles where nobody really wants to be anyway. You'd set a line to high on the PDU and issue a power cycle on both PSUs so you'd always get into the BMC remotely.

tjk · Jun 9, 2023

Thanks @oneplane for your reply! We run large KVM clusters, for a new app we are looking at bare metal with lxd/lxc containers, and I'm looking at MaaS or other tools to handle those deployments and using Ansible for post server install and for LXC deployment.

I'd stay with PVE for this, however they don't support live migration of LXC's, hence why I am looking at BMR with LXD to manage our container deployments, since LXD allows live migration.

PS - How are you liking Harvester? Also curious what other autoclustering KVM stacks you are looking at.

oneplane · Jun 9, 2023

tjk said:
Thanks @oneplane for your reply! We run large KVM clusters, for a new app we are looking at bare metal with lxd/lxc containers, and I'm looking at MaaS or other tools to handle those deployments and using Ansible for post server install and for LXC deployment.

I think MaaS has mostly an added value if you don't have the manpower internally to setup automatic PXE (which is what pretty much everything under the sun does for bare metal provisioning). That used to be 'part of the job' but these days specialisation and sector growth mean you probably have to pick what you want to work on internally to add value, and what you want to just COTS.

What it all boils down to is controlling the servers to set them to boot over the network, and besides some iSCSI magic that's still PXE most of the time. As for how the boot method is set, that depends on the hardware (which is also a bit what MaaS hints at): in many cases the only way is to provision and then lock down IPMI (or whatever else the BMC does) so you can use it to dictate power and boot control remotely (which makes total sense).

For us, we didn't want to deal with the many variations of BMC software directly (even within the same vendor and the same systems and mainboard series consistency is severely lacking), so we only configured (or wired up -- in some cases you can wire up a reset line and the default boot method tends to be PXE) the boot method and go from there. First boot would also change the BMC from the host side over the non-network BMC interface. It's a PITA, but beats repetitive manual action. As for why we went to 'wire up some reset': BMC and UEFI/BIOS firmware is notoriously insecure and buggy, especially when you have control of the OS on the hardware. No amount of SGX, TPM, OPAL or SecureBoot fixes this. Spending time on just making that vulnerable bit less relevant was preferred over trying to test and fix something that realistically only the vendor should do.

tjk said:
I'd stay with PVE for this, however they don't support live migration of LXC's, hence why I am looking at BMR with LXD to manage our container deployments, since LXD allows live migration.

In this case I'd advocate for running PVE on it, and then a single VM inside that where you do the LXD stuff. It means that you can control the entire lifecycle of the LXD guest machine without ever having to touch the server itself. Purely from a Terraform and automation perspective that would be a must-have to us (especially the near-instant snapshotting, super useful for debugging a production system without actually touching the running system).

tjk said:
PS - How are you liking Harvester? Also curious what other autoclustering KVM stacks you are looking at.

It's pretty neat, but we didn't get the performance we wanted right away so as with most big systems with lots of components, it takes some effort. For us, a sidestep was to use a SAN (which breaks the whole idea of HCI) for now while we figure out storage. It might also simply be too young to have the same specs and capabilities as a legacy/traditional compute/network/storage stack where everything is rigid and separate.

We've also done some other distros that either do Kubernetes and KVM (so you get KubeVirt capabilities), oVirt with some cloud-init and ansible magic (it doesn't really do any auto-clustering properly otherwise), and OpenNebula which isn't a full solution, but once you're in Packer and Cloud-Init territory, everything else becomes just another instance.

virginiaguy76 · Dec 19, 2023

Hands down Metify Mojo: www.metify.io

Simple with great multi-OEM support. Everything is there out-of-the-box, including DHCP and iPXE facilities. Built with Python and Ansible. It supports Dell, SMC, Supermicro, HPE, Lenovo and a bunch of others. It also uses Redfish for the security minded.

tjk · Dec 19, 2023

virginiaguy76 said:
Hands down Metify Mojo: www.metify.io

Simple with great multi-OEM support. Everything is there out-of-the-box, including DHCP and iPXE facilities. It supports Dell, SMC, Supermicro, HPE, Lenovo and a bunch of others. It also uses Redfish for the security minded.

Site is pretty weak on details, no screen shots, no demo, no pricing, no architecture overview, and so on. May be a good product, but you would never know from the site.

virginiaguy76 · Dec 19, 2023

tjk said:
Site is pretty weak on details, no screen shots, no demo, no pricing, no architecture overview, and so on. May be a good product, but you would never know from the site.

There is this:

How does Mojo work?

What is Mojo? Mojo is a platform to manage servers, storage, network devices, and rack elements via the DMTF Redfish standard. Mojo’s focus is to remove the need to use multiple proprietary tools and platforms to manage multi-OEM infrastructure. Mojo has everything needed to start managing...

discourse.metify.io

Discourse: Discourse - Metify Mojo

Search

Bare Metal Provisioning - MaaS, Something else?

tjk

Active Member

oneplane

Well-Known Member

tjk

Active Member

oneplane

Well-Known Member

virginiaguy76

New Member

tjk

Active Member

virginiaguy76

New Member

How does Mojo work?