Ceph advice

Bjorn Smith · May 13, 2022

Hi,

I have just bought yet another batch of 5xDell Wyse 5070 - this time the extended versions - and I intend on deploying a small ceph cluster so I can use it for storage for my kubernetes cluster that I built on another batch of Dell Wyse 5070's.

I know these machines are tiny - but they have 4 cores each and should easily be upgradeable to 32Gigs of memory if required. I plan on adding a separate NIC so ceph has an isolated storage network, and then use the public network for clients.

I will install 2 or 3 SSD's (somehow).

So I just wonder if there are any gotcha's or advice people that run ceph today can give me.

I plan on running with 3 replicas - and I read that 5 nodes is a good number of nodes, since it allows some downtime on the different nodes.

Thanks for all advice you can give.

P.S. I know I will not have awesome performance, since its only 1Gbit network, but I don't want speed demons - I want stability and secure storage for my kubernetes cluster.

Edit: I will probably be using something similar to this for having two extra SATA m.2 drives. Controller Card - PCIe 2x M.2 SATA SSD - SATA Controller Cards | Denmark

oneplane · May 13, 2022

I'll be following along on this one. I too am considering a 'tiny' Ceph cluster for Kubernetes or Havester labs where speed really isn't much of a thing (heck, 30MB/s would be enough) but reliability with availability and durability is important. Most Ceph concepts assume high bandwidth and IOPS but if you don't need that it often devolves into 'use something else', which misses the point of wanting to use Ceph.

PigLover · May 13, 2022

I'd strongly recommend that you run Kubernetes on the machines serving as your Ceph cluster as well - and then use Rook to install/manage Ceph on them. Rook is a Kubernetes "operator" that will deploy and manage all of the Ceph components for you, e.g., to do upgrades you just update the image version tags in the Rook config YAML and then Rook & K8s do their magic to get the upgrade done.

Bjorn Smith · May 13, 2022

PigLover said:
I'd strongly recommend that you run Kubernetes on the machines serving as your Ceph cluster as well - and then use Rook to install/manage Ceph on them. Rook is a Kubernetes "operator" that will deploy and manage all of the Ceph components for you, e.g., to do upgrades you just update the image version tags in the Rook config YAML and then Rook & K8s do their magic to get the upgrade done.

So basically expand my kubernetes cluster to include the 5 new nodes - but then have ceph running only on these 5 new nodes - I guess that would simply things in terms of installing and managing ceph.

Do you know if the ceph cluster can be accessed from the outside of kubernetes? Just in case I want to use some of the storage from external machines.

PigLover · May 13, 2022

Bjorn Smith said:
So basically expand my kubernetes cluster to include the 5 new nodes - but then have ceph running only on these 5 new nodes - I guess that would simply things in terms of installing and managing ceph.

Do you know if the ceph cluster can be accessed from the outside of kubernetes? Just in case I want to use some of the storage from external machines.

Yes, that would probably be the best approach. Though you could set the Ceph cluster machines up as a separate/independent Ceph cluster and then expose Ceph to your primary Ceph cluster.

In either case it will take doing some reading on Rook to get things right - running as one big cluster you'd need to put together labels to make sure the Ceph components run on the right machines (and only those machines) and to ensure other workloads don't get scheduled there. Running as two cluster's you'd have to do some work to expose Ceph between them (not too hard and, BTW, that answers your other question that yes - you can expose Ceph outside the cluster).

What I do is just run Ceph and my K8s workloads as one cluster, sorta hyper-converged. I say "sorta" because I'm really only using Ceph for persistent storage (PVs) inside K8s. My primary datastore still lives on a NAS.

Best idea would be to go take a look at Rook and see if it will work for you.

oneplane · May 13, 2022

I've been looking at Rook for this as well but it still has the chicken-and-egg problem of needing to store Kubernetes data before Kubernetes is up and running, but you can't store it in Ceph because the cluster isn't up yet.

One way to bootstrap this would be some sort of bootstrap cluster using something more simplistic and then use that to start a bootstrap-Rook-Ceph to store application cluster data, then have a second cluster run Rook as well (on different nodes) and use that for daily use.

Harvester bypasses this problem a little bit but still has the bootstrap problem in a different way.

PigLover · May 13, 2022

oneplane said:
I've been looking at Rook for this as well but it still has the chicken-and-egg problem of needing to store Kubernetes data before Kubernetes is up and running, but you can't store it in Ceph because the cluster isn't up yet.

One way to bootstrap this would be some sort of bootstrap cluster using something more simplistic and then use that to start a bootstrap-Rook-Ceph to store application cluster data, then have a second cluster run Rook as well (on different nodes) and use that for daily use.

Harvester bypasses this problem a little bit but still has the bootstrap problem in a different way.

It really doesn't have that problem. The only state information required to get started is a small amount of local storage on the contorl plane node and for etcd, which runs on the control plane node and doesn't depend on PVs at all. Etcd uses a (very small) amount of local storage on your control plane node (or nodes - if you are running a redundant control plane).

Kubernetes baseline expectation is for applications to be stateless. You only have to provide storage if you are going to bring up a stateful application, which you can do after you bring up Rook/Ceph.

So the bootstrap process is:
- build your cluster and install K8s on it.
- edit the Rook YAML files to describe your OSDs and Networking
- install Rook (which is as simple as applying the Rook YAML files)
- wait a bit for your OSDs to initialize and come on line
- define your PVs on Rook (again - as simple as applying the right YAML files, which you can find in the Rook examples)
- Install whatever else you want on the K8s cluster

There is no "chicken & egg" conundrum.

oneplane · May 14, 2022

It does have that problem for me, because the control plane nodes are virtual machines that need to be stored somewhere ;-)
So the "- build your cluster and install K8s on it." for me would be stored on Ceph. I think I'm probably ending up doing two clusters, one in Proxmox just for VM storage and one in Kubernetes just for PV.

PigLover · May 14, 2022

oneplane said:
It does have that problem for me, because the control plane nodes are virtual machines that need to be stored somewhere ;-)
So the "- build your cluster and install K8s on it." for me would be stored on Ceph. I think I'm probably ending up doing two clusters, one in Proxmox just for VM storage and one in Kubernetes just for PV.

Fair play - but you are mixing two different problems in a way that creates chicken & egg issue for you.

Sean Ho · May 15, 2022

Each node, whether ceph or k8s, will still want to have some local storage, beyond OSD drives. Rook/ceph needs a bit in `/var/lib/rook` for each osd and mon, and each k8s node needs a fair bit of space for cri, not just etcd.

Virtualization is a separate issue from distributed storage. If you want your k8s central/server to be in a VM, that's fine, just put the VM image on something that doesn't depend on k8s (could be iSCSI target from pve if you like). Similarly, the ceph daemons run by rook are all containers, but (by default) use local mounts for volumes.

If you want to migrate your VMs onto your k8s cluster, consider kubevirt.

oneplane · May 15, 2022

Well, that's the thing; if you kubevirt your controlplane you're back in chicken-and-egg territory. Harvester tries to solve it by making every node be a possible apiserver, scheduler, etcd node, controller manager etc. but that also comes with distributed storage (but not PV, only object storage IIRC?).

You could run Kubernetes bare-metal, but why would anyone do that? You'd be beholden to IPMI and SoL and KVM... and every node would be a special hands-on install. Kubernetes always requires storage, and storage local to the OS can still be remote storage. It does of course not need to be distributed, but unless you want to always have two separate clusters (a storage cluster and a Kubernetes cluster), you'll probably want that. Saying it's a different thing is true, but comparable to saying CPU and memory are not a singular thing for a node. While true, you can't really properly use one without the other.

Sean Ho · May 15, 2022

yep, I am suggesting to run k8s bare-metal. The failover design of k8s is not that k8s nodes would be live-migrated to another machine, but to allow k8s nodes to fail and build the application to be resilient -- e.g., stateless application servers with state stored in HA DB clusters and bulk storage in systems like ceph.

I run a six-node HCI k3s cluster on much beefier hardware than the Dell 5070, using rook/ceph. Each node has an SSD for root, and runs debian and k3s bare-metal, deployed via ansible. K3d is another option.

oneplane · May 15, 2022

Oh definitely, I wasn't suggesting failing over virtual machines, but just not having free-standing operating systems whenever possible. I like my hardware nodes to be compute and memory and nothing else (except if we're talking about specialised hardware like coral accelerators and GPUs but that's what labels and taints are for).

I suppose one of the things that could work is having a local-only virtualisation setup where the only point of failure would be zero control plane VMs remaining, but in that scenario you'd re-deploy from manifests and charts in Git anyway. But I always make things needlessly complex and run non-kubernetes virtual machines that do need distributed storage (mainly because it doesn't fit in a single 6-disk node). Perhaps that's also where my solution could be since KubeVirt has no problem with nested KVM so I could do node-local storage, run K8S and non-K8S VMs on that, but first run K8S with Rook, run a rados GW on that, and then have the non-K8S VMs use those.

Edit: and now I've realised that I'd essentially be re-inventing Harvester.

Sean Ho · May 15, 2022

oneplane said:
I like my hardware nodes to be compute and memory and nothing else (except if we're talking about specialised hardware like coral accelerators and GPUs but that's what labels and taints are for).

I completely empathise; I also pxe boot all my machines (also works around issues like old bioses not booting off nvme) and have a couple diskless compute nodes (not ceph osd!) using nfs root. The inescapable tension that k8s folks have struggled with for a long time is providing persistent storage via cattle nodes. You can make nodes interchangeable with regard to compute and networking, but storage has inertia. Disks are specialised hardware just like GPUs; reslivers/backfills take time.

This is also why despite HCI being great for homelabs, in production even if the cluster is designed for HCI, often it gravitates towards a group of storage-heavy nodes and a group of compute-heavy nodes. Add in a dedicated back-end network for ceph, and we're basically back to a SAN, just using ethernet instead of FC.

oneplane · May 15, 2022

Sean Ho said:
You can make nodes interchangeable with regard to compute and networking, but storage has inertia. Disks are specialised hardware just like GPUs; reslivers/backfills take time.

This is also why despite HCI being great for homelabs, in production even if the cluster is designed for HCI, often it gravitates towards a group of storage-heavy nodes and a group of compute-heavy nodes. Add in a dedicated back-end network for ceph, and we're basically back to a SAN, just using ethernet instead of FC.

Yep, that's what I keep coming back to as well. Before K8S (and before Mesos + Marathon) we had things like PXE+NFS root as well (well, technically a rather fat initrd and /usr and some /etc and /var mounted on top of that) and for application persistent storage it was mostly NBD and iSCSI. Then we had some OCFS and even SMB before we went for Lustre. That worked both great as well as being a PITA, and iSCSI was kept side-by-side. Before we resolved most of the issues or even had a Ceph cluster in production, nearly all workloads were shipped off to various clouds as it really wasn't a core business or benefit to the company or customers to do all that internally; it didn't (and doesn't) make the customer products better so why bother, even if it did marginally save costs.

Currently all that is left in the classic storage method is iSCSI and SMB, mostly due to legacy services that aren't going to be migrated but are being shut down one by one as they reach end-of-life anyway.

That does of course not mean that the knowledge becomes useless or that a homelab can't benefit from it ;-) And there are small projects (even on the business end) where it still makes sense to do it local instead of hosted.

I did have a full-PXE+NFS Xen cluster running at home at some point, but virtual machine disks over NFS aren't terribly performant, even with NFSv4 multipath. On top of that, Xen has fallen out of favour and KVM works fine. That makes Proxmox and KubeVirt (or Harvester) on top of that a pretty good target (but I haven't measured nested hypervisor performance losses yet).

Edit: sorry for the partial thread hijacks and braindump posts, it's almost as if I'm rubberducking the forum or something like that

for everyone else who stumbles upon this: if you just want PV for K8S and Ceph fits your hardware, Rook is definitely the way to do. If you make your life hard (like I seem to be doing), well, it's going to be harder than that

Search

Ceph advice

Bjorn Smith

Well-Known Member

oneplane

Well-Known Member

PigLover

Moderator

Bjorn Smith

Well-Known Member

PigLover

Moderator

oneplane

Well-Known Member

PigLover

Moderator

oneplane

Well-Known Member

PigLover

Moderator

Sean Ho

seanho.com

oneplane

Well-Known Member

Sean Ho

seanho.com

oneplane

Well-Known Member

Sean Ho

seanho.com

oneplane

Well-Known Member