Ceph advice

Bjorn Smith

Well-Known Member
Sep 3, 2019
505
267
63
48
r00t.dk
Hi,

I have just bought yet another batch of 5xDell Wyse 5070 - this time the extended versions - and I intend on deploying a small ceph cluster so I can use it for storage for my kubernetes cluster that I built on another batch of Dell Wyse 5070's.

I know these machines are tiny - but they have 4 cores each and should easily be upgradeable to 32Gigs of memory if required. I plan on adding a separate NIC so ceph has an isolated storage network, and then use the public network for clients.

I will install 2 or 3 SSD's (somehow).

So I just wonder if there are any gotcha's or advice people that run ceph today can give me.

I plan on running with 3 replicas - and I read that 5 nodes is a good number of nodes, since it allows some downtime on the different nodes.

Thanks for all advice you can give.

P.S. I know I will not have awesome performance, since its only 1Gbit network, but I don't want speed demons - I want stability and secure storage for my kubernetes cluster.

Edit: I will probably be using something similar to this for having two extra SATA m.2 drives. Controller Card - PCIe 2x M.2 SATA SSD - SATA Controller Cards | Denmark
 
Last edited:

oneplane

Active Member
Jul 23, 2021
214
98
28
I'll be following along on this one. I too am considering a 'tiny' Ceph cluster for Kubernetes or Havester labs where speed really isn't much of a thing (heck, 30MB/s would be enough) but reliability with availability and durability is important. Most Ceph concepts assume high bandwidth and IOPS but if you don't need that it often devolves into 'use something else', which misses the point of wanting to use Ceph.
 

PigLover

Moderator
Jan 26, 2011
3,051
1,351
113
I'd strongly recommend that you run Kubernetes on the machines serving as your Ceph cluster as well - and then use Rook to install/manage Ceph on them. Rook is a Kubernetes "operator" that will deploy and manage all of the Ceph components for you, e.g., to do upgrades you just update the image version tags in the Rook config YAML and then Rook & K8s do their magic to get the upgrade done.
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
505
267
63
48
r00t.dk
I'd strongly recommend that you run Kubernetes on the machines serving as your Ceph cluster as well - and then use Rook to install/manage Ceph on them. Rook is a Kubernetes "operator" that will deploy and manage all of the Ceph components for you, e.g., to do upgrades you just update the image version tags in the Rook config YAML and then Rook & K8s do their magic to get the upgrade done.
So basically expand my kubernetes cluster to include the 5 new nodes - but then have ceph running only on these 5 new nodes - I guess that would simply things in terms of installing and managing ceph.

Do you know if the ceph cluster can be accessed from the outside of kubernetes? Just in case I want to use some of the storage from external machines.
 

PigLover

Moderator
Jan 26, 2011
3,051
1,351
113
So basically expand my kubernetes cluster to include the 5 new nodes - but then have ceph running only on these 5 new nodes - I guess that would simply things in terms of installing and managing ceph.

Do you know if the ceph cluster can be accessed from the outside of kubernetes? Just in case I want to use some of the storage from external machines.
Yes, that would probably be the best approach. Though you could set the Ceph cluster machines up as a separate/independent Ceph cluster and then expose Ceph to your primary Ceph cluster.

In either case it will take doing some reading on Rook to get things right - running as one big cluster you'd need to put together labels to make sure the Ceph components run on the right machines (and only those machines) and to ensure other workloads don't get scheduled there. Running as two cluster's you'd have to do some work to expose Ceph between them (not too hard and, BTW, that answers your other question that yes - you can expose Ceph outside the cluster).

What I do is just run Ceph and my K8s workloads as one cluster, sorta hyper-converged. I say "sorta" because I'm really only using Ceph for persistent storage (PVs) inside K8s. My primary datastore still lives on a NAS.

Best idea would be to go take a look at Rook and see if it will work for you.
 

oneplane

Active Member
Jul 23, 2021
214
98
28
I've been looking at Rook for this as well but it still has the chicken-and-egg problem of needing to store Kubernetes data before Kubernetes is up and running, but you can't store it in Ceph because the cluster isn't up yet.

One way to bootstrap this would be some sort of bootstrap cluster using something more simplistic and then use that to start a bootstrap-Rook-Ceph to store application cluster data, then have a second cluster run Rook as well (on different nodes) and use that for daily use.

Harvester bypasses this problem a little bit but still has the bootstrap problem in a different way.
 

PigLover

Moderator
Jan 26, 2011
3,051
1,351
113
I've been looking at Rook for this as well but it still has the chicken-and-egg problem of needing to store Kubernetes data before Kubernetes is up and running, but you can't store it in Ceph because the cluster isn't up yet.

One way to bootstrap this would be some sort of bootstrap cluster using something more simplistic and then use that to start a bootstrap-Rook-Ceph to store application cluster data, then have a second cluster run Rook as well (on different nodes) and use that for daily use.

Harvester bypasses this problem a little bit but still has the bootstrap problem in a different way.
It really doesn't have that problem. The only state information required to get started is a small amount of local storage on the contorl plane node and for etcd, which runs on the control plane node and doesn't depend on PVs at all. Etcd uses a (very small) amount of local storage on your control plane node (or nodes - if you are running a redundant control plane).

Kubernetes baseline expectation is for applications to be stateless. You only have to provide storage if you are going to bring up a stateful application, which you can do after you bring up Rook/Ceph.

So the bootstrap process is:
- build your cluster and install K8s on it.
- edit the Rook YAML files to describe your OSDs and Networking
- install Rook (which is as simple as applying the Rook YAML files)
- wait a bit for your OSDs to initialize and come on line
- define your PVs on Rook (again - as simple as applying the right YAML files, which you can find in the Rook examples)
- Install whatever else you want on the K8s cluster

There is no "chicken & egg" conundrum.
 
Last edited:
  • Like
Reactions: BoredSysadmin

oneplane

Active Member
Jul 23, 2021
214
98
28
It does have that problem for me, because the control plane nodes are virtual machines that need to be stored somewhere ;-)
So the "- build your cluster and install K8s on it." for me would be stored on Ceph. I think I'm probably ending up doing two clusters, one in Proxmox just for VM storage and one in Kubernetes just for PV.
 

PigLover

Moderator
Jan 26, 2011
3,051
1,351
113
It does have that problem for me, because the control plane nodes are virtual machines that need to be stored somewhere ;-)
So the "- build your cluster and install K8s on it." for me would be stored on Ceph. I think I'm probably ending up doing two clusters, one in Proxmox just for VM storage and one in Kubernetes just for PV.
Fair play - but you are mixing two different problems in a way that creates chicken & egg issue for you.
 

Sean Ho

seanho.com
Nov 19, 2019
284
120
43
Vancouver, BC
seanho.com
Each node, whether ceph or k8s, will still want to have some local storage, beyond OSD drives. Rook/ceph needs a bit in `/var/lib/rook` for each osd and mon, and each k8s node needs a fair bit of space for cri, not just etcd.

Virtualization is a separate issue from distributed storage. If you want your k8s central/server to be in a VM, that's fine, just put the VM image on something that doesn't depend on k8s (could be iSCSI target from pve if you like). Similarly, the ceph daemons run by rook are all containers, but (by default) use local mounts for volumes.

If you want to migrate your VMs onto your k8s cluster, consider kubevirt.
 

oneplane

Active Member
Jul 23, 2021
214
98
28
Well, that's the thing; if you kubevirt your controlplane you're back in chicken-and-egg territory. Harvester tries to solve it by making every node be a possible apiserver, scheduler, etcd node, controller manager etc. but that also comes with distributed storage (but not PV, only object storage IIRC?).

You could run Kubernetes bare-metal, but why would anyone do that? You'd be beholden to IPMI and SoL and KVM... and every node would be a special hands-on install. Kubernetes always requires storage, and storage local to the OS can still be remote storage. It does of course not need to be distributed, but unless you want to always have two separate clusters (a storage cluster and a Kubernetes cluster), you'll probably want that. Saying it's a different thing is true, but comparable to saying CPU and memory are not a singular thing for a node. While true, you can't really properly use one without the other.
 
Last edited:

Sean Ho

seanho.com
Nov 19, 2019
284
120
43
Vancouver, BC
seanho.com
yep, I am suggesting to run k8s bare-metal. The failover design of k8s is not that k8s nodes would be live-migrated to another machine, but to allow k8s nodes to fail and build the application to be resilient -- e.g., stateless application servers with state stored in HA DB clusters and bulk storage in systems like ceph.

I run a six-node HCI k3s cluster on much beefier hardware than the Dell 5070, using rook/ceph. Each node has an SSD for root, and runs debian and k3s bare-metal, deployed via ansible. K3d is another option.
 

oneplane

Active Member
Jul 23, 2021
214
98
28
Oh definitely, I wasn't suggesting failing over virtual machines, but just not having free-standing operating systems whenever possible. I like my hardware nodes to be compute and memory and nothing else (except if we're talking about specialised hardware like coral accelerators and GPUs but that's what labels and taints are for).

I suppose one of the things that could work is having a local-only virtualisation setup where the only point of failure would be zero control plane VMs remaining, but in that scenario you'd re-deploy from manifests and charts in Git anyway. But I always make things needlessly complex and run non-kubernetes virtual machines that do need distributed storage (mainly because it doesn't fit in a single 6-disk node). Perhaps that's also where my solution could be since KubeVirt has no problem with nested KVM so I could do node-local storage, run K8S and non-K8S VMs on that, but first run K8S with Rook, run a rados GW on that, and then have the non-K8S VMs use those.

Edit: and now I've realised that I'd essentially be re-inventing Harvester.
 

Sean Ho

seanho.com
Nov 19, 2019
284
120
43
Vancouver, BC
seanho.com
I like my hardware nodes to be compute and memory and nothing else (except if we're talking about specialised hardware like coral accelerators and GPUs but that's what labels and taints are for).
I completely empathise; I also pxe boot all my machines (also works around issues like old bioses not booting off nvme) and have a couple diskless compute nodes (not ceph osd!) using nfs root. The inescapable tension that k8s folks have struggled with for a long time is providing persistent storage via cattle nodes. You can make nodes interchangeable with regard to compute and networking, but storage has inertia. Disks are specialised hardware just like GPUs; reslivers/backfills take time.

This is also why despite HCI being great for homelabs, in production even if the cluster is designed for HCI, often it gravitates towards a group of storage-heavy nodes and a group of compute-heavy nodes. Add in a dedicated back-end network for ceph, and we're basically back to a SAN, just using ethernet instead of FC.
 
  • Like
Reactions: oneplane

oneplane

Active Member
Jul 23, 2021
214
98
28
You can make nodes interchangeable with regard to compute and networking, but storage has inertia. Disks are specialised hardware just like GPUs; reslivers/backfills take time.

This is also why despite HCI being great for homelabs, in production even if the cluster is designed for HCI, often it gravitates towards a group of storage-heavy nodes and a group of compute-heavy nodes. Add in a dedicated back-end network for ceph, and we're basically back to a SAN, just using ethernet instead of FC.
Yep, that's what I keep coming back to as well. Before K8S (and before Mesos + Marathon) we had things like PXE+NFS root as well (well, technically a rather fat initrd and /usr and some /etc and /var mounted on top of that) and for application persistent storage it was mostly NBD and iSCSI. Then we had some OCFS and even SMB before we went for Lustre. That worked both great as well as being a PITA, and iSCSI was kept side-by-side. Before we resolved most of the issues or even had a Ceph cluster in production, nearly all workloads were shipped off to various clouds as it really wasn't a core business or benefit to the company or customers to do all that internally; it didn't (and doesn't) make the customer products better so why bother, even if it did marginally save costs.

Currently all that is left in the classic storage method is iSCSI and SMB, mostly due to legacy services that aren't going to be migrated but are being shut down one by one as they reach end-of-life anyway.

That does of course not mean that the knowledge becomes useless or that a homelab can't benefit from it ;-) And there are small projects (even on the business end) where it still makes sense to do it local instead of hosted.

I did have a full-PXE+NFS Xen cluster running at home at some point, but virtual machine disks over NFS aren't terribly performant, even with NFSv4 multipath. On top of that, Xen has fallen out of favour and KVM works fine. That makes Proxmox and KubeVirt (or Harvester) on top of that a pretty good target (but I haven't measured nested hypervisor performance losses yet).

Edit: sorry for the partial thread hijacks and braindump posts, it's almost as if I'm rubberducking the forum or something like that :oops: for everyone else who stumbles upon this: if you just want PV for K8S and Ceph fits your hardware, Rook is definitely the way to do. If you make your life hard (like I seem to be doing), well, it's going to be harder than that:D