Confused about high availability clusters, swarms, etc.

Octavio Masomenos · Jul 8, 2022

I’m building a home lab and I want to self host a few websites and several services like NextCloud, Joplin, PhotoPrism, etc. I plan to run each service in a Docker swarm for high availability and load balancing. I’d like to run all these containers on a HA cluster of servers; I have several older SFF/UCFF (Intel NUC, etc.) machines and I want to put them to use in a cluster so that all of these high value services will remain available and performant even if one or more containers/nodes fail and/or if one or more of the physical servers fails.

Although I won’t fully understand how all this works until I roll up my sleeves and start getting my hands dirty, I think I understand the basics of Docker swarms. I’m struggling, though, to wrap my head around the server clusters. I found some articles/tutorials/guides but they all seem to revolve around setting up a web server. That’s fine; I want/need to setup a web server but these guides seem to want the web stack to sit on top of the hardware cluster - non-containerized - and that’s not how I want to do things. They make it seem like the service is/has to be integrated into the cluster setup. Can’t I just setup the server cluster, then whatever I run on it (swarms for the various services) just automatically has the benefits of the HA cluster? Or am I missing something?

I think this is how I intend to approach things…

Setup a Debian server on a single machine.
Install Docker and get a LAMP container running with a single, very simple, static website
Setup a Cloudflare secure tunnel to serve that website up to the world.
Setup a swarm and get the web server/website running on the swarm (confirming that it’s still publicly accessible).
Back all that up then add a couple more physical machines and try to create the server cluster
Once that’s done, get the web server/site back up
Deploy other container/nodes/swarms/services

Please let me know if you see any problems with my approach and/or if you have suggestions for improving on it. Thank you for your time and consideration.

Sean Ho · Jul 8, 2022

Lots of ways to do it! Plan out what your failure modes might be and what level of resilience you want for each. At home you presumably only have a single pipe to the Internet, and and single power feed from the outside (though perhaps you have multiple circuits to your rack from your breaker panel). UPS, PDU, switch, cooling can all be single points of failure. Most of the time when folks are thinking about swarm/k8s they're just thinking about nodes as failure domains.

Even just focusing on node failure, persistent storage is tricky. It's easy to failover stateless services like the reverse proxy and php-fpm, but your HA nextcloud service is useless if the single node storing the actual data is inaccesssible. So we're thinking about gluster, ceph, longhorn, minio, etc. HA DB is another "fun" rabbit hole; for performance you generally don't want to run a single DB on top of clustered storage, rather, let the DB do its own replication across multiple instances.

Services on the cluster are run through the cluster's own overlay network so traffic is seamlessly routed to the proper node. So deployment of services is usually containerised.

Octavio Masomenos · Jul 8, 2022

Sean Ho said:
At home you presumably only have a single pipe to the Internet, and and single power feed from the outside (though perhaps you have multiple circuits to your rack from your breaker panel). UPS, PDU, switch, cooling can all be single points of failure. Most of the time when folks are thinking about swarm/k8s they're just thinking about nodes as failure domains.

Thanks for the response, Sean! Yeah, I did give a little thought to all that but I’m just going to have to live with downtime if my internet goes down. Maybe later I’ll give some thought to a failover (maybe 5G?) device if it doesn’t cost me a lot. I plan to get a UPS later to deal with brief power outages. I can definitely put the modem, router, switch, and all the servers on it. I didn’t really consider UPS, PDU, and switch failures. I’ll add that to my “to think about later” list! This whole self-hosting thing sure has its pitfalls! But learning about all this stuff is kinda the whole reason I’m doing it so the more I have to think about and learn, the better off I’ll be.

your HA nextcloud service is useless if the single node storing the actual data is inaccesssible.

Single node? Crap. I’m very new to all this, but I was assuming that once I clustered 3-5 servers, there’d be some sort of data store that spanned them all. Seems like some kind of UnionFS would just be built in. Hmmm. I’ll have to Google gluster, ceph, longhorn, minio, etc. - never heard of those but maybe the answer is in there somewhere. I am going to have a machine that’s nothing but a big disk array with SnapRAID and MergerFS. But I see what you mean; ultimately, it’s just a single node.

Services on the cluster are run through the cluster's own overlay network so traffic is seamlessly routed to the proper node. So deployment of services is usually containerised.

OK. That makes sense. So am I right that all I need to do is get the server cluster up and then whatever (containers/services) I run on there automatically has the benefits of HA without having to be specifically configured for it?

Sean Ho · Jul 8, 2022

Ah, well it's not quite so effortless, but a lot of the heavy lifting is done for you by swarm/k8s. You'd still need to configure a deployment with multiple instances of the service (e.g. php process) running on different nodes. In k8s, you'd typically have:

a container (docker) with the binaries, volume mount points, etc.
a Pod with ports to expose, volumes, and potentially other helper/sidecar containers,
a Deployment that specifies how the Pod is run on nodes (e.g. how many replicas you want),
a Service that maps ports and lets other processes on the cluster access the app,
an Ingress that tells the ingress controller (ingress-nginx, traefik) what domain name to use and CN for the TLS cert

These would all just be yaml files. For a common app like NextCloud, you can easily find a Helm chart that packages all those together.

I'm not as familiar with swarm, but I'm sure it has something similar.

You're off to a good start; enjoy the journey!

Sean Ho · Jul 8, 2022

I learned k8s through a mixture of k3s documentation for the bare-metal install, plus k8s documentation for various cluster resources. I went back and reviewed Kelsey Hightower's "Kubernetes the Hard Way" afterward; it is very helpful to see how the various modular bits of k8s fit together in practise, although (1) portions need to be adapted if running on bare-metal rather than on cloud VMs, and (2) most of the modular bits are combined in k3s and handled for you anyway.

I had a bit of a learning curve with k8s, but a much steeper one (and still ongoing!) with ceph. Clustered storage is not a small task. You might start with just letting your storage node be a SPoF and having your containers mount via NFS or iSCSI or whatnot.

i386 · Jul 9, 2022

How many nodes do you want to run?
How many nodes are allowed to fail without operations being impacted?
How many nodes must fail fail before operations are stopped (automatically through policies or manually)?*
A+B power + generator(s)?
Redundant internet links?

These kind of questions are what I ask myself when I'm thinking about "ha" and that's before even looking at the software side ._.

*stuff you learn when it happens ._.

Bjorn Smith · Aug 26, 2022

I run my kubernetes cluster on 5 small dell wyse 5070 machines, with storage via NFS - all configuration etc is stored on the NFS server and the nodes just mount the configuration as readonly, and whatever it needs write access to as r/w - obviously those that need r/w does not run with multiple instances - but the beauty of kubernetes is that when a node dies, the pod just gets deployed to another node and everything should just work

I have no idea about docker swam - I think its "easier" than k8s - but I might be wrong.

Search

Confused about high availability clusters, swarms, etc.

Octavio Masomenos

New Member

Sean Ho

seanho.com

Octavio Masomenos

New Member

Sean Ho

seanho.com

Sean Ho

seanho.com

i386

Well-Known Member

Bjorn Smith

Well-Known Member