Proxmox Ceph hardware question

vrod · Mar 12, 2018

Hi all,

I am looking to start a hosting firm and was finding Proxmox Ceph as an interesting platform for it. I am planning to utilize my Dell C6220 box for it. However I do have some questions in regards to the hardware. The boxes has this hardware:

2x E5-2680
256GB 800mhz ddr3 ecc (quad rank)
dual port 10g + dual port 1g onboard

The chassis is the 12x3,5" version so I was looking to populate 3 nodes with 4x 7,2K SATA 2/3TB drives (12 OSD's total) and pop in a optane 900p 280gb for journaling in the remaining PCIe slot.

My question is now if this disk configuration makes sense? I will be using HGST UltraStar drives with which I've had good experiences with but is there more sensible drives to use?

I have also heard some people saying that 3-node clusters are not a good idea. Would it be better to use my 4th node (which would require me to recable the hotswap bay for 3,3,3,3 instead of 4,4,4) instead of just 3?

Any advice/suggestion is greatly appreciated.

IT33513 · Mar 16, 2018

>I have also heard some people saying that 3-node clusters are not a good idea. Would it be better to use my 4th node (which would require me to recable the hotswap bay for 3,3,3,3 instead of 4,4,4) instead of just 3?

3-node cluster is just fine.

What about switching? Are you planning to use cross-over connection or switching?

MiniKnight · Mar 16, 2018

That's probably OK to get started with. I'd prefer doing 3,3,3,3 + 1 extra node for a 5th if you can. When you're hosting other people's apps and data, you don't want downtime.

IT33513 · Mar 16, 2018

>When you're hosting other people's apps and data, you don't want downtime.

That's true, but I won't say that you really need 5 nodes to provide decent redundancy level.

MiniKnight · Mar 16, 2018

IT33513 said:
That's true, but I won't say that you really need 5 nodes to provide decent redundancy level.

What happens during a failure when 1/3 of your cluster is down and it begins to re-balance? Let's say you use 2 replicas so 3 copies of data, Ceph cannot put the third copy on a new machine. You've got 33.3% of your data needing to go over the network to rebalance putting lots of load on your existing drives.

I've done 3 nodes. Now I won't do less than 5. 3 works fine when things are online, 5 less so.

Even on the RAM, if you need to migrate VMs to existing nodes each remaining node will see a 50% increase in required RAM if 1 of 3 nodes fails.

PigLover · Mar 16, 2018

@IT33513 - it all depends on what you are trying to achieve.

3 nodes is fine for a lab. It works and can operate during some failure modes.
4 nodes is OK for production if you don't promise "real" HA and can keep a "maintenance window" where things might not work because you take a node down on purpose.
5 nodes is minimum set for true HA production operation.

tl;dr:

If its for a lab and you want to experiment with Proxmox/Ceph in a clustered environment then 3 is fine. When all nodes are up and running and things are normal then 3 nodes works really well. When a node faults Ceph will normally keep running, though replica writes will "pend", which might eventually cause the cluster to refuse to complete writes. In fact, 3 is really good for this kind of lab because it will give you the opportunity to experience and learn, when things go badly, why so many people say 3 is not enough.

At the end of the day, the whole purpose of the cluster is to be able to survive and thrive during node outages. Most users start out thinking about unplanned outages (faults) - but in the long run faults are actually rare with modern hardware and the really big value comes when you actually trust the cluster to work right, which allows you to change your day-to-day operations to allow more flexibility in "planned" outages for maintenance, etc. There is big $$$ cost savings on operational expense by reducing overtime, night-work premiums and generally being able to get stuff done faster and with less people.

In order to achieve the level of "trust" required to change operational procedures, you need to be confident that (a) the system will operate normally if you take a node offline on purpose and (b) that the system will still survive a fault while you have a node offline "on purpose".

In order to do that with Ceph (and to some extent Proxmox) you need to be able to recover the cluster to a completely balanced normal operating mode even with a node out of service. This requires that you have a "+1" node in your Ceph cluster. For reasons I won't debate here, Ceph with 1 replica (2 copies) is a bad idea. So this yields 4 nodes (3+1) as the minimum set to actually achieve resilient and reliable service when a node is offline. Subject to the note below, 4 nodes might retain reliance even with two nodes offline (if they are the right two nodes...).

We are discussing here Ceph running in a Hyperconverged configuration, with Ceph Mon, Ceph OSD and user workloads (Proxmox) all running together on the same nodes. With Ceph, you not only need to maintain a set of OSDs sufficient to contain the data Placement Groups and satisfy the resiliency rules (the Crush map), you also need to maintain a quorum among the MONs. Quorum is "half +1", so in your initial configuration you would run MONs on all three nodes, which still maintains quorum even with one node failed. With four nodes it makes no sense to add a MON on the fourth node because you don't gain any additional reliance, in fact you actually get less resilience - if two nodes fail you lose quorum among the MONs and your Ceph cluster is offline (for writes). With four nodes and 3 MONs you at least stand the chance of maintaining quorum with two failed nodes as long as one of them is the node with no MON, which is just under half the 2-node failure cases (note that you also lose Proxmox quorum with two-nodes failed in a 4-node cluster - but this is mostly about Ceph).

You fix this with a 5th node. With 5 nodes you actually gain value by adding MON on the 4th and 5th nodes. You now have really good OSD resiliency (copies + 2). And you survive any two-node outage - more than survive: subject to available space on yours disks you can actually achieve fully balanced "normal" operation of your pools. As an operator, if your business continuity risk policies permit, you could safely take a node out for maintenance at any time and have confidence that a single-node fault in the remaining nodes would have little or no meaningful impact to your system operation. This is the stable, production quality minimum state for a Proxmox+Ceph HA cluster.

For further growth, you can add additional nodes with OSD and Proxmox workloads in any number needed. You do not need to (and probably should not) add additional MONs beyond the initial 5. At some point, if the cluster grows enough, it might make sense to isolate the MONs onto their own servers. But that is a different discussion - and with today's hardware your cluster has to get pretty large before it is require for capacity (though if you are getting large it might make sense for other operational procedure reasons).

All of this leads to the three summary line bullets above...

Search

Proxmox Ceph hardware question

vrod

Active Member

IT33513

New Member

MiniKnight

Well-Known Member

IT33513

New Member

MiniKnight

Well-Known Member

PigLover

Moderator