MVC - Minimum Viable Ceph

OBasel · Aug 2, 2017

What I really want is the ability to start with 2 boxes and move out to 5 over time. Add a new box with bigger hard drives every year for three years and then start retiring older ones.

I'd prefer to use the small SMC 4-bay and use Xeon D-1508's. That'll give me 10G and they're cheap. Upgrade a few of their CPU's every other year and use the oddball ones for simple vms and containers like Ubiquiti. Maybe nextcloud. That stuff.

I also know there isn't a 2 box ceph. FreeNAS has been my tried and true but it just seems so long in the tooth. I installed FreeNAS 11 and tried the new interface. We have a QNAP at the office and that thing is light years beyond FreeNAS.

In a home environment I'm nervous about a long power outage. 3-5 ceph nodes if a utility line is downed by a car or weather it'll mean a full cluster reboot.

DC wise ceph is easy. Home it's a lot to take on with few nodes.

Even with 12 drives in 3 systems is that going to be too small to be usable on 10G? I'm planning to have 3 replicas of everything so if a system goes down I'm fine. I'm also considering using md raid and going easy but then I don't have replicas. Rsync is a solution too but it doesn't get me to expand easily using small form factors.

Maybe what I really need is 2 versions of the SMC. One with slow drives for bulk storage and virtualization version with 8 drives. They're not making that so it's not an option without the bigger FreeNAS case.

Patrick · Aug 2, 2017

I am not going to be a help here. The smallest implementation I have done and used for any amount of time is ~50 drives IIRC.

niekbergboer · Aug 2, 2017

I was confused by this as well, thinking that I could build an n+2 setup with three boxes. But it isn't: it's n+1, and the problem is not OSDs but monitors:

The problem with 2 boxes is that you cannot maintain monitor quorum one of the boxes is down: the number of monitors you need for quorum is the smallest integer greater than half of your machine count, which is 2 for 2 machines. Thus, if one machine is down/rebooted, you lose quorum and your cluster goes (or should go) either read-only or fully into an error state.

For 3 machines, minimum quorum is also 2, so there you can afford to seamlessly down/reboot one machine. It isn't until you have 5 machines that you can down 2 at the same time.

As for speed: I run 3 boxes with 2 OSDs on Intel S3500s each, and I can read some 600 Mbyte/s off CephFs. I didn't benchmark raw RBDs.

Edit: as for power, my X11SSL-F boxes (with a Core i3-6100T; effectively a dual-core Xeon (!)), consume 22W idle, and some 25-27W in normal use. Three of those shouldn't break the bank power wise.

Evan · Aug 2, 2017

The easy option for ceph (or vSAN and others) is to run a 3rd box with a vm on local disk a monitor.
All these solutions are not great performance with low number of nodes / disks.

Local raid and failover / replication (e.g. ZFS send) seems to be a better solution generally.
Have said this if performance is ok for your need ceph is fun to play with.

PigLover · Aug 2, 2017

I've played with this 9 ways to Sunday, including small home/lab configs all the way up to massive configs at work (one with over 1,000 OSDs).

After all was said and done i do not find Ceph very viable for small configs and have moved away from it for most home/lab use cases. Its just not worth the effort. For larger configs the extra effort is worth it to get the easy scalability and flexibility. But when small its really not worth it.

Personally, for two nodes and home use, I'd stick with ZFS + replication. FreeNAS 11 makes that easy, but if you think FreeNAS is getting too long-in-the-tooth or bloated there are certainly a lot of other ways to achieve ZFS + Replication.

More directly on-topic:

Minimum Viable Ceph: As @Evan noted, for two-node Ceph you can set OSD replicas to "1" (two copies) and use a small server or VM as the third MON. I think you'll quickly find this config disappointing. Note that I would not recommend doing this.
As @niekbergboer notes, if power is your concern and you want a simple implementation I would look at the X11SSH-f with a low power I3 or v6 Xeon. DO NOT use the X11SSH-TF to get 10Gbe as that board borked the PCIe distribution and has a useless 60mm M2 slot (one of the few design screw-ups in SMs product line). Get the -F board and a used Mellanox card for 10Gbe. 20-25w idle is very doable. You won't get system power much below 40W with the D-1508.
You can do it with three nodes. In nominal operation 3 nodes is fully stable and it will still operate with 2 nodes running. However, with a node offline the OSDs cannot recover to a fully "stable" state and there is a very small chance you could see writes blocked.
For me, Minimum Practical Ceph (MPC?) is really 5 nodes, 10 OSDs. This is the configuration that meets all of the best practices (odd number of nodes, etc), allows replica 2 (3 copies), is reasonably performant and is able to re-balance the OSDs to a fully stable state during a node failure (and bonus: can survive a 2nd node failure at the same time). If you use the low-power nodes noted above you can keep this at about 120w idle so it won't break the bank on electricity use.
Lastly, if you comparing to ZFS and want efficient disk use for bulk storage, you also want to use Ceph erasure codes. I'll leave out the details, but in order to get reasonable redundancy, reasonably efficient disk use, and the same rules noted above MVC with erasure codes is really 7 nodes (assumes 5+2 erasure coding for reasonable efficiency) and MPC (maintaining fully stable operation during failure) is 9 nodes. This is the reason I have effectively abandoned using Ceph for most home/lab use (but still love it for things I experience on the job).

niekbergboer · Aug 2, 2017

Thanks for that extra info on the X11SSH-TF @PigLover ! I've been close to pulling the trigger on one of those at least twice

The fact that taking down one node in a 3-node setup leave the cluster on a degraded (though functional) state is indeed a pain, and after one node broke (mobo went belly up), I couldn't reboot any of the other two boxes until I got a replacement.

That said: things *did* still work despite one system breaking, and that's the main reason I have the cluster (I use Proxmox VE with it's Ceph management). I don't care enough about performance to go 5 or more nodes.

Also, with three nodes and a workstation, the very affordable TP-link 24Gbe+4xSFP+ switch (at only 14-15W) gets the whole thing connected at 10Gbe. Going to more nodes means either much more expensive or much more power hungry switches.

As for capacity, one of my nodes has a sizeable ZFS config, and backups to an off-site box.

kroem · Apr 4, 2018

Stealing your thread a bit maybe - I want/need a shared storage for a 3 node pve-cluster (one is monitoring node only...). Is a one SSD/box solution even viable? So two SSD's in two hosts. I won't really need any great performance, I just want to be able to do some HA.

Search

MVC - Minimum Viable Ceph

OBasel

Active Member

Patrick

Administrator

niekbergboer

Active Member

Evan

Well-Known Member

PigLover

Moderator

niekbergboer

Active Member

kroem

Active Member