Planing 40Gbps Ceph Cluster Lab Build - Gotcha / Pitfalls?

Yves · Apr 22, 2019

Hi there,

Already almost hijacked the thread of boe so I thought I start a new one here. I wanted to learn / experiment and tryout building a ceph cluster as a storage array network for my ESXi compute cluster lab. So my idea was since I already have 24x 8TB HDDs from WD (buy one more so its even around the 5 servers) around to get 5x DL380p Gen8 (they are quite inexpensive these days) equip each of them with:

5x WD RED 8TB (Ceph Storage Disk)
1x Samsung 512GB 850 Pro SSD or Samsung 500GB 960 Evo (Ceph Cache Disk)
1x HP InfiniBand FDR/Ethernet 10/40Gb 2-port 544FLR-QSFP
1x HPE Dual 120GB Value Endurance Solid State Drives M.2 Enablement Kit 777894-B21 (RHEL or CentOS System Disk)

and for everything to connect:
2x Mellanox SX6036 36ports QSFP 56Gb/s managed

After setting up the Ceph Cluster I would use iSCSI Multipath to connect it to my ESXi Compute Cluster which is not yet 40Gbps but with the iSCSI Multipathing I would at least be able to use the Full Dual 10Gbps I have.

Possible gotcha / pitfalls:

Where do I put the 2.5" Drive if I go with that option instead of the NVME?
Does the HP 544FLR which seams to be a ConnectX-3 really work with that switch?
HP Controller to the Backplane is only with 2 cables which makes me thinking about 2x4 SAS Lanes, so what if I attach 12 SATA disks? Will it even work or will it underperform?
HP 420i Controller has a HBA mode which I hope will accept non HP HDDs
Never done ceph before, so I am not sure if 512GB is enough cache for that many 8TB disks? Would the NVME change "much" to the SSDs performance wise in Ceph?

I also attached a little PDF which I draw in visio as an idea how to connect it

MiniKnight · Apr 22, 2019

I'd double-check that that is what you want to do switch wise. Some of those are set to do InfiniBand only.

Rand__ · Apr 22, 2019

So you will need the appropriate license o/c for the switches.

12 Sata disks in 2x4 lanes should be no problem

I tried ceph with 3 optane based nodes once (no optimization, out of the box) and it was quite slow, so not sure it will be usable.
But o/c to learn ceph it will be fine

amalurk · Apr 22, 2019

A consumer drive as your ceph cache disk?

Mautobu · May 6, 2019

Reds are a 5400 rpm drive and will limit your speed. Ultrastars are a much better choice if you can afford the difference. I'd like to echo that those Samsung drives are questionable. You want something with high endurance. I would also assume ideally SLC or 2 bit MLC, which I doubt is very common anymore. TLC is likely what you'll find. check this out Enterprise Storage Solution - Enterprise SSD | Samsung SSD

MikeWebb · Jun 21, 2019

Just starting out with ceph. Did a POC with vm's to get an insight into the technology....very nice and so far alot to still get my head around. No hard provisioning yet as I've learnt my nodes are tooo fat and to (2) few. So I speak no authority, just google knowledge.

5 OSD's over 5 nodes. NICE! good resilience and rebalance speed with either OSD or node failure.

40Gbe will prob run around 21-25Gb/s throughput but low latency. I don't think you will see that speed though but your cluster net even if ESXi had the support, but it will enjoy the flavour, especially as it rebalances if you lose/take down a node.

Will this replicated (3/2, 3/1) or EC?

With 5 OSD's sharing the SSD for db and WAL, I would look at an enterprise grade SSD (s3700?) over a consumer NVMe. loose your journal you loose the 5 OSD's. Maybe start without the separate db/WAL, see how you go, baseline benchmark then add the SSD (stick anywhere with velcro seems to be the go) and bechmark again, bask in the glory of your decision or head down the rabbit hole of optimisation.

Will MONs be on the same nodes? along with any other gateways? need to consider resource allocations for those. They would need not much really

How much memory are you looking at running? you could get away with 48GB but 64 would be safest (more = more cache). Depends what else you are running on it

CPU? balance clock with core and price. Prob E5-2637 v2 , E5-2630 v2 or the L. Again depends what else you are running on it. looking at use case I think the lower clock speed wouldn't affect you IOPs seen at client. Duel CPU...over kill and NUMA.

Over all sizing should be done not for for use but for recovery failure to do so for when it is really getting smashed by rebalance and also serving it's intended use could lead to cascade failures..ouch

Redundant OS drives (and power supplies) IMHO are things not really needed for this setup. You have 5 nodes using a solution where the philosophy is "stuff will fail" and 5 nodes would be best managed with ansible. I would have a spare disk on hand and in the event of a node failure due to OS drive dying, replace drive use kickstarter (if using Centos) and ansible. You'll have a node up and running quick smart. (For my POC, I provisioned with vagrant and ansible...very fast to blow away and rebuild and I find kickstarter files easier then vagrant.)

Anyway, that's about the limit of my notes and knowledge. Hopefully someone will come along and correct this if I'm completely off the mark...then we've both learnt something

Search

Planing 40Gbps Ceph Cluster Lab Build - Gotcha / Pitfalls?

Yves

Member

Attachments

MiniKnight

Well-Known Member

Rand__

Well-Known Member

amalurk

Active Member

Mautobu

New Member

MikeWebb

Member