Proxmox & CEPH

gb00s · Jan 19, 2024

I would like to start a short poll that shall run for 90 days here. Any thoughts about Ceph deployment and/or other distributed HA storage solutions are very much appreciated.

Thanks and keep voting.

ano · Jan 19, 2024

Do you mean specificly now for proxmox?

we run ceph for a lot of stuff, and for RGW and cephfs we separate out, but for proxmox, yes I think its cool to let it handle it, but.. also to have an external, depending on usecases and other params really

for vsan replacement, we plan to run proxmoxcluster and let proxmox handle storage.

for other where compute is high, storage options limited, external makes more sense

hats off to the people of proxmox who integrated ceph (and even reef) so effortlessly.

Rand__ · Jan 19, 2024

Can you elaborate on the vsan replacement? (sorry for the stupid question, totally new to Proxmox)

I think thats going to be a hot topic going forward (for myself included) ...

ano · Jan 19, 2024

yes I can, but what exactly? on what vsan is? or how proxmox and ceph replace it? (not stretch bit, but yes)

Rand__ · Jan 19, 2024

ano said:
for vsan replacement, we plan to run proxmoxcluster and let proxmox handle storage.

This part

When i look at proxmoxcluster it only seem to be the clustering of the nodes and not be related to providing a shared storage made up of local disks (which is what vsan did)

Terry Wallace · Jan 19, 2024

I run a large Proxmox cluster.. all machine are part of the the cluster.. 6 of the machines have ceph added (through the proxmox gui) and have the extra Nic added for ceph backbone traffic.. I don't place any VM/CTs on those nodes letting them effectively be storage only nodes. All other nodes run their VM off disks stored in the ceph cluster. (You can run it the other way where ceph is added to every node.. but you need the extra nic for every node.. and taking a node down for work... means ceph rebalancing generally as you'd loose OSDs)

gb00s · Jan 19, 2024

Terry Wallace said:
... (You can run it the other way where ceph is added to every node.. but you need the extra nic for every node.. and taking a node down for work... means ceph rebalancing generally as you'd loose OSDs)

Does this mean there's no option to avoid rebalancing while you maintain one node? As an example, with MooseFS you can put a node into 'Maintenance Mode' and the cluster does not care at all and no rebalancing is happening. The Cluster just assumes the node is ok. Otherwise, I would gladly combine storage and compute in nodes for efficiency purposes. The extra NIC is not that big of a thing I guess.

zunder1990 · Jan 19, 2024

gb00s said:
Does this mean there's no option to avoid rebalancing while you maintain one node? As an example, with MooseFS you can put a node into 'Maintenance Mode' and the cluster does not care at all and no rebalancing is happening. The Cluster just assumes the node is ok.

Yes you can do the same thing with ceph.
Set those two flags on the host that is about to reboot.

remove flags after host is back online.

ano · Jan 19, 2024

set noout/norebalance is what you normally do then it doesnt autoheal/rebuild for short maint.

you dont need an extra nic, reason for extra nic is bandwith for backfilling/recovery operations. if you run say 2x40gig in lacp, or 2x100.. or 2x25 etc, you can get away using just that

ssd / nvme and 1x10g? you gonna fill that fast.

Terry Wallace · Jan 19, 2024

That's the recommended way is to 'no out' the node for maintenance... I have some nodes that have better reliability and power backup than others.. so if I have a compute node crash not worried about.. unlike a ceph node crashing... (That's just me and my setup).

On the ceph backend network, if you don't have the backend network its very easy to overwhelm your entire network if you have any type of unplanned rebalance event (some ssd dies for instance ) I don't build ceph without one.. The people that wrote it knew what they were doing.

geonap · Jan 19, 2024

here is the set up i'm thinking about building.

40G mellanox switch two switches with mclag

6 servers in total
dual 7313 epyc milan 16c/32t each
512gb ram
1 p4800x
4 p4610
40g mlx3

looking for at least two failures to tolerate, 3 would be a plus.
based off of the 6.4tb drives, 3 replicas means i could get a comfortable 30-40TB with 3 failures.
initial data migration is about 20TB maybe reaching 30TB in two years.

the systems have a total of 192 cores, total needed cores for migration from standalone boxes are about 130 cores, there is more than enough memory to cover those.

there's a possibility i can bring up a 7th node if that'll help.

trying to run around 10-15 vm's right now, mostly windows with stable and protected storage.

Terry Wallace · Jan 19, 2024

LOL thats pretty close to what I used to run before upgrading.. Mellanox sx6036 connectx 3 pro vpi cards in eth mode.
nodes had dual samsung 500g for boot pairs. (zfs mirror install) 1 optane 280g hhhl card for ceph wall and HGST 3.8TB enterprise SSDs for storage.

since swapped cards out to Connectx 5 ( 2x100g ) and switches out to Arista 32x100.

If you ever want to chat specifics hit me up in PM.

NPS · Jan 19, 2024

geonap said:
dual 7313 epyc milan 16c/32t each
512gb ram

What's your reason to go dual socket but low core count? I always came to the conclusion that single socket is more cost efficient unless you want to have more than 64 cores or need more than 512GB of RAM (1TB can also be possible for acceptable cost with 2DPC)

Terry Wallace · Jan 19, 2024

For me at the time it was cost.. picking up used DL 380's got me dual socket. more ram slots (cheaper ram at smaller sizes) hot swap power supplies and enterprise class ILO all on the cheap..then dropped in some 10c/20htread cpus in each socket. (cheaply as its older hardware). I was in to enterprise grade nodes for around 400$ each at the time. (not counting the SSD).

NPS · Jan 19, 2024

Terry Wallace said:
For me at the time it was cost.. picking up used DL 380's got me dual socket. more ram slots (cheaper ram at smaller sizes) hot swap power supplies and enterprise class ILO all on the cheap..then dropped in some 10c/20htread cpus in each socket. (cheaply as its older hardware). I was in to enterprise grade nodes for around 400$ each at the time. (not counting the SSD).

Totally understandable but for EPYC I always came to different results.

ano · Jan 19, 2024

geonap said:
here is the set up i'm thinking about building.

40G mellanox switch two switches with mclag

6 servers in total
dual 7313 epyc milan 16c/32t each
512gb ram
1 p4800x
4 p4610
40g mlx3

looking for at least two failures to tolerate, 3 would be a plus.
based off of the 6.4tb drives, 3 replicas means i could get a comfortable 30-40TB with 3 failures.
initial data migration is about 20TB maybe reaching 30TB in two years.

the systems have a total of 192 cores, total needed cores for migration from standalone boxes are about 130 cores, there is more than enough memory to cover those.

there's a possibility i can bring up a 7th node if that'll help.

trying to run around 10-15 vm's right now, mostly windows with stable and protected storage.

4800 is kinda pointless, except for that, it will be very nice.

ano · Jan 19, 2024

Terry Wallace said:
On the ceph backend network, if you don't have the backend network its very easy to overwhelm your entire network if you have any type of unplanned rebalance event (some ssd dies for instance ) I don't build ceph without one.. The people that wrote it knew what they were doing.

reccomendations for project theese days are changeing, usually people just vlan stuff, instead of separate (on 2x100 and high bw)

ano · Jan 19, 2024

as for HPE DL385? nodes? Id rather do single 7763, 7443 etc, than dual 7313

dual epyc 3rd gen is kinda.. underwhelming. we have tested dual 7402, 7313, 7443, 7763, 75F3 etc, vs single, and single is of course.. very epyc! and the 7313 are very nice!

beware that you MUST tune bios +os for performance on amd 2nd/3rd gen, or you will loose 25-35% performance with vms/most workloads

ceph REALLY likes more nodes.

only reason for dual epyc on 3rd gen with 7313, would be not having enough ram slots depending on motherboard etc

geonap · Jan 19, 2024

i got an amazing deal on 7313’s a year ago, also 7443’s. I am using the 7443 servers another way.

ano · Jan 19, 2024

well, they are superb cpus. We use 7313 for stuff with core licsensing, and not needing 75F3 and such per core. SPLA and others are much cheaper to licsense up that way for vms.

Proxmox & CEPH

What's your prefered method to install and run Ceph storage in your environment

Ceph storage deployment within Proxmox Cluster

Seperate Ceph storage cluster deployment

I user another distributed storage solution

I stay with Vmware/VSan

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

PsyOps SysOp

Well-Known Member

Active Member

Well-Known Member

PsyOps SysOp

Member

PsyOps SysOp

Active Member

PsyOps SysOp

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member