Proxmox & CEPH

What's your prefered method to install and run Ceph storage in your environment


  • Total voters
    17
  • Poll closed .
Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gb00s

Well-Known Member
Jul 25, 2018
1,217
624
113
Poland
I would like to start a short poll that shall run for 90 days here. Any thoughts about Ceph deployment and/or other distributed HA storage solutions are very much appreciated.

Thanks and keep voting.
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
Do you mean specificly now for proxmox?

we run ceph for a lot of stuff, and for RGW and cephfs we separate out, but for proxmox, yes I think its cool to let it handle it, but.. also to have an external, depending on usecases and other params really

for vsan replacement, we plan to run proxmoxcluster and let proxmox handle storage.

for other where compute is high, storage options limited, external makes more sense



hats off to the people of proxmox who integrated ceph (and even reef) so effortlessly.
 

Rand__

Well-Known Member
Mar 6, 2014
6,639
1,774
113
Can you elaborate on the vsan replacement? (sorry for the stupid question, totally new to Proxmox)

I think thats going to be a hot topic going forward (for myself included) ...
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
yes I can, but what exactly? on what vsan is? or how proxmox and ceph replace it? (not stretch bit, but yes)
 

Rand__

Well-Known Member
Mar 6, 2014
6,639
1,774
113
for vsan replacement, we plan to run proxmoxcluster and let proxmox handle storage.
This part :)

When i look at proxmoxcluster it only seem to be the clustering of the nodes and not be related to providing a shared storage made up of local disks (which is what vsan did)
 
Last edited:

Terry Wallace

PsyOps SysOp
Aug 13, 2018
201
131
43
Central Time Zone
I run a large Proxmox cluster.. all machine are part of the the cluster.. 6 of the machines have ceph added (through the proxmox gui) and have the extra Nic added for ceph backbone traffic.. I don't place any VM/CTs on those nodes letting them effectively be storage only nodes. All other nodes run their VM off disks stored in the ceph cluster. (You can run it the other way where ceph is added to every node.. but you need the extra nic for every node.. and taking a node down for work... means ceph rebalancing generally as you'd loose OSDs)


px1.png

px2.png
 
  • Like
Reactions: gb00s

gb00s

Well-Known Member
Jul 25, 2018
1,217
624
113
Poland
... (You can run it the other way where ceph is added to every node.. but you need the extra nic for every node.. and taking a node down for work... means ceph rebalancing generally as you'd loose OSDs)
Does this mean there's no option to avoid rebalancing while you maintain one node? As an example, with MooseFS you can put a node into 'Maintenance Mode' and the cluster does not care at all and no rebalancing is happening. The Cluster just assumes the node is ok. Otherwise, I would gladly combine storage and compute in nodes for efficiency purposes. The extra NIC is not that big of a thing I guess.
 

zunder1990

Active Member
Nov 15, 2012
219
74
28
Does this mean there's no option to avoid rebalancing while you maintain one node? As an example, with MooseFS you can put a node into 'Maintenance Mode' and the cluster does not care at all and no rebalancing is happening. The Cluster just assumes the node is ok.
Yes you can do the same thing with ceph.
Set those two flags on the host that is about to reboot.
1705679588745.png
remove flags after host is back online.
 
  • Like
Reactions: Stephan

ano

Well-Known Member
Nov 7, 2022
680
290
63
set noout/norebalance is what you normally do then it doesnt autoheal/rebuild for short maint.

you dont need an extra nic, reason for extra nic is bandwith for backfilling/recovery operations. if you run say 2x40gig in lacp, or 2x100.. or 2x25 etc, you can get away using just that

ssd / nvme and 1x10g? you gonna fill that fast.
 

Terry Wallace

PsyOps SysOp
Aug 13, 2018
201
131
43
Central Time Zone
That's the recommended way is to 'no out' the node for maintenance... I have some nodes that have better reliability and power backup than others.. so if I have a compute node crash not worried about.. unlike a ceph node crashing... (That's just me and my setup).

On the ceph backend network, if you don't have the backend network its very easy to overwhelm your entire network if you have any type of unplanned rebalance event (some ssd dies for instance ) I don't build ceph without one.. The people that wrote it knew what they were doing.
 

geonap

Member
Mar 2, 2021
76
75
18
here is the set up i'm thinking about building.

40G mellanox switch two switches with mclag

6 servers in total
dual 7313 epyc milan 16c/32t each
512gb ram
1 p4800x
4 p4610
40g mlx3

looking for at least two failures to tolerate, 3 would be a plus.
based off of the 6.4tb drives, 3 replicas means i could get a comfortable 30-40TB with 3 failures.
initial data migration is about 20TB maybe reaching 30TB in two years.

the systems have a total of 192 cores, total needed cores for migration from standalone boxes are about 130 cores, there is more than enough memory to cover those.

there's a possibility i can bring up a 7th node if that'll help.

trying to run around 10-15 vm's right now, mostly windows with stable and protected storage.
 
  • Like
Reactions: ano

Terry Wallace

PsyOps SysOp
Aug 13, 2018
201
131
43
Central Time Zone
LOL thats pretty close to what I used to run before upgrading.. Mellanox sx6036 connectx 3 pro vpi cards in eth mode.
nodes had dual samsung 500g for boot pairs. (zfs mirror install) 1 optane 280g hhhl card for ceph wall and HGST 3.8TB enterprise SSDs for storage.

since swapped cards out to Connectx 5 ( 2x100g ) and switches out to Arista 32x100.

If you ever want to chat specifics hit me up in PM.
 

NPS

Active Member
Jan 14, 2021
147
45
28
dual 7313 epyc milan 16c/32t each
512gb ram
What's your reason to go dual socket but low core count? I always came to the conclusion that single socket is more cost efficient unless you want to have more than 64 cores or need more than 512GB of RAM (1TB can also be possible for acceptable cost with 2DPC)
 

Terry Wallace

PsyOps SysOp
Aug 13, 2018
201
131
43
Central Time Zone
For me at the time it was cost.. picking up used DL 380's got me dual socket. more ram slots (cheaper ram at smaller sizes) hot swap power supplies and enterprise class ILO all on the cheap..then dropped in some 10c/20htread cpus in each socket. (cheaply as its older hardware). I was in to enterprise grade nodes for around 400$ each at the time. (not counting the SSD).
 

NPS

Active Member
Jan 14, 2021
147
45
28
For me at the time it was cost.. picking up used DL 380's got me dual socket. more ram slots (cheaper ram at smaller sizes) hot swap power supplies and enterprise class ILO all on the cheap..then dropped in some 10c/20htread cpus in each socket. (cheaply as its older hardware). I was in to enterprise grade nodes for around 400$ each at the time. (not counting the SSD).
Totally understandable but for EPYC I always came to different results.
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
here is the set up i'm thinking about building.

40G mellanox switch two switches with mclag

6 servers in total
dual 7313 epyc milan 16c/32t each
512gb ram
1 p4800x
4 p4610
40g mlx3

looking for at least two failures to tolerate, 3 would be a plus.
based off of the 6.4tb drives, 3 replicas means i could get a comfortable 30-40TB with 3 failures.
initial data migration is about 20TB maybe reaching 30TB in two years.

the systems have a total of 192 cores, total needed cores for migration from standalone boxes are about 130 cores, there is more than enough memory to cover those.

there's a possibility i can bring up a 7th node if that'll help.

trying to run around 10-15 vm's right now, mostly windows with stable and protected storage.
4800 is kinda pointless, except for that, it will be very nice.
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
On the ceph backend network, if you don't have the backend network its very easy to overwhelm your entire network if you have any type of unplanned rebalance event (some ssd dies for instance ) I don't build ceph without one.. The people that wrote it knew what they were doing.
reccomendations for project theese days are changeing, usually people just vlan stuff, instead of separate (on 2x100 and high bw)
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
as for HPE DL385? nodes? Id rather do single 7763, 7443 etc, than dual 7313

dual epyc 3rd gen is kinda.. underwhelming. we have tested dual 7402, 7313, 7443, 7763, 75F3 etc, vs single, and single is of course.. very epyc! and the 7313 are very nice!

beware that you MUST tune bios +os for performance on amd 2nd/3rd gen, or you will loose 25-35% performance with vms/most workloads

ceph REALLY likes more nodes.

only reason for dual epyc on 3rd gen with 7313, would be not having enough ram slots depending on motherboard etc
 

geonap

Member
Mar 2, 2021
76
75
18
i got an amazing deal on 7313’s a year ago, also 7443’s. I am using the 7443 servers another way.
 

ano

Well-Known Member
Nov 7, 2022
680
290
63
well, they are superb cpus. We use 7313 for stuff with core licsensing, and not needing 75F3 and such per core. SPLA and others are much cheaper to licsense up that way for vms.