Ceph right for my needs? (Keeping a in sync backup on remote location...) (pve-zync vs ceph?)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

kroem

Active Member
Aug 16, 2014
248
43
28
38
(This turned out to be a long thread start...sorry...)
Im redoing my servers at home and looking to maybe redo the storage too.

Today I run ESXi on two hosts and run like periodic backups to external storage, because I never really found a way to do it properly. (VMs backup via Unitrends). .
Media storage was/is in a Virtual Napp-it zpool rz2 of 8x4TB + 200GB S3700, external storage have been 4x4TB on a HP Microserver.
VM storage was a mix, most on a 500GB hw raid10 (which dies, seperate thread....)

Now Im migrating to Proxmox for several reasons - I prefer Debian, its "free", I could run ZoL etc etc

But what Im looking at now it to change my storage too. ZFS have been very good to me, no problems really whatsoever, but I would like to see what I could do which would be smart in my situation in terms of backup and availability (and obviously labing, testing, learning :) )

My nodes would be:

1. SM mobo, E5-1620v1, 96GB RAM (this is where the zpool is today)
2. SM mobo, E5-2640v1, 32GB RAM (On the way, Thanks @T_Minus :) )
3. SM mobo, Atom C2750, 32GB RAM (This one I planned to sell, but then I had to idea to put it remote instead)

My storage today ranges between like 10-12TB, this is mostly media, and I could limit it. My issue with running ZFS on the backup node is that space would be almost full, and well over recommended fill of a zpool.

So, the real question: Would Ceph be a valid direction to go in for my storage needs? The way I see it, Id then spread 4x4TB across the nodes (and SSDs, for journal?) keeping them all in sync, effectivly giving me a backup on the remote location. (which Id have to connect to via ...VPN?)

Alternative two would be a zfs send, or pve-zync setup. Never used zfs send/receive, but people seem to like it. But since this is media etc, Id really like to idea of have an actual filesystem to brows on the backup, not snapshots. Is restoration of snapshots a fairly simple task?

Alternative three is some nightly rsync script, which could work. But its not as fancy, is it?

Maybe, and probably Im overthinking all of this, but its nice to learn!
 

PigLover

Moderator
Jan 26, 2011
3,184
1,545
113
Ceph could do what you want - but its probably not the right answer. Your config is two small. Only three nodes, with one remote on a VPN (on a high-latency link) isn't going to give you the best results.

In most cases to do Ceph "remote" replication you want a viable ceph cluster at both sites (3 nodes per site, better with 5 or more to get the resiliency model right). With replication of placement groups over that high-latency link you'll suffer IO delays that will drive you nucking futs.

I think you may be better off using zfs snapshot replication (ZFS Send / pve-sync).
 
Last edited:
  • Like
Reactions: kroem

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
I would tend to agree with @PigLover on the ceph cluster being a bit too small. Docs say 3 nodes is OK but things got hairy every time I built a cluster with less than 5 nodes.

zfs send/ receive will handle replication. If you have media files you want browsable, just make an rsync target and use that.
 
  • Like
Reactions: kroem

kroem

Active Member
Aug 16, 2014
248
43
28
38
Yarr... It sounds very reasonable, and kind of what I figured... I might try something smaller though, keeping zfs and running ceph too, to learn.
 

fractal

Active Member
Jun 7, 2016
309
69
28
33
How does glusterfs fit into the discussion? I was half considering a three node glusterfs with two local mirrors and a remote node with geo-replication ( Synchronise a GlusterFS volume to a remote site using geo replication | JamesCoyle.net ).

I currently rely on rsync to keep the two copies of my NAS in sync and sneaker net (well, car net) to update my off-site backups but at least on paper, gluster seems to offer a more elegant solution.

Oh, and my backup target is a hp microserver too. I was thinking of sticking one at a relatives house and use it for rsync backups before I ran into glusterfs with geo-replication.
 

PigLover

Moderator
Jan 26, 2011
3,184
1,545
113
GlusterFS has pretty much the same issues with Geo-replication as Ceph. Basic GlusterFS installs expect LAN-grade performance between nodes (good throughput, low latency, little/no jitter, very low packet drops). Like Ceph, they offer a separate method to geo-replicate and synchronize two GlusterFS clusters - but just like the Ceph-based solutions they do not fit the use-case you describe very well.

I do get that technologies like GlusterFS and Ceph are the current "shiny object" but your use case is small and simple and these are both "heavy" solutions. You really are better off using a simple ZFS-sync as described before. Not nearly as "cool" or "fun" but it will meet your requirements at much lower cost and with a lot less headaches.

Ceph, Gluster and related tools become appropriate when the headaches they solve exceed the headaches they cause. Until you reach that point - KISS.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
The problem with backup and replication is that they are never up to date. Even with replication you have a gap between data and backup. To solve this you must replace async technologies like backup or replication with sync technologies. This can be a cluster filesystem or a simple network mirror between two storage nodes (ex your current filer and your backup system). Both is an option if your network is fast enough.

I am currently playing with ZFS Netraid where I build a ZFS mirror or Raid-Z over iSCSI Luns provided by Solaris Comstar with a Master and a Slave node. Seems very promising on a 10G or faster network as a simple enough HA solution with manual or auto service failover. Data in such a netraid is always in sync between both server and each can continue to provide services with current data state if the other fails.

About my concept, see http://napp-it.de/doc/downloads/napp-it.pdf
chapter 25
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
I am currently playing with ZFS Netraid where I build a ZFS mirror or Raid-Z over iSCSI Luns provided by Solaris Comstar with a Master and a Slave node. Seems very promising on a 10G or faster network as a simple enough HA solution with manual or auto service failover. Data in such a netraid is always in sync between both server and each can continue to provide services with current data state if the other fails.
I briefly tried something like this. The extra latency and (generally) lower bandwidth from going over a WAN connection versus a LAN connection turned me off of going netraid/ sync.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
From some tests sequential netraid-1 performance is very good, latency may be a little lower.
But I would not use netraid on WAN. 10G or the upcomimg 40G is they key technology for pool sync technologies like HA/netraid.
 

PigLover

Moderator
Jan 26, 2011
3,184
1,545
113
From some tests sequential netraid-1 performance is very good, latency may be a little lower.
But I would not use netraid on WAN. 10G or the upcomimg 40G is they key technology for pool sync technologies like HA/netraid.
WRT the OPs question: he was looking for a solution for a remote site without adding a bunch of extra nodes. As I already noted, Ceph isn't that answer. GlusterFS isn't that answer. And now we have another proposal - which also does not address his question because it is dependent on 10G/40G links (which I'm betting he doesn't have between his house and the remote site...).

And this brings us full circle: while it isn't "real time replication", the best solution for his use case is still probably ZFS snaps and rsync (in his specific case on Proxmox, all packaged up as "pve-sync").
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
GlusterFS has pretty much the same issues with Geo-replication as Ceph. Basic GlusterFS installs expect LAN-grade performance between nodes (good throughput, low latency, little/no jitter, very low packet drops). Like Ceph, they offer a separate method to geo-replicate and synchronize two GlusterFS clusters - but just like the Ceph-based solutions they do not fit the use-case you describe very well.

I do get that technologies like GlusterFS and Ceph are the current "shiny object" but your use case is small and simple and these are both "heavy" solutions. You really are better off using a simple ZFS-sync as described before. Not nearly as "cool" or "fun" but it will meet your requirements at much lower cost and with a lot less headaches.

Ceph, Gluster and related tools become appropriate when the headaches they solve exceed the headaches they cause. Until you reach that point - KISS.
Well said good sir...well said! I won't even add any comments to this thread because the goodies have already been delivered. ZFS send/recv FTW...FreeNAS replication even easier.
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
I would prefer zfs replication over rsync on async backups as it is more efficient. It offers a datastream of modified bytes (must not process a file compare) and is faster especially with many small files or larger pools.

Maybe I am in a privileged position. My campus is 10G all over the town and even in our country all sites/universities are connected via 10G - larger universities with 100G.

With 10G you have the full performance of a larger ZFS pool. Within a site a netmirror is possible. If you have 1G connectivity you can expect the performance of a single disk with about 100MB/s. If performance is not too critical, you can use this for sync technologies. If your WAN is slower, no alternative to async technologies unless you are satisfied with USB2 performance.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
A bit late, but DRBD should do the trick, even over WAN ...

@gea quite OT: i have about the thing you talk about in your paper with srp on scst-targets and mdraid over two chassis quite stable and currently in prod. Each Chassis has 4x SSD that forms a raid10 + 2x hdd for another raid10. For failover i use keepalived and a some homebrewn shellscripts. SAN Network is Dual QDR infiniband on two separate switches, backed by multipath.

Wonder if it would be possible to do the same with zfs instead of mdraid, what do you think? would zfs assemble fast enough (<10sec ) after the Master changes, so that it can be backed by multipath and vm's dont go ro ? and is there something like Bitmaps in mdraid that keep re-Sync Time low when a Node comes back after a maintaince-outage or failure?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
You need to do a pool import where the import time relates to number of disks,
As you mirror over pools the import affects only two LUNs - very fast even if you use a ZFS pool with many disks as base for a LUN.

On Solaris SMB and NFS (and zvols for iSCSI) is part of ZFS so service management is done with ZFS and a pool import. On Linux you must additionally switch SAMBA or NFS with a service management.

Re-Syncing a ZFS mirror is always very efficient as only modified data must be resynced, not the whole mirror.

But with Solarish I have not solved yet the problem that a node failure can lead to a hanging mirror. This should be solved with a cfgadm unconfigure what I was not able to get it working. Maybe a problem with the Solaris initiator.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
Ah, ok - didn't get it that you export 1 LUN with RaidZ per Storage node.
I currently do it the other way round. Each HDD of the Storage-Nodes is exported as single disk. Then on the master, for N disks in the storage-nodes, N Raid-1 are assembled where Raid-0 over these forms the final target that is exported. Master can either be one of the storage-nodes or also an external Node, with failover between them. Besides mdadm and scst there is currently keepalived for HA, dm-multipath and also ALUA on the targets involved, what makes it a bit complex.

Would love to get this switched to ZFS-Mirrors and a Stripe over these for the final target.
When i find some time i`ll try it in a lab-setup and see how far i can get it work, first without mpio to keep it simple.

Edit: maybe keeping mdadm for the N mirrors and just replacing the Raid-0 with a ZFS Stripe could also be worth to consider ... as it wouldn't nest multiple zpools ?
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
ZFS ontop of hardwareraid or mdraid would be a bad idea.
ZFS will detect all errors that the raid below cannot detect. ZFS will report them but cannot repair.

On option with a few disk may be a LUN per raw disk.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
Ah, so the layering needs to be with sort of a ZVOL on top of the physical drives, like this.

Code:
    phy. HDD(s)
        |
      ZVOL                      Storage-Box(es)
 (Single HDD's or RaidZ)
        |
 LUN(s) / export(s)
        |                     -------------------
      ZVOL 
 (over mutliple LUN's)
        |                           Master
    LUN / export
        |                     -------------------
   Hypervisors
Looks like there is no way to prevent that the 'cost' of ZFS doubles, once on the Storage-Boxes and another time on the Master that exports the final storage ...
 

gea

Well-Known Member
Dec 31, 2010
3,141
1,184
113
DE
Ah, so the layering needs to be with sort of a ZVOL on top of the physical drives.
A network LUN can be built on top of a ZVOL (=a ZFS filesystem that is treated as a blockdevice, like a disk) but also on a file or a real/raw disk.

With a raw disk you use the disk directly as base for a LUN and avoid the first ZFS layer in the chain ZFS pool -> one Zvol -> one LUN -> ZFS netraid/mirror over two nodes for the price of a more complicated setup.

The chain is then Disks -> LUNs -> ZFS netraid over all LUNS on both nodes where the netraid must allow a failure of half of the LUNs.