Shared Multi-Host Storage for Docker and Data Volumes

Kevin Maschke · Jul 13, 2017

Hi!

I'm trying to come up with a design for a initially small-medium infrastructure that uses docker and shared multi-host storage, but I'm not entirely sure which option would suite best or be the most feasible...

I apologise if this is not the right forum for this thread, and if it should better go into the Docker forum...

The idea is to setup a Docker infrastructure with access to a shared storage on which all images, data and shared configuration files are to be stored.

Initially we would have 4 hosts, in two different locations/DCs (two hosts in each DC), and maybe in the future we'd add more. And we need all 4 hosts and all containers on these to have access to the same storage (using Data Volumes).

This storage is to store the following:

Docker images. The idea is to have one image on the storage, and be able to deploy multiple containers from the same image, avoiding the need to create the image multiple times.
Common/Static config files. The idea is that for some apps and containers running the same app, we have one set of configuration files on the storage, and all apps on the containers read the configuration from there. This way if we need to do a change to the configuration, we can change it just once to affect all apps on all containers.
Application Data. The idea is that Data generated by/within the apps is also stored on the shared storage, and accessible by other containers running the same app.

We've thought of two different ways of doing this:

Using a High Availability (HA) NFS server.
The idea would be to have either A) one NFS server in each DC, making one primary and replicating it on the other one, or B) two NFS servers in each DC, were Server A replicated Server B in DC1, and this is replicated to DC2.

HANFS Option A

HANFS Option B

Using Ceph
The idea would be to have either A) the Docker hosts setup as described initially, and another set of servers across the DCs to run Ceph, and have the Docker hosts and containers mount the location, or B) configure the Docker hosts to also run Ceph on a separate disk/partition.

Ceph Option A

The Ceph Option B is the one which seems best to us.

Ceph Option B

The storage would be setup in the following way:

Ceph Option B Storage Design

What do you guys think? Do you believe this is a viable option? Any better idea? Has anyone tried/done something like this?

Any comment/suggestion will be appreciated!

Thanks,
Kevin.

Evan · Jul 13, 2017

How is the link between the DC's in terms of latency and throuput ?
This type of setup can be difficult in 2 DC's connected with private dark fiber, over the internet with vpn's etc I am not sure you will win with ceph even if it looks like the best solution in many ways.

Kevin Maschke · Jul 13, 2017

Evan said:
How is the link between the DC's in terms of latency and throuput ?
This type of setup can be difficult in 2 DC's connected with private dark fiber, over the internet with vpn's etc I am not sure you will win with ceph even if it looks like the best solution in many ways.

Thanks for your response Evan!

The link between DC's is... let's say "normal". It's not a 10Gbs link, hell, not even 1Gbs. That is one of the thinks I had in mind too when I started looking into this. In our case it would be a case of setting it up and testing it to see if it is usable or if we have to think about either improving connectivity, or changing the plan.

I am also not 100% sure yet if Ceph is the best option, but it does look like the best option compared to other's we've looked at. Do you have any other recommendation or idea?

Regards,
Kevin.

_alex · Jul 13, 2017

Isn't this exactly where DRBD shines with Option A or B ?
But if the link is 'not even 1Gbs' you will surely get massive problems with whatever Option you choose.
As this is obviously WAN/routed through public net, you can expect that the link is not always reliable.
There might be no route / a wrong route between the DC just for a few hundred ms or a few seconds for several reasons.

In the worst case this might end with a full resync or even worse split-brain (at least with DRBD, don't know how CEPH would handle this). So imho there should be something like a third quroum-system, independent of the route between the two DC. Could be public DNS or whatever, just away from the interconnect.

Have been thinking about similar setup, too but gave it up after quite some time thinking about it.
Imho, dark-fiber between two DC is the only way to get a reliable sync. Doing this over public WAN (even with VPN or GRE - whatever) will always cause serious trouble, and it's not a question if but only when, and how serious the impact will be.

If there is a reliable, fault-resilent solution to mirror/sync storage in realtime between to DC over 1Gbs WAN please let me know

Alex

Evan · Jul 13, 2017

Hi,

Just for fun background info our versions of normal differ

My normal for a DC cluster pair 10-20kms away from each other is 200G+ of double protected dark fiber and between DC's pairs say Europe to Asia multi gigabit's of private MPLS network

So probably not surprising that i cant really say with any experience what may work in your Low bandwidth situation. As for ceph I really don't think it will like you bandwidth and latency at all but you can test or others may have played with ceph in this config but I have never touched it in less than 10G networks and less than 1ms latency between sites.

What I can offer as an idea is a while back we did some tests using DRBD and pacemaker, you can create different replications so some data replicates each way sort of active active, sit NFS on top and use KVM or some containers then running your docker instances, or maybe just the docker Instances work direct on DRBD/pacemaker/NFS.
At the time we tested docker was not in the picture but it looks a good fit for you, just Google and you will find some setup examples for the DRBD & pacemaker HA cluster setup. This will certainly work on lower bandwidth configs.

Evan

I am assuming your looking for a free/open source solution right and not a big $$$ solution ? If the latter then I could give a few other hints what may work with lower bandwidth.

Evan · Jul 13, 2017

@_alex jumped in while I was typing with similar / same suggest but to answer how question about mirror/sync storage realtime less than 1G... not absolute realtime easily but closer to realtime there is some options from storage vendors that do a pretty good job, also depends how many TB you need to keep synced

100tb+ over 100Mbit MPLS or P2P connections at 100ms latency is certainly possible out of the box from say the IBM storage systems. (Not internet quality though, needs to be stable connection)

I suspect the vsan sync local -> slight delayed remote sync solution in latest release could also work. But again it's some dollars involved.

_alex · Jul 13, 2017

Well, i have something similar within Rack / DC where i use SRP over multiple paths with ALUA, mdraid with internal bitmaps between mirrors where each disk is on one of the both hosts, keepalived and some shell/bash-scripts. This works fine and is quite fast, but is with 2x 40Gb IB in each host, over two switches.

In theory, mdraid with bitmaps and iscsi instead of SRP 'could work' over wan, but i just wouldn't want to try it

PigLover · Jul 13, 2017

I think that your optimal design may depend on requirements you haven't stated. For example, what are your performance expectations (iops)? What level of consistency between the two sites do you require? Can you tolerate latency in achieving that consistency (i.e., do you require 'always consistent' or can you accept 'eventually consistant')?

I think your option B Ceph design could be functionally quite nice, but if you require high IOPS and can't tolerate delay in achieving consistency then you would likely be disappointed.

Also remember to consider failure modes. Two-site distributed designs are at risk of unresolvable split brain faults. You need clear business rules about how to put things back together afterword.

Sent from my SM-G950U using Tapatalk

nitrobass24 · Jul 13, 2017

@Kevin Maschke - First off Welcome to STH! I am super excited you posted about this topic because when I was doing research on this about a year ago I did not find much in the way of content.

@PigLover is on point with his questions as ultimately your business requirements will dictate your functional design.

_alex · Jul 14, 2017

I would do everything to keep this as simple as possible.

In general, with the two ceph options, i wonder
- how rebalance over the small bandwidth will affect stability, that link will suffer and there might be congestion
- how performance will be with only 4 OSD (if i got this right)
- split-brain (also NFS)

Here is an 'idea', assuming the goal is not to setup a distibuted storage over datacenters but have the containers running in two DC and work with the same data:

Kevin Maschke said:
Docker images. The idea is to have one image on the storage, and be able to deploy multiple containers from the same image, avoiding the need to create the image multiple times.

These won't change often, periodic ZFS send/recv, rsync, lsyncd, csync2 and other might work well.
Also, putting them into some public cloud and sync back to local filesystem for fast reads could be a solution to handle this.
Implmenting some sort of 'locking' while data is synced might be necessary to avoid accessing old / incomplete data.

Kevin Maschke said:
Common/Static config files. The idea is that for some apps and containers running the same app, we have one set of configuration files on the storage, and all apps on the containers read the configuration from there. This way if we need to do a change to the configuration, we can change it just once to affect all apps on all containers.

See above, basically the same.

Kevin Maschke said:
Application Data. The idea is that Data generated by/within the apps is also stored on the shared storage, and accessible by other containers running the same app.[\QUOTE]

Here imho it starts to get interesting.

Do all instances of the app need to write to this data ?
Or is there one instance that generates the data and the other instances only need to read ?
Do the apps require sync writes ?

I think for something like let's say apache serving static files this could be handled quite easy just the same way like the images / configuration files with some drawbacks on consistency.

For databases, clustering on application level over WAN (i.e. Galera) could be easier and more reliable.
In general, when there is a way to handle this on application layer i'd just do it there.

Otherwise, my feeling is that a pair of DRBD Volumes in active/passive, one active in each DC for 'local instances' is maybe best suited.

Alex

_alex · Jul 14, 2017

Evan said:
but to answer how question about mirror/sync storage realtime less than 1G... not absolute realtime easily but closer to realtime there is some options from storage vendors that do a pretty good job, also depends how many TB you need to keep synced

100tb+ over 100Mbit MPLS or P2P connections at 100ms latency is certainly possible out of the box from say the IBM storage systems. (Not internet quality though, needs to be stable connection)

I suspect the vsan sync local -> slight delayed remote sync solution in latest release could also work. But again it's some dollars involved.

I'm generally interested in solutions in that field, in special real world experiences what works and what doesn't. And also in the amount of $$$ needed for entry solutions including support/maintenance, most vendors don't publish any prices, what is the point i stop looking at the solution at all most of the time.

Search

Shared Multi-Host Storage for Docker and Data Volumes

Kevin Maschke

New Member

Evan

Well-Known Member

Kevin Maschke

New Member

_alex

Active Member

Evan

Well-Known Member

Evan

Well-Known Member

_alex

Active Member

PigLover

Moderator

nitrobass24

Moderator

_alex

Active Member

_alex

Active Member