Whitey's foray into the world of CEPH

whitey · Dec 12, 2016

GODDAMN I love it when a plan comes together!

This is gonna be long/ugly...fair warning. :-D

Got the itch to finally peel back the layers of the CEPH onion...pretty darned impressed so far and I KNOW I am JUST scratching the surface.

Started w/ a LOT of RTFM, still have a world of this to do. Since you all know I am a glutton for punishment and pay the mortgage on a Linux skillset I figured I HAD to take the 'from scratch' route and started 'experimenting' w/ a CEPH setup consisting of 1 ceph-admin (headend/node) VM, 3 mon VM's, and 3 OSD VM's w/ the intentions of going to dedicated vt-D setups if this all worked out. Deploy 7 RHEL7 boxes via foreman and 15 mins later I am cooking w/ gasoline.

Flash forward, got CEPH jewell up on that Vm config above and learned quite a bit, needs to dig deeper under the CephFS hood but for now RADOS Block Device is 'DA BOMB' :-D

Phase 2: AKA, Let's get 'froggy'!
Blew up my original CEPH cluster (HAHAH I know lol, some proficient Linux guy right), I had snapshots and I could have beat myself over the head for recovery but it was a 'sandbox' playground. What was my ill demise was trying to add three more vt-D CEPH configured OSD's to the orignal VM 3 node OSD cluster/tree map and BOOM, I did have a different number of disks in the virt v.s. phys but I digress.

Phase 3: OK I need an ACTUAL 'applicable use case'
As a lot of you know I do a lot of my stg protection strategy between two diff vt-D AIO FreeNAS boxes...periodic snapshot/replication schedule is a blessing no surprise there right. But I wanted to dig deeper into this distributed/replicated scale-out storage underpinnings...sure a FreeNAS box is great but you are at the mercy of a single node failure minus HA (which not a lot of us have or have had too much luck implementing on ZFS solutions). CEPH seems to make this interesting to me much in the way that vSAN does, OSD crush map I think I have set to 2 copies, can lose one node effectively...scale OSD's/MON's to hearts content to increase fault tolerant/resilient domains across nodes or increase PAXOS MON quorom/slave/master/CEPH black magic as far as I am concerned until I do more RTFM...I need to do more research here...much to learn.

So I have those 3 vt-D CEPH OSD nodes now, one 200gb hussl ssd for journal, 450gb 10K hitachi for XFS formatted data disk...thoughts that come to my mind is I would sure like to see cache/journal stats, saw a 'ceph osd perf' cmd that looked like I saw some throughput/latency characteristics/stats rolling up.

Back to the topic, so far I have setup Veeam Back and Replication 9.5 and hooked that to what I will call my 'CEPH gateway' for now. It's really just a client mounting RBD devices and formatting that XFS, mapping a Veeam backup repo to my CEPH gateway host over ssh, effectively into the CEPH pool, seem to backup at abt 90MB/sec, restore at 250MB/s or so...not too shabby, I smile cause I was on the FreeNAS IRC the other week and saw some sysadmin complaining of Veeam restores pushing some measly 6-9MB/s and taking forever...I only have 6 disks but they seemed performant enough.

Still following? :-D

Next up I had to bite off what I think/hope CephFS addresses. File access like POSIX aware utilities/FS support for NFS/SMB. More research. Figured a brilliant way arnd it though that is working like a champ.

Installed targetcli (LIO framework) on the CEPH gateway RHEL7 box that was also hosting the Veeam backup repo via /dev/rbd0 so I just created another RBD dev and 'rbd create/rbd map' and 'Whitey's your uncle' heh :-D

Config targetcli, vsphere, BOOM (lil bit of struggles here but no biggie). Deployed a foreman provisioned ubuntu 16.04 box to the new CEPH iSCSI vSphere VMFS mounted stg. AWESOME!

As a bonus, check this, HOT CEPH filesystem resize for both Linux (rbd/XFS)/vSphere/iSCSI/VMFS)

Code:

Linux
rbd --pool=rbd --size=102400 resize veeam
xfs_growfs /mnt/veeam

vSphere
rbd --pool=rbd --size=102400 resize iscsivol01
Use vSphere web client to resize device HOT/live.

Post a slew of good pics tomorrow, I need to make sure I captured the end-to-end, I got dang good notes but need to shore up some documentation.

Sorry for the novel, had to decompress somewhere, pretty stoked though, I'm sure my dreams will be crushed 'somewhere' along my battle scarred journey!

whitey · Dec 12, 2016

Note before I cut out to bed for the night...helpful tools to watch network stats/traffic unless you have cacti/mrtg/rrdtools graphs on your devices, these dig deeper into protocol inspection/throughput/traffic flows/etc.

From the 'CEPH gateway' box mouting the RBD devices /dev/rbd0 (XFS/Veeam) and /dev/rbd1 (targetcli/iSCSI/vSphere) use these types of utilities for network througput/protocol/communication insight.

iptraf-ng
trafshow
darkstat
iftop
vnstat
tcptrack
jnettop

Patrick · Dec 13, 2016

Welcome to the world of Ceph.

whitey · Dec 15, 2016

As promised images, sorry for the delay.

Jeggs101 · Dec 15, 2016

Welcome to the world of cheap hyper-converged. Looks like you've got a toy to play with the week after Christmas.

whitey · Dec 16, 2016

Jeggs101 said:
Welcome to the world of cheap hyper-converged. Looks like you've got a toy to play with the week after Christmas.

Indeed, anyone know offhand if CephFS is the key to no SPOF on the frontend like I potentially have now using that RHEL7 box as a RBD 'gateway'? Guess I could invoke keepalived nonsense but that sure feels like a hack.

Jeggs101 · Dec 16, 2016

Isn't cephfs really new?

PigLover · Dec 16, 2016

Jeggs101 said:
Isn't cephfs really new?

CephFS is reasonably new (compared to ext4, zfs, etc). But its fairly unique in how it provides scale-out capabilities and any competitive "scale out" FS is equally "new".

With the Jewel release of Caph it is considered by its development community to be production ready. Trustworthy is something it will have to earn. But "earning" that trust requires work from experimenters and early adopters so Kudo's to @whitey for taking this on.

Unfortunately I can't offer much advice here because I don't have enough experience with how Veeam works and the characteristics needed from the filesystem that backs it. But I am very interested in following this as it evolves.

dswartz · Dec 16, 2016

Looks interesting. Scaleio does also. I'm paranoid though, and don't want to ever store VMs on a datastore without checksumming. So, you'd need to put ceph on top of zfs or something. So far, not enough motivation to play with this. Kudos to you for doing so though

whitey · Dec 16, 2016

PigLover said:
CephFS is reasonably new (compared to ext4, zfs, etc). But its fairly unique in how it provides scale-out capabilities and any competitive "scale out" FS is equally "new".

With the Jewel release of Caph it is considered by its development community to be production ready. Trustworthy is something it will have to earn. But "earning" that trust requires work from experimenters and early adopters so Kudo's to @whitey for taking this on.

Unfortunately I can't offer much advice here because I don't have enough experience with how Veeam works and the characteristics needed from the filesystem that backs it. But I am very interested in following this as it evolves.

Both the Veeam repo and vSphere iSCSI volume are just CEPH Rados Block devices served up to the ceph-client (RBD gateway box as I use it). For Veeam it's just that RHEL7 box w/ a RBD block device, XFS formatted and then mounted to Veeam as a backup repository to store backups on. The issue I have is what good does it do to have all this redundancy/resiliency on the backend when the frontend is a single point of failure. Scratches head. :-(

Hoping CephFS handles this frontend redundancy by using the monitor (mons) daemons and mounting to those. Only issue w/ that is for vSphere anyways it doesn't exactly talk ceph FS...YET (think there are requests into VMware to support). Mount CephFS to box that serves NFS to vSphere, same SPOF issue...maddening/who knows. It's messy. Was hoping some smart cookie here had the magic up their sleeve or could tell me 'hey silly, your doing that ass-backwards' :-D

PigLover · Dec 16, 2016

dswartz said:
Looks interesting. Scaleio does also. I'm paranoid though, and don't want to ever store VMs on a datastore without checksumming. So, you'd need to put ceph on top of zfs or something. So far, not enough motivation to play with this. Kudos to you for doing so though

ScaleIO doesn't provide a filesystem - it is virtualized block storage only. They take the "do one thing and do it well" approach, as opposed to the Ceph approach of "one storage subsystem for all three major storage models (object, block and file)".

If your problem set is block only and performance optimization is a core requirement then ScaleIO may be preferred (as long as you can live with their licensing model - there is no true FOSS upstream for ScaleIO).

dswartz · Dec 16, 2016

Yeah, I got that. I was thinking mainly of scaleio vs rados. At the moment, I am using iSCSI not NFS for my datastore, because I can trivially have a 1gb backup for the 10gb storage subnet, without getting into oddball teaming scenarios. I can tell vsphere to use the 10gb mpio link as preferred, and the 1gb only backup. Scaleio also (from what I was told) really wants a large number of nodes if you want good performance... I'm waiting for EMC to f*ck this up

amalurk · Dec 16, 2016

Next try Beegfs and compare benches so I can decide which one to try in my limited time and with my limited resources.

whitey · Dec 16, 2016

amalurk said:
Next try Beegfs and compare benches so I can decide which one to try in my limited time and with my limited resources.

Never heard of it so imma assume it's some niche player, I'm a fam man w/ 3 young kiddos of my own so hacking time does have it's limits/actual free time. I have to focus my efforts on tech that will further my career and CEPH is certainly a building block for OpenStack AND Red Hat acquired them and I have a hard time remembering the last turd they rolled. Plus I am a through and through Open Source purist at heart, every M$ VM I run at home pains me to no end and in that regard they are VERY limited arnd here. :-D

amalurk · Dec 16, 2016

whitey said:
Never heard of it so imma assume it's some niche player, I'm a fam man w/ 3 young kiddos of my own so hacking time does have it's limits/actual free time. I have to focus my efforts on tech that will further my career and CEPH is certainly a building block for OpenStack AND Red Hat acquired them and I have a hard time remembering the last turd they rolled. Plus I am a through and through Open Source purist at heart, every M$ VM I run at home pains me to no end and in that regard they are VERY limited arnd here. :-D

Niche...maybe, according to their website it is used in large HPCs and was developed originally by Europe's big Fraunhofer Institute and open sourced in the last year or so. It was called something else before that and so it has been around a while I believe. So it might be more mature than some competitors. It seems similar to Ceph and ScaleIO from my review of it.

I also have 3 young kids but not employed in tech/IT. I am a lawyer by day and techie by night after the kids go to bed.

whitey · Dec 16, 2016

Falls along the lines of Lustre/GPFS/OrangeFS for me then...someday maybe :-D

If I was gonna dive into anything else it'd have to be ScaleIO or maybe NDFS (Nutanix) currently.

whitey · Dec 16, 2016

amalurk said:
Niche...maybe, according to their website it is used in large HPCs and was developed originally by Europe's big Fraunhofer Institute and open sourced in the last year or so. It was called something else before that and so it has been around a while I believe. So it might be more mature than some competitors. It seems similar to Ceph and ScaleIO from my review of it.

I also have 3 young kids but not employed in tech/IT. I am a lawyer by day and techie by night after the kids go to bed.

My bad I do see they are a pretty heavy hitter in the HPC space, may be worth a looksie, how abt this, you trek down that path and make some discoveries and report back and I'll continue on w/ CEPH for now as I am digging it...divide and conquer and we'll both share war stories :-D

Chuntzu · Dec 16, 2016

amalurk said:
Niche...maybe, according to their website it is used in large HPCs and was developed originally by Europe's big Fraunhofer Institute and open sourced in the last year or so. It was called something else before that and so it has been around a while I believe. So it might be more mature than some competitors. It seems similar to Ceph and ScaleIO from my review of it.

I also have 3 young kids but not employed in tech/IT. I am a lawyer by day and techie by night after the kids go to bed.

Fhgfs is the old name.

Sent from my SM-N920T using Tapatalk

BackupProphet · Dec 17, 2016

I tried Ceph about one year ago. I got it to work, and everything was ok.

But, the amount of reading, and the amount of knobs I had to know about was just too much. I concluded with that the complexity of Ceph is just too dangerous for any organization that lacks the manpower and money for monitoring, add new nodes, removing nodes, pool managements and so on.

I really hope the developers spend some time on simplifying Ceph and set some sane default. Knowing about all the knobs is not necessary.

PigLover · Dec 17, 2016

BackupProphet said:
I tried Ceph about one year ago. I got it to work, and everything was ok.

But, the amount of reading, and the amount of knobs I had to know about was just too much. I concluded with that the complexity of Ceph is just too dangerous for any organization that lacks the manpower and money for monitoring, add new nodes, removing nodes, pool managements and so on.

I really hope the developers spend some time on simplifying Ceph and set some sane default. Knowing about all the knobs is not necessary.

This. Spot on. It is a problem even in a large organization that DOES have the manpower and experience to handle it. I had the chance to speak to some of the leadership of Redhat/Inktank and expressed exactly this a couple of months ago. Its just too difficult to maintain- when something goes wrong the number of logs you have to read/correlate from multiple hosts to even figure out what happened, much less how to fix it, is overwhelming.

I completely get that they need to finish things like Bluestore and overwrite support on EC pools. But unless they take some of their resources and spend time on operability/management tooling they will start to lose their support base - and soon.

Whitey's foray into the world of CEPH

Moderator

Moderator

Administrator

Moderator

Attachments

Well-Known Member

Moderator

Well-Known Member

Moderator

Active Member

Moderator

Moderator

Active Member

Active Member

Moderator

Active Member

Moderator

Moderator

Active Member

Well-Known Member

Moderator