CEPH write performance pisses me off!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
Ceph is awesome since it does file, block, and object. On the other hand if you only need block and performance is a concern, I've been happy with ScaleIO. Just three SSDs gets me 4.3Gb write and 10.9Gb read.

 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Impressive, I'm pretty happy w/ 3 OSD's and 3 ssd's behind each OSD delivering 2Gbps read, 1Gbps write.

What type of ssd's are you using in your ScaleIO config?
 

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
is scaleIO completely free? I'm assuming both these solutions require three nodes? I'm trying to decide how to do home storage, if i want a single server running freenas giving the other two servers nfs/iscsi storage for both vm datastores usage and archival storage(ie media, pictures, docs, etc. ) or trying something like scaleio, where each box has a few sets of drives and all shared about.

Maybe im off on my idea what scaleio or ceph is suppose to be used for?

Right now the two esxi nodes each have their own local storage. I'm building out a third server which is limited to 8 2.5 hdd so I'm trying to decide is it best to pool all my storage under freenas, then pass it out or do something else with it. I may just stick to local to keep it simple.

any thoughts appreciated.
 

PigLover

Moderator
Jan 26, 2011
3,184
1,545
113
@marcoi - no, ScaleIO is not "completely free". It is a commercial product that has a "free option" for small scale deployments. That is sorta the opposite of Ceph, which is FOSS but has a paid "support" option from RedHat/InkTank.

For the use case you describe Ceph or ScaleIO could work - but they are probably more trouble for you than value. Your initial thought of a storage server serving iSCSI/NFS to two workload platforms is a good one - and will be much easier to manage. You do still have the single point of failure in the storage server. But in practice - unless you really need 3 to 5 "9s" reliability - that is of little consequence (if it stops you just fix it, move on and be happy).

both Ceph and ScaleIO are "scale out" solutions. They are really designed for use cases where your storage needs are quite large and likely growing. They are designed to "scale horizontally", which means you can add additional disks/servers/racks/etc as you need them without impact to active service and with linear impact to performance (in theory...). In the horizontal scale environment getting consistent and predictable performance as you grow is usually more important than getting absolute maximum performance possible, though ScaleIO does emphasize performance while Ceph tends to emphasize flexibility and consistency of performance.

Neither solution is for the feint of heart to install/manage. To the extent they exist at all, their installation, operation and monitoring tools are - to be kind - "maturing". There is a fair amount of learning that has to be put place before you really understand how to manage it well.

At the small end of the reasonable use case, for users like @Patrick, the high-availability features are very important (largely because we all start harassing him if STH stops working :)). And a side effect of the horizontal scalability design allows him to make good use of the mixed bag of disks he's collected over the years. But Ceph is only one part of the whole picture of how he achieves high availability.

Of course - if you want to learn the "scale out" approaches or test them and understand them and play with them - have at it. Join me, @whitey, @grogthegreat, @Patrick and many others. There is lots to learn and share - and both Ceph and ScaleIO have community involvement so you might even be able to contribute.
 
  • Like
Reactions: whitey

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
@PigLover thanks for the clarification. I don't build out datacenters for a living, so my use case for learning all of this new tech and software is for personal pleasure and to keep up with a broad scope of IT knowledge. It tends to help with other work related IT stuff I do.

With that said, both of these technologies, as interesting as they are, would be way over kill for home use. I might play with them down the road but for now i want to just get home stuff running by having a simple solution that works and is easy to support.
 

Marsh

Moderator
May 12, 2013
2,644
1,496
113
@marcoi

You don't need to wait to play with the new stuff.
Using your ESXi box , you have opportunity to create dozen of nested virtual lab.
vsphere vSan cluster , Proxmox cluster , cepth cluster , SaclieIO cluster , S2D cluster , Cisco network lab etc.
If you want speed , add more ram, PCIe SSD, 10Gbit network and switch is buildin to vSpehre.
 
  • Like
Reactions: PigLover

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
@PigLover hit it on the head, nothing to add here other than saying well done sir! My vote for you and I've seen a few of your requests, stick w/ either a dedicated NAS box (I won't warp your mind here) or a all-in-one if you want a lil' more utility out of the setup.
 

PigLover

Moderator
Jan 26, 2011
3,184
1,545
113
OK - so confession time. I have more toys available than I need and run a 7 node (soon growing to 9) Proxmox/Ceph hyperconverged cluster for testing and playing and stuff...but for the things I care about I keep it really simple.

FreeNAS storage host stuffed with 4TB Seagate portable "cracked" drives for volume, a few SSDs in a pool for speed, and one small Proxmox host for VMs. Another FreeNAS host for local backups. And stuff I really care about backed up to the Cloud. This part of my network I never need to worry about - it just runs and runs and runs.

And @Marsh is spot-on - you can do all the playing/learning part on VMs built inside the two ESXi nodes. Without spending any more money (to make the wifey happy).
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
@whitey If you didn't see it in the picture, the SSDs are PM853T 960GB SATA drives. Found the amazing ebay deal for them on this site in fact.

@marcio Since no one directly answered the minimum node question, yes scaleIO does need at least three nodes. I've had both three and four nodes in my cluster and find that 4 nodes is much better since you get a higher usable percent out of your raw total storage. Plus I don't like to be at the limits of anything. That benchmark is with only three nodes, each with only one SSD.
While PigLover is correct that it has a free option, the great thing is that the free option is identical to the paid version with no limits other than no support and you aren't allowed to use in production.

I'm not the target audience but I do use ScaleIO at home to run both my home lab as well as storage for movies, tv shows, music etc. I like how I don't need to do anything if a hard drive or an entire server dies. The data gets reduplicated right away instead of waiting for me to replace hardware. I can add hard drives or servers as my needs grow and the data gets re-balanced onto the new hardware. After researching ceph and giving up, I found scaleio easy to install and manage. There are downsides to software that isn't FOSS. Dell (after taking over EMC) could decide to abandon the software or stop releasing the unlimited and free for person use version. That significant risk is worth the benefits to me but might not be for you.
 

Attachments

  • Like
Reactions: marcoi

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
I see ScaleIO has a VM version. Anyone try that?

Here is my Idea dunno if its good.
Create a Scale IO VM host on three separate esxi nodes. Each will have 1 raid card in pass-through or if it doesn't matter then i will use local data store with either same amount of hdd space assigned or 1 250gb ssd drive if in pass through since i have 4 of the same models. Each will be assigned to a new vm switch which will have vm iscsi kernal and 2 physical 1GB nics (each node has atleast 4x 1gb nics). My idea is have the 2 nics of each host networked to allow the cross host access to duplicate the data while the internal switch provides access to the local ScaleIO vm.

Not sure if that is possible? would it be worth testing?
 

grogthegreat

New Member
Apr 21, 2016
23
7
3
36
What you described is called 'hypercovered' if you like tech lingo. ScaleIO is neat in that it can run hyperconvered (storage and hypervisor in the same box), or not.
Not only does ScaleIO support it, but it even has a vcenter web client add on that will deploy that config for you; even creating the VMs, setting up the networking, and attaching the disks. I had the software deploy the scaleio VMs to 32GB SATADOMs. The disks that you choose get directly attached to the VMs as part of the install process. ScaleIO doesn't like raid cards so use a HBA or just sata drives. Changing scaleio networking after it is setup is a pain so I kept it simple and just had the ScaleIO VMs only have a single IP address and virtual nic that they used for Scaleio VM to VM communication as well as ScaleIO VM to esxi communication. Since you already have SSDs and will probably get more in the future, I strongly recommend getting a 10Gb switch. I'm using a quanta lb6m switch and connectx-2 single port NICs which kept the cost low. Using vlans lets me have as many subnets as I could want all on the single nic.

The free scaleio download includes getting started, user, and deployment guides. Read through them and you'll get a good understanding of how the software works and if it matches what you are trying to do.
 

tomtom13

New Member
Aug 21, 2017
1
0
1
41
Hi,

Just wanted to chip in to straighten up one misconception that people have about ceph write performance sucking big time.
Most of people will tell you that ceph will slow down 3 times due to triple redundancy ... which is just lie, and attribute 2x slowdown due to journal ... which is again not entirely true.

So I actually described this situation elsewhere before but I'll just copy one bit here because I don't want to write it again.

So, when using ceph people will stick with defaults and will use XFS for partitions, which will incur a penalty because for writes ... xfs guys did a great job fixing the metadata bound bottleneck but FS by design will still suffer some slow down. Now I'm not nocking of the XFS in any shape or form, it's a extremely mature fs where you can seriously trust your data ... unfortunately having fs with that level of consistency will always create a penalty. XFS really shines when serialising large accesses from more than 8 different processes - where everything else more or less sucks, but in ceph you get a single process access per partition / fs - hence the slowdown.

Second slowdown comes from fact that ceph does not trust FS with journalling and creates it's own journal, while still NOT disabling a fs journalling. This behaviour results in this
Data enters OSD -> write to journal of XFS of journal partition -> commit journal to disk of journal partition -> trim journal of journal partition -> write data to journal of data partition -> commit journal to data partition -> trim journal of data partition -> write in journal of journall partition that bufffered data from journal partition should be deleted -> commit journal of journall partition to disk -> trim journal of journal partition.
Now, one could come in and claim that XFS has a trick for that which is called journal checkpointing (trick stollen from ext3) but since ceph requires atomicity this setting is overwritten and xfs is forced to commit almost on every single write (a major slowdown)

In most use scenarios mere mortal will not see this problem since if you drop 10GB into ceph with 100gb of combined journal space, ceph will just consume journal space and write problem will not become as apparent. Now if you will drop 3TB of data into same ceph you will see all the problems first hand and experience a "seek the disk to death" scenario that you can't stop or postpone.

On top of this problem is an issue with how objects are actually stored within XFS file system. There is an video on youtube that explains that in great detail (something with bluestor in title). To cut long story short, all objects are in folders ... but to get to them quicker they are grouped in folders with count of 50 - 200 objects per folder. If you cross 200 objects (files) per folder - folder gets split into two subfolders with objects spread among them ... this is very metadata heavy and follows same path of being written to same disk 6 times.

Bluestor does solve most if not all of those issues and from my experience it's just heaven and earth difference.

Also in term of benchmarking:
- if you want to know and absolute speed of you ceph deployment you need to chuck away all data until you starve ceph of journal space.
- if you want to know real world performance you should measure only the journal part, since you should have journal big enough for your typical deployment requirements.
- if you want best R/RW performance from ceph - switch to bluestor .... it will not cure cancer but it may save a stray cat :)

(note to original poster - due to seek multiplication with use of posix fs with another journal, you kill performance on those SSD which will perform FAR better on bluestor)
 
Last edited: