Hi,
Just wanted to chip in to straighten up one misconception that people have about ceph write performance sucking big time.
Most of people will tell you that ceph will slow down 3 times due to triple redundancy ... which is just lie, and attribute 2x slowdown due to journal ... which is again not entirely true.
So I actually described this situation elsewhere before but I'll just copy one bit here because I don't want to write it again.
So, when using ceph people will stick with defaults and will use XFS for partitions, which will incur a penalty because for writes ... xfs guys did a great job fixing the metadata bound bottleneck but FS by design will still suffer some slow down. Now I'm not nocking of the XFS in any shape or form, it's a extremely mature fs where you can seriously trust your data ... unfortunately having fs with that level of consistency will always create a penalty. XFS really shines when serialising large accesses from more than 8 different processes - where everything else more or less sucks, but in ceph you get a single process access per partition / fs - hence the slowdown.
Second slowdown comes from fact that ceph does not trust FS with journalling and creates it's own journal, while still NOT disabling a fs journalling. This behaviour results in this
Data enters OSD -> write to journal of XFS of journal partition -> commit journal to disk of journal partition -> trim journal of journal partition -> write data to journal of data partition -> commit journal to data partition -> trim journal of data partition -> write in journal of journall partition that bufffered data from journal partition should be deleted -> commit journal of journall partition to disk -> trim journal of journal partition.
Now, one could come in and claim that XFS has a trick for that which is called journal checkpointing (trick stollen from ext3) but since ceph requires atomicity this setting is overwritten and xfs is forced to commit almost on every single write (a major slowdown)
In most use scenarios mere mortal will not see this problem since if you drop 10GB into ceph with 100gb of combined journal space, ceph will just consume journal space and write problem will not become as apparent. Now if you will drop 3TB of data into same ceph you will see all the problems first hand and experience a "seek the disk to death" scenario that you can't stop or postpone.
On top of this problem is an issue with how objects are actually stored within XFS file system. There is an video on youtube that explains that in great detail (something with bluestor in title). To cut long story short, all objects are in folders ... but to get to them quicker they are grouped in folders with count of 50 - 200 objects per folder. If you cross 200 objects (files) per folder - folder gets split into two subfolders with objects spread among them ... this is very metadata heavy and follows same path of being written to same disk 6 times.
Bluestor does solve most if not all of those issues and from my experience it's just heaven and earth difference.
Also in term of benchmarking:
- if you want to know and absolute speed of you ceph deployment you need to chuck away all data until you starve ceph of journal space.
- if you want to know real world performance you should measure only the journal part, since you should have journal big enough for your typical deployment requirements.
- if you want best R/RW performance from ceph - switch to bluestor .... it will not cure cancer but it may save a stray cat
(note to original poster - due to seek multiplication with use of posix fs with another journal, you kill performance on those SSD which will perform FAR better on bluestor)