Ceph very slow writes

skunky · Apr 25, 2017

Hello everyone,
I have a pretty big problem with a 3 node ceph cluster on CentOS7. It just hangs at a point when tranferring large ( 10GB ) files to it, regardless if using rbd or cephfs.
For example, from the client machine i start to transfer 3x10GB files, it tranfers a half of the 30GB content, and at a point both "fs_apply_latency" and "fs_commit_latency" go up to 3000-4000 ms ( sometimes even 30000 !!!) , resulting in about "100 requests are blocked > 32 sec". At this point the transfer just freezes, and at a point it starts again, and againg freeze...and so on, until it finish.
My hardware setup is not very apropiate for a ceph cluster, since both public and cluster network are on 1GB nics. On each server I have 1 x 10Gb card, that i used at the beginning for the public_network and 1Gb for cluster_network , but this didn't help since i guess cluster was ingesting too much traffic to be able to handle withing the 1Gb cluster_network. So i switched over to 1gb for both ceph client and ceph cluster.
The problem is that i just need to isolate the issue as much as it can be done and figure out if there's a ceph ,network, OS misconfiguration, or just bad hardware for ceph.
So, there are 4 hp's DL 160 G6 94Gb ram, 3 for the cluster ( mon, osd, mds ) and 1 for the ceph client. They all have p410 smart array controllers ( cache disabled ) but write cache ( smartarrayaccelerator ) enabled for all logical volumes, including the journal ssd disk. Centos7, all kernel updated to "4.10.12-1.el7.elrepo.x86_64".
HW:
There are 2 x samsung 850 ssd 120gb ( one for OS , one for ceph journal ) , 2 x 1TB sandisk ultra II ( for 2 osd's ) , 4x5TB seagate ( for 4 spinning osds's ) on each server.
They are all capable of at least 3Gbps ( i've just noticed that hpssacli is reporting 1.5Gpbs on one ceph node, will stick to fix that meanwhile ).
They are all connected to sw's that share 10Gb links between them, cluster_network is on a separate vlan on the same sw's. ( cannot add sw at the moment, all my work is done remotelly ).
Here is my NEW ceph.conf accordingly to
Tuning for All Flash Deployments - Ceph - Ceph :

[global]
fsid = 2806fecf-4c9a-4805-a16a-10d01f3b9e22
mon_initial_members = storage4, storage5, storage6
mon_host = 10.10.6.14,10.10.6.15,10.10.6.16
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
cluster network = 10.10.60.0/24
public network = 10.10.6.0/24
mon pg warn max per osd = 0
mds cache size = 500000
mon lease = 50
mon lease renew interval = 30
mon lease ack timeout = 100
mon osd min down reporters = 4
osd crush update on start = false
filestore_xattr_use_omap = true
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[mon]
mon_pg_warn_max_per_osd=5000
mon_max_pool_pg_num=106496
[client]
rbd cache = false
[osd]
osd mkfs type = xfs
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
osd mkfs options xfs = -f -i size=2048
filestore_queue_max_ops=5000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 8
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000

I've also added the following to sysctl.conf:

fs.file-max = 6553600
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 819200
net.ipv4.tcp_keepalive_time = 20
kernel.msgmni = 2878
kernel.sem = 256 32000 100 142
kernel.shmmni = 4096
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576
net.core.wmem_default = 1048576
net.core.wmem_max = 1048576
net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_tw_buckets = 10000
I just don't understand why are those latencies appearing, and to be honest is not much better than i used the same hdd for osd and it's journal. Now, the journal is on one of the 120gb ssd, which is splittled into multiple 18GB partitions.
Does anyone have any ideea in wich directions should i debug more ( raid, hdd, netwok, ceph.conf, os ) ?
I'm already since a week with this issue...
Have a nice day, and many thanks in advance !

Jeggs101 · Apr 25, 2017

Here's what I'm seeing:

You've got 3 nodes. I think that's small for a Ceph and Ceph does better with many nodes (like 100) rather than few.

You've got consumer SSDs on older HP RAID controllers. I don't think that's best.

Need 10g network with jumbo frames on.

Need 20-40 OSD.

skunky · Apr 25, 2017

Thanks a lot.
So the main bottleneck in this 3 nodes case would be network & journal ssd type. In one way or other i have to make them spin as much as i can, unfortunatelly. At a point I will be able to add some intel DC ssd's for journaling, and one 10GB sw ( configured with two vlans for public/cluster network ). Also considering changing raid controllers.
Do you think it would get some extra performance this way ?
Those latencies flapping and getting very high are a big concern for me since they translate into blocked requests at a point ( at least thats what i guess ).

PigLover · Apr 25, 2017

I would guess that your main bottleneck is replication traffic on the cluster network. Your stalls are likely related to the journals filling up because you are writing from the client faster than you can get the replication done. You need to get that cluster network up to 10Gbe.

You might also look at upgrading to Luminous and using Bluestore on the OSDs. The improved cache write design could really help this use case. Luminous is currently pre-release so you may not want to use it yet.

After you get the 10Gbe cluster network done a better SSD for the journal would be a great idea. That Samsung 850 is unlikely to deal with the concerncy you are throwing at it very well.

Lastly - ditch the raid card and get a plain old HBA (LSI card in IT mode). The raid functions and extra cache are not helping you at all.

Sorry for any typos. Posting from mobile.

Sent from my SM-G950U using Tapatalk

skunky · May 2, 2017

Sirs, thank you very much !
Your posts helped me a lot in the process of ceph understanding.
I would have some other questions...
I would prefer to buy only 1 x 10GB Ethernet switch, split it into 2 vlans, for public/cluster networks.
The 3 servers ( they are all mons and osd's nodes ) have 2 port 10GB ethernet cards, each. So i can use 1 port for cluster and one port for public, on each server.
Do you think it would be a good ideea to have both vlans on the same switch ? - It would reduce some costs instead of buying two sw's .
Second, regarding HBA card. Do you have any recomandations on the LSI side about that ?
I saw some cards like ( LSI SAS 9217-8i 6Gbps SAS )
LSI SAS 9217-8i 6Gbps 8 Port SAS HBA R76Y4 Ref
Do you think it can nicelly address the latencies and iop's issues ?
It worries me that this card is a raid card too, so buying it, it will get me into raid latencies again....
Again, thank you very much for your help !

Leemur · May 2, 2017

Your main problem is the samsung SSDs, they are known to be very slow for CEPH journal, it can be less than 1MB/Sec.
I have been figthing the exact same problem on a Ceph cluster I took over from somebody else.
For now I have moved all journals back to the HDDs and are seeing improved write speeds.

[Edit]
here is a link with information about how to test Journal performance and also some results.
Ceph: how to test if your SSD is suitable as a journal device? | Sébastien Han

Samsung 850 Pro 128GB: 1.2 MB/s

skunky · May 2, 2017

Thank you.
It's in plan to put one Intel DC S3700 per each server, leave the rest samsungs for osd's data, replace raid cards with IT-mode HBA's, and add 10GB's Ethernet sw's ( jumbo frames enabled on cluster network ports).
Not sure about choosing the right LSI controller & 10gb switch ( vlan's or not ) though..

Leemur · May 2, 2017

I would personally avoid any Samsung Evo SSD's in a ceph cluster.

The problem is the way the disks are built; the disk it self have poorly write speeds and therefor Samsung have put a few GB (size depends on model and size of disk) of fast write cache in front of it. This works fine when used in a workstation but not in ceph; you will end up with slower write speeds than a 7200RPM sata HDD.

On my cluster at home I tried with a proper journal in front, but still had problems.
I ended up replacing them with sata HDD's with the same SSD's as journals and are seeing better performance as I avoid huge spikes in the write latency.

maze · May 2, 2017

Shouldnt be an issue to use 2x 10g ports. You Can even Bond Them and do a trunk with both your vlans on. Should give you som resilience..

skunky · May 2, 2017

Thanks for good info ! Actually, now I have 6 x sandisk ultra II 1TB for osd data and samsung for osd journal. They are cosumer hdd's though... Any experience with these sandisks? I would try to put journaling on one of them on each server just for sake of testing.

skunky · May 2, 2017

Thanks Maze. Yeah.. the trunk over bonded would be a great ideea.

skunky · May 9, 2017

Wow. Just moved journaling from samsung to sandisk for testing purposes and this improved a lot journal writes.
Now I have to find out a proper ~200gb mlc ssd for journaling ( sandisks are 1tb and they were meant to host osds ). I saw sebastien's ssd's comparation page, and i just can decide which of those three should I go for journaling: intel 3500 , intel 3600, samsung 863. Samsung is cheaper, but is samsung, and I already had experiences with those evo's. Intel S3500 is cheaper than S3700, but regarding that report it handles more troughput than s3700...
Which of one do you think is more robust for journaling ?
Ceph: how to test if your SSD is suitable as a journal device? | Sébastien Han
Thanks a lot !

whitey · May 9, 2017

I've saw that chart LONG ago as well, sat there perplexed and scratched my head. I think I'd opt for s3700's for sure over the s3500's no matter what that report said lol...just saying. All the ??? marks splattering his charts don't exactly leave a warm fuzzy. One things for sure based on his consumer list they are garbage for CEPH use...the NVMe's seem to have all fared very well on the ent class list.

My CEPH 3 node w/ husmm 12Gbps sas ssd dev's (SSD800MM/SSD1600MM pure flash CEPH pool) kills it on I/O performance. I've posted several threads over the last year on here covering details/numbers (w/ s3600 dev's back then) but a refresher wouldn't hurt I guess.

skunky · May 10, 2017

Now I'm scrathing my head too

. Not sure that report is entirely correct/relevant especially knowing that 3700 is double than 3500 @ IOPS...
I would touch those 863 samsungs ( NOT 863a ) but I'm little scared that "samsung" is written on my TV too

Tough choice I have to do, knowing that I have to order them, fligh to London DC, install, test, flight back, & benchmark & see results...
To be onest, everywhere I look: on ceph conferences, internet, forums, mailing lists, all guys are generally using s3600/s3700's for journaling & s3610's for osd data. So I just need a good point and reason that why should I buy expensier 3700 instead of 3500 or samsung 863...

i386 · May 10, 2017

skunky said:
I just need a good point and reason that why should I buy expensier 3700 instead of 3500 or samsung 863...

iops/$
The intel s3700/3710 get the max performance out the sata interface, there are no better performing sata ssds.

whitey · May 10, 2017

skunky said:
Now I'm scrathing my head too . Not sure that report is entirely correct/relevant especially knowing that 3700 is double than 3500 @ IOPS...
I would touch those 863 samsungs ( NOT 863a ) but I'm little scared that "samsung" is written on my TV too
Tough choice I have to do, knowing that I have to order them, fligh to London DC, install, test, flight back, & benchmark & see results...
To be onest, everywhere I look: on ceph conferences, internet, forums, mailing lists, all guys are generally using s3600/s3700's for journaling & s3610's for osd data. So I just need a good point and reason that why should I buy expensier 3700 instead of 3500 or samsung 863...

Assuming you meant s3700/s3610 for journal and s3500's for OSD?

skunky · May 10, 2017

Yes, that's right. So nobody had ceph journaling experiences with Samsung 863 240GB until now. Regarding that chart they are 64.7MB/s ( @140 pound ) compared to s3700 200GB - 22.5MB/s ( @400 pound ) & s3500 240GB - 39.1MB/s ( @182 pound) .
Still scratching my head...

whitey · May 10, 2017

Don't believe the hype...s3700's rock, and you all know I love my ent class HGST ssd's so...don't get me going there :-D

I hear allright things about some of the better ent class sammy's I believe IIRC, just no experience but someone will chime in I'm fairly certain.

I'm biased probably because that is all I will use (Intel/HGST ssd's) really due to nothing but stellar bang for buck and reliability IMHO.

i386 · May 10, 2017

skunky said:
Yes, that's right. So nobody had ceph journaling experiences with Samsung 863 240GB until now. Regarding that chart they are 64.7MB/s ( @140 pound ) compared to s3700 200GB - 22.5MB/s ( @400 pound ) & s3500 240GB - 39.1MB/s ( @182 pound) .
Still scratching my head...

I wouldn't rely on the numbers posted on that list, see this screenshot:

skunky · May 10, 2017

whitey said:
Assuming you meant s3700/s3610 for journal and s3500's for OSD?

No,
s3500 vs s3700 for journaling...
Osd's will remain on those Sandisks ultra II for a while ( increased fs_apply_latency maybe after ) along with the 5TB's sata spinning disks.

Ceph very slow writes

New Member

Well-Known Member

New Member

Moderator

New Member

New Member

New Member

New Member

Active Member

New Member

New Member

New Member

Moderator

New Member

Well-Known Member

Moderator

New Member

Moderator

Well-Known Member

New Member