Hello everyone,
I have a pretty big problem with a 3 node ceph cluster on CentOS7. It just hangs at a point when tranferring large ( 10GB ) files to it, regardless if using rbd or cephfs.
For example, from the client machine i start to transfer 3x10GB files, it tranfers a half of the 30GB content, and at a point both "fs_apply_latency" and "fs_commit_latency" go up to 3000-4000 ms ( sometimes even 30000 !!!) , resulting in about "100 requests are blocked > 32 sec". At this point the transfer just freezes, and at a point it starts again, and againg freeze...and so on, until it finish.
My hardware setup is not very apropiate for a ceph cluster, since both public and cluster network are on 1GB nics. On each server I have 1 x 10Gb card, that i used at the beginning for the public_network and 1Gb for cluster_network , but this didn't help since i guess cluster was ingesting too much traffic to be able to handle withing the 1Gb cluster_network. So i switched over to 1gb for both ceph client and ceph cluster.
The problem is that i just need to isolate the issue as much as it can be done and figure out if there's a ceph ,network, OS misconfiguration, or just bad hardware for ceph.
So, there are 4 hp's DL 160 G6 94Gb ram, 3 for the cluster ( mon, osd, mds ) and 1 for the ceph client. They all have p410 smart array controllers ( cache disabled ) but write cache ( smartarrayaccelerator ) enabled for all logical volumes, including the journal ssd disk. Centos7, all kernel updated to "4.10.12-1.el7.elrepo.x86_64".
HW:
There are 2 x samsung 850 ssd 120gb ( one for OS , one for ceph journal ) , 2 x 1TB sandisk ultra II ( for 2 osd's ) , 4x5TB seagate ( for 4 spinning osds's ) on each server.
They are all capable of at least 3Gbps ( i've just noticed that hpssacli is reporting 1.5Gpbs on one ceph node, will stick to fix that meanwhile ).
They are all connected to sw's that share 10Gb links between them, cluster_network is on a separate vlan on the same sw's. ( cannot add sw at the moment, all my work is done remotelly ).
Here is my NEW ceph.conf accordingly to
Tuning for All Flash Deployments - Ceph - Ceph :
[global]
fsid = 2806fecf-4c9a-4805-a16a-10d01f3b9e22
mon_initial_members = storage4, storage5, storage6
mon_host = 10.10.6.14,10.10.6.15,10.10.6.16
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
cluster network = 10.10.60.0/24
public network = 10.10.6.0/24
mon pg warn max per osd = 0
mds cache size = 500000
mon lease = 50
mon lease renew interval = 30
mon lease ack timeout = 100
mon osd min down reporters = 4
osd crush update on start = false
filestore_xattr_use_omap = true
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[mon]
mon_pg_warn_max_per_osd=5000
mon_max_pool_pg_num=106496
[client]
rbd cache = false
[osd]
osd mkfs type = xfs
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
osd mkfs options xfs = -f -i size=2048
filestore_queue_max_ops=5000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 8
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000
I've also added the following to sysctl.conf:
fs.file-max = 6553600
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 819200
net.ipv4.tcp_keepalive_time = 20
kernel.msgmni = 2878
kernel.sem = 256 32000 100 142
kernel.shmmni = 4096
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576
net.core.wmem_default = 1048576
net.core.wmem_max = 1048576
net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_tw_buckets = 10000
I just don't understand why are those latencies appearing, and to be honest is not much better than i used the same hdd for osd and it's journal. Now, the journal is on one of the 120gb ssd, which is splittled into multiple 18GB partitions.
Does anyone have any ideea in wich directions should i debug more ( raid, hdd, netwok, ceph.conf, os ) ?
I'm already since a week with this issue...
Have a nice day, and many thanks in advance !
I have a pretty big problem with a 3 node ceph cluster on CentOS7. It just hangs at a point when tranferring large ( 10GB ) files to it, regardless if using rbd or cephfs.
For example, from the client machine i start to transfer 3x10GB files, it tranfers a half of the 30GB content, and at a point both "fs_apply_latency" and "fs_commit_latency" go up to 3000-4000 ms ( sometimes even 30000 !!!) , resulting in about "100 requests are blocked > 32 sec". At this point the transfer just freezes, and at a point it starts again, and againg freeze...and so on, until it finish.
My hardware setup is not very apropiate for a ceph cluster, since both public and cluster network are on 1GB nics. On each server I have 1 x 10Gb card, that i used at the beginning for the public_network and 1Gb for cluster_network , but this didn't help since i guess cluster was ingesting too much traffic to be able to handle withing the 1Gb cluster_network. So i switched over to 1gb for both ceph client and ceph cluster.
The problem is that i just need to isolate the issue as much as it can be done and figure out if there's a ceph ,network, OS misconfiguration, or just bad hardware for ceph.
So, there are 4 hp's DL 160 G6 94Gb ram, 3 for the cluster ( mon, osd, mds ) and 1 for the ceph client. They all have p410 smart array controllers ( cache disabled ) but write cache ( smartarrayaccelerator ) enabled for all logical volumes, including the journal ssd disk. Centos7, all kernel updated to "4.10.12-1.el7.elrepo.x86_64".
HW:
There are 2 x samsung 850 ssd 120gb ( one for OS , one for ceph journal ) , 2 x 1TB sandisk ultra II ( for 2 osd's ) , 4x5TB seagate ( for 4 spinning osds's ) on each server.
They are all capable of at least 3Gbps ( i've just noticed that hpssacli is reporting 1.5Gpbs on one ceph node, will stick to fix that meanwhile ).
They are all connected to sw's that share 10Gb links between them, cluster_network is on a separate vlan on the same sw's. ( cannot add sw at the moment, all my work is done remotelly ).
Here is my NEW ceph.conf accordingly to
Tuning for All Flash Deployments - Ceph - Ceph :
[global]
fsid = 2806fecf-4c9a-4805-a16a-10d01f3b9e22
mon_initial_members = storage4, storage5, storage6
mon_host = 10.10.6.14,10.10.6.15,10.10.6.16
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
cluster network = 10.10.60.0/24
public network = 10.10.6.0/24
mon pg warn max per osd = 0
mds cache size = 500000
mon lease = 50
mon lease renew interval = 30
mon lease ack timeout = 100
mon osd min down reporters = 4
osd crush update on start = false
filestore_xattr_use_omap = true
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
[mon]
mon_pg_warn_max_per_osd=5000
mon_max_pool_pg_num=106496
[client]
rbd cache = false
[osd]
osd mkfs type = xfs
osd mount options xfs = rw,noatime,,nodiratime,inode64,logbsize=256k,delaylog
osd mkfs options xfs = -f -i size=2048
filestore_queue_max_ops=5000
filestore_queue_max_bytes = 1048576000
filestore_max_sync_interval = 10
filestore_merge_threshold = 500
filestore_split_multiple = 100
osd_op_shard_threads = 8
journal_max_write_entries = 5000
journal_max_write_bytes = 1048576000
journal_queueu_max_ops = 3000
journal_queue_max_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
objecter_inflight_op_bytes = 1048576000
I've also added the following to sysctl.conf:
fs.file-max = 6553600
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_max_syn_backlog = 819200
net.ipv4.tcp_keepalive_time = 20
kernel.msgmni = 2878
kernel.sem = 256 32000 100 142
kernel.shmmni = 4096
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576
net.core.wmem_default = 1048576
net.core.wmem_max = 1048576
net.core.somaxconn = 40000
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_tw_buckets = 10000
I just don't understand why are those latencies appearing, and to be honest is not much better than i used the same hdd for osd and it's journal. Now, the journal is on one of the 120gb ssd, which is splittled into multiple 18GB partitions.
Does anyone have any ideea in wich directions should i debug more ( raid, hdd, netwok, ceph.conf, os ) ?
I'm already since a week with this issue...
Have a nice day, and many thanks in advance !