Ceph blustore over RDMA performance gain

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

EluRex

Active Member
Apr 28, 2015
218
78
28
Los Angeles, CA
I want to share following testing with you

4 PVE Nodes cluster with 3 Ceph Bluestore Node, total of 36 OSD.
  1. OSD: st6000nm0034
  2. block.db & block.wal device: Samsung sm961 512GB
  3. NIC: Mellanox Connectx3 VPI dual port 40 Gbps
  4. Switch: Mellanox sx6036T
  5. Network: IPoIB separated public network & cluster network
This shows ceph over RDMA is successfully enabled


Ceph over RDMA - rados bench -p rbd 60 write -b 4M -t 16

2454.72 MB/s

Standard TCP/IP - rados bench -p rbd 60 write -b 4M -t 16

2053.9 MB/s

Total performance gain is about 25%

Total pool performance with 4 tests running - rados bench -p rbd 60 write -b 4M -t 16
upload_2018-6-2_21-11-30.png
4856.72 MB/s
 

mrktt

New Member
Aug 21, 2018
4
0
1
I want to share following testing with you

4 PVE Nodes cluster with 3 Ceph Bluestore Node, total of 36 OSD.
  1. OSD: st6000nm0034
  2. block.db & block.wal device: Samsung sm961 512GB
  3. NIC: Mellanox Connectx3 VPI dual port 40 Gbps
  4. Switch: Mellanox sx6036T
  5. Network: IPoIB separated public network & cluster network
This shows ceph over RDMA is successfully enabled
I've a setup nearly identical to yours, but I cannot start the cluster in RDMA mode. The OSD come up and then mark themselves down because they cannot communicate with each other on different hosts.
Can you please share your ceph.conf and/or distribution/kernel version? Is this done with inbox infiniband drivers or with OFED?
thanks!
 

EluRex

Active Member
Apr 28, 2015
218
78
28
Los Angeles, CA
I've a setup nearly identical to yours, but I cannot start the cluster in RDMA mode. The OSD come up and then mark themselves down because they cannot communicate with each other on different hosts.
Can you please share your ceph.conf and/or distribution/kernel version? Is this done with inbox infiniband drivers or with OFED?
thanks!
check the error log for each osd /var/log/ceph/ceph-osd.[id].log

typically the problem can be solved

ceph-disk activate /dev/sd[x] --reactivate

or

systemctl disable ceph-osd@[id].service; systemctl enable ceph-osd@[id].service
 

arglebargle

H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈
Jul 15, 2018
657
244
43
I'd be curious to see some CPU utilization numbers with and without RDMA, even if it's something as simple as a netdata utilization graph before and after.
 

mrktt

New Member
Aug 21, 2018
4
0
1
check the error log for each osd /var/log/ceph/ceph-osd.[id].log

typically the problem can be solved

ceph-disk activate /dev/sd[x] --reactivate

or

systemctl disable ceph-osd@[id].service; systemctl enable ceph-osd@[id].service
Thanks, but no effect. The osd service starts as "up, in" and goes down after a couple of minutes and a bunch of
heartbeat_check: no reply from x.x.x.x
for every osd from the other host
So I don't think it's a osd problem per se, but the services can't communicate via RDMA for unknown reasons.
Thanks anyway
 

EluRex

Active Member
Apr 28, 2015
218
78
28
Los Angeles, CA
Thanks, but no effect. The osd service starts as "up, in" and goes down after a couple of minutes and a bunch of
heartbeat_check: no reply from x.x.x.x
for every osd from the other host
So I don't think it's a osd problem per se, but the services can't communicate via RDMA for unknown reasons.
Thanks anyway
this seems your RoCE is not up
 

mrktt

New Member
Aug 21, 2018
4
0
1
this seems your RoCE is not up
It is not. I'm using a relatively old infiniband switch that supports only pure old style IB, no RoCE.
So, I have RMDA communication between the nodes (all the test utilities work perfectly) but not RoCE.
If Ceph can communicate only with RoCE this explains all my problems.
I was suspecting something like this but until now I wasn't able to find references to RoCE beeing mandatory on official Ceph documentation.
Thanks
 

mrktt

New Member
Aug 21, 2018
4
0
1
hmmm strange... because I am also running on msx6036 IB switch and what I use is IPoIB
I think SX6036 supports both VPI and Ethernet, so you can have RoCE connection.
I'm using an old QLogic 12200 which I think can't do RoCE.
With esxi 6.5 there's the same problem: the new drivers has only the _EN variant and I can bring up the network only with a direct connection between the ports, if I go through the switch the ports remain down.
I'll try to do the same test with the ceph nodes, if the internal network comes up with a direct cable connection probably the problem is the missing RoCE capabilities of the switch.
 

cek

New Member
Jul 21, 2021
1
0
1
To anyone coming to this thread -- I'd suggest abandoning Infiniband idea and going with EN/RoCE if you're at initial point of your cluster build. I had the misfortune of picking IB and not thinking out the consequences and now I'm stuck with IPoIB, which is really a band-aid and provides mediocre performance for CEPH. I was able to achieve max 30k iops on rbd blockdev over ipoib. Pick EN cards/switch[es] instead!

p.s.: It also seems CEPH has ditched the idea of supporting any type of RDMA and will just rely on whatever the TCP stack is capable of. (should be faster with RoCE vs ipoib)
 

gerby

SREious Engineer
Apr 3, 2021
50
22
8
I have an existing Pacific cluster that I was considering migrating to RoCEv2. It's currently a containerized deployment created with cephadm; are there any gotchas or tips you've got for those exploring this space? I had planned using RoCE only on the cluster network, leaving the public network standard tcp.
 

EluRex

Active Member
Apr 28, 2015
218
78
28
Los Angeles, CA
ceph will not support RDMA in production yet and seems development on it is extremely slow and mellanox commitment on it ceased and now it is part of async messenger

development on supproting dpdk + spdk is probaby faster than waiting for rdma
 
Last edited:

epycftw

New Member
Jul 1, 2021
5
0
1
Oh that's not good, I was counting on Ceph to give me 50 GB/sec if I paired it with some nvmes and connectx edr cards. What's a good fast free cluster file system that talks efficiently over infiniband and scales out well and is like Ceph? Do people run Cephfs or mount a block device as a xfs/whatever filesystem? Any recent tests or benchmarks with fast nvmes? like 7 gb/sec read 2 gb/sec write fast nvmes?