Ceph blustore over RDMA performance gain

Discussion in 'Linux Admins, Storage and Virtualization' started by EluRex, Jun 2, 2018.

  1. EluRex

    EluRex Active Member

    Joined:
    Apr 28, 2015
    Messages:
    205
    Likes Received:
    70
    I want to share following testing with you

    4 PVE Nodes cluster with 3 Ceph Bluestore Node, total of 36 OSD.
    1. OSD: st6000nm0034
    2. block.db & block.wal device: Samsung sm961 512GB
    3. NIC: Mellanox Connectx3 VPI dual port 40 Gbps
    4. Switch: Mellanox sx6036T
    5. Network: IPoIB separated public network & cluster network
    This shows ceph over RDMA is successfully enabled
    [​IMG]

    Ceph over RDMA - rados bench -p rbd 60 write -b 4M -t 16
    [​IMG]
    2454.72 MB/s

    Standard TCP/IP - rados bench -p rbd 60 write -b 4M -t 16
    [​IMG]
    2053.9 MB/s

    Total performance gain is about 25%

    Total pool performance with 4 tests running - rados bench -p rbd 60 write -b 4M -t 16
    upload_2018-6-2_21-11-30.png
    4856.72 MB/s
     
    #1
    anoother, _alex, whitey and 5 others like this.
  2. mrktt

    mrktt New Member

    Joined:
    Aug 21, 2018
    Messages:
    4
    Likes Received:
    0
    I've a setup nearly identical to yours, but I cannot start the cluster in RDMA mode. The OSD come up and then mark themselves down because they cannot communicate with each other on different hosts.
    Can you please share your ceph.conf and/or distribution/kernel version? Is this done with inbox infiniband drivers or with OFED?
    thanks!
     
    #2
  3. EluRex

    EluRex Active Member

    Joined:
    Apr 28, 2015
    Messages:
    205
    Likes Received:
    70
    check the error log for each osd /var/log/ceph/ceph-osd.[id].log

    typically the problem can be solved

    ceph-disk activate /dev/sd[x] --reactivate

    or

    systemctl disable ceph-osd@[id].service; systemctl enable ceph-osd@[id].service
     
    #3
  4. arglebargle

    arglebargle H̸̖̅ȩ̸̐l̷̦͋l̴̰̈ỏ̶̱ ̸̢͋W̵͖̌ò̴͚r̴͇̀l̵̼͗d̷͕̈

    Joined:
    Jul 15, 2018
    Messages:
    315
    Likes Received:
    84
    I'd be curious to see some CPU utilization numbers with and without RDMA, even if it's something as simple as a netdata utilization graph before and after.
     
    #4
  5. EluRex

    EluRex Active Member

    Joined:
    Apr 28, 2015
    Messages:
    205
    Likes Received:
    70
    the lab environment already move on to test other things... no netdata or anything available @ this point
     
    #5
  6. mrktt

    mrktt New Member

    Joined:
    Aug 21, 2018
    Messages:
    4
    Likes Received:
    0
    Thanks, but no effect. The osd service starts as "up, in" and goes down after a couple of minutes and a bunch of
    heartbeat_check: no reply from x.x.x.x
    for every osd from the other host
    So I don't think it's a osd problem per se, but the services can't communicate via RDMA for unknown reasons.
    Thanks anyway
     
    #6
  7. EluRex

    EluRex Active Member

    Joined:
    Apr 28, 2015
    Messages:
    205
    Likes Received:
    70
    this seems your RoCE is not up
     
    #7
  8. mrktt

    mrktt New Member

    Joined:
    Aug 21, 2018
    Messages:
    4
    Likes Received:
    0
    It is not. I'm using a relatively old infiniband switch that supports only pure old style IB, no RoCE.
    So, I have RMDA communication between the nodes (all the test utilities work perfectly) but not RoCE.
    If Ceph can communicate only with RoCE this explains all my problems.
    I was suspecting something like this but until now I wasn't able to find references to RoCE beeing mandatory on official Ceph documentation.
    Thanks
     
    #8
  9. EluRex

    EluRex Active Member

    Joined:
    Apr 28, 2015
    Messages:
    205
    Likes Received:
    70
    hmmm strange... because I am also running on msx6036 IB switch and what I use is IPoIB
     
    #9
  10. mrktt

    mrktt New Member

    Joined:
    Aug 21, 2018
    Messages:
    4
    Likes Received:
    0
    I think SX6036 supports both VPI and Ethernet, so you can have RoCE connection.
    I'm using an old QLogic 12200 which I think can't do RoCE.
    With esxi 6.5 there's the same problem: the new drivers has only the _EN variant and I can bring up the network only with a direct connection between the ports, if I go through the switch the ports remain down.
    I'll try to do the same test with the ceph nodes, if the internal network comes up with a direct cable connection probably the problem is the missing RoCE capabilities of the switch.
     
    #10
  11. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,450
    Likes Received:
    331
    Yes and no: The 6036G supports ethernet/vpi out of the box, the standard 6036 requires a gateway license ($3k+)
     
    #11
Similar Threads: Ceph blustore
Forum Title Date
Linux Admins, Storage and Virtualization ceph backfill problem Sep 20, 2018
Linux Admins, Storage and Virtualization Ceph low performance Sep 17, 2018
Linux Admins, Storage and Virtualization Ceph iSCSI Target with Proxmox VE implementation ? Jul 18, 2018
Linux Admins, Storage and Virtualization Anybody seen this container/ceph-fuse bug? Apr 8, 2018
Linux Admins, Storage and Virtualization Ceph SSD Recommendations Mar 20, 2018

Share This Page