1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

CEPH write performance pisses me off!

Discussion in 'Linux Admins, Storage and Virtualization' started by whitey, Jan 25, 2017.

  1. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    So as to save myself the typing, JUST posted this on CEPH irc/freenode.

    Any thoughts/takers?

    CEPH-perf-irc.png
     
    #1
  2. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    8,939
    Likes Received:
    2,613
    How many OSDs total?
     
    #2
  3. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    3, each with a single s3610 800gb ssd. Read throughput is great 2Gbps network/250MB/s to array, write...weak sauce like 300Mbps network/35-40MB/sec to array.

    SURELY I should NOT see that type of performance dropoff, I know they were touting Bluestore as 'the next best thing since sliced bread' but hell, this POSIX backend is garbage I guess. i did read to check CPU usage as write can use that a bit more liberally but each OSD node's CPU is at 30-40% usage on active read/write operations. Read maybe a lil' lower CPU utilization all while delivering higher throughput...what gives? hah

    CEPH-perf-read.png
    CEPH-perf-write.png

    Funny how you can watch the replication traffic do it's thing at a furious rate. Not sure how I'd coorelate 2Gbps active to 2Gbps rep traffic when the float back on (WRITE) happens at 300Mbps active and 600Mbps (double/active live traffic) rep traffic, almost makes me think 'ok double rep traffic to two other OSD's but then why was READ replication traffic 2Gbps/2Gbps lockstepped if you follow what I am explaining in the screenshots.
     
    #3
    Last edited: Jan 25, 2017
  4. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    8,939
    Likes Received:
    2,613
    Interesting. I think you need more drives :)
     
    #4
  5. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    HATERS GONNA HATE HAHAH j/k bro, I knew that was coming...you know me and my JANKY gear :-D

    Pretty impressive read throughput though right, that is a super real world sVMotion of a 30GB VM about 1/2 vdisk used (15GB or so). Sending off CEPH cluster w/ those 3 OSD's and single disks FLYS, but write is LAAAAAAAME! :-D

    I kinda expected writes be be 1/2 as quick as reads at least, bet if you look at perf specs (read v.s. writes) of s3610's they are not too far off from eachother...stg sw layer giving me the middle finger w/out a doubt.
     
    #5
  6. Rahvin9999

    Rahvin9999 Member

    Joined:
    Jan 14, 2016
    Messages:
    41
    Likes Received:
    16
    read is 250MB a sec, write is 35/40MB sec.
    Sounds about right.

    I’m simplifying a bit but:
    Write is /3 for the standard triple redundancy
    And then /2 for the journaling
    So 250MB/3 = 83MB /2 = 40MB
     
    #6
  7. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Predictions if I add another CEPH pool comprised of two s3700 200GB devices on each OSD if this will go up/down/abt the same? Abt to do this here in the next hour or so.
     
    #7
  8. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Why is READ not penalized nearly as much as write in CEPH, maybe I am naive to a simple/fundamental core concept here. Replica's/PG's/CRUSH leaf type w/in ceph.conf maybe? If this is 'all she's got Scotty' for my config and are reasonable performance number I can deal I guess, when I had 15 or so VM's on it is seemed very snappy/responsive.

    My current ceph.conf
    Code:
    [cephuser@ceph-admin ceph-deploy]$ cat ceph.conf
    [global]
    fsid = 31485460-ffba-4b78-b3f8-3c5e4bc686b1
    mon_initial_members = osd01, osd02, osd03
    mon_host = 192.168.2.176,192.168.2.177,192.168.2.178
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    public_network = 192.168.2.0/24
    cluster_network = 192.168.111.0/24
    osd_pool_default_size = 2 # Write an object 2 times
    osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
    osd_pool_default_pg_num = 256
    osd_pool_default_pgp_num = 256
    osd_crush_chooseleaf_type = 1
    
     
    #8
  9. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,352
    Likes Received:
    829
    In order to read from ceph you need an answer from exactly one copy of the data.

    To do a write you need to compete the write to each copy of the journal - the rest can proceed asynchronously. So write should be ~1/3 the speed of your reads, but in practice they are slower than that.

    You can improve this somewhat by setting 'required' writes to a lower value, which allows your write to be acknowledged before all three copies have been journaled.

    You can speed this up further with more OSDs in the pool, which allows more workers to operate asynchronously after the required writes are complete

    Sent from my SM-G925V using Tapatalk
     
    #9
  10. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    OK so I just slapped two more s3700 20gb models into each OSD node, want to really create a new pool, trying to rtfm but not bumping into quick answer, it all looks like expanding an existing pool.

    Is it as simple as:

    ceph-deploy disk zap osd01:sdc osd01:sdd
    ceph-deploy osd create osd01:sdc osd01:sdd

    ceph-deploy disk zap osd02:sdc osd02:sdd
    ceph-deploy osd create osd02:sdc osd02:sdd

    ceph-deploy disk zap osd03:sdc osd03:sdd
    ceph-deploy osd create osd03:sdc osd03:sdd

    Or does that just grow (and merge disks) of the existing output of 'ceph osd lspools'
    Code:
    [cephuser@ceph-admin ceph-deploy]$ ceph osd lspools
    0 rbd,
    
    EDIT: Nope it looks like maybe this:

    ceph osd pool create 'otherswitchesIneedtoinvestigatehereLOL'
     
    #10
  11. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,352
    Likes Received:
    829
    You don't need to create a new pool. Just add the OSDs and data will be redistributed to use them...

    Sent from my SM-G925V using Tapatalk
     
    #11
  12. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    This gents covers it pretty good here it looks like, need to digest more.

    Ceph Storage :: Next Big Thing: How Data Is Stored In CEPH Cluster

    Guess at a high level I am struggling to understand how I control when I create a new ceph pool (IE: ceph osd pool create s3700pool 128)
    how to pin/ensure the new PG's that comprise the new pool ONLY come from those new s3700 devices. I am sure I am over-complicating this lol.
     
    #12
  13. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    8,939
    Likes Received:
    2,613
    What I have seen several folks do, especially on NVMe drives that can handle higher QD, is to make virtual disks on a drive and use them as Ceph disks. That gives you more OSDs with fewer physical devices. Not something you would want to run in production, but if you just wanted to test performance it may be useful.
     
    #13
  14. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Yeah I was hoping to create/isolate a new pool of the new devices just to see if devices had anything to do w/ it but I suppose we have debunked and said the hell w/ my theory there. So my previous cmds will work, rock on.

    ceph-deploy disk zap
    ceph-deploy disk create

    Executing!!!BOOM
     
    #14
  15. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    MUUUUUHAHAHAHAH!!!

    From this:
    Code:
    [cephuser@ceph-admin ceph-deploy]$ ceph osd tree
    ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 2.16747 root default
    -2 0.72249     host osd01
     0 0.72249         osd.0       up  1.00000          1.00000
    -3 0.72249     host osd02
     1 0.72249         osd.1       up  1.00000          1.00000
    -4 0.72249     host osd03
     2 0.72249         osd.2       up  1.00000          1.00000
    
    To this:
    Code:
    [cephuser@ceph-admin ceph-deploy]$ ceph osd tree
    ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 3.22939 root default
    -2 1.07646     host osd01
     0 0.72249         osd.0       up  1.00000          1.00000
     3 0.17699         osd.3       up  1.00000          1.00000
     4 0.17699         osd.4       up  1.00000          1.00000
    -3 1.07646     host osd02
     1 0.72249         osd.1       up  1.00000          1.00000
     5 0.17699         osd.5       up  1.00000          1.00000
     6 0.17699         osd.6       up  1.00000          1.00000
    -4 1.07646     host osd03
     2 0.72249         osd.2       up  1.00000          1.00000
     7 0.17699         osd.7       up  1.00000          1.00000
     8 0.17699         osd.8       up  1.00000          1.00000
    
     
    #15
  16. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Holy crap that recovering remap/backfill is FLYING. 630MB/s

    Re-benchmark time!

    Code:
    [cephuser@ceph-admin ceph-deploy]$ ceph -w
        cluster 31485460-ffba-4b78-b3f8-3c5e4bc686b1
         health HEALTH_WARN
                1 pgs backfill_wait
                1 pgs backfilling
                recovery 1243/51580 objects misplaced (2.410%)
                too few PGs per OSD (14 < min 30)
         monmap e1: 3 mons at {osd01=192.168.2.176:6789/0,osd02=192.168.2.177:6789/0,osd03=192.168.2.178:6789/0}
                election epoch 6, quorum 0,1,2 osd01,osd02,osd03
         osdmap e129: 9 osds: 9 up, 9 in; 1 remapped pgs
                flags sortbitwise,require_jewel_osds
          pgmap v42140: 64 pgs, 1 pools, 101382 MB data, 25390 objects
                199 GB used, 3107 GB / 3306 GB avail
                1243/51580 objects misplaced (2.410%)
                      62 active+clean
                       1 active+remapped+backfilling
                       1 active+remapped+wait_backfill
    recovery io 636 MB/s, 159 objects/s
      client io 999 B/s rd, 999 B/s wr, 0 op/s rd, 1 op/s wr
    
    2017-01-25 19:03:40.896002 mon.0 [INF] pgmap v42139: 64 pgs: 1 active+remapped+wait_backfill, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3107 GB / 3306 GB avail; 991 B/s rd, 991 B/s wr, 2 op/s; 1243/51580 objects misplaced (2.410%); 631 MB/s, 157 objects/s recovering
    2017-01-25 19:03:41.917934 mon.0 [INF] osdmap e129: 9 osds: 9 up, 9 in
    2017-01-25 19:03:41.921966 mon.0 [INF] pgmap v42140: 64 pgs: 1 active+remapped+wait_backfill, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3107 GB / 3306 GB avail; 999 B/s rd, 999 B/s wr, 2 op/s; 1243/51580 objects misplaced (2.410%); 636 MB/s, 159 objects/s recovering
    2017-01-25 19:03:42.935860 mon.0 [INF] osdmap e130: 9 osds: 9 up, 9 in
    2017-01-25 19:03:42.936920 mon.0 [INF] pgmap v42141: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%); 266 MB/s, 66 objects/s recovering
    2017-01-25 19:03:42.944929 mon.0 [INF] pgmap v42142: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%); 533 MB/s, 133 objects/s recovering
    2017-01-25 19:03:43.949058 mon.0 [INF] pgmap v42143: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%)
    
     
    #16
  17. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Boo, abt the same write perf. SOLVED/UPDATED BELOW Perplexed/need higher end devices, apparently these s3610's/s3700's ain't what they used to be :-D

    WRITES via sVMotion (FreeNAS NFS 8x hussl pool r10 config to CEPH pool over RBD/LIO iSCSI)
    Code:
    [root@cephgw ~]# vnstat -l -i ens160
    Monitoring ens160...    (press CTRL-C to stop)
    
       rx:   311.18 Mbit/s  3164 p/s          tx:   616.44 Mbit/s  7415 p/s
    
    READS are still EXCELLENT in my book via sVMotion (CEPH RBD/LIO iSCSI to FreeNAS iSCSI zvol)
    Code:
    [root@cephgw ~]# vnstat -l -i ens160
    Monitoring ens160...    (press CTRL-C to stop)
    
       rx:     2.27 Gbit/s 34827 p/s          tx:     2.26 Gbit/s 21284 p/s
    
    WTF moment, before I was reading from a FreeNAS NFS pool writing to CEPH...check this...same ZFS pool reading from iSCSI zvol writing to CEPH...MUCH BETTER

    WRITES via sVMotion (FreeNAS iSCSI zvol cut from 8x hussl r10 pool to CEPH RBD/LIO iSCSI)
    Code:
    [root@cephgw ~]# vnstat -l -i ens160
    Monitoring ens160...    (press CTRL-C to stop)
    
       rx:   932.57 Mbit/s  9508 p/s          tx:   933.93 Mbit/s 11017 p/s
    
    3x's perf and right where I would expect it to be. WIN!!!
     
    #17
    Last edited: Jan 25, 2017
  18. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    So yeah riddle me that...now this testing has be re-thinking my NFS v.s. iSCSI strategy. HAHA
     
    #18
  19. Patriot

    Patriot Moderator

    Joined:
    Apr 18, 2011
    Messages:
    1,062
    Likes Received:
    541
    Ceph has many internal bottlenecks.... You either get replication or performance not both. You can use nvme drives to boost performance, but they will not be used to their capabilities without making multiple OSDs per nvme device which negates duplication. (do not do this outside of performance testing) Ceph is a massive ball of bandaids... It has promise but it is also kinda shoddy right now...On the upside they are very accepting of new devs to help make proper fixes to problems.
     
    #19
  20. whitey

    whitey Moderator

    Joined:
    Jun 30, 2014
    Messages:
    2,216
    Likes Received:
    676
    Hey for a hodge-podge of disks (three 800gb s3610's and six 200GB s3700's) thrown at it getting 2Gbps read/1Gbps write is pretty sweet in my book, taking down a node with VM's running on that iSCSI VMFS volumes (served up through CEPH/RBD/LIO framework) and them not even blinking makes me smile when I think about the hassle of getting EVEN CLOSE to that parity on a FreeNAS box w/out significant jumping through hoops.
     
    #20
Similar Threads: CEPH write
Forum Title Date
Linux Admins, Storage and Virtualization Ceph is pool never creating? Feb 6, 2017
Linux Admins, Storage and Virtualization Proxmox VE docs to upgrade Ceph to Jewel Jan 5, 2017
Linux Admins, Storage and Virtualization Proxmox 4.4 now supports Ceph "Jewel" release Jan 5, 2017
Linux Admins, Storage and Virtualization Proxmox WordPress Ceph and IOPS Dec 21, 2016
Linux Admins, Storage and Virtualization Whitey's foray into the world of CEPH Dec 12, 2016

Share This Page