CEPH write performance pisses me off!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
So as to save myself the typing, JUST posted this on CEPH irc/freenode.

Any thoughts/takers?

CEPH-perf-irc.png
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
3, each with a single s3610 800gb ssd. Read throughput is great 2Gbps network/250MB/s to array, write...weak sauce like 300Mbps network/35-40MB/sec to array.

SURELY I should NOT see that type of performance dropoff, I know they were touting Bluestore as 'the next best thing since sliced bread' but hell, this POSIX backend is garbage I guess. i did read to check CPU usage as write can use that a bit more liberally but each OSD node's CPU is at 30-40% usage on active read/write operations. Read maybe a lil' lower CPU utilization all while delivering higher throughput...what gives? hah

CEPH-perf-read.png
CEPH-perf-write.png

Funny how you can watch the replication traffic do it's thing at a furious rate. Not sure how I'd coorelate 2Gbps active to 2Gbps rep traffic when the float back on (WRITE) happens at 300Mbps active and 600Mbps (double/active live traffic) rep traffic, almost makes me think 'ok double rep traffic to two other OSD's but then why was READ replication traffic 2Gbps/2Gbps lockstepped if you follow what I am explaining in the screenshots.
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Interesting. I think you need more drives :)
HATERS GONNA HATE HAHAH j/k bro, I knew that was coming...you know me and my JANKY gear :-D

Pretty impressive read throughput though right, that is a super real world sVMotion of a 30GB VM about 1/2 vdisk used (15GB or so). Sending off CEPH cluster w/ those 3 OSD's and single disks FLYS, but write is LAAAAAAAME! :-D

I kinda expected writes be be 1/2 as quick as reads at least, bet if you look at perf specs (read v.s. writes) of s3610's they are not too far off from eachother...stg sw layer giving me the middle finger w/out a doubt.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Predictions if I add another CEPH pool comprised of two s3700 200GB devices on each OSD if this will go up/down/abt the same? Abt to do this here in the next hour or so.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
read is 250MB a sec, write is 35/40MB sec.
Sounds about right.

I’m simplifying a bit but:
Write is /3 for the standard triple redundancy
And then /2 for the journaling
So 250MB/3 = 83MB /2 = 40MB
Why is READ not penalized nearly as much as write in CEPH, maybe I am naive to a simple/fundamental core concept here. Replica's/PG's/CRUSH leaf type w/in ceph.conf maybe? If this is 'all she's got Scotty' for my config and are reasonable performance number I can deal I guess, when I had 15 or so VM's on it is seemed very snappy/responsive.

My current ceph.conf
Code:
[cephuser@ceph-admin ceph-deploy]$ cat ceph.conf
[global]
fsid = 31485460-ffba-4b78-b3f8-3c5e4bc686b1
mon_initial_members = osd01, osd02, osd03
mon_host = 192.168.2.176,192.168.2.177,192.168.2.178
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 192.168.2.0/24
cluster_network = 192.168.111.0/24
osd_pool_default_size = 2 # Write an object 2 times
osd_pool_default_min_size = 1 # Allow writing 1 copy in a degraded state
osd_pool_default_pg_num = 256
osd_pool_default_pgp_num = 256
osd_crush_chooseleaf_type = 1
 

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
In order to read from ceph you need an answer from exactly one copy of the data.

To do a write you need to compete the write to each copy of the journal - the rest can proceed asynchronously. So write should be ~1/3 the speed of your reads, but in practice they are slower than that.

You can improve this somewhat by setting 'required' writes to a lower value, which allows your write to be acknowledged before all three copies have been journaled.

You can speed this up further with more OSDs in the pool, which allows more workers to operate asynchronously after the required writes are complete

Sent from my SM-G925V using Tapatalk
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
OK so I just slapped two more s3700 20gb models into each OSD node, want to really create a new pool, trying to rtfm but not bumping into quick answer, it all looks like expanding an existing pool.

Is it as simple as:

ceph-deploy disk zap osd01:sdc osd01:sdd
ceph-deploy osd create osd01:sdc osd01:sdd

ceph-deploy disk zap osd02:sdc osd02:sdd
ceph-deploy osd create osd02:sdc osd02:sdd

ceph-deploy disk zap osd03:sdc osd03:sdd
ceph-deploy osd create osd03:sdc osd03:sdd

Or does that just grow (and merge disks) of the existing output of 'ceph osd lspools'
Code:
[cephuser@ceph-admin ceph-deploy]$ ceph osd lspools
0 rbd,
EDIT: Nope it looks like maybe this:

ceph osd pool create 'otherswitchesIneedtoinvestigatehereLOL'
 

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
You don't need to create a new pool. Just add the OSDs and data will be redistributed to use them...

Sent from my SM-G925V using Tapatalk
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
This gents covers it pretty good here it looks like, need to digest more.

Ceph Storage :: Next Big Thing: How Data Is Stored In CEPH Cluster

Guess at a high level I am struggling to understand how I control when I create a new ceph pool (IE: ceph osd pool create s3700pool 128)
how to pin/ensure the new PG's that comprise the new pool ONLY come from those new s3700 devices. I am sure I am over-complicating this lol.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,513
5,804
113
What I have seen several folks do, especially on NVMe drives that can handle higher QD, is to make virtual disks on a drive and use them as Ceph disks. That gives you more OSDs with fewer physical devices. Not something you would want to run in production, but if you just wanted to test performance it may be useful.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
You don't need to create a new pool. Just add the OSDs and data will be redistributed to use them...

Sent from my SM-G925V using Tapatalk
Yeah I was hoping to create/isolate a new pool of the new devices just to see if devices had anything to do w/ it but I suppose we have debunked and said the hell w/ my theory there. So my previous cmds will work, rock on.

ceph-deploy disk zap
ceph-deploy disk create

Executing!!!BOOM
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
MUUUUUHAHAHAHAH!!!

From this:
Code:
[cephuser@ceph-admin ceph-deploy]$ ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 2.16747 root default
-2 0.72249     host osd01
 0 0.72249         osd.0       up  1.00000          1.00000
-3 0.72249     host osd02
 1 0.72249         osd.1       up  1.00000          1.00000
-4 0.72249     host osd03
 2 0.72249         osd.2       up  1.00000          1.00000
To this:
Code:
[cephuser@ceph-admin ceph-deploy]$ ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 3.22939 root default
-2 1.07646     host osd01
 0 0.72249         osd.0       up  1.00000          1.00000
 3 0.17699         osd.3       up  1.00000          1.00000
 4 0.17699         osd.4       up  1.00000          1.00000
-3 1.07646     host osd02
 1 0.72249         osd.1       up  1.00000          1.00000
 5 0.17699         osd.5       up  1.00000          1.00000
 6 0.17699         osd.6       up  1.00000          1.00000
-4 1.07646     host osd03
 2 0.72249         osd.2       up  1.00000          1.00000
 7 0.17699         osd.7       up  1.00000          1.00000
 8 0.17699         osd.8       up  1.00000          1.00000
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Holy crap that recovering remap/backfill is FLYING. 630MB/s

Re-benchmark time!

Code:
[cephuser@ceph-admin ceph-deploy]$ ceph -w
    cluster 31485460-ffba-4b78-b3f8-3c5e4bc686b1
     health HEALTH_WARN
            1 pgs backfill_wait
            1 pgs backfilling
            recovery 1243/51580 objects misplaced (2.410%)
            too few PGs per OSD (14 < min 30)
     monmap e1: 3 mons at {osd01=192.168.2.176:6789/0,osd02=192.168.2.177:6789/0,osd03=192.168.2.178:6789/0}
            election epoch 6, quorum 0,1,2 osd01,osd02,osd03
     osdmap e129: 9 osds: 9 up, 9 in; 1 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v42140: 64 pgs, 1 pools, 101382 MB data, 25390 objects
            199 GB used, 3107 GB / 3306 GB avail
            1243/51580 objects misplaced (2.410%)
                  62 active+clean
                   1 active+remapped+backfilling
                   1 active+remapped+wait_backfill
recovery io 636 MB/s, 159 objects/s
  client io 999 B/s rd, 999 B/s wr, 0 op/s rd, 1 op/s wr

2017-01-25 19:03:40.896002 mon.0 [INF] pgmap v42139: 64 pgs: 1 active+remapped+wait_backfill, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3107 GB / 3306 GB avail; 991 B/s rd, 991 B/s wr, 2 op/s; 1243/51580 objects misplaced (2.410%); 631 MB/s, 157 objects/s recovering
2017-01-25 19:03:41.917934 mon.0 [INF] osdmap e129: 9 osds: 9 up, 9 in
2017-01-25 19:03:41.921966 mon.0 [INF] pgmap v42140: 64 pgs: 1 active+remapped+wait_backfill, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3107 GB / 3306 GB avail; 999 B/s rd, 999 B/s wr, 2 op/s; 1243/51580 objects misplaced (2.410%); 636 MB/s, 159 objects/s recovering
2017-01-25 19:03:42.935860 mon.0 [INF] osdmap e130: 9 osds: 9 up, 9 in
2017-01-25 19:03:42.936920 mon.0 [INF] pgmap v42141: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%); 266 MB/s, 66 objects/s recovering
2017-01-25 19:03:42.944929 mon.0 [INF] pgmap v42142: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%); 533 MB/s, 133 objects/s recovering
2017-01-25 19:03:43.949058 mon.0 [INF] pgmap v42143: 64 pgs: 1 peering, 1 active+remapped+backfilling, 62 active+clean; 101382 MB data, 199 GB used, 3106 GB / 3306 GB avail; 782/51187 objects misplaced (1.528%)
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Boo, abt the same write perf. SOLVED/UPDATED BELOW Perplexed/need higher end devices, apparently these s3610's/s3700's ain't what they used to be :-D

WRITES via sVMotion (FreeNAS NFS 8x hussl pool r10 config to CEPH pool over RBD/LIO iSCSI)
Code:
[root@cephgw ~]# vnstat -l -i ens160
Monitoring ens160...    (press CTRL-C to stop)

   rx:   311.18 Mbit/s  3164 p/s          tx:   616.44 Mbit/s  7415 p/s
READS are still EXCELLENT in my book via sVMotion (CEPH RBD/LIO iSCSI to FreeNAS iSCSI zvol)
Code:
[root@cephgw ~]# vnstat -l -i ens160
Monitoring ens160...    (press CTRL-C to stop)

   rx:     2.27 Gbit/s 34827 p/s          tx:     2.26 Gbit/s 21284 p/s
WTF moment, before I was reading from a FreeNAS NFS pool writing to CEPH...check this...same ZFS pool reading from iSCSI zvol writing to CEPH...MUCH BETTER

WRITES via sVMotion (FreeNAS iSCSI zvol cut from 8x hussl r10 pool to CEPH RBD/LIO iSCSI)
Code:
[root@cephgw ~]# vnstat -l -i ens160
Monitoring ens160...    (press CTRL-C to stop)

   rx:   932.57 Mbit/s  9508 p/s          tx:   933.93 Mbit/s 11017 p/s
3x's perf and right where I would expect it to be. WIN!!!
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
So yeah riddle me that...now this testing has be re-thinking my NFS v.s. iSCSI strategy. HAHA
 

Patriot

Moderator
Apr 18, 2011
1,451
792
113
Ceph has many internal bottlenecks.... You either get replication or performance not both. You can use nvme drives to boost performance, but they will not be used to their capabilities without making multiple OSDs per nvme device which negates duplication. (do not do this outside of performance testing) Ceph is a massive ball of bandaids... It has promise but it is also kinda shoddy right now...On the upside they are very accepting of new devs to help make proper fixes to problems.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Hey for a hodge-podge of disks (three 800gb s3610's and six 200GB s3700's) thrown at it getting 2Gbps read/1Gbps write is pretty sweet in my book, taking down a node with VM's running on that iSCSI VMFS volumes (served up through CEPH/RBD/LIO framework) and them not even blinking makes me smile when I think about the hassle of getting EVEN CLOSE to that parity on a FreeNAS box w/out significant jumping through hoops.