Ceph IOPS

Bjorn Smith · Feb 15, 2023

Hi,

So I have gotten my new cluster up and running based on the nice Fujutsu TX1320M3 - and I have for now only 3 nodes running with 2 OSD's each.

This works nice and I get an average 1ms read/write latency according to the ceph dashboard statistics, which is good enough for me - so I am wondering what benefin besides more resilience I will get if I deploy yet another 2 nodes with 2 OSD's each.

The alternative is to put 2 more OSD's into each of the 3 nodes.

This should give me a little more bandwidth to write, same scalability - and not to mention - power consumtion savings from deploying yet another two nodes.

So - I guess its a no brainer?

For a homelab - 3 nodes is "safe" enough right?

I do have backup of the important stuff, ie. VM's running on ceph and the rest of the data can be recreated from configuration.

As you can see its not very busy my cluster

iGene · Feb 15, 2023

3 nodes should be safe enough if you use replica 3 and min_size 2.

Adding more nodes will be more beneficial but from your load I would say you probably won't see a difference.

Bjorn Smith · Feb 16, 2023

I think I already have replica 3, but I am not sure about min_size 2

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.100.210.0/16
     fsid = da5cbdc2-5c9b-48ab-908a-f03d6b2e6024
     mon_allow_pool_delete = true
     mon_cluster_log_file_level = info
     mon_host = 192.168.210.10 192.168.210.11 192.168.210.12
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 192.168.210.0/16

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve10]
     host = pve10
     mds_standby_for_name = pve

[mds.pve11]
     host = pve11
     mds_standby_for_name = pve

[mds.pve12]
     host = pve12
     mds standby for name = pve

[mon.pve10]
     cluster_addr = 10.100.210.10
     public_addr = 192.168.210.10

[mon.pve11]
     cluster_addr = 10.100.210.11
     public_addr = 192.168.210.11

[mon.pve12]
     cluster_addr = 10.100.210.12
     public_addr = 192.168.210.12

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve10 {
    id -3        # do not change unnecessarily
    id -4 class ssd        # do not change unnecessarily
    # weight 1.74658
    alg straw2
    hash 0    # rjenkins1
    item osd.0 weight 0.87329
    item osd.1 weight 0.87329
}
host pve11 {
    id -5        # do not change unnecessarily
    id -6 class ssd        # do not change unnecessarily
    # weight 1.74658
    alg straw2
    hash 0    # rjenkins1
    item osd.2 weight 0.87329
    item osd.3 weight 0.87329
}
host pve12 {
    id -7        # do not change unnecessarily
    id -8 class ssd        # do not change unnecessarily
    # weight 1.74658
    alg straw2
    hash 0    # rjenkins1
    item osd.4 weight 0.87329
    item osd.5 weight 0.87329
}
root default {
    id -1        # do not change unnecessarily
    id -2 class ssd        # do not change unnecessarily
    # weight 5.23975
    alg straw2
    hash 0    # rjenkins1
    item pve10 weight 1.74658
    item pve11 weight 1.74658
    item pve12 weight 1.74658
}

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

ano · Feb 16, 2023

I got my home ceph lab online today as well!

5 node so far, plan is to swap around drivesd/osd so I can get up to 8 nodes and test how it affects performance and latency. write amplification is real, seeing 30gbps max so far, I wanted enough devices to be have ereasure coding for testing

Bjorn Smith · Feb 16, 2023

ano said:
I got my home ceph lab online today as well!

5 node so far, plan is to swap around drivesd/osd so I can get up to 8 nodes and test how it affects performance and latency. write amplification is real, seeing 30gbps max so far, I wanted enough devices to be have ereasure coding for testing

Nice - you are not worried about power consumption with 8 nodes?

Right now my lab uses around 200w, which is a lot for me, considering the power cost here in Denmark.

ano · Feb 16, 2023

Bjorn Smith said:
Nice - you are not worries about power consumption with 8 nodes?

Right now my lab uses around 200w, which is a lot for me, considering the power cost here in Denmark.

very! we have gone from free power, to massive cost, a rack costs thousands per month

but its for lab now, so wont run 24/7, and for "homeprod" I'l probably go down to 5, maybe 3? we will see, I need to heat an entire floor as well, so wont be wasted, for most of the year

Bjorn Smith · Feb 16, 2023

Bash:

root@pve10:~# ceph osd pool get vmdata size
size: 3
root@pve10:~# ceph osd pool get vmdata size|min_size
root@pve10:~# ceph osd pool get vmdata min_size
min_size: 2

So I guess im good

ano · Feb 17, 2023

how are your fio/rados benchmarks looking?

I will introduce the watt vs MBs for rados, I'm at 4.2MBs per watt in rados now for SEQ writes

and about 3.3MBs with fio for seq writes (spinners with nvme cache)

Bjorn Smith · Feb 17, 2023

Code:

root@pve10:~# rados bench -p testbench 10 write -t 4 --run-name client1
hints = 1
Maintaining 4 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve10_1355305
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1       4       102        98   391.938       392   0.0289584   0.0374568
    2       4       186       182   363.912       336   0.0273233   0.0365849
    3       4       274       270   359.887       352   0.0362456   0.0359277
    4       4       365       361     360.9       364   0.0406281   0.0440994
    5       4       471       467   373.506       424   0.0449998   0.0426409
    6       4       575       571   380.576       416   0.0459591   0.0417564
    7       4       668       664   379.343       372   0.0321925   0.0413815
    8       4       768       764   381.917       400   0.0493542   0.0416405
    9       4       860       856   380.363       368   0.0719468   0.0419258
   10       4       962       958    383.12       408    0.060831   0.0417053
Total time run:         10.0353
Total writes made:      962
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     383.445
Stddev Bandwidth:       29.1387
Max bandwidth (MB/sec): 424
Min bandwidth (MB/sec): 336
Average IOPS:           95
Stddev IOPS:            7.28469
Max IOPS:               106
Min IOPS:               84
Average Latency(s):     0.0416811
Stddev Latency(s):      0.100085
Max latency(s):         3.09271
Min latency(s):         0.0217051

I am okay with the MB/s - its much less than what I would expect considering I have have 4 OSD/node and 3 nodes - and its all SATA SSD's.

But I don't understand the IOPS - why its so low.

I haven't done a fio test from a VM, perhaps I should do that, since that is what really matters.

But it could also be that its simply the lack of CPU cores on my machines that is causing this.

Testing from a machine not part of the cep cluster with 12 cores gives a different result:

Code:

root@pve3:~# rados bench -p testbench 10 write -t 12 --run-name client1 -c /etc/pve/ceph.conf
hints = 1
Maintaining 12 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve3_4092321
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      12       172       160   639.927       640   0.0865981   0.0741933
    2      12       337       325     649.9       660   0.0454082   0.0726116
    3      12       506       494   658.556       676   0.0511371   0.0720991
    4      12       675       663   662.883       676    0.072972   0.0714694
    5      12       854       842   673.476       716   0.0350671   0.0708232
    6      12      1028      1016   677.207       696   0.0475928   0.0705947
    7      12      1180      1168   667.302       608   0.0777169    0.071243
    8      12      1340      1328   663.874       640   0.0651854     0.07184
    9      12      1506      1494   663.874       664   0.0441369   0.0718776
   10      12      1669      1657   662.673       652   0.0797938   0.0721067
Total time run:         10.0425
Total writes made:      1669
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     664.776
Stddev Bandwidth:       30.6406
Max bandwidth (MB/sec): 716
Min bandwidth (MB/sec): 608
Average IOPS:           166
Stddev IOPS:            7.66014
Max IOPS:               179
Min IOPS:               152
Average Latency(s):     0.0721159
Stddev Latency(s):      0.0242251
Max latency(s):         0.253595
Min latency(s):         0.0220415
Cleaning up (deleting benchmark objects)
Removed 1669 objects
Clean up completed and total clean up time :0.380883

Better bandwidth, but IOPS is still shit considering its a SSD backed pool.

So I guess I will be severely constrained by the lack of cores on the ceph machines, since they are also running the VM's.

That is unfortunate.

Stephan · Feb 17, 2023

What SSDs are those? Enterprise with PLP? Try with, and without write caching.

Bjorn Smith · Feb 17, 2023

Yes they should all be enterprise disks.

Toshiba HK4R
and a single Intel S4610

So []ICODE]hdparm -W 0[/ICODE] on all disks.

Just tested, and it makes no real difference unfortunately.

Of course I should have tested each disk individually before doing any pools, just to see what I could get from the raw disks.

Ceph have added this info:

How it got those numbers I don't know - either it has tested it and put in a number - or its some standard numbers - but if those are IOPS on the individual disks, I cannot see why I get so low numbers.
200 IOPS is pretty low - but it might just be the nature of it all because I only have 3 nodes with 4 cores each.

ano · Feb 17, 2023

that test uses large blocks, so few iops, but yes should be better, guessing network kills you?

to test iops run something like rados bench -p yourpoolname 30 -b 4K write rand

number of PG's also affect it

my cluster has about 8 to 9 mill iops depending on how I configure it, and my avg is like.. 16k iops

need some tuning

Bjorn Smith · Feb 17, 2023

Tested with 4K blocks and its much better - just the write speed is low - but I guess that is normal

Small blocks, high IOPS, BIG block low IOPS

Code:

rados bench -p testbench 10 write -t 12 seq -b 4K  --run-name client1 -c /etc/pve/ceph.conf
hints = 1
Maintaining 12 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_pve3_58903
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      12      4638      4626   18.0684   18.0703  0.00353701  0.00258628
    2      12      9921      9909   19.3511   20.6367  0.00206252  0.00241419
    3      12     15106     15094   19.6508   20.2539  0.00340544  0.00237885
    4      12     19671     19659   19.1953    17.832  0.00191694  0.00243463
    5      12     25035     25023   19.5461   20.9531  0.00182545  0.00239208
    6      12     30268     30256   19.6947   20.4414  0.00197988  0.00237406
    7      11     35638     35627   19.8779   20.9805  0.00128871  0.00235229
    8      12     40867     40855   19.9455   20.4219  0.00187551  0.00234416
    9      12     45867     45855    19.899   19.5312   0.0065553  0.00234933
   10       1     50479     50478   19.7147   18.0586  0.00568559  0.00237174
Total time run:         10.0018
Total writes made:      50479
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     19.7148
Stddev Bandwidth:       1.26182
Max bandwidth (MB/sec): 20.9805
Min bandwidth (MB/sec): 17.832
Average IOPS:           5046
Stddev IOPS:            323.025
Max IOPS:               5371
Min IOPS:               4565
Average Latency(s):     0.00237175
Stddev Latency(s):      0.00106038
Max latency(s):         0.0240364
Min latency(s):         0.000887746
Cleaning up (deleting benchmark objects)
Removed 50479 objects
Clean up completed and total clean up time :8.56148

Random gives similar numbers - so I guess everything is working fine

I get about 1/3 of the IOPS of a single disk, which I guess is good enough for random 4k writes which is what most application does.

gb00s · Feb 17, 2023

I think it's time to become 'uncool' and to test GlusterFS on ZFS for a home storage cluster. These numbers are not worth the money.

Bjorn Smith · Feb 17, 2023

gb00s said:
I think it's time to become 'uncool' and to test GlusterFS on ZFS for a home storage cluster. These numbers are not worth the money.

I agree with 4M block size its very low IOPS - but with 4k I think its pretty good.

Its not NVME, its SATA SSD's

And also tiny machines in terms of CPU/Memory

ano · Feb 17, 2023

also remember to run rados from all your nodes, and check latency, then aggregate the benchmarks, and there is a reason why nobody publishes pure rados bench numbers, and runs replica 2 a lot...

ceph when under load from multiple clients in reallife, when acting as say s3 objstor etc works out quite well though

only good thing about ceph is it makes you appreciate ZFS more for performance

I've been able to push ZFS to 18GBs for 128k random writes

and 100k+ 4k performance (random as well)

if you want, I can probably give you acccess to my lab, I have 60 x 18 TB (new) drives for OSD and 16x 7.68TB gen4 nvme kioxia in my supermicro lab stuff now spread across up to 8x chassis, all 7402, 8x32GB ddr4 and 2x100gbps

Bjorn Smith · Feb 17, 2023

ano said:
only good thing about ceph is it makes you appreciate ZFS more for performance

True, but ZFS is a SPOF

Thatw why I stopped with ZFS - I want to be able to shut any node down without having to migrate VM's or shut them down.

I to be honest I dont need 100k IOPS - its a homelab/homeprod - not an enterprise lab

gb00s · Feb 17, 2023

Then test GlusterFS with ZFS if you appreciate ZFS with HA.

Bjorn Smith · Feb 17, 2023

gb00s said:
Then test GlusterFS with ZFS if you appreciate ZFS with HA.

If I had an identical cluster it would be nice - but I have just finally gotten this one done - I am not about to rip it all apart for potential "gains" - perhaps next time

Bjorn Smith · Feb 17, 2023

Read about glusterfs - it looks nice - so from what I understand - it uses normal filesystems - so it can basically replicate the files across multiple nodes - and represent it as a single filesystem.

And at any time you can access one of the nodes and see the files, since its just a normal filesystem.

Unless you use the feature where a file is chopped up.

I think I want to test this in a couple of VM's just to get a feeling how it works

It might be what I really want

And I also like this teaser:
[transport [tcp |rdma | tcp,rdma]]

Ceph IOPS

Well-Known Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member