Ceph Benchmark request /NVME

Discussion in 'Linux Admins, Storage and Virtualization' started by Rand__, Feb 22, 2019.

  1. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Hi,

    in my everlasting search for the holy grail (high performance on qd1t1 setups) I recently stumbled over Ceph again.
    I then saw that there is a new release around the corner and started to look a bit closer and thought it might be promising
    But before I start investing countless hours again in the next disappointing solution I thought I'd use the community wisdom to provide some initial answers.

    So could somebody (potentially with an all nvme setup assuming thats the ultimate single user performance option) be so kind to provide me a Q1T1 CDM screenshot (and potentially briefly list the setup) ?

    Thanks a mil.
     
    #1
  2. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493
    Did CDM actually release a new version that works for NVME, and optane?

    I just ran it yesterday and the results were all over, and wildly inaccurate.
     
    #2
  3. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Never even knew there were issues ?;)
    Attributed all the bad perf I saw to vsan:p

    O/c I am also happy to take fio, dd or anything else you can provide :D
     
    #3
    Last edited: Feb 22, 2019
  4. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493

    I don't know what version I'm using, I can check, it's on my test bench system so I rarely ever upgrade it.

    I've seen 300MB\s+ on 4K Q1T1 from drives that don't come close.

    I've seen optane bounce from 75MB\s to 105MB\s, re-test it over and over and always get 75, then test again and it jumps to 100+.

    It's not accurate representation of performance of these drives.

    I use it only to make sure there's no drastic problems and put the drives through their paces, and re-check SMART info.

    I would not use CDM with Optane or NVME as a performance-level indicator.

    The version of Anvil Pro I have too gives wildly varying IOPs for these drives too... as-in from 90,000 to 300,000+ on tests there's no way any drive would ever get 300k+ without some sort of caching, etc, going on.


    I would use IOMETER to figure out what you can get out of your setup.
     
    #4
  5. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    I tend to use some 5.x version of CDM and its at least an indicator.
    Wild fluctuations I thought were due to power saving measures... have you ruled them out?
     
    #5
    T_Minus likes this.
  6. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493
    Hmmmm... why would power saving cause fluctuations in excess of 200,000 IOPs for drives that cannot do that performance at any QD or Thread no matter what? (300MB/s to 950MB/s variance too)

    As long as you know it's not accurate that's all that matters :) really.
     
    #6
  7. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493
    Oh, and FWIW I'm testing on Win7. I had to install the NVME drivers to get any to even work ;)

    What's really crazy is the performance variance between the 'original' hotfix NVME for Win7, and the 'latest gen' drivers...
    The original drivers in my tests had much better random write, where-as the new ones have better reads.
    Not scientific by any means, just an observation.
     
    #7
  8. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    yeah i agree that too much perf should not be an issue of that (caching?), but fluctuations might be...
     
    #8
  9. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    ok, only run on 2012 since thats the default vm i deploy (since i got plenty of licenses). Only major issue was S&M patches (and o/c the 1.2.15 esxi nvme drivers which caused havoc on perf)
     
    #9
  10. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,838
    Likes Received:
    1,493
    Ah ok.

    I just got done testing what I thought was a NIB 400GB P3700... turns out it had 2.75 PBW

    I think that's my highest "used" drive to date, except at this point I don't know if I did that over hte last 3 years of misc. usages (doubtful) or if I actually purchased 1 NIB 400gb, and 1 used :/ seems odd.
     
    #10
  11. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Well we have seen many 'deals' with new drives that were not new... if you didnt test when you got it it might very well be...
     
    #11
  12. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    So I set up a test system, single host, 3 optane for 3 osds, single monitor, admin und client vm.
    All boxes have 4 cores (of a dual 5115 10 x 2.4ghz box), 16GB Ram and ESX only networking.
    The only 'slow' thing on that box is the vm datastore which is an old intel ssd. Not sure this would impact perf if mon is on slower disk?

    I used defaults everywhere (primarily for the lack of knowing better), so no perf option, not using multiple OSDs per nvme as suggested etc pp.
    Note I am looking to maximize performance for a single user/single thread thus most optimizations targeting parallelism won't help much for my use case.

    Rados benches also using defaults:
    Code:
    [cephuser@ceph1 ~]$ rados bench -p scbench 10 write --no-cleanup
    hints = 1
    Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
    Object prefix: benchmark_data_ceph1_19616
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16       313       297   1187.94      1188   0.0901362   0.0521002
        2      16       617       601   1201.88      1216   0.0527464   0.0525541
        3      16       925       909   1211.87      1232   0.0417492   0.0523887
        4      16      1231      1215   1214.87      1224   0.0275083   0.0522769
        5      16      1543      1527   1221.47      1248   0.0525531   0.0521112
        6      16      1850      1834   1222.54      1228   0.0324431    0.052101
        7      16      2159      2143   1224.44      1236   0.0201907   0.0520023
        8      16      2462      2446   1222.87      1212   0.0449799   0.0521397
        9      16      2782      2766   1229.21      1280   0.0698403   0.0519154
       10      16      3087      3071   1228.27      1220   0.0542829   0.0519207
    Total time run:         10.0307
    Total writes made:      3088
    Write size:             4194304
    Object size:            4194304
    Bandwidth (MB/sec):     1231.42
    Stddev Bandwidth:       24.1808
    Max bandwidth (MB/sec): 1280
    Min bandwidth (MB/sec): 1188
    Average IOPS:           307
    Stddev IOPS:            6
    Max IOPS:               320
    Min IOPS:               297
    Average Latency(s):     0.051954
    Stddev Latency(s):      0.0175077
    Max latency(s):         0.129501
    Min latency(s):         0.0168499
    
    
    
    rados bench -p scbench 10 seq
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      16       527       511   2035.62      2044   0.0285773   0.0290513
        2      15      1086      1071   2130.11      2240   0.0279285   0.0279961
        3      15      1645      1630   2163.69      2236   0.0283242   0.0276305
        4      15      2213      2198    2190.4      2272   0.0281926   0.0273209
        5      15      2778      2763   2202.07      2260   0.0269144   0.0271988
    Total time run:       5.5966
    Total reads made:     3088
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   2207.05
    Average IOPS:         551
    Stddev IOPS:          23
    Max IOPS:             568
    Min IOPS:             511
    Average Latency(s):   0.027164
    Max latency(s):       0.0701902
    Min latency(s):       0.00903297
    [cephuser@ceph1 ~]$ rados bench -p scbench 10 rand
    hints = 1
      sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
        0       0         0         0         0         0           -           0
        1      15       545       530   2119.61      2120   0.0269049   0.0280255
        2      15      1108      1093   2185.39      2252   0.0170569   0.0272992
        3      15      1666      1651   2200.84      2232   0.0270038   0.0271961
        4      15      2200      2185   2184.57      2136   0.0263655   0.0274298
        5      15      2771      2756      2204      2284   0.0233137   0.0272046
        6      15      3336      3321   2213.29      2260    0.027059   0.0270973
        7      15      3911      3896   2225.64      2300   0.0274909   0.0269519
        8      15      4469      4454    2226.4      2232   0.0265596   0.0269469
        9      15      5047      5032   2235.87      2312   0.0226487   0.0268336
       10      15      5618      5603   2240.63      2284   0.0196013   0.0267812
    Total time run:       10.0306
    Total reads made:     5618
    Read size:            4194304
    Object size:          4194304
    Bandwidth (MB/sec):   2240.35
    Average IOPS:         560
    Stddev IOPS:          16
    Max IOPS:             578
    Min IOPS:             530
    Average Latency(s):   0.0267921
    Max latency(s):       0.0740841
    Min latency(s):       0.00561334
    
    Lower thread counts / Blocksize runs were significantly worse.


    dd tests on non-optimized Ceph FS (xfs,kernel mounted):


    Code:
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=10G count=10 oflag=direct
    dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
    0+10 records in
    0+10 records out
    21474795520 bytes (21 GB) copied, 31.3578 s, 685 MB/s
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=10M count=1024 oflag=direct
    1024+0 records in
    1024+0 records out
    10737418240 bytes (11 GB) copied, 29.5715 s, 363 MB/s
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=1M count=10240 oflag=direct
    10240+0 records in
    10240+0 records out
    10737418240 bytes (11 GB) copied, 33.9238 s, 317 MB/s
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=256K count=40960 oflag=direct
    40960+0 records in
    40960+0 records out
    10737418240 bytes (11 GB) copied, 57.5057 s, 187 MB/s
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=64K count=163840 oflag=direct
    163840+0 records in
    163840+0 records out
    10737418240 bytes (11 GB) copied, 164.541 s, 65.3 MB/s
    [root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=16K count=655360 oflag=direct   655360+0 records in
    655360+0 records out
    10737418240 bytes (11 GB) copied, 522.349 s, 20.6 MB/s
    dd if=/dev/zero of=ddd1 bs=4K count=2621440 oflag=direct
    2621440+0 records in
    2621440+0 records out
    10737418240 bytes (11 GB) copied, 2021.14 s, 5.3 MB/s
    

    What do you think will fiddling with optimizations be able to accomplish?
    double / tripple at best?
     
    #12
    Last edited: May 3, 2019
  13. MikeWebb

    MikeWebb Member

    Joined:
    Jan 28, 2018
    Messages:
    87
    Likes Received:
    19
    Hope this thread continues. watching
     
    #13
  14. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    Sorry, not from my end.
    Ceph had the same issues all the others (i have tested) do - not able to transfer raw drive performance to a networked setup.
     
    #14
  15. MikeWebb

    MikeWebb Member

    Joined:
    Jan 28, 2018
    Messages:
    87
    Likes Received:
    19
    Yep to true. But a saturated network link with high iops is a saturated network link with high iops.

    Very easy to saturate a 10Gb/s link with COTS hardware and have responsiveness under load. The same also with those massive SPOF boxes the freenas guys build that can saturate 40gb/s. The raft of technologies available to us mortals now days is impressive.

    Single Raw disk performance is great and awesome benchmarks put a grin on my face, but I went all frowney face when they didn’t transfer to equal to cluster performance. Then a rememberd, I’m not trying to play star citizen, I’m trying balance the benefits of resilience, scalability, performance and cost.

    My 25-28 Gb/s with an non tweaked zfs over glusterfs 2 node (with Pi witness) Proxmox cluster is nowhere near the raw performance of the storage inside and tweaking would see gains for sure. Money did go wasted on bad purchases and mistakes where made.

    Dazzled by the benchmarks for optane I got a few 900p’s but didn’t see those numbers equally relate to network speed but man, overall performance improved. In retrospect we should have spent it on another server distributed our existing disk over three. This lead to discussions about ceph and what we want to achieve with our infrastructure going into the future.

    Wow so of topic now, pulling it back in. Ceph is moving fast and lot of good information and tweaks how to out there are now redundant. I think things like adding optane and other nvme in a scaled out manner with ceph would give us better bang then with a zfs glusterfs solution, I’m biased towards the latter but that’s just because proven familiarity (and it is feature rich). I’m very open to what ceph can offer but it just seems so arcane.
     
    #15
  16. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    well my tests trying to satisfy my (low qd, high iops) requirements have failed miserably with everything I have thrown at it.
    If you have a recommendation how to satisfy 10G (for starters) at qd1/t1, 64K (esxi nfs -> sync) while maintaining at least a 2 node HA setup then please share :)
     
    #16
  17. MikeWebb

    MikeWebb Member

    Joined:
    Jan 28, 2018
    Messages:
    87
    Likes Received:
    19
    Can I ask why this as a performance test criteria? I know this is synthetic testing but we are talking storage node clusters here and not a single spinner (r/w head) in a single computer. A single person accessing the cluster is still a lot of data moving around with read an writes. One of the main reasons for a dedicated network for the storage data, and another for the cluster (if HCI) and another for consumption etc
     
    #17
  18. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    I am looking to replace vsan as a shared/ha storage.
    A nfs based zfs filer was the most likely replacement and from what I could gather 64k might be the appropriate chunk size from vmware side.
    QD1/T1 since its for personal use primarily (ie myself) and if I assume if its ok with my requirements it will easily scale to more users.

    If you prefer a more business like scenario think of 'high speed vm storage for the board of directors' - thats the example I usually tell big companies when they ask me what I need it for;)

    I know that is contrary to the normal enterprise level optimization or evaluation path and that is probably the reason why I have such a hard time to get this to work as hoped ...:)
     
    #18
  19. MikeWebb

    MikeWebb Member

    Joined:
    Jan 28, 2018
    Messages:
    87
    Likes Received:
    19
    OK thanks.

    Yeah, its a juggling act to balance block, record and (glusterfs) shard sizes with, also, the requirements of applications (databases or vm filesystems) etc while having a uniformity all through the stack. This all leads to rabbit hole syndrome. What doesn't help is when looking for suggestions and input from communities, the standard replies seem to be "it depends".

    I'm in the middle of moving all our data over to our backup nas (yay zfs snapshots) and moving physical disks around the servers in preparation for ceph and eventually another server. reducing disk (OSD) density per node and increasing OSD nodes in the cluster. Which is of benefit to ceph performance and resilience.

    So next week I should be able to give some input relating 2 node ceph cluster with mixed HDD, SDD, NVMe OSD's per node....or more questions.

    I think I should really do hardware build post so i can point to that when talking about my set up and what I do with it etc.
     
    #19
  20. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,592
    Likes Received:
    544
    :) Looking forward to it.
     
    #20
Similar Threads: Ceph Benchmark
Forum Title Date
Linux Admins, Storage and Virtualization New ceph cluster -recommendations? Sep 5, 2019
Linux Admins, Storage and Virtualization Different disk sets for Proxmox Ceph pools? May 29, 2019
Linux Admins, Storage and Virtualization CEPH: switching from HDD to SSD - HW recommendations May 23, 2019
Linux Admins, Storage and Virtualization ceph backfill problem Sep 20, 2018
Linux Admins, Storage and Virtualization Ceph low performance Sep 17, 2018

Share This Page