Ceph Benchmark request /NVME

Rand__ · Feb 22, 2019

Hi,

in my everlasting search for the holy grail (high performance on qd1t1 setups) I recently stumbled over Ceph again.
I then saw that there is a new release around the corner and started to look a bit closer and thought it might be promising
But before I start investing countless hours again in the next disappointing solution I thought I'd use the community wisdom to provide some initial answers.

So could somebody (potentially with an all nvme setup assuming thats the ultimate single user performance option) be so kind to provide me a Q1T1 CDM screenshot (and potentially briefly list the setup) ?

Thanks a mil.

T_Minus · Feb 22, 2019

Did CDM actually release a new version that works for NVME, and optane?

I just ran it yesterday and the results were all over, and wildly inaccurate.

Rand__ · Feb 22, 2019

Never even knew there were issues ?

Attributed all the bad perf I saw to vsan

O/c I am also happy to take fio, dd or anything else you can provide

T_Minus · Feb 22, 2019

Rand__ said:
Never even knew there were issues ?
Attributed all the bad perf I saw to vsan

I don't know what version I'm using, I can check, it's on my test bench system so I rarely ever upgrade it.

I've seen 300MB\s+ on 4K Q1T1 from drives that don't come close.

I've seen optane bounce from 75MB\s to 105MB\s, re-test it over and over and always get 75, then test again and it jumps to 100+.

It's not accurate representation of performance of these drives.

I use it only to make sure there's no drastic problems and put the drives through their paces, and re-check SMART info.

I would not use CDM with Optane or NVME as a performance-level indicator.

The version of Anvil Pro I have too gives wildly varying IOPs for these drives too... as-in from 90,000 to 300,000+ on tests there's no way any drive would ever get 300k+ without some sort of caching, etc, going on.

I would use IOMETER to figure out what you can get out of your setup.

Rand__ · Feb 22, 2019

I tend to use some 5.x version of CDM and its at least an indicator.
Wild fluctuations I thought were due to power saving measures... have you ruled them out?

T_Minus · Feb 22, 2019

Rand__ said:
I tend to use some 5.x version of CDM and its at least an indicator.
Wild fluctuations I thought were due to power saving measures... have you ruled them out?

Hmmmm... why would power saving cause fluctuations in excess of 200,000 IOPs for drives that cannot do that performance at any QD or Thread no matter what? (300MB/s to 950MB/s variance too)

As long as you know it's not accurate that's all that matters

really.

T_Minus · Feb 22, 2019

Oh, and FWIW I'm testing on Win7. I had to install the NVME drivers to get any to even work

What's really crazy is the performance variance between the 'original' hotfix NVME for Win7, and the 'latest gen' drivers...
The original drivers in my tests had much better random write, where-as the new ones have better reads.
Not scientific by any means, just an observation.

Rand__ · Feb 22, 2019

yeah i agree that too much perf should not be an issue of that (caching?), but fluctuations might be...

Rand__ · Feb 22, 2019

ok, only run on 2012 since thats the default vm i deploy (since i got plenty of licenses). Only major issue was S&M patches (and o/c the 1.2.15 esxi nvme drivers which caused havoc on perf)

T_Minus · Feb 22, 2019

Rand__ said:
ok, only run on 2012 since thats the default vm i deploy (since i got plenty of licenses). Only major issue was S&M patches (and o/c the 1.2.15 esxi nvme drivers which caused havoc on perf)

Ah ok.

I just got done testing what I thought was a NIB 400GB P3700... turns out it had 2.75 PBW

I think that's my highest "used" drive to date, except at this point I don't know if I did that over hte last 3 years of misc. usages (doubtful) or if I actually purchased 1 NIB 400gb, and 1 used :/ seems odd.

Rand__ · Feb 23, 2019

Well we have seen many 'deals' with new drives that were not new... if you didnt test when you got it it might very well be...

Rand__ · Mar 2, 2019

So I set up a test system, single host, 3 optane for 3 osds, single monitor, admin und client vm.
All boxes have 4 cores (of a dual 5115 10 x 2.4ghz box), 16GB Ram and ESX only networking.
The only 'slow' thing on that box is the vm datastore which is an old intel ssd. Not sure this would impact perf if mon is on slower disk?

I used defaults everywhere (primarily for the lack of knowing better), so no perf option, not using multiple OSDs per nvme as suggested etc pp.
Note I am looking to maximize performance for a single user/single thread thus most optimizations targeting parallelism won't help much for my use case.

Rados benches also using defaults:

Code:

[cephuser@ceph1 ~]$ rados bench -p scbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph1_19616
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       313       297   1187.94      1188   0.0901362   0.0521002
    2      16       617       601   1201.88      1216   0.0527464   0.0525541
    3      16       925       909   1211.87      1232   0.0417492   0.0523887
    4      16      1231      1215   1214.87      1224   0.0275083   0.0522769
    5      16      1543      1527   1221.47      1248   0.0525531   0.0521112
    6      16      1850      1834   1222.54      1228   0.0324431    0.052101
    7      16      2159      2143   1224.44      1236   0.0201907   0.0520023
    8      16      2462      2446   1222.87      1212   0.0449799   0.0521397
    9      16      2782      2766   1229.21      1280   0.0698403   0.0519154
   10      16      3087      3071   1228.27      1220   0.0542829   0.0519207
Total time run:         10.0307
Total writes made:      3088
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     1231.42
Stddev Bandwidth:       24.1808
Max bandwidth (MB/sec): 1280
Min bandwidth (MB/sec): 1188
Average IOPS:           307
Stddev IOPS:            6
Max IOPS:               320
Min IOPS:               297
Average Latency(s):     0.051954
Stddev Latency(s):      0.0175077
Max latency(s):         0.129501
Min latency(s):         0.0168499



rados bench -p scbench 10 seq
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       527       511   2035.62      2044   0.0285773   0.0290513
    2      15      1086      1071   2130.11      2240   0.0279285   0.0279961
    3      15      1645      1630   2163.69      2236   0.0283242   0.0276305
    4      15      2213      2198    2190.4      2272   0.0281926   0.0273209
    5      15      2778      2763   2202.07      2260   0.0269144   0.0271988
Total time run:       5.5966
Total reads made:     3088
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2207.05
Average IOPS:         551
Stddev IOPS:          23
Max IOPS:             568
Min IOPS:             511
Average Latency(s):   0.027164
Max latency(s):       0.0701902
Min latency(s):       0.00903297
[cephuser@ceph1 ~]$ rados bench -p scbench 10 rand
hints = 1
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      15       545       530   2119.61      2120   0.0269049   0.0280255
    2      15      1108      1093   2185.39      2252   0.0170569   0.0272992
    3      15      1666      1651   2200.84      2232   0.0270038   0.0271961
    4      15      2200      2185   2184.57      2136   0.0263655   0.0274298
    5      15      2771      2756      2204      2284   0.0233137   0.0272046
    6      15      3336      3321   2213.29      2260    0.027059   0.0270973
    7      15      3911      3896   2225.64      2300   0.0274909   0.0269519
    8      15      4469      4454    2226.4      2232   0.0265596   0.0269469
    9      15      5047      5032   2235.87      2312   0.0226487   0.0268336
   10      15      5618      5603   2240.63      2284   0.0196013   0.0267812
Total time run:       10.0306
Total reads made:     5618
Read size:            4194304
Object size:          4194304
Bandwidth (MB/sec):   2240.35
Average IOPS:         560
Stddev IOPS:          16
Max IOPS:             578
Min IOPS:             530
Average Latency(s):   0.0267921
Max latency(s):       0.0740841
Min latency(s):       0.00561334

Lower thread counts / Blocksize runs were significantly worse.

dd tests on non-optimized Ceph FS (xfs,kernel mounted):

Code:

[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=10G count=10 oflag=direct
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+10 records in
0+10 records out
21474795520 bytes (21 GB) copied, 31.3578 s, 685 MB/s
[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=10M count=1024 oflag=direct
1024+0 records in
1024+0 records out
10737418240 bytes (11 GB) copied, 29.5715 s, 363 MB/s
[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=1M count=10240 oflag=direct
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 33.9238 s, 317 MB/s
[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=256K count=40960 oflag=direct
40960+0 records in
40960+0 records out
10737418240 bytes (11 GB) copied, 57.5057 s, 187 MB/s
[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=64K count=163840 oflag=direct
163840+0 records in
163840+0 records out
10737418240 bytes (11 GB) copied, 164.541 s, 65.3 MB/s
[root@ceph6 cephfs]# dd if=/dev/zero of=ddd1 bs=16K count=655360 oflag=direct   655360+0 records in
655360+0 records out
10737418240 bytes (11 GB) copied, 522.349 s, 20.6 MB/s
dd if=/dev/zero of=ddd1 bs=4K count=2621440 oflag=direct
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 2021.14 s, 5.3 MB/s

What do you think will fiddling with optimizations be able to accomplish?
double / tripple at best?

MikeWebb · Jun 14, 2019

Hope this thread continues. watching

Rand__ · Jun 14, 2019

Sorry, not from my end.
Ceph had the same issues all the others (i have tested) do - not able to transfer raw drive performance to a networked setup.

MikeWebb · Jun 14, 2019

Rand__ said:
Sorry, not from my end.
Ceph had the same issues all the others (i have tested) do - not able to transfer raw drive performance to a networked setup.

Yep to true. But a saturated network link with high iops is a saturated network link with high iops.

Very easy to saturate a 10Gb/s link with COTS hardware and have responsiveness under load. The same also with those massive SPOF boxes the freenas guys build that can saturate 40gb/s. The raft of technologies available to us mortals now days is impressive.

Single Raw disk performance is great and awesome benchmarks put a grin on my face, but I went all frowney face when they didn’t transfer to equal to cluster performance. Then a rememberd, I’m not trying to play star citizen, I’m trying balance the benefits of resilience, scalability, performance and cost.

My 25-28 Gb/s with an non tweaked zfs over glusterfs 2 node (with Pi witness) Proxmox cluster is nowhere near the raw performance of the storage inside and tweaking would see gains for sure. Money did go wasted on bad purchases and mistakes where made.

Dazzled by the benchmarks for optane I got a few 900p’s but didn’t see those numbers equally relate to network speed but man, overall performance improved. In retrospect we should have spent it on another server distributed our existing disk over three. This lead to discussions about ceph and what we want to achieve with our infrastructure going into the future.

Wow so of topic now, pulling it back in. Ceph is moving fast and lot of good information and tweaks how to out there are now redundant. I think things like adding optane and other nvme in a scaled out manner with ceph would give us better bang then with a zfs glusterfs solution, I’m biased towards the latter but that’s just because proven familiarity (and it is feature rich). I’m very open to what ceph can offer but it just seems so arcane.

Rand__ · Jun 14, 2019

well my tests trying to satisfy my (low qd, high iops) requirements have failed miserably with everything I have thrown at it.
If you have a recommendation how to satisfy 10G (for starters) at qd1/t1, 64K (esxi nfs -> sync) while maintaining at least a 2 node HA setup then please share

MikeWebb · Jun 16, 2019

Rand__ said:
qd1/t1, 64K

Can I ask why this as a performance test criteria? I know this is synthetic testing but we are talking storage node clusters here and not a single spinner (r/w head) in a single computer. A single person accessing the cluster is still a lot of data moving around with read an writes. One of the main reasons for a dedicated network for the storage data, and another for the cluster (if HCI) and another for consumption etc

Rand__ · Jun 16, 2019

I am looking to replace vsan as a shared/ha storage.
A nfs based zfs filer was the most likely replacement and from what I could gather 64k might be the appropriate chunk size from vmware side.
QD1/T1 since its for personal use primarily (ie myself) and if I assume if its ok with my requirements it will easily scale to more users.

If you prefer a more business like scenario think of 'high speed vm storage for the board of directors' - thats the example I usually tell big companies when they ask me what I need it for

I know that is contrary to the normal enterprise level optimization or evaluation path and that is probably the reason why I have such a hard time to get this to work as hoped ...

MikeWebb · Jun 16, 2019

OK thanks.

Yeah, its a juggling act to balance block, record and (glusterfs) shard sizes with, also, the requirements of applications (databases or vm filesystems) etc while having a uniformity all through the stack. This all leads to rabbit hole syndrome. What doesn't help is when looking for suggestions and input from communities, the standard replies seem to be "it depends".

I'm in the middle of moving all our data over to our backup nas (yay zfs snapshots) and moving physical disks around the servers in preparation for ceph and eventually another server. reducing disk (OSD) density per node and increasing OSD nodes in the cluster. Which is of benefit to ceph performance and resilience.

So next week I should be able to give some input relating 2 node ceph cluster with mixed HDD, SDD, NVMe OSD's per node....or more questions.

I think I should really do hardware build post so i can point to that when talking about my set up and what I do with it etc.

Rand__ · Jun 16, 2019

Looking forward to it.

Ceph Benchmark request /NVME

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Build. Break. Fix. Repeat

Build. Break. Fix. Repeat

Well-Known Member

Well-Known Member

Build. Break. Fix. Repeat

Well-Known Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member