Getting IOPS on a 12 drive RAID 0/10 array -> scripts + plots + questions

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

robc

New Member
Oct 16, 2023
23
2
3
I am trying to get the most 4k iops I can get out of a raid array for and experimenting with different configurations to try to get a sense of where the sweet spot is in terms of number of drives, performance, and raid config. I was hoping to see more linear performance gains with number of drives on raid 10 but this is not the case as you'll see below.

Initially this post was about trying to optimize the array but I think it is clear that I'll need something other than mdadm to get the most IOPs in raid 10 configurations so to cut to the question first, what should I be looking at to get more IOPs out of raid 10 configurations? Was looking at:

- SW raid like xinnor / raidix - Don't know pricing / if I could even afford them
- HW raid card like Highpoint - ssd7580B
- Graid - Not considering for this setup but maybe in the future

Server has a 64 core 7702P, tons of ram, and up to 16 samsung pm9a3 pcie gen4 nvme drives. I am configuring it with ansible with this script that runs mdadm to create the array, runs a bunch of fio tests, summarizes those tests in a csv and exports it. I have to manually run it for each raid config but at least it will sweep across a number of test settings. Applications are generally based on rocksdb / postgres which is heavy on 4k r/w so only benching that block size.

So I ran bunch of times with different raid configs, consolidated the data, and then put that in gpt et voila,

1705479549443.png
1705479600065.png

So it is pretty clear that software raid 10 doesn't scale well.

## Questions

- Is it normal for SW raid to be a bottleneck in these high IOPs situations? Is it normal to need to get huge CPUs to drive this many drives? In the past this is not something I have observed but I was working with much smaller arrays
- Any suggestions to the FIO command to squeeze out more performance / consistently max out drives so that I can better gauge the performance impact on different array setups.
- Any config values I should be setting to maximize performance? I am running tuned right now but thinking other settings I am sure I am missing.
- Why does raid10 not scale? I was expecting much better results

Thank you very much for any feedback. I'll be updating this thread with more data and iterations with any suggestions.
 

Attachments

Last edited:

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
Nvme as a protocol relies heavily on a massive queue depth to get maximum performance out of a device by parallelizing and optimizing nand reads/writes. With raid you create shorter queue depths per single nvme device leaving the ssd controller less chances for parallelizing or optimizing io.
For comparison: sata had a max queue depth of 32 commands, sas 256 and nvme has 65,535 queues with 65,535 commands each (=2^32 :D) .
While nvme allows this extreme queue depth I don't thing that any ssd/nand controller implement algorithms for such a depth and instead use something that's a multiple of the nand channels and more than sas queue depth.

In your benchmarks you use some extreme numbers (256qd/128t = 256*128 total queue depth), is that what your actual workload does?
 

NPS

Active Member
Jan 14, 2021
147
44
28
As I wrote in your other thread:
How much disk space do you need?
Did you think about (try out) Optane?

Additional thought: Will the DB writes by sync or async? (I guess sync, but don't know)
 

robc

New Member
Oct 16, 2023
23
2
3
@i386 - Thanks for feedback. So I am not sure what queue depth most aligns with my application but that is a very good question and hardest to answer. Benchmarking the application on different drive configurations is very complicated since the data is large so I am just assuming 4k rand r/w is the best since both PG and rocksdb use that blocksize. My main question is raid sizing so I am just trying to find some fio settings that maximizes the IOPs to compare different arrays and clearly see the impacts. Any settings suggestions welcome and I include those in all the benchmarks for comparison. Right now just using some random settings to see their impact.

But for now, the biggest worry is why I am not able to see any increase in performance with raid10. I was hoping to see some improvement but that doesn't seem to be the case. Anyone know why?
 

nexox

Well-Known Member
May 3, 2023
675
280
63
Do you have CPU load measurements for these benchmarks? If you think that could be a bottleneck then you kind of want to look at what fio is doing and what the kernel threads are using for raid.

Ultimately I think you're going to have to benchmark your actual application, perhaps on a smaller dataset, but in my experience it's very difficult to predict much about how a RDBMS will run from low level filesystem benchmarks.
 

robc

New Member
Oct 16, 2023
23
2
3
As I wrote in your other thread:
How much disk space do you need?
Did you think about (try out) Optane?

Additional thought: Will the DB writes by sync or async? (I guess sync, but don't know)
So I replied more in the other thread we were talking in but yeah, optane's price to performance isn't what I need. DBs are in the 4tb range and I have many aiming to have about 20TB of space on each node. I'd need a lot of optane drives to get to where I need to be so normal nvme drives are a better option.

The more interesting question for me is what raid configuration I should go with as I'd like to have some redundancy from raid 10 but I am not that worried about losing data so raid 0 might be a better option. All my DBs are recoverable and will be run in sets of three within k8s so if one goes down, there should be automatic failover giving me time to recover the node meaning that raid 0 is very much an option. Was just hoping that raid 10 would show better performance so that I can just run that and not have to worry about any drive failure at all which is basically the question that this thread is trying to answer. Running 12 drives in raid 0 is just asking for trouble but 4 drives I think is within my risk tolerance but that is a whole other discussion. Hoping the performance numbers will dictate a sweet spot.
 

robc

New Member
Oct 16, 2023
23
2
3
Do you have CPU load measurements for these benchmarks? If you think that could be a bottleneck then you kind of want to look at what fio is doing and what the kernel threads are using for raid.

Ultimately I think you're going to have to benchmark your actual application, perhaps on a smaller dataset, but in my experience it's very difficult to predict much about how a RDBMS will run from low level filesystem benchmarks.
I was just looking at htop but I can easily run prometheus and drill into other metrics if that is helpful. But from what I was seeing in htop the cores were pinned with 128 jobs (the same number of threads I have) and got pretty high with 64 (might be worth it to get some data on that).

More important though would be A, figuring out why raid 10 performance is so bad and 2, what are some settings that best measure IOPs performance from the different arrays.

You are right though that I really should be benching my application but that is a little harder than a couple of ansible scripts... Will have to noodle on the easiest way to do that.
 

nexox

Well-Known Member
May 3, 2023
675
280
63
The important question is which process was using all that CPU, if it was fio then that would explain a lot about the benchmark results.