Linux mdadm scaling issues

Rand__

Well-Known Member
Mar 6, 2014
4,497
878
113
Hi,
I am playing around with a bunch of SSDs trying to get a feel for how mdadm is scaling and what the sweat spot will be. This is still part of my everlasting search to find the ultimate shared storage setup for my VMWare boxes...

So I have a new Ubuntu installation, default, no optimizations yet. Just dropped in my drives with the onboard SATA controller on a X10SRA with a Ev2667v4ES (2.9Ghz core freq).
The drives are Intel S3700 400's (some Intel, some Dell, fixed to 6Gbs).

I have run a bunch of 4K centric tests with various drives in the array and have to say it scales extremely bad - basically it hits a limit at 200K IOPS and thats it.

Scaling from 2 to 8 drives in Raid0 (4 threads, iodepth 16, t=60 [I know steady state perf will drop to thats at 30k/drive so should still be more than 200))

de52_s3700_2r0: write: io=16447MB, bw=280683KB/s, iops=70170, runt= 60002msec
de52_s3700_4r0: write: io=32251MB, bw=550390KB/s, iops=137597, runt= 60002msec
de52_s3700_6r0: write: io=42368MB, bw=723056KB/s, iops=180763, runt= 60002msec
de52_s3700_8r0: write: io=46302MB, bw=790214KB/s, iops=197553, runt= 60001msec


I also tried with more jobs (8) and different iodepths, no real improvement.

Just for fun I ran a 2disk Raid0 on Intel 750s - quite inconsistent results (but might be steady state, close after each other)...
de51_nvmer0: write: io=96582MB, bw=1609.7MB/s, iops=412060, runt= 60003msec
de51_nvmer0: write: io=53776MB, bw=917734KB/s, iops=229433, runt= 60003msec

So mdadm *is* able to reach more than 200K IOPS so maybe thats more a Sata controller issue?
Will need to perform the tests on a HBA next...

What is your experience with SSDs with mdadm? Ever hit that limit? Did you fix it? :)
 

Attachments

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,009
1,569
113
CA
SAS2 HBA can't fully utilize 8x Enterprise SATA SSD let alone a SATA controller :(

I don't know IF that's your problem, but it likely could be.

Either way running 2x HBA will yield even > performance.
 
  • Like
Reactions: eva2000

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
Your bottleneck is very likely the C612 PCH that the SATA is connected through. As @T_Minus suggests - spread them across more IO controllers. If you have a single HBA available give it a try with 4 on the on-board SATA and 4 on the HBA.
 

Rand__

Well-Known Member
Mar 6, 2014
4,497
878
113
Got enough SAS2/3 HBAs to play around, so will give that a try:)
 

_alex

Active Member
Jan 28, 2016
874
94
28
Bavaria / Germany
i'd recommend:
get an as-recent-as-possible Kernel
enable blk-mq / scsi-mq
set io-scheduler to mq-deadline (you will need a recent Kernel for this)
(deadline or noop without blk-mq).
disable pti (pti=off) for max performance/a good baseline

not sure how much the c612 can do, but got nice improvements in a vm with the above mentioned. (scsi-mq instead of blk-mq in my case ...)
 

Rand__

Well-Known Member
Mar 6, 2014
4,497
878
113
Better (SAS3 HBA with 4 drives, 4 on SATA)-
de52_s3700_2r0: write: io=17770MB, bw=303272KB/s, iops=75818, runt= 60002msec
de52_s3700_4r0: write: io=32768MB, bw=559223KB/s, iops=139805, runt= 60001msec
de52_s3700_6r0: write: io=46329MB, bw=790660KB/s, iops=197665, runt= 60001msec
de52_s3700_8r0: write: io=59228MB, bw=987.13MB/s, iops=252702, runt= 60001msec
 

Rand__

Well-Known Member
Mar 6, 2014
4,497
878
113
So scaling works fine until I seem to hit the SAS3 limit (6 drives scale nicely, 8 seem to hit the next limit)
(Thats 4 SATA and 8 on SAS3 HBA)

de52_s3700_2r0: write: io=17770MB, bw=303272KB/s, iops=75818, runt= 60002msec
de52_s3700_4r0: write: io=32768MB, bw=559223KB/s, iops=139805, runt= 60001msec
de52_s3700_6r0: write: io=46329MB, bw=790660KB/s, iops=197665, runt= 60001msec
de52_s3700_8r0: write: io=59228MB, bw=987.13MB/s, iops=252702, runt= 60001msec
de52_s3700_10r0: write: io=74764MB, bw=1242.2MB/s, iops=318000, runt= 60187msec
de52_s3700_12r0: write: io=80170MB, bw=1336.2MB/s, iops=342053, runt= 60001msec


Edit:
just for completeness sake - 2->8 drives on SAS3 HBA only (Dell H330)
s3700_2r0.fio: write: io=16501MB, bw=281610KB/s, iops=70402, runt= 60002msec
s3700_4r0.fio: write: io=32243MB, bw=550279KB/s, iops=137569, runt= 60001msec
s3700_6r0.fio: write: io=46510MB, bw=793757KB/s, iops=198439, runt= 60001msec
s3700_8r0.fio: write: io=62752MB, bw=1045.9MB/s, iops=267726, runt= 60003msec
 
Last edited:
  • Like
Reactions: eva2000

_alex

Active Member
Jan 28, 2016
874
94
28
Bavaria / Germany
Not sure this is SAS Limit, you cable one Drive per Port, without Expander?
what would only 8 Drives on the hba give ?
if thats more than 252k like 4 SATA/4 SAS bottleneck is definitely not sas3.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,218
412
83
Have you turned off the write intent bitmap for the RAID array [mdadm --grow --bitmap=none /dev/md123]? I've seen large flash arrays bottleneck on that before. Note I wouldn't recommend running a production array without a write bitmap.

Also, see if you can see any discrepancies in the iostat loads on the drives in the array whilst you run your tests [using a command like `watch "iostat -kx|egrep 'sd|Device'"`); as others have noted there might be a dodgy SSD, cable or controller somewhere in the mix that's dragging down the performance of the whole array and that sort of thing should show up as devices in the list that have markedly different perf stats to the others.
 

Rand__

Well-Known Member
Mar 6, 2014
4,497
878
113
as others have noted there might be a dodgy SSD, cable or controller somewhere in the mix that's dragging down the performance of the whole array and that sort of thing should show up as devices in the list that have markedly different perf stats to the others.
Thanks;
I had run single disk tests before to identify any significant differences and they were within 10% or so which I consider ok for now :)
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,218
412
83
D'oh, missed you were using a RAID0, lizard brain read it as 10 like all good RAIDs should be ;)
 

_alex

Active Member
Jan 28, 2016
874
94
28
Bavaria / Germany
maybe stripesize can help a bit.
afaik its 512kb by default, what is a bit high for random 4k.
but wouldn't expect too much / consider this as one of the last things to tune.

and yes, Raid-10 with internal bitmaps is the way to go ;)
 

Celoxocis

New Member
Mar 28, 2017
3
0
1
37
just added SAS HBA only values above for completeness sake
run this script (alternative link to his github)to find the optimal input/output block sizes of your SSD's.
(run it on a single /dev not on the mdX itself)
than apply the block size as the chunk-size value for your mdadm setup.
run the benchmarks and see what you get.