Linux mdadm scaling issues

Discussion in 'Linux Admins, Storage and Virtualization' started by Rand__, Mar 6, 2018.

  1. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    Hi,
    I am playing around with a bunch of SSDs trying to get a feel for how mdadm is scaling and what the sweat spot will be. This is still part of my everlasting search to find the ultimate shared storage setup for my VMWare boxes...

    So I have a new Ubuntu installation, default, no optimizations yet. Just dropped in my drives with the onboard SATA controller on a X10SRA with a Ev2667v4ES (2.9Ghz core freq).
    The drives are Intel S3700 400's (some Intel, some Dell, fixed to 6Gbs).

    I have run a bunch of 4K centric tests with various drives in the array and have to say it scales extremely bad - basically it hits a limit at 200K IOPS and thats it.

    Scaling from 2 to 8 drives in Raid0 (4 threads, iodepth 16, t=60 [I know steady state perf will drop to thats at 30k/drive so should still be more than 200))

    de52_s3700_2r0: write: io=16447MB, bw=280683KB/s, iops=70170, runt= 60002msec
    de52_s3700_4r0: write: io=32251MB, bw=550390KB/s, iops=137597, runt= 60002msec
    de52_s3700_6r0: write: io=42368MB, bw=723056KB/s, iops=180763, runt= 60002msec
    de52_s3700_8r0: write: io=46302MB, bw=790214KB/s, iops=197553, runt= 60001msec


    I also tried with more jobs (8) and different iodepths, no real improvement.

    Just for fun I ran a 2disk Raid0 on Intel 750s - quite inconsistent results (but might be steady state, close after each other)...
    de51_nvmer0: write: io=96582MB, bw=1609.7MB/s, iops=412060, runt= 60003msec
    de51_nvmer0: write: io=53776MB, bw=917734KB/s, iops=229433, runt= 60003msec

    So mdadm *is* able to reach more than 200K IOPS so maybe thats more a Sata controller issue?
    Will need to perform the tests on a HBA next...

    What is your experience with SSDs with mdadm? Ever hit that limit? Did you fix it? :)
     

    Attached Files:

    #1
  2. T_Minus

    T_Minus Moderator

    Joined:
    Feb 15, 2015
    Messages:
    6,883
    Likes Received:
    1,509
    SAS2 HBA can't fully utilize 8x Enterprise SATA SSD let alone a SATA controller :(

    I don't know IF that's your problem, but it likely could be.

    Either way running 2x HBA will yield even > performance.
     
    #2
    eva2000 likes this.
  3. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,783
    Likes Received:
    1,122
    Your bottleneck is very likely the C612 PCH that the SATA is connected through. As @T_Minus suggests - spread them across more IO controllers. If you have a single HBA available give it a try with 4 on the on-board SATA and 4 on the HBA.
     
    #3
  4. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    Got enough SAS2/3 HBAs to play around, so will give that a try:)
     
    #4
  5. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    i'd recommend:
    get an as-recent-as-possible Kernel
    enable blk-mq / scsi-mq
    set io-scheduler to mq-deadline (you will need a recent Kernel for this)
    (deadline or noop without blk-mq).
    disable pti (pti=off) for max performance/a good baseline

    not sure how much the c612 can do, but got nice improvements in a vm with the above mentioned. (scsi-mq instead of blk-mq in my case ...)
     
    #5
  6. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    Better (SAS3 HBA with 4 drives, 4 on SATA)-
    de52_s3700_2r0: write: io=17770MB, bw=303272KB/s, iops=75818, runt= 60002msec
    de52_s3700_4r0: write: io=32768MB, bw=559223KB/s, iops=139805, runt= 60001msec
    de52_s3700_6r0: write: io=46329MB, bw=790660KB/s, iops=197665, runt= 60001msec
    de52_s3700_8r0: write: io=59228MB, bw=987.13MB/s, iops=252702, runt= 60001msec
     
    #6
  7. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    close to 8x 36k / 288k Write iops the s3700 should do ...
     
    #7
  8. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    So scaling works fine until I seem to hit the SAS3 limit (6 drives scale nicely, 8 seem to hit the next limit)
    (Thats 4 SATA and 8 on SAS3 HBA)

    de52_s3700_2r0: write: io=17770MB, bw=303272KB/s, iops=75818, runt= 60002msec
    de52_s3700_4r0: write: io=32768MB, bw=559223KB/s, iops=139805, runt= 60001msec
    de52_s3700_6r0: write: io=46329MB, bw=790660KB/s, iops=197665, runt= 60001msec
    de52_s3700_8r0: write: io=59228MB, bw=987.13MB/s, iops=252702, runt= 60001msec
    de52_s3700_10r0: write: io=74764MB, bw=1242.2MB/s, iops=318000, runt= 60187msec
    de52_s3700_12r0: write: io=80170MB, bw=1336.2MB/s, iops=342053, runt= 60001msec


    Edit:
    just for completeness sake - 2->8 drives on SAS3 HBA only (Dell H330)
    s3700_2r0.fio: write: io=16501MB, bw=281610KB/s, iops=70402, runt= 60002msec
    s3700_4r0.fio: write: io=32243MB, bw=550279KB/s, iops=137569, runt= 60001msec
    s3700_6r0.fio: write: io=46510MB, bw=793757KB/s, iops=198439, runt= 60001msec
    s3700_8r0.fio: write: io=62752MB, bw=1045.9MB/s, iops=267726, runt= 60003msec
     
    #8
    Last edited: Mar 8, 2018
    eva2000 likes this.
  9. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    Not sure this is SAS Limit, you cable one Drive per Port, without Expander?
    what would only 8 Drives on the hba give ?
    if thats more than 252k like 4 SATA/4 SAS bottleneck is definitely not sas3.
     
    #9
  10. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    1,090
    Likes Received:
    363
    Have you turned off the write intent bitmap for the RAID array [mdadm --grow --bitmap=none /dev/md123]? I've seen large flash arrays bottleneck on that before. Note I wouldn't recommend running a production array without a write bitmap.

    Also, see if you can see any discrepancies in the iostat loads on the drives in the array whilst you run your tests [using a command like `watch "iostat -kx|egrep 'sd|Device'"`); as others have noted there might be a dodgy SSD, cable or controller somewhere in the mix that's dragging down the performance of the whole array and that sort of thing should show up as devices in the list that have markedly different perf stats to the others.
     
    #10
  11. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    there are no bitmaps on raid-0 ;)
     
    #11
  12. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    Thanks;
    I had run single disk tests before to identify any significant differences and they were within 10% or so which I consider ok for now :)
     
    #12
  13. EffrafaxOfWug

    EffrafaxOfWug Radioactive Member

    Joined:
    Feb 12, 2015
    Messages:
    1,090
    Likes Received:
    363
    D'oh, missed you were using a RAID0, lizard brain read it as 10 like all good RAIDs should be ;)
     
    #13
  14. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    Yeah, just for testers & fun atm :)
     
    #14
  15. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    maybe stripesize can help a bit.
    afaik its 512kb by default, what is a bit high for random 4k.
    but wouldn't expect too much / consider this as one of the last things to tune.

    and yes, Raid-10 with internal bitmaps is the way to go ;)
     
    #15
  16. pricklypunter

    pricklypunter Well-Known Member

    Joined:
    Nov 10, 2015
    Messages:
    1,546
    Likes Received:
    441
    I went and cleaned my glasses...thought that said striptease to begin with, I was going to agree, those always...ahem...help scaling :D:p
     
    #16
    dswartz and _alex like this.
  17. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,692
    Likes Received:
    586
    just added SAS HBA only values above for completeness sake
     
    #17
  18. Celoxocis

    Celoxocis New Member

    Joined:
    Mar 28, 2017
    Messages:
    3
    Likes Received:
    0
    run this script (alternative link to his github)to find the optimal input/output block sizes of your SSD's.
    (run it on a single /dev not on the mdX itself)
    than apply the block size as the chunk-size value for your mdadm setup.
    run the benchmarks and see what you get.
     
    #18
Similar Threads: Linux mdadm
Forum Title Date
Linux Admins, Storage and Virtualization Finally finished all the testing and ready to make then change to zfsonLinux Thursday at 11:46 AM
Linux Admins, Storage and Virtualization new in Linux Kernel > 5.1 io_uring much faster aio I/O Dec 17, 2019
Linux Admins, Storage and Virtualization SHOW STH: Flexi Raid - A flexible storage solution on a Linux distro, using off the shelf tools Dec 1, 2019
Linux Admins, Storage and Virtualization Clear Linux used exclusively for FFMPEG on Kaby/Coffee etc. Aug 14, 2019
Linux Admins, Storage and Virtualization ZFS on Linux 0.8.0 released! May 28, 2019

Share This Page