mdadm RAID10 geometry

EffrafaxOfWug · Jan 11, 2019

Long story short - does anyone know of a way to indicate which drives are paired with one another in an MD RAID10?

I'm looking to upgrade my array by way of replacing each drive with a larger one, and it occurred to me I've never looked for a way to see which drive is paired with which; I'll be able to see this when I replace and resync the drives (as new drive and source drive will both have activity lights on constantly), but it'd be nice to see if there's a way to ascertain the disk topology a) beforehand and b) without having to physically look at the activity lights. Ideally I'd like to colour-code my drive sleds according to this.

From mdstat:

Code:

md10 : active raid10 sda1[9] sdg1[7] sdf1[10] sdd1[11] sdb1[6] sde1[8]
      17581171200 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/66 pages [0KB], 131072KB chunk

mdadm --detail /dev/md10 shows sync-set information, but only shows two different sets as opposed to the three I'd expect, so seemingly these don't map to the components of each stripe, but rather to either the left or right hand drives:

Code:

...
    Number   Major   Minor   RaidDevice State
      11       8       49        0      active sync set-A   /dev/sdd1
       6       8       17        1      active sync set-B   /dev/sdb1
      10       8       81        2      active sync set-A   /dev/sdf1
       7       8       97        3      active sync set-B   /dev/sdg1
       9       8        1        4      active sync set-A   /dev/sda1
       8       8       65        5      active sync set-B   /dev/sde1

Thus it seems that the native mdadm tools don't actually report this information.

Short of actually comparing the blocks, does anyone know of any tricks to help ascertain which two drives make up each of the three stripes?

EffrafaxOfWug · Jan 11, 2019

Following on from my previous comment, after a bit of experimentation I threw together a one-liner that compares the first 64MB of each partition (it skips the first megabyte to try and avoid any metadata which might differ per drive, as I got some weird results without the skip) in the array and does a quick dd into md5sum; hopefully we'll get two of each hash:

Code:

root@wug:~# for drive in sd{a,b,d,e,f,g}; do echo ${drive}1 && dd if=/dev/${drive}1 skip=1M bs=1M count=64 2>/dev/null|md5sum; done
sda1
5r07939prq6727r468r1o9803s0079rn  -
sdb1
q806350soqr7roqps19noq5s10q87qp6  -
sdd1
q806350soqr7roqps19noq5s10q87qp6  -
sde1
5r07939prq6727r468r1o9803s0079rn  -
sdf1
rono6r3r3485nr4n17oqp331r054r304  -
sdg1
rono6r3r3485nr4n17oqp331r054r304  -

So it looks fro the above that we have the following pairs:
sda1 with sde1
sdb1 with sdd1
sdf1 with sdg1

I'll do some more digging to check this is actually valid, but in the meantime if there's a nicer way to do this I'd love to hear it...!

Goose · Jan 16, 2019

Hey, this was an interesting one to think about.

I was going to caution you that md raid10 is potentially not a normal beast but your near layout is infact a normal one. The other possibilities are weirder though.

I also didn't know of a way to see which disks are mirrors of each other so your idea is a great one. That being said. The info is already in your mdadm --detail and your research is basically a proof.

ie. Your proof shows sdd and sdb to be mirrors and sda and sde to be mirrors
You can see this matches up well to the --detail as you an see that sdd and sdb are pairs in an active sync set and sda and sde are also pairs in an active sync set. ie. you have a 3 way stripe of pairs of disks hence the 3 sync sets.

The way it's displayed is actually probably pretty good when you realise that md has the ability to have n-way mirrors eg. 3 way mirrors which would presumably look like a sync set with an a,b and c disk.

I have no idea how it would be presented if you had for eg. an 8 disk far offset layout with replicas set to 3.

Then simply use hdparm -tT /dev/sdX to blink the relevant disk and you're good to go.
The bitmap will save your bacon should you do something dumb.

Finally, if you replace them one by one by failing the disk and then removing it, replacing it etc. then you shouldn't have any greif presumably.

EffrafaxOfWug · Jan 16, 2019

Glad it caught someone's eye

Yep, the mdadm --detail output basically lists the drives in "by pair" order but it's non-obvious nor could I find it documented anywhere either, hence why I was resorting to comparing the hash of the drives. I'll see if I can make head or tails of the source code and maybe kick off a bug report to see if this can be clarified in the man pages (assuming I've not missed it).

I basically wrote a script to do a cleaner job of doing the hashes (and to turn on the SGPIO lights in pairs accordingly) but as you say that approach only works for the regular "near" layout with conventional mirrored pairs.

And yup, now I've mapped out the pairs, drive replacements are now under way; my drives were bought in two batches so for added pointless paranoia I wanted to make sure each pair was composed of drives from different batches, which eventually resulted in this thread.

I almost always stick to the near layout as it typically offers somewhat better random IO. I'll fire up a VM at some point and see what the other geometries might look like...

Goose · Jan 16, 2019

Apparently it was changed to the by replica order in 2013 sometime.

I saw that someone suggested that mdadm --examine /dev/sdX would be a good idea and TBH it probably would tell you which were pairs in your instance.

Near is meant to be the worst for random i/o. See ""Far" layout is designed for offering striping performance on a mirrored array; sequential reads can be striped, similar to as in RAID 0 configurations.[13] Random reads are somewhat faster, while sequential and random writes offer about equal speed to other mirrored RAID configurations. "Far" layout performs well for systems in which reads are more frequent than writes, which is a common case. For a comparison, regular RAID 1 as provided by Linux software RAID, does not stripe reads, but can perform reads in parallel.[14]" from the wikipedia page Non-standard RAID levels - Wikipedia

Yeah I couldn't see anything in the man pages either so if you did get them to describe it better then I'm sure a lot of people would owe you a beer someday.

EffrafaxOfWug · Jan 16, 2019

The --examine option doesn't show any information like that as far as I can see.

Do you mind if I ask where did you get the info re: ordering output by replica order? Is it listed in one of the changelogs or is this mailing list voodoo? Only if I'm to submit a bug report it would be useful to mention, lest I come across as more of a crank than I really am

I can see looking at Detail.c that mdadm basically lists the devices in array.raid_disks in subdevice order, but I don't think there's any clear indication that this is necessarily in pair order (but then I'm not a kernel hacker not even a C programmer).

Regarding your middle paragraph - I didn't test it extensively at the time (just a half hour or so dicking about with fio) but the near layout was marginally faster for me in random loads. The (old!) benches at PyCurious: RAID5,6 and 10 Benchmarks on 2.6.25.5 appear to be using sequential loads, so aren't particularly indicative. As I see it the weakness of the far layout (at least as far as I understand the on-disc format at least) is that lots of random IOs require a lot of seeking from the start of one disc to the end of another, ultimately resulting in a lot more head movement for random data and thus correspondingly lower random IO. I don't think I've got any adequate way of proving it though, but this, combined with my fio fiddles and a desire to KISS, meant it was preferable to me to use the near layout.

EDIT: I've just rediscovered the reason why I ruled out the far layout:
Linux Raid Wiki

Far mode raid-10 is unique in that it scatters blocks all over the array, so it is very difficult to reshape and currently any attempt to do so will be rejected. It could be done if someone cares to code it.

Being a stingy home user I tend to do a fair amount of reshaping, so this was a deal-breaker for me.

Goose · Jan 16, 2019

Shame about --examine

The man pages use the term replica rather than mirror as not only can you have n-way mirrors but also technically the replicas can be sort of anywhere ie. they can be scattered around in a pseudo random fashion in a far offset raid 10. See the wikipedia page for examples of the layout shenanigans.

In terms of it being ordered by replica I can't find any doco explaining as much, it's just that that's what it obviously does.

Unless there's a really good reason to use anything other than near, that's what I would stick with.

Search

mdadm RAID10 geometry

EffrafaxOfWug

Radioactive Member

EffrafaxOfWug

Radioactive Member

Goose

New Member

EffrafaxOfWug

Radioactive Member

Goose

New Member

EffrafaxOfWug

Radioactive Member

Goose

New Member