mdadm questions

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Tinkerer

Member
Sep 5, 2020
83
28
18
I want to create (and test) mdadm raid10 with 4 nvme enterprise ssd's in 2 mirrors striped (raid10).

I intend to create:
Code:
mdadm --create /dev/md0 --level 1 --name mirror1 --raid-devices 2 /dev/nvme1n1 /dev/nvme2n1
mdadm --create /dev/md1 --level 1 --name mirror2 --raid-devices 2 /dev/nvme3n1 /dev/nvme4n1
mdadm --create --chunk-size 64K /dev/md2 --level 0 --name stripeset -raid-devices 2 /dev/md0 /dev/md1
I found the following might help improve performance:
Code:
# Tune md0
echo 16 > /sys/block/md0/md/group_thread_cnt
echo 16384 > /sys/block/md0/md/stripe_cache_size
echo 5000000 > /proc/sys/dev/raid/speed_limit_min;
echo 10000000 > /proc/sys/dev/raid/speed_limit_max
As for the questions about this:
1: Do I set the sysctl on md2 only (the stripe) or for each one?
2: I'm pretty sure I still need to align a partition right?
3: Chunk size: I see people pick large chunk sizes to improve throughput, but I want to tune for random small block (4k - 16k) IOPS. I don't think a 1M chunk will help but Im not sure so, open to suggestions.

Thanks!
 

nexox

Well-Known Member
May 3, 2023
678
282
63
You probably don't want to go about it this way with three arrays, md supports raid level 10 directly in one array, which is almost certainly going to be faster and more reliable. Partition alignment is important, but all the utilities tend to handle that fine now without any extra effort, you may also want to consider LVM rather than static partitions, or you can just format the whole volume and mount it if you want to. Chunk sizes in the 64k-256k range should give you more or less the same performance for random IO, you would need to do some specific benchmarks with your workload and hardware if you really need to get the last couple percent of performance out of the setup, otherwise probably just go with 256k.
 
  • Like
Reactions: Tinkerer

gb00s

Well-Known Member
Jul 25, 2018
1,191
602
113
Poland
... Chunk sizes in the 64k-256k range should give you more or less the same performance for random IO, you would need to do some specific benchmarks with your workload and hardware if you really need to get the last couple percent of performance out of the setup, otherwise probably just go with 256k.
I did not see him mentioning any file system he's likely to choose.
 
  • Like
Reactions: Tinkerer

acquacow

Well-Known Member
Feb 15, 2017
787
439
63
42
You can just set the level to 10 when you create the array, no need to do nested md devices anymore.

I did this a lot while at Fusion-io and you don't even need to mess with chunk sizes anymore either...that was 5 years ago though.
 
  • Like
Reactions: gb00s and Tinkerer

Tinkerer

Member
Sep 5, 2020
83
28
18
Thanks all!

So this should do it:
mdadm --create /dev/md0 --level1 0 --name raid10 -raid-devices 4 /dev/nvm1n1 /dev/nvm2n1 /dev/nvme3n1 /dev/nvme4n1

The reason for the nested raid0's was that I read that mdadm level10 wasn't strictly raid10 and works with 2 and 3 drives too. That makes me really frown.

I still frown and I'll test it by removing drives and letting it resilver. Or rebuild, resync whatever mdadm calls that :cool:.

You might see if you can format the nvme drive for 4K block sizes before you build your array. See:
Yes, thanks. I had already done that. Also updated the firmware with the latest available.

I did not see him mentioning any file system he's likely to choose.
Probably good old trusty EXT4. I think I'll dump LVM on there so I can easily create lv's.

So 1 question unanswered and 1 new one:
1. The sysctl stuff from the first post. The first 2 don't work (anymore), and I can't find info on why that is. Do I need to do something to make that work or is that simply outdated stuff?

2. The Arch Wiki mentions specific parameters to format EXT4, based on chunk and stripe size. See 3.10 here. Archwiki is usually right ;) so to confirm, I take it that's still accurate?
 

nexox

Well-Known Member
May 3, 2023
678
282
63
I still frown and I'll test it by removing drives and letting it resilver. Or rebuild, resync whatever mdadm calls that
As far as I understand it, though I have never tested it, md will leave you with a degraded array until you give it a new disk to rebuild onto, it shouldn't shuffle all the data around to a 3 disk raid 10 format.


1. The sysctl stuff from the first post. The first 2 don't work
The second one, at least, is only for parity raid, so probably doesn't exist for a mirror/stripe array, the first is probably the same deal, but I haven't looked it up.