I have a problem with mdadm RAID1 and RAID10 on Linux Debian.
I've performed tests on Debian 12.12 with kernel 6.1 and Debian 13.1 (there is no noticeable difference in results despite the newer kernel).
Maximum simplified hardware configuration:
2 NVMe Gen 5 WD BLACK SN8100 drives connected directly to the motherboard M.2 slots (no controllers, adapters, or converters, etc.)
Baseboard: Supermicro H13SAE-MF (tested with different bios firmware versions)
CPU: AMD EPYC 4564P
mdadm raid commands:
mdadm --create /dev/md0 --assume-clean --run --level=0 --raid-devices=2
mdadm --create /dev/md1 --assume-clean --run --level=1 --raid-devices=2
I've also tested options (if applicable to the RAID level):
--chunk=64k / 128k / 512k / 1m
--bitmap=none
--consistency-policy=resync
ext4 filesystem command:
mkfs.ext4 /dev/md0 -E lazy_journal_init=0,lazy_itable_init=0
mount /dev/md0 /mnt
I've also tested options:
stride=,stripe-width= based on raid chunk size
The --assume-clean option in mdadm and lazy_ options in mkfs ensure that nothing happens on the disks in the background during tests.
The tests are performed using fio v.3.33:
I've tested different fio parameters [changing iodepth, numjobs, ioengine, bs]
Below are example tests and results:
Sequential read
fio --name=seq-read --filename=/mnt/test --direct=1 --rw=read --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Sequential write
fio --name=seq-write --filename=/mnt/test --direct=1 --rw=write --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Random read
fio --name=rand-read --filename=/mnt/test --direct=1 --rw=randread --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Random write
fio --name=rand-write --filename=/mnt/test --direct=1 --rw=randwrite --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Aggregated results:
Single drive:
As you can see, fio test for a single drive is quite close to the specs from the manufacturer (seq read/write: 14900 MB/s / 14000 MB/s, random read/write: 2300k / 2400k iops).
Raid 0:
Raid 0 should be able to read at double the speed (since it reads from two drives) and write at double speed (due to striping), and as seen in tests, almost everything is doubled (except seq write, which could be a bit higher).
Raid 1:
This is my biggest problem.
I expected doubled reads and writes compared to a single drive.
Everything looks fine for sequential operations.
But in random read I get 25% worse results than raid 0 and 70% worse results in random write than a single drive.
Why? Is this normal? I understand that software RAID can have some losses, but so much?
Is there any way to improve these results?
Originally, the problem I was dealing with concerned mdadm raid10, but after analyzing, I realized something is wrong with mirroring on such fast NVMe drives (in my case).
Has anyone noticed a similar problem? Can anything be done about it?
Thanks a lot for the help!
I've performed tests on Debian 12.12 with kernel 6.1 and Debian 13.1 (there is no noticeable difference in results despite the newer kernel).
Maximum simplified hardware configuration:
2 NVMe Gen 5 WD BLACK SN8100 drives connected directly to the motherboard M.2 slots (no controllers, adapters, or converters, etc.)
Baseboard: Supermicro H13SAE-MF (tested with different bios firmware versions)
CPU: AMD EPYC 4564P
mdadm raid commands:
mdadm --create /dev/md0 --assume-clean --run --level=0 --raid-devices=2
mdadm --create /dev/md1 --assume-clean --run --level=1 --raid-devices=2
I've also tested options (if applicable to the RAID level):
--chunk=64k / 128k / 512k / 1m
--bitmap=none
--consistency-policy=resync
ext4 filesystem command:
mkfs.ext4 /dev/md0 -E lazy_journal_init=0,lazy_itable_init=0
mount /dev/md0 /mnt
I've also tested options:
stride=,stripe-width= based on raid chunk size
The --assume-clean option in mdadm and lazy_ options in mkfs ensure that nothing happens on the disks in the background during tests.
The tests are performed using fio v.3.33:
I've tested different fio parameters [changing iodepth, numjobs, ioengine, bs]
Below are example tests and results:
Sequential read
fio --name=seq-read --filename=/mnt/test --direct=1 --rw=read --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Sequential write
fio --name=seq-write --filename=/mnt/test --direct=1 --rw=write --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Random read
fio --name=rand-read --filename=/mnt/test --direct=1 --rw=randread --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Random write
fio --name=rand-write --filename=/mnt/test --direct=1 --rw=randwrite --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --
Aggregated results:
Code:
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| | seq-read | seq-write | rand-read | rand-write |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| single drive | IOPS=7129, BW=13.9GiB/s (15.0GB/s) | IOPS=6619, BW=12.9GiB/s (13.9GB/s) | IOPS=2345k, BW=9161MiB/s (9606MB/s) | IOPS=2207k, BW=8621MiB/s (9040MB/s) |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| raid 0 | IOPS=14.3k, BW=27.9GiB/s (29.9GB/s) | IOPS=9951, BW=19.4GiB/s (20.9GB/s) | IOPS=4709k, BW=18.0GiB/s (19.3GB/s) | IOPS=4182k, BW=16.0GiB/s (17.1GB/s) |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| raid 1 | IOPS=14.3k, BW=27.9GiB/s (29.9GB/s) | IOPS=6554, BW=12.8GiB/s (13.7GB/s) | IOPS=3495k, BW=13.3GiB/s (14.3GB/s) | IOPS=615k, BW=2403MiB/s (2520MB/s) |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
As you can see, fio test for a single drive is quite close to the specs from the manufacturer (seq read/write: 14900 MB/s / 14000 MB/s, random read/write: 2300k / 2400k iops).
Raid 0:
Raid 0 should be able to read at double the speed (since it reads from two drives) and write at double speed (due to striping), and as seen in tests, almost everything is doubled (except seq write, which could be a bit higher).
Raid 1:
This is my biggest problem.
I expected doubled reads and writes compared to a single drive.
Everything looks fine for sequential operations.
But in random read I get 25% worse results than raid 0 and 70% worse results in random write than a single drive.
Why? Is this normal? I understand that software RAID can have some losses, but so much?
Is there any way to improve these results?
Originally, the problem I was dealing with concerned mdadm raid10, but after analyzing, I realized something is wrong with mirroring on such fast NVMe drives (in my case).
Has anyone noticed a similar problem? Can anything be done about it?
Thanks a lot for the help!