Performance issues with mdadm RAID1/RAID10 on NVMe Gen 5 drives

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

wnt2048

New Member
Dec 16, 2024
6
1
3
I have a problem with mdadm RAID1 and RAID10 on Linux Debian.
I've performed tests on Debian 12.12 with kernel 6.1 and Debian 13.1 (there is no noticeable difference in results despite the newer kernel).

Maximum simplified hardware configuration:
2 NVMe Gen 5 WD BLACK SN8100 drives connected directly to the motherboard M.2 slots (no controllers, adapters, or converters, etc.)
Baseboard: Supermicro H13SAE-MF (tested with different bios firmware versions)
CPU: AMD EPYC 4564P

mdadm raid commands:
mdadm --create /dev/md0 --assume-clean --run --level=0 --raid-devices=2
mdadm --create /dev/md1 --assume-clean --run --level=1 --raid-devices=2

I've also tested options (if applicable to the RAID level):
--chunk=64k / 128k / 512k / 1m
--bitmap=none
--consistency-policy=resync

ext4 filesystem command:
mkfs.ext4 /dev/md0 -E lazy_journal_init=0,lazy_itable_init=0
mount /dev/md0 /mnt

I've also tested options:
stride=,stripe-width= based on raid chunk size

The --assume-clean option in mdadm and lazy_ options in mkfs ensure that nothing happens on the disks in the background during tests.

The tests are performed using fio v.3.33:

I've tested different fio parameters [changing iodepth, numjobs, ioengine, bs]
Below are example tests and results:

Sequential read
fio --name=seq-read --filename=/mnt/test --direct=1 --rw=read --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --

Sequential write
fio --name=seq-write --filename=/mnt/test --direct=1 --rw=write --bs=1M --iodepth=32 --numjobs=1 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --

Random read
fio --name=rand-read --filename=/mnt/test --direct=1 --rw=randread --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --

Random write
fio --name=rand-write --filename=/mnt/test --direct=1 --rw=randwrite --bs=4k --iodepth=32 --numjobs=64 --size=10G --time_based --runtime=60s --group_reporting --ioengine=libaio --

Aggregated results:
Code:
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|               | seq-read                            | seq-write                           | rand-read                           | rand-write                          |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| single drive  | IOPS=7129, BW=13.9GiB/s (15.0GB/s)  | IOPS=6619, BW=12.9GiB/s (13.9GB/s)  | IOPS=2345k, BW=9161MiB/s (9606MB/s) | IOPS=2207k, BW=8621MiB/s (9040MB/s) |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| raid 0        | IOPS=14.3k, BW=27.9GiB/s (29.9GB/s) | IOPS=9951, BW=19.4GiB/s (20.9GB/s)  | IOPS=4709k, BW=18.0GiB/s (19.3GB/s) | IOPS=4182k, BW=16.0GiB/s (17.1GB/s) |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| raid 1        | IOPS=14.3k, BW=27.9GiB/s (29.9GB/s) | IOPS=6554, BW=12.8GiB/s (13.7GB/s)  | IOPS=3495k, BW=13.3GiB/s (14.3GB/s) | IOPS=615k, BW=2403MiB/s (2520MB/s)  |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
Single drive:
As you can see, fio test for a single drive is quite close to the specs from the manufacturer (seq read/write: 14900 MB/s / 14000 MB/s, random read/write: 2300k / 2400k iops).

Raid 0:
Raid 0 should be able to read at double the speed (since it reads from two drives) and write at double speed (due to striping), and as seen in tests, almost everything is doubled (except seq write, which could be a bit higher).

Raid 1:
This is my biggest problem.
I expected doubled reads and writes compared to a single drive.
Everything looks fine for sequential operations.
But in random read I get 25% worse results than raid 0 and 70% worse results in random write than a single drive.
Why? Is this normal? I understand that software RAID can have some losses, but so much?
Is there any way to improve these results?

Originally, the problem I was dealing with concerned mdadm raid10, but after analyzing, I realized something is wrong with mirroring on such fast NVMe drives (in my case).
Has anyone noticed a similar problem? Can anything be done about it?

Thanks a lot for the help!
 

i386

Well-Known Member
Mar 18, 2016
4,801
1,863
113
36
Germany
With sata consumer ssds there was a drop in performance once the ram on the controller was full and it started to write to the actual nand.
The wd sn8100 (according to a quick google search) is also a consumer ssd and could have the same problem...
 

nexox

Well-Known Member
May 3, 2023
1,823
881
113
I would test without a filesystem in the way, and if you really want to see maximum performance from SSDs they should be secure erased (or maybe just run TRIM on every sector) to restore them to new performance. Then again unless you're trying to validate performance against a specific requirement I wouldn't bother benchmarking really, because you have no target or goal so you don't have any basis for choosing a benchmark method so what do those numbers even tell you?
 

wnt2048

New Member
Dec 16, 2024
6
1
3
With sata consumer ssds there was a drop in performance once the ram on the controller was full and it started to write to the actual nand.
The wd sn8100 (according to a quick google search) is also a consumer ssd and could have the same problem...
I know there have been such cases with sata SSDs, but I think the problem here lies somewhere else. If the random write IOPS really dropped that much, the same effect would appear on a single drive, because i run same fio tests on single drive and 2 mirrored drives.
RAID 1 should write data in parallel to both drives, so it should perform almost the same as on a single disk.
But instead of the expected 2,200K IOPS, I’m getting only around 600K. It’s as if RAID 1 first writes to one drive, then the other, waits a bit, and ends up with 600K IOPS.

I would test without a filesystem in the way, and if you really want to see maximum performance from SSDs they should be secure erased (or maybe just run TRIM on every sector) to restore them to new performance. Then again unless you're trying to validate performance against a specific requirement I wouldn't bother benchmarking really, because you have no target or goal so you don't have any basis for choosing a benchmark method so what do those numbers even tell you?
I’m not testing theoretical performance on block devices on purpose.
What I’m trying to find is a stable and fast solution for databases, that’s why I’m testing ext4 — it’s most likely the final filesystem.
I also test random write load in fio, because I want to squeeze out as many IOPS as possible in random writes from multiple threads, similar to how my databases behave on daily load.
These numbers tell me something’s wrong with the software RAID (mdadm) or with some hardware/Linux configuration, because RAID 1 doesn’t seem to write at full speed to both mirrored drives.
It’s absurd that a single drive gives me three times better random write performance than a RAID 1 array.
 

nexox

Well-Known Member
May 3, 2023
1,823
881
113
It’s absurd that a single drive gives me three times better random write performance than a RAID 1 array.
If that was the last test you ran the SSD might have just run out of cache, pSLC, or erased blocks. This is, coincidentally, not going to tell you anything about long term database performance.
 

wnt2048

New Member
Dec 16, 2024
6
1
3
If that was the last test you ran the SSD might have just run out of cache, pSLC, or erased blocks. This is, coincidentally, not going to tell you anything about long term database performance.
These results are consistent. Before posting on the forum, I ran multiple tests on different Debian versions with various configurations, and each time I observed high IOPS from a single drive but low IOPS for RAID 1 in random write tests.
If the issue were related to the SLC cache/erased blocks/etc. running out, then why does it always occur on RAID 1 and never on a single drive?
Moreover, tests of this drive show the SLC cache size is around 600 GB:
WD Black SN8100 2 TB Review
WD_Black SN8100 Gen5 SSD Review - SanDisk's Fastest M.2 SSD To Date
The FIO tests I perform use a 10 GB file, so I am far below that 600 GB limit.
In my opinion, this is not where the problem lies.
 

nexox

Well-Known Member
May 3, 2023
1,823
881
113
I just noticed you're using the --direct flag, which doesn't always work for software raid arrays, the kernel may just ignore it, giving you benchmarks that aren't comparable.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,430
542
113
I expected doubled reads and writes compared to a single drive.
Yes for RAID0, no for RAID1. Whilst RAID1 is still able to stripe the reads under most circumstances, writes still have to be written to both/all devices before the write can be confirmed - so write speed in a RAID1 array can never be faster than the write speed of the slowest drive in the array.

As others have pointed out, you're not using enterprise drives so it's possible one or both of your drives might have had their pSLC cache exhausted - the TPU review put the 2TB version of the SN8100 slowing down after ~600GB of writes - and you need to wait for the drive to do its thing (how long this actually takes varies considerably and I'm not even sure it's measureable) before performance will recover.

It could be one of your drives being slow, it could be there really is a problem with mdadm. It's easy enough to check; either mdadm --fail one of the discs out of the array, or recreate RAID1 arrays with only one device e.g.
Code:
mdadm --create /dev/md0 --level 1 --raid-disks 2 --assume-clean /dev/nvme0n1p1 missing
mdadm --create /dev/md1 --level 1 --raid-disks 2 --assume-clean /dev/nvme1n1p1 missing
...then run the same tests against each not-really-a-RAID1 and see if there's any significant difference. This way you're effectively testing against a single drive, just with mdadm in the way, so if the "single bare drive" results are noticeably inferior than the "single drive RAID1" results then you've got a pointer towards mdadm being a problem. If the results are markedly different to one another, then that's an indication that one of the drives is the problem.

I don't think the --assume-clean param actually does anything when you create a two-drive RAID1 with a missing drive, but I don't think it hurts having it there. It's probably irrelevant, but since it's just for testing you might want to add -E nodiscard to your mkfs.ext4 to make sure the murky world of TRIM never rears its head. I see you've already tried nuking write-intent bitmaps/using --consistency-policy=resync.

One of the biggest reasons enterprise drives are/can be priced so expensively is that they're generally far more performant and consistent than consumer drives in heavy I/O scenarios that most consumer drives will never need to handle. Under most consumer use cases it's not an issue, but then something like a drive failure and a RAID rebuild happens, and resync performance drops off a cliff a third of the way through the job.

I feel your pain though, the selection of enterprise M.2 is pretty poor and how consumer M.2 behave isn't terribly well tested. But if it's any consolation here's a re-run of some of your benches on a pair of Crucial T500s ("only" PCIe4, mdadm RAID1 + LVM). The NVME cooling in this thing isn't great so I waited a good ~20mins between runs for the drives to get back to ambient.
Code:
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|               | seq-read                            | seq-write                           | rand-read                           | rand-write                          |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| raid 1        | IOPS=6824, BW=6825MiB/s (7156MB/s)  | IOPS=6409, BW=6409MiB/s (6721MB/s)  | IOPS=2195k, BW=8576MiB/s (8993MB/s) | IOPS=585k, BW=2284MiB/s (2395MB/s)  |
+---------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
Incidentally, pay attention to the full output of fio and you can see how much data was read and esp. written during the tests. In the case of writes, it's likely to be very large - for me it was 134GiB for rand-write and for 376GiB for seq-write, enough to put a heavy dent in any pSLC cache, probably much larger for you. Similarly, look and see if there's any major variations in the latency figures; I've seen all manner of bad drives, loose connections, dodgy cables play havoc there in ways that aren't always immediately obvious.
Maybe think about dropping the time down to 30s or lower, I think that's still plenty of time to give valid results without chewing your drive up quite so much. And normally I'd go for much lower values for iodepth and numjobs if I was trying to aim for a more realistic benchmark, as well as testing mixed read/write, depending on your use-case.
Running summat like iostat -kx at the time you're running your benches can give you a semblance of an idea of what each drive's busy-ness might look like during the bench.
 
  • Like
Reactions: nexox