MDADM RAID6 - unstable behavior

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Sergiu

New Member
Jul 11, 2019
12
2
1
Hello,

I have a big fat server with 22 * 15.36TB Micron 9300 Pro SSDs connected directly to the motherboard via PCIe lines. I have setup a RAID6 configuration without any write caching, getting an array of about 307,2 TB. On top of the array I have a big fat MySQL instance and what I have observed is that out of nowhere, without any server restart or error, the array started to do some checks where basically it froze all the writes. And at some point it ended up having absolutely no IO requests issued, just mdadm using one core for 100%. The state of array was in active-checking, stuck at 99.9%, with all the blocks actually being checked. The array was build with a chunk size of 256KB, mdadm is on version v4.1 - 2018-10-01 (Ubuntu Server 20.04 LTS) and the array was later formatted as ext4. Also the whole array init took 4 days.
Is there any way to debug mdadm or are there some known limits regarding configurations like mine? Or any special issue with rewriting same chunk over and over again (MySQL write pattern ends up issuing bursts of write and flush commands for same space when doing a lot of transactions)?
 

MBastian

Active Member
Jul 17, 2016
205
59
28
Düsseldorf, Germany
There is really nothing in `dmesg`or `journalctl` ? Also see if you have any cron or systemd jobs that periodically scrub the array.
On a side note. RAID6 with more than 12 drives and no hotspare drives is an invitation to disaster. Also: Why not ZFS?
 

Sergiu

New Member
Jul 11, 2019
12
2
1
Nothing useful in dmesg. No cronjobs for scrubbing that I am aware of, OS is freshly installed. Leaving aside the invitation for disaster, is there any known issue in general with RAID6 aside of needing a write cache?

Did an early test with ZFS (zfs-0.8.3-1ubuntu12.9). It ate 90 cores (I have 2 x AMD 7763) for writing at 2.8GB/s , uncompressed. Mdadm is using about 1 core per 500MB. I have a target of sustained write performance of 5GB/s with a budget of no more than 10% server load, only mdadm fits for now. Plus for my stack, ZFS is less SSD friendly.
 

MBastian

Active Member
Jul 17, 2016
205
59
28
Düsseldorf, Germany
I have a target of sustained write performance of 5GB/s with a budget of no more than 10% server load, only mdadm fits for now.
Did you test the worst case? I am unsure if an degraded or rebuilding software RAID6 array would still meet the above criterias.
If reliability is not an issue you could test a RAID50 setup. You can't beat the speed and rebuild times of an RAID10 array, if you can take the 50% available space hit.

What is the output of `cat /sys/block/<mdX>/md/sync_action` and `mdadm -D /dev/<mdX>`?
 

Sergiu

New Member
Jul 11, 2019
12
2
1
I am in the process of testing worst case scenario and now all applications running on software RAID. Previously had hardware RAID and it just worked, but with NVMe it's harder to build. Output of commands is below. From what I notice, it's scrubbing the data in foreground, freezing all writes. Is there any way to do it in background?

cat /sys/block/md2/md/sync_action
check
mdadm -D /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Tue Jun 22 13:11:59 2021
Raid Level : raid6
Array Size : 300055782400 (286155.49 GiB 307257.12 GB)
Used Dev Size : 15002789120 (14307.77 GiB 15362.86 GB)
Raid Devices : 22
Total Devices : 22
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Jul 6 11:23:52 2021
State : active, checking
Active Devices : 22
Working Devices : 22
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 256K

Consistency Policy : bitmap

Check Status : 10% complete

Name : srv2103:2 (local to host srv2103)
UUID : afffe981:f27116e2:e2ab5ea9:7011e1ab
Events : 57620

Number Major Minor RaidDevice State
0 259 20 0 active sync /dev/nvme0n1
1 259 3 1 active sync /dev/nvme10n1
2 259 4 2 active sync /dev/nvme11n1
3 259 0 3 active sync /dev/nvme12n1
4 259 9 4 active sync /dev/nvme13n1
5 259 15 5 active sync /dev/nvme14n1
6 259 8 6 active sync /dev/nvme15n1
7 259 16 7 active sync /dev/nvme16n1
8 259 10 8 active sync /dev/nvme17n1
9 259 14 9 active sync /dev/nvme18n1
10 259 12 10 active sync /dev/nvme19n1
11 259 19 11 active sync /dev/nvme1n1
12 259 13 12 active sync /dev/nvme20n1
13 259 7 13 active sync /dev/nvme21n1
14 259 21 14 active sync /dev/nvme2n1
15 259 18 15 active sync /dev/nvme3n1
16 259 2 16 active sync /dev/nvme4n1
17 259 1 17 active sync /dev/nvme5n1
18 259 6 18 active sync /dev/nvme6n1
19 259 17 19 active sync /dev/nvme7n1
20 259 5 20 active sync /dev/nvme8n1
21 259 11 21 active sync /dev/nvme9n1
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
What are your RAID rebuild min/max speeds set to? I don't fully understand why it would freeze all IO unless there was something wrong, but if your rebuild speeds are set too high it's common to see regular IO get hit.
Code:
effrafax@wug:~$ cat /proc/sys/dev/raid/speed_limit_max
5000000
Parity RAID usually benefits considerably from a larger stripe cache and potentially a larger thread count:
Code:
effrafax@wug:~$ cat /sys/block/md6/md/stripe_cache_size
16384
effrafax@wug:~$ cat /sys/block/md6/md/group_thread_cnt
4
However, with the amount of bandwidth and IOPS at your disposal I'd be very surprised to run in to anything other than CPU bottlenecks. Have you tried giving iostat a run to see if any of your drives are noticeably busier or slower than others? What's your PCIe topology, is it possible some of the drives are getting throttled?

Incidentally, the mdadm resync on debian at least is usually triggered from /etc/cron.d/mdadm - I suspect ubuntu is very similar.
Code:
effrafax@wug:~$ cat /etc/cron.d/mdadm
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <madduck@madduck.net>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
 
Last edited:

Sergiu

New Member
Jul 11, 2019
12
2
1
speed_limit_max = 3000000
speed_limit_min = 200000
My issue was that writes would not pass through when check is running. I have just changed speed_limit_min to 10000 and now I finally see writes passing.

group_thread_cnt set to 6 made a huge difference in write consistency during check, however stripe_cache_size changes had no significant effect. Upon investigation, apparently there is a consistency check setup that runs for 6 hours continuously. So I think the most issues are sorted now as I can finally see writes.

Now another issue that I observed. On my machine, when the check is running, it issues reads with request size of exactly 672KB / drive, mine being SATA SSDs . On my server, however I see request size being 4KB / drive during check which just does not make sense, it should be in the range of hundreds of KB. However I cannot figure out what to configure to make mdadm read in larger chunk size per drive.
 

Stephan

Well-Known Member
Apr 21, 2017
920
698
93
Germany
Back when I was still using mdraid instead of ZFS I ran into a similar issue with spinning rust, where any sync or check would make the array next to unusable. I tracked this down to the following code (introduced 2015 see [PATCH/RFC/RFT] md: allow resync to go faster when there is competing IO. — Linux RAID Storage and commit md: allow resync to go faster when there is competing IO. · torvalds/linux@ac8fa41) and used this patch for a long time to revert MD back to its original behaviour. No idea if you can make and run a patched kernel to test this out.

Also make sure to run all array background operations at idle I/O scheduling class (like checkarray --idle does by ionicing the resync process id).

Running databases on ZFS is tricky. See https://people.freebsd.org/~seanc/postgresql/scale15x-2017-postgresql_zfs_best_practices.pdf for a primer for postgresql or About ZFS recordsize – JRS Systems: the blog for some info with MySQL. From correct ashift to correctly tweaked dataset recordsize, compression etc. and also within database alot of parameters have to be considered. For my purposes I usually have "write once, rot for years" scenarios so I prefer a filesystem that detects any errors not just in metadata but also in data, hence ZFS.

C:
--- linux-4.11.orig/drivers/md/md.c     2017-05-01 04:47:48.000000000 +0200
+++ linux-4.11/drivers/md/md.c  2017-05-07 04:49:09.105594510 +0200
@@ -8539,18 +8539,11 @@ void md_do_sync(struct md_thread *thread
                        /((jiffies-mddev->resync_mark)/HZ +1) +1;
 
                if (currspeed > speed_min(mddev)) {
-                       if (currspeed > speed_max(mddev)) {
+                       if ((currspeed > speed_max(mddev)) ||
+                                       !is_mddev_idle(mddev, 0)) {
                                msleep(500);
                                goto repeat;
                        }
-                       if (!is_mddev_idle(mddev, 0)) {
-                               /*
-                                * Give other IO more of a chance.
-                                * The faster the devices, the less we wait.
-                                */
-                               wait_event(mddev->recovery_wait,
-                                          !atomic_read(&mddev->recovery_active));
-                       }
                }
        }
        pr_info("md: %s: %s %s.\n",mdname(mddev), desc,
 

Sergiu

New Member
Jul 11, 2019
12
2
1
I think what I am seeing is not the effect of that patch but poor concurrency. For example, I expected to have best results with a large group_thread_cnt so I set it to 24, but then the performance went down by a factor of 5 during check. Setting it to 4 lead to way better performance.
Each SSD is capable of 850K random 4K IOPS, I have a theoretical throughput of about 17M IOPS yet the system is not doing more than 100K now.
Coming from hardware raid solutions, mdadm so far looks like a bad joke... already spent over 2 days digging for optimization guidelines, but nothing. Could also be the hardware which may be slightly too new (Amd Milan) but I doubt it's responsible for most of the issues.

@Stephan what performance numbers have you observed with ZFS after finetuning? Can it easily do 5GB/s reads/writes or go into the order of millions of IOs if finetuned?
 

Stephan

Well-Known Member
Apr 21, 2017
920
698
93
Germany
Sorry, can't share numbers without breaching NDAs. But truenas.com has some ballpark numbers with their TrueNAS Core product, which uses ZFS.

With this many SSDs and storage you will run into a bit flip every couple of months. RAID6 mdadm will not prevent this, aside from checkarray being able to detect a single error and fix it. But to my knowledge it can't detect such errors "live" when data is read back from disk and ingested by MySQL. Only during patrol reads using checkarray. What follows will be garbage-in-garbage-out. Hence ZFS, which will check data checksums with every read, and take corrective action right away if something smells funny.

This will cost you some performance and naive thinking might suggest to just roll with mdadm. Your first corrupt database in MySQL with offline database and mysqlcheck running for hours, without any knowledge what got corrupted after all, could make you rethink.
 
  • Like
Reactions: MBastian

Sergiu

New Member
Jul 11, 2019
12
2
1
@Stephan
From my understanding, since those are enterprise SSDs with strong error correction codes, bit flips are easily detectable and correctable by the SSD itself. If there is an uncorrectable read error, the SSD should just return a read error at upper level which would inform mdadm / zfs / raid controller about the issue and then standard strategy would be to read from replica if raid 1 schemes / reconstruct from parity. I do have already large HDDs in RAID 10 and SSDs with RAID6 (46TB) and so far we never observed corruption with hardware RAID. Only strange case was when one drive died in RAID 10 and array got temporary corrupted until we deactivated the bad drive. Are you talking about bit flips at other layers?
 

Stephan

Well-Known Member
Apr 21, 2017
920
698
93
Germany
Issue will be you write 0 and drive will return 1, drive itself will not recognize the problem. Issue could be in flash, firmware problem, or problem in NVME interface. Paranoid ZFS designers check data when back in CPU, just to be on safe side. From my understanding with such large systems this will push error rate down to once every couple of decades instead of once per year.
 

lihp

Active Member
Jan 2, 2021
186
53
28
Hello,

I have a big fat server with 22 * 15.36TB Micron 9300 Pro SSDs connected directly to the motherboard via PCIe lines. I have setup a RAID6 configuration without any write caching, getting an array of about 307,2 TB.
Just to be sure: one RAID 6 array of 22 drives? With old rust thats a problem already, with NVMEs you multiply those issues due to latency and RAID6 stripe calculation.

Suggestions:
  • option 1: Two more drives and make it 3x RAID5 arrays in RAID0. Same size - loads more performance. Yet maybe its even wiser to go down to 4x arrays with 6 disks each.
  • option 2: Do 1. and go RAIDIX instead of mdadm - I can help you there for a test account, drop a PM if you like to.
  • option 3: Go back to hardware RAID.
Conceptually mdadm is not optimized to handle 22 disks standard RAIDs. And RAID6 is not made to handle more than 6-8 drives in a single array. So: Imho its made to fail from the start. You actually want to make your arrays as small as possible to enable negligble load on the CPU for RAID. Here you do the exact opposite: you provoke massive single threaded loads. Also for speed RAID5 >> RAID6. Striped RAID5 arrays are imho the best for you to achieve max speed.

For max performance from my tests in the past, what worked best:
  • RAID50
  • 4-8 drives per array (depends on drives and latency)
  • xfs file system, ext4 is fine too but (slightly) slower than xfs. zfs is the worst, since here also single-threaded loads occur.
PS: Can you supply an uptime list for the drives as well as TBW so far?
 

lihp

Active Member
Jan 2, 2021
186
53
28
I am in the process of testing worst case scenario and now all applications running on software RAID...
From my understanding so far, this is what happens:
  1. Too many disks to stripe - this is important below. Sweet spot depending on drives for RAID6 is somewhere at 3-8 drives per array. Latest at 12-14 drives time needed to write stripe set grows exponentially due to latency and stripe computation.
  2. Storage is large, which increases time needed for syncs drastically, as well as for replacement of a failed drive (rebuild, resync).
  3. You are writing to the RAID6 array in huge bursts.
  4. Stripe computation cant keep the pace. Latest when cache is full or runs into timeouts due to wait times (depending on your config) a write cant be completed.
  5. When a write cant be completed or runs into a timeout, your drives are considered out of sync.
  6. Once a drive is considered out of sync, a synch of the array is started (see your excerpt above).
  7. The sync itself now takes place while you try to write, which further puts load on your array. Now bandwidth goes even more down and another out of sync event happens.
  8. Depending on the sync, bitmap is not used, which further increases the time needed for a resync. Id estimate the time needed for a resync in your array somewhere of 30+ hours minimum, worst case even 100+ hours.
  9. ... this goes on.
Some info: Mdadm checkarray function - Thomas-Krenn-Wiki

When it comes to max performance you are better of with RAID5 with 3, 5 or 9 drives per array, since it offers a better write performance. Considering above points from me, I would probably start with 9 drives in case of RAID 5. Least space lost, still fast since stripe computation in RAID 5 are simple XOR operations. Should be quite performance and cost-efficient. So in the end RAID0 over 3x RAID5 (9 drives each) Once created wait for rebuild until status is only active and all drives in each array are "U". By then your array should fly.

PS-edit: you can speed up initial bitmap building by increasing min/max speed, stripe cache size, disable NCQ on all disks and set bitmap to internal during initial rebuild (and none once done).
 
Last edited:

lihp

Active Member
Jan 2, 2021
186
53
28
Micron 9300 Pro SSDs connected directly to the motherboard via PCIe lines
And just to be sure, please elaborate on that. How did you connect, which cables? Are you sure each drive is connected by 4 PCIe lanes?
 

lihp

Active Member
Jan 2, 2021
186
53
28
Are you actually suggesting such a setup after posting that "raid 6 is not made for more than 6-8 devices"? :oops:
RAID0 over 3x RAID5 (9 drives each)

Comment: 9 drives in RAID5 is also a close call, but since op is looking for cost efficiency - thats imho the plausible max.
 
Last edited:

Sergiu

New Member
Jul 11, 2019
12
2
1
@lihp
What I am doing is a delicate balance between maximum possible storage per server, performance and reliability, all based on previous observed years of experience with SSDs in production. What I observed is that, compared to HDDs, SSDs do tend to have bad sectors more often, therefore the likelyhood of an uncorrectable read error during a rebuild is exponentially higher with age, which makes RAID 5 a no go. If I lose primary server due to 3 SSDs out of 22 dead instead of 2, I can switch the replica and rebuild primary. If both primary and replica die, then I have to fallback to backups. The benefit of the having 30-60 TB more storage outweight the risks for our use case.

All SSDs are directly attached NVMe, therefore I have a theoretical read/write bandwidth of over 60GB/s and over 17M read IOPS / 3M write IOPS. I need about 5GB sustained read with a minimum of 1GB sustained write at mixed request sizes, usually 4 to 128 KB. Thought it should be easy with mdadm but I am discovering that it is the purest piece of crap possible when it comes to NVMe usage.

Anyone having experience with RAIDIX ERA ?

Found Sudo Null - Latest IT News
and https://builders.intel.com/docs/dat...rage-systems-with-intel-optane-dc-storage.pdf that make mdadm and zfs looks like bad jokes.
 

lihp

Active Member
Jan 2, 2021
186
53
28
What I am doing is a delicate balance...
I got that from your setup and writing.

...observed years of experience with SSDs in production...
Whatever the reason - they tend to get out of sync easily

I have a theoretical read/write bandwidth of over 60GB/s and over 17M read IOPS / 3M write IOPS. I need about 5GB sustained read with a minimum of 1GB sustained write at mixed request sizes, usually 4 to 128 KB.
Mdadm is plain not made for that type of efficiency. RAIDIX is. Depending on the rest, I consider 5GB/s sustained easy going with your NVME army.

mdadm but I am discovering that it is the purest piece of crap possible
I disagree. It's just not optimized for it. In best case you stay 50% below your hardware potential.

which makes RAID 5 a no go
I got different experiences with recent NVMEs, but I get your point:
  1. Were I in your shoes, I d go RAID50 with single RAID arrays of 5 drives.
  2. Considering your comment I'd advise RAID10.
  3. I am in doubt if Raid6 makes sense at that size at all. If so Id go with arrays of 6 disks. and create a RAID60 (4x) array.
With mdadm you should Imho ditch RAID6 in any case - striping and parity calculation with 4-6 drives is negligible. but simultaneous calculation of stripe and parity for 24 drives in on array is sick. even in 4 RAID6 arrays á 6 disks striped its still alot.

PS: On anything RAIDIX you can holler me by PM and I can also connect you with a tech if ya like to.