Strange MDADM RAID 6 behaviour

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

tinfoil3d

QSFP28
May 11, 2020
880
404
63
Japan
Just to give an update on this, today kernel 5.13.0-25 was pushed out to this system and the issue is finally gone. :D
Are you 1 billion trillion percent sure it wasn't something else you've changed? When working on a very complex and, in general, systems that have any production value i try as much as i can to only change one single thing, be it software or hardware at a time, then re-test to roll back easily in case there's a problem.
Had it been me I'd definitely go ahead and test with other motherboard, if i don't have one i'd borrow one from a friend. To limit the fault scope to raid controller and disks themselves.
 

Mashie

Member
Jun 26, 2020
37
9
8
Are you 1 billion trillion percent sure it wasn't something else you've changed? When working on a very complex and, in general, systems that have any production value i try as much as i can to only change one single thing, be it software or hardware at a time, then re-test to roll back easily in case there's a problem.
Had it been me I'd definitely go ahead and test with other motherboard, if i don't have one i'd borrow one from a friend. To limit the fault scope to raid controller and disks themselves.
The only thing to change in my system between today and the last reboot a week ago is that the kernel went from 5.11 to 5.13.

5.13 had work done to the mdadm implementation so whatever they did fixed this by accident.
 
  • Like
Reactions: tinfoil3d

Stephan

Well-Known Member
Apr 21, 2017
934
710
93
Germany
Could have been mdraid, could also have been some driver in the storage stack mis-handling interrupts (after a while).

Anyhow glad you got this sorted.

That's why I compile my own 5.4.172 from Arch AUR still. Longest running LTS kernel, a bunch of chinese whitebox storage companies seem to be using this version, and they contribute really obscure storage stack fixes from time to time. Also the only kernel that doesn't reset e1000e Intel ethernet i219 onboard chips every couple of hours because of chip bugs. Even with workarounds in place.
 

Mashie

Member
Jun 26, 2020
37
9
8
Could have been mdraid, could also have been some driver in the storage stack mis-handling interrupts (after a while).

Anyhow glad you got this sorted.

That's why I compile my own 5.4.172 from Arch AUR still. Longest running LTS kernel, a bunch of chinese whitebox storage companies seem to be using this version, and they contribute really obscure storage stack fixes from time to time. Also the only kernel that doesn't reset e1000e Intel ethernet i219 onboard chips every couple of hours because of chip bugs. Even with workarounds in place.
It was always the first write to the array after reboot that got stuck for up to 10 minutes. After that it would behave until next reboot.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
438
252
63
NH, USA
Great news!!
It was always the first write to the array after reboot that got stuck for up to 10 minutes. After that it would behave until next reboot.
That was the key "symptom" that attracted me to this thread.

[After we had eliminated hardware, my gut told me that something was askew in md's (init/first-time) handling of its write-intent bitmap. (Thunar was merely the agent provocateur; 4KB reads sounded like a directory-walk.) Was hoping to find a way to provoke the one-timeness by adding a minimal dd action in a pre-Desktop rc script--i.e., at least a workaround for the "hang".]

All's well that ends well ...
 

PerryCS

New Member
Aug 27, 2022
9
4
3
Created an account to mention... I have the exact same problem with different equipment. I am running a 480TB array on an Amd 2700x Asus Prime X470 with 32GB 2666Mhz (non ecc). Running Ubuntu 18.04LTS. I have 2 large arrays. I also have Thunar File Manager installed. I am using a LSI SAS 12G 9300 I think it is. I have a 16 Bay Raid Machine and a Super Micro 24 Bay. I originally had the 16 x 8TB IronWolf Drives in the Raid Machine array. The 24 bay wasn't used for much originally (I need more money) LOL!

I had the same problem as mentioned by the OP.

Eventually I purchased 9 x 16TB Exos drives. I zero wipe all drives using DD. Ubuntu runs on a WD 1TB SSD Blue.

Eventually I purchased 10 more 10TB drives Exos. I swapped the arrays around so the SuperMicro was the main array I used and it only holds Exos drives (by the way, I LOVE Exos drives compared to IronWolf Drives).

The problem persisted.

Upon bootup and SOMETIMES randomly throughout the week I would go to access the network. It would appear to freeze... Usually it would copy from Windows to the Server a few seconds and then the speeds would drop to 0 for a minute or so. During this time, ALL the lights of ALL the drives on the array being accessed would be blinking like crazy doing "something".

dmesg gives me the exact error structure as the OP...

I just 4 hours ago upgraded the BIOS to the latest. Went in. Problem still occuring. BUT, after reading these posts, it seems it could be caused by an older MDADM/kernal.

So, I will try and figure out how to do that. I am running a Plex server and a massive backup for my decades of "collecting things" and running a computer business.

I'll try and figure out if I can see which kernal I am using and which mdadm I have. I did just clone the SSD today so I could try the OS upgrade but I was scared to do it as I have Plex, Apache, NodeJS, WOTLK Private server just for me - it's inside VMM).

At least if I butcher anything, I can just swap the SSD's back. I did a clone using DD and I always use the backup "just done" in the system and pull the original out and label it "backup" and swap them in the future after a small multi drive rotation of backups. :)

I'll update here when I get around to trying that and upgrading my system. It would be really great to finally fix this annoyance. I was always scared it was memory, motherboard, LSI card, or something causing the problem... I have tested this system to death with Prime and other tests and they all come out perfect! The ONLY issue I have with this home server is that the OP had...

Also, sometimes when copying files the system would start off fast, drop to 0... then speed up then drop to 0... but, usually, once it does it's "blinking of the lights or whatever it's doing" the system screams. I also copy from array to array all the time doing backups to the 16 bay raid 6.

I also have a mix in the SuperMicro 24 bay of Raid 5's and 6's and it only SEEMS to do it to the raid 6 arrays. So, who knows.

I can't wait to update my Ubuntu. I plan on buying an Epyc or Threadripper to play with in the future with ECC memory now that speeds are around 3200Mhz these days for ECC memory.

THANK you for posting your problems. It really helps to see this thread and know that at least someone else found a solution to this. Also, I wonder if Thunar file manager is causing the problem by hooking into something in the background?!?!?!?! Who knows. Weird considering Thunar isn't even open when this happens. I'm copying from Windows 10 with Ryzen 5950x on a 1GB Asus NIC across the network to my AMd server... and I also have many other machines that this exact same thing happened to... 3950x, 5600G, 2400G, 2700x, 1700x etc... and I have Ubiquity network equipment, router, switch, wifi, etc... no errors in those for network problems. I have replaced all my cables with Cat 7/8 and redid the rest in Cat 6. Overkill for me, but, I like to learn, tinker. :)

Will update when I get around to doing this upgrade. And I HOPE my problem is resolved like the OP... :)

I wanted to comment on here because I am running on a cleanly (non upgraded, non expanded MDADM raid 6 array) and a non Intel system and I have the exact same problem. :)

Thank you for your time.
David Perry
Perry Computer Services
 

PerryCS

New Member
Aug 27, 2022
9
4
3
I ran uname -a and I get this...

Linux plextbserver 5.4.0-124-generic #140~18.04.1-Ubuntu SMP Fri Aug 5 11:43:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux...

The OP's version was 5.13.0-25. So, I can't wait to upgrade. :)

and mdadm -V

mdadm - v4.1-rc1 - 2018-03-22

I am still an intermediate user when it comes to Linux. Not an expert by any means. But, will update once I upgrade to Ubuntu 22.04 LTS or whatever the newest one is that's LTS.
 

Mashie

Member
Jun 26, 2020
37
9
8
I ran uname -a and I get this...

Linux plextbserver 5.4.0-124-generic #140~18.04.1-Ubuntu SMP Fri Aug 5 11:43:34 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux...

The OP's version was 5.13.0-25. So, I can't wait to upgrade. :)

and mdadm -V

mdadm - v4.1-rc1 - 2018-03-22

I am still an intermediate user when it comes to Linux. Not an expert by any means. But, will update once I upgrade to Ubuntu 22.04 LTS or whatever the newest one is that's LTS.
That is indeed one of the kernel versions that caused this problem for me.

I'm now on 22.04.01 LTS and the stock kernel is working fine with MDADM.
 
  • Like
Reactions: PerryCS

Stephan

Well-Known Member
Apr 21, 2017
934
710
93
Germany
Ubuntu appears to walk Microsoft's way, increasing number of QA problems. At least you didn't have to pay for it. ;-)

If you suspect a kernel bug, try Arch Linux with packages "linux" or "linux-lts" first. Like boot a system-rescue dot org ISO, use Ventoy, or install Arch on a USB stick. Arch kernels have been kept very fresh for years, as Arch is a rolling release model distribution. See if problem persists.
 
  • Like
Reactions: UhClem and gb00s

PerryCS

New Member
Aug 27, 2022
9
4
3
Thank you for the comments. I might try Arch when I get my Threadripper/Epyc server running one day. I can't reinstall everything just now - lack of money, time, and hardware. BUT, in the future I will give that a try. Never heard of Arch Linux but will look it up. I'm quite familiar with CentOS 7, Ubuntu, Mint, Zoran but other than those... haven't really touched any other flavors. I do use a GUI desktop - haven't figured out how to do no GUI on the system, BUT, when remoting into it and using VM's how they could have GUI's. Also, I prefer to use gedit as vim has the absolute worst control keys and shortcuts of any software I have ever used. lol

I might consider trying to update to 22.04.4 LTS tonight... since I have a 2 day old clone of the drive, if everything gets borked I can always slap that drive back in.

Will update when I try it and see what happens. Have to google how to update the OS to a newer version as safe as possible.
 

PerryCS

New Member
Aug 27, 2022
9
4
3
Just updated to 20.04 LTS - things don't seem to have changed much version wise... will see if everything works fine and then go up to v22. Maybe I have to do something else than update the whole OS... uname -a and mdadm -V yield very similar results..

Linux plextbserver 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
mdadm - v4.1 - 2018-10-01

I'll test this out for a day and make sure all my samba shares work. My apache is working, so that's good. Have to test my Plex , VM's, and shares and then will go up 1 more upgrade..

Will update here once done - I'll also see about the raid problems but considering the versions barely changed I don't expect that to be solved especially considering the huge version gap between OP system and even me going from 18.04 LTS to 20 LTS.
 

PerryCS

New Member
Aug 27, 2022
9
4
3
OK, just finished the upgrade to 22.04 and the versions did change quite a bit... uname -a and mdadm -V give me the following...

Linux plextbserver 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
mdadm - v4.2 - 2021-12-30

So, I'll update here as time goes on and see if my problem was solved by upgrading. I just have to fix my apache - seems mysql is all messed up right now but everything else works great (plex, samba, transcoding).

So, I'll followup and let everyone know if this solved it for me now that I am on much later kernel versions. (sorry for all the posts, I just figured my system was totally different than the OP - Amd vs Intel, using full Raid JBOD storage solutions in racks (Raid Machine x16 and SuperMicro x24) and using different drives Exos vs IronWolf and finally my arrays have not been expanded like the OP... so, this will be a good test.

I'll update in a couple of weeks. Have a great day everyone. :)

UPDATE#1: A couple of instant observations. I rarely used to get speeds of 326MB/sec using rsync to backup the arrays to the other arrays... I am now getting on average 450-520MB/sec. My weird bug also used to happen doing my backups. So far, the backups not only complete faster due to the increased speeds... but, rsync instead of a black screen with no writing now has more stats saying...

building file list ...
434103 files to consider

which I LOVE! Before there was no stats while it processed files. Also, no weird blinking lights and no slowdowns but this new change has only been running for 45 min approx. As mentioned, I will update as time goes on. I'll run my array sync which used to max out at 4.7GB/sec - I can't wait to see the new results.
 
Last edited:

PerryCS

New Member
Aug 27, 2022
9
4
3
Wow, the new speeds are incredible... hitting 6.3GB/sec peak, averaging 6.0 to 6.2GB/sec - that's a massive upgrade in speed from my previous max speed of 4.7GB/sec... also, that's while playing Plex. I am so happy to have stumbled onto this thread - it would have been years before I pushed the upgrade to the newest versions of Ubuntu. OK, I'll stop updating on things that are not related to the OP BUT I am posting these changes because whatever they did in the kernel or mdadm has vastly changed the speed of my system.


1661749455989.png
 
  • Like
Reactions: i386

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
it would have been years before I pushed the upgrade to the newest versions of Ubuntu
...
BUT I am posting these changes because whatever they did in the kernel or mdadm has vastly changed the speed of my system.
+1 for updates/upgrades :D
 

Mashie

Member
Jun 26, 2020
37
9
8
Wow, the new speeds are incredible... hitting 6.3GB/sec peak, averaging 6.0 to 6.2GB/sec - that's a massive upgrade in speed from my previous max speed of 4.7GB/sec... also, that's while playing Plex. I am so happy to have stumbled onto this thread - it would have been years before I pushed the upgrade to the newest versions of Ubuntu. OK, I'll stop updating on things that are not related to the OP BUT I am posting these changes because whatever they did in the kernel or mdadm has vastly changed the speed of my system.


View attachment 24223
Glad to hear things have improved.

It was an update for MDADM to improve the performance in one of the recent kernel versions (which accidentally fixed the freezing bug).
 

Stephan

Well-Known Member
Apr 21, 2017
934
710
93
Germany
Never heard of Arch Linux but will look it up.
Arch is just like Steely Dan. You may have never heard of it, but your favorite band's favorite band, is Steely Dan.

You may have heard of Manjaro or EndeavourOS, both based on Arch: Arch-based distributions - ArchWiki Or how about the Steam Deck? Valve’s upcoming Steam Deck will be based on Arch Linux—not Debian

When its unclear to me, like in OPs case, if there is a hardware issue or software issue, I turn to Arch. Sometimes a bug is fixed in Torvald's curated tree, when Ubuntu maintainers simply didn't have the manpower to diagnose and fix it. Aside from rolling and fresh as an advantage, I've become quite good at producing quality custom packages for the kernel, systemd, ungoogled-chromium, bareos, hostapd and two dozen others. Something I found harder and harder to do on .DEB and .RPM-based distributions, despite me having 30 years of experience. Finally, no commercial interests or pressures steering the distribution itself - see my signature.

But I am glad you got your stuff working good.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
438
252
63
NH, USA
... Arch kernels have been kept very fresh for years, as Arch is a rolling release model distribution.
Very enlightening--thank you. I think I'll be switching from Slackware.
[Yes, I like my Unix raw ... very raw. GUI (and I mean the acronym itself) didn't exist when I started hacking Unix (1973). Reeling In the Years ... :) ]
=====
(motd) "If you make something idiot-proof, only idiots will want to use it."
 
Last edited:
  • Like
Reactions: Stephan and Mashie

Stephan

Well-Known Member
Apr 21, 2017
934
710
93
Germany
"So, what's it like to have Arch as your daily driver? A: hxxps://pr0gramm.com/top/2166836"

Jokes aside, I just wish ZFS would become a 1st rate citizen on Arch, instead of coming from AUR. AUR is a user-supported repository where anyone can make an account and contribute a package. For UEFI boot nothing beats hxxps://github.com/zbm-dev/zfsbootmenu. Install any Linux in its own dataset and just boot it. Take filesystem snapshots every 10 minutes, never lose data again from user error. mdadm just can't compete.
 

Mashie

Member
Jun 26, 2020
37
9
8
"So, what's it like to have Arch as your daily driver? A: hxxps://pr0gramm.com/top/2166836"

Jokes aside, I just wish ZFS would become a 1st rate citizen on Arch, instead of coming from AUR. AUR is a user-supported repository where anyone can make an account and contribute a package. For UEFI boot nothing beats hxxps://github.com/zbm-dev/zfsbootmenu. Install any Linux in its own dataset and just boot it. Take filesystem snapshots every 10 minutes, never lose data again from user error. mdadm just can't compete.
Can you expand volumes in ZFS with random number of disks yet or do you still need to add full vdevs?