Strange MDADM RAID 6 behaviour

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Mashie

Member
Jun 26, 2020
37
9
8
!!! EDIT !!! I didn't see your defrag posting till after posting the below. (Had this thread opened, then switched to analog life, and neglected to refresh upon return. :)) Will be interesting to see the effect.
I may have to cancel the defrag before it is completed though, at the current rate it will take around a month to complete.

Here are some of the output from e4defrag to show how much the extents are tweaked as part of the defrag. I have no point of reference what number of extents are normal/acceptable for 20-80GB files.

Code:
[11542/24046]/mnt/storage/file01:    100%  extents: 29 -> 26    [ OK ]
[11543/24046]/mnt/storage/file02:    100%  extents: 18 -> 8    [ OK ]
[11544/24046]/mnt/storage/file03:    100%  extents: 36 -> 36    [ OK ]
[11545/24046]/mnt/storage/file04:    100%  extents: 30 -> 30    [ OK ]
[11546/24046]/mnt/storage/file05:    100%  extents: 34 -> 34    [ OK ]
[11547/24046]/mnt/storage/file06:    100%  extents: 12 -> 10    [ OK ]
[11548/24046]/mnt/storage/file07:    100%  extents: 268 -> 31    [ OK ]
[11549/24046]/mnt/storage/file08:    100%  extents: 47 -> 47    [ OK ]
[11550/24046]/mnt/storage/file09:    100%  extents: 367 -> 235    [ OK ]
[11551/24046]/mnt/storage/file10:    100%  extents: 302 -> 36    [ OK ]
[11552/24046]/mnt/storage/file11:    100%  extents: 177 -> 30    [ OK ]
[11553/24046]/mnt/storage/file12:    100%  extents: 9 -> 9    [ OK ]
[11554/24046]/mnt/storage/file13:    100%  extents: 185 -> 37    [ OK ]
[11555/24046]/mnt/storage/file14:    100%  extents: 9 -> 9    [ OK ]
[11556/24046]/mnt/storage/file15:    100%  extents: 14 -> 12    [ OK ]
[11557/24046]/mnt/storage/file16:    100%  extents: 256 -> 30    [ OK ]
[11558/24046]/mnt/storage/file17:    100%  extents: 37 -> 37    [ OK ]
[11559/24046]/mnt/storage/file18:    100%  extents: 28 -> 28    [ OK ]
[11560/24046]/mnt/storage/file19:    100%  extents: 42 -> 37    [ OK ]
[11561/24046]/mnt/storage/file20:    100%  extents: 13 -> 13    [ OK ]
[11562/24046]/mnt/storage/file21:    100%  extents: 96 -> 15    [ OK ]
[11563/24046]/mnt/storage/file22:    100%  extents: 324 -> 197    [ OK ]
[11564/24046]/mnt/storage/file23:    100%  extents: 43 -> 39    [ OK ]
[11565/24046]/mnt/storage/file24:    100%  extents: 17 -> 17    [ OK ]
[11566/24046]/mnt/storage/file25:    100%  extents: 10 -> 10    [ OK ]
[11567/24046]/mnt/storage/file26:    100%  extents: 371 -> 43    [ OK ]
[11568/24046]/mnt/storage/file27:    100%  extents: 265 -> 44    [ OK ]
[11569/24046]/mnt/storage/file28:    100%  extents: 61 -> 9    [ OK ]
[11570/24046]/mnt/storage/file29:    100%  extents: 23 -> 23    [ OK ]
[11571/24046]/mnt/storage/file30:    100%  extents: 13 -> 13    [ OK ]
[11572/24046]/mnt/storage/file31:    100%  extents: 34 -> 34    [ OK ]
[11573/24046]/mnt/storage/file32:    100%  extents: 424 -> 424    [ OK ]
[11574/24046]/mnt/storage/file33:    100%  extents: 39 -> 39    [ OK ]
[11575/24046]/mnt/storage/file34:    100%  extents: 58 -> 58    [ OK ]
[11576/24046]/mnt/storage/file35:    100%  extents: 42 -> 42    [ OK ]
[11577/24046]/mnt/storage/file36:    100%  extents: 162 -> 269    [ OK ]
[11578/24046]/mnt/storage/file37:    100%  extents: 67 -> 67    [ OK ]
[11579/24046]/mnt/storage/file38:    100%  extents: 29 -> 27    [ OK ]
[11580/24046]/mnt/storage/file39:    100%  extents: 109 -> 79    [ OK ]
[11581/24046]/mnt/storage/file40:    100%  extents: 167 -> 167    [ OK ]
[11582/24046]/mnt/storage/file41:    100%  extents: 247 -> 293    [ OK ]
[11583/24046]/mnt/storage/file42:    100%  extents: 78 -> 78    [ OK ]
[11584/24046]/mnt/storage/file43:    100%  extents: 12 -> 12    [ OK ]
[11585/24046]/mnt/storage/file44:    100%  extents: 9 -> 9    [ OK ]
[11586/24046]/mnt/storage/file45:    100%  extents: 195 -> 221    [ OK ]
[11587/24046]/mnt/storage/file46:    100%  extents: 358 -> 358    [ OK ]
[11588/24046]/mnt/storage/file47:    100%  extents: 93 -> 93    [ OK ]
[11589/24046]/mnt/storage/file48:    100%  extents: 1074 -> 232    [ OK ]
[11590/24046]/mnt/storage/file49:    100%  extents: 73 -> 7    [ OK ]
[11591/24046]/mnt/storage/file50:    100%  extents: 207 -> 26    [ OK ]
[11592/24046]/mnt/storage/file51:    100%  extents: 25 -> 25    [ OK ]
[11593/24046]/mnt/storage/file52:    100%  extents: 14 -> 14    [ OK ]
[11594/24046]/mnt/storage/file53:    100%  extents: 240 -> 147    [ OK ]
[11595/24046]/mnt/storage/file54:    100%  extents: 26 -> 26    [ OK ]
[11596/24046]/mnt/storage/file55:    100%  extents: 238 -> 61    [ OK ]
[11597/24046]/mnt/storage/file56:    100%  extents: 290 -> 152    [ OK ]
[11598/24046]/mnt/storage/file57:    100%  extents: 78 -> 14    [ OK ]
[11599/24046]/mnt/storage/file58:    100%  extents: 70 -> 41    [ OK ]
[11600/24046]/mnt/storage/file59:    100%  extents: 286 -> 47    [ OK ]
[11601/24046]/mnt/storage/file60:    100%  extents: 13 -> 11    [ OK ]
[11602/24046]/mnt/storage/file61:    100%  extents: 29 -> 29    [ OK ]
[11603/24046]/mnt/storage/file62:    100%  extents: 31 -> 9    [ OK ]
[11604/24046]/mnt/storage/file63:    100%  extents: 193 -> 36    [ OK ]
That is actually very good to hear.

I don't want to be pessimistic, but there's an ~even chance that the 8==>10 won't change anything (except capacity). In either case, just so you don't feel pressured into doing the -grow asap, it might be useful to devise a "coping plan" so that this wart has minimal impact on life-and-wife. E.g., you can minimize the frequency of shutdown/boot-up by seeing what procedures for Hibernate are available in Ubuntu.

And, we can force the stall-event to happen at the last stage of boot-up, and avoid the surprise annoyance, as currently. This might also avoid the rattling of those 5 disks, but that is just a "theory" I have. The 120 seconds of timeout is probably unavoidable. Details on this forcing can wait till later--just something to ponder in the meantime.

Next time you reboot (normally--no need to force the event), try to cause the stall with the following command:
Code:
dd if=/dev/zero of=/mnt/storage/40MBz bs=8M count=5 oflag=direct
The system is on 24/7 with a reboot once every 1-4 weeks to apply security updates so the issues is mainly an annoyance as long as it isn't a sign of horrible things to come.

If the system can be forced to stall as part of the boot-up that would be a good last resort if the other options fail. I will give that command a try after next reboot whenever that is.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
249
43
NH, USA
I may have to cancel the defrag before it is completed though, at the current rate it will take around a month to complete.
I think you should cancel. From the e2fsck you did, /dev/md0 was only 3.x% non-contiguous, so little to be gained de-frag-wise. As for affecting/improving the stall situation, I believe this is also tangerines-vs-tomatoes.
Here are some of the output from e4defrag to show how much the extents are tweaked as part of the defrag. I have no point of reference what number of extents are normal/acceptable for 20-80GB files.
You can use
Code:
hdparm --fibmap pathname
to see the extent layout for a file.
The system is on 24/7 with a reboot once every 1-4 weeks to apply security updates so the issues is mainly an annoyance as long as it isn't a sign of horrible things to come.
Understood, on the reboot freq. As for the "horrible" part, it doesn't feel like that, but that's just my hunch, based only on the simplistic reproducibility of the glitch and the consistently benign outcome (to date). ["Grains of salt": Long ago, when even top CS faculty had never heard of Unix, I knew the kernel, totally. But, that was then ... I retired 20+ yrs ago, and 20 yrs earlier, I stopped doing the kernel.]
 

Mashie

Member
Jun 26, 2020
37
9
8
E4defrag started to speed up with many files tankfully not fragmented, it just finished:

Code:
    Success:            [ 22324/24054 ]
    Failure:            [ 1730/24054 ]
    Total extents:            256096->242202
    Fragmented percentage:         26%->22%
And it made no difference at all as you expected, if anything it made it worse as it now will stall for 4min30sec now.:

Code:
Oct 18 13:58:06 IONE kernel: [  242.648073] INFO: task jbd2/md0-8:1163 blocked for more than 120 seconds.
Oct 18 13:58:06 IONE kernel: [  242.648086]       Tainted: P           OE     5.11.0-37-generic #41~20.04.2-Ubuntu
Oct 18 13:58:06 IONE kernel: [  242.648090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 18 13:58:06 IONE kernel: [  242.648093] task:jbd2/md0-8      state:D stack:    0 pid: 1163 ppid:     2 flags:0x00004000
Oct 18 13:58:06 IONE kernel: [  242.648101] Call Trace:
Oct 18 13:58:06 IONE kernel: [  242.648107]  __schedule+0x44c/0x8a0
Oct 18 13:58:06 IONE kernel: [  242.648116]  schedule+0x4f/0xc0
Oct 18 13:58:06 IONE kernel: [  242.648121]  jbd2_journal_commit_transaction+0x300/0x18f0
Oct 18 13:58:06 IONE kernel: [  242.648129]  ? dequeue_entity+0xd8/0x410
Oct 18 13:58:06 IONE kernel: [  242.648139]  ? wait_woken+0x80/0x80
Oct 18 13:58:06 IONE kernel: [  242.648145]  ? try_to_del_timer_sync+0x54/0x80
Oct 18 13:58:06 IONE kernel: [  242.648154]  kjournald2+0xb6/0x280
Oct 18 13:58:06 IONE kernel: [  242.648161]  ? wait_woken+0x80/0x80
Oct 18 13:58:06 IONE kernel: [  242.648165]  ? commit_timeout+0x20/0x20
Oct 18 13:58:06 IONE kernel: [  242.648171]  kthread+0x12b/0x150
Oct 18 13:58:06 IONE kernel: [  242.648179]  ? set_kthread_struct+0x40/0x40
Oct 18 13:58:06 IONE kernel: [  242.648185]  ret_from_fork+0x22/0x30
Oct 18 13:58:06 IONE kernel: [  242.648218] INFO: task pool-Thunar:4737 blocked for more than 120 seconds.
Oct 18 13:58:06 IONE kernel: [  242.648223]       Tainted: P           OE     5.11.0-37-generic #41~20.04.2-Ubuntu
Oct 18 13:58:06 IONE kernel: [  242.648226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 18 13:58:06 IONE kernel: [  242.648228] task:pool-Thunar     state:D stack:    0 pid: 4737 ppid:  2576 flags:0x00000000
Oct 18 13:58:06 IONE kernel: [  242.648234] Call Trace:
Oct 18 13:58:06 IONE kernel: [  242.648236]  __schedule+0x44c/0x8a0
Oct 18 13:58:06 IONE kernel: [  242.648240]  ? __mod_memcg_lruvec_state+0x25/0xe0
Oct 18 13:58:06 IONE kernel: [  242.648252]  schedule+0x4f/0xc0
Oct 18 13:58:06 IONE kernel: [  242.648256]  rwsem_down_read_slowpath+0x184/0x3c0
Oct 18 13:58:06 IONE kernel: [  242.648264]  down_read+0x43/0xa0
Oct 18 13:58:06 IONE kernel: [  242.648269]  ext4_da_map_blocks.constprop.0+0x2dc/0x380
Oct 18 13:58:06 IONE kernel: [  242.648276]  ext4_da_get_block_prep+0x55/0xe0
Oct 18 13:58:06 IONE kernel: [  242.648281]  ext4_block_write_begin+0x14a/0x530
Oct 18 13:58:06 IONE kernel: [  242.648285]  ? ext4_da_map_blocks.constprop.0+0x380/0x380
Oct 18 13:58:06 IONE kernel: [  242.648290]  ? __ext4_journal_start_sb+0x106/0x120
Oct 18 13:58:06 IONE kernel: [  242.648297]  ext4_da_write_begin+0x1de/0x460
Oct 18 13:58:06 IONE kernel: [  242.648303]  generic_perform_write+0xc2/0x1c0
Oct 18 13:58:06 IONE kernel: [  242.648314]  ext4_buffered_write_iter+0x98/0x150
Oct 18 13:58:06 IONE kernel: [  242.648321]  ext4_file_write_iter+0x53/0x220
Oct 18 13:58:06 IONE kernel: [  242.648326]  ? common_file_perm+0x72/0x170
Oct 18 13:58:06 IONE kernel: [  242.648335]  do_iter_readv_writev+0x152/0x1b0
Oct 18 13:58:06 IONE kernel: [  242.648343]  do_iter_write+0x88/0x1c0
Oct 18 13:58:06 IONE kernel: [  242.648350]  vfs_iter_write+0x19/0x30
Oct 18 13:58:06 IONE kernel: [  242.648356]  iter_file_splice_write+0x276/0x3c0
Oct 18 13:58:06 IONE kernel: [  242.648365]  do_splice_from+0x21/0x40
Oct 18 13:58:06 IONE kernel: [  242.648371]  do_splice+0x2e8/0x650
Oct 18 13:58:06 IONE kernel: [  242.648377]  __do_splice+0xde/0x160
Oct 18 13:58:06 IONE kernel: [  242.648383]  __x64_sys_splice+0x99/0x110
Oct 18 13:58:06 IONE kernel: [  242.648389]  do_syscall_64+0x38/0x90
Oct 18 13:58:06 IONE kernel: [  242.648394]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 18 13:58:06 IONE kernel: [  242.648401] RIP: 0033:0x7faa74c4a7f3
Oct 18 13:58:06 IONE kernel: [  242.648406] RSP: 002b:00007faa71adc700 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
Oct 18 13:58:06 IONE kernel: [  242.648411] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007faa74c4a7f3
Oct 18 13:58:06 IONE kernel: [  242.648414] RDX: 0000000000000016 RSI: 0000000000000000 RDI: 0000000000000017
Oct 18 13:58:06 IONE kernel: [  242.648417] RBP: 0000000000000000 R08: 0000000000100000 R09: 0000000000000004
Oct 18 13:58:06 IONE kernel: [  242.648420] R10: 00007faa71adc840 R11: 0000000000000293 R12: 0000000000000016
Oct 18 13:58:06 IONE kernel: [  242.648423] R13: 0000000000000000 R14: 0000000000000017 R15: 00007faa71adc850
Oct 18 14:00:07 IONE kernel: [  363.478661] INFO: task jbd2/md0-8:1163 blocked for more than 241 seconds.
Oct 18 14:00:07 IONE kernel: [  363.478673]       Tainted: P           OE     5.11.0-37-generic #41~20.04.2-Ubuntu
Oct 18 14:00:07 IONE kernel: [  363.478677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 18 14:00:07 IONE kernel: [  363.478679] task:jbd2/md0-8      state:D stack:    0 pid: 1163 ppid:     2 flags:0x00004000
Oct 18 14:00:07 IONE kernel: [  363.478688] Call Trace:
Oct 18 14:00:07 IONE kernel: [  363.478694]  __schedule+0x44c/0x8a0
Oct 18 14:00:07 IONE kernel: [  363.478703]  schedule+0x4f/0xc0
Oct 18 14:00:07 IONE kernel: [  363.478707]  jbd2_journal_commit_transaction+0x300/0x18f0
Oct 18 14:00:07 IONE kernel: [  363.478715]  ? dequeue_entity+0xd8/0x410
Oct 18 14:00:07 IONE kernel: [  363.478725]  ? wait_woken+0x80/0x80
Oct 18 14:00:07 IONE kernel: [  363.478732]  ? try_to_del_timer_sync+0x54/0x80
Oct 18 14:00:07 IONE kernel: [  363.478741]  kjournald2+0xb6/0x280
Oct 18 14:00:07 IONE kernel: [  363.478748]  ? wait_woken+0x80/0x80
Oct 18 14:00:07 IONE kernel: [  363.478752]  ? commit_timeout+0x20/0x20
Oct 18 14:00:07 IONE kernel: [  363.478758]  kthread+0x12b/0x150
Oct 18 14:00:07 IONE kernel: [  363.478766]  ? set_kthread_struct+0x40/0x40
Oct 18 14:00:07 IONE kernel: [  363.478773]  ret_from_fork+0x22/0x30
Oct 18 14:00:07 IONE kernel: [  363.478804] INFO: task pool-Thunar:4737 blocked for more than 241 seconds.
Oct 18 14:00:07 IONE kernel: [  363.478809]       Tainted: P           OE     5.11.0-37-generic #41~20.04.2-Ubuntu
Oct 18 14:00:07 IONE kernel: [  363.478812] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 18 14:00:07 IONE kernel: [  363.478814] task:pool-Thunar     state:D stack:    0 pid: 4737 ppid:  2576 flags:0x00000000
Oct 18 14:00:07 IONE kernel: [  363.478820] Call Trace:
Oct 18 14:00:07 IONE kernel: [  363.478823]  __schedule+0x44c/0x8a0
Oct 18 14:00:07 IONE kernel: [  363.478827]  ? __mod_memcg_lruvec_state+0x25/0xe0
Oct 18 14:00:07 IONE kernel: [  363.478839]  schedule+0x4f/0xc0
Oct 18 14:00:07 IONE kernel: [  363.478842]  rwsem_down_read_slowpath+0x184/0x3c0
Oct 18 14:00:07 IONE kernel: [  363.478851]  down_read+0x43/0xa0
Oct 18 14:00:07 IONE kernel: [  363.478856]  ext4_da_map_blocks.constprop.0+0x2dc/0x380
Oct 18 14:00:07 IONE kernel: [  363.478863]  ext4_da_get_block_prep+0x55/0xe0
Oct 18 14:00:07 IONE kernel: [  363.478868]  ext4_block_write_begin+0x14a/0x530
Oct 18 14:00:07 IONE kernel: [  363.478872]  ? ext4_da_map_blocks.constprop.0+0x380/0x380
Oct 18 14:00:07 IONE kernel: [  363.478877]  ? __ext4_journal_start_sb+0x106/0x120
Oct 18 14:00:07 IONE kernel: [  363.478884]  ext4_da_write_begin+0x1de/0x460
Oct 18 14:00:07 IONE kernel: [  363.478890]  generic_perform_write+0xc2/0x1c0
Oct 18 14:00:07 IONE kernel: [  363.478901]  ext4_buffered_write_iter+0x98/0x150
Oct 18 14:00:07 IONE kernel: [  363.478908]  ext4_file_write_iter+0x53/0x220
Oct 18 14:00:07 IONE kernel: [  363.478914]  ? common_file_perm+0x72/0x170
Oct 18 14:00:07 IONE kernel: [  363.478923]  do_iter_readv_writev+0x152/0x1b0
Oct 18 14:00:07 IONE kernel: [  363.478932]  do_iter_write+0x88/0x1c0
Oct 18 14:00:07 IONE kernel: [  363.478938]  vfs_iter_write+0x19/0x30
Oct 18 14:00:07 IONE kernel: [  363.478944]  iter_file_splice_write+0x276/0x3c0
Oct 18 14:00:07 IONE kernel: [  363.478954]  do_splice_from+0x21/0x40
Oct 18 14:00:07 IONE kernel: [  363.478960]  do_splice+0x2e8/0x650
Oct 18 14:00:07 IONE kernel: [  363.478966]  __do_splice+0xde/0x160
Oct 18 14:00:07 IONE kernel: [  363.478972]  __x64_sys_splice+0x99/0x110
Oct 18 14:00:07 IONE kernel: [  363.478978]  do_syscall_64+0x38/0x90
Oct 18 14:00:07 IONE kernel: [  363.478983]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 18 14:00:07 IONE kernel: [  363.478990] RIP: 0033:0x7faa74c4a7f3
Oct 18 14:00:07 IONE kernel: [  363.478995] RSP: 002b:00007faa71adc700 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
Oct 18 14:00:07 IONE kernel: [  363.479000] RAX: ffffffffffffffda RBX: 0000000000100000 RCX: 00007faa74c4a7f3
Oct 18 14:00:07 IONE kernel: [  363.479004] RDX: 0000000000000016 RSI: 0000000000000000 RDI: 0000000000000017
Oct 18 14:00:07 IONE kernel: [  363.479006] RBP: 0000000000000000 R08: 0000000000100000 R09: 0000000000000004
Oct 18 14:00:07 IONE kernel: [  363.479009] R10: 00007faa71adc840 R11: 0000000000000293 R12: 0000000000000016
Oct 18 14:00:07 IONE kernel: [  363.479012] R13: 0000000000000000 R14: 0000000000000017 R15: 00007faa71adc850
 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
920
698
93
Germany
Can you transplant the drives with controller to a different mainboard? Even just for testing, without case, spread out on a table. These kernel warnings are concerning and should never happen like this.

Also it is high time to backup whatever is on the raid, in case the ext4 on that raid blows up. Make sure you have MD5 checksums of all files, just in case something runs even more amok and starts spraying junk all over the array. If things are too big, maybe prioritize and copy the most important stuff off to a 14-16 TB USB3 drive like WD Book. Just to play safe and prevent tears.
 
  • Like
Reactions: tinfoil3d

Mashie

Member
Jun 26, 2020
37
9
8
Can you transplant the drives with controller to a different mainboard? Even just for testing, without case, spread out on a table. These kernel warnings are concerning and should never happen like this.

Also it is high time to backup whatever is on the raid, in case the ext4 on that raid blows up. Make sure you have MD5 checksums of all files, just in case something runs even more amok and starts spraying junk all over the array. If things are too big, maybe prioritize and copy the most important stuff off to a 14-16 TB USB3 drive like WD Book. Just to play safe and prevent tears.
I originally had the array use the on-board SATA controllers and moving to the LSI 3905 was one attempt to rule the motherboard out. I don't have a spare motherboard/cpu to try with.

The most important bits I have on Google Drive already so if the rest is lost it is mainly a massive inconvenience.
 

Mashie

Member
Jun 26, 2020
37
9
8
Next time you reboot (normally--no need to force the event), try to cause the stall with the following command:
Code:
dd if=/dev/zero of=/mnt/storage/40MBz bs=8M count=5 oflag=direct
This worked perfectly fine to trigger the stall with.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
249
43
NH, USA
This worked perfectly fine to trigger the stall with.
Good. [the intent was to have a minimal "provoker" not involving foreign actor (thunar) or device (nvme)]
I originally had the array use the on-board SATA controllers and moving to the LSI 9305 was one attempt to rule the motherboard out.
I think that does rule out the mobo--since the stall occurs with only the on-board SATAs (all on the C612 chipset), and (separately) with only the 9305 (on a CPU-PCIe slot).
[A bad memory location as culprit is effectively eliminated, since stall occurs with 2 different kernel versions.]

[ ... waiting on the --grow from 10==>N ... ]
 
  • Like
Reactions: Mashie

Mashie

Member
Jun 26, 2020
37
9
8
[ ... waiting on the --grow from 10==>N ... ]
And off we go, the reshaping should be done by Sunday evening at the current speed:

Code:
mashie@IONE:~$ sudo mdadm /dev/md0 --add /dev/sdh1
mdadm: added /dev/sdh1
mashie@IONE:~$ sudo mdadm /dev/md0 --add /dev/sdk1
mdadm: added /dev/sdk1
mashie@IONE:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sun Jun 30 22:27:54 2019
        Raid Level : raid6
        Array Size : 78129610752 (74510.20 GiB 80004.72 GB)
     Used Dev Size : 9766201344 (9313.78 GiB 10000.59 GB)
      Raid Devices : 10
     Total Devices : 12
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Oct 21 23:46:53 2021
             State : clean 
    Active Devices : 10
   Working Devices : 12
    Failed Devices : 0
     Spare Devices : 2

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : IONE:0  (local to host IONE)
              UUID : 1f8e4385:3ef16ed6:20147617:818d417e
            Events : 144463

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       81        1      active sync   /dev/sdf1
       2       8       97        2      active sync   /dev/sdg1
       3       8       17        3      active sync   /dev/sdb1
       4       8       33        4      active sync   /dev/sdc1
       6       8      177        5      active sync   /dev/sdl1
       5       8       49        6      active sync   /dev/sdd1
       7       8      193        7      active sync   /dev/sdm1
       9       8      129        8      active sync   /dev/sdi1
       8       8      145        9      active sync   /dev/sdj1

      10       8      113        -      spare   /dev/sdh1
      11       8      161        -      spare   /dev/sdk1
mashie@IONE:~$ sudo mdadm --grow --raid-devices=12 --backup-file=/root/md0_grow.bak /dev/md0
mdadm: Need to backup 20480K of critical section..
mashie@IONE:~$
 

lpallard

Member
Aug 17, 2013
276
11
18
I'm very late to the game but a while back (about 8 years ago) I had to give up mdadm for a storage server because of constant kernel panics. If I remember well this was due to a severe bug in mdadm for some kernel series... First thing you should try is get a cheap IBM M1015 and flash it ti IT mode. If you're using consumer grade hardware, never eliminate the possibility of some firmware or BIOS issues and quality issues. I'll try to get some details of what happened to me and post back if I find anything relevant.
 

Mashie

Member
Jun 26, 2020
37
9
8
I'm very late to the game but a while back (about 8 years ago) I had to give up mdadm for a storage server because of constant kernel panics. If I remember well this was due to a severe bug in mdadm for some kernel series... First thing you should try is get a cheap IBM M1015 and flash it ti IT mode. If you're using consumer grade hardware, never eliminate the possibility of some firmware or BIOS issues and quality issues. I'll try to get some details of what happened to me and post back if I find anything relevant.
Thanks, any info about MDADM issues is welcome.

I'm already on workstation hardware (E5-1650 v3 Xeon, ECC RAM and LSI 3905-24i controller).
 

Mashie

Member
Jun 26, 2020
37
9
8
[ ... waiting on the --grow from 10==>N ... ]
I grew successfully from 10 -> 12 and expanded the file system which took quite a while.
At this stage the stall would still happen after reboot, something had changed though as it no longer did the heavy reading/seeking on just 5 specific drives, it was now doing reading of all 12 drives and without much of the very noisy seeking. The stall however is now lasting just over 6 minutes.

As no particular drive was standing out at this point I expanded from 12 -> 14 and things neither improved nor degraded further.

This is the output from triggering the stalls now:

Code:
mashie@IONE:~$ dd if=/dev/zero of=/mnt/storage/40MBz bs=8M count=5 oflag=direct
5+0 records in
5+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 553.211 s, 75.8 kB/s
mashie@IONE:~$
And the usual entries in syslog:

Code:
Nov  1 09:07:31 IONE kernel: [  363.766976] INFO: task jbd2/md0-8:1217 blocked for more than 120 seconds.
Nov  1 09:07:31 IONE kernel: [  363.766988]       Tainted: P           OE     5.11.0-38-generic #42~20.04.1-Ubuntu
Nov  1 09:07:31 IONE kernel: [  363.766992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  1 09:07:31 IONE kernel: [  363.766995] task:jbd2/md0-8      state:D stack:    0 pid: 1217 ppid:     2 flags:0x00004000
Nov  1 09:07:31 IONE kernel: [  363.767003] Call Trace:
Nov  1 09:07:31 IONE kernel: [  363.767009]  __schedule+0x44c/0x8a0
Nov  1 09:07:31 IONE kernel: [  363.767021]  schedule+0x4f/0xc0
Nov  1 09:07:31 IONE kernel: [  363.767028]  jbd2_journal_commit_transaction+0x300/0x18f0
Nov  1 09:07:31 IONE kernel: [  363.767038]  ? dequeue_entity+0xd8/0x410
Nov  1 09:07:31 IONE kernel: [  363.767047]  ? wait_woken+0x80/0x80
Nov  1 09:07:31 IONE kernel: [  363.767053]  ? try_to_del_timer_sync+0x54/0x80
Nov  1 09:07:31 IONE kernel: [  363.767062]  kjournald2+0xb6/0x280
Nov  1 09:07:31 IONE kernel: [  363.767069]  ? wait_woken+0x80/0x80
Nov  1 09:07:31 IONE kernel: [  363.767073]  ? commit_timeout+0x20/0x20
Nov  1 09:07:31 IONE kernel: [  363.767078]  kthread+0x12b/0x150
Nov  1 09:07:31 IONE kernel: [  363.767086]  ? set_kthread_struct+0x40/0x40
Nov  1 09:07:31 IONE kernel: [  363.767093]  ret_from_fork+0x22/0x30
Nov  1 09:09:32 IONE kernel: [  484.597705] INFO: task jbd2/md0-8:1217 blocked for more than 241 seconds.
Nov  1 09:09:32 IONE kernel: [  484.597712]       Tainted: P           OE     5.11.0-38-generic #42~20.04.1-Ubuntu
Nov  1 09:09:32 IONE kernel: [  484.597713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  1 09:09:32 IONE kernel: [  484.597714] task:jbd2/md0-8      state:D stack:    0 pid: 1217 ppid:     2 flags:0x00004000
Nov  1 09:09:32 IONE kernel: [  484.597717] Call Trace:
Nov  1 09:09:32 IONE kernel: [  484.597722]  __schedule+0x44c/0x8a0
Nov  1 09:09:32 IONE kernel: [  484.597727]  schedule+0x4f/0xc0
Nov  1 09:09:32 IONE kernel: [  484.597729]  jbd2_journal_commit_transaction+0x300/0x18f0
Nov  1 09:09:32 IONE kernel: [  484.597734]  ? dequeue_entity+0xd8/0x410
Nov  1 09:09:32 IONE kernel: [  484.597739]  ? wait_woken+0x80/0x80
Nov  1 09:09:32 IONE kernel: [  484.597742]  ? try_to_del_timer_sync+0x54/0x80
Nov  1 09:09:32 IONE kernel: [  484.597746]  kjournald2+0xb6/0x280
Nov  1 09:09:32 IONE kernel: [  484.597750]  ? wait_woken+0x80/0x80
Nov  1 09:09:32 IONE kernel: [  484.597752]  ? commit_timeout+0x20/0x20
Nov  1 09:09:32 IONE kernel: [  484.597754]  kthread+0x12b/0x150
Nov  1 09:09:32 IONE kernel: [  484.597758]  ? set_kthread_struct+0x40/0x40
Nov  1 09:09:32 IONE kernel: [  484.597760]  ret_from_fork+0x22/0x30
Nov  1 09:11:33 IONE kernel: [  605.427075] INFO: task jbd2/md0-8:1217 blocked for more than 362 seconds.
Nov  1 09:11:33 IONE kernel: [  605.427086]       Tainted: P           OE     5.11.0-38-generic #42~20.04.1-Ubuntu
Nov  1 09:11:33 IONE kernel: [  605.427090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  1 09:11:33 IONE kernel: [  605.427093] task:jbd2/md0-8      state:D stack:    0 pid: 1217 ppid:     2 flags:0x00004000
Nov  1 09:11:33 IONE kernel: [  605.427101] Call Trace:
Nov  1 09:11:33 IONE kernel: [  605.427107]  __schedule+0x44c/0x8a0
Nov  1 09:11:33 IONE kernel: [  605.427118]  schedule+0x4f/0xc0
Nov  1 09:11:33 IONE kernel: [  605.427122]  jbd2_journal_commit_transaction+0x300/0x18f0
Nov  1 09:11:33 IONE kernel: [  605.427130]  ? dequeue_entity+0xd8/0x410
Nov  1 09:11:33 IONE kernel: [  605.427140]  ? wait_woken+0x80/0x80
Nov  1 09:11:33 IONE kernel: [  605.427147]  ? try_to_del_timer_sync+0x54/0x80
Nov  1 09:11:33 IONE kernel: [  605.427156]  kjournald2+0xb6/0x280
Nov  1 09:11:33 IONE kernel: [  605.427163]  ? wait_woken+0x80/0x80
Nov  1 09:11:33 IONE kernel: [  605.427167]  ? commit_timeout+0x20/0x20
Nov  1 09:11:33 IONE kernel: [  605.427173]  kthread+0x12b/0x150
Nov  1 09:11:33 IONE kernel: [  605.427181]  ? set_kthread_struct+0x40/0x40
Nov  1 09:11:33 IONE kernel: [  605.427188]  ret_from_fork+0x22/0x30
 

Goose

New Member
Jan 16, 2019
21
7
3
How about booting with a livecd and then mounting the drive to see if the issue still occurs.

Just because your MD array is affected doesn't mean it's the cause... I think it's likely that another service is doing something that causes the drives to be busy.

EDIT:
If the issue still occurs with the livecd, then it's likely that one of your disks/disk paths is unhealthy. Try doing a long smart scan or a non-destructive badblocks to see what's happening. I suppose you could also use WDs disk test tool. It will tell you if the drive(s) have reassigned sectors.
 
Last edited:
  • Like
Reactions: UhClem

UhClem

just another Bozo on the bus
Jun 26, 2012
435
249
43
NH, USA
@Mashie , pardon my sloth -- kept getting bogged down trying to write a script that would have the best chance of exposing real/useful info.
... meanwhile ... Kudos to @Goose :
Just because your MD array is affected doesn't mean it's the cause... I think it's likely that another service is doing something that causes the drives to be busy.
I've had this suspicion ever since seeing:
Code:
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
md0             105.00       420.00         0.00         0.00        420          0          0
nvme0n1           0.00         0.00         0.00         0.00          0          0          0
sda              21.00        84.00         0.00         0.00         84          0          0    (disk 7)
sdb              21.00        84.00         0.00         0.00         84          0          0    (disk 1)
sdc               0.00         0.00         0.00         0.00          0          0          0
sdd              21.00        84.00         0.00         0.00         84          0          0    (disk 3)
sde               0.00         0.00         0.00         0.00          0          0          0
sdf              21.00        84.00         0.00         0.00         84          0          0    (disk 9)
sdg              21.00        84.00         0.00         0.00         84          0          0    (disk 10)
sdh               0.00         0.00         0.00         0.00          0          0          0
sdi               0.00         0.00         0.00         0.00          0          0          0
sdj               0.00         0.00         0.00         0.00          0          0          0
sdk               0.00         0.00         0.00         0.00          0          0          0
4KB reads are the "tell".

I intended my new script to be run without Desktop/GUI cruft, or at single-user. LiveCD might be better; but possibly iintroduces a new "variable" of a different OS version/image, whereas just changing runlevel (on your existing boot) would only eliminate (all the) variables introduced with Desktop.

With your newly-grown array, use:
Code:
dd if=/dev/zero of=/mnt/storage/60MBzero  bs=12M count=5 oflag=direct
to (try to) provoke.
 
Last edited:

Mashie

Member
Jun 26, 2020
37
9
8
Hi @UhClem I didn't see you had replied here, I guess the email notification was lost in the ether.

That command will happily trigger the stalling.
Code:
mashie@IONE:~$ dd if=/dev/zero of=/mnt/storage/60MBzero  bs=12M count=5 oflag=direct
5+0 records in
5+0 records out
62914560 bytes (63 MB, 60 MiB) copied, 532.051 s, 118 kB/s
mashie@IONE:~$
Whichever option you think is best to get rid of the cruft I'm happy to give it a go. The stalling is getting close to almost 10 minutes now which is starting to become quite annoying.
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
249
43
NH, USA
Whichever option you think is best to get rid of the cruft I'm happy to give it a go. The stalling is getting close to almost 10 minutes now which is starting to become quite annoying.
After more pondering, my bet is that Thunar is the agent provocateur[**], so, rather than mucking with runlevels, etc., (please humor me) try booting with Thunar disabled/eliminated, and try that dd command. (If I'm wrong, we can resort to mucking ...)

[**] This is/would-be not a fault of Thunar (it's just user-level code); but I believe Thunar might be exposing a (soft) bug in md.
 

Mashie

Member
Jun 26, 2020
37
9
8
After more pondering, my bet is that Thunar is the agent provocateur[**], so, rather than mucking with runlevels, etc., (please humor me) try booting with Thunar disabled/eliminated, and try that dd command. (If I'm wrong, we can resort to mucking ...)

[**] This is/would-be not a fault of Thunar (it's just user-level code); but I believe Thunar might be exposing a (soft) bug in md.
What is the easiest way to disable Thunar?
 

UhClem

just another Bozo on the bus
Jun 26, 2012
435
249
43
NH, USA
What is the easiest way to disable Thunar?
I don't know; I am (since 1973) strictly command-line, on Unix.
[Mea culpa ... "Do as I say, not as I do."]
(...Googling...)
... maybe there is something to toggle/comment in a startup file (for Xfce?).
Speaking of which, maybe there is an (easy?) way to suppress startup of Xfce (vs runlevels, systemctl, etc).
Now, it is I who is a stranger in a strange land.

Any help here, STHers??
 
  • Like
Reactions: tinfoil3d

Goose

New Member
Jan 16, 2019
21
7
3
If you don't want to try a liveCD, then try booting to a lower runlevel such as 3. See What Are “Runlevels” on Linux? for info on how to do it.

That way X wont be loaded so you wont have issues with Thunar, but TBH I doubt that's the issue. It may be some form of indexing but again probably not.
 
  • Like
Reactions: tinfoil3d

MrCalvin

IT consultant, Denmark
Aug 22, 2016
87
15
8
51
Denmark
www.wit.dk
A good example of running "stable" (older kernel) doesn't always mean you get a stable system!
RHEL comes to my mind where an old kernel is chosen to give the highest stability, but it is true? I feel there is a exaggerated trust in old kernels.
(sorry going a little off topic)