MDADM and LSI 2308IT drive drops on wake from sleep

chinesestunna

Active Member
Jan 23, 2015
516
106
43
53
Hi all,

Recently I've upgraded my home lab/storage server, mainly the chip/board/storage controller:
  1. Motherboard: Supermicro X8DTE-F to Supermicro X9DRD-7LNF
  2. CPU: Dual L5630 Xeons to Single E5-2670
  3. Storage Controller (PCIe passthrough) - Dell H200 (LSI 2008) to LSI 2308 IT Mode (onboard on the Supermicro), still connected to Intel 24 port SAS expander which connects out to the drives
  4. ESXi 5.5 to ESXi 6.7
  5. Debian 6 based build to OpenMediaVault 4 (Debian 9 based)
The server was running 2 storage arrays (10 x 3TB and 8 x 2TB) using direct passthrough of the SAS controller and expander to OpenMediaVault VM. These were moved over without a hitch. I have all the drives set to spin down after 1 hour of inactivity. They would spin back up as soon as the array was accessed.

Issue:

I've been running this setup for the past month now and a strange issue has been plaguing it - every few days, one of the drives, usually same drive on the 10 x 3TB array, would drop from the array when attempting to spin the array up. Dmesg logs shows a series of drive wake up commands as the array was accessed from sleep and various link resets, expander wake ups etc. This one drive, with all SMART data were perfectly health, ran short/long test, badblocks etc, would seem to not connect and be dropped from the array. After confirming drive is fine, I would re-add it and MDADM would resync without a hitch. The last couple of times I didn't even bother rebooting the server or VM, simply forcing a rescan of the SATA device ID brought it back and we're off to rebuild.

Diagnosis:

I spent a fair amount of time diagnosing this, checked cabling, power delivery, various settings on the LSI card BIOS etc to no avail. One thing I did observe immediately after upgrade was that array spin up was slower than before, I didn't time it but it was noticeable slower. It would be say under 15 sec before but now was closer to 30-45 sec.
Even after disabling staggered spin up issue persists, I've also changed all drive parameters that are referenced online for RAID usage such as SECRT to 70 etc to no avail :(

Solution:
Eventually I think I found the solution - the LSI card's BIOS (old Dell didn't even have BIOS load, it was completely transparent) had a couple delay spin up settings - how many secs of delay between spin up batches and how many drives to spin up per batch.
It would appear that MDADM is not aware or full aware and would mark the drive as "failed" if it doesn't respond within a certain time frame. I disabled the spin up delay as my PSU could handle the power draw (never had a problem before without spin up delay) by setting the time to 0 seconds and batch size to 10 drives. So far for past 2 weeks have not had the drive drop issue anymore.

I did a good amount of googling/reading other's experience but most RAID (hardware or software) drive drops were more due to TLER timeout issues, nothing mentioning the delayed spin up causing a problem. This seems very strange to me so wanted to document here for anyone else that may run into the issue


See post 2 for more updates
 
Last edited:
  • Like
Reactions: Tha_14

chinesestunna

Active Member
Jan 23, 2015
516
106
43
53
Not Solved :( Spoke too soon, further research yielded this thread on Github for ZFS, I'm not running ZFS but symptoms and issues are identical, OS marking drive unresponsive on wake up event, raid manager flagging drive as bad, dropped from array:
ZFS io error when disks are in idle/standby/spindown mode · Issue #4713 · zfsonlinux/zfs

It seems that the issue was introduced in post Linux 2.6 systems. Before stumbling upon that thread I didn't even consider that 2 weeks before hardware update I had updated the NAS VM as Debian 7 was EOL. New NAS VM is Debian 9 based and that's when the issue started flaring up
 
Last edited: