MDADM and LSI 2308IT drive drops on wake from sleep

chinesestunna · Jul 31, 2018

Hi all,

Recently I've upgraded my home lab/storage server, mainly the chip/board/storage controller:

Motherboard: Supermicro X8DTE-F to Supermicro X9DRD-7LNF
CPU: Dual L5630 Xeons to Single E5-2670
Storage Controller (PCIe passthrough) - Dell H200 (LSI 2008) to LSI 2308 IT Mode (onboard on the Supermicro), still connected to Intel 24 port SAS expander which connects out to the drives
ESXi 5.5 to ESXi 6.7
Debian 6 based build to OpenMediaVault 4 (Debian 9 based)

The server was running 2 storage arrays (10 x 3TB and 8 x 2TB) using direct passthrough of the SAS controller and expander to OpenMediaVault VM. These were moved over without a hitch. I have all the drives set to spin down after 1 hour of inactivity. They would spin back up as soon as the array was accessed.

Issue:
I've been running this setup for the past month now and a strange issue has been plaguing it - every few days, one of the drives, usually same drive on the 10 x 3TB array, would drop from the array when attempting to spin the array up. Dmesg logs shows a series of drive wake up commands as the array was accessed from sleep and various link resets, expander wake ups etc. This one drive, with all SMART data were perfectly health, ran short/long test, badblocks etc, would seem to not connect and be dropped from the array. After confirming drive is fine, I would re-add it and MDADM would resync without a hitch. The last couple of times I didn't even bother rebooting the server or VM, simply forcing a rescan of the SATA device ID brought it back and we're off to rebuild.

Diagnosis:
I spent a fair amount of time diagnosing this, checked cabling, power delivery, various settings on the LSI card BIOS etc to no avail. One thing I did observe immediately after upgrade was that array spin up was slower than before, I didn't time it but it was noticeable slower. It would be say under 15 sec before but now was closer to 30-45 sec.
Even after disabling staggered spin up issue persists, I've also changed all drive parameters that are referenced online for RAID usage such as SECRT to 70 etc to no avail

Solution:
Eventually I think I found the solution - the LSI card's BIOS (old Dell didn't even have BIOS load, it was completely transparent) had a couple delay spin up settings - how many secs of delay between spin up batches and how many drives to spin up per batch.
It would appear that MDADM is not aware or full aware and would mark the drive as "failed" if it doesn't respond within a certain time frame. I disabled the spin up delay as my PSU could handle the power draw (never had a problem before without spin up delay) by setting the time to 0 seconds and batch size to 10 drives. So far for past 2 weeks have not had the drive drop issue anymore.

I did a good amount of googling/reading other's experience but most RAID (hardware or software) drive drops were more due to TLER timeout issues, nothing mentioning the delayed spin up causing a problem. This seems very strange to me so wanted to document here for anyone else that may run into the issue

See post 2 for more updates

chinesestunna · Aug 12, 2018

Not Solved

Spoke too soon, further research yielded this thread on Github for ZFS, I'm not running ZFS but symptoms and issues are identical, OS marking drive unresponsive on wake up event, raid manager flagging drive as bad, dropped from array:
ZFS io error when disks are in idle/standby/spindown mode · Issue #4713 · zfsonlinux/zfs

It seems that the issue was introduced in post Linux 2.6 systems. Before stumbling upon that thread I didn't even consider that 2 weeks before hardware update I had updated the NAS VM as Debian 7 was EOL. New NAS VM is Debian 9 based and that's when the issue started flaring up

Search

MDADM and LSI 2308IT drive drops on wake from sleep

chinesestunna

Active Member

chinesestunna

Active Member