Hi all,
Recently I've upgraded my home lab/storage server, mainly the chip/board/storage controller:
Issue:
I've been running this setup for the past month now and a strange issue has been plaguing it - every few days, one of the drives, usually same drive on the 10 x 3TB array, would drop from the array when attempting to spin the array up. Dmesg logs shows a series of drive wake up commands as the array was accessed from sleep and various link resets, expander wake ups etc. This one drive, with all SMART data were perfectly health, ran short/long test, badblocks etc, would seem to not connect and be dropped from the array. After confirming drive is fine, I would re-add it and MDADM would resync without a hitch. The last couple of times I didn't even bother rebooting the server or VM, simply forcing a rescan of the SATA device ID brought it back and we're off to rebuild.
Diagnosis:
I spent a fair amount of time diagnosing this, checked cabling, power delivery, various settings on the LSI card BIOS etc to no avail. One thing I did observe immediately after upgrade was that array spin up was slower than before, I didn't time it but it was noticeable slower. It would be say under 15 sec before but now was closer to 30-45 sec.
Even after disabling staggered spin up issue persists, I've also changed all drive parameters that are referenced online for RAID usage such as SECRT to 70 etc to no avail
Solution:
Eventually I think I found the solution - the LSI card's BIOS (old Dell didn't even have BIOS load, it was completely transparent) had a couple delay spin up settings - how many secs of delay between spin up batches and how many drives to spin up per batch.
It would appear that MDADM is not aware or full aware and would mark the drive as "failed" if it doesn't respond within a certain time frame. I disabled the spin up delay as my PSU could handle the power draw (never had a problem before without spin up delay) by setting the time to 0 seconds and batch size to 10 drives. So far for past 2 weeks have not had the drive drop issue anymore.
I did a good amount of googling/reading other's experience but most RAID (hardware or software) drive drops were more due to TLER timeout issues, nothing mentioning the delayed spin up causing a problem. This seems very strange to me so wanted to document here for anyone else that may run into the issue
See post 2 for more updates
Recently I've upgraded my home lab/storage server, mainly the chip/board/storage controller:
- Motherboard: Supermicro X8DTE-F to Supermicro X9DRD-7LNF
- CPU: Dual L5630 Xeons to Single E5-2670
- Storage Controller (PCIe passthrough) - Dell H200 (LSI 2008) to LSI 2308 IT Mode (onboard on the Supermicro), still connected to Intel 24 port SAS expander which connects out to the drives
- ESXi 5.5 to ESXi 6.7
- Debian 6 based build to OpenMediaVault 4 (Debian 9 based)
Issue:
I've been running this setup for the past month now and a strange issue has been plaguing it - every few days, one of the drives, usually same drive on the 10 x 3TB array, would drop from the array when attempting to spin the array up. Dmesg logs shows a series of drive wake up commands as the array was accessed from sleep and various link resets, expander wake ups etc. This one drive, with all SMART data were perfectly health, ran short/long test, badblocks etc, would seem to not connect and be dropped from the array. After confirming drive is fine, I would re-add it and MDADM would resync without a hitch. The last couple of times I didn't even bother rebooting the server or VM, simply forcing a rescan of the SATA device ID brought it back and we're off to rebuild.
Diagnosis:
Even after disabling staggered spin up issue persists, I've also changed all drive parameters that are referenced online for RAID usage such as SECRT to 70 etc to no avail
Solution:
It would appear that MDADM is not aware or full aware and would mark the drive as "failed" if it doesn't respond within a certain time frame. I disabled the spin up delay as my PSU could handle the power draw (never had a problem before without spin up delay) by setting the time to 0 seconds and batch size to 10 drives. So far for past 2 weeks have not had the drive drop issue anymore.
I did a good amount of googling/reading other's experience but most RAID (hardware or software) drive drops were more due to TLER timeout issues, nothing mentioning the delayed spin up causing a problem. This seems very strange to me so wanted to document here for anyone else that may run into the issue
See post 2 for more updates
Last edited: