BSOD when pulling disk from redundant array

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.
Hi. I'm busy testing a new system before I put it into production.
Specs:
Windows Server 2012 R2.
Supermicro SC846BE16-R920B Chassis (SAS2 24 Drive bays)
Supermicro X10SLM+-F motherboard, 32GB RAM
Adaptec ASR81605ZQ SAS3 RAID controller, connected to chassis SAS2 backplane
2 x Intel SSDSC2CW24 240GB SSD in RAID1, used for (UEFI) Boot drive
3 x Samsung SSD 840 1TB SSD in RAID5, 256GB used for MaxCache, and rest for GPT volume used for Hyper-V VHD's
11 + 1 hotspare x Hitachi HDS5C404 4TB Coolspin, in RAID6, GPT logical volume

I want to make sure the system can survive a disk failure, so I pull a disk while the system is running.
The OS BSOD's, the screen goes black, IPMI reports the system is still running, but physical monitor and IPMI KVM is black screen, no signal.
The RAID controller alarm sounds, I can see the hotspare firing up and a rebuild starting.

After a hard power cycle, RAID6 volume is still rebuilding, on first boot OS reports automatic repair (I've never seen this before), then boots to recovery console, on second boot it goes to normal OS.
OS reports recover yfrom driver fault, full memory dump was created.

Memory dump shows OS shutdown due to critical service, CSRSS, failure.

The system must be able to survive a drive pull, else I have no confidence it can survive a drive failure.

Any ideas?

P.
 

Jeggs101

Well-Known Member
Dec 29, 2010
1,529
241
63
We are heading out on vacation for the next month. From the airport I'm wondering if there is a log file on this. The whole idea of RAID is that this does not happen.
 
I tried again, this time with server under load, running VM's and copying files over the network, pulled two drives from RAID6, one drive from RAID5, and one drive from RAID1, rebuild auto started, no BSOD, system still running.

Adaptec support says that based on the support logs, when the system died, the backplane reported a momentary state where all drives were offline, and then came online again.
Very suspicious.

The SM BPN-SAS2-846EL1 backplane is reported as a LSI SAS2X36 and firmware version 0e0b.
I can't find any details on what the latest firmware for this backplane is supposed to be.
Anybody know?
 

bwillcox

Member
Jan 20, 2013
32
0
6
Tejas
That is a case of the expander backplane barfing and knocking all or most of the drives offline. If any error happens on the bus all the expander can do is reset itself and that interrupts comms to the drives. Raid cards of any brand usually get angry about that.

That is why the hard core storage guys will tell you only use SAS drives on the expanders. This is also why I greatly prefer the setup with the passive backplane and the 24 port Adaptecs with SATA drives or SSDs in our big storage boxes at the day job.

SM support ought to be able to give you a hand with the firmware on the expander.
 
I got an updated firmware v "55.14.18.0" for the expander from SM.

From the "SMC ExpanderXtoolsLite v1.5_Window" tools package:
"xflash.exe -i get avail" finds the backplane.
"xflash.exe -i 500304800067B83F get ver" reports the version as "55.14.11.00"

SM docs say to not use xflash.exe to flash as WWN can get lost, but to instead use smc.exe GUI app, but smc.exe does not find the backplane.
It seems like smc.exe is trying to look for the backplane on the "Microsoft Storage Spaces" controller, and I can't find a way to tell it to look on the other controllers.

Anybody know if / how to tell smc.exe to keep looking on all controllers?
I asked SM support, but I don't expect an answer until tomorrow.
 
No word from SM yet on how to get smc.exe to use the correct HBA.
Maybe I should just try an old WinPE (pre-VHD pre-storage spaces)?

I went ahead and used xflash to update the firmware.
As the SM FAQ warned, my WWN has reset to 0x7F, with only one expander not a big deal.

I can't figure out how to use xflash to change the WWN, looks like a XML template can be used, but don't know the format.

It is now pretty easy to repro the crash, as Adaptec said, the expander reset.
So all I do is "xflash -i [WWN] reset exp", and boom, OS crash due to csrss terminating.


That does mean that placing the OS boot drives, RAID1 or not, on the expander is pretty risky.
How do you guys run OS boot drives?
On expander in RAID1?
On motherboard SATA no RAID?
On motherboard SATA Intel RAID1?