Server keeps shutting down and dropping hard drives

smidley

New Member
Feb 7, 2011
17
0
1
First, the specs.
  • Case: Norco 4220
  • Power Supply: Corsair 750W
  • Motherboard: Tyan S7012
  • CPU: Intel Xeon E5540
  • RAM: 64gigs ECC DDR3
  • Raid Card: LSI MegaRaid SAS 9260-16i
  • Hard Drives: 1x 256gig SSD, 19x 2TB spindle drives. All spindle drives are in a RAID 5. Passthrough is used to present the raid to the VM.
  • OS: ESXi 6 (Installed to USB thumb drive)
  • VM OS: Server 2012 R2
This server has been super stable until the last couple of days. No changes have been made and no patching has been done lately. The server will be running just fine and then all of a sudden, the physical server will shut off. It will turn itself back on after a few minutes and boot up just fine. There is no reference to anything wrong in the error logs. Just today, I started getting another problem where one of my hard drives in the array will drop out of the array briefly and then start working again just fine. Array logs here: MegaRAID Storage Manager 15.08.01.02 Event Log - Generated on Wed Feb 10 20:41:0 - Pastebin.com

While I was troubleshooting the hard drive issue, the entire raid controller seemed to shut down and then came back online. I'm wondering if there's an issue with the power supply? I also triggered an event in my server VM that caused the CPU to spike to 100% on purpose and after doing that for a minute, the server shut itself off again. This also leads me to believe it could be a power supply related issue. I'm leaning towards power supply or motherboard here. I'm on the latest version of the motherboard BIOS. Does anyone have any suggestions?
 

FMA1394

Active Member
Jan 11, 2013
624
186
43
I would agree with you on the power supply or motherboard. Check for bad caps. That's what that sounds like to me.
 

Quasduco

Active Member
Nov 16, 2015
126
46
28
109
Tennessee
I think most importantly, you should be arranging backups not on that server asap. Those power losses could be causing some nasty data corruption.

Also, not to give you too hard of a time, but you have all 19 spinners in a single raid 5? That is a recipe for problems all by itself...
 

izx

Active Member
Jan 17, 2016
113
38
28
36
Did you notice any issues in the BMC/IPMI System Event Log?

The controller keeps warning about timeouts and resets on the last port (12-15) with reference to SAS address 0x50014380085EB6C7. Is that the problem disk (most probably), or the onboard expander?

What manufacturer/model is the problem disk?
 

smidley

New Member
Feb 7, 2011
17
0
1
I think most importantly, you should be arranging backups not on that server asap. Those power losses could be causing some nasty data corruption.

Also, not to give you too hard of a time, but you have all 19 spinners in a single raid 5? That is a recipe for problems all by itself...
The data on the server is not critical, it's just my home server with mostly video content. I had the drives in a raid 6 previously, but stepped down to a raid 5 because I needed the extra space :)
 

smidley

New Member
Feb 7, 2011
17
0
1
Did you notice any issues in the BMC/IPMI System Event Log?

The controller keeps warning about timeouts and resets on the last port (12-15) with reference to SAS address 0x50014380085EB6C7. Is that the problem disk (most probably), or the onboard expander?

What manufacturer/model is the problem disk?
I found out that my power supply is probably the issue. The fan isn't spinning and I'm guessing it's overheating when the server demands high power usage.