A WD Red failure that could have ended very badly

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

slatfats

Member
Nov 3, 2014
42
15
8
Brisbane, Australia
Mods - I wasn't sure which section to put this in, so please move this if you feel it should be somewhere else.

Bear with me for a bit of a long story...

This morning started like any other. After having breakfast with my wife and son, I went downstairs to the garage to feed our dogs. The garage is also where my fileserver lives (in a Supermicro SC933 chassis), and when I got there, I noticed it was making a beeping sound, like an error beep. It was a constant beep, but then after a random length of time it would stop, and then after another random length of time it would start again. I got a bit closer and noticed that, when the beeping would stop, it sounded like a short circuit. The fans were all still spinning OK, the HDD status LEDs were flashing as expected (only 8 of my 18 drives were actually mounted; those eight were flashing and the others were all dark, as expected) and the NIC lights were all flashing normally, so I didn't know what was going on - but I knew something wasn't right.

I fed the dogs, came upstairs, told my wife that the server wasn't working right so I would need to shut it down before I went off to work, and there would be no Paw Patrol for my son today :-( The Intel BMC was reporting that everything was fine - voltages OK, fans spinning normally. ESXi shut all my VMs down fine, and the host shut down OK as well, so I really had no idea what was going on.

Fast forward a few hours, I got home from work and tried to fire her up again. This time I'm met with a constant beep, fans spinning, NIC lights flashing but no HDD LEDs flashing.
A few weeks ago I had seen some transport errors so suspected that one of my HBA's might have died, so I started there and pulled them. Fired her up, same beep, so shut it down again. Had my motherboard died? But then a thought - had my backplane died? I pulled the power for the backplane, fired the server up, and the motherboard and CPU fans all power up fine, with no beep. OK - must be the backplane.

So I start pulling the drives one-by-one, to be sure it's the backplane and not a drive. In hindsight, I should have pulled all the drives first, to prevent any data loss!
The SC933 has 15 drive bays, and I started at number 15. Pulled each drive one by one and swapped out old screws for new Supermicro 3.5" screws (I had been meaning to do this for a couple of months...).

I got all the way to the last drive, and then I was greeted by this:




Burnout. That was a 3TB WD Red drive. It had been in service pretty much constantly since June 2013... So in many ways this was a worse failure than I originally expected! Financially, this is at the low end of what I was expecting, but our losses could have potentially been huge.

In my 25 years of building computers, this is the first time I've seen a burnout.

Without the drives the backplane still beeps, so I assume it's a write-off. I don't know if I've lost any data without a replacement backplane, so it's off to eBay I go.

Morals of the story:
1. Keep a smoke detector near your labs. I had one near mine at our last house, but not since we moved. I'll always keep one nearby now.
2. Always have off-site backup. For us, all of our irreplaceable data is backed up to at least two cloud services, so if we have lost anything it'll only be media that we can always download again.
3. Clean your server regularly. I have no idea what caused this burnout, but it's pretty obvious from my photos that I'm not great at keeping my server dust-free. Perhaps some kind of build-up caused a short, I have no idea. Either way, it can't hurt to keep it clean.

For me, it's time to go backplane and HDD shopping... maybe it's time for a completely new chassis!
 

nthu9280

Well-Known Member
Feb 3, 2016
1,628
498
83
San Antonio, TX
Mods - I wasn't sure which section to put this in, so please move this if you feel it should be somewhere else.

Morals of the story:
1. Keep a smoke detector near your labs. I had one near mine at our last house, but not since we moved. I'll always keep one nearby now.
2. Always have off-site backup. For us, all of our irreplaceable data is backed up to at least two cloud services, so if we have lost anything it'll only be media that we can always download again.
3. Clean your server regularly. I have no idea what caused this burnout, but it's pretty obvious from my photos that I'm not great at keeping my server dust-free. Perhaps some kind of build-up caused a short, I have no idea. Either way, it can't hurt to keep it clean.

For me, it's time to go backplane and HDD shopping... maybe it's time for a completely new chassis!
Wow. Hope you have good backup and no data loss.
From the pics fine dust build-up thru the chassis grill next to the failed drive could be the culprit.

Not sure if you need a new chassis ( want may be ?)

Sent from my Nexus 6 using Tapatalk
 

MiniKnight

Well-Known Member
Mar 30, 2012
3,072
973
113
NYC
It makes you wonder if there was an arc between the PCB and the side of the backplane.
 

Evan

Well-Known Member
Jan 6, 2016
3,346
598
113
Seen plenty of burn outs of components but on a drive that’s a lot worse than I think I have seen before.
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
I can't figure it out from the posted images: could the bottom of the HDD be touching the case?
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I am a bi-annual 'blow the server/chassis/components out' type of guy, air compressor has never failed me yet on low psi (40-60) via trigger/squeeze air chuck action.

Feel bad for ya, that looks nasty! Hope data is 'in-tact' and recoverable or you have sufficient replica's/backups offsite or in a different chassis/dataset.
 

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
man I had a POE switch go up in smoke too! It might be worthwhile to invest in what the 3D printer folks do - a portable fire extinguisher! Rather than have a house fire!
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,709
517
113
Canada
Reminds me of critter damage I seen in a switch mode power supply one time :D
I suspect either the main servo controller developed a short, or a shorted cap in this case though, obviously it gobbled enough current to roast the tracks and burn the laminate :)
 

alex1002

Member
Apr 9, 2013
519
19
18
To me it seems like a power failure on the PCB. Perhaps nas overheat?

Sent from my Pixel 2 using Tapatalk