Mods - I wasn't sure which section to put this in, so please move this if you feel it should be somewhere else.
Bear with me for a bit of a long story...
This morning started like any other. After having breakfast with my wife and son, I went downstairs to the garage to feed our dogs. The garage is also where my fileserver lives (in a Supermicro SC933 chassis), and when I got there, I noticed it was making a beeping sound, like an error beep. It was a constant beep, but then after a random length of time it would stop, and then after another random length of time it would start again. I got a bit closer and noticed that, when the beeping would stop, it sounded like a short circuit. The fans were all still spinning OK, the HDD status LEDs were flashing as expected (only 8 of my 18 drives were actually mounted; those eight were flashing and the others were all dark, as expected) and the NIC lights were all flashing normally, so I didn't know what was going on - but I knew something wasn't right.
I fed the dogs, came upstairs, told my wife that the server wasn't working right so I would need to shut it down before I went off to work, and there would be no Paw Patrol for my son today :-( The Intel BMC was reporting that everything was fine - voltages OK, fans spinning normally. ESXi shut all my VMs down fine, and the host shut down OK as well, so I really had no idea what was going on.
Fast forward a few hours, I got home from work and tried to fire her up again. This time I'm met with a constant beep, fans spinning, NIC lights flashing but no HDD LEDs flashing.
A few weeks ago I had seen some transport errors so suspected that one of my HBA's might have died, so I started there and pulled them. Fired her up, same beep, so shut it down again. Had my motherboard died? But then a thought - had my backplane died? I pulled the power for the backplane, fired the server up, and the motherboard and CPU fans all power up fine, with no beep. OK - must be the backplane.
So I start pulling the drives one-by-one, to be sure it's the backplane and not a drive. In hindsight, I should have pulled all the drives first, to prevent any data loss!
The SC933 has 15 drive bays, and I started at number 15. Pulled each drive one by one and swapped out old screws for new Supermicro 3.5" screws (I had been meaning to do this for a couple of months...).
I got all the way to the last drive, and then I was greeted by this:
Burnout. That was a 3TB WD Red drive. It had been in service pretty much constantly since June 2013... So in many ways this was a worse failure than I originally expected! Financially, this is at the low end of what I was expecting, but our losses could have potentially been huge.
In my 25 years of building computers, this is the first time I've seen a burnout.
Without the drives the backplane still beeps, so I assume it's a write-off. I don't know if I've lost any data without a replacement backplane, so it's off to eBay I go.
Morals of the story:
1. Keep a smoke detector near your labs. I had one near mine at our last house, but not since we moved. I'll always keep one nearby now.
2. Always have off-site backup. For us, all of our irreplaceable data is backed up to at least two cloud services, so if we have lost anything it'll only be media that we can always download again.
3. Clean your server regularly. I have no idea what caused this burnout, but it's pretty obvious from my photos that I'm not great at keeping my server dust-free. Perhaps some kind of build-up caused a short, I have no idea. Either way, it can't hurt to keep it clean.
For me, it's time to go backplane and HDD shopping... maybe it's time for a completely new chassis!
Bear with me for a bit of a long story...
This morning started like any other. After having breakfast with my wife and son, I went downstairs to the garage to feed our dogs. The garage is also where my fileserver lives (in a Supermicro SC933 chassis), and when I got there, I noticed it was making a beeping sound, like an error beep. It was a constant beep, but then after a random length of time it would stop, and then after another random length of time it would start again. I got a bit closer and noticed that, when the beeping would stop, it sounded like a short circuit. The fans were all still spinning OK, the HDD status LEDs were flashing as expected (only 8 of my 18 drives were actually mounted; those eight were flashing and the others were all dark, as expected) and the NIC lights were all flashing normally, so I didn't know what was going on - but I knew something wasn't right.
I fed the dogs, came upstairs, told my wife that the server wasn't working right so I would need to shut it down before I went off to work, and there would be no Paw Patrol for my son today :-( The Intel BMC was reporting that everything was fine - voltages OK, fans spinning normally. ESXi shut all my VMs down fine, and the host shut down OK as well, so I really had no idea what was going on.
Fast forward a few hours, I got home from work and tried to fire her up again. This time I'm met with a constant beep, fans spinning, NIC lights flashing but no HDD LEDs flashing.
A few weeks ago I had seen some transport errors so suspected that one of my HBA's might have died, so I started there and pulled them. Fired her up, same beep, so shut it down again. Had my motherboard died? But then a thought - had my backplane died? I pulled the power for the backplane, fired the server up, and the motherboard and CPU fans all power up fine, with no beep. OK - must be the backplane.
So I start pulling the drives one-by-one, to be sure it's the backplane and not a drive. In hindsight, I should have pulled all the drives first, to prevent any data loss!
The SC933 has 15 drive bays, and I started at number 15. Pulled each drive one by one and swapped out old screws for new Supermicro 3.5" screws (I had been meaning to do this for a couple of months...).
I got all the way to the last drive, and then I was greeted by this:
Burnout. That was a 3TB WD Red drive. It had been in service pretty much constantly since June 2013... So in many ways this was a worse failure than I originally expected! Financially, this is at the low end of what I was expecting, but our losses could have potentially been huge.
In my 25 years of building computers, this is the first time I've seen a burnout.
Without the drives the backplane still beeps, so I assume it's a write-off. I don't know if I've lost any data without a replacement backplane, so it's off to eBay I go.
Morals of the story:
1. Keep a smoke detector near your labs. I had one near mine at our last house, but not since we moved. I'll always keep one nearby now.
2. Always have off-site backup. For us, all of our irreplaceable data is backed up to at least two cloud services, so if we have lost anything it'll only be media that we can always download again.
3. Clean your server regularly. I have no idea what caused this burnout, but it's pretty obvious from my photos that I'm not great at keeping my server dust-free. Perhaps some kind of build-up caused a short, I have no idea. Either way, it can't hurt to keep it clean.
For me, it's time to go backplane and HDD shopping... maybe it's time for a completely new chassis!