Hardware Failures in 2020 - Post yours!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
Here is a new 2020 thread since the last one we had was in 2017 (that lived on for years.)
Here is a link to the Hardware Failures in 2017 thread.

Still do not have a picture yet, but one of the Intel 480GB SSDs that is mirroring the rpool on a STH hosting server is not doing well at all!
Intel 480GB SSD Failing.JPG
 

Cixelyn

Researcher
Nov 7, 2018
50
30
18
San Francisco
Had a 10GbE NIC died on a sys-e300-9d-4cn8tp last week. First time I've ever actually seen a NIC (without human assistance) I think....
 

pinkanese

New Member
Jun 19, 2014
27
10
3
33
I had a similar failure a month or so back. My home server has a mirrored rpool and one of the Intel drives went out. Though mine are ancient 40gb Intel 320's :p Fortunately I had a spare.
 

Dave Corder

Active Member
Dec 21, 2015
290
184
43
41
Had a Gigabyte GA-7PESH2 motherboard die on me a month or two ago. Just woke up one morning and it wouldn't power on. It was my whitebox Proxmox server. I found a Dell 720 on Craigslist locally and picked that up to replace it - it came with more RAM with lesser CPU cores, so I moved my CPUs over and sold off my old RAM and a few other misc parts.
 
  • Like
Reactions: Pbrizzo

mbosma

Member
Dec 4, 2018
76
58
18
We've had 3 of these Supermicro X11DPi-NT fail now, all seem to have issues in the vrm area.
Seems to be tied to hw version 1.21, replacement boards had version 2.0.

I still have a bunch of these with hw version 1.21 in production.
Let's hope they last longer than these ones did.IMG-20191212-WA0007.jpeg
 

Dreece

Active Member
Jan 22, 2019
503
160
43
@mbosma - I love supermicro chassis's, was never a fan of their boards though, do you know what their sas3 backplanes are like reliability wise?
 

mbosma

Member
Dec 4, 2018
76
58
18
@mbosma - I love supermicro chassis's, was never a fan of their boards though, do you know what their sas3 backplanes are like reliability wise?
So far the failures with supermicro have been pretty limited, no excessive amounts of failures apart from these cases.
Never had motherboards just die like this except for this specific type of motherboard.
I've deployed loads of supermicro servers and never had any issues with backplanes.
 
  • Like
Reactions: T_Minus and Dreece

Catalyze

New Member
Apr 5, 2020
11
11
3
I had two 860 evos, a 120 gb inland ssd, a 3tb wd blue and 3tb baracuda all die in a single day and a 12 tb wd red out of a easystore come doa. Total 6 drives in one day. Around 4.5 tb of lost data. About 250gb of that was irreplaceable.
 

Dreece

Active Member
Jan 22, 2019
503
160
43
I had two 860 evos, a 120 gb inland ssd, a 3tb wd blue and 3tb baracuda all die in a single day and a 12 tb wd red out of a easystore come doa. Total 6 drives in one day. Around 4.5 tb of lost data. About 250gb of that was irreplaceable.
Now that's terrible luck! :mad: Were these all in the same box?
 

Catalyze

New Member
Apr 5, 2020
11
11
3
Now that's terrible luck! :mad: Were these all in the same box?
Nope. The ssds and the hard drives were in two different machines. I was actually able to rebuild the array after each 3tb drive died, but then unraid decided it was going to wipe the rebuilt drives because they had a zfs file system and it did not know what to do with that. Bear in mind that was after it had rebuilt the array on those drives.
 

Indecided

Active Member
Sep 5, 2015
163
83
28
So far the failures with supermicro have been pretty limited, no excessive amounts of failures apart from these cases.
Never had motherboards just die like this except for this specific type of motherboard.
I've deployed loads of supermicro servers and never had any issues with backplanes.
Agreed until we deployed two 2U twinPro2 servers, out of 8 nodes, we've had 2 failures already. They certainly do not look like they've lived a hard life previously, so I struggle to figure why and how. Kicker is that the node replacements are tough to come by on eBay so far so right now empty until we can figure out if we can stuff some upgraded nodes into the chassis.
 

edge

Active Member
Apr 22, 2013
203
71
28
This happened in 2014, but it was such a colossal disaster, I feel the need to share it here. At that time, I worked as a presales consultant for a large OEM and my clients were predominantly fortune 50.

One of them had just changed their IT management in the Northeast and the new top dog wanted to do a catastrophe recovery test. His idea was was to completely power down one of the DC's out route 78 in NJ. I and every other OEM consultant advised him against it.

We argued: you are attached to two grids, you have enough battery for 12 hours, and enough on site stored diesel for your generators for a week, your DC is specifically designed not to ever have to power down. His response was: "That is exactly why we need to see it come up from a complete power down!".

The Thursday before the Friday night power down, I went through that DC with one of my storage SE and a server SE. I wasn't worried about the servers. I asked the storage SE his thoughts - his reply was: we just did a complete refresh and our oldest cabinet is 2 years old, we shouldn't lose too many spindles. I looked over too the mainframe corner and asked him: "those are the same cabinets I saw here in 2005, aren't they?". He replied: "yep, they just swap in new drives as they fail".

They went ahead with the power down. We lost two raids in forty+ storage systems (~6000 hdd) due to multiple spindle spin up failures. 24 of 32 mainframe cabinets lost all arrays due to spin up failure - shutdown a spinner that has been spinning for 10 years and the odds are against you.

Two weeks later I met the new Northeast IT director.

What prompted this memory is reading of members here spinning down disks at night to save energy. It just tweaks a nerve in me. I love solid state, almost.

Don't talk to me about tape, in any format.
 

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
1,872
617
113
i7-740QM I bought locally: DOA
Core 2 Duo T7600G I bought on eBay: chipped die
A Precision M6500: mobo and GPU dead because of short circuit in the GPU. Twice.
Crosshair VII bottom M.2 slot: torn off M.2 mounting boss because lol aluminum
 

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
1,872
617
113
Oh, and add a ASUS Slot 1 board and 250W PSU to that list, PSU caught fire and took out my board, F
 

vangoose

Active Member
May 21, 2019
326
104
43
Canada
Just found my HGST SN260 6.4TB nvme disk died. Haven't used it for a while, putting it in one sever to test iSER and it's not recognized, touching the surface and card is cold.
 
  • Like
Reactions: Patrick and Marsh