Hardware Failures in 2020 - Post yours!

Patrick · Mar 25, 2020

Here is a new 2020 thread since the last one we had was in 2017 (that lived on for years.)
Here is a link to the Hardware Failures in 2017 thread.

Still do not have a picture yet, but one of the Intel 480GB SSDs that is mirroring the rpool on a STH hosting server is not doing well at all!

Spartacus · Mar 25, 2020

Our first failure for the year in our POS FAS2040 luckily these are like $40 now

Fritz · Mar 25, 2020

Patrick said:
Here is a new 2020 thread since the last one we had was in 2017 (that lived on for years.)
Here is a link to the Hardware Failures in 2017 thread.

Still do not have a picture yet, but one of the Intel 480GB SSDs that is mirroring the rpool on a STH hosting server is not doing well at all!
View attachment 13344

Which model ?

Cixelyn · Mar 25, 2020

Had a 10GbE NIC died on a sys-e300-9d-4cn8tp last week. First time I've ever actually seen a NIC (without human assistance) I think....

Patrick · Mar 25, 2020

Fritz said:
Which model ?

I think these were S3500 or S3600's.

pinkanese · Mar 25, 2020

I had a similar failure a month or so back. My home server has a mirrored rpool and one of the Intel drives went out. Though mine are ancient 40gb Intel 320's

Fortunately I had a spare.

Dave Corder · Mar 25, 2020

Had a Gigabyte GA-7PESH2 motherboard die on me a month or two ago. Just woke up one morning and it wouldn't power on. It was my whitebox Proxmox server. I found a Dell 720 on Craigslist locally and picked that up to replace it - it came with more RAM with lesser CPU cores, so I moved my CPUs over and sold off my old RAM and a few other misc parts.

mbosma · Mar 26, 2020

We've had 3 of these Supermicro X11DPi-NT fail now, all seem to have issues in the vrm area.
Seems to be tied to hw version 1.21, replacement boards had version 2.0.

I still have a bunch of these with hw version 1.21 in production.
Let's hope they last longer than these ones did.

Dreece · Mar 26, 2020

@mbosma - I love supermicro chassis's, was never a fan of their boards though, do you know what their sas3 backplanes are like reliability wise?

mbosma · Mar 26, 2020

Dreece said:
@mbosma - I love supermicro chassis's, was never a fan of their boards though, do you know what their sas3 backplanes are like reliability wise?

So far the failures with supermicro have been pretty limited, no excessive amounts of failures apart from these cases.
Never had motherboards just die like this except for this specific type of motherboard.
I've deployed loads of supermicro servers and never had any issues with backplanes.

Diavuno · Apr 2, 2020

Im going through another Supermicro Failure... my home ESXI threw a purple screen and wont boot now.
https://forums.servethehome.com/index.php?threads/supermicro-s-post-hangs-on-code-42.28037/

In early January my old DL180G6 HomeNAS decided to fail and then didnt even try to post, just hung with the fans at 100%

Catalyze · Apr 7, 2020

I had two 860 evos, a 120 gb inland ssd, a 3tb wd blue and 3tb baracuda all die in a single day and a 12 tb wd red out of a easystore come doa. Total 6 drives in one day. Around 4.5 tb of lost data. About 250gb of that was irreplaceable.

Dreece · Apr 7, 2020

Catalyze said:
I had two 860 evos, a 120 gb inland ssd, a 3tb wd blue and 3tb baracuda all die in a single day and a 12 tb wd red out of a easystore come doa. Total 6 drives in one day. Around 4.5 tb of lost data. About 250gb of that was irreplaceable.

Now that's terrible luck!

Were these all in the same box?

Catalyze · Apr 7, 2020

Dreece said:
Now that's terrible luck! Were these all in the same box?

Nope. The ssds and the hard drives were in two different machines. I was actually able to rebuild the array after each 3tb drive died, but then unraid decided it was going to wipe the rebuilt drives because they had a zfs file system and it did not know what to do with that. Bear in mind that was after it had rebuilt the array on those drives.

Indecided · Apr 11, 2020

mbosma said:
So far the failures with supermicro have been pretty limited, no excessive amounts of failures apart from these cases.
Never had motherboards just die like this except for this specific type of motherboard.
I've deployed loads of supermicro servers and never had any issues with backplanes.

Agreed until we deployed two 2U twinPro2 servers, out of 8 nodes, we've had 2 failures already. They certainly do not look like they've lived a hard life previously, so I struggle to figure why and how. Kicker is that the node replacements are tough to come by on eBay so far so right now empty until we can figure out if we can stuff some upgraded nodes into the chassis.

edge · May 6, 2020

This happened in 2014, but it was such a colossal disaster, I feel the need to share it here. At that time, I worked as a presales consultant for a large OEM and my clients were predominantly fortune 50.

One of them had just changed their IT management in the Northeast and the new top dog wanted to do a catastrophe recovery test. His idea was was to completely power down one of the DC's out route 78 in NJ. I and every other OEM consultant advised him against it.

We argued: you are attached to two grids, you have enough battery for 12 hours, and enough on site stored diesel for your generators for a week, your DC is specifically designed not to ever have to power down. His response was: "That is exactly why we need to see it come up from a complete power down!".

The Thursday before the Friday night power down, I went through that DC with one of my storage SE and a server SE. I wasn't worried about the servers. I asked the storage SE his thoughts - his reply was: we just did a complete refresh and our oldest cabinet is 2 years old, we shouldn't lose too many spindles. I looked over too the mainframe corner and asked him: "those are the same cabinets I saw here in 2005, aren't they?". He replied: "yep, they just swap in new drives as they fail".

They went ahead with the power down. We lost two raids in forty+ storage systems (~6000 hdd) due to multiple spindle spin up failures. 24 of 32 mainframe cabinets lost all arrays due to spin up failure - shutdown a spinner that has been spinning for 10 years and the odds are against you.

Two weeks later I met the new Northeast IT director.

What prompted this memory is reading of members here spinning down disks at night to save energy. It just tweaks a nerve in me. I love solid state, almost.

Don't talk to me about tape, in any format.

Wasmachineman_NL · May 10, 2020

i7-740QM I bought locally: DOA
Core 2 Duo T7600G I bought on eBay: chipped die
A Precision M6500: mobo and GPU dead because of short circuit in the GPU. Twice.
Crosshair VII bottom M.2 slot: torn off M.2 mounting boss because lol aluminum

Wasmachineman_NL · May 19, 2020

Oh, and add a ASUS Slot 1 board and 250W PSU to that list, PSU caught fire and took out my board, F

vangoose · Jun 1, 2020

Just found my HGST SN260 6.4TB nvme disk died. Haven't used it for a while, putting it in one sever to test iSER and it's not recognized, touching the surface and card is cold.

Fritz · Jun 1, 2020

vangoose said:
Just found my HGST SN260 6.4TB nvme disk died. Haven't used it for a while, putting it in one sever to test iSER and it's not recognized, touching the surface and card is cold.

It died in it's sleep.

Hardware Failures in 2020 - Post yours!

Administrator

Well-Known Member

Well-Known Member

Researcher

Administrator

New Member

Well-Known Member

Member

Active Member

Member

Active Member

New Member

Active Member

New Member

Active Member

Active Member

Wittgenstein the Supercomputer FTW!

Wittgenstein the Supercomputer FTW!

Active Member

Well-Known Member