Big Hardware Failure

Patrick · May 18, 2020

Looks like we just suffered a double-disk failure on one of the Proxmox hosting nodes (dual E5-2698 V4 box.)

We had an OS SSD die but because of COVID-19, it never got replaced. Today, it looks like the second drive died.

I may actually head to the Fremont data center today.

ari2asem · May 18, 2020

fingers crossed it will end by those 2 disks.

last week during windows update my 4 hot-spares died all together after restarting for update.

4 seagate 2.5 inch hdd, 2tb each. no data lost. but 4 disks are gone. out of warranty

PigLover · May 18, 2020

I know Covid-19 creates a special case - but this is why I always insisted operations staff treat yellow alarms (loss of redundancy) with the same urgency as a service affecting outage.

Hope its an easy recovery and not something that we'll read about later in the "mishaps" thread

urbanracer34 · May 18, 2020

Hoping for the best!

Patrick · May 18, 2020

Forums likely going down in a bit as well as a few other bits. I think this may end up being an opportunity for a chassis swap.

Patrick · May 18, 2020

I think we are going to do a chassis swap this evening and a backup/ restore or two. We are probably going to take down the forums just to be careful. Likely will start in 30-60 min.

Patrick · May 18, 2020

Ok. Forums should be back online now. That was scary/ not fun. More on what happened later.

pricklypunter · May 18, 2020

I'm glad all went well

Patrick · May 18, 2020

pricklypunter said:
I'm glad all went well

I am not sure "well" is how I would describe it.

TBH - on the drive home I was reflecting on how different this was to the first time I had a Proxmox failure. Again, no fault of Proxmox here. Two mirrored SSDs failed. The difference this time was it became an exercise in trying to figure out what happened, how to remedy the hardware situation, and how to get the forums restored.

It gave me the idea that maybe I should do some content on this kind of stuff. There is a lot of content out there on how to burn the ISO and get started. Less around fixing when things go bad.

The good news is that the forums are serving pages and apparently writing to the database so that is a good sign.

Patrick · May 18, 2020

https://twitter.com/i/web/status/1262601498426404865

PigLover · May 18, 2020

Very interested in this write up. Looking forward to it. Hope it includes your thoughts about why two high quality drives would fail so near in time to each other.

nasi · May 19, 2020

Patrick said:
maybe I should do some content on this kind of stuff

That would be great! I'm new on this topic but I often read it's fine to install Proxmox OS on only 1 disk. VMs and stuff on a separate RAID. But what do I do when this single disk with OS fails? What is the recommended backup solution?

i386 · May 19, 2020

nasi said:
But what do I do when this single disk with OS fails?

Replace it.

nasi said:
What is the recommended backup solution?

(In general, not limited to proxmox as the os)
At home (or other places where downtime is not a problem) I would just install a fresh version and try to run the services/vms.
For productive deployments you should have a tested and evaluated strategy which you would use for such a case

vl1969 · May 19, 2020

This funny, but I just went through a similar, but not as dire, experience with my home server. And am looking for a good step by step how-to on what to do when whole setup fails.

Now, my setup did not fail completely.
I had one SSD from the mirror die.

But it was an adventure to get it all back to normal with pandemic and all other crazy stuff. But also because my MB is screwy when it comes to booting devices.

And now my primary concern is:

1. How to plan for the next time?
2. How to upgrade the proxmoxm to latest without loosing the current setup.

Last time I had to upgrade, from 5.1 to 5.3, I lost all VMs and had to redo do then by hand. Luckaly all my data is on separate pools and disks so that staide put.
Now how to move from 5.3 to 6 safly?

Patrick · May 19, 2020

@nasi - Having a double-disk failure on the mirrored rpool drives is effectively like a single OS disk failing in Proxmox. Great point.

@vl1969 - I have seen that too. It is much better in the newer versions of Proxmox than the initial ZFS rpool booting days. Interestingly enough, we upgraded probably 20+ nodes to Proxmox VE 6 and none had an issue with do-release-upgrades. Proxmox VE 5.4 is no longer supported I believe as of this summer so that is something to consider.

The harder upgrade from 5-6 was doing the clusters using Ceph as well. That was a follow upgrade instructions exactly case.

@PigLover I think this is going to be a main site piece. You are right. The S3610's are high-quality drives so I was not anticipating a failure like this.

The node itself is one of the next to get replaced as one of the remaining Xeon E5 V4 nodes (E5-2699 V4.) It was going to be retired making way for next-gen servers. I was actually hoping we could do Cascade + DCPMMs on the replacement node and run databases directly from DCPMM instead of PCIe Optane.

PigLover · May 19, 2020

On the upside, this outage was largely transparent to your users (at least it appeared so to me). Your need for a brief outage to replace the broken server could have just been announced as "routine maintenance" and nobody would have been the wiser. Having reliability at the service level (what the users see) even in the face of rather catastrophic faults in the infrastructure is the objective you want to strive for. There are some things to be proud of in this.

You've come a long way from when you lost those Micron SSDs in the early days

STH is rather unique in your willingness to share openly what is happening behind the scenes. Even the ugly. Which is a big part of why I find this place so interesting to hang around. TY.

Patrick · May 19, 2020

PigLover said:
You've come a long way from when you lost those Micron SSDs in the early days

STH is rather unique in your willingness to share openly what is happening behind the scenes. Even the ugly. Which is a big part of why I find this place so interesting to hang around. TY.

Much different days for sure!

In comparison, this was a bit of a pain, but most of that was due to how slow servers boot rather than anything else.

I asked about the last time I was servicing the hosting racks and it was ~18 months ago (mid-Dec 2018.) A big part of that is changing strategy for hosting.

Another difference is that instead of rushing to get a piece up on this today, I am trying to make a better piece with more useful thoughts. Certainly going in with a well-defined triage plan helps since it becomes just executing rather than the emotional void of not knowing how to remedy. It meant while I was working on this I was also taking notes of what could have been better.

We are big enough now that I did at least pause and think if Intel might be unhappy if I post pictures of their drives and point to them as a point of failure. At the same time, I think you are right. It was a pretty quick decision that we were going to share this. It would be weird to talk about storage, redundancy, and backups and never talk about failures. If these machines never failed, infrastructure would look a lot different.

PigLover · May 19, 2020

On a side note - just noticed the enhanced "likes" in the forum update. While I "loved" your post, the emoji that goes with it feels a bit creepy for a tech site

.

Patrick · May 26, 2020

@PigLover and all, have a little writeup about this:

https://www.servethehome.com/9-step-calm-and-easy-proxmox-ve-boot-drive-failure-recovery/

Also, the worst-performing video we have had in a while in terms of views (but actually watch time is very good.)

pricklypunter · May 26, 2020

Great video bud

Big Hardware Failure

Administrator

Active Member

Moderator

Member

Administrator

Administrator

Administrator

Well-Known Member

Administrator

Administrator

Moderator

Member

Well-Known Member

Active Member

Administrator

Moderator

Administrator

Moderator

Administrator

Well-Known Member