Big Hardware Failure

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
Looks like we just suffered a double-disk failure on one of the Proxmox hosting nodes (dual E5-2698 V4 box.)

We had an OS SSD die but because of COVID-19, it never got replaced. Today, it looks like the second drive died.

I may actually head to the Fremont data center today.
 
Last edited:

ari2asem

Active Member
Dec 26, 2018
504
81
28
The Netherlands, Groningen
fingers crossed it will end by those 2 disks.

last week during windows update my 4 hot-spares died all together after restarting for update.

4 seagate 2.5 inch hdd, 2tb each. no data lost. but 4 disks are gone. out of warranty :mad::mad::mad:
 

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
I know Covid-19 creates a special case - but this is why I always insisted operations staff treat yellow alarms (loss of redundancy) with the same urgency as a service affecting outage.

Hope its an easy recovery and not something that we'll read about later in the "mishaps" thread :)
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
Forums likely going down in a bit as well as a few other bits. I think this may end up being an opportunity for a chassis swap.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
I think we are going to do a chassis swap this evening and a backup/ restore or two. We are probably going to take down the forums just to be careful. Likely will start in 30-60 min.
 
  • Like
Reactions: mmo

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
I'm glad all went well :)
I am not sure "well" is how I would describe it.

TBH - on the drive home I was reflecting on how different this was to the first time I had a Proxmox failure. Again, no fault of Proxmox here. Two mirrored SSDs failed. The difference this time was it became an exercise in trying to figure out what happened, how to remedy the hardware situation, and how to get the forums restored.

It gave me the idea that maybe I should do some content on this kind of stuff. There is a lot of content out there on how to burn the ISO and get started. Less around fixing when things go bad.

The good news is that the forums are serving pages and apparently writing to the database so that is a good sign.
 
  • Like
Reactions: ari2asem

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
Very interested in this write up. Looking forward to it. Hope it includes your thoughts about why two high quality drives would fail so near in time to each other.
 

nasi

New Member
Feb 25, 2020
20
4
3
maybe I should do some content on this kind of stuff
That would be great! I'm new on this topic but I often read it's fine to install Proxmox OS on only 1 disk. VMs and stuff on a separate RAID. But what do I do when this single disk with OS fails? What is the recommended backup solution?
 

i386

Well-Known Member
Mar 18, 2016
1,935
504
113
31
Germany
But what do I do when this single disk with OS fails?
Replace it.
What is the recommended backup solution?
(In general, not limited to proxmox as the os)
At home (or other places where downtime is not a problem) I would just install a fresh version and try to run the services/vms.
For productive deployments you should have a tested and evaluated strategy which you would use for such a case :D
 

vl1969

Active Member
Feb 5, 2014
607
68
28
This funny, but I just went through a similar, but not as dire, experience with my home server. And am looking for a good step by step how-to on what to do when whole setup fails.

Now, my setup did not fail completely.
I had one SSD from the mirror die.

But it was an adventure to get it all back to normal with pandemic and all other crazy stuff. But also because my MB is screwy when it comes to booting devices.

And now my primary concern is:

1. How to plan for the next time?
2. How to upgrade the proxmoxm to latest without loosing the current setup.

Last time I had to upgrade, from 5.1 to 5.3, I lost all VMs and had to redo do then by hand. Luckaly all my data is on separate pools and disks so that staide put.
Now how to move from 5.3 to 6 safly?
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
@nasi - Having a double-disk failure on the mirrored rpool drives is effectively like a single OS disk failing in Proxmox. Great point.

@vl1969 - I have seen that too. It is much better in the newer versions of Proxmox than the initial ZFS rpool booting days. Interestingly enough, we upgraded probably 20+ nodes to Proxmox VE 6 and none had an issue with do-release-upgrades. Proxmox VE 5.4 is no longer supported I believe as of this summer so that is something to consider.

The harder upgrade from 5-6 was doing the clusters using Ceph as well. That was a follow upgrade instructions exactly case.

@PigLover I think this is going to be a main site piece. You are right. The S3610's are high-quality drives so I was not anticipating a failure like this.

The node itself is one of the next to get replaced as one of the remaining Xeon E5 V4 nodes (E5-2699 V4.) It was going to be retired making way for next-gen servers. I was actually hoping we could do Cascade + DCPMMs on the replacement node and run databases directly from DCPMM instead of PCIe Optane.
 
  • Like
Reactions: Jeggs101

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
On the upside, this outage was largely transparent to your users (at least it appeared so to me). Your need for a brief outage to replace the broken server could have just been announced as "routine maintenance" and nobody would have been the wiser. Having reliability at the service level (what the users see) even in the face of rather catastrophic faults in the infrastructure is the objective you want to strive for. There are some things to be proud of in this.

You've come a long way from when you lost those Micron SSDs in the early days :)

STH is rather unique in your willingness to share openly what is happening behind the scenes. Even the ugly. Which is a big part of why I find this place so interesting to hang around. TY.
 
  • Like
Reactions: Lix and Jeggs101

Patrick

Administrator
Staff member
Dec 21, 2010
11,906
4,868
113
You've come a long way from when you lost those Micron SSDs in the early days :)

STH is rather unique in your willingness to share openly what is happening behind the scenes. Even the ugly. Which is a big part of why I find this place so interesting to hang around. TY.
Much different days for sure!

In comparison, this was a bit of a pain, but most of that was due to how slow servers boot rather than anything else.

I asked about the last time I was servicing the hosting racks and it was ~18 months ago (mid-Dec 2018.) A big part of that is changing strategy for hosting.

Another difference is that instead of rushing to get a piece up on this today, I am trying to make a better piece with more useful thoughts. Certainly going in with a well-defined triage plan helps since it becomes just executing rather than the emotional void of not knowing how to remedy. It meant while I was working on this I was also taking notes of what could have been better.

We are big enough now that I did at least pause and think if Intel might be unhappy if I post pictures of their drives and point to them as a point of failure. At the same time, I think you are right. It was a pretty quick decision that we were going to share this. It would be weird to talk about storage, redundancy, and backups and never talk about failures. If these machines never failed, infrastructure would look a lot different.