Cluster node resurrection - has anyone seen this?

Patrick · Nov 17, 2015

I was at the HE2 datacenter yesterday after getting a few alerts that made it sound like a node was failing.

The particular node had 2x 240GB consumer disks (OS boot drives ZFS mirrored), 2x 800GB SanDisk CloudSpeed Ultras and 2x 4TB hard drives. The latter four drives were being used in the Ceph cluster.

This is one of the nodes I tested pulling the consumer drives just to ensure that if one failed it would still boot. Yesterday it appears as though on failed and stopped the machine from booting fully.

It is a E5 V3 system but has a very slow POST/ Boot. So I spent a bit of time trying to see if I could coax it back to life spending about 15 minutes each time to allow the system to come back online. That gave me time to pull a Dell 8132 switch out of the colo and do a bunch of re-wiring. After multiple reboots, putting drives in different locations and etc, it was still not working. I decided I would do a formal remove from the cluster, install new boot drives then rejoin it but had too many meetings yesterday to do so.

This morning I logged into the Proxmox cluster... and that node was up! It is showing that the SSD failed:

However the node is back up and running a few test VMs. It was certainly down for over three hours before it came back online.

Has anyone ever seen a node resurrection like this?

Biren78 · Nov 17, 2015

I've never seen that. A few years ago we had a blade server from a big vendor we thought was down. It turned out it was a dead 10K spinner that froze a node then all the blades in that chassis.

TuxDude · Nov 17, 2015

I haven't seen exactly that - but I have seen bad drives cause a server to take FOREVER but still eventually successfully boot, which sounds pretty much like what you had here. Just in my cases it never happened to be on a clustered node.

Patrick · Nov 17, 2015

TuxDude said:
I haven't seen exactly that - but I have seen bad drives cause a server to take FOREVER but still eventually successfully boot, which sounds pretty much like what you had here. Just in my cases it never happened to be on a clustered node.

I am actually going to replace it with a drive from a different vendor since I now do not trust those drives.

TuxDude · Nov 17, 2015

Patrick said:
I am actually going to replace it with a drive from a different vendor since I now do not trust those drives.

If you're giving away the old ones, I can always find a use for some SSD

Fritz · Nov 17, 2015

I had something similar happen to me. Had a server go down in the middle of the night. Heard it go silent. Got up and tried to restart it and it was deader than a door nail. Next morning it was back up and running. Turned out to be a short circuit in the meter box. It dropped one leg only so it wasn't obvious. As far as I knew I still had power. It's a smart meter and it called the power company and reported the problem. I was clueless until the sparky guys rang the doorbell. Went out and looked and it was pretty burnt. It heated up and opened during the night and then cooled and fused back together. Lucky I didn't suffer any serious damage.

Rain · Nov 20, 2015

Proxmox is Linux based, right? So, they're using ZFSOnLinux, correct?

ZFSOnLinux had a nasty bug that got fixed recently where if a drive failed in a specific way, it would cause a strange kernel lockup. It happened to me recently, actually. Seagate 3TB drive failed (getting ready to rid myself of these 3TB desktop drives, ugh) and the machine locked up. It didn't cause a kernel panic, but started spewing out "CPU Core# Stuck" error messages. Tried to reboot at the local console and it didn't want to do anything; issuing "reboot" and "poweroff" as root did nothing, I had to hold the power button. I through the drive into a test machine and Linux started throwing kernel errors non-stop that seemed to suggest the drive was spewing data non-stop at the machine despite not getting requests for it.

Did you watch the node POST after restarting it, or did you issue a ACPI reboot with IPMI and assume it happily rebooted? It's possible it was just stuck powered on and not actually rebooting -- a hard reset would have been required, not an OS-level reset. Check what version of ZoL your Proxmox installations have. I think the bug was found in 0.6.3.x or 0.6.4.x, but affected earlier versions; it's definitely fixed in 0.6.5.x though.

Patrick · Nov 22, 2015

I did actually watch it reboot @Rain and even did a hard reboot. That might have been the issue though.

Search

Cluster node resurrection - has anyone seen this?

Patrick

Administrator

Biren78

Active Member

TuxDude

Well-Known Member

Patrick

Administrator

TuxDude

Well-Known Member

Fritz

Well-Known Member

Rain

Active Member

Patrick

Administrator