Cluster node resurrection - has anyone seen this?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,811
113
I was at the HE2 datacenter yesterday after getting a few alerts that made it sound like a node was failing.

The particular node had 2x 240GB consumer disks (OS boot drives ZFS mirrored), 2x 800GB SanDisk CloudSpeed Ultras and 2x 4TB hard drives. The latter four drives were being used in the Ceph cluster.

This is one of the nodes I tested pulling the consumer drives just to ensure that if one failed it would still boot. Yesterday it appears as though on failed and stopped the machine from booting fully.

It is a E5 V3 system but has a very slow POST/ Boot. So I spent a bit of time trying to see if I could coax it back to life spending about 15 minutes each time to allow the system to come back online. That gave me time to pull a Dell 8132 switch out of the colo and do a bunch of re-wiring. After multiple reboots, putting drives in different locations and etc, it was still not working. I decided I would do a formal remove from the cluster, install new boot drives then rejoin it but had too many meetings yesterday to do so.

This morning I logged into the Proxmox cluster... and that node was up! It is showing that the SSD failed:
upload_2015-11-17_6-14-28.png

However the node is back up and running a few test VMs. It was certainly down for over three hours before it came back online.

Has anyone ever seen a node resurrection like this?
 
  • Like
Reactions: Biren78

Biren78

Active Member
Jan 16, 2013
550
94
28
I've never seen that. A few years ago we had a blade server from a big vendor we thought was down. It turned out it was a dead 10K spinner that froze a node then all the blades in that chassis.
 

TuxDude

Well-Known Member
Sep 17, 2011
616
338
63
I haven't seen exactly that - but I have seen bad drives cause a server to take FOREVER but still eventually successfully boot, which sounds pretty much like what you had here. Just in my cases it never happened to be on a clustered node.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,811
113
I haven't seen exactly that - but I have seen bad drives cause a server to take FOREVER but still eventually successfully boot, which sounds pretty much like what you had here. Just in my cases it never happened to be on a clustered node.
I am actually going to replace it with a drive from a different vendor since I now do not trust those drives.
 

Fritz

Well-Known Member
Apr 6, 2015
3,386
1,387
113
70
I had something similar happen to me. Had a server go down in the middle of the night. Heard it go silent. Got up and tried to restart it and it was deader than a door nail. Next morning it was back up and running. Turned out to be a short circuit in the meter box. It dropped one leg only so it wasn't obvious. As far as I knew I still had power. It's a smart meter and it called the power company and reported the problem. I was clueless until the sparky guys rang the doorbell. Went out and looked and it was pretty burnt. It heated up and opened during the night and then cooled and fused back together. Lucky I didn't suffer any serious damage.
 

Rain

Active Member
May 13, 2013
276
124
43
Proxmox is Linux based, right? So, they're using ZFSOnLinux, correct?

ZFSOnLinux had a nasty bug that got fixed recently where if a drive failed in a specific way, it would cause a strange kernel lockup. It happened to me recently, actually. Seagate 3TB drive failed (getting ready to rid myself of these 3TB desktop drives, ugh) and the machine locked up. It didn't cause a kernel panic, but started spewing out "CPU Core# Stuck" error messages. Tried to reboot at the local console and it didn't want to do anything; issuing "reboot" and "poweroff" as root did nothing, I had to hold the power button. I through the drive into a test machine and Linux started throwing kernel errors non-stop that seemed to suggest the drive was spewing data non-stop at the machine despite not getting requests for it.

Did you watch the node POST after restarting it, or did you issue a ACPI reboot with IPMI and assume it happily rebooted? It's possible it was just stuck powered on and not actually rebooting -- a hard reset would have been required, not an OS-level reset. Check what version of ZoL your Proxmox installations have. I think the bug was found in 0.6.3.x or 0.6.4.x, but affected earlier versions; it's definitely fixed in 0.6.5.x though.
 
Last edited:
  • Like
Reactions: Patrick

Patrick

Administrator
Staff member
Dec 21, 2010
12,516
5,811
113
I did actually watch it reboot @Rain and even did a hard reboot. That might have been the issue though.