I was at the HE2 datacenter yesterday after getting a few alerts that made it sound like a node was failing.
The particular node had 2x 240GB consumer disks (OS boot drives ZFS mirrored), 2x 800GB SanDisk CloudSpeed Ultras and 2x 4TB hard drives. The latter four drives were being used in the Ceph cluster.
This is one of the nodes I tested pulling the consumer drives just to ensure that if one failed it would still boot. Yesterday it appears as though on failed and stopped the machine from booting fully.
It is a E5 V3 system but has a very slow POST/ Boot. So I spent a bit of time trying to see if I could coax it back to life spending about 15 minutes each time to allow the system to come back online. That gave me time to pull a Dell 8132 switch out of the colo and do a bunch of re-wiring. After multiple reboots, putting drives in different locations and etc, it was still not working. I decided I would do a formal remove from the cluster, install new boot drives then rejoin it but had too many meetings yesterday to do so.
This morning I logged into the Proxmox cluster... and that node was up! It is showing that the SSD failed:
However the node is back up and running a few test VMs. It was certainly down for over three hours before it came back online.
Has anyone ever seen a node resurrection like this?
The particular node had 2x 240GB consumer disks (OS boot drives ZFS mirrored), 2x 800GB SanDisk CloudSpeed Ultras and 2x 4TB hard drives. The latter four drives were being used in the Ceph cluster.
This is one of the nodes I tested pulling the consumer drives just to ensure that if one failed it would still boot. Yesterday it appears as though on failed and stopped the machine from booting fully.
It is a E5 V3 system but has a very slow POST/ Boot. So I spent a bit of time trying to see if I could coax it back to life spending about 15 minutes each time to allow the system to come back online. That gave me time to pull a Dell 8132 switch out of the colo and do a bunch of re-wiring. After multiple reboots, putting drives in different locations and etc, it was still not working. I decided I would do a formal remove from the cluster, install new boot drives then rejoin it but had too many meetings yesterday to do so.
This morning I logged into the Proxmox cluster... and that node was up! It is showing that the SSD failed:
However the node is back up and running a few test VMs. It was certainly down for over three hours before it came back online.
Has anyone ever seen a node resurrection like this?