Supermicro Server becoming unresponsive

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

lpallard

Member
Aug 17, 2013
276
11
18
Hello,

I just have experienced something strange and quite worrisome with a Supermicro virtualization node (running Proxmox).

What happened:
  • Tried to access a web site served by one of the VM's running on that node. Browser says "Page cannot be found". Tried another site from a different VM on the same server, same thing. Checked firewall, all OK. Internet is all OK. Network is up & running.
  • Tried SSH to a few VM's running on that node. None works (connection time out).
  • Tried to access Proxmox's webUI, same result (connection time out).
  • Try the IPMI webUI, it works. Login, and status is "Normal". Event log is empty (IPMI says "There are no event log entries present at this time.") but the sensors section is empty (all sensors say either "NA" or ":Not Present!").
  • Connect a monitor and keyboard, no video signal. I noticed the only LED on the chassis was the power one, the others (network, HDD etc) were not blinking at all.
  • Hard reset the server (which I hate to do). Access the LSI MegaRAID firmware to look for status or errors, nothing specific came up (the error log was actually empty).
  • Get out of the MegaRAID FW, and into the machine's BIOS. All seems OK.
  • POST detects all CPU cores, all RAM, no errors.
  • I let the server boot, Proxmox comes back online. all is fine.
What just happened? Motherboard failure?

Now that this happened, that reminded me that about a year ago, I did a soft shutdown from IPMI (power down) and the server wouldnt shut down. This is related? How can this happen and no traces or problems or errors anywhere?

Hopefully someone can shed light on this....
Thanks!

Specs (main hardware only):
  • Mobo: Supermicro H8DCL-iF (BIOS 3.5 / IPMI 3.16)
  • CPU: 2x AMD Opteron 4334
  • RAM: 64GB Kingston KVR16R11D4K4-64
  • PSU: Corsair HX1060
  • RAID controller: IBM ServeRAID M5016 (See below for details)
  • Network: Intel Pro/1000 PT
  • Storage: 4x Hitachi Ultrastar 15k600 SAS 300GB

Software:
Proxmox 3.2 (see details below). I know, this is ANCIENT ( I meant to update the OS to new release for 2 years now but the upgrade procedure is not straightforward and I basically need to restart from scratch).


Code:
                    Versions
                ================
Product Name    : ServeRAID M5016
Serial No       : SV20718863
FW Package Build: 23.2.1-0021

                    Mfg. Data
                ================
Mfg. Date       : 02/18/12
Rework Date     : 00/00/00
Revision No     : 26A
Battery FRU     : N/A

                Image Versions in Flash:
                ================
BIOS Version       : 5.29.00_4.12.05.00_0x05090000
FW Version         : 3.150.05-1441
NVDATA Version     : 2.1108.03-0068
WebBIOS Version    : 6.1-23-e_23-Rel
Preboot CLI Version: 05.01-08:#%00001
Boot Block Version : 2.05.00.00-0004
BOOT Version       : 07.26.05.219

Code:
proxmox-ve-2.6.32: 3.2-126 (running kernel: 2.6.32-29-pve)
pve-manager: 3.2-4 (running version: 3.2-4/e24a91c1)
pve-kernel-2.6.32-29-pve: 2.6.32-126
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-16
pve-firmware: 1.1-3
libpve-common-perl: 3.0-18
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-8
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1