PSOD - Spin Count Exceeded after new hardware

KlausK

New Member
Aug 14, 2020
2
0
1
Hi STH,


Help :)


For a few years I have been running a small ESXi box on an Intel Skull Canyon NUC (NUC6i7KYK, 32GB non-ECC RAM) with 3 active VM's. Nothing under heavy load and everything was fine. Though after 3 years of 24/7 service it gave up. Fan died and it kept acting up in different ways.

Due to that I upgraded to new hardware. On the new hardware I re-installed VMware on a new USB stick but moved the SSD's from the old hardware and imported the existing datastores.

But now nothing is stable. After about a day or two of uptime with all VM's running I get a PSOD with "Spin Count Exceeded - Possible Deadlock". See screenshots attached below. I tried searching the forums but the two post about Spin Count doesn't appear to be related.

Hardware:
  1. Barebone: RS100-E10-Pi2
    1. BIOS: Version 3103 (latest)
    2. FW: Version 1.13.6 (latest)
  2. CPU: Xeon E-2236
  3. RAM: 64GB of ECC RAM - Kingston KTH-PL424E/16G
  4. ESXi: 6.7u3 build 16075168 (also tested 7.0 and multiple 6.7u3 builds)
  5. Datastore:
    1. SSD1: INTEL SSDPEKKW51
    2. SSD2: WD - WDS500G3X0C-00SJ
  6. USB: SanDisk UltraFit 32GB - SDCZ430-032G-G46
3 VM's:
  • VMware vCenter
    • 2 vCPU's
    • 10GB RAM
  • GrayLog
    • 2vCPU's
    • 4GB RAM
  • Cisco Firepower Management Center (FMC)
    • 4 vCPU's
    • 16GB RAM

Though if i run with only vCenter and GrayLog VM's it appears stable. Also stable with no VM's running. So it appears it is somehow related to the FMC VM. Though I have to admit I cannot remember if I tested with the FMC VM alone. That test I just started.

I have tried to run run MemTest86 v8.4 numerous times with 100% pass. What else could I try to figure out what is going on? Is there some data I could capture and share to help?


Any help is greatly appreciated.


Errors seen:
From the iKVM: "ID: 1 CPU_CATERR sensor of type processor logged a IERR"

From the BIOS:
1597407079847.png

PSOD's:
1:
1597407148149.png
2:
1597407162391.png
3:
1597407174678.png

MemTest:
1597407119677.png
1597407768598.png

/Klaus
 

KlausK

New Member
Aug 14, 2020
2
0
1
For future reference to people reading this thread.

It appears I have it running stable now. It is currently at 8 days up-time. 4x the uptime of what I have had on this server before as it was normally failing between 1 and 2 days of uptime. It has been running all VM's with normal usage without issues but I will not celebrate until it is at 14+ days I guess.

Though what fixed it I am not sure off. Due to the server is in a small rack which is hard to service I did a whole lot of things at once when I pulled it out.
  1. Changed the 2x 512 GB SSD's (1 Intel and 1 WD) to 2x 1TB Samsung 970 Evo Plus SSD's with newest firmware on them.
    1. VM's were moved using Veeam backup and restore.
  2. Kept the VM datastore on 1 SSD only. Before it was split between 2 SSD's. The 2nd SSD's now houses a datastore for backups and other tasks.
  3. Reseated all the RAM in different locations (1<->4, 2<->3).
  4. Reseated the CPU and used a proper thermal compound this time instead of just using the pre-applied thermal pad on the heatsink.
  5. Changed LAN cables to make it easier to service later on (they were too short).
  6. Flashed a BIOS version that was released just the day before I pulled the server out.
    1. Update Microcode 0xD6 for IPU 2020.1 Security Issue.
    2. Update "SIO Configuration" setup item in BIOS.

So fingers crossed it remains stable... Any bets on what fixed it? :)

/Klaus
 

dvdwsn

New Member
May 16, 2021
1
0
1
Hi KlausK,
I'm having the same CPU_CATERR problem as you and I have the same server as you. My Asus RS100-E10-PI2, is running with an i3 9100, different Kingston ECC ram and Debian.
Wondering if this problem ever returned? If not, did you ever narrow down what it was that fixed the issue for you?

Thanks!