My Xeon-D has begun underclocking under medium use and I can't figure out why.
What I did:
I replaced the stock heatsink on my Xeon-D 1537 with the Cooljag BUF-E as per this thread. Note that I used the stock screws so I didn't even replace the backplate. The replacement was surgical and I never ever even pulled the motherboard out of the case.
While inside I also pulled out an ASMedia USB Host PCI card that was passed-through to a VM. These are the ONLY two hardware changes I knowingly made to the server.
What I observed:
Once back in ESXi with my VMs booted up, I noticed extremely sluggish performance in my Windows VM. I looked at some other VMs and they were also sluggish. Everything seemed to peg the CPU to 100%.
My hardware specs:
Can anyone here think what on earth I might have done that would cause my CPU performance to tank? It surely must be the heatsink replacement but I don't understand HOW replacing the heatsink would have caused this behavior. If the CPU was overheating and downclocking it would surely ACTUALLY OVERHEAT. =\
What I did:
I replaced the stock heatsink on my Xeon-D 1537 with the Cooljag BUF-E as per this thread. Note that I used the stock screws so I didn't even replace the backplate. The replacement was surgical and I never ever even pulled the motherboard out of the case.
While inside I also pulled out an ASMedia USB Host PCI card that was passed-through to a VM. These are the ONLY two hardware changes I knowingly made to the server.
What I observed:
Once back in ESXi with my VMs booted up, I noticed extremely sluggish performance in my Windows VM. I looked at some other VMs and they were also sluggish. Everything seemed to peg the CPU to 100%.
My hardware specs:
- Xeon-D 1537 (SuperMicro X10SDV-7TP4F). This is an 8-core, hyperthreaded SoC
- Boot drive = Intel 900P. All VMs involved in the testing in this post are hosted here.
- 128GB DDR4 RAM running at 2133MHz
- CPU info:
-
Code:
[root@esxi:~] vim-cmd hostsvc/hosthardware | grep cpu -A 10 cpuPowerManagementInfo = (vim.host.CpuPowerManagementInfo) { currentPolicy = "High Performance", hardwareSupport = "ACPI P-states" }, cpuInfo = (vim.host.CpuInfo) { numCpuPackages = 1, numCpuCores = 8, numCpuThreads = 16, hz = 1699998648 }, cpuPkg = (vim.host.CpuPackage) [ (vim.host.CpuPackage) { index = 0, vendor = "intel", hz = 1699998648, busHz = 99999910, description = "Intel(R) Xeon(R) CPU D-1537 @ 1.70GHz", threadId = (short) [ 0, 1, 2, -- cpuFeature = (vim.host.CpuIdInfo) [ (vim.host.CpuIdInfo) { level = 0, vendor = <unset>, eax = "0000:0000:0000:0000:0000:0000:0001:0100", ebx = "0111:0101:0110:1110:0110:0101:0100:0111", ecx = "0110:1100:0110:0101:0111:0100:0110:1110", edx = "0100:1001:0110:0101:0110:1110:0110:1001" }, (vim.host.CpuIdInfo) { level = 1, -- cpuID = (short) [ 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, -- cpuFeature = (vim.host.CpuIdInfo) [ (vim.host.CpuIdInfo) { level = 0, vendor = <unset>, eax = "0000:0000:0000:0000:0000:0000:0001:0100", ebx = "0111:0101:0110:1110:0110:0101:0100:0111", ecx = "0110:1100:0110:0101:0111:0100:0110:1110", edx = "0100:1001:0110:0101:0110:1110:0110:1001" }, (vim.host.CpuIdInfo) { level = 1,
-
- IPMI
- I immediately thought the heatsink installation was bad so I checked my CPU temps in IPMI. The CPU temp was EXCELLENT and ~20 deg C lower than before, hovering around ~45 deg C.
- In ESXi
- I ran Prime95 in my Windows VM which has 8 vCPUs dedicated to it, just as I did when taking temp readings BEFORE the heatsink replacement. Unlike what you'd expect, and the behavior I was seeing before the heatsink replacement, the CPU inside the VM hit 100% but the host saw the VM's CPU usage as <6%.
- I loaded up a stresslinux VM and ran it with 8 vCPUs and ran an 8-CPU stress test. The host saw the VM CPU usage as 10-15%, not the 50% I'd expect.
- ESXTOP
- With ONE stresslinux VM set to CPU=8, w/ 8 vCPUs:
-
- PCPU Util Avg = 56
PCPU Used Avg = 11
VM #1 (stresslinux1)
%RUN = 818%
%USED = 184%
%RDY = <1
-
- With TWO stresslinux VM set to CPU=8, w/ 8 vCPUs each:
-
- PCPU Util Avg = 99
PCPU Used Avg = 12
VM #1 (stresslinux1)
%RUN = 735%
%USED = 124%
%RDY = 77
VM #2 (stresslinux2)
%RUN = 804%
%USED = 112%
%RDY = 14
-
- As you can see, ESXTOP is showing that:
- There is a huge discrepancy between PCPU Used and Util, indicating the CPU is frequency is dropping.
- There is a huge discrepancy between %RUN and %USED, even when %RDY is very low and we aren't oversubscribed in any way. %RUN is where we'd expect at ~800% (~8 vCPU x 100%) but %USED never leaves 100-200%.
- With 2VMs hitting 16 vCPUs, %RDY skyrockets to ~14 and ~77, well above the recommended ~5, despite the fact that I'm only hitting 16 vCPUs hard on a server with 16 logical processors (8-core hyperthreaded).
- With ONE stresslinux VM set to CPU=8, w/ 8 vCPUs:
- Ubuntu Live CD
- I didn't know how to investigate CPU frequency problems in ESXi so I rebooted the host to an Ubuntu Live USB drive.
- Geekbench4: these scores are obviously garbage
- Single-core score: 936 (vs. 2067-3031 others are getting)
- Multi-core score: 1778 (vs. 10,030-16,631 others are getting)
- I ran sysbench (using sysbench cpu run --threads=16 --time=60) and then checked out the CPU frequencies:
-
- ~400MHz across all cores! This is obviously the issue.
- Note that the downclocking occurs at 5+ threads. At <=4 threads the CPU does not downclock and stays at ~800MHz.
-
- I reseated the heatsink and paid more attention to good thermal paste application, although when I pulled it off the original application looked very evenly spread and I had no patches of missing paste. Also, again, temperatures look better than ever (including idle temps where the CPU is NOT underclocked).
- Putting the USB PCI card back in, re-enabling passthrough, and adding it back to the Windows VM it was originally attached to. I had no reason to think this would matter but I tried it.
- Disabling all VMs on the ESXi host other than the stresslinux VMs.
- I had the JPV1 8-pin power plugged in erroneously in addition to the 24-pin ATX connector. Whoops. I pulled this out and saw no changes.
- Disabling CPU power saving in the bios AND in Ubuntu: no matter what settings I tweaked the CPU frequency dropped to 400MHz under load.
- Installed Windows 10 Pro x64, enabled High Performance power plan, and installed all SuperMicro drivers. Same behavior.
- Pulled the CMOS battery and shorted the JBT1 jumper as per the instructions from SuperMicro on how to clear the CMOS.
- BIOS and BMC firmware are both up to date.
Can anyone here think what on earth I might have done that would cause my CPU performance to tank? It surely must be the heatsink replacement but I don't understand HOW replacing the heatsink would have caused this behavior. If the CPU was overheating and downclocking it would surely ACTUALLY OVERHEAT. =\
Last edited: