Inexplicable CPU underclocking: help!

altano · Feb 23, 2019

My Xeon-D has begun underclocking under medium use and I can't figure out why.

What I did:
I replaced the stock heatsink on my Xeon-D 1537 with the Cooljag BUF-E as per this thread. Note that I used the stock screws so I didn't even replace the backplate. The replacement was surgical and I never ever even pulled the motherboard out of the case.

While inside I also pulled out an ASMedia USB Host PCI card that was passed-through to a VM. These are the ONLY two hardware changes I knowingly made to the server.

What I observed:
Once back in ESXi with my VMs booted up, I noticed extremely sluggish performance in my Windows VM. I looked at some other VMs and they were also sluggish. Everything seemed to peg the CPU to 100%.

My hardware specs:

Xeon-D 1537 (SuperMicro X10SDV-7TP4F). This is an 8-core, hyperthreaded SoC
Boot drive = Intel 900P. All VMs involved in the testing in this post are hosted here.
128GB DDR4 RAM running at 2133MHz

CPU info:

Code:

[root@esxi:~] vim-cmd hostsvc/hosthardware | grep cpu -A 10
       cpuPowerManagementInfo = (vim.host.CpuPowerManagementInfo) {
          currentPolicy = "High Performance",
          hardwareSupport = "ACPI P-states"
       },
       cpuInfo = (vim.host.CpuInfo) {
          numCpuPackages = 1,
          numCpuCores = 8,
          numCpuThreads = 16,
          hz = 1699998648
       },
       cpuPkg = (vim.host.CpuPackage) [
          (vim.host.CpuPackage) {
             index = 0,
             vendor = "intel",
             hz = 1699998648,
             busHz = 99999910,
             description = "Intel(R) Xeon(R) CPU D-1537 @ 1.70GHz",
             threadId = (short) [
                0,
                1,
                2,
    --
             cpuFeature = (vim.host.CpuIdInfo) [
                (vim.host.CpuIdInfo) {
                   level = 0,
                   vendor = <unset>,
                   eax = "0000:0000:0000:0000:0000:0000:0001:0100",
                   ebx = "0111:0101:0110:1110:0110:0101:0100:0111",
                   ecx = "0110:1100:0110:0101:0111:0100:0110:1110",
                   edx = "0100:1001:0110:0101:0110:1110:0110:1001"
                },
                (vim.host.CpuIdInfo) {
                   level = 1,
    --
                cpuID = (short) [
                   15,
                   14,
                   13,
                   12,
                   11,
                   10,
                   9,
                   8,
                   7,
                   6,
    --
       cpuFeature = (vim.host.CpuIdInfo) [
          (vim.host.CpuIdInfo) {
             level = 0,
             vendor = <unset>,
             eax = "0000:0000:0000:0000:0000:0000:0001:0100",
             ebx = "0111:0101:0110:1110:0110:0101:0100:0111",
             ecx = "0110:1100:0110:0101:0111:0100:0110:1110",
             edx = "0100:1001:0110:0101:0110:1110:0110:1001"
          },
          (vim.host.CpuIdInfo) {
         level = 1,

My Investigation:

IPMI
- I immediately thought the heatsink installation was bad so I checked my CPU temps in IPMI. The CPU temp was EXCELLENT and ~20 deg C lower than before, hovering around ~45 deg C.
In ESXi
- I ran Prime95 in my Windows VM which has 8 vCPUs dedicated to it, just as I did when taking temp readings BEFORE the heatsink replacement. Unlike what you'd expect, and the behavior I was seeing before the heatsink replacement, the CPU inside the VM hit 100% but the host saw the VM's CPU usage as <6%.
- I loaded up a stresslinux VM and ran it with 8 vCPUs and ran an 8-CPU stress test. The host saw the VM CPU usage as 10-15%, not the 50% I'd expect.
ESXTOP
- With ONE stresslinux VM set to CPU=8, w/ 8 vCPUs:
  - PCPU Util Avg = 56
    PCPU Used Avg = 11
    VM #1 (stresslinux1)
    %RUN = 818%
    %USED = 184%
    %RDY = <1
- With TWO stresslinux VM set to CPU=8, w/ 8 vCPUs each:
  - PCPU Util Avg = 99
    PCPU Used Avg = 12
    VM #1 (stresslinux1)
    %RUN = 735%
    %USED = 124%
    %RDY = 77
    VM #2 (stresslinux2)
    %RUN = 804%
    %USED = 112%
    %RDY = 14
- As you can see, ESXTOP is showing that:
  - There is a huge discrepancy between PCPU Used and Util, indicating the CPU is frequency is dropping.
  - There is a huge discrepancy between %RUN and %USED, even when %RDY is very low and we aren't oversubscribed in any way. %RUN is where we'd expect at ~800% (~8 vCPU x 100%) but %USED never leaves 100-200%.
  - With 2VMs hitting 16 vCPUs, %RDY skyrockets to ~14 and ~77, well above the recommended ~5, despite the fact that I'm only hitting 16 vCPUs hard on a server with 16 logical processors (8-core hyperthreaded).
Ubuntu Live CD
- I didn't know how to investigate CPU frequency problems in ESXi so I rebooted the host to an Ubuntu Live USB drive.
- Geekbench4: these scores are obviously garbage
  1. Single-core score: 936 (vs. 2067-3031 others are getting)
  2. Multi-core score: 1778 (vs. 10,030-16,631 others are getting)
- I ran sysbench (using sysbench cpu run --threads=16 --time=60) and then checked out the CPU frequencies:
  - ~400MHz across all cores! This is obviously the issue.
  - Note that the downclocking occurs at 5+ threads. At <=4 threads the CPU does not downclock and stays at ~800MHz.

Other things I tried:

I reseated the heatsink and paid more attention to good thermal paste application, although when I pulled it off the original application looked very evenly spread and I had no patches of missing paste. Also, again, temperatures look better than ever (including idle temps where the CPU is NOT underclocked).
Putting the USB PCI card back in, re-enabling passthrough, and adding it back to the Windows VM it was originally attached to. I had no reason to think this would matter but I tried it.
Disabling all VMs on the ESXi host other than the stresslinux VMs.
I had the JPV1 8-pin power plugged in erroneously in addition to the 24-pin ATX connector. Whoops. I pulled this out and saw no changes.
Disabling CPU power saving in the bios AND in Ubuntu: no matter what settings I tweaked the CPU frequency dropped to 400MHz under load.
Installed Windows 10 Pro x64, enabled High Performance power plan, and installed all SuperMicro drivers. Same behavior.
Pulled the CMOS battery and shorted the JBT1 jumper as per the instructions from SuperMicro on how to clear the CMOS.
BIOS and BMC firmware are both up to date.

Can anyone here think what on earth I might have done that would cause my CPU performance to tank? It surely must be the heatsink replacement but I don't understand HOW replacing the heatsink would have caused this behavior. If the CPU was overheating and downclocking it would surely ACTUALLY OVERHEAT. =\

scline · Feb 24, 2019

That sound incredibly odd indeed. Have you checked if there is a BIOS update available? Just shooting in the dark that something in the bios is set to keep things low performance.

Granted that does not explain why changing a heatsink would cause such a downgrade in speed

altano · Feb 24, 2019

scline said:
Just shooting in the dark that something in the bios is set to keep things low performance.

Shots in the dark are much appreciated! I updated to R2.0, the latest bios, the latest BMC firmware, and ESXi 6.7U1 about 3 weeks ago. So I'm on the latest of everything as far as I know.

Marsh · Feb 24, 2019

Few years ago, same thing happened to me with a dual ASUS 2011 system.
I was pulling my little hair that I have.
After few weeks of troubleshooting, I pull the cmos battery out for few minutes, factory reset the bios .
It fixed the weird cpu clock problem.

altano · Feb 24, 2019

Marsh said:
Few years ago, same thing happened to me with a dual ASUS 2011 system.
I was pulling my little hair that I have.
After few weeks of troubleshooting, I pull the cmos battery out for few minutes, factory reset the bios .
It fixed the weird cpu clock problem.

Thanks for the suggestion. I pulled the CMOS battery and shorted the JBT1 jumper as per the instructions from SuperMicro on how to clear the CMOS. Once back in Windows the behavior seems to be identical as far as I can tell.

Some more potentially interesting data:

HWMonitor and CPU-Z report the frequencies of all cores (under load) at 800MHz while Task Manager reports 400MHz.
According to HWMonitor, the CPU Package power is 20-25W when idle and 40-45W under Prime95 load.
Both Task Manager and HWMonitor agree that when at idle, with High Performance power plan enabled, the clock speeds of all cores return to 1.9-2.1GHz (there is some fluctuation back down to 0.8GHz as some very minor load hits the CPUs, but when truly idle the speeds stay up around 2GHz).

Danic · Feb 24, 2019

Does the your motherboard come fanless like show in Supermicro's site? I wonder if your heatsink replacement is causing "fan failure" and bios forcing it into low clock speed?
Have you used a different power supply?

Lastly, in my overclocking adventures, I use program called HWINFO to see if CPU is thermal or power throttling. (See the sensors section of the application). I would have CPUs report thermal throttle even thought they have low temps. In most cases it was because something else was overheating. HWMonitor may have the same Sensor package as HWINFO as they look similar.

altano · Feb 25, 2019

Danic said:
Does the your motherboard come fanless like show in Supermicro's site? I wonder if your heatsink replacement is causing "fan failure" and bios forcing it into low clock speed?

Yup it is fanless, so I don't think the new heatsink could be resulting in fan failures. Good question.

Danic said:
Have you used a different power supply?

I can't easily try this but it's on my list of things to try this week when I have some more free time to salvage a PSU from another machine.

Danic said:
... I use program called HWINFO ...

Neat program! With Prime95 running there is very clearly a single culprit for the downclocking: "IA: Turbo Attenuation (MCT)" / "Power Limit Exceeded" / "Package/Ring Power Limit Exceeded". It doesn't appear to be related to thermals.

Any idea what might be causing this?

altano · Feb 25, 2019

And FWIW the voltages at idle of all the cores are ~0.82V and under Prime95 load are ~0.64V

altano · Feb 25, 2019

@Patrick I think the image proxy has a bug. Sometimes the images don't load (I've both personally experienced this and I've had some people who I linked to this post ask me to resend them the images). I just get a little error image instead (with a red X through it).

It's happening more consistently in my other post: https://forums.servethehome.com/ind...gpu-for-vmware-passthrough.23589/#post-219754 (note that the image I'm linking to is legit: https://files.terriblefish.com/20190221_055403168_iOS_LI.jpg)

Nikolas · Mar 30, 2019

I had the exact same issue as you. I removed the stock heatsink on my x10SDV-7TP4F, just to replace the thermal compound. After I had the thermal grease replaced, I put everything back together. Stock heatsink and all.
But when I turned on the server, I only got 0,41GHz as maximum speed on the processor.

I first thought that I had put too much of the thermal compund on the heatsink, so I reopened the chassie, removed the heatsink and removed SOME of the thermal compund from the edges that were in excess. Mind you that there was not much excess, but still to be on the safe side.
Still the results were the same, 0,41GHz on the processor.

I check everything that you did as well, I also installed the HWInfo tool and got the exact same fault culprits as you had.

I reopened the server once more and checked everything a fourth or fifth time. And as I was about to screw on the heatsink again, I noticed that one of the Vitec PR72-221 power inductors was loose one side. I must have banged it a little bit accidently while I was trying to remove the heatsink the first time. (See the attached image on which component got damaged.

Luckily I was able to re-solder the loose pin to the motherboard and thus solved the problem that I had.

So maybe it would be worth check out the power inducturs around the processor, to see if you accidently got it/them loose?

altano · Mar 30, 2019

Thanks for posting! Oh man I wonder if that happened to me. One of power inductors definitely looked SLIGHTLY askew and I was worried I had hit it. But when I touched it gently it went back into place and doesn’t feel loose at all.

Can you give me more details about how you resoldered it? I’ve never soldered something that was surface mount before.

Nikolas · Mar 31, 2019

I think that if you were able to move the power inductor, then it most probably has got loose on one of it's pins. Even if you say that it doesn't feel loose,, if you are able to physically move the power inductor AT ALL, then it is loose.

It is not easy to give soldering lessons over a forum...

I removed the heatsink to get a better view of the power inductor. Then I took a fine tipped soldering iron and from the side (same side as the arrow on the attached picture from my earlier post), warmed up the solder on the motherboard and the pin and at the same time added some (very little) solder to get the pin to attach to the solder area on the motherboard.

I attached a (stolen) picture of the power inductor so you can see how the pins look like.

Good luck. I hope this solves your issue with your processor.

On a side note, my replacement of the thermal compound (from the factury applied compund) reduced the CPU-temperature with about 30° C (from around 90-95° C to 55-60° C), still using the same original heatsink, I guess that with a copper heat sink I could reduce the temperature another 10-15 degrees?

altano · May 3, 2019

@Nikolas you're a lifesaver!

I sent the board in for RMA and they didn't immediately find the problem. I pointed out that the power inductors, although no longer askew, did get slightly bumped during my heatsink replacement. With this information Supermicro support was able to resolder the power inductors and fix the board! I just ran it through its paces and the thing is flying like new.

Kudos to both Nikolas for finding the problem and Supermicro for performing this repair for me (for free no less!).

Search

Inexplicable CPU underclocking: help!

altano

Active Member

scline

Member

altano

Active Member

Marsh

Moderator

altano

Active Member

Danic

Member

altano

Active Member

altano

Active Member

altano

Active Member

Nikolas

New Member

Attachments

altano

Active Member

Nikolas

New Member

Attachments

altano

Active Member