You bought a cheap off-roadmap Intel Xeon CPU from somewhere, but the hardware crashes and reboots, even when idle. You realize the CPU might have gotten thrown out from the hyperscaler's datacenter for a reason. That reason?
Luckily, your CPU has extensive diagnostics and your Linux distribution supports "pstore" crash saving. In the directory /sys/fs/pstore/ within the saved dmesg* and mce* files you find something like this:
Ouch, an MCE. MCEs in logs are never a good sign and in majority of cases this means broken hardware. Here, one core is defective, in this case number 2 and its hyper-thread sibling number 26.
Instead of throwing the CPU out, you can try to disable the specific core and see if the rest of the CPU is still good. Through the flexibility of Linux, try the following:
In your boot loader, add an isolation parameter to exclude the broken core from being used:
Then somewhere in your boot-scripts, e.g. /etc/rc.local, set the core and its sibling to offline:
A check using lscpu -e reveals the core is now offline:
To verify stability, try stress-ng using as much memory as possible or compile a Linux kernel in tmpfs i.e. RAM over night in a loop.
Luckily, your CPU has extensive diagnostics and your Linux distribution supports "pstore" crash saving. In the directory /sys/fs/pstore/ within the saved dmesg* and mce* files you find something like this:
Code:
mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 0: f200004000070005
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffaa4015f0> {asm_sysvec_apic_timer_interrupt+0x0/0x20}
mce: [Hardware Error]: TSC 127d93bae02
mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1707175411 SOCKET 0 APIC 4 microcode 5003604
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Instead of throwing the CPU out, you can try to disable the specific core and see if the rest of the CPU is still good. Through the flexibility of Linux, try the following:
In your boot loader, add an isolation parameter to exclude the broken core from being used:
Code:
isolcpus=managed_irq,2,26
Code:
LOGGER="logger --id=$$ -t $(basename $0)"
CPUS_DEFECTIVE="2 26"
# Only run on one certain board with specific MAC address
if ip link show | grep -qi 00:11:22:33:44:55; then
echo "Faulty CPU, disabling CPUs $CPUS_DEFECTIVE" | $LOGGER
for c in $CPUS_DEFECTIVE; do
echo 0 > "/sys/devices/system/cpu/cpu${c}/online"
done
fi
Code:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
...
2 - - - - no - - -
...
26 - - - - no - - -
...