Stephan

How-to Guide Machine Check Exception (mce) workaround

You bought a cheap off-roadmap Intel Xeon CPU from somewhere, but the hardware crashes and reboots, even when idle. You realize the CPU might have gotten thrown out from the hyperscaler's datacenter for a reason. That reason?

Luckily, your CPU has extensive diagnostics and your Linux distribution supports "pstore" crash saving. In the directory /sys/fs/pstore/ within the saved dmesg* and mce* files you find something like this:
Code:
mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 0: f200004000070005
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffaa4015f0> {asm_sysvec_apic_timer_interrupt+0x0/0x20}
mce: [Hardware Error]: TSC 127d93bae02
mce: [Hardware Error]: PROCESSOR 0:50657 TIME 1707175411 SOCKET 0 APIC 4 microcode 5003604
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Ouch, an MCE. MCEs in logs are never a good sign and in majority of cases this means broken hardware. Here, one core is defective, in this case number 2 and its hyper-thread sibling number 26.

Instead of throwing the CPU out, you can try to disable the specific core and see if the rest of the CPU is still good. Through the flexibility of Linux, try the following:

In your boot loader, add an isolation parameter to exclude the broken core from being used:
Code:
isolcpus=managed_irq,2,26
Then somewhere in your boot-scripts, e.g. /etc/rc.local, set the core and its sibling to offline:
Code:
LOGGER="logger --id=$$ -t $(basename $0)"
CPUS_DEFECTIVE="2 26"
# Only run on one certain board with specific MAC address
if ip link show | grep -qi 00:11:22:33:44:55; then
    echo "Faulty CPU, disabling CPUs $CPUS_DEFECTIVE" | $LOGGER
    for c in $CPUS_DEFECTIVE; do
        echo 0 > "/sys/devices/system/cpu/cpu${c}/online"
    done
fi
A check using lscpu -e reveals the core is now offline:
Code:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ       MHZ
...
  2    -      -    - -                 no         -         -         -
...
26    -      -    - -                 no         -         -         -
...
To verify stability, try stress-ng using as much memory as possible or compile a Linux kernel in tmpfs i.e. RAM over night in a loop.
Author
Stephan
Views
810
First release
Last update
Rating
0.00 star(s) 0 ratings