Jasper Lake Proxmox (KVM/QEMU) VM Guest Stability

AdriftAtlas · Jan 21, 2023

Problem Description:

Many people using Promox (and other KVM/QEMU based hypervisors) on Jasper Lake platforms (N5105, N6005) are experiencing kernel panics and/or hangs of their guest VMs. Both Linux (OpenWRT, Ubuntu) and FreeBSD (pfSense, OPNsense) guest VMs are affected. The host itself remains up and does not experience issues. LXC containers running in the host are not affected either. This happens on many Mini PCs; official Intel NUCs and Aliexpress units.

The issue seems to be related to CPU power management as the issue tends to occur during idle. Disabling C-States in the host BIOS and/or via kernel flags either completely or partially seems to reduce issue occurrence. Switching CPU idle mode from ACPI/MWAIT to Halt in the guest VMs seems to help too. Upgrading the host kernel from 5.15 to 5.19, 6.0, 6.1, or 6.2 seems to reduce incidence. Though ultimately the guests will still freeze or panic; possibly after a few weeks instead of a few days.

Working Fix (Updated 05/02/2023):

Option 1 (Load updated microcode at each boot):

Update CPU microcode to latest available in Debian non-free repo on Proxmox host:

Microcode - Debian Wiki

wiki.debian.org

Debian -- Details of source package intel-microcode in bookworm

Option 2 (Update BIOS if motherboard is Changwang N5105 v3, v4, v5):

Step 1: Download BIOS iso from Changwang's Website, ensure that you have a compatible Changwang motherboard:

畅网N5105-V3-V4-V5微码更新2023-04-18发布(鸡血版本)

BIOS更新内容：1,更新最新CPU微码2,修复了虚拟机情况下死机重启问题3,依然是满血全功耗开放版本（请自行注意做好散热）

www.changwang.com

Step 2: Use Rufus to convert the ISO into a bootable usb stick:

Rufus - The Official Website (Download, New Releases)

Rufus is a small application that creates bootable USB drives, which can then be used to install or run Microsoft Windows, Linux or DOS. In just a few minutes, and with very few clicks, Rufus can help you run a new Operating System on your computer...

rufus.ie

Step 3: Boot from USB stick (hit F11 at AMI splash screen) and let it automatically update bios.

Step 4: As bios settings will be reset after update, configure the BIOS as required by hitting delete at AMI splash screen.

Step 5: Verifying that BIOS has updated the microcode:

Code:

grep 'stepping\|model\|microcode' /proc/cpuinfo

model           : 156
model name      : Intel(R) Celeron(R) N5105 @ 2.00GHz
stepping        : 0
microcode       : 0x24000024

Old Potential Fixes:

Updating CPU microcode to latest available on Proxmox host:

Microcode - Debian Wiki

wiki.debian.org

Installing Opt-In Kernels on Proxmox:

Opt-in Linux 5.19 Kernel for Proxmox VE 7.x available

We recently uploaded a 5.19 kernel into our repositories. The 5.15 kernel will stay the default on the Proxmox VE 7.x series, 5.19 is an option. 5.19 may be useful for some (especially newer) setups, for example if there is improved hardware support that has not yet been backported to 5.15. How...

forum.proxmox.com

Opt-in Linux 6.1 Kernel for Proxmox VE 7.x available

We recently uploaded a 6.1 kernel into our repositories. The 5.15 kernel will stay the default on the Proxmox VE 7.x series, 6.1 is an option that replaces the previous 5.19 based opt-in kernel. The 6.1 based kernel may be useful for some (especially newer) setups, for example if there is...

forum.proxmox.com

Disabling ACPI/MWAIT idle in pfSense guest VM (FreeBSD):

sysctl machdep.idle_mwait=0

sysctl machdep.idle=hlt

The above can also be done on Linux based VM guests:
https://docs.kernel.org/admin-guide...el-command-line-options-and-module-parameters
According to this, setting idle=halt or intel_idle.max_cstate=0 as a kernel parameter will cause intel_idle initialization to fail.

Disabling C-States or Enhanced C-States in BIOS.

Using kvm64 as guest CPU instead of host and limiting CPU flags:
2 (1 sockets, 2 cores) [kvm64,flags=-pcid;-spec-ctrl;-ssbd;-ibpb;-virt-ssbd;-amd-ssbd;-amd-no-ssb;+aes]

Related Threads:

VM freezes irregularly

Hi everyone, I have rewritten the text based on the troubleshooting I have tried. I am at my wit's end here: Some weeks ago, I bought a pfsense box on AliExpress (4-core N5105, 8GB RAM and 250GB NVMe) and installed Proxmox on it. On the box I run two VMs: pfSense - runs excellent and no...

forum.proxmox.com

Recurring Kernel Panics - Fatal trap 12: page fault while in kernel mode

forum.opnsense.org

pfSense kernel panic

I have been installing pfSense on Proxmox for a week and almost every day I register a crash but I have no idea what caused it. Among the logs I read "panic:...

forum.netgate.com

Example pfSense Kernel Panic:

Code:

kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0x1008e
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff80da2d71
stack pointer            = 0x28:0xfffffe0025782b00
frame pointer            = 0x28:0xfffffe0025782b60
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = resume, IOPL = 0
current process        = 11 (idle: cpu0)
trap number        = 12
panic: page fault
cpuid = 0
time = 1672654637
KDB: enter: panic

db:0:kdb.enter.default>  bt

Tracing pid 11 tid 100003 td 0xfffff8000520d000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00257828c0
vpanic() at vpanic+0x194/frame 0xfffffe0025782910
panic() at panic+0x43/frame 0xfffffe0025782970
trap_fatal() at trap_fatal+0x38f/frame 0xfffffe00257829d0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0025782a30
calltrap() at calltrap+0x8/frame 0xfffffe0025782a30
--- trap 0xc, rip = 0xffffffff80da2d71, rsp = 0xfffffe0025782b00, rbp = 0xfffffe0025782b60 ---
callout_process() at callout_process+0x1b1/frame 0xfffffe0025782b60
handleevents() at handleevents+0x188/frame 0xfffffe0025782ba0
cpu_activeclock() at cpu_activeclock+0x70/frame 0xfffffe0025782bd0
cpu_idle() at cpu_idle+0xa8/frame 0xfffffe0025782bf0
sched_idletd() at sched_idletd+0x326/frame 0xfffffe0025782cb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0025782cf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0025782cf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

db:0:kdb.enter.default>  alltrace

Tracing command sleep pid 35878 tid 100632 td 0xfffff80057237740
sched_switch() at sched_switch+0x606/frame 0xfffffe003671b9c0
mi_switch() at mi_switch+0xdb/frame 0xfffffe003671b9f0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe003671ba40
sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame 0xfffffe003671ba80
_sleep() at _sleep+0x1c6/frame 0xfffffe003671bb00
kern_clock_nanosleep() at kern_clock_nanosleep+0x1c1/frame 0xfffffe003671bb80
sys_nanosleep() at sys_nanosleep+0x3b/frame 0xfffffe003671bbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe003671bcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe003671bcf0
--- syscall (240, FreeBSD ELF64, sys_nanosleep), rip = 0x80038c9fa, rsp = 0x7fffffffec18, rbp = 0x7fffffffec60 ---

Tracing command sh pid 15762 tid 100600 td 0xfffff80016b8e000
sched_switch() at sched_switch+0x606/frame 0xfffffe00366cb970
mi_switch() at mi_switch+0xdb/frame 0xfffffe00366cb9a0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe00366cb9f0
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00366cba20
_sleep() at _sleep+0x1f1/frame 0xfffffe00366cbaa0
pipe_read() at pipe_read+0x3fe/frame 0xfffffe00366cbb10
dofileread() at dofileread+0x95/frame 0xfffffe00366cbb50
sys_read() at sys_read+0xc0/frame 0xfffffe00366cbbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00366cbcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00366cbcf0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x80044f03a, rsp = 0x7fffffffe3d8, rbp = 0x7fffffffe900 ---

Tracing command sh pid 15703 tid 100633 td 0xfffff80057237000
sched_switch() at sched_switch+0x606/frame 0xfffffe0036720800
mi_switch() at mi_switch+0xdb/frame 0xfffffe0036720830
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe0036720880
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00367208b0
_sleep() at _sleep+0x1f1/frame 0xfffffe0036720930
kern_wait6() at kern_wait6+0x59e/frame 0xfffffe00367209c0
sys_wait4() at sys_wait4+0x7d/frame 0xfffffe0036720bc0
amd64_sy

Stephan · Jan 21, 2023

Machine gets stabler with power save disabled? Ten bucks this is a power regulation issue with the board. Insufficient PSU, load changes causing voltage drops leading to such faults. Manufacturer will fix it and tell no-one and you will never figure out what was wrong. ;-) I still maintain any cheap and relatively low-power ECC-capable Fujitsu C236/C246 board with a regular ATX/SFX PSU is a better deal. Stable.

AdriftAtlas · Jan 21, 2023

Stephan said:
Machine gets stabler with power save disabled? Ten bucks this is a power regulation issue with the board. Insufficient PSU, load changes causing voltage drops leading to such faults. Manufacturer will fix it and tell no-one and you will never figure out what was wrong. ;-) I still maintain any cheap and relatively low-power ECC-capable Fujitsu C236/C246 board with a regular ATX/SFX PSU is a better deal. Stable.

If it was a true hardware issue then it would bring down the host too but it does not. In this case the host doesn't even notice a problem. This also happens on boards from multiple vendors including Intel. I guess it's possible the CPU architecture itself is flawed.

It is more likely an issue of the microcode or Linux kernel in KVM/QEMU. Some people have reported stable VMs under XCP-ng 6.3 and ESXi 8.

edinatl · Mar 1, 2023

Thanks for your post, I have been having the exact issues described and tried a few of the suggestions, particularly the microcode thing, but still had the problems. I went from Proxmox to ESXi in hopes of things being stable with opnsense, but after about 30 or so hours of uptime I again experienced a VM crash while the hypervisor remained stable. Honestly I think I should do a memory test but I haven't had a chance yet. The failures have a sort of consistency despite the random times it crashes. I have switched to baremetal as a last resort and will try to update this thread if things change. In case anyone is interested, here is the hardware info report that details the hardware we're talking about here: HW probe of Techvision TVI7309X B0 Desktop Computer (TVI7309X) #5b6dd24e9a

Also, I was not able to get xcp-ng to install after waiting at a black screen for a while, maybe I needed to wait longer but I couldn't be that patient (I'm talking like 10 minutes on a blank screen).

fenio · Mar 8, 2023

I'm on microcode 24 since 11 days.
I'm on TrueNAS Scale with kernel 5.15.79.
Before update of microcode I think the record for VMs before they crashed was 7 days. Usually closer to 3-5 days and often just a day or two.
After update it's 11 days so looks promising.
Changelog for microcode update says it's just security fixes but nowadays this can mean anything like predictive branching heavily used in virtualization.
So fingers crossed

fenio · Mar 27, 2023

Seems that indeed was issue with microcode. Just another report from me:

Code:

root@master:~# uptime
 14:48:25 up 30 days,  2:14,  1 user,  load average: 0.26, 0.45, 0.56

kliguin · Mar 27, 2023

fenio said:
Seems that indeed was issue with microcode. Just another report from me:

Code:

root@master:~# uptime 14:48:25 up 30 days, 2:14, 1 user, load average: 0.26, 0.45, 0.56

This seems the host or the "crashing" VM?

fenio · Mar 27, 2023

kliguin said:
This seems the host or the "crashing" VM?

I never had issues with host. Only VMs were completely unstable for me and freezing irregularly. That uptime is from VM.
Before microcode update the longest stable uptime was ~7 days, usually between 3-5 days and sometimes it was freezing after just few hours.
Since I moved to microcode "24" it's super stable for 30 days.

Stephan · Mar 27, 2023

Just another anecdote from me: Not sure if connected with retbleed fixes in the Linux kernel, but for months now I have been seeing more and more regressions in "lts" type kernels. Versions 5.4.xxx, 5.15.xx, 6.1.xx. Of the kind kernel oopses or panics when starting VMs, unexplained crashes within VMs etc. Forced me to go all the way up to 6.2, which is stable in this regard. I suspect into such kernels regressions by way of "backports" are introduced. Which then nobody later figures out. Latest regression was moving public kernel symbols out of sight and from 6.2.8 up, ZFS wouldn't compile anymore. If you see "stable" or "lts", doubt it. Try close to Torvald's tree.

kliguin · Mar 27, 2023

fenio said:
I never had issues with host. Only VMs were completely unstable for me and freezing irregularly. That uptime is from VM.
Before microcode update the longest stable uptime was ~7 days, usually between 3-5 days and sometimes it was freezing after just few hours.
Since I moved to microcode "24" it's super stable for 30 days.

can you share the download/version number how you installed the microcode?

kliguin · Mar 30, 2023

Update to microcode 24 seems to fix the crashing VM's for a lot of people.

For those who can't wait for the Debian repo to be updated and want to test the newest microcode, I've re-packaged the older version with the updated microcode data files (20230214). Install this over the top of the existing one and it should update for you. The version is intentionally kept old so that when the newer Debian package comes out it will take precedence over this hack job .

AdriftAtlas · Apr 2, 2023

My pfSense VM has been running for 25 days now with 0x24000024 microcode on the Proxmox host.

I would avoid installing packages from unknown sources. It can be updated from official sources:

Step 1: Update CPU microcode to latest available in Debian stable repo on Proxmox host:

Microcode - Debian Wiki

wiki.debian.org

Step 2: As the stable Debian repo currently has 0x24000023 you need to manually update the microcode to 0x24000024:

Code:

wget https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/archive/main.zip
unzip main.zip -d MCU
cp -r /root/MCU/Intel-Linux-Processor-Microcode-Data-Files-main/intel-ucode/. /lib/firmware/intel-ucode/
update-initramfs -u -k all
reboot
....
dmesg -T | grep microcode

The new microcode package should ship in the non-free stable repo with Debian 11.7 on 04/29/2023 making step two unnecessary.

freph91 · Apr 14, 2023

Been keeping an eye on these threads between this one and the one over on Proxmox forums. Thanks for the info, @AdriftAtlas . Unfortunately I ran into a crash after 6 days on 6.2.6-1-pve. I was trialing it for a bit and it was quite stable on 6.2.2, but they introduced something in 6.2.6 that's causing crashes again. Back to 6.1.14 in the meantime which was rock solid for me.

AdriftAtlas · Apr 14, 2023

freph91 said:
Been keeping an eye on these threads between this one and the one over on Proxmox forums. Thanks for the info, @AdriftAtlas . Unfortunately I ran into a crash after 6 days on 6.2.6-1-pve. I was trialing it for a bit and it was quite stable on 6.2.2, but they introduced something in 6.2.6 that's causing crashes again. Back to 6.1.14 in the meantime which was rock solid for me.

What is crashing for you? Is it the host or the VM? I don't believe the kernel was ever the issue. The issue that many of us were having is fixed by a microcode update.

I have been running kernel 6.2.9 for eight days now and it seems OK.

freph91 · Apr 15, 2023

AdriftAtlas said:
What is crashing for you? Is it the host or the VM? I don't believe the kernel was ever the issue. The issue that many of us were having is fixed by a microcode update.

I have been running kernel 6.2.9 for eight days now and it seems OK.

The pfSense VM kernel panicked (which spikes the CPU to 100% and makes the system unresponsive) after 6d20h. Microcode (0x24000024) upgrade with 6.1.14 kernel had many days of uptime (23 days before I upgraded to 6.2.6 kernel). No RAM/CPU creep indicating some other underlying issue, and only that VM was affected. 6.2.6 might have some issue that's fixed in 6.2.9. Looking forward to an update from you indicating stability or otherwise before I give it another go, but in the meantime I'm quite content with what I know gave me the longest uptime so far.

System specs:
proxmox-ve: 7.4-1 (running kernel: 6.1.14-1-pve)
N6005 w/ i226 NICs passed through to pfSense VM (22.05)
/proc/cmdline: quiet intel_iommu=on iommu=pt intel_pstate=disable intel_idle.max_cstate=1

AdriftAtlas · Apr 23, 2023

Updated original post with BIOS bundled microcode fix for Changwang boards.

freph91 · Apr 24, 2023

"Changwang N5105-V3-V4-V5 microcode update released on 2023-04-18 (chicken blood version)"

Gotta love Google translate. Hopefully they'll release one for the N6005 boards soon.

AdriftAtlas · Apr 24, 2023

freph91 said:
"Changwang N5105-V3-V4-V5 microcode update released on 2023-04-18 (chicken blood version)"

Gotta love Google translate. Hopefully they'll release one for the N6005 boards soon.

Chicken-blood therapy - Wikipedia

en.wikipedia.org

The "chicken blood version" has power limits unlocked. A bit perilous with questionable benefits. Hence the name...

DomFel · May 15, 2023

To people with N6005, DO NOT update to the N5105, CWWK tech support just confirmed the BIOS is NOT universal.

BarTouZ · Jun 13, 2023

Hello,

I have just updated my NUC based on N5105, it is well recognized and patched :

On the other hand, I don't know if you had the same thing but I lost my SATA SSD, it is no longer recognized:

Before

After

Have you encountered this problem and if so, how did you solve it ?

Thanks for your help

Jasper Lake Proxmox (KVM/QEMU) VM Guest Stability

New Member

Well-Known Member

New Member

New Member

New Member

New Member

Member

New Member

Well-Known Member

Member

Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Member

New Member