Jasper Lake Proxmox (KVM/QEMU) VM Guest Stability

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

AdriftAtlas

New Member
Jan 21, 2023
9
4
3
Problem Description:

Many people using Promox (and other KVM/QEMU based hypervisors) on Jasper Lake platforms (N5105, N6005) are experiencing kernel panics and/or hangs of their guest VMs. Both Linux (OpenWRT, Ubuntu) and FreeBSD (pfSense, OPNsense) guest VMs are affected. The host itself remains up and does not experience issues. LXC containers running in the host are not affected either. This happens on many Mini PCs; official Intel NUCs and Aliexpress units.

The issue seems to be related to CPU power management as the issue tends to occur during idle. Disabling C-States in the host BIOS and/or via kernel flags either completely or partially seems to reduce issue occurrence. Switching CPU idle mode from ACPI/MWAIT to Halt in the guest VMs seems to help too. Upgrading the host kernel from 5.15 to 5.19, 6.0, 6.1, or 6.2 seems to reduce incidence. Though ultimately the guests will still freeze or panic; possibly after a few weeks instead of a few days.

Working Fix (Updated 05/02/2023):

Option 1 (Load updated microcode at each boot):


Update CPU microcode to latest available in Debian non-free repo on Proxmox host:



Option 2 (Update BIOS if motherboard is Changwang N5105 v3, v4, v5):

Step 1:
Download BIOS iso from Changwang's Website, ensure that you have a compatible Changwang motherboard:


Step 2: Use Rufus to convert the ISO into a bootable usb stick:


Step 3: Boot from USB stick (hit F11 at AMI splash screen) and let it automatically update bios.

Step 4: As bios settings will be reset after update, configure the BIOS as required by hitting delete at AMI splash screen.

Step 5: Verifying that BIOS has updated the microcode:

Code:
grep 'stepping\|model\|microcode' /proc/cpuinfo

model           : 156
model name      : Intel(R) Celeron(R) N5105 @ 2.00GHz
stepping        : 0
microcode       : 0x24000024
Old Potential Fixes:

Updating CPU microcode to latest available on Proxmox host:

Installing Opt-In Kernels on Proxmox:

Disabling ACPI/MWAIT idle in pfSense guest VM (FreeBSD):
sysctl machdep.idle_mwait=0
sysctl machdep.idle=hlt


The above can also be done on Linux based VM guests:
https://docs.kernel.org/admin-guide...el-command-line-options-and-module-parameters
According to this, setting idle=halt or intel_idle.max_cstate=0 as a kernel parameter will cause intel_idle initialization to fail.

Disabling C-States or Enhanced C-States in BIOS.

Using kvm64 as guest CPU instead of host and limiting CPU flags:
2 (1 sockets, 2 cores) [kvm64,flags=-pcid;-spec-ctrl;-ssbd;-ibpb;-virt-ssbd;-amd-ssbd;-amd-no-ssb;+aes]

Related Threads:

Example pfSense Kernel Panic:
Code:
kernel trap 12 with interrupts disabled


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0x1008e
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff80da2d71
stack pointer            = 0x28:0xfffffe0025782b00
frame pointer            = 0x28:0xfffffe0025782b60
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = resume, IOPL = 0
current process        = 11 (idle: cpu0)
trap number        = 12
panic: page fault
cpuid = 0
time = 1672654637
KDB: enter: panic

db:0:kdb.enter.default>  bt

Tracing pid 11 tid 100003 td 0xfffff8000520d000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00257828c0
vpanic() at vpanic+0x194/frame 0xfffffe0025782910
panic() at panic+0x43/frame 0xfffffe0025782970
trap_fatal() at trap_fatal+0x38f/frame 0xfffffe00257829d0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0025782a30
calltrap() at calltrap+0x8/frame 0xfffffe0025782a30
--- trap 0xc, rip = 0xffffffff80da2d71, rsp = 0xfffffe0025782b00, rbp = 0xfffffe0025782b60 ---
callout_process() at callout_process+0x1b1/frame 0xfffffe0025782b60
handleevents() at handleevents+0x188/frame 0xfffffe0025782ba0
cpu_activeclock() at cpu_activeclock+0x70/frame 0xfffffe0025782bd0
cpu_idle() at cpu_idle+0xa8/frame 0xfffffe0025782bf0
sched_idletd() at sched_idletd+0x326/frame 0xfffffe0025782cb0
fork_exit() at fork_exit+0x7e/frame 0xfffffe0025782cf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0025782cf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

db:0:kdb.enter.default>  alltrace

Tracing command sleep pid 35878 tid 100632 td 0xfffff80057237740
sched_switch() at sched_switch+0x606/frame 0xfffffe003671b9c0
mi_switch() at mi_switch+0xdb/frame 0xfffffe003671b9f0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe003671ba40
sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame 0xfffffe003671ba80
_sleep() at _sleep+0x1c6/frame 0xfffffe003671bb00
kern_clock_nanosleep() at kern_clock_nanosleep+0x1c1/frame 0xfffffe003671bb80
sys_nanosleep() at sys_nanosleep+0x3b/frame 0xfffffe003671bbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe003671bcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe003671bcf0
--- syscall (240, FreeBSD ELF64, sys_nanosleep), rip = 0x80038c9fa, rsp = 0x7fffffffec18, rbp = 0x7fffffffec60 ---

Tracing command sh pid 15762 tid 100600 td 0xfffff80016b8e000
sched_switch() at sched_switch+0x606/frame 0xfffffe00366cb970
mi_switch() at mi_switch+0xdb/frame 0xfffffe00366cb9a0
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe00366cb9f0
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00366cba20
_sleep() at _sleep+0x1f1/frame 0xfffffe00366cbaa0
pipe_read() at pipe_read+0x3fe/frame 0xfffffe00366cbb10
dofileread() at dofileread+0x95/frame 0xfffffe00366cbb50
sys_read() at sys_read+0xc0/frame 0xfffffe00366cbbc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00366cbcf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00366cbcf0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x80044f03a, rsp = 0x7fffffffe3d8, rbp = 0x7fffffffe900 ---

Tracing command sh pid 15703 tid 100633 td 0xfffff80057237000
sched_switch() at sched_switch+0x606/frame 0xfffffe0036720800
mi_switch() at mi_switch+0xdb/frame 0xfffffe0036720830
sleepq_catch_signals() at sleepq_catch_signals+0x3f3/frame 0xfffffe0036720880
sleepq_wait_sig() at sleepq_wait_sig+0xf/frame 0xfffffe00367208b0
_sleep() at _sleep+0x1f1/frame 0xfffffe0036720930
kern_wait6() at kern_wait6+0x59e/frame 0xfffffe00367209c0
sys_wait4() at sys_wait4+0x7d/frame 0xfffffe0036720bc0
amd64_sy
 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
924
700
93
Germany
Machine gets stabler with power save disabled? Ten bucks this is a power regulation issue with the board. Insufficient PSU, load changes causing voltage drops leading to such faults. Manufacturer will fix it and tell no-one and you will never figure out what was wrong. ;-) I still maintain any cheap and relatively low-power ECC-capable Fujitsu C236/C246 board with a regular ATX/SFX PSU is a better deal. Stable.
 

AdriftAtlas

New Member
Jan 21, 2023
9
4
3
Machine gets stabler with power save disabled? Ten bucks this is a power regulation issue with the board. Insufficient PSU, load changes causing voltage drops leading to such faults. Manufacturer will fix it and tell no-one and you will never figure out what was wrong. ;-) I still maintain any cheap and relatively low-power ECC-capable Fujitsu C236/C246 board with a regular ATX/SFX PSU is a better deal. Stable.
If it was a true hardware issue then it would bring down the host too but it does not. In this case the host doesn't even notice a problem. This also happens on boards from multiple vendors including Intel. I guess it's possible the CPU architecture itself is flawed.

It is more likely an issue of the microcode or Linux kernel in KVM/QEMU. Some people have reported stable VMs under XCP-ng 6.3 and ESXi 8.
 
  • Like
Reactions: edinatl

edinatl

New Member
Mar 1, 2023
1
0
1
Thanks for your post, I have been having the exact issues described and tried a few of the suggestions, particularly the microcode thing, but still had the problems. I went from Proxmox to ESXi in hopes of things being stable with opnsense, but after about 30 or so hours of uptime I again experienced a VM crash while the hypervisor remained stable. Honestly I think I should do a memory test but I haven't had a chance yet. The failures have a sort of consistency despite the random times it crashes. I have switched to baremetal as a last resort and will try to update this thread if things change. In case anyone is interested, here is the hardware info report that details the hardware we're talking about here: HW probe of Techvision TVI7309X B0 Desktop Computer (TVI7309X) #5b6dd24e9a

Also, I was not able to get xcp-ng to install after waiting at a black screen for a while, maybe I needed to wait longer but I couldn't be that patient (I'm talking like 10 minutes on a blank screen).
 
Last edited:

fenio

New Member
Feb 25, 2023
3
0
1
I'm on microcode 24 since 11 days.
I'm on TrueNAS Scale with kernel 5.15.79.
Before update of microcode I think the record for VMs before they crashed was 7 days. Usually closer to 3-5 days and often just a day or two.
After update it's 11 days so looks promising.
Changelog for microcode update says it's just security fixes but nowadays this can mean anything like predictive branching heavily used in virtualization.
So fingers crossed ;)
 

fenio

New Member
Feb 25, 2023
3
0
1
Seems that indeed was issue with microcode. Just another report from me:
Code:
root@master:~# uptime
 14:48:25 up 30 days,  2:14,  1 user,  load average: 0.26, 0.45, 0.56
 

kliguin

Member
Nov 22, 2022
59
40
18
Seems that indeed was issue with microcode. Just another report from me:
Code:
root@master:~# uptime
14:48:25 up 30 days,  2:14,  1 user,  load average: 0.26, 0.45, 0.56
This seems the host or the "crashing" VM?
 

fenio

New Member
Feb 25, 2023
3
0
1
This seems the host or the "crashing" VM?
I never had issues with host. Only VMs were completely unstable for me and freezing irregularly. That uptime is from VM.
Before microcode update the longest stable uptime was ~7 days, usually between 3-5 days and sometimes it was freezing after just few hours.
Since I moved to microcode "24" it's super stable for 30 days.
 

Stephan

Well-Known Member
Apr 21, 2017
924
700
93
Germany
Just another anecdote from me: Not sure if connected with retbleed fixes in the Linux kernel, but for months now I have been seeing more and more regressions in "lts" type kernels. Versions 5.4.xxx, 5.15.xx, 6.1.xx. Of the kind kernel oopses or panics when starting VMs, unexplained crashes within VMs etc. Forced me to go all the way up to 6.2, which is stable in this regard. I suspect into such kernels regressions by way of "backports" are introduced. Which then nobody later figures out. Latest regression was moving public kernel symbols out of sight and from 6.2.8 up, ZFS wouldn't compile anymore. If you see "stable" or "lts", doubt it. Try close to Torvald's tree.
 

kliguin

Member
Nov 22, 2022
59
40
18
I never had issues with host. Only VMs were completely unstable for me and freezing irregularly. That uptime is from VM.
Before microcode update the longest stable uptime was ~7 days, usually between 3-5 days and sometimes it was freezing after just few hours.
Since I moved to microcode "24" it's super stable for 30 days.
can you share the download/version number how you installed the microcode?
 

kliguin

Member
Nov 22, 2022
59
40
18
Update to microcode 24 seems to fix the crashing VM's for a lot of people.


For those who can't wait for the Debian repo to be updated and want to test the newest microcode, I've re-packaged the older version with the updated microcode data files (20230214). Install this over the top of the existing one and it should update for you. The version is intentionally kept old so that when the newer Debian package comes out it will take precedence over this hack job :).
 

AdriftAtlas

New Member
Jan 21, 2023
9
4
3
My pfSense VM has been running for 25 days now with 0x24000024 microcode on the Proxmox host.

I would avoid installing packages from unknown sources. It can be updated from official sources:

Step 1: Update CPU microcode to latest available in Debian stable repo on Proxmox host:

Step 2: As the stable Debian repo currently has 0x24000023 you need to manually update the microcode to 0x24000024:

Code:
wget https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/archive/main.zip
unzip main.zip -d MCU
cp -r /root/MCU/Intel-Linux-Processor-Microcode-Data-Files-main/intel-ucode/. /lib/firmware/intel-ucode/
update-initramfs -u -k all
reboot
....
dmesg -T | grep microcode
The new microcode package should ship in the non-free stable repo with Debian 11.7 on 04/29/2023 making step two unnecessary.
 

freph91

New Member
Oct 19, 2018
4
0
1
Been keeping an eye on these threads between this one and the one over on Proxmox forums. Thanks for the info, @AdriftAtlas . Unfortunately I ran into a crash after 6 days on 6.2.6-1-pve. I was trialing it for a bit and it was quite stable on 6.2.2, but they introduced something in 6.2.6 that's causing crashes again. Back to 6.1.14 in the meantime which was rock solid for me.
 

AdriftAtlas

New Member
Jan 21, 2023
9
4
3
Been keeping an eye on these threads between this one and the one over on Proxmox forums. Thanks for the info, @AdriftAtlas . Unfortunately I ran into a crash after 6 days on 6.2.6-1-pve. I was trialing it for a bit and it was quite stable on 6.2.2, but they introduced something in 6.2.6 that's causing crashes again. Back to 6.1.14 in the meantime which was rock solid for me.
What is crashing for you? Is it the host or the VM? I don't believe the kernel was ever the issue. The issue that many of us were having is fixed by a microcode update.

I have been running kernel 6.2.9 for eight days now and it seems OK.
 

freph91

New Member
Oct 19, 2018
4
0
1
What is crashing for you? Is it the host or the VM? I don't believe the kernel was ever the issue. The issue that many of us were having is fixed by a microcode update.

I have been running kernel 6.2.9 for eight days now and it seems OK.
The pfSense VM kernel panicked (which spikes the CPU to 100% and makes the system unresponsive) after 6d20h. Microcode (0x24000024) upgrade with 6.1.14 kernel had many days of uptime (23 days before I upgraded to 6.2.6 kernel). No RAM/CPU creep indicating some other underlying issue, and only that VM was affected. 6.2.6 might have some issue that's fixed in 6.2.9. Looking forward to an update from you indicating stability or otherwise before I give it another go, but in the meantime I'm quite content with what I know gave me the longest uptime so far.

System specs:
proxmox-ve: 7.4-1 (running kernel: 6.1.14-1-pve)
N6005 w/ i226 NICs passed through to pfSense VM (22.05)
/proc/cmdline: quiet intel_iommu=on iommu=pt intel_pstate=disable intel_idle.max_cstate=1
 

freph91

New Member
Oct 19, 2018
4
0
1
"Changwang N5105-V3-V4-V5 microcode update released on 2023-04-18 (chicken blood version)"

Gotta love Google translate. Hopefully they'll release one for the N6005 boards soon.
 

DomFel

Member
Sep 5, 2022
77
74
18
To people with N6005, DO NOT update to the N5105, CWWK tech support just confirmed the BIOS is NOT universal.
 

BarTouZ

New Member
Aug 14, 2022
5
2
3
Hello,

I have just updated my NUC based on N5105, it is well recognized and patched :

1686724730266.png

On the other hand, I don't know if you had the same thing but I lost my SATA SSD, it is no longer recognized:

Before

1686724804621.png

After

1686724821108.png

Have you encountered this problem and if so, how did you solve it ?

Thanks for your help