Minisforum MS-01 issue with Proxmox - CPU hard lockup

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

pyb0

New Member
Mar 2, 2024
7
3
3
Hi everyone,

I'm currently trying to sort out issues related to CPU hard/soft lockups on MS-01 with Proxmox 8.1.4 and kernel 6.5.13-1-pve.


Code:
Mar 20 18:00:28 pve3 kernel: watchdog: Watchdog detected hard LOCKUP on cpu 9
Mar 20 18:00:28 pve3 kernel: Modules linked in: dm_snapshot iptable_nat tcp_diag inet_diag nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace veth cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics bonding tls softdog nfnetlink_log sunrpc binfmt_misc nfnetlink zfs(PO) spl(O) snd_hda_codec_hdmi vhost_net vhost vhost_iotlb snd_hda_codec_realtek tap snd_hda_codec_generic kvmgt ledtrig_audio mdev intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_sof_pci_intel_tgl intel_uncore_frequency_common snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda
Mar 20 18:00:28 pve3 kernel:  snd_hda_ext_core x86_pkg_temp_thermal snd_soc_acpi_intel_match intel_powerclamp snd_soc_acpi coretemp soundwire_generic_allocation soundwire_bus snd_soc_core kvm_intel i915 snd_compress crct10dif_pclmul ac97_bus polyval_clmulni mt7921e snd_pcm_dmaengine polyval_generic mt7921_common snd_hda_intel btusb ghash_clmulni_intel mt76_connac_lib snd_intel_dspcfg btrtl sha256_ssse3 mt76 snd_intel_sdw_acpi sha1_ssse3 snd_hda_codec drm_buddy btbcm aesni_intel mac80211 snd_hda_core ttm btintel crypto_simd snd_hwdep drm_display_helper btmtk cryptd snd_pcm mei_hdcp mei_pxp snd_timer rapl cec cmdlinepart bluetooth cfg80211 spi_nor snd mei_me rc_core intel_cstate joydev ecdh_generic pcspkr wmi_bmof mtd libarc4 soundcore mei cdc_acm ecc input_leds drm_kms_helper acpi_tad acpi_pad mac_hid i2c_algo_bit kvm vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio drm iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj btrfs hid_generic usbkbd usbmouse blake2b_generic usbhid xor hid
Mar 20 18:00:28 pve3 kernel:  raid6_pq simplefb uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme xhci_pci xhci_pci_renesas spi_intel_pci i2c_i801 nvme_core video crc32_pclmul thunderbolt i40e xhci_hcd igc spi_intel i2c_smbus nvme_common wmi pinctrl_tigerlake
Mar 20 18:00:28 pve3 kernel: CPU: 9 PID: 952015 Comm: atop Tainted: P     U     O       6.5.13-1-pve #1
Mar 20 18:00:28 pve3 kernel: Hardware name: Micro Computer (HK) Tech Limited Venus Series/AHWSA, BIOS AHWSA.1.17 12/14/2023
Mar 20 18:00:28 pve3 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x284/0x2d0
Mar 20 18:00:28 pve3 kernel: Code: 12 83 e0 03 83 ea 01 48 c1 e0 05 48 63 d2 48 05 40 3f 03 00 48 03 04 d5 40 3b 65 95 4c 89 20 41 8b 44 24 08 85 c0 75 0b f3 90 <41> 8b 44 24 08 85 c0 74 f5 49 8b 14 24 48 85 d2 74 8b 0f 0d 0a eb
Mar 20 18:00:28 pve3 kernel: RSP: 0018:ffffabfee13b7af0 EFLAGS: 00000046
Mar 20 18:00:28 pve3 kernel: RAX: 0000000000000000 RBX: ffff8e9763f18ab0 RCX: 0000000000280000
Mar 20 18:00:28 pve3 kernel: RDX: 0000000000001637 RSI: 0000000058e058e1 RDI: ffff8e9763f18ab0
Mar 20 18:00:28 pve3 kernel: RBP: ffffabfee13b7b10 R08: fe555405c9f8db7b R09: ffff8e9763f18ab0
Mar 20 18:00:28 pve3 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e9e4fa73f40
Mar 20 18:00:28 pve3 kernel: R13: 0000000000000000 R14: 0000000000000009 R15: 00000007ee3fb44a
Mar 20 18:00:28 pve3 kernel: FS:  00007ac2d2600740(0000) GS:ffff8e9e4fa40000(0000) knlGS:0000000000000000
Mar 20 18:00:28 pve3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 20 18:00:28 pve3 kernel: CR2: 000073895346c040 CR3: 000000048977c000 CR4: 0000000000752ee0
Mar 20 18:00:28 pve3 kernel: PKRU: 55555554
Mar 20 18:00:28 pve3 kernel: Call Trace:
Mar 20 18:00:28 pve3 kernel:  <NMI>
Mar 20 18:00:28 pve3 kernel:  ? show_regs+0x6d/0x80
Mar 20 18:00:28 pve3 kernel:  ? watchdog_hardlockup_check+0x10c/0x1e0
Mar 20 18:00:28 pve3 kernel:  ? watchdog_overflow_callback+0x6b/0x80
Mar 20 18:00:28 pve3 kernel:  ? __perf_event_overflow+0x119/0x380
Mar 20 18:00:28 pve3 kernel:  ? perf_event_overflow+0x19/0x30
Mar 20 18:00:28 pve3 kernel:  ? handle_pmi_common+0x175/0x3f0
Mar 20 18:00:28 pve3 kernel:  ? intel_pmu_handle_irq+0x11f/0x480
Mar 20 18:00:28 pve3 kernel:  ? perf_event_nmi_handler+0x2b/0x50
Mar 20 18:00:28 pve3 kernel:  ? nmi_handle+0x5d/0x160
Mar 20 18:00:28 pve3 kernel:  ? default_do_nmi+0x47/0x130
Mar 20 18:00:28 pve3 kernel:  ? exc_nmi+0x1d5/0x2a0
Mar 20 18:00:28 pve3 kernel:  ? end_repeat_nmi+0x16/0x67
Mar 20 18:00:28 pve3 kernel:  ? native_queued_spin_lock_slowpath+0x284/0x2d0
Mar 20 18:00:28 pve3 kernel:  ? native_queued_spin_lock_slowpath+0x284/0x2d0
Mar 20 18:00:28 pve3 kernel:  ? native_queued_spin_lock_slowpath+0x284/0x2d0
Mar 20 18:00:28 pve3 kernel:  </NMI>
Mar 20 18:00:28 pve3 kernel:  <TASK>
Mar 20 18:00:28 pve3 kernel:  _raw_spin_lock_irqsave+0x5c/0x80
Mar 20 18:00:28 pve3 kernel:  task_cputime_adjusted+0x4b/0x100
Mar 20 18:00:28 pve3 kernel:  do_task_stat+0xb19/0xdf0
Mar 20 18:00:28 pve3 kernel:  proc_tid_stat+0x11/0x30
Mar 20 18:00:28 pve3 kernel:  proc_single_show+0x53/0xe0
Mar 20 18:00:28 pve3 kernel:  seq_read_iter+0x132/0x4a0
Mar 20 18:00:28 pve3 kernel:  seq_read+0xcd/0x110
Mar 20 18:00:28 pve3 kernel:  vfs_read+0xb1/0x360
Mar 20 18:00:28 pve3 kernel:  ksys_read+0x73/0x100
Mar 20 18:00:28 pve3 kernel:  __x64_sys_read+0x19/0x30
Mar 20 18:00:28 pve3 kernel:  do_syscall_64+0x58/0x90
Mar 20 18:00:28 pve3 kernel:  ? do_syscall_64+0x67/0x90
Mar 20 18:00:28 pve3 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Mar 20 18:00:28 pve3 kernel:  ? do_syscall_64+0x67/0x90
Mar 20 18:00:28 pve3 kernel:  ? do_syscall_64+0x67/0x90
Mar 20 18:00:28 pve3 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Mar 20 18:00:28 pve3 kernel: RIP: 0033:0x7ac2d26fd19d
Mar 20 18:00:28 pve3 kernel: Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 54 0a 00 e8 49 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 24 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
Mar 20 18:00:28 pve3 kernel: RSP: 002b:00007ffc566b4e68 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Mar 20 18:00:28 pve3 kernel: RAX: ffffffffffffffda RBX: 00005a8666ca1c20 RCX: 00007ac2d26fd19d
Mar 20 18:00:28 pve3 kernel: RDX: 0000000000000c00 RSI: 00007ffc566b4f20 RDI: 0000000000000020
Mar 20 18:00:28 pve3 kernel: RBP: 00007ac2d27d45e0 R08: 0000000000000c00 R09: 0000000000000001
Mar 20 18:00:28 pve3 kernel: R10: 0000000000001000 R11: 0000000000000246 R12: 00007ffc566b4f20
Mar 20 18:00:28 pve3 kernel: R13: 0000000000000fff R14: 0000000000000d68 R15: 00007ac2d27d39e0
Mar 20 18:00:28 pve3 kernel:  </TASK>
Mar 20 18:00:28 pve3 kernel: watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [kworker/7:2:2287795]
Mar 20 18:00:28 pve3 kernel: Modules linked in: dm_snapshot iptable_nat tcp_diag inet_diag nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 overlay rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace veth cmac nls_utf8 cifs cifs_arc4 rdma_cm iw_cm ib_cm ib_core cifs_md4 fscache netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables nvme_fabrics bonding tls softdog nfnetlink_log sunrpc binfmt_misc nfnetlink zfs(PO) spl(O) snd_hda_codec_hdmi vhost_net vhost vhost_iotlb snd_hda_codec_realtek tap snd_hda_codec_generic kvmgt ledtrig_audio mdev intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_sof_pci_intel_tgl intel_uncore_frequency_common snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda
This happens after 24-36 hours with or without E-cores enabled and latest microcode 0x411c (2023-08-30) !
Using acpi and schedutil. KSM was disabled already... memtest86 does not show any RAM error.
Any thoughts?
 

pyb0

New Member
Mar 2, 2024
7
3
3
I've moved back to intel_pstate driver (instead of acpi) and "performance" governor and it looks much better now. I did not see any difference in terms of power usage though. Still around 30W with 4 nvme SSDs and 6 containers/VMs running.
Let's see if it's stable enough over several days...
 
  • Like
Reactions: Whatever

pyb0

New Member
Mar 2, 2024
7
3
3
While switching back to intel_pstate driver improved a lot, I still got a CPU error today... MF already proposed an RMA, but I would go for a refund... I would think MS-01 is still a beta product at this stage.