Hi all, just want to share also here my experience (you can find day by day saga in this
thread). And sorry for the long post, but has been a week full of tests.
First of all, I really love the machine. It's small, super fast, easy to add storage, memory and pcie cards. I ordered trough Amazon DE (Germany) and it took 1 day to arrive.
The main reason for why I decided to buy it it's to replace a real power hungry dual epyc 7551 that server as my main homelab hypervisor during the last 2 years. Without any issue (apart from the energy bill).
So, I bought the "barebone" 13900H configuration, and added by myself:
- Lexar NQ790 2 TB PCIe 4.0 SSD , M.2 2280 PCIe Gen4x4 NVMe 1.4
- Crucial RAM 96GB Kit (2x48GB) DDR5 5600MHz
- 1/2.5/5/10Gb SFP+ RJ45 Transceiver
I did a fresh install of the latest Proxmox 8.1.4, Kernel 6.5.11-8-pve and latest microcode installed.
Everything was going good, till the moment I started doing some intensive job (ie, starting Windows 11 VM, but also doing a lot of CPU intensive task as transcoding etc) and the system rebooted itself.
The error messages I was getting on the syslog were not consistent, here are few of them:
Code:
Feb 19 12:31:42 nicoska2 kernel: Memory failure: 0x10e1b37: unhandlable page.
Feb 19 12:32:49 nicoska2 kernel: mce_notify_irq: 1 callbacks suppressed
Feb 19 12:32:49 nicoska2 kernel: mce: [Hardware Error]: Machine check events logged
Feb 19 12:32:51 nicoska2 kernel: mce: [Hardware Error]: Machine check events logged
-- Reboot -
Code:
Feb 19 23:45:58 nicoska2 kernel: vmbr0: port 7(veth106i0) entered disabled state
Feb 19 23:45:58 nicoska2 kernel: vmbr0: port 7(veth106i0) entered disabled state
Feb 19 23:45:58 nicoska2 kernel: veth106i0 (unregistering): left allmulticast mode
Feb 19 23:45:58 nicoska2 kernel: veth106i0 (unregistering): left promiscuous mode
Feb 19 23:45:58 nicoska2 kernel: vmbr0: port 7(veth106i0) entered disabled state
Feb 19 23:45:58 nicoska2 audit[29954]: AVC apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-106_</var/lib/lxc>" pid=29954 comm="apparmor_parser"
Feb 19 23:45:58 nicoska2 kernel: audit: type=1400 audit(1708382758.447:74): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-106_</var/lib/lxc>" pid=29954 comm="apparmor_parser"
Feb 19 23:45:58 nicoska2 kernel: EXT4-fs (dm-11): unmounting filesystem fd237a8e-f4aa-49a7-97b0-c7123fb0c218.
Feb 19 23:45:58 nicoska2 systemd[1]: pve-container@106.service: Deactivated successfully.
Feb 19 23:45:58 nicoska2 systemd[1]: Stopped pve-container@106.service - PVE LXC Container: 106.
Feb 19 23:45:58 nicoska2 systemd[1]: Started pve-container@106.service - PVE LXC Container: 106.
Feb 19 23:45:59 nicoska2 kernel: EXT4-fs (dm-11): mounted filesystem fd237a8e-f4aa-49a7-97b0-c7123fb0c218 r/w with ordered data mode. Quota mode: none.
Feb 19 23:45:59 nicoska2 audit[29981]: AVC apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-106_</var/lib/lxc>" pid=29981 comm="apparmor_parser"
Feb 19 23:45:59 nicoska2 kernel: audit: type=1400 audit(1708382759.415:75): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-106_</var/lib/lxc>" pid=29981 comm="apparmor_parser"
Feb 19 23:45:59 nicoska2 kernel: vmbr0: port 7(veth106i0) entered blocking state
Feb 19 23:45:59 nicoska2 kernel: vmbr0: port 7(veth106i0) entered disabled state
Feb 19 23:45:59 nicoska2 kernel: veth106i0: entered allmulticast mode
Feb 19 23:45:59 nicoska2 kernel: veth106i0: entered promiscuous mode
Feb 19 23:45:59 nicoska2 kernel: eth0: renamed from vethiNJGQT
Feb 19 23:45:59 nicoska2 kernel: vmbr0: port 7(veth106i0) entered blocking state
Feb 19 23:45:59 nicoska2 kernel: vmbr0: port 7(veth106i0) entered forwarding state
Feb 19 23:46:01 nicoska2 kernel: nfs: Deprecated parameter 'intr'
Feb 19 23:46:03 nicoska2 pvedaemon[1951]: <root@pam> successful auth for user 'root@pam'
Feb 19 23:46:06 nicoska2 pvestatd[1921]: modified cpu set for lxc/106: 1-10,12-13,15-16,18-19
Feb 19 23:53:01 nicoska2 pvedaemon[1953]: <root@pam> successful auth for user 'root@pam'
-- Reboot --
Code:
Feb 20 09:17:01 nicoska2 CRON[281696]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Feb 20 09:17:01 nicoska2 CRON[281697]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Feb 20 09:17:01 nicoska2 CRON[281696]: pam_unix(cron:session): session closed for user root
Feb 20 09:18:58 nicoska2 kernel: mce: [Hardware Error]: Machine check events logged
Feb 20 09:21:37 nicoska2 pvedaemon[1935]: <root@pam> successful auth for user 'root@pam'
Feb 20 09:25:11 nicoska2 pveproxy[249943]: worker exit
Feb 20 09:25:11 nicoska2 pveproxy[1949]: worker 249943 finished
Feb 20 09:25:11 nicoska2 pveproxy[1949]: starting 1 worker(s)
Feb 20 09:25:11 nicoska2 pveproxy[1949]: worker 287436 started
Feb 20 09:31:54 nicoska2 pveproxy[266452]: worker exit
Feb 20 09:31:54 nicoska2 pveproxy[1949]: worker 266452 finished
Feb 20 09:31:54 nicoska2 pveproxy[1949]: starting 1 worker(s)
Feb 20 09:31:54 nicoska2 pveproxy[1949]: worker 291984 started
Feb 20 09:34:01 nicoska2 kernel: mce: [Hardware Error]: Machine check events logged
Feb 20 09:34:03 nicoska2 kernel: mce: [Hardware Error]: Machine check events logged
Feb 20 09:34:15 nicoska2 kernel: RAS: Soft-offlining pfn: 0x10b3e10
Feb 20 09:34:15 nicoska2 kernel: Memory failure: 0x10b3e10: unhandlable page.
-- Reboot --
Code:
eb 20 17:44:16 nicoska2 pvedaemon[14727]: start VM 107: UPID:nicoska2:00003987:00007C2C:65D4D6E0:qmstart:107:root@pam:
Feb 20 17:44:16 nicoska2 pvedaemon[1948]: <root@pam> starting task UPID:nicoska2:00003987:00007C2C:65D4D6E0:qmstart:107:root@pam:
Feb 20 17:44:16 nicoska2 systemd[1]: Created slice qemu.slice - Slice /qemu.
Feb 20 17:44:16 nicoska2 systemd[1]: Started 107.scope.
Feb 20 17:44:17 nicoska2 kernel: tap107i0: entered promiscuous mode
Feb 20 17:44:17 nicoska2 kernel: vmbr0: port 17(fwpr107p0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: vmbr0: port 17(fwpr107p0) entered disabled state
Feb 20 17:44:17 nicoska2 kernel: fwpr107p0: entered allmulticast mode
Feb 20 17:44:17 nicoska2 kernel: fwpr107p0: entered promiscuous mode
Feb 20 17:44:17 nicoska2 kernel: vmbr0: port 17(fwpr107p0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: vmbr0: port 17(fwpr107p0) entered forwarding state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 1(fwln107i0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 1(fwln107i0) entered disabled state
Feb 20 17:44:17 nicoska2 kernel: fwln107i0: entered allmulticast mode
Feb 20 17:44:17 nicoska2 kernel: fwln107i0: entered promiscuous mode
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 1(fwln107i0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 1(fwln107i0) entered forwarding state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 2(tap107i0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 2(tap107i0) entered disabled state
Feb 20 17:44:17 nicoska2 kernel: tap107i0: entered allmulticast mode
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 2(tap107i0) entered blocking state
Feb 20 17:44:17 nicoska2 kernel: fwbr107i0: port 2(tap107i0) entered forwarding state
Feb 20 17:44:17 nicoska2 pvedaemon[1948]: <root@pam> end task UPID:nicoska2:00003987:00007C2C:65D4D6E0:qmstart:107:root@pam: OK
-- Reboot --
View attachment 34837
Code:
Feb 20 22:40:24 nicoska2 pvedaemon[1943]: <root@pam> end task UPID:nicoska2:00003AC3:00009039:65D51C47:qmstart:107:root@pam: OK
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 2/KVM/15145 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 7/KVM/15150 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 5/KVM/15148 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 1/KVM/15144 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 4/KVM/15147 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 6/KVM/15149 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 3/KVM/15146 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 9/KVM/15152 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 8/KVM/15151 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:32 nicoska2 pvedaemon[15287]: starting vnc proxy UPID:nicoska2:00003BB7:000093BD:65D51C50:vncproxy:107:root@pam:
Feb 20 22:40:32 nicoska2 pvedaemon[1944]: <root@pam> starting task UPID:nicoska2:00003BB7:000093BD:65D51C50:vncproxy:107:root@pam:
Feb 20 22:40:33 nicoska2 pvedaemon[1945]: VM 107 qmp command failed - VM 107 qmp command 'guest-ping' failed - got timeout
Feb 20 22:40:37 nicoska2 pveproxy[1960]: detected empty handle
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 4: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 5: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 9: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 8: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 2: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 6: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 1: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 3: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 7: requested 19791 ns lapic timer period limited to 200000 ns
-- Reboot --
Code:
Feb 20 22:40:24 nicoska2 pvedaemon[1943]: <root@pam> end task UPID:nicoska2:00003AC3:00009039:65D51C47:qmstart:107:root@pam: OK
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 2/KVM/15145 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 7/KVM/15150 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 5/KVM/15148 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 1/KVM/15144 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 4/KVM/15147 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 6/KVM/15149 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 3/KVM/15146 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 9/KVM/15152 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:24 nicoska2 kernel: x86/split lock detection: #AC: CPU 8/KVM/15151 took a split_lock trap at address: 0x7eebd050
Feb 20 22:40:32 nicoska2 pvedaemon[15287]: starting vnc proxy UPID:nicoska2:00003BB7:000093BD:65D51C50:vncproxy:107:root@pam:
Feb 20 22:40:32 nicoska2 pvedaemon[1944]: <root@pam> starting task UPID:nicoska2:00003BB7:000093BD:65D51C50:vncproxy:107:root@pam:
Feb 20 22:40:33 nicoska2 pvedaemon[1945]: VM 107 qmp command failed - VM 107 qmp command 'guest-ping' failed - got timeout
Feb 20 22:40:37 nicoska2 pveproxy[1960]: detected empty handle
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 4: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 5: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 9: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 8: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 2: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 6: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 1: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 3: requested 19791 ns lapic timer period limited to 200000 ns
Feb 20 22:40:48 nicoska2 kernel: kvm: vcpu 7: requested 19791 ns lapic timer period limited to 200000 ns
-- Reboot --
What I tried so far:
- Switch Ram module: system crash on heavy load
- BIOS factory reset: system crash on heavy load
- Run Memtest86+ : no error, test passed, system crash on heavy load
- Removed the SFP+ transceiver: system crash on heavy load
- Disabled (in BIOS) C-State and SpeedShift: system crash on heavy load
- Changed TDP limit(in BIOS):
- PL1: 60000
PL2: 80000
I changed to:
PL1: 40000 system crash on heavy load
PL2: 60000 system crash on heavy load
- Disable Efficiency cores (in BIOS), now running only 12 P cores: system work perfectly also in heavy load
So, my question now is, do you guys think that I received a bad unit with a faulty CPU? I still have time to send it back to Amazon.
What would you do in my situation? Are there any additional test that I could perform?
Thanks for reading this long post.
Nico