Tracking Down Proxmox 8 Hardware Error Reporting

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

chereszabor

New Member
Jul 3, 2020
7
4
3
Hello,
I am running a home server with the following specs:
Mobo: Asrock Rack ROMED8-2T
CPU: AMD Epyc 7401p
RAM: 8x Samsung M393A4K40CB2-CTD (32GB@2666MT/s)
Other devices: (Nvidia 2070, SAS HBA, PCIe to 4x NVME card)

I recently upgraded from Proxmox 7 to 8, and around the same time I also upgraded the RAM in my system from 8x 32GB@2133 to 8x 32GB@2666. I then noticed that my system began spewing the following Syslog message once in a while:

mce: [Hardware Error]: Machine check events logged

There is no other details associated with the message in the Syslog. I've tried using ras-mc-ctl and edac-util, both are not reporting any errors.

Does anyone have any other suggestions on what I should try to get to the bottom of this?

Thanks!
 

mmk

Active Member
Oct 15, 2016
104
39
28
Czech Republic
This does not really sound like anything related to Proxmox. That is a generic Linux error. That's very likely going to be a memory issue of some kind..

Does the remote management on the board give any useful hints?
 

chereszabor

New Member
Jul 3, 2020
7
4
3
You're correct, I was getting errors in IPMI, however, didn't realize that the error buffer ran out of storage and wasn't being updated.

Here is the error logged in IPMI:
|Record| |GenID|GenID | |Sensor| |EvtDir|Event|Event|Event| ID | Type | TimeStamp |(Low)|(High)|EvMRev| Type |Sensor #| Type |Data1|Data2|Data3|
------|------|-----------|-----|------|------|------|--------|------|-----|-----|-----|
0e37h| 02h| 65031dd7h | 21h| 00h| 04h| 0ch| 00h| 6fh| 00h| 0ah| 00h|

Also, forgot to mention, I did reach out to Asrock Tech Support, they advised me to test the memory in pairs via memtest86. I am trying to avoid that approach as each memory swap requires 30 minutes of disassembly/reassembly.