memory error? cpu error? cosmic ray?!

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

e97

Active Member
Jun 3, 2015
325
197
43
Anyone know what this is about:

Code:
Message from syslogd@RYZERVER at Nov 30 14:48:24 ...
kernel:[4212402.147302] [Hardware Error]: Deferred error, no action required.

Message from syslogd@RYZERVER at Nov 30 14:48:24 ...
kernel:[4212402.147307] [Hardware Error]: CPU:1 (19:21:2) MC25_STATUS[Over|-|-|-|-|-|UECC|Deferred|-|-]: 0xc10ff07fffffffb9

Message from syslogd@RYZERVER at Nov 30 14:48:24 ...
kernel:[4212402.147313] [Hardware Error]: IPID: 0x0000000000000000

Message from syslogd@RYZERVER at Nov 30 14:48:24 ...
kernel:[4212402.147315] [Hardware Error]: Bank 25 is reserved.

Message from syslogd@RYZERVER at Nov 30 14:48:24 ...
kernel:[4212402.147316] [Hardware Error]: cache level: L1, tx: GEN
Showed up on all terminals on a home server.

Code:
up 48 days, 21:43, 11 users,  load average: 0.70, 0.90, 0.99
Nothing jumps out to me on netdata at that time or a second before or after as far as I can tell.


memtest (4 passes) OK before production use.

Specs:
Ryzen 5950X
4 x 32GB ECC UDIMM 3200MHz
 
Last edited:

homeserver78

New Member
Nov 7, 2023
26
13
3
Sweden
I don't know what that's about, unfortunately. I had correctable ECC errors on a RAM stick and the messages looked different:
Code:
Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: 9400004000910091

Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: TSC 0 ADDR 3adbcabc0

Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1637065003 SOCKET 0 APIC 0 microcode 12a
(I managed to "fix" the issue by marking the memory range containing the addresses mentioned in the messages, plus some margin, as reserved by adding 'memmap=0x10000000$0x3a0000000' to the kernel command line. No more mce messages since.)

In general, I could not find any real documentation describing mce messages, so I was mostly guessing what was wrong and what to do about it.

Since your messages mention L1 cache, perhaps there was a bit flip in the cpu L1 cache? But obviously guessing is not really good enough here, and I have no idea where to find actual info about these messages.
 
  • Like
Reactions: e97

e97

Active Member
Jun 3, 2015
325
197
43
I don't know what that's about, unfortunately. I had correctable ECC errors on a RAM stick and the messages looked different:
Code:
Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: 9400004000910091

Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: TSC 0 ADDR 3adbcabc0

Message from syslogd@  at Tue Nov 16 13:16:26 2021 ...
: mce: [Hardware Error]: PROCESSOR 0:406d8 TIME 1637065003 SOCKET 0 APIC 0 microcode 12a
(I managed to "fix" the issue by marking the memory range containing the addresses mentioned in the messages, plus some margin, as reserved by adding 'memmap=0x10000000$0x3a0000000' to the kernel command line. No more mce messages since.)

In general, I could not find any real documentation describing mce messages, so I was mostly guessing what was wrong and what to do about it.

Since your messages mention L1 cache, perhaps there was a bit flip in the cpu L1 cache? But obviously guessing is not really good enough here, and I have no idea where to find actual info about these messages.
Yea the error doesnt look like the typical memory error. The bank and L1 made me think it was a CPU/memory controller issue. Will have to try more google-fu.

I updated the system a day or two before this error and it required a reboot so maybe something is funky. Will keep monitoring, if it pops up again I'll reboot and see what happens from there.

Thanks for sharing how to skip memory regions! Handy to know, hope I dont have to use it.
 

e97

Active Member
Jun 3, 2015
325
197
43
2018-12-08 update: havent seen any more of these messages or system issues. Also rebooted the system and no issues.