Mapping around ecc errors

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

bbqdt

Member
Sep 15, 2019
93
64
18
I get the following ecc error on a Linux box several times a day -

May 24 18:21:04 staton-nas kernel: mce: [Hardware Error]: Machine check events logged
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000040000800c2
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: TSC 1c35588953416
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: ADDR 117d228000
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: MISC 122100200020008c
May 24 18:21:04 staton-nas kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1590358864 SOCKET 0 APIC 0
May 24 18:21:04 staton-nas kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x117d228 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:1 rank:4)

The addr is always the same, so I’m trying to map around it with a ‘memmap=5M$0x117CFA8001’ kernel argument.

The argument seems to be applying because I see the following in syslog -

May 24 16:03:09 staton-nas kernel: user: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x0000000100000000-0x000000117cfa8000] usable
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117cfa8001-0x000000117d4a8000] reserved
May 24 16:03:09 staton-nas kernel: user: [mem 0x000000117d4a8001-0x000000407fffffff] usable

but I still get the ecc errors.

Am I missing something?

Is the “ADDR 117d228000” in the edac syslog errors not the actual address I need to map around? Do I need to covert that to a physical address somehow?

I’m too cheap to replace a whole dimm for a single bad bit.
 

bbqdt

Member
Sep 15, 2019
93
64
18
The more research I do, the more convinced I become that the “memory scrubbing error“ message indicates the error is coming from memory scrubbing that the hardware is doing. And I can safely ignore it now that I have mapped around it. The OS will never actually use this memory area because I reserved it.

Can anyone confirm that?
 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
942
711
93
Germany
How much time have you spent on this? Surely an hour? Compare this to tech worker hourly rate... then just get a working DIMM please. RAM needs to work reliably, else you are in a world of data corruption hurt.
 

bbqdt

Member
Sep 15, 2019
93
64
18
This is my home server, its more fun to figure out a workaround ;-), yeah it’s only 16g of 256, and ddr3, but I enjoy the challenge.