Correctable ECC Error (pic) - RMA?

IamSpartacus

Well-Known Member
Mar 14, 2016
2,478
628
113
I came across the following errors in my server syslogs:
Code:
kernel: mce: [Hardware Error]: Machine check events logged
kernel: [Hardware Error]: Corrected error, no action required.
kernel: [Hardware Error]: CPU:2 (17:31:0) MC17_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b
kernel: [Hardware Error]: Error Addr: 0x000000034c5beec0
kernel: [Hardware Error]: IPID: 0x0000009600450f00, Syndrome: 0x400040000a801201
kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
I then checked my IPMI event logs and see this:



Time to RMA the module?
 

SRussell

Active Member
Oct 7, 2019
294
151
43
US
Would the vendor replace them based on those logs or would they want something like memtestx86 ran first?
 

IamSpartacus

Well-Known Member
Mar 14, 2016
2,478
628
113
Would the vendor replace them based on those logs or would they want something like memtestx86 ran first?
I just contacted them a few minutes ago so we'll see what they say. After seeing these errors I ran a full system memtestx86 (the non UEFI version as it's included with Unraid) and after 3 full passes (about 17 hours), no errors were reported. However, at the time I hadn't seen the IPMI logs which told me exactly which module is having these errors so I've now pulled all system memory except for this one erroring module and have just started another round of memtestx86 (the newest UEFI version). So we'll see if it reports anything.
 

AndreiL

New Member
Jun 30, 2019
23
9
3
I just contacted them a few minutes ago so we'll see what they say. After seeing these errors I ran a full system memtestx86 (the non UEFI version as it's included with Unraid) and after 3 full passes (about 17 hours), no errors were reported. However, at the time I hadn't seen the IPMI logs which told me exactly which module is having these errors so I've now pulled all system memory except for this one erroring module and have just started another round of memtestx86 (the newest UEFI version). So we'll see if it reports anything.
The pro version (non-free) of memtest86 has ECC specific tests (V8/UEFI only version)
The V4 (old BIOS) version is pretty useless for ECC testing, I've had clean runs for 96 hours straight only to have Ubuntu report thousands soft-ECC (correctable) errors. I generally replace the module as soon as it starts generating any ECC errors (dmesg reports which module it is)
 

IamSpartacus

Well-Known Member
Mar 14, 2016
2,478
628
113
Is the V8 Free version of any use? Not really trying to drop $45 for the Pro version. But yea I'm leaning towards just sending in an RMA once I hear back from the vendor.
 

AndreiL

New Member
Jun 30, 2019
23
9
3
Is the V8 Free version of any use? Not really trying to drop $45 for the Pro version. But yea I'm leaning towards just sending in an RMA once I hear back from the vendor.
Yeah, also didn't go for the PRO V8. Looks like the free V8 has some ECC reporting, but in my opinion it's not telling me anything I don't already know: replace the module(s) = problem fixed.