This server was rebooted over the weekend. Today, started getting MCE messages on a server (Supermicro X9DRD-7LN4F system with 16x16GB DIMMs):
[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5 offset:0x880 grain:32 syndrome:0x0 - areaRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)
[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
[129516.092401] mce: [Hardware Error]: Machine check events logged
[129516.092441] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[129516.092443] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[129516.092444] EDAC sbridge MC1: TSC 0
[129516.092446] EDAC sbridge MC1: ADDR 2fa25b5000
[129516.092447] EDAC sbridge MC1: MISC 90840010001108c
[129516.092448] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539126621 SOCKET 1 APIC 20
[129516.223389] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
Looks like it's all on 2nd memory controller.
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 2 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 1 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
Seems to be showing on 2nd socket Ch#0_DIMM#0 and also Ch#2_DIMM#0. If it was isolated to a single DIMM, I would be inclined to think DIMM failure... but seeing it on 2 DIMMs so close together has me wondering if there's something else going on.
Thoughts?
[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5 offset:0x880 grain:32 syndrome:0x0 - areaRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)
[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
[129516.092401] mce: [Hardware Error]: Machine check events logged
[129516.092441] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[129516.092443] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[129516.092444] EDAC sbridge MC1: TSC 0
[129516.092446] EDAC sbridge MC1: ADDR 2fa25b5000
[129516.092447] EDAC sbridge MC1: MISC 90840010001108c
[129516.092448] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539126621 SOCKET 1 APIC 20
[129516.223389] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - areaRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
Looks like it's all on 2nd memory controller.
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 2 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 1 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
Seems to be showing on 2nd socket Ch#0_DIMM#0 and also Ch#2_DIMM#0. If it was isolated to a single DIMM, I would be inclined to think DIMM failure... but seeing it on 2 DIMMs so close together has me wondering if there's something else going on.
Thoughts?