Getting MCE messages on a server...

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
This server was rebooted over the weekend. Today, started getting MCE messages on a server (Supermicro X9DRD-7LN4F system with 16x16GB DIMMs):

[124519.723865] mce: [Hardware Error]: Machine check events logged
[124519.723881] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[124519.723883] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010091
[124519.723885] EDAC sbridge MC1: TSC 0
[124519.723886] EDAC sbridge MC1: ADDR 2fa25b5880
[124519.723887] EDAC sbridge MC1: MISC 140724686
[124519.723889] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539121624 SOCKET 1 APIC 20
[124520.445994] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2fa25b5 offset:0x880 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:4 rank:1)

[125043.238458] mce: [Hardware Error]: Machine check events logged
[125043.238479] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[125043.238482] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[125043.238483] EDAC sbridge MC1: TSC 0
[125043.238485] EDAC sbridge MC1: ADDR 2fa25b5000
[125043.238486] EDAC sbridge MC1: MISC 90840010001108c
[125043.238488] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539122148 SOCKET 1 APIC 20
[125043.470434] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)

[129516.092401] mce: [Hardware Error]: Machine check events logged
[129516.092441] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[129516.092443] EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 10: 8c000048000800c1
[129516.092444] EDAC sbridge MC1: TSC 0
[129516.092446] EDAC sbridge MC1: ADDR 2fa25b5000
[129516.092447] EDAC sbridge MC1: MISC 90840010001108c
[129516.092448] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1539126621 SOCKET 1 APIC 20
[129516.223389] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2fa25b5 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)

Looks like it's all on 2nd memory controller.

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 2 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 1 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors

Seems to be showing on 2nd socket Ch#0_DIMM#0 and also Ch#2_DIMM#0. If it was isolated to a single DIMM, I would be inclined to think DIMM failure... but seeing it on 2 DIMMs so close together has me wondering if there's something else going on.

Thoughts?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
MC1/CH0/DIMM0 keeps incrementing.

Code:
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 5 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 1 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
I suspect it's that DIMM that's bad... or, should i just try to re-seat it?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
ugh... getting ugly, and this time on MC1/CH2/DIMM0:


Code:
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 5 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 42 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
 

Blinky 42

Active Member
Aug 6, 2015
615
232
43
48
PA, USA
You can always swap around the dimms showing errors to slots that appear to be good, run memtest86 for a while and see if the errors move with the dimms or stay with the slot.
Just re-seating them might help, but it could also be deeper and you need to swap out the board eventually. We had a whole set of x8 boards back in the day that would eventually develop errors on one of the dimm slots only. Oddly if you rebooted the box and the bios brought it up w/o that dimm in the mix form failing the inital self testing they would chug along just fine missing the memory.
 

nthu9280

Well-Known Member
Feb 3, 2016
1,628
498
83
San Antonio, TX
Did you change out the CPUs by chance? Once I got v2 cpus to upgrade v1 set. The S2600CP2J would disable couple of slots and you could see the warning LEDs by those DIMMs. It was working fine before on V1 CPUs. I used alcohol wipes to clean the CPU pads/contacts and the issue stopped
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
You can always swap around the dimms showing errors to slots that appear to be good, run memtest86 for a while and see if the errors move with the dimms or stay with the slot.
Just re-seating them might help, but it could also be deeper and you need to swap out the board eventually. We had a whole set of x8 boards back in the day that would eventually develop errors on one of the dimm slots only. Oddly if you rebooted the box and the bios brought it up w/o that dimm in the mix form failing the inital self testing they would chug along just fine missing the memory.
I'm hoping it is not the board, but I do have a spare unit for this server. Along with spare DIMMs, etc.

Did you change out the CPUs by chance? Once I got v2 cpus to upgrade v1 set. The S2600CP2J would disable couple of slots and you could see the warning LEDs by those DIMMs. It was working fine before on V1 CPUs. I used alcohol wipes to clean the CPU pads/contacts and the issue stopped
This system has 2x E5-2680v2.

The thing is, it's been running stable under the load of about 20~30 VMs for many months and only started showing these errors recently after a reboot. I have to plan some down time for it, as several of those VMs are in use. I do have spare DIMMs and other spare parts, just wondering how much "swapping" out to do when I finally get a chance to take down the server.
 

nthu9280

Well-Known Member
Feb 3, 2016
1,628
498
83
San Antonio, TX
Ok. That rules out my line of troubleshooting. Since this is your primary VM host, it makes the process little tricky. Hope it's memory sticks and not a issue with the MB.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
Ok.. need help with decoding the EDAC mcelog messages...

Code:
Oct 12 17:12:02 dongtan mcelog: Hardware event. This is not a software error.
Oct 12 17:12:02 dongtan mcelog: MCE 1
Oct 12 17:12:02 dongtan mcelog: CPU 10 BANK 10
Oct 12 17:12:02 dongtan mcelog: MISC 90840010001108c ADDR 2fa25b5000
Oct 12 17:12:02 dongtan mcelog: TIME 1539122148 Tue Oct  9 14:55:48 2018
Oct 12 17:12:02 dongtan mcelog: MCG status:
Oct 12 17:12:02 dongtan mcelog: MCi status:
Oct 12 17:12:02 dongtan mcelog: Corrected error
Oct 12 17:12:02 dongtan mcelog: MCi_MISC register valid
Oct 12 17:12:02 dongtan mcelog: MCi_ADDR register valid
Oct 12 17:12:02 dongtan mcelog: MCA: MEMORY CONTROLLER MS_CHANNEL1_ERR
Oct 12 17:12:02 dongtan mcelog: Transaction: Memory scrubbing error
Oct 12 17:12:02 dongtan mcelog: MemCtrl: Corrected patrol scrub error

Oct 12 17:12:02 dongtan mcelog: Hardware event. This is not a software error.
Oct 12 17:12:02 dongtan mcelog: MCE 31
Oct 12 17:12:02 dongtan mcelog: CPU 10 BANK 7
Oct 12 17:12:02 dongtan mcelog: MISC 42721286 ADDR 2fa25b5880
Oct 12 17:12:02 dongtan mcelog: TIME 1539365877 Fri Oct 12 10:37:57 2018
Oct 12 17:12:02 dongtan mcelog: MCG status:
Oct 12 17:12:02 dongtan mcelog: MCi status:
Oct 12 17:12:02 dongtan mcelog: Corrected error
Oct 12 17:12:02 dongtan mcelog: MCi_MISC register valid
Oct 12 17:12:02 dongtan mcelog: MCi_ADDR register valid
Oct 12 17:12:02 dongtan mcelog: MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
Oct 12 17:12:02 dongtan mcelog: Transaction: Memory read error
Oct 12 17:12:02 dongtan mcelog: STATUS 8c00004000010091 MCGSTATUS 0
Oct 12 17:12:02 dongtan mcelog: MCGCAP 1000c1b APICID 20 SOCKETID 1
Oct 12 17:12:02 dongtan mcelog: PPIN 5b8a27707138515f
Oct 12 17:12:02 dongtan mcelog: CPUID Vendor Intel Family 6 Model 62
The above are the type of messages i'm getting. There are two ADDR that seem to be the problem:

ADDR 2fa25b5000
ADDR 2fa25b5880

am i right so far? if so, if I dig through dmidecode -t 20, i find this one that appears to match the addresses above:

Code:
Handle 0x0048, DMI type 20, 35 bytes
Memory Device Mapped Address
    Starting Address: 0x02C00000000
    Ending Address: 0x02FFFFFFFFF
    Range Size: 16 GB
    Physical Device Handle: 0x0047
    Memory Array Mapped Address Handle: 0x0040
    Partition Row Position: 1
I mean, 2,C00,000,000 < 2,FA2,5B5,000 < 2,FA2,5B5,880 < 2,FFF,FFF,FFF, right? (commas inserted to make it easier to read)

So, if that is correct, it is saying "Physical Device Handle: 0x0047". So, if I look at dmidecode -t memory, i find:

Code:
Handle 0x0047, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x003F
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMF2
        Bank Locator: P1_Node1_Channel1_Dimm1
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: Hynix Semiconductor          
        Serial Number: 4F920CCC  
        Asset Tag: Dimm3_AssetTag
        Part Number: HMT42GR7MFR4C-PB
        Rank: 2
        Configured Clock Speed: 1600 MHz
So, if that process above is correct, then I'm only look at CPU#2/Ch#1/DIMM#1, which is labeled P2-DIMMF2 on the motherboard. But, this seems contrary to:

Code:
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 7 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 92 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors
mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors
The output from edac-util seems to suggest

CPU#2/Ch#0/DIMM#0
CPU#2/Ch#2/DIMM#0

Which is NOT:

CPU#2/Ch#1/DIMM#1

The only thing consistent is the CPU#2 part. I must be misinterpreting the mcelog output?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
I think I must be misinterpreting the output from edac-util. The chances of 2 DIMMs failing at the same time are pretty rare, so my manual analysis that is pointing me to only a single defective DIMM, P2-DIMMF2, seems more likely to be correct; and at least seems to be logical.

I'm thinking of going for P2-DIMMF2 replacement only. What do you guys think?

If I'm completely unsure, there is the brute force option of replacing all 8x DIMMs on CPU#2; but I'd like to learn something from this experience, so if anyone here can confirm or deny my analysis that is pointing me to P2-DIMMF2, please chime in....
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
Ok. I'm really confused now... so, I thought, maybe I can find errors with more specific information in the IPMI interface:

Code:
19    2018/10/12 16:44:20    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
20    2018/10/12 17:01:23    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
21    2018/10/12 17:31:56    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
22    2018/10/12 17:53:25    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
23    2018/10/12 18:16:43    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
24    2018/10/12 18:32:25    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
25    2018/10/12 18:53:08    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
26    2018/10/12 19:19:09    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
27    2018/10/12 19:33:15    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
28    2018/10/13 01:27:19    OEM    Memory    Correctable Memory ECC @ DIMMF1(CPU2) - Asserted
So, that is claiming defect is P2-DIMMF1, not P2-DIMMF2??? WTF?
 

Blinky 42

Active Member
Aug 6, 2015
615
232
43
48
PA, USA
From a test but try and keep it quick plan, I would say:
Since you are taking thee system down at some point anyway I would run memtest86+ on it as-is and see what it reports. Since you are getting that many errors quick it should show up very quick in the first few tests memtest runs. Then shut it down and pull the dimms it says are giving errors, spin up with less ram and see if you are ok or errors have migrated to other locations. If that it runs error-free a few minutes past where you saw errors before, shut it down and swap the possibly bad dims into slots that were occupied and move those hopefully good dimes into the other slots and run again. If you get errors that move, replace the dims and re-test. If the errors are all over no rhyme or reason, you can try and re-seat everything but I would say pull dimms out until it is stable enough to run degraded until you can coordinate a full mobo swap.

Doing testing with just VM load will be much more spotty because depending on how you system is used the problem areas may be infrequently accessed.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
From a test but try and keep it quick plan, I would say:
Since you are taking thee system down at some point anyway I would run memtest86+ on it as-is and see what it reports. Since you are getting that many errors quick it should show up very quick in the first few tests memtest runs. Then shut it down and pull the dimms it says are giving errors, spin up with less ram and see if you are ok or errors have migrated to other locations. If that it runs error-free a few minutes past where you saw errors before, shut it down and swap the possibly bad dims into slots that were occupied and move those hopefully good dimes into the other slots and run again. If you get errors that move, replace the dims and re-test. If the errors are all over no rhyme or reason, you can try and re-seat everything but I would say pull dimms out until it is stable enough to run degraded until you can coordinate a full mobo swap.

Doing testing with just VM load will be much more spotty because depending on how you system is used the problem areas may be infrequently accessed.
at this point, since i'm not really sure where the error is, i'm actually thinking about testing a set of 8x DIMMs on the spare server, after confirming the DIMMs are good, I might just swap out all DIMMs on CPU#2. I can then test the 8x DIMMs I pulled out in the spare machine to figure out which one (or more) went bad.

i'm also looking at the edac label database. there's very few entries available. but, if I can figure out the proper labels for this board by testing in the lab with the spare, I can then register those labels on the production machine and that might help me identify the problem. however, at this point where I haven't fully confirmed the labels, it looks like edac-utils output would be pointing to P2-DIMME1 and P2-DIMMG1.

This shit really needs improvement; it shouldn't be so complicated to decode these messages. So far, I have:

edac-utils/(unverified labels): P2-DIMME1 + P2-DIMMG1

manual analysis of error message address+dmidecode info: P2-DIMMF2

Supermicro IPMI event logs: P2-DIMMF1

3 different methods, 3 different answers... this *should not* be this confusing.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
i haven't shut down this server yet, but the MCE errors have stopped for over 3 weeks now:

Code:
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: P1-DIMMA1: 0 Corrected Errors
mc0: csrow0: P1-DIMMB1: 0 Corrected Errors
mc0: csrow0: P1-DIMMC1: 0 Corrected Errors
mc0: csrow0: P1-DIMMD1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: P1-DIMMA2: 0 Corrected Errors
mc0: csrow1: P1-DIMMB2: 0 Corrected Errors
mc0: csrow1: P1-DIMMC2: 0 Corrected Errors
mc0: csrow1: P1-DIMMD2: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: P2-DIMME1: 18 Corrected Errors
mc1: csrow0: P2-DIMMF1: 0 Corrected Errors
mc1: csrow0: P2-DIMMG1: 111 Corrected Errors
mc1: csrow0: P2-DIMMH1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: P2-DIMME2: 0 Corrected Errors
mc1: csrow1: P2-DIMMF2: 0 Corrected Errors
mc1: csrow1: P2-DIMMG2: 0 Corrected Errors
mc1: csrow1: P2-DIMMH2: 0 Corrected Errors
weird... cosmic rays?