I have a 2U Supermicro 825 system with a X8DT-6F w/ 2x L5640, and 128GB of RAM (8x16GB PC3L-8500R ECC). It's a lab/testing system that's usually powered off. I recently powered it on and it was behaving oddly.
At first, during POST only 112GB were recognized. When the system began booting the OS, I encountered several odd symptoms:
1) system booting extremely slowly.... and would hang at various points
2) spontaneously reboots
3) I sometimes see the message: "CMCI storm detected: switching to poll mode"
4) during POST, sometimes I would see "Uncorrectable ECC error CPU2: DIMM1A"
5) during POST, other times I would see "Uncorrectable ECC error CPU2: DIMM2A"
Due to seeing #4 and #5, one of the first things I did was re-seat P2-DIMM1A and P2-DIMM2A; it recognized 128GB during POST, but other symptoms remained. I then swap DIMM1A with another DIMM in the system to see if the problem would follow the DIMM; it did not. I did the same with DIMM2A, but that error didn't occur frequently and I still saw the ECC error on DIMM1A. I swapped all 8 DIMMs in various permutation and still exhibiting the same errors; I found it unlikely that all 8 DIMMs would go bad at the same time.
I then took out all 8 DIMMs, and installed only 2 DIMMs (P1-DIMM1A + P2-DIMM1A), one for each CPU. The POST sees 32GB and system booted up perfectly and I ran a few benchmarks without any errors for several minutes; system seems stable. So, next I added a 2nd pair of DIMMs, total of 4x DIMMs for 64GB, but during POST, only saw 48GB and many of the symptoms above returned once i populated P1-DIMM2A and P2-DIMM2A. Removed P1/P2-DIMM2A pair, and system is stable again. Just to confirm, swapped the P1/P2-DIMM2A pair with another pair of DIMMs, and the error returned. So, I'm doubtful my problem is bad DIMM at this point.
This had me suspecting either motherboard problem or the integrated memory controller of the L5640 CPUs. So, I swapped CPU1 with CPU2 to see if the problem might follow the CPU to socket 1 instead of socket 2. I did inspect the LGA1366 sockets to see if there were any bent pins; none that I could see. The problem remained with socket 2 (same error #4 above). So, likely not CPU issue.
Just to eliminate the possibility, also swapped the PSUs with known good spares. The PSUs all worked and the symptoms above remained.
So, should I conclude a motherboard issue? Are there any other possibilities? Suggestions?
At first, during POST only 112GB were recognized. When the system began booting the OS, I encountered several odd symptoms:
1) system booting extremely slowly.... and would hang at various points
2) spontaneously reboots
3) I sometimes see the message: "CMCI storm detected: switching to poll mode"
4) during POST, sometimes I would see "Uncorrectable ECC error CPU2: DIMM1A"
5) during POST, other times I would see "Uncorrectable ECC error CPU2: DIMM2A"
Due to seeing #4 and #5, one of the first things I did was re-seat P2-DIMM1A and P2-DIMM2A; it recognized 128GB during POST, but other symptoms remained. I then swap DIMM1A with another DIMM in the system to see if the problem would follow the DIMM; it did not. I did the same with DIMM2A, but that error didn't occur frequently and I still saw the ECC error on DIMM1A. I swapped all 8 DIMMs in various permutation and still exhibiting the same errors; I found it unlikely that all 8 DIMMs would go bad at the same time.
I then took out all 8 DIMMs, and installed only 2 DIMMs (P1-DIMM1A + P2-DIMM1A), one for each CPU. The POST sees 32GB and system booted up perfectly and I ran a few benchmarks without any errors for several minutes; system seems stable. So, next I added a 2nd pair of DIMMs, total of 4x DIMMs for 64GB, but during POST, only saw 48GB and many of the symptoms above returned once i populated P1-DIMM2A and P2-DIMM2A. Removed P1/P2-DIMM2A pair, and system is stable again. Just to confirm, swapped the P1/P2-DIMM2A pair with another pair of DIMMs, and the error returned. So, I'm doubtful my problem is bad DIMM at this point.
This had me suspecting either motherboard problem or the integrated memory controller of the L5640 CPUs. So, I swapped CPU1 with CPU2 to see if the problem might follow the CPU to socket 1 instead of socket 2. I did inspect the LGA1366 sockets to see if there were any bent pins; none that I could see. The problem remained with socket 2 (same error #4 above). So, likely not CPU issue.
Just to eliminate the possibility, also swapped the PSUs with known good spares. The PSUs all worked and the symptoms above remained.
So, should I conclude a motherboard issue? Are there any other possibilities? Suggestions?