system issues: need advice or confirmation of my diagnosis

BLinux · Oct 1, 2017

I have a 2U Supermicro 825 system with a X8DT-6F w/ 2x L5640, and 128GB of RAM (8x16GB PC3L-8500R ECC). It's a lab/testing system that's usually powered off. I recently powered it on and it was behaving oddly.

At first, during POST only 112GB were recognized. When the system began booting the OS, I encountered several odd symptoms:

1) system booting extremely slowly.... and would hang at various points
2) spontaneously reboots
3) I sometimes see the message: "CMCI storm detected: switching to poll mode"
4) during POST, sometimes I would see "Uncorrectable ECC error CPU2: DIMM1A"
5) during POST, other times I would see "Uncorrectable ECC error CPU2: DIMM2A"

Due to seeing #4 and #5, one of the first things I did was re-seat P2-DIMM1A and P2-DIMM2A; it recognized 128GB during POST, but other symptoms remained. I then swap DIMM1A with another DIMM in the system to see if the problem would follow the DIMM; it did not. I did the same with DIMM2A, but that error didn't occur frequently and I still saw the ECC error on DIMM1A. I swapped all 8 DIMMs in various permutation and still exhibiting the same errors; I found it unlikely that all 8 DIMMs would go bad at the same time.

I then took out all 8 DIMMs, and installed only 2 DIMMs (P1-DIMM1A + P2-DIMM1A), one for each CPU. The POST sees 32GB and system booted up perfectly and I ran a few benchmarks without any errors for several minutes; system seems stable. So, next I added a 2nd pair of DIMMs, total of 4x DIMMs for 64GB, but during POST, only saw 48GB and many of the symptoms above returned once i populated P1-DIMM2A and P2-DIMM2A. Removed P1/P2-DIMM2A pair, and system is stable again. Just to confirm, swapped the P1/P2-DIMM2A pair with another pair of DIMMs, and the error returned. So, I'm doubtful my problem is bad DIMM at this point.

This had me suspecting either motherboard problem or the integrated memory controller of the L5640 CPUs. So, I swapped CPU1 with CPU2 to see if the problem might follow the CPU to socket 1 instead of socket 2. I did inspect the LGA1366 sockets to see if there were any bent pins; none that I could see. The problem remained with socket 2 (same error #4 above). So, likely not CPU issue.

Just to eliminate the possibility, also swapped the PSUs with known good spares. The PSUs all worked and the symptoms above remained.

So, should I conclude a motherboard issue? Are there any other possibilities? Suggestions?

Terry Kennedy · Oct 1, 2017

BLinux said:
I have a 2U Supermicro 825 system with a X8DT-6F w/ 2x L5640, and 128GB of RAM (8x16GB PC3L-8500R ECC).

That MB part numer seems short. X8DTH-6F, perhaps?

I then took out all 8 DIMMs, and installed only 2 DIMMs (P1-DIMM1A + P2-DIMM1A), one for each CPU. The POST sees 32GB and system booted up perfectly and I ran a few benchmarks without any errors for several minutes; system seems stable. So, next I added a 2nd pair of DIMMs, total of 4x DIMMs for 64GB, but during POST, only saw 48GB and many of the symptoms above returned once i populated P1-DIMM2A and P2-DIMM2A. Removed P1/P2-DIMM2A pair, and system is stable again. Just to confirm, swapped the P1/P2-DIMM2A pair with another pair of DIMMs, and the error returned. So, I'm doubtful my problem is bad DIMM at this point.

I was building some X8DTH-iF systems a couple years ago, with the recommended Hynix memory from the Supermicro compatibility list. It was only available used, and I ordered 40 or so sticks (12 per server, plus spares). I used to just throw 12 sticks in, close the lid, and run Memtest86+ for a few days, since that's what I did years ago when I bought the exact same memory new ($500 a piece).

I had a system that was just NOT happy - it would either throw errors in POST or Memtest86+ would report either errors or the wrong amount of memory in bizarre ways (like multiple modules all at address 0 on one CPU). It would repeatedly flag one or more sticks as bad, but when I swapped them, sometimes the problem would move and sometimes it wouldn't. What I wound up doing was starting with a single stick in CPU 1 and CPU 2, memtest, add a second stick and test, and so on. I found 3 of 12 sticks in that system were bad, including some that didn't play well with others - they'd work fine as the only stick, but when a second stick was added, the second one was reported as failed even though the first one was the actual culprit.

I finally got the system to behave consistently, if not correctly - the BIOS would report 96GB while Memtest86+ would report 8GB more than the BIOS, and FreeBSD would report 8GB less than the BIOS. The system ran solidly for a few months until I worked up the energy to tear back into it. What I found was that all of the memory was labeled 8GB DDR3 1066 (registered, ECC, 2Rx4) on the factory labels on the heat spreaders. However, one of the parts was actually a 16GB DDR3 800 (registered, ECC, forget if it was dual or quad rank). I had to find that out by looking at the dmidecode output. Depending on what was talking to it (BIOS, Memtest86+ or FreeBSD), each had their own ideas of what to do in this bizarre, unsupported situation. Apparently somewhere along the line before I got the part, somebody took the correct heat spreader off for some reason and then put the wrong one back on.

The moral of the story is that each stick of RAM needs to be tested individually as well as with all of the others, and look at the dmidecode or SPD data to make sure they are all really identical. If you have a mix of different chip revisions (A/B/etc.), try to populate one CPU with all of one kind and the other CPU with all of the others. If that isn't possible, at least try to get all 3 channels the same. In theory, A chip or B chip shouldn't matter - but since the module manufacturers often use that as a piece of the part number, it probably does matter in some cases.

Since I do a lot of hacking on the Supermicro chassis (including cutting the MB power cables to length and crimping new pins on) I put the motherboard in (I have a carton of X8DTH-iF, so I'm not too concerned if I blow one up, though that has never happened so far) and a single sacrificial E5520 and 8GB stick in CPU1 and then give it the "smoke test" before I put any valuable parts in there.

[Lots more here.]

BLinux · Oct 2, 2017

Terry Kennedy said:
That MB part numer seems short. X8DTH-6F, perhaps?

No, put the dash in the wrong place. It is a X8DT6-F, single IOH.

Terry Kennedy said:
I was building some X8DTH-iF systems a couple years ago, with the recommended Hynix memory from the Supermicro compatibility list. It was only available used, and I ordered 40 or so sticks (12 per server, plus spares). I used to just throw 12 sticks in, close the lid, and run Memtest86+ for a few days, since that's what I did years ago when I bought the exact same memory new ($500 a piece).

I had a system that was just NOT happy - it would either throw errors in POST or Memtest86+ would report either errors or the wrong amount of memory in bizarre ways (like multiple modules all at address 0 on one CPU). It would repeatedly flag one or more sticks as bad, but when I swapped them, sometimes the problem would move and sometimes it wouldn't. What I wound up doing was starting with a single stick in CPU 1 and CPU 2, memtest, add a second stick and test, and so on. I found 3 of 12 sticks in that system were bad, including some that didn't play well with others - they'd work fine as the only stick, but when a second stick was added, the second one was reported as failed even though the first one was the actual culprit.

I finally got the system to behave consistently, if not correctly - the BIOS would report 96GB while Memtest86+ would report 8GB more than the BIOS, and FreeBSD would report 8GB less than the BIOS. The system ran solidly for a few months until I worked up the energy to tear back into it. What I found was that all of the memory was labeled 8GB DDR3 1066 (registered, ECC, 2Rx4) on the factory labels on the heat spreaders. However, one of the parts was actually a 16GB DDR3 800 (registered, ECC, forget if it was dual or quad rank). I had to find that out by looking at the dmidecode output. Depending on what was talking to it (BIOS, Memtest86+ or FreeBSD), each had their own ideas of what to do in this bizarre, unsupported situation. Apparently somewhere along the line before I got the part, somebody took the correct heat spreader off for some reason and then put the wrong one back on.

The moral of the story is that each stick of RAM needs to be tested individually as well as with all of the others, and look at the dmidecode or SPD data to make sure they are all really identical. If you have a mix of different chip revisions (A/B/etc.), try to populate one CPU with all of one kind and the other CPU with all of the others. If that isn't possible, at least try to get all 3 channels the same. In theory, A chip or B chip shouldn't matter - but since the module manufacturers often use that as a piece of the part number, it probably does matter in some cases.

I know what you're saying, I've run into the memory quirkiness too. If this was a system I was assembling from a bunch of fresh off ebay parts, I would be more skeptical about the DIMMs. But I already tested all the components of this system near the beginning of this year and had everything running stable for a while; although this system does not run 24/7, I've powered it on to run various tests or to experiment with things many times since it was first put together without all the symptoms described in the OP. These symptoms just started.

Also, it's not just that the memory isn't recognized occasionally, the system is encountering a lot of errors is what I'm interpreting the CMCI flooding to mean, which also explains why it runs so slowly like it was a 33Mhz 486.

Thinking more about this last night, I think I'm going to see if I can isolate it down to a particular physical DIMM slot. I seem to be running into issues when I populate P1-DIMM2A and P2-DIMM2A. I suspect the problem is only with P2-DIMM2A and I didn't try yet just populating it with 3 DIMMs with P1-DIMM2A and *not* P2-DIMM2A. I may also try, and I know the manual doesn't recommend it, populating all other DIMMs except P2-DIMM2A and see if the stability issues go away. If it is isolated to P2-DIMM2A, i may remove the motherboard to see if maybe I can trace it down to a broken solder joint or something like that, that I may be able to fix. Otherwise, I may try a new replacement motherboard.

BLinux · Oct 3, 2017

Well, there is definitely something wrong with the motherboard I think. I can populate all the DIMMs on CPU1 and only P2-DIMM1A on CPU2 and it's pretty stable. As soon as I put a 2nd DIMM for CPU2, in any of the other slots, not just P2-DIMM2A, it starts to act up. Even memtest86+ got strange... it didn't report any errors, but came to a crawl and almost froze (the "running" baton that normally spins doesn't move for several minutes at a time).

I even tried an entirely new set of DIMMs, same issue specific to CPU socket2 DIMMs. Just to confirm, I took all the DIMMs that this motherboard didn't like, and put them in my 2nd test system with a X8DTH-6F motherboard (vs the X8DT6-F that's giving me problems) and everything was peachy. memtest86+ didn't complain about any of those DIMMs when they are in the X8DTH-6F and didn't come to a crawl like it did in the X8DT6-F.

So, it definitely seems like the motherboard has gone bad some how while it was sitting in the rack. I haven't pulled the board out to exam it yet.

Terry Kennedy · Oct 3, 2017

BLinux said:
Well, there is definitely something wrong with the motherboard I think. I can populate all the DIMMs on CPU1 and only P2-DIMM1A on CPU2 and it's pretty stable. As soon as I put a 2nd DIMM for CPU2, in any of the other slots, not just P2-DIMM2A, it starts to act up. Even memtest86+ got strange... it didn't report any errors, but came to a crawl and almost froze (the "running" baton that normally spins doesn't move for several minutes at a time).

I even tried an entirely new set of DIMMs, same issue specific to CPU socket2 DIMMs. Just to confirm, I took all the DIMMs that this motherboard didn't like, and put them in my 2nd test system with a X8DTH-6F motherboard (vs the X8DT6-F that's giving me problems) and everything was peachy. memtest86+ didn't complain about any of those DIMMs when they are in the X8DTH-6F and didn't come to a crawl like it did in the X8DT6-F.

That sure sounds like a problem with a CPU socket pin. When you pull the board, look at the CPU2 socket more closely. I don't know how that would go bad with the system just sitting, though. Another thing to check would be if both EPS12V power cables are connected to the board (if it has 2 connectors). Still don't know why that would be a new problem.

nthu9280 · Oct 4, 2017

Terry Kennedy said:
That sure sounds like a problem with a CPU socket pin. When you pull the board, look at the CPU2 socket more closely. I don't know how that would go bad with the system just sitting, though. Another thing to check would be if both EPS12V power cables are connected to the board (if it has 2 connectors). Still don't know why that would be a new problem.

I recently ran into somewhat similar issue. I have a S2600CP2J that worked fine on E5-2670 v1 chips. I replaced the CPUs with 2660v2 and started seeing seeing a similar error on 2 DIMM slots one of the cpus. The error stayed with the slots when swapped the memory. Took the proc out and noticed the contact pads were murky. Cleaned it with a surgical alcohol wipe and the problem went away. Ran Memtest86 & Prime95 for over a day each and didn't see any issues. I knew MB / CPU socket was good.

Sent from my Nexus 6 using Tapatalk

BLinux · Oct 4, 2017

Terry Kennedy said:
That sure sounds like a problem with a CPU socket pin. When you pull the board, look at the CPU2 socket more closely. I don't know how that would go bad with the system just sitting, though. Another thing to check would be if both EPS12V power cables are connected to the board (if it has 2 connectors). Still don't know why that would be a new problem.

I did take a look at the socket when I swapped CPU1 and CPU2, but I'll take a closer look again. Maybe bring a flash light and a hi-res camera and take a photo so I can zoom in on the big screen.

nthu9280 said:
I recently ran into somewhat similar issue. I have a S2600CP2J that worked fine on E5-2670 v1 chips. I replaced the CPUs with 2660v2 and started seeing seeing a similar error on 2 DIMM slots one of the cpus. The error stayed with the slots when swapped the memory. Took the proc out and noticed the contact pads were murky. Cleaned it with a surgical alcohol wipe and the problem went away. Ran Memtest86 & Prime95 for over a day each and didn't see any issues. I knew MB / CPU socket was good.

That's interesting... so I guess similar symptoms can arise with bad contact between the CPU pads and the pins? In my case, I already swapped CPU1 and CPU2 and the symptoms stayed with socket 2; so I'm guessing it's not the CPU side (asking)?

If I can find a way to fix this without replacing the motherboard that would be great... I'd rather not throw $100 bucks at a "test" system that only gets used once a month or so for a few hours.

BLinux · Oct 8, 2017

Well, so I double checked the two EPS power connections and they appear be okay.

I then removed cpu2 (formerly cpu1) and took a photo of the socket. At first glance, it looks ok... But then I noticed something:

Looks like some sort of very thin thread touching the pins in the lower right section. When I looked up the LGA 1366 pin-out, that area is used for DRAM channels 0-2.

I haven't actually done anything other than take the picture. I may get a blow gun and just blow that stuff out and try again to see if the system behaves normally again.

Rand__ · Oct 8, 2017

Looks like a hair or strand of fiber. Weird that it got there but might very well be the culprit.
Else - have you cleaned out the dim slots with air? Had an issue once with memory where cleaning them out with air helped ...

traderjay · Oct 8, 2017

Did you check to ensure all the standoff are in the right position? SM Motherboards dont follow the standard mounting hole templates. In my firs SM workstation build, I foolishly assumed that it is the default ATX mounting holes only to realize that one standoff is shorting out one DIMM slot, causing memory errors and boot errors. Removing that standoff solved all the problems.

BLinux · Oct 8, 2017

traderjay said:
Did you check to ensure all the standoff are in the right position? SM Motherboards dont follow the standard mounting hole templates. In my firs SM workstation build, I foolishly assumed that it is the default ATX mounting holes only to realize that one standoff is shorting out one DIMM slot, causing memory errors and boot errors. Removing that standoff solved all the problems.

i don't think that's the case. the system was working for several months until this started. if I had such an issue, i would think I wouldn't have seen months of stability until now.

T_Minus · Oct 8, 2017

Rand__ said:
Looks like a hair or strand of fiber. Weird that it got there but might very well be the culprit.
Else - have you cleaned out the dim slots with air? Had an issue once with memory where cleaning them out with air helped ...

Are you 100% on this? Did you see something come out of them? If so do you know what it was?

If not, my thought is simply re-installing the RAM fixed your issue. I've had this occur a couple times.
I have also seen thermal paste sitting 'on top', luckily I used tweezers and removed prior to installing RAM.

Rand__ · Oct 9, 2017

No, didn't see anything, assumed it was just dust (though maybe a fine hair could have been in there too).
Had swapped 8 modules back and forth anyway I could combine them before, no help. Canned air fixed it.
Not sure it was really it but its done in 10 seconds and won't hurt (unless you buy the wrong kind of canned air).

BLinux · Oct 9, 2017

Rand__ said:
No, didn't see anything, assumed it was just dust (though maybe a fine hair could have been in there too).
Had swapped 8 modules back and forth anyway I could combine them before, no help. Canned air fixed it.
Not sure it was really it but its done in 10 seconds and won't hurt (unless you buy the wrong kind of canned air).

I have a very quiet air compressor I got a few years back that I use with an airgun at 75psi to clean computers; very effective. But, there are some cases where I use an old toothbrush to loosen up the dust build up before using the airgun.

BLinux · Oct 9, 2017

well, powered up the unit tonight and same exact symptoms. doesn't appear that removing that thread in the socket made any difference. took the motherboard out and examined the underside but couldn't see any obvious signs of what might be wrong.

at this point, i think i'm just going to try to find a replacement. it's been taking up too much of my time already. too bad though, hate throwing away stuff that might otherwise be repaired.

Search

system issues: need advice or confirmation of my diagnosis

BLinux

cat lover server enthusiast

Terry Kennedy

Well-Known Member

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast

Terry Kennedy

Well-Known Member

nthu9280

Well-Known Member

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast

Rand__

Well-Known Member

traderjay

Active Member

BLinux

cat lover server enthusiast

T_Minus

Build. Break. Fix. Repeat

Rand__

Well-Known Member

BLinux

cat lover server enthusiast

BLinux

cat lover server enthusiast