Supermicro X8DT6-F thermal alarm with 2nd CPU in CPU1 socket?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
i recently picked up a dirt cheap Supermicro 836 with a X8DT6-F motherboard in unknown condition. I've been testing it out and upgrading BIOS/firmwares/IPMI/etc. All seemed well while I had a single CPU in the CPU2 socket. Today, I started testing the other cpu socket (CPU1), but the system will not boot. Instead, I get a red light thermal alarm and all fans spin at full speed. If I remove the CPU from CPU1 socket, everything works again. I thought it might be CPU related, so I swapped CPUs, but same behavior, as long as CPU1 socket is unoccupied everything works, but once any CPU is in CPU1 socket, the system doesn't boot, and the red thermal alarm light is on.

I examined CPU1 socket, and I did notice a single pin that was a bit out of alignment, it wasn't touching another pin, but it was close. I used some tweezers to adjust it and tried again, but unfortunately, same symptoms.

Does anyone have any hints on how I might be able to fix this? So far, this is the main problem with the system. How does the thermal alarm work? where is this sensor? is there a particular pin in the 1366 socket that pertains to this?
 

anlin

New Member
Dec 8, 2016
29
8
3
31
Have you tried different a different memory configuration? I had an X8DTE-F (a model very close to yours, I believe they even share the same BIOS) that refused to boot with certain memory configurations when both sockets were populated. Does the system boot with 2 CPUs but no memory in CPU1 slots?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
Have you tried different a different memory configuration? I had an X8DTE-F (a model very close to yours, I believe they even share the same BIOS) that refused to boot with certain memory configurations when both sockets were populated. Does the system boot with 2 CPUs but no memory in CPU1 slots?
right now, i'm just testing minimal configurations and experience said problem. there's only 1 DIMM in CPU2, which boots fine. put in CPU1, and no boot, thermal warning, and blank screen. put a single DIMM into CPU1, same thing.
 

anlin

New Member
Dec 8, 2016
29
8
3
31
Have I understood correctly that the system doesn't boot with just CPU1 installed? If so, it is starting to sound like socket 1 is toast :( I don't think the thermal alarm light actually means something is overheating. I would try examining the socket once more to make sure there are no bent or missing pins. I would also take a look at IPMI with the system powered up. If I recall correctly, the X8DTx-F boards have a POST code readout in the IPMI web interface. It might give a clue as to what the problem is.

On a somewhat unrelated note - how did your BIOS updates go? No problems I asssume? I'm asking because the official DOS tool comprehensively corrupted the BIOS on both of my X8DTE-F boards. I thought the first time was a fluke or I had done something wrong, but the second time it happened I made sure to follow the instructions to a letter but ended up with a dead board anyway. The BIOS (on both boards) got messed up to a point where no recovery method was working and I ended up having to manually reprogram the SPI BIOS chip.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
Have I understood correctly that the system doesn't boot with just CPU1 installed? If so, it is starting to sound like socket 1 is toast :( I don't think the thermal alarm light actually means something is overheating. I would try examining the socket once more to make sure there are no bent or missing pins. I would also take a look at IPMI with the system powered up. If I recall correctly, the X8DTx-F boards have a POST code readout in the IPMI web interface. It might give a clue as to what the problem is.

On a somewhat unrelated note - how did your BIOS updates go? No problems I asssume? I'm asking because the official DOS tool comprehensively corrupted the BIOS on both of my X8DTE-F boards. I thought the first time was a fluke or I had done something wrong, but the second time it happened I made sure to follow the instructions to a letter but ended up with a dead board anyway. The BIOS (on both boards) got messed up to a point where no recovery method was working and I ended up having to manually reprogram the SPI BIOS chip.
yeah, good point.. going to check out IPMI interface while powering up with the problem. see if it gives me any additional clues. i haven't tried it with single CPU in socket 1; might be worth trying out.

regarding BIOS update; no problem whatsoever. BIOS update, IPMI FW update, and onboard LSI SAS2008 IT mode firmware update all went perfectly. BIOS corruption sounds bad... would never want that. maybe there was a corrupted bios image at some point? when was this? BTW, how did you manually reprogram the BIOS chip?
 

anlin

New Member
Dec 8, 2016
29
8
3
31
Have you also made sure both CPUs are the same model? The model of both CPUs should be exactly the same in a multi-socket setup.

Good to hear that your BIOS updates went smoother than mine did. I used a Bus Pirate and flashrom to manually flash the (thankfully removable) BIOS memory chip. I updated the first board about 6 months ago, the second board maybe a month after that. I know for a fact the image I used to update the BIOS was good, since manually flashing the same exact image fixed both boards. The two systems were updated completely independently, there was no common hardware that could have caused the issue. Furthermore, on both boards the DOS updater also successfully verified the contents of the BIOS chip after flashing (but resulted in a corrupt BIOS anyway). My best guess at this point is that the DOS tool updated only parts of the BIOS and not entire 4M image, and the issue I was seeing was a compatibility issue between (parts of) the old and the new BIOS.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
Ok, so the problem persists and here's an update:

1) With CPU in CPU2 socket, empty CPU1 socket, IPMI interface shows sensor readings and everything is normal other than no readings from CPU1 socket.

2) With CPU in CPU1 *AND* CPU2 socket, the problem happens; no boot and thermal alarm LED is red. IPMI interface shows no sensor readings at all, in fact when I refresh the sensor page, it hangs. None of the webpages on the IPMI web gui work; not even the front page MAC address information shows up; everything is blank.

3) With the CPU that was in CPU2 socket, when I move it to CPU1 socket and leave CPU2 socket empty, I get the same problem; no boot and thermal alarm LED is red.

4) With the CPU that was originally in CPU1 socket, I moved it to CPU2 socket, leaving CPU1 socket empty, the system boots perfectly. So, that confirms both CPUs (same ones) work; and work fine as long as they are in the CPU2 socket and leave the CPU1 socket empty.

So, I think definitely something seems amiss in the CPU1 socket. I've taken another look at the socket pins, but nothing stands out yet. Attached is a photo of the problematic socket.

IMG_1253.JPG
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,143
597
113
New York City
www.glaver.org
Ok, so the problem persists and here's an update:
It doesn't have to be a socket problem - it could be something that is only active when there is a CPU in the socket. For example, the VRM for that socket doesn't do anything if there is no CPU. Based on what you're seeing, it looks like it is either in the SMBus (since that's where the IPMI gets its sensor data) or the whole IPMI is getting knocked offline (if the heartbeat LED stops blinking).

I see 2 X8DT6-F boards on eBay for $89. I'd probably go with the one that has a pair E5506 CPUs installed, just because CPUs make good "packing material" if the seller doesn't have the official Foxconn blank fillers for the CPU sockets.

On the other hand, this may give you an excuse to migrate to a generation (or two!) newer motherboard / CPU(s).
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
@Terry Kennedy That's a good point. I have no need to replace this board, it came with a very cheap 836 chassis as a bonus. I was hoping it would work, but that's not the case it seems.
 

sfbayzfs

Active Member
May 6, 2015
259
145
43
SF Bay area
The issue could also be a bad RAM slot, have you tried the RAM stick in other slots when a CPU is in the socket which appears to be bad?

Hmmm, I have been planning to start a bad ram slots thread...
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
The issue could also be a bad RAM slot, have you tried the RAM stick in other slots when a CPU is in the socket which appears to be bad?

Hmmm, I have been planning to start a bad ram slots thread...
I've dealt with some bad ram slot issues too, but usually they don't prevent the system from POST completely. In this case, the system will not even POST, blank screen for 15 minutes or more, thermal alarm on, etc.
 

njoyner

New Member
May 6, 2017
4
0
1
Los Angeles
Last year I had what sounds like the same problem on a similar supermicro board. Like you, I went through quite a few combinations of memory, etc. What was really strange is that I discovered on this board the problem happened with the 56xx parts, but the 55xx parts worked fine. Eventually I inspected the sockets and did find a pin or two damaged on at least one socket, and couldn't repair it well enough to get the 56xx series to work on the board. I eventually gave up as I needed the system to work with the hex core parts, and without being able to debug into the BIOS or do some kind of open/short test on the socket it was looking difficult to root cause. I think I found that there were some differences between the parts as it related to the thermal sensor related pins and perhaps a couple of QPI pins or something.

I bought quite a few of systems from this vintage, and after this and a couple of other flakey platforms I vowed not to spend too much time on any system that had strange behavior and had bent pins on the socket. This vintage CPU/socket seemed to have pretty fragile sockets that could be easily damaged, rendering quite a few system boards as a failure point.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,694
1,096
113
artofserver.com
Last year I had what sounds like the same problem on a similar supermicro board. Like you, I went through quite a few combinations of memory, etc. What was really strange is that I discovered on this board the problem happened with the 56xx parts, but the 55xx parts worked fine. Eventually I inspected the sockets and did find a pin or two damaged on at least one socket, and couldn't repair it well enough to get the 56xx series to work on the board. I eventually gave up as I needed the system to work with the hex core parts, and without being able to debug into the BIOS or do some kind of open/short test on the socket it was looking difficult to root cause. I think I found that there were some differences between the parts as it related to the thermal sensor related pins and perhaps a couple of QPI pins or something.

I bought quite a few of systems from this vintage, and after this and a couple of other flakey platforms I vowed not to spend too much time on any system that had strange behavior and had bent pins on the socket. This vintage CPU/socket seemed to have pretty fragile sockets that could be easily damaged, rendering quite a few system boards as a failure point.
thanks for sharing your experience... let me make sure i understand you:

1) with 55xx CPUs, the system worked with 2 CPUs?
2) with 56xx CPUs, the system only worked with 1 CPU and not both?

if that's right, I do have some 55xx CPUs I can try to see if that configuration works with dual CPUs.
 

sfbayzfs

Active Member
May 6, 2015
259
145
43
SF Bay area
I've dealt with some bad ram slot issues too, but usually they don't prevent the system from POST completely. In this case, the system will not even POST, blank screen for 15 minutes or more, thermal alarm on, etc.
I have definitely had that failure mode with a bad RAM slot in some cases - any RAM in the bad slot makes the system never show video, removing it lets it boot.