SM X9 mem issues with just 2 slots, UNLESS lock to 1066

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

james23

Active Member
Nov 18, 2014
441
122
43
52
hi, been troubleshooting a new build (all but ram was bought used, as working), and am looking for others input with these unusal issues. (i have built and manage 25+ SM setups over the years, so i have done all the troubleshooting i know of here).

Quick question / issue:
(throughout im following the SM manual for ram population order, unless im intentionally testing something)

if i use cpu1 's blue C1 and D1 slots, in anyway, i quickly get all kinds of memory type issues (even w just 1x cpu connected)

If i only use CPU1's A1 and B1 , and CPU2's E1,F1,G1,H1,E2,F2,G2,H2 = no issues (ie using 10x sticks). ( 28 hour memtest , 0 errors , multiple boots, no hint of a problem)

If i change BIOS setting for mem speed from AUTO (which runs mem at 1333 as it should), TO 1066, i have no issues and can use cpu1 's blue C1 and D1 slots.

(when i say issues/problems, i mean bios will lock up, or loop (ipmi event will show catERR at freeze), or if i can get to my UEFI memtest 7.2 pro, it quickly shows ECC errors along with IPMI event log showing ECC errors on CPU1's C1 D1 slots, i have ONLY ever seen ECC errors on C1 and D1 slots.

(i have reseated CPUs , no bent pins on either socket, inspected all of board for visual issues, all testing is with no disks, no pcie slots occupied, and just a usb stick w my memtest boot) I have even swapped CPU 1 with CPU 2 (while running just 1 cpu) to verify issue is not CPU1's fault. Also reset CMOS via Manual. and tried running on just 1 PSU (tried both psus, w just 1x).

(im never mixing memory types / makes here, i only have 3 x sets of memory as to rule out memory issues)

some specs:

CPU: 2x e5-2667 (2.9GHz, 3.5GHz , 6C +ht, L3:xM, 130W, MB microcode patch= 710)
MB: X9DR3-LN4F+ rev 1.10 (read from mb sticker) (BIOS = 3.2 latest as of jun2018)
PSU: 2x SM 720w
Case/Chasis: CSE-825 (2U , 8x 3.5" bays)

RAM options im using in above :
8 x sticks of (from working server):
elpida EBJ41HE4BAFA-DJ-E - 4GB PC3-10600 DDR3-1333H (9-9-9) ECC-Registered CL9 240-Pin DIMM Dual Rank
EBJ41HE4BAFA-DJ-E pdf, EBJ41HE4BAFA-DJ-E description, EBJ41HE4BAFA-DJ-E datasheets, EBJ41HE4BAFA-DJ-E view ::: ALLDATASHEET :::

10 x sticks of (from working server):
Hynix 4GB PC3-10600 DDR3-1333MHz ECC Registered CL9 240-Pin DIMM Dual Rank Memory Module Mfr P/N HMT151R7BFR4C-H9-DB
https://www.skhynix.com/product/filedata/fileDownload.do?seq=2931

4x sticks of:
HMT151R7BFR8C-G7 DB AA-C (1066 ddr3 ECC - QUAD rank
https://www.skhynix.com/product/filedata/fileDownload.do?seq=2937

A good link someone else pointed, Hyinx memory P/N "decoder"
https://www.skhynix.com/static/filedata/fileDownload.do?seq=190

So a pretty basic SM setup, but with weird memory issues (only C1 D1 slots). I realize RAM sticks are not exact part NO's on SM QVL for this MB, but they match up exactly with QVL listed Part NO's specs, (but for the 4x 1066 sticks). and i pretty confidenly know all these sticks are good (never can say 100% good w ram LOL)

any input or ideas guys? thanks!
 

SavageWS6

Member
Feb 2, 2016
35
7
8
31
Pennsylvania
I recently just had a RAM issue on my X9DRD-7LN4F-JBOD and I did everything the same as you, although I thought I had a dead RAM slot. I tested all my RAM and everything was good. I guess it wasn't a bad RAM slot.. or technically was. tl;dr I sent it in to Supermicro and my chipset was defective. Per Supermicro

"Where as your board has a defective chipset CHP-0465LH-00C1-INT - U3G1"

So possibly somehow your chipset is defective if you already singled out a ton of possible issues.
 
  • Like
Reactions: james23

james23

Active Member
Nov 18, 2014
441
122
43
52
interesting, thanks. thats def. useful info at this point as seems very similar to my issue!
I will keep this updated with whatever i find, or is my resolution.
 

SavageWS6

Member
Feb 2, 2016
35
7
8
31
Pennsylvania
No problem. In all my years of building and fixing, I've never seen a chipset fail, especially a Intel chipset like a Intel 602J I had. Hope you figure something out and the final answer isn't the chipset related problem. I'm on the search for a new board now possibly.

EDIT: The RAM I was using which was working fine and tested was

- SAMSUNG M393B1K70CH0-CH9Q5
- Hynix HMT351R7BFR4C-H9
 

zir_blazer

Active Member
Dec 5, 2016
356
128
43
The more Memory Modules you got and/or the more Ranks on each module, the less the maximum possible Frequency that you can clock them. It makes sense than that is your problem since you say that they work fine at 1066 MHz but not 1333, so is possible that you're putting too much stress on those Sandy Bridge IMCs (Integrated Memory Controllers) if fully or almost fully populated.
Check Page 2-14 of your Motherboard Manual for Intel E5-2600 Series Processor RDIMM Memory Support. Depending on the amount of DIMMs per Channel and the amount of Ranks in each module, you can officially clock them at either 1600, 1333, 1066, and even there are some configurations that can't go over 800 MHz. Not only that, you are mixing Dual Rank and Quad Rank modules, but the Quad Rank ones are rated for 1066 MHz, so you are overclocking them. Maybe you can get things working if you rearrange the Memory Modules in such a way that the Quad Rank modules are alone in their own channel, but not sure how much work that would involve.

The fact that your Frankenstein works makes me think that certified memory lists are stupidly overrated.
 
  • Like
Reactions: james23

james23

Active Member
Nov 18, 2014
441
122
43
52
thanks for the reply zir,
i’m fine with running the system at 1066 full time, versus 1333 (however I do need to run longer memtest runs at 1066 to be 100% sure that it’s fully stable there too, as for now it just “seems” to be more stable based off a 12-14h memtest run at 1066 vs 1333 failures).
The reason I’m hung up on it not working at 1333 (among other issues) is that all indications and specs say that it should be fine at 1333, which to me is an indication that there are HW issues that just may not be making themselves clear right now ( or with the relatively short and limited tests I can do before deployment).. once I can get past memtest successfully , I plan on moving on to other stability tests ( like Windows or Linux test suites and P 95 or superpi runs)

2 points though:
1- I am certainly not mixing ram types nor dual and quad rank dimms. I was simply listing with various types of memory I have on hand to test with. In all of my testing, I’ve never tested with anything but consistent dimm sets, for that specific test run.

2- while your points definitely shed light if it turns out that 1066 is fully stable (ie for a 48+ hour memtest run) I also want to note that I’m far from fully populating the available ram slots on this board (The most I’m ever able to populate , in my tests, is eight slots (of a total 24 ram slots on mb).

Does this info change your opinion on possibly overloading my sandy bridge IMCs? thanks
 
Last edited:

zir_blazer

Active Member
Dec 5, 2016
356
128
43
For some reason I thought that you tested with all 22 modules at once, even though you state that you never got it working @ 1333 MHz with only a single Processor and 4 modules.
What are you currently testing with? Did you tested with both the Hynix and Elpida modules, 4 at a time @ 1333 MHz? I'm not exactly sure if its possible that signal integrity could degrade in such a way that it is "good enough" to work @ 1066 MHz but not @ 1333 MHz anymore, be it either the modules themselves, the CPU, or the Motherboard wiring to the slots, but I would pay attention to that detail just in case. I'm mostly used to see RAM problems in enthusiast setups where a Processor may not like a module with certain DRAM ICs above a determined Frequency but can work with another type of DRAM ICs, but since this is Server grade Hardware, I would be surprised if these issues were also present here at conservative clock speeds.


According to the Block Diagram, each Memory Channel has a letter (A, B, C, D for CPU1, and E, F, G, H for CPU2), followed by a number of DIMM Slot that uses that channel (A1, A2, A3). It seems that the blue slots are all the first slot of a channel, like A1-B1-C1-D1. Did you tried different combinations like A2-B2-C2-D2 for CPU1? I don't know if that could boot or using the first slot of a channel is enforced, but if you're out of ideas you may try that, or just populating C1 and D1 and check if it boots. Also pay attention to Timmings and Voltage, just in case.
The way that I suppose that you may make it work is with 8 of the 1333 MHz Hynix in one CPU, and the 8 Elpida in the other, at 2 DIMMs per Channel. Theorically, even your Sandy Bridge supports two 2Rx4 or 2Rx8 with 2 DPC at 1600 MHz, so there is plenty of IMC headroom...
 
Last edited:

james23

Active Member
Nov 18, 2014
441
122
43
52
In removing the MB today (to inspect below for shorts or other issues), i did notice something a bit weird, on the MB ATX connector, pin 7 and pin8 looks to have a fuse (or something) soldered between them. Ive checked images online (and my own x8 and x7 SM boards i have here) and have not found this done anywhere else.
It does look like this was "aftermarket" added, or maybe a SM rma added this (doubtful, but possible)? only thing i found online relating to pin 7 and pin8 is a DIY article where writter is converting a ATX PSU into a bench power supply , and he shows adding a resistor and LED light via pin7 and pin8 to make a red LED light showing when the PSU is ON.
pin7 is PSU_OK (goes to 5v when PSU is running properly)
pin8 is GND (ground).
article:
https://www.electronics-tutorials.ws/blog/convert-atx-psu-to-bench-supply.html

see my image below.

(btw, i didnt find anything else unusual that would relate to my memory issues, ie no bent pins, nor touching pins, nor anything to indicate the ram pins were making contact with chassis nor other). Ive asked the seller to begin testing a replacement MB to ship out monday (ive suggested he test memtest + 8x dimms, but im not sure how "through" his testing will be :/ )

bottomure.JPG
 

vrod

Active Member
Jan 18, 2015
241
43
28
31
The one thing that saved my ass with this board was to look at the DIMM Population guide. Then, after that I started dropping DIMM’s in pairs into the server, plugging them in according to the Population Guide. Boot the system with the DIMM’s and shut it down again, then put in another pair.

It took a while until I had the 16 DIMM’s in, it probably looked funny to the other guys at the DC. :D Since I did that, the system has been running completely without issues, that’s almost 60 days at the time of writing.
 

james23

Active Member
Nov 18, 2014
441
122
43
52
Well, looks like super-micro is being a bit shady on this board model / revision. I found this link:

Bug 875194 – sbridge: HANDLING MCE MEMORY ERROR

which has several users with the exact same model board , same board Rev 1.10 , as me , all having memory issues- for some the solution was buying new/replacing the same board but with rev 1.20 FIXED the mem issues. So i called up SM support, and the support rep (wasn't very helpful) but said from his board rev change log, from v1.10 to rev 1.20 boards, there were significant "memory performance improvements" The rep said i should get a rev 1.20 board even though im trying 3x different sets of memory from the SM QVL / Tested list on my rev 1.10 board.

I asked rep if these rev. change logs are published, and he said no. I asked how can i avoid this in the future? as im now stuck with 2 x rev 1.10 boards that are essentially TOTALLY useless. He said to call support before buying a board and they will tell me any rev changes info/problems.

To update my testing, while i did find one config that lasted 37H on memtest86 pro v7.2 , with no ecc errors, (layout was using 7x dimms, but skipping A1- so B1,C1,D1,E1,F1,G1,H1). this dimm setup freezes on ubuntu live CD about 20m after boot (w ecc errors in ipmi of-course)

This is really dis-heartening for an otherwise hardcore supermicro fan. (now im scared to buy any SM used equipment, without having to check with their hard to communicate with support, and have to fish around for board rev info/changes).