SM X9 mem issues with just 2 slots, UNLESS lock to 1066

Discussion in 'Processors and Motherboards' started by james23, Jun 7, 2018.

  1. james23

    james23 Active Member

    Joined:
    Nov 18, 2014
    Messages:
    241
    Likes Received:
    41
    hi, been troubleshooting a new build (all but ram was bought used, as working), and am looking for others input with these unusal issues. (i have built and manage 25+ SM setups over the years, so i have done all the troubleshooting i know of here).

    Quick question / issue:
    (throughout im following the SM manual for ram population order, unless im intentionally testing something)

    if i use cpu1 's blue C1 and D1 slots, in anyway, i quickly get all kinds of memory type issues (even w just 1x cpu connected)

    If i only use CPU1's A1 and B1 , and CPU2's E1,F1,G1,H1,E2,F2,G2,H2 = no issues (ie using 10x sticks). ( 28 hour memtest , 0 errors , multiple boots, no hint of a problem)

    If i change BIOS setting for mem speed from AUTO (which runs mem at 1333 as it should), TO 1066, i have no issues and can use cpu1 's blue C1 and D1 slots.

    (when i say issues/problems, i mean bios will lock up, or loop (ipmi event will show catERR at freeze), or if i can get to my UEFI memtest 7.2 pro, it quickly shows ECC errors along with IPMI event log showing ECC errors on CPU1's C1 D1 slots, i have ONLY ever seen ECC errors on C1 and D1 slots.

    (i have reseated CPUs , no bent pins on either socket, inspected all of board for visual issues, all testing is with no disks, no pcie slots occupied, and just a usb stick w my memtest boot) I have even swapped CPU 1 with CPU 2 (while running just 1 cpu) to verify issue is not CPU1's fault. Also reset CMOS via Manual. and tried running on just 1 PSU (tried both psus, w just 1x).

    (im never mixing memory types / makes here, i only have 3 x sets of memory as to rule out memory issues)

    some specs:

    CPU: 2x e5-2667 (2.9GHz, 3.5GHz , 6C +ht, L3:xM, 130W, MB microcode patch= 710)
    MB: X9DR3-LN4F+ rev 1.10 (read from mb sticker) (BIOS = 3.2 latest as of jun2018)
    PSU: 2x SM 720w
    Case/Chasis: CSE-825 (2U , 8x 3.5" bays)

    RAM options im using in above :
    8 x sticks of (from working server):
    elpida EBJ41HE4BAFA-DJ-E - 4GB PC3-10600 DDR3-1333H (9-9-9) ECC-Registered CL9 240-Pin DIMM Dual Rank
    EBJ41HE4BAFA-DJ-E pdf, EBJ41HE4BAFA-DJ-E description, EBJ41HE4BAFA-DJ-E datasheets, EBJ41HE4BAFA-DJ-E view ::: ALLDATASHEET :::

    10 x sticks of (from working server):
    Hynix 4GB PC3-10600 DDR3-1333MHz ECC Registered CL9 240-Pin DIMM Dual Rank Memory Module Mfr P/N HMT151R7BFR4C-H9-DB
    https://www.skhynix.com/product/filedata/fileDownload.do?seq=2931

    4x sticks of:
    HMT151R7BFR8C-G7 DB AA-C (1066 ddr3 ECC - QUAD rank
    https://www.skhynix.com/product/filedata/fileDownload.do?seq=2937

    A good link someone else pointed, Hyinx memory P/N "decoder"
    https://www.skhynix.com/static/filedata/fileDownload.do?seq=190

    So a pretty basic SM setup, but with weird memory issues (only C1 D1 slots). I realize RAM sticks are not exact part NO's on SM QVL for this MB, but they match up exactly with QVL listed Part NO's specs, (but for the 4x 1066 sticks). and i pretty confidenly know all these sticks are good (never can say 100% good w ram LOL)

    any input or ideas guys? thanks!
     
    #1
  2. SavageWS6

    SavageWS6 Member

    Joined:
    Feb 2, 2016
    Messages:
    39
    Likes Received:
    7
    I recently just had a RAM issue on my X9DRD-7LN4F-JBOD and I did everything the same as you, although I thought I had a dead RAM slot. I tested all my RAM and everything was good. I guess it wasn't a bad RAM slot.. or technically was. tl;dr I sent it in to Supermicro and my chipset was defective. Per Supermicro

    "Where as your board has a defective chipset CHP-0465LH-00C1-INT - U3G1"

    So possibly somehow your chipset is defective if you already singled out a ton of possible issues.
     
    #2
    james23 likes this.
  3. james23

    james23 Active Member

    Joined:
    Nov 18, 2014
    Messages:
    241
    Likes Received:
    41
    interesting, thanks. thats def. useful info at this point as seems very similar to my issue!
    I will keep this updated with whatever i find, or is my resolution.
     
    #3
  4. SavageWS6

    SavageWS6 Member

    Joined:
    Feb 2, 2016
    Messages:
    39
    Likes Received:
    7
    No problem. In all my years of building and fixing, I've never seen a chipset fail, especially a Intel chipset like a Intel 602J I had. Hope you figure something out and the final answer isn't the chipset related problem. I'm on the search for a new board now possibly.

    EDIT: The RAM I was using which was working fine and tested was

    - SAMSUNG M393B1K70CH0-CH9Q5
    - Hynix HMT351R7BFR4C-H9
     
    #4
  5. zir_blazer

    zir_blazer Active Member

    Joined:
    Dec 5, 2016
    Messages:
    159
    Likes Received:
    47
    The more Memory Modules you got and/or the more Ranks on each module, the less the maximum possible Frequency that you can clock them. It makes sense than that is your problem since you say that they work fine at 1066 MHz but not 1333, so is possible that you're putting too much stress on those Sandy Bridge IMCs (Integrated Memory Controllers) if fully or almost fully populated.
    Check Page 2-14 of your Motherboard Manual for Intel E5-2600 Series Processor RDIMM Memory Support. Depending on the amount of DIMMs per Channel and the amount of Ranks in each module, you can officially clock them at either 1600, 1333, 1066, and even there are some configurations that can't go over 800 MHz. Not only that, you are mixing Dual Rank and Quad Rank modules, but the Quad Rank ones are rated for 1066 MHz, so you are overclocking them. Maybe you can get things working if you rearrange the Memory Modules in such a way that the Quad Rank modules are alone in their own channel, but not sure how much work that would involve.

    The fact that your Frankenstein works makes me think that certified memory lists are stupidly overrated.
     
    #5
    james23 likes this.
  6. james23

    james23 Active Member

    Joined:
    Nov 18, 2014
    Messages:
    241
    Likes Received:
    41
    thanks for the reply zir,
    i’m fine with running the system at 1066 full time, versus 1333 (however I do need to run longer memtest runs at 1066 to be 100% sure that it’s fully stable there too, as for now it just “seems” to be more stable based off a 12-14h memtest run at 1066 vs 1333 failures).
    The reason I’m hung up on it not working at 1333 (among other issues) is that all indications and specs say that it should be fine at 1333, which to me is an indication that there are HW issues that just may not be making themselves clear right now ( or with the relatively short and limited tests I can do before deployment).. once I can get past memtest successfully , I plan on moving on to other stability tests ( like Windows or Linux test suites and P 95 or superpi runs)

    2 points though:
    1- I am certainly not mixing ram types nor dual and quad rank dimms. I was simply listing with various types of memory I have on hand to test with. In all of my testing, I’ve never tested with anything but consistent dimm sets, for that specific test run.

    2- while your points definitely shed light if it turns out that 1066 is fully stable (ie for a 48+ hour memtest run) I also want to note that I’m far from fully populating the available ram slots on this board (The most I’m ever able to populate , in my tests, is eight slots (of a total 24 ram slots on mb).

    Does this info change your opinion on possibly overloading my sandy bridge IMCs? thanks
     
    #6
    Last edited: Jun 7, 2018
  7. zir_blazer

    zir_blazer Active Member

    Joined:
    Dec 5, 2016
    Messages:
    159
    Likes Received:
    47
    For some reason I thought that you tested with all 22 modules at once, even though you state that you never got it working @ 1333 MHz with only a single Processor and 4 modules.
    What are you currently testing with? Did you tested with both the Hynix and Elpida modules, 4 at a time @ 1333 MHz? I'm not exactly sure if its possible that signal integrity could degrade in such a way that it is "good enough" to work @ 1066 MHz but not @ 1333 MHz anymore, be it either the modules themselves, the CPU, or the Motherboard wiring to the slots, but I would pay attention to that detail just in case. I'm mostly used to see RAM problems in enthusiast setups where a Processor may not like a module with certain DRAM ICs above a determined Frequency but can work with another type of DRAM ICs, but since this is Server grade Hardware, I would be surprised if these issues were also present here at conservative clock speeds.


    According to the Block Diagram, each Memory Channel has a letter (A, B, C, D for CPU1, and E, F, G, H for CPU2), followed by a number of DIMM Slot that uses that channel (A1, A2, A3). It seems that the blue slots are all the first slot of a channel, like A1-B1-C1-D1. Did you tried different combinations like A2-B2-C2-D2 for CPU1? I don't know if that could boot or using the first slot of a channel is enforced, but if you're out of ideas you may try that, or just populating C1 and D1 and check if it boots. Also pay attention to Timmings and Voltage, just in case.
    The way that I suppose that you may make it work is with 8 of the 1333 MHz Hynix in one CPU, and the 8 Elpida in the other, at 2 DIMMs per Channel. Theorically, even your Sandy Bridge supports two 2Rx4 or 2Rx8 with 2 DPC at 1600 MHz, so there is plenty of IMC headroom...
     
    #7
    Last edited: Jun 7, 2018
  8. james23

    james23 Active Member

    Joined:
    Nov 18, 2014
    Messages:
    241
    Likes Received:
    41
    In removing the MB today (to inspect below for shorts or other issues), i did notice something a bit weird, on the MB ATX connector, pin 7 and pin8 looks to have a fuse (or something) soldered between them. Ive checked images online (and my own x8 and x7 SM boards i have here) and have not found this done anywhere else.
    It does look like this was "aftermarket" added, or maybe a SM rma added this (doubtful, but possible)? only thing i found online relating to pin 7 and pin8 is a DIY article where writter is converting a ATX PSU into a bench power supply , and he shows adding a resistor and LED light via pin7 and pin8 to make a red LED light showing when the PSU is ON.
    pin7 is PSU_OK (goes to 5v when PSU is running properly)
    pin8 is GND (ground).
    article:
    https://www.electronics-tutorials.ws/blog/convert-atx-psu-to-bench-supply.html

    see my image below.

    (btw, i didnt find anything else unusual that would relate to my memory issues, ie no bent pins, nor touching pins, nor anything to indicate the ram pins were making contact with chassis nor other). Ive asked the seller to begin testing a replacement MB to ship out monday (ive suggested he test memtest + 8x dimms, but im not sure how "through" his testing will be :/ )

    bottomure.JPG
     
    #8
  9. vrod

    vrod Active Member

    Joined:
    Jan 18, 2015
    Messages:
    214
    Likes Received:
    30
    The one thing that saved my ass with this board was to look at the DIMM Population guide. Then, after that I started dropping DIMM’s in pairs into the server, plugging them in according to the Population Guide. Boot the system with the DIMM’s and shut it down again, then put in another pair.

    It took a while until I had the 16 DIMM’s in, it probably looked funny to the other guys at the DC. :D Since I did that, the system has been running completely without issues, that’s almost 60 days at the time of writing.
     
    #9
  10. james23

    james23 Active Member

    Joined:
    Nov 18, 2014
    Messages:
    241
    Likes Received:
    41
    Well, looks like super-micro is being a bit shady on this board model / revision. I found this link:

    Bug 875194 – sbridge: HANDLING MCE MEMORY ERROR

    which has several users with the exact same model board , same board Rev 1.10 , as me , all having memory issues- for some the solution was buying new/replacing the same board but with rev 1.20 FIXED the mem issues. So i called up SM support, and the support rep (wasn't very helpful) but said from his board rev change log, from v1.10 to rev 1.20 boards, there were significant "memory performance improvements" The rep said i should get a rev 1.20 board even though im trying 3x different sets of memory from the SM QVL / Tested list on my rev 1.10 board.

    I asked rep if these rev. change logs are published, and he said no. I asked how can i avoid this in the future? as im now stuck with 2 x rev 1.10 boards that are essentially TOTALLY useless. He said to call support before buying a board and they will tell me any rev changes info/problems.

    To update my testing, while i did find one config that lasted 37H on memtest86 pro v7.2 , with no ecc errors, (layout was using 7x dimms, but skipping A1- so B1,C1,D1,E1,F1,G1,H1). this dimm setup freezes on ubuntu live CD about 20m after boot (w ecc errors in ipmi of-course)

    This is really dis-heartening for an otherwise hardcore supermicro fan. (now im scared to buy any SM used equipment, without having to check with their hard to communicate with support, and have to fish around for board rev info/changes).
     
    #10
  11. RedX1

    RedX1 New Member

    Joined:
    Aug 11, 2017
    Messages:
    9
    Likes Received:
    0
    #11
Similar Threads: issues slots
Forum Title Date
Processors and Motherboards Memory and PCI-E slots directions and other build issues Mar 6, 2016
Processors and Motherboards X9DRW-iF wierd boot issues & troubleshooting Oct 21, 2018
Processors and Motherboards Server locking up... Logs show voltage issues... help? Sep 22, 2018
Processors and Motherboards ES cpu issues with Supermicro Apr 27, 2018
Processors and Motherboards Anyone had issues with Xeon D-1520, specific board ASRock D1520D4i? Feb 22, 2018

Share This Page