Highly unstable system (Supermicro X9DRi-LN4F)

marirs · Jan 8, 2019

I've recently acquired a used server which has a Supermicro X9DRi-LN4F, 32 GB of RAM (8 x 4GB), and 2x e5-2630lv2.

Unfortunately its been very unstable, and usually crashes in ~10mins with something like this in the event log:

16 2019/01/08 16:14:55 OEM Memory Correctable Memory ECC @ DIMMC1(CPU1) - Asserted

I've tried running Memtest86, and on some runs it has no errors, but on others 1000s of errors like the above all in different locations (C1, D1, etc). I've tried moving the RAM sticks in different locations but cannot get a consistent fault on a specific stick.

Most of the crashes during regular use seem to occur in C1 (even after moving sticks around).

I have removed one CPU, so currently all 8 sticks are on the A-D RAM slots.

I tried swapping out the CPU with the other one and crashes still occur.

Any ideas on what could be the problems here or any other troubleshooting steps? At this point I think the motherboard must be faulty.

Is it a problem that I'm running with only one CPU?

edit: the most reliable way I've found to get it to crash is by doing large file transfer through winscp. I've tried cpu stress test tools and various memory testers, but they don't reliably cause crashes.

WANg · Jan 8, 2019

Hm,. Did you have a good look at the motherboard manual, especially the flowchart diagram on page 18 telling you how the machine is stitched together?

It's a typical dual socket Sandy/Ivy setup where one socket is the master (Socket1) and the other is the slave (Socket2). The master CPU talks to the disk (the SATA bus), the USB and the BMC and communicates to the slave via its dual QPI lanes. It's not as retarded as some of the X8/X9 blades that I ran across (where the slave socket must talk to the PCIe slots on the master, and the master only via QPI) but where it instead talks to its own 48 dedicated PCIe lanes. This is kinda important from a latency/timing standpoint. Anyways, that's not quite the point. So several things you need to know:

a) You can run it with only 1 CPU populated, but it must always be on socket 1. If you leave socket 1 empty and socket 2 populated, the machine should not work. However, leaving only socket 1 populated will be fine.

b) Sandy/Ivys (and the Nehelems) uses triple channel RAM setups, and that's why you see 8 groups of memory ranging from A to H, and each group containing 3 slots (that's not a coincidence). If you look at the RAM slots, note that the 8 groups are divided into 4 groups for each CPU socket, where A/B are the first 2 groups for Socket 1, and E/F are the first 2 groups for Socket 2. Typically, if socket 2 is not populated the associated RAM groups are turned off. Let's say you have 4 GB RAM modules in all 24 slots but only have a single CPU in socket 1, you'll only see 48GB in total, since the 12 slots associated with socket 2 will not be turned on. Obviously, talking to the closest group to your socket gets you the least amount of latency (in other words, if your code is running on socket 1, don't shove your data on group G if you want good latency).

c) Note that on the board diagram, it's arranged E-F-S2-G-H and C-D-S1-A-B (so socket 2's memory groups are switched 180 degrees in relations to Socket 1). Note that you don't need to populate all RAM slots, but you do have to populate them in the correct order. Groups get filled out 3 at a time first.

So what does it all mean?

1) Physically examine the RAM and the RAM slots. Turn on the torch mode on your smartphone, Shine a light into/look in the slots for groups C and D. Does the "fingers" on the RAM slot look okay? Any potentially bent pins? Worn traces? Now look at the RAM modules themselves. Are they the same batches? What type of RAM are they? Or do you have a hodgepodge of voltages, timings/wait states, speed, vendors and resellers? How about the "fingers" on the RAM module contacts? Do they look a little funny to you?

2) If it's complaining about C1, take C1/C2/C3 out of service (for now). Test with Socket 1, populate RAM only in A1+A2+A3. See if that's stable. Then expand to A1+A2+A3 AND B1+B2+B3. See if that's stable. Then test with 2 sockets, populate with the above, plus CPU in Socket2, plus E1+E2+E3, then that and F1+F2+F3. If they look clean, put groups G and H into service. Then D. And then finally C. While you are at it, look at the power rails on the machine via IPMI/BMC and see if it looks wobbly or noisy.

EffrafaxOfWug · Jan 8, 2019

Probably a long shot, but given that you've shuffled the DIMMs and the CPUs around, have you inspected the DIMM slots and the mobo traces for obvious damage and/or obstructions (you'll want to use a good torch or directional light for this)? I had the same behaviour once with intermittent crashes always at the same addresses; a dust bunny stuck in one of the slots was causing a marginal connection on some of the pins that was enough to cause ECC to throw a wobbly when those addresses were accessed (although not universally - some access patterns didn't trigger it, but eventually some pages in those addresses would, resulting in a correctable or uncorrectable error being logged).

To get a more accurate picture you might want to try repeated memtest runs and take a log of the memory address ranges that come back as failing.

If you acquired this used, it's entirely possible it was unloaded precisely because it hadd stability problems of course...

nthu9280 · Jan 8, 2019

In addition, check for any bent CPU socket pins and or debris. Clean the CPU contacts with alcohol wipe and let it dry for couple of minutes before reseating them in the socket.

marirs · Jan 8, 2019

Thanks for the detailed replies guys!

To eliminate some things:

- I did refer to the manual and nothing is plugged into anything connected to socket 2, so I think I should be fine there.
- All the RAM sticks are the same model and manufacturer, however I don't think they're the same batch number.
- I examined C1 with a flash light and nothing seemed immediately wrong physically with the pins.
- Memtest so far hasn't been a reliable cause of failure. There have been 2 runs where basically every other stick reported errors (and this starts pretty soon after the test starts), and then the other tests where it runs with no errors overnight.

So what I've done now is:

- As suggested, removed sticks from C1 & C2 and put them in A3 & B3. Previously I was following the recommended A1, B1, C1, D1, A2, B2... pattern. So A1-A3, B1-B3, D1-D2 are currently populated.
- I'm not sure of the best way to check voltage quality. IPMI only gives readings with a max refresh of 10secs, and I don't know if that is sufficient? However! The server came with redundant PSUs and I am only using one one, so I swapped them out.

So far, its been stable and running my ssh file transfer test for ~40minutes which is better than before!

I'm going to let this test run overnight and then try to repopulated C1 & C2 tomorrow. I would be so happy if this was just a PSU problem...

This is a used server, but fortunately the company is local so I don't have the worry of having to mail a ~100lb monster through the mail!

RageBone · Jan 9, 2019

Actually, lga 2011 is quadchannel and in this case, 3 slots per channel.

So @marirs A1, B1, ... D1 is a reasonable population.

Edit other stuff I remembered:
My x10 ... With 4 channels and 3dimms each, works flawless as long as there are less then 10 dimms installed per socket.
When I populate it fully, dimms randomly disappear or throw errors.
I haven't dug into what's causing that though.

marirs · Jan 11, 2019

Ok, I repopulated C1 & C2, and.... crashes again. Went and exchanged the motherboard for another one, and..... crashes on C1 again!!!!

At this point I don't know.... maybe this motherboard is really not designed with 1 socket and this ram configuration..... will try populating and the second cpu socket and ram slots and hopefully that will be stable.

Worried about memory corruption but I hope ECC will take care of it.

RageBone · Jan 12, 2019

In terms of debugging and swaping things around, did you rule out a defective cpu?

RedX1 · Jan 12, 2019

I did post this yesterday in reply to another thread.

I too have had trouble de-bugging SM memory problems

I found this Supermicro guide which explains in much better detail the Memory Configuration requirements of X9 Socket R Series DP Motherboards.

https://www.supermicro.com/support/resources/memory/X9_DP_memory_config_socket_R.pdf

It has much more detail than that provided in the various motherboard manuals.

I hope this is useful.

RedX1

Jordan · Jan 15, 2019

What is your DDR3 memory speed? I had a similar issue when I upgraded from 8x4GB to 16x4GB RAM on my X9DAI motherboard with 2xE5-2670. I ended up reading through the errata and noticed this entry:

BT181. Core Frequencies at or Below the DRAM DDR Frequency May Result in Unpredictable System Behavior

Looks like the v2 Xeons have the same errata under code CA1.

I had DDR3-1333 RAM and setting it to 1066MHz speeds in the bios fixed the issue for me.

Search

Highly unstable system (Supermicro X9DRi-LN4F)

marirs

New Member

WANg

Well-Known Member

EffrafaxOfWug

Radioactive Member

nthu9280

Well-Known Member

marirs

New Member

RageBone

Active Member

marirs

New Member

RageBone

Active Member

RedX1

Active Member

Jordan

New Member