X9DRI-LN4F+ Woes

Hop-Scotch · Mar 27, 2021

Need some expert help here.

TLDR:

Replaced almost every piece of system trying to resolve correctable memory errors (storm) and CATERR. Still nothing works.

Lets begin.

I'm building a new storage server to reduce the foot print and energy use of my current setup (my old gaming setup from the early 2000s, q9550 in a 4U paired to 2 x se3016 expander chassis). The new setup:

Case: SC846 w/ BPN-SAS2-EL1 and dual 920w Platinum SQ
Motherboard: X9DRI-LN4F+ v1.2
CPU: 2x 2660 v2
Cables: 8pin to dual 8pin for the CPUs so I can use the SM 8pin to PCI-e cable I have for when I add a video card
Memory: 96gb HMT31GR7BFR4C-H9 Hynix (from the SC tested memory list)
HBA: DELL LSI 9207-8i (tape mod just in case)
OS: UnRAID

Now this gets long.

Set everything up. Update BIOS to latest (3.4) and IPMI to 3.48 (I know there's a 3.61). Let MemTest86 off of UnRAID boot USB run 24+ hours, no errors (non logged in IPMI either). I start getting everything setup within UnRAID and start copying over data from the old box. Everything is fine for while, then I'll get a storm of Correctable Memory Errors (logged via MCElog as well as IPMI, though they don't agree as to which DIMM after looking at the offending address and using DMIDecode), this ends in a CATERR. So I swap DIMM locations (and clean contacts) and boot everything back up. Everything is fine for awhile, then another storm of errors ending in CATERR (so many it fills up the SEL log), different DIMM, different slot. Repeat shuffling DIMMs logging which DIMM went where, nothing is ever the same DIMM or slot.

So I reseat both CPUs (and clean them), and start everything over again. Same issue. I swap the CPUs, same issue. I start pulling DIMMs when they are reported until I'm down to just 1 in the system. Same issue.

Repeat everything from MemTest86 with all DIMMs and 1 CPU, to yanking a DIMM as it's reported, same issue.

I've checked all the stand offs and inspected the motherboard and pins. Cleaned all the contacts.

I can't possibly have all bad hardware at once right? My Ebay luck cannot be that bad.

Order another motherboard, same make and model. All the same testing, cleaning, reseating, except using MemTest v9. Same issue.
Order a pair of E5-2650 v2. Repeat everything. Same issue.
Replace CMOS battery for giggles. Same issue.

I have looked at proper memory population. I've always followed that, except right now if I populate C1 or D1 it lasts enough to boot into UnRAID before the storm hits.

Currently CPU1 has A1-2-3 and B1-2-3 populated, CPU 2 has E1-2, F1-2, G1, H1 populated and it's been up for an hour.

Longest it's run? 2 maybe 3 days.
Shortest? Skip the memory errors and just CATERR.

Transfer TBs of data, works fine, random memory error storm, CATERR. Or no memory storm, CATERR.
I can leave it running transcoding a 4k down to 720p 2mbit while streaming more 4k in the house, works for hours, movies end. CATERR. Or memory storm and CATERR.

I have tried everything without splitting the 8pin. Same results.
I have tried just 1 PSU, and then the other PSU. Same results.

Let it sit in UnRAID without disks (remove the LSI card). Same results.

I've been at this for weeks.

Everything ends in a memory storm and CATERR or just straight up CATERR. I've searched this site (I'm a lurker) and the internet to the best of my abilities, which I thought were good until this moment. I've always built my own PCs, custom water cooling, modding, overclocking, and I thought I was adept, but I'm completely stumped.

What is left to try or swap? I can't point to any specific thing that I can do to force it all to crash. I'm back to at least getting memory errors, I spent the last 2 days with it just going straight to CATERR and having no information to go on and try and hunt down causes.

Attached is a screen shot from IPMIView, it contains some new errors that just showed up today, once. I have no idea why the odd time shift in there either. I have to continuously clear the Event Log because it fills up.

Hop-Scotch · Mar 29, 2021

Returned the setup to a single E5-2650 v2 and all 12 sticks of ram. Got a storm of correctable memory errors (DIMMD1 according to IPMIView) ending in CATERR.

Removed DIMMD1 and moved DIMMD3 to it's place. Server has been running for 18hrs at this point. How ever I'm still logging:

Code:

Mar 29 04:58:32 Segrid kernel: mce: [Hardware Error]: Machine check events logged
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00010000010093
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: TSC 9612bfb01ab4
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: ADDR 13d324dc0
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: MISC 21503e3e86
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1617019112 SOCKET 0 APIC 0
Mar 29 04:58:32 Segrid kernel: EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x13d324 offset:0xdc0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:0)
Mar 29 04:58:32 Segrid kernel: mce: [Hardware Error]: Machine check events logged
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010093
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: TSC 9612bfb0df48
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: ADDR 1000677c0
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: MISC 21505cdc86
Mar 29 04:58:32 Segrid kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1617019112 SOCKET 0 APIC 0
Mar 29 04:58:32 Segrid kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x100067 offset:0x7c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:0)

Which has increasing counters obviously:

Code:

root@Segrid:~# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:53
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:37
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch2_ce_count:0
root@Segrid:~# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:55
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:42
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch2_ce_count:0

However IPMIView shows nothing (screen shot below). What am I missing? Bad slots? Bad RAM, bad everything, again?

i386 · Mar 29, 2021

I'm not sure if it was that board or another, but some x9 boards required a rev 2.0 board to work properly with v2 cpus.

Hop-Scotch · Mar 29, 2021

This board requires revision 1.20 to work with V2 CPUs, which is the revision I have, with fully updated BIOS (3.4). It posts and runs.

Errors seem to be counting on D1 and C1. I can pull those and swap again. I'm considering buying another batch of approved RAM, Samsung this time as a test.

Is there a better way to test ECC RAM than MemTest86? Nothing is ever found when I let that run.

MBastian · Mar 29, 2021

The only things you did not swap where the memory stick models and the OS (Unraid)?

Double check your BIOS DIMM configuration settings. To start set extra conservative values below your DIMMS specs ... don't trust Auto

Edit: Disabling ASPM might be worth a shot.

MBastian · Mar 29, 2021

Hop-Scotch said:
Is there a better way to test ECC RAM than MemTest86? Nothing is ever found when I let that run.

Hmmm, if that is really the case you might want to do some tests with a different OS.

Search

X9DRI-LN4F+ Woes

Hop-Scotch

New Member

Hop-Scotch

New Member

i386

Well-Known Member

Hop-Scotch

New Member

MBastian

Active Member

MBastian

Active Member