Dual XEON 2696V4 + X10DAX Epic Disaster build - BSOD randomly - please help..

traderjay

Active Member
Mar 24, 2017
196
41
28
35
Thanks all for the response. Do I for sure have mismatched stepping? One CPU is on the table and all I can read is the prints on the heatspreader.

CPU1 - SR2J0 (reading this off the headspreader)
CPU2 - Revision B00001B, Stepping 1, Model 4F CPU Family 6
 

vinceflynow

New Member
May 3, 2017
29
5
3
What program are you getting the Revision value from?

From my ESXI server, with dual E5-2699 V4 SR2J0:

Code:
[root@myesxi:~] esxcli hardware cpu list
CPU:0
   Id: 0
   Package Id: 0
   Family: 6
   Model: 79
   Type: 0
   Stepping: 1
   Brand: GenuineIntel
   Core Speed: 2199998056
   Bus Speed: 99999906
   APIC ID: 0x0
   Node: 0
   L2 Cache Size: 262144
   L2 Cache Associativity: 8
   L2 Cache Line Size: 64
   L2 Cache CPU Count: 2
   L3 Cache Size: 57671680
   L3 Cache Associativity: 20
   L3 Cache Line Size: 64
   L3 Cache CPU Count: 2
...
4F(hexidecimal) = 79 (decimal). Based on the Stepping, Model, and Family values, CPU2 looks most like a SR2J0 (but I could be wrong).
 

Marsh

Moderator
May 12, 2013
2,292
1,108
113
Run hwinfo and cpuz to confirm

CPU1 - SR2J0 is stepping 2
CPU2 - is stepping 1
 

lni

Member
Aug 20, 2017
34
10
8
39
my 2696v4 processors have SR2J0 on the heatspreaders, below is the lscpu result

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 88
On-line CPU(s) list: 0-87
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
Stepping: 1

CPU MHz: 1200.036
CPU max MHz: 3700.0000
CPU min MHz: 1200.0000
BogoMIPS: 4399.68
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-21,44-65
NUMA node1 CPU(s): 22-43,66-87
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
So I guess the codes on the heatspreader doesn't confirm the steppings? The seller of the CPU reaffirmed that the two chips worked perfectly in his X10DAX and will take back the CPU if I can't get it to work. My new motherboard arrived today and will be testing again.
 

vinceflynow

New Member
May 3, 2017
29
5
3
If it were me, I would still try to confirm, through software, the stepping of each CPU.

Both my 2696V4 have SR2J0 S-spec etched on the heatspreader. And ESXI reports that these SR2J0 labeled CPUs have the follow attributes:
Family 6, Model 79, Stepping 1.

Ini's post also confirms the same attributes for his SR2J0 labeled CPUs.

These attributes are only from two data points. You never know ... someone could have replaced the heatspreader, or etched values on these CPU's after getting them from Intel.
 

Aluminum

Active Member
Sep 7, 2012
431
45
28
Read the cpu info on both from software, the IHS could be remarked pretty easily. Mixing steppings is almost always a bad idea in the SMP 101 book.

These things are even on newegg now, third party seller of course. I notice the one listing mentions x10dax as not compatible, but they are wrong about not working on asrock as mine has been chugging along for almost a year now.

I also noticed a lot of 'different' sellers on fleabay and elsewhere are using the same small set of pictures and screenshots...you gotta give some extra attention shopping in the system pull grey market.
 

AJXCR

Active Member
Jan 20, 2017
565
95
28
32
I've been running 2x SR2J0's on an X10DAC which is highly similar to an X10DAX for almost a year with no issues whatsoever. I'll try to pull some info on the CPU's and or BIOS this evening..
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
Got a new motherboard and the BSOD continues :( I checked everything including all the cables and my suspect now either the CPU or RAM. Below is a report of the dump file:

Code:
Crash Dump Analysis provided by OSR Open Systems Resources, Inc. (http://www.osr.com)
Online Crash Dump Analysis Service
See http://www.osronline.com for more information
Windows 8 Kernel Version 15063 MP (44 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 15063.0.amd64fre.rs2_release.170317-1834
Machine Name:
Kernel base = 0xfffff800`5ce9a000 PsLoadedModuleList = 0xfffff800`5d1e65a0
Debug session time: Tue Oct  3 00:07:52.606 2017 (UTC - 4:00)
System Uptime: 0 days 0:05:40.302
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

KMODE_EXCEPTION_NOT_HANDLED (1e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffffc0000005, The exception code that was not handled
Arg2: fffff8005ce20f04, The address that the exception occurred at
Arg3: ffffffffffffffff, Parameter 0 of the exception
Arg4: 0000000000000000, Parameter 1 of the exception

Debugging Details:
------------------

TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

FAULTING_IP:
hal!HalSendSoftwareInterrupt+b4
fffff800`5ce20f04 c3              ret

EXCEPTION_PARAMETER1:  ffffffffffffffff

EXCEPTION_PARAMETER2:  0000000000000000

WRITE_ADDRESS: unable to get nt!MmSpecialPoolStart
unable to get nt!MmSpecialPoolEnd
unable to get nt!MmPagedPoolEnd
unable to get nt!MmNonPagedPoolStart
unable to get nt!MmSizeOfNonPagedPoolInBytes
 0000000000000000

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

BUGCHECK_STR:  0x1e_c0000005

DEFAULT_BUCKET_ID:  CODE_CORRUPTION

PROCESS_NAME:  MemCompression

CURRENT_IRQL:  2

TRAP_FRAME:  ffffdc08c1200028 -- (.trap 0xffffdc08c1200028)
Unable to read trap frame at ffffdc08`c1200028

LAST_CONTROL_TRANSFER:  from fffff8005d01abaa to fffff8005d005fd0

CONTEXT:  244c8948cccccccc -- (.cxr 0x244c8948cccccccc)
Unable to read context, Win32 error 0n30

STACK_TEXT:
ffffb380`2aef1bf8 fffff800`5d01abaa : 00000000`0000001e ffffffff`c0000005 fffff800`5ce20f04 ffffffff`ffffffff : nt!KeBugCheckEx
ffffb380`2aef1c00 fffff800`5d011482 : 00000000`00000000 00000000`00000000 ffffdc08`c1200028 00000000`00ff7ad0 : nt!KiDispatchException+0x174c8a
ffffb380`2aef22c0 fffff800`5d00f605 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiExceptionDispatch+0xc2
ffffb380`2aef24a0 fffff800`5ce20f04 : ffffdc08`c0e80000 00001f80`00000000 00000000`00000000 00000000`00000000 : nt!KiStackFault+0x105
ffffb380`2aef2630 ffffdc08`c0e80000 : 00001f80`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalSendSoftwareInterrupt+0xb4
ffffb380`2aef2638 00001f80`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 ffffdc08`c0e960e0 : 0xffffdc08`c0e80000
ffffb380`2aef2640 00000000`00000000 : 00000000`00000000 00000000`00000000 ffffdc08`c0e960e0 00000000`00000000 : 0x1f80`00000000


MODULE_NAME: memory_corruption

IMAGE_NAME:  memory_corruption

FOLLOWUP_NAME:  memory_corruption

DEBUG_FLR_IMAGE_TIMESTAMP:  0

MEMORY_CORRUPTOR:  LARGE

STACK_COMMAND:  .cxr 0x244c8948cccccccc ; kb

FAILURE_BUCKET_ID:  X64_MEMORY_CORRUPTION_LARGE

BUCKET_ID:  X64_MEMORY_CORRUPTION_LARGE

Followup: memory_corruption
 
Last edited:

traderjay

Active Member
Mar 24, 2017
196
41
28
35
Here is the dump for the WHEA Uncorrectable error:

Code:
Crash Dump Analysis provided by OSR Open Systems Resources, Inc. (http://www.osr.com)
Online Crash Dump Analysis Service
See http://www.osronline.com for more information
Windows 8 Kernel Version 15063 MP (88 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 15063.0.amd64fre.rs2_release.170317-1834
Machine Name:
Kernel base = 0xfffff802`98a91000 PsLoadedModuleList = 0xfffff802`98ddd5a0
Debug session time: Thu Oct  5 14:10:44.628 2017 (UTC - 4:00)
System Uptime: 0 days 0:06:24.457
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: ffffcc0efe3a0028, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b2000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000010005, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------

*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: pshed!_WHEA_ERROR_RECORD_HEADER                ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: pshed!_WHEA_ERROR_RECORD_SECTION_DESCRIPTOR                ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: pshed!_WHEA_ERROR_RECORD_HEADER                ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: pshed!_WHEA_ERROR_RECORD_HEADER                ***
***                                                                   ***
*************************************************************************
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: pshed!_WHEA_ERROR_RECORD_HEADER                ***
***                                                                   ***
*************************************************************************
Unable to read KTHREAD address ffffcc0f0316a138
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2

BUGCHECK_STR:  0x124_GenuineIntel

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

CURRENT_IRQL:  f

STACK_TEXT: 
ffff9c80`4a74fbb8 fffff802`98a4e5cf : 00000000`00000124 00000000`00000000 ffffcc0e`fe3a0028 00000000`b2000000 : nt!KeBugCheckEx
ffff9c80`4a74fbc0 fffff802`98cecfbd : ffffcc0e`fe3a0028 ffffcc0e`fbf0d220 ffffcc0e`fbf0d220 ffffcc0e`fbf0d220 : hal!HalBugCheckSystem+0xcf
ffff9c80`4a74fc00 fffff802`98a4eaf8 : 00000000`00000728 00000000`00000017 00000000`00000000 00000000`00000000 : nt!WheaReportHwError+0x25d
ffff9c80`4a74fc60 fffff802`98a4ee58 : 00000000`00000010 00000000`00000017 ffff9c80`4a74fe08 00000000`00000017 : hal!HalpMcaReportError+0x50
ffff9c80`4a74fdb0 fffff802`98a4ed46 : ffffcc0e`fbf14fd0 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpMceHandlerCore+0xe0
ffff9c80`4a74fe00 fffff802`98a4ef8a : 00000000`00000058 00000000`00000001 00000000`00000000 00000000`00000000 : hal!HalpMceHandler+0xda
ffff9c80`4a74fe40 fffff802`98a4f120 : ffffcc0e`fbf14fd0 ffff9c80`4a750070 00000000`00000000 00000000`00000000 : hal!HalpMceHandlerWithRendezvous+0xce
ffff9c80`4a74fe70 fffff802`98c071bb : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : hal!HalHandleMcheck+0x40
ffff9c80`4a74fea0 fffff802`98c06f2c : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxMcheckAbort+0x7b
ffff9c80`4a74ffe0 00007ff8`4b3b57ae : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiMcheckAbort+0x1ac
00000000`5af4ee08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x7ff8`4b3b57ae


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_GenuineIntel__UNKNOWN

BUCKET_ID:  X64_0x124_GenuineIntel__UNKNOWN

Followup: MachineOwner
 

Terry Kennedy

Well-Known Member
Jun 25, 2015
1,067
506
113
New York City
www.glaver.org
Got a new motherboard and the BSOD continues :( I checked everything including all the cables and my suspect now either the CPU or RAM.
Get a copy of Memtest86+ V5.01 and press F2 when prompted to test on all cores simultaneously. It may not work or may have display oddities due to the large number of cores and a chipset it doesn't understand.

Thisi s one place where Dell motherboards really shine - going all the way back to Core 2 Duo, they have Dell's MPMEMORY diagnostic included in the BIOS. Too bad Supermicro doesn't do something similar.
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
I found out in bios that my memory are running at 2400 mhz when all 8 dimms are populated. My suspect is this speed is out of spec for the 2696v4 memory controller hence the BSODs. I changed motherboard and the issue persists so I am guessing RAM is the biggest suspect?
 

Nanotech

Active Member
Aug 1, 2016
595
99
28
40
I found out in bios that my memory are running at 2400 mhz when all 8 dimms are populated. My suspect is this speed is out of spec for the 2696v4 memory controller hence the BSODs. I changed motherboard and the issue persists so I am guessing RAM is the biggest suspect?
2696 V4 supports DDR4-2400 as do all V4 processors. Running all 8 DIMM's wouldn't cause an issue like that even though it does place a bigger load on the IMC of the processor.
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
2696 V4 supports DDR4-2400 as do all V4 processors. Running all 8 DIMM's wouldn't cause an issue like that even though it does place a bigger load on the IMC of the processor.
But based on SuperMicro's own documentation, when all 4 channels are populated, the RAM speed goes down to 2133 mhz. Not sure why my board is defaulting to 2400 mhz and there is no way for me to change the RAM speed.

Edit - I reread the documentation, the 2133 speed limit is for Haswell chip. On the other hand, Kingston documentation says when you run 2 DIMMs per channel the speed drops to 2133:

General Intel Xeon E5-2600 v3 / v4 (Grantley) Information
 
Last edited:

vinceflynow

New Member
May 3, 2017
29
5
3
The E5-2696 V4 are Broadwell-EP.

The Supermicro documentation for the X10DA{X,C,i} motherboard shows that you can populate all four channels, per CPU, and still have RDIMMs run 2400MHz.

I'm assuming that your eight RDIMMs are evenly spread across the 4 channels * 2 CPUs (8 channels). Which means you're only populating one slot per channel. And according to the chart (taken from the X10DAC manual), the RDIMMS run at 2400 Mhz, when only one RDIMM slot per channel is populated.

X10DAC_mem.png
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
The E5-2696 V4 are Broadwell-EP.

The Supermicro documentation for the X10DA{X,C,i} motherboard shows that you can populate all four channels, per CPU, and still have RDIMMs run 2400MHz.

I'm assuming that your eight RDIMMs are evenly spread across the 4 channels * 2 CPUs (8 channels). Which means you're only populating one slot per channel. And according to the chart (taken from the X10DAC manual), the RDIMMS run at 2400 Mhz, when only one RDIMM slot per channel is populated.

View attachment 6737
I am putting all 8 DIMMS in the blue slots of the board - or did I screw up?
 

vinceflynow

New Member
May 3, 2017
29
5
3
The blue slots should be the first slot for each channel. Which means you're populating one slot per channel, per CPU. This DIMM population maximizes the memory performance on these motherboards, when using dual Broadwell-EP CPU.

I also populate all 8 blue DIMM slots on X10DRG-Q motherboard with dual E5-2696V4.
 

traderjay

Active Member
Mar 24, 2017
196
41
28
35
The blue slots should be the first slot for each channel. Which means you're populating one slot per channel, per CPU. This DIMM population maximizes the memory performance on these motherboards, when using dual Broadwell-EP CPU.

I also populate all 8 blue DIMM slots on X10DRG-Q motherboard with dual E5-2696V4.
Thanks - can you tell me what RAM model are you using for the system?