Intel W680 DDR5 and ECC Reporting

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

spring1993

New Member
Jul 12, 2023
4
0
1
Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

  • Mobo Asus w680
  • CPU 13900k stock
  • Ram 128gb Kingston ECC 4800
  • OS Debian 12 Kernel 6.5.13
 

spring1993

New Member
Jul 12, 2023
4
0
1
Could you please elaborate with more details, how the data will survive a bad ram if the os or the administrator will never know that the ecc ram is bad?
 

unwind-protect

Active Member
Mar 7, 2016
603
246
43
Boston
Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

  • Mobo Asus w680
  • CPU 13900k stock
  • Ram 128gb Kingston ECC 4800
  • OS Debian 12 Kernel 6.5.13
How do you figure that reporting does not work?

Which OS, anyway?
 

spring1993

New Member
Jul 12, 2023
4
0
1
I mentioned OS Debian 12 in the post

and here is how:

Code:
$ ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

$ edac-util -v
edac-util: Error: No memory controller data found.

$ dmesg | grep -i edac
[    0.412433]  EDAC MC: Ver: 3.0.0
 

JanR

Member
Nov 5, 2023
40
24
8
If your board supports ACPI error injection, you could try that and see what happens when you inject correctable or uncorrectable errors:
I tried that on an X13SAE-F with no success.

Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.

What I figured out so far: The "igen6" driver is NOT what we want. It is for reporting of INBAND ECC problems for some very specific Alderlake SKU. What we need, is support for the out-of-band (so "real") ECC memory.

With respect to the usefulness of such a driver: I operate ECC boards since many years and check the EDAC data quite often (thus, I miss it for W680 boards). So far, I had no uncorrectable error yet but a bunch of correctable ones. This shows that ECC is working and helps in order to find problematic DIMMs (e.g., the ones that throw one correctable error per week).
 

Alterra

New Member
Feb 26, 2023
19
9
3
I tried that on an X13SAE-F with no success.
When you say no success, you mean you inject an error and nothing happens? I am getting this on X12SAE when I inject a correctable error (I don't have a W680 board to test on):

Code:
[309293.990507] EINJ: Error INJection is initialized.
[309413.023377] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[309413.023380] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[309413.023381] {1}[Hardware Error]: event severity: corrected
[309413.023382] {1}[Hardware Error]:  Error 0, type: corrected
[309413.023383] {1}[Hardware Error]:  fru_text: CorrectedErr
[309413.023384] {1}[Hardware Error]:   section_type: memory error
[309413.023385] {1}[Hardware Error]:   error_type: 8, parity error
It takes a few seconds to appear in kernel log - I guess this is due to EDAC framework polling ACPI tables occasionally. No idea if actual real memory errors are reported or not. Even if they are, since there is no EDAC driver, the error location would not be reported anyway.

Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.
Yeah, it would be good if Intel took the ball on this one. The hardware support is there, but in this regard the competition is hands down better.
 
Last edited:

JanR

Member
Nov 5, 2023
40
24
8
When you say no success, you mean you inject an error and nothing happens?
I tried this some month ago... nothing happened but my guess was that I was not able to inject the error the right way. Especially, there was NOTHING in the log, even not the injection. Therefore, I guess I made a mistake or the board does not support the error injection (although I enabled the appropriate option in BIOS).

Currently, testing (especially experimenting with kernel options) is not that simple because the board is in production use.
 

Styp

Member
Aug 1, 2018
77
26
18
The Asus Pro WS W680, 13900k, DDR5 ECC 4800 setup should work.
I built a machine a couple of months ago with the exact same setup, and it worked.

Just be careful: 13000 non-K does not work with ECC; you really need a K processor.
 
Last edited:

PigLover

Moderator
Jan 26, 2011
3,224
1,588
113
…Just be careful: 13000 non-K does not work with ECC; you really need a K processor.
Why do you say 13900 non-k does not support ecc? Ark and the rest of Intels docs say 13900 does support ecc.
 

cromo

Active Member
Jun 6, 2019
124
41
28
Replying to myself, but indeed there is an update: support for EDAC in Alder Lake-N, Raptor Lake-P, Meteor Lake-{P,PS} was added in kernel 6.8:

EDIT: actually, no, this was for igen6 module, which provides In-Band ECC, not a full-blown ECC with dedicated memory.
 
Last edited:

2bluesc

New Member
Dec 24, 2017
19
15
3
40
After installing routine software updates on my NAS my Arch system began experiencing frequent crashes through the day after rebooting to a new kernel and crashes persisted even after rolling back to previous Linux LTS kernel version.

System Configuration
  • CPU: Intel i5-12600K (ECC supported here)
  • Motherboard: ASUS Pro WS W680-ACE (ECC supported here)
  • Memory: 2x32GB ECC UDIMMs (HMCG88MEBEA084N) @ 4800 MT/s and 1.1V
  • OS: Arch Linux
  • Previous Status: Stable operation for many months

Crash Symptoms
  • BTRFS and ZFS filesystem errors
  • NVMe device issues
  • Generic kernel linked list and atomic corruption
  • Various seemingly unrelated kernel oops/panics

Initial Troubleshooting Attempts
  1. Kernel rollback - v6.12.32 -> v6.12.25 - No improvement
  2. BIOS update - Originally version 4001 (10/04/2024) and now 4101 (12/03/2024)
  3. BIOS config - Tried both existing config and "Load Optimized Defaults"
  4. Memory testing with Memtest86 - 4 passes overnight with ECC polling enabled, ZERO errors reported

Memtester Results (50GB test)
Code:
$ memtester 50G
Loop 1:
  Stuck Address       : testing  14FAILURE: possible bad address line at offset 0x00000009d35a2000.
Skipping to next test...
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : testing  19FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3000.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3008.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3010.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3018.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3020.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3028.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3030.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3038.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3040.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3048.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3050.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3058.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3060.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3068.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3070.
FAILURE: 0x0000000000000000 != 0x1313131313131313 at offset 0x00000001116c3078.
  Checkerboard        : testing  43
I swapped the DIMM slots and repeat the test (mainly reseating the memory) and it crashed again. At least this time system didn't hang so I could investigate until it inevitably crashes. No ECC errors in journalctl or dmesg
Code:
$ memtester 50G
Jun 08 10:28:23 memtester version 4.6.0 (64-bit)
Jun 08 10:28:23 Copyright (C) 2001-2020 Charles Cazabon.
Jun 08 10:28:23 Licensed under the GNU General Public License version 2 (only).
Jun 08 10:28:23
Jun 08 10:28:23 pagesize is 4096
Jun 08 10:28:23 pagesizemask is 0xfffffffffffff000
Jun 08 10:28:23 want 51200MB (53687091200 bytes)
Jun 08 10:28:42 got  51200MB (53687091200 bytes), trying mlock ...locked.
Jun 08 10:28:42 Loop 1:
Jun 08 10:31:17   Stuck Address       : ok
Jun 08 10:34:48   Random Value        : FAILURE: 0xef9971e8eff6f8c4 != 0x0000000000000000 at offset 0x0000000156f39ff0.
Jun 08 10:34:48 FAILURE: 0xffec8efdfd9e0d09 != 0x0000000000000000 at offset 0x0000000156f39ff8.
Jun 08 10:34:48 FAILURE: 0x4dbf1c793ec6b04e != 0x0000000000000000 at offset 0x0000000156f3a000.
Jun 08 10:34:48 FAILURE: 0x73e86229afff71e8 != 0x0000000000000000 at offset 0x0000000156f3a008.
Jun 08 10:34:48 FAILURE: 0x67c362be527f9e81 != 0x0000000000000000 at offset 0x0000000156f3a010.
Jun 08 10:34:48 FAILURE: 0x4eb2ef31ebbf0f06 != 0x0000000000000000 at offset 0x0000000156f3a018.
Jun 08 10:34:48 FAILURE: 0x4f3f74785b0f68b5 != 0x0000000000000000 at offset 0x0000000156f3a020.
Jun 08 10:34:48 FAILURE: 0xfffff85a6b424eb0 != 0x0000000000000000 at offset 0x0000000156f3a028.
Jun 08 10:34:48 FAILURE: 0x5ff57023ff6f4e26 != 0x0000000000000000 at offset 0x0000000156f3a030.
Jun 08 10:34:48 FAILURE: 0x7b3925d97bfff8bd != 0x0000000000000000 at offset 0x0000000156f3a038.
Jun 08 10:34:48 FAILURE: 0xad3fa0e553af7c94 != 0x0000000000000000 at offset 0x0000000156f3a040.
Jun 08 10:34:48 FAILURE: 0xfefc0c736ff3ae4b != 0x0000000000000000 at offset 0x0000000156f3a048.
Jun 08 10:34:48 FAILURE: 0x6dde34bd73efb1a8 != 0x0000000000000000 at offset 0x0000000156f3a050.
Jun 08 10:34:48 FAILURE: 0xd7bf934c6ff707ac != 0x0000000000000000 at offset 0x0000000156f3a058.
Jun 08 10:34:48 FAILURE: 0x7c73c2e4ffff5ad3 != 0x0000000000000000 at offset 0x0000000156f3a060.
Jun 08 10:34:48 FAILURE: 0xfda3f5eb777f2581 != 0x0000000000000000 at offset 0x0000000156f3a068.
Jun 08 10:34:57   Compare XOR         : ok
Jun 08 10:35:04   Compare SUB         : ok
Jun 08 10:35:11   Compare MUL         : ok
Jun 08 10:35:40   Compare DIV         : ok
Jun 08 10:35:47   Compare OR          : ok
Jun 08 10:35:54   Compare AND         : ok
Jun 08 10:36:03   Sequential Increment: ok
...
EDAC Investigation
Checked /sys/devices/system/edac/ and found NO memory controller entries - only power management files.

Expected to see:
Code:
/sys/devices/system/edac/mc/mc0/ce_count
/sys/devices/system/edac/mc/mc0/ue_count
But these directories don't exist and clearly a driver isn't finding a device and loading an appropriate driver. :(

A recent patch was posted here that adds EDAC support. However, that device ID isn't present on my i5-12600k. According to Intel's datasheet on PCI IDs, this is a different chip. My system does have 8086:4648 which logically makes sense with the 6P+4E cores the i5-12600k.

Code:
$ lspci -nn | rg 4648
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:4648] (rev 02)
I'm going to dig deeper on this... might be as simple as building the latest kernel with the 180f091 commit and patching it to include my Device ID?
 
Last edited:

JanR

Member
Nov 5, 2023
40
24
8
Thank you for that pointer! I will try this next week. According to Phoronix, this patch will be added to 6.16: Intel Hardware Support Expanded In EDAC Drivers For Linux 6.16 - Phoronix

I'm going to dig deeper on this... might be as simple as building the latest kernel with the 180f091 commit and patching it to include my Device ID?
Give it a try - my guess is that your chances are really good since there are already ADL SKUs and it is very unlikely that the memory controller of your SKU is different from the others.
 

2bluesc

New Member
Dec 24, 2017
19
15
3
40
I'm going to dig deeper on this... might be as simple as building the latest kernel with the 180f091 commit and patching it to include my Device ID?
This patch was successful and now two memory controllers show up as expected with corresponding counters! :cool:

Code:
$ dmesg | rg -i -e edac -e ecc -e ie31200
[    0.372753] EDAC MC: Ver: 3.0.0
[    0.486674] usb usb1: Manufacturer: Linux 6.15.0-1-edac-git-13804-g939f15e640f1-dirty xhci-hcd
[    0.488489] usb usb2: Manufacturer: Linux 6.15.0-1-edac-git-13804-g939f15e640f1-dirty xhci-hcd
[   13.131016] caller ie31200_init_one+0x1b4/0x480 [ie31200_edac] mapping multiple BARs
[   13.131051] EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (INTERRUPT)
[   13.131082] EDAC MC1: Giving out device to module ie31200_edac controller IE31200_1: DEV 0000:00:00.0 (INTERRUPT)
Code:
$ grep . /sys/devices/system/edac/mc/mc*/*_count
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc0/ce_noinfo_count:0
/sys/devices/system/edac/mc/mc0/ue_count:0
/sys/devices/system/edac/mc/mc0/ue_noinfo_count:0
/sys/devices/system/edac/mc/mc1/ce_count:0
/sys/devices/system/edac/mc/mc1/ce_noinfo_count:0
/sys/devices/system/edac/mc/mc1/ue_count:0
/sys/devices/system/edac/mc/mc1/ue_noinfo_count:0
And `ras-mc-ctl` works:
Code:
$ ras-mc-ctl --error-count
Label                   CE      UE
mc#1csrow#0channel#0    0       0
mc#0csrow#0channel#0    0       0
mc#1csrow#1channel#0    0       0
mc#0csrow#1channel#0    0       0
mc#1csrow#0channel#1    0       0
mc#0csrow#0channel#1    0       0
mc#1csrow#1channel#1    0       0
mc#0csrow#1channel#1    0       0
Netdata detects EDAC as well, see screenshot.

Butttt, the new 6.16 pre-release kernel of course breaks ZFS, so now backporting fixes to Arch LTS @ ~6.12.32 for now.
 

Attachments

Last edited:

JanR

Member
Nov 5, 2023
40
24
8
This patch was successful and now two memory controllers show up as expected with corresponding counters! :cool:
Great!

I did some experimentation too: My machine at work is still running 6.6.85, so I "backported" the driver. "Backporting" in this case means I just copied the file "ie31200_edac.c" from linux-next into my the 6.6.85 source tree. This seems to be possible since both the patchset that introduces RPL-S support at all as well as the patch adding some more SKU touch *only* this file.

The driver compiles with no issues and can be loaded:

Code:
[1811426.074380] resource: resource sanity check: requesting [mem 0x00000000fedc0000-0x00000000fedcffff], which spans more than pnp 00:04 [mem 0xfedc0000-0xfedc7fff]
[1811426.074384] caller ie31200_init_one+0x1ae/0x4c0 [ie31200_edac] mapping multiple BARs
[1811426.074451] EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (INTERRUPT)
[1811426.074527] EDAC MC1: Giving out device to module ie31200_edac controller IE31200_1: DEV 0000:00:00.0 (INTERRUPT)
Furthermore, edac-util gives the information I expected:

Code:
# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: mc#1csrow#0channel#0: 0 Corrected Errors
mc1: csrow0: mc#1csrow#0channel#1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: mc#1csrow#1channel#0: 0 Corrected Errors
mc1: csrow1: mc#1csrow#1channel#1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors
mc1: csrow2: mc#1csrow#2channel#1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors
mc1: csrow3: mc#1csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
Therefore, this seems to be working. Now we have to wait for an correctable error - I guess that your machine that already experienced memory errors is a good candidate for this.