Intel W680 DDR5 and ECC Reporting

spring1993 · Dec 18, 2023

Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

Mobo Asus w680
CPU 13900k stock
Ram 128gb Kingston ECC 4800
OS Debian 12 Kernel 6.5.13

sko · Dec 18, 2023

spring1993 said:
how will the OS know that there is corruption and lock or shutdown itself to protect the data?

that's not how ECC memory works...

spring1993 · Dec 18, 2023

Could you please elaborate with more details, how the data will survive a bad ram if the os or the administrator will never know that the ecc ram is bad?

unwind-protect · Dec 18, 2023

spring1993 said:
Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

Mobo Asus w680

CPU 13900k stock

Ram 128gb Kingston ECC 4800

OS Debian 12 Kernel 6.5.13

How do you figure that reporting does not work?

Which OS, anyway?

spring1993 · Dec 18, 2023

I mentioned OS Debian 12 in the post

and here is how:

Code:

$ ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

$ edac-util -v
edac-util: Error: No memory controller data found.

$ dmesg | grep -i edac
[    0.412433]  EDAC MC: Ver: 3.0.0

Alterra · Dec 19, 2023

If your board supports ACPI error injection, you could try that and see what happens when you inject correctable or uncorrectable errors:

APEI Error INJection — The Linux Kernel documentation

You may need to modprobe einj first.

Don't hold your breath while waiting for EDAC support, even Comet Lake is still not there...

JanR · Dec 26, 2023

Alterra said:
If your board supports ACPI error injection, you could try that and see what happens when you inject correctable or uncorrectable errors:

I tried that on an X13SAE-F with no success.

Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.

What I figured out so far: The "igen6" driver is NOT what we want. It is for reporting of INBAND ECC problems for some very specific Alderlake SKU. What we need, is support for the out-of-band (so "real") ECC memory.

With respect to the usefulness of such a driver: I operate ECC boards since many years and check the EDAC data quite often (thus, I miss it for W680 boards). So far, I had no uncorrectable error yet but a bunch of correctable ones. This shows that ECC is working and helps in order to find problematic DIMMs (e.g., the ones that throw one correctable error per week).

Alterra · Dec 26, 2023

JanR said:
I tried that on an X13SAE-F with no success.

When you say no success, you mean you inject an error and nothing happens? I am getting this on X12SAE when I inject a correctable error (I don't have a W680 board to test on):

Code:

[309293.990507] EINJ: Error INJection is initialized.
[309413.023377] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[309413.023380] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[309413.023381] {1}[Hardware Error]: event severity: corrected
[309413.023382] {1}[Hardware Error]:  Error 0, type: corrected
[309413.023383] {1}[Hardware Error]:  fru_text: CorrectedErr
[309413.023384] {1}[Hardware Error]:   section_type: memory error
[309413.023385] {1}[Hardware Error]:   error_type: 8, parity error

It takes a few seconds to appear in kernel log - I guess this is due to EDAC framework polling ACPI tables occasionally. No idea if actual real memory errors are reported or not. Even if they are, since there is no EDAC driver, the error location would not be reported anyway.

JanR said:
Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.

Yeah, it would be good if Intel took the ball on this one. The hardware support is there, but in this regard the competition is hands down better.

JanR · Dec 26, 2023

Alterra said:
When you say no success, you mean you inject an error and nothing happens?

I tried this some month ago... nothing happened but my guess was that I was not able to inject the error the right way. Especially, there was NOTHING in the log, even not the injection. Therefore, I guess I made a mistake or the board does not support the error injection (although I enabled the appropriate option in BIOS).

Currently, testing (especially experimenting with kernel options) is not that simple because the board is in production use.

Styp · Dec 27, 2023

The Asus Pro WS W680, 13900k, DDR5 ECC 4800 setup should work.
I built a machine a couple of months ago with the exact same setup, and it worked.

~~Just be careful: 13000 non-K does not work with ECC; you really need a K processor.~~

PigLover · Dec 27, 2023

Styp said:
…Just be careful: 13000 non-K does not work with ECC; you really need a K processor.

Why do you say 13900 non-k does not support ecc? Ark and the rest of Intels docs say 13900 does support ecc.

RolloZ170 · Dec 27, 2023

Styp said:
Just be careful: 13000 non-K does not work with ECC; you really need a K processor.

i5-13500 and up non 'F' have ECC support.

PigLover said:
Ark and the rest of Intels docs say 13900 does support ecc

~~i think he means Asus Pro WS W680 supports ECC only with 'K' models.~~

Styp · Dec 27, 2023

My bad. 13900k vs 13900kf, I just found the screenshots.

13900kf:

13900k:

cromo · Mar 12, 2024

Is there any update in that matter?

cromo · Mar 13, 2024

Replying to myself, but indeed there is an update: support for EDAC in Alder Lake-N, Raptor Lake-P, Meteor Lake-{P,PS} was added in kernel 6.8:

[GIT PULL] EDAC updates for v6.8 - Borislav Petkov

EDIT: actually, no, this was for igen6 module, which provides In-Band ECC, not a full-blown ECC with dedicated memory.

Search

Intel W680 DDR5 and ECC Reporting

spring1993

New Member

sko

Active Member

spring1993

New Member

unwind-protect

Active Member

spring1993

New Member

Alterra

New Member

JanR

New Member

Alterra

New Member

JanR

New Member

Styp

Member

PigLover

Moderator

RolloZ170

Well-Known Member

Styp

Member

cromo

Member

cromo

Member