Intel W680 DDR5 and ECC Reporting

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

spring1993

New Member
Jul 12, 2023
4
0
1
Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

  • Mobo Asus w680
  • CPU 13900k stock
  • Ram 128gb Kingston ECC 4800
  • OS Debian 12 Kernel 6.5.13
 

spring1993

New Member
Jul 12, 2023
4
0
1
Could you please elaborate with more details, how the data will survive a bad ram if the os or the administrator will never know that the ecc ram is bad?
 

unwind-protect

Active Member
Mar 7, 2016
418
156
43
Boston
Good day, everyone.

I am planning to update all my noisy and slow Xeons to W680 for many reasons.

To test, I bought just one, built it, and was able to get a "working" ECC but found that no ECC reporting is available yet!

The question is, since ECC reporting is not implemented yet, how will the OS know that there is corruption and lock or shutdown itself to protect the data?

If a RAM stick is at fault, then it will corrupt the OS and anything else in the way without the OS even noticing it. Am I missing something here?

  • Mobo Asus w680
  • CPU 13900k stock
  • Ram 128gb Kingston ECC 4800
  • OS Debian 12 Kernel 6.5.13
How do you figure that reporting does not work?

Which OS, anyway?
 

spring1993

New Member
Jul 12, 2023
4
0
1
I mentioned OS Debian 12 in the post

and here is how:

Code:
$ ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

$ edac-util -v
edac-util: Error: No memory controller data found.

$ dmesg | grep -i edac
[    0.412433]  EDAC MC: Ver: 3.0.0
 

JanR

New Member
Nov 5, 2023
28
11
3
If your board supports ACPI error injection, you could try that and see what happens when you inject correctable or uncorrectable errors:
I tried that on an X13SAE-F with no success.

Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.

What I figured out so far: The "igen6" driver is NOT what we want. It is for reporting of INBAND ECC problems for some very specific Alderlake SKU. What we need, is support for the out-of-band (so "real") ECC memory.

With respect to the usefulness of such a driver: I operate ECC boards since many years and check the EDAC data quite often (thus, I miss it for W680 boards). So far, I had no uncorrectable error yet but a bunch of correctable ones. This shows that ECC is working and helps in order to find problematic DIMMs (e.g., the ones that throw one correctable error per week).
 

Alterra

New Member
Feb 26, 2023
7
2
3
I tried that on an X13SAE-F with no success.
When you say no success, you mean you inject an error and nothing happens? I am getting this on X12SAE when I inject a correctable error (I don't have a W680 board to test on):

Code:
[309293.990507] EINJ: Error INJection is initialized.
[309413.023377] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[309413.023380] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[309413.023381] {1}[Hardware Error]: event severity: corrected
[309413.023382] {1}[Hardware Error]:  Error 0, type: corrected
[309413.023383] {1}[Hardware Error]:  fru_text: CorrectedErr
[309413.023384] {1}[Hardware Error]:   section_type: memory error
[309413.023385] {1}[Hardware Error]:   error_type: 8, parity error
It takes a few seconds to appear in kernel log - I guess this is due to EDAC framework polling ACPI tables occasionally. No idea if actual real memory errors are reported or not. Even if they are, since there is no EDAC driver, the error location would not be reported anyway.

Currently, there is no EDAC driver and - according to my experiences - also the "Output ACPI APEI/GHES BIOS detected errors via EDAC" driver option does not work.

However, according to Intel documention, it should be possible to write such a driver. Unfortunately, I have no time for that, and, according to EDAC mailing list checked some weeks ago, so far nobody is working on that.
Yeah, it would be good if Intel took the ball on this one. The hardware support is there, but in this regard the competition is hands down better.
 
Last edited:

JanR

New Member
Nov 5, 2023
28
11
3
When you say no success, you mean you inject an error and nothing happens?
I tried this some month ago... nothing happened but my guess was that I was not able to inject the error the right way. Especially, there was NOTHING in the log, even not the injection. Therefore, I guess I made a mistake or the board does not support the error injection (although I enabled the appropriate option in BIOS).

Currently, testing (especially experimenting with kernel options) is not that simple because the board is in production use.
 

Styp

Member
Aug 1, 2018
69
21
8
The Asus Pro WS W680, 13900k, DDR5 ECC 4800 setup should work.
I built a machine a couple of months ago with the exact same setup, and it worked.

Just be careful: 13000 non-K does not work with ECC; you really need a K processor.
 
Last edited:

PigLover

Moderator
Jan 26, 2011
3,186
1,545
113
…Just be careful: 13000 non-K does not work with ECC; you really need a K processor.
Why do you say 13900 non-k does not support ecc? Ark and the rest of Intels docs say 13900 does support ecc.