Has anyone gotten ECC logging (Rasdaemon, EDAC, WHEA, etc.) to work on Xeon W-1200 or W-1300 or Core 12ᵗʰ or 13ᵗʰ Gen processors?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

BigBullion

Member
Jul 28, 2022
45
14
8
I have an 11th Gen Intel Xeon W-1370 Rocket Lake CPU which supports DDR4 ECC UDIMMs. The motherboard is Gigabyte W480M Vision W, with latest BIOS (version F21).

In order to enable the system to log ECC errors I have attempted to install Rasdaemon and EDAC. This is my first time installing and using Rasdaemon and EDAC.

I have followed these instructions and tried to install and run it on Fedora Workstation 37 but failed to get it working.

Here are the steps and commands that I have tried.

Code:
$ sudo dnf install -y rasdaemon edac-util edac-utils libedac edac-ctl

$ sudo systemctl enable rasdaemon
$ sudo systemctl start rasdaemon
$ sudo systemctl enable ras-mc-ctl
$ sudo systemctl start ras-mc-ctl

$ ras-mc-ctl --status
ras-mc-ctl: drivers not loaded.

$ edac-util -v
edac-util: Error: No memory controller data found.

$ dmesg | grep -i edac
[    1.228826]  EDAC MC: Ver: 3.0.0

$ sudo modprobe ie31200_edac
modprobe: ERROR: could not insert 'ie31200_edac': Operation not permitted
Maybe EDAC is not supported for 10th Gen Comet Lake or 11th Gen Rocket Lake CPUs? I have found a webpage in which a user mentions that there is no support for 10th Gen.

I don't know if this is the right place to look, but I have searched through the source code for the Linux EDAC drivers and there is no mention of EDAC for 10th Gen or 11th Gen support, although there is mention of EDAC support for 8th and 9th gen Coffee Lake CPUs.

I might try a different Linux distro, Ubuntu, next.

I have also installed Windows 11 Pro for Workstations and looked at Event Viewer > Windows Logs > System > Source, but there were no WHEA-Logger events after stress testing. Even though Memtest86 starts outputting ECC errors within a few seconds after starting the test:

xeonw1370passmarkresults.jpg

The lack of WHEA errors in Windows 11 might be due to lack of motherboard support of WHEA though. Supermicro's motherboards do seem to support WHEA reporting. According to Supermicro's manuals for X12SAE, X12SAE-5 and X13SAE, they do mention support for WHEA.
 
Last edited:

Alterra

New Member
Feb 26, 2023
7
2
3
I think your first problem is the Gigabyte W480M board. Gigabyte disabled error reporting in BIOS, and locked the setting so it cannot be changed. I had this board and in my understanding, Comet Lake CPUs can report ECC errors via SERR, SMI or SCI and I verified that on the W480M, all of those are disabled. Admittedly Gigabyte does not specifically advertise ECC error reporting but I consider that half assed implementation. Further, the board was horribly unstable with W-1290P. The support was not very useful so I tossed the board and replaced it with Supermicro X12SAE. It may not have all the fancy features but it works. And it has a 32-bit PCI slot. :)

X12SAE does support error reporting via SMI (the WHEA thingy I presume). The BIOS also exposes ACPI EINJ table so it can be verified to some extent. The Linux kernel will log the error after a while. However EDAC does not work so there will not be much details. It looks like the ie31200 driver last supported CPU is Coffee Lake. Some updates would be needed for Comet Lake / Rocket Lake.

Not sure how memtest86 gets the results on W480M, maybe it directly reads the syndrome registers periodically.
 
  • Like
Reactions: BigBullion

BigBullion

Member
Jul 28, 2022
45
14
8
I decided not to spend any more time on this Gigabyte W480M motherboard anymore due to ECC errors in Memtest86.

I have spent a total of nine months troubleshooting this issue for this $200 board, going back and forth with memory manufacturers, Intel, distributors and Gigabyte. I probably spent more money buying spare parts attempting to diagnose this issue, and paying for shipping to return supposedly defective parts, than the cost of the motherboard itself.

Maybe the reason with this issue was because this board is based on the W480 chipset. The W480 chipset was designed for Comet Lake CPUs and not the newer generation Rocket Lake CPUs. It was also designed for 2933 MHz memory instead of 3200 MHz memory. I was using the newer Rocket Lake CPUs and higher frequency 3200 MHz memory. This combination might not have been adequately tested with this board. The below screenshots support this idea:
xeonw1370passmarkresults.jpgmemtest86-result-intel-xeon-w1350.JPG
memtest86-results-intel-core-i5-10600k.jpg
Also note that there is no > 2666 MHz ECC memory in their QVL list.

Maybe it was because I have damaged some of the memory traces of this board. I do remember accidentally dropping screws and/or screwdrivers when working with motherboards. However, when I sent this board to Gigabyte for RMA, they did not mention any damage.

To prevent screws and/or screwdrivers from being dropped again, I switched to using ratcheting screwdrivers since I believe that ratcheting screwdrivers are less likely to be dropped. Also I switched to using hex sockets instead of Phillips head whenever possible.

At the end, Gigabyte told me that board was "fine" because there were no uncorrectable errors from MemTest86.

gigabyte-esupport.png

Just a tidbit: I mistakenly bought a TRAY CPU from Provantage.com. Intel does not offer warranty directly to customers for TRAY CPUs, only to resellers. However, even though I bought a TRAY CPU, I was still eligible for a warranty replacement after I contacted the reseller (Provantage.com) for a replacement CPU. I sent Provantage a screenshot of a message from Intel saying that I must contact the reseller.
 
Last edited:

Alterra

New Member
Feb 26, 2023
7
2
3
I was under the impression that the ECC circuitry has been fused off (i.e. permanently disabled in HW) on all Comet Lake / Rocket Lake Core i3/i5/i7/i9 CPUs. And this means that the circuitry will never report any ECC errors, and this in turn means that memtest cannot detect any (corrected) errors, only if the actual memory contents are different compared to expected value. You will need a Xeon W-1200/W-1300. On 10600K the error syndrome registers will be spotless clean.

Unless I am mistaken, for ECC correction to work you need:
1) A CPU which supports ECC
2) Memory with extra chips for redundancy bits
3) Motherboard with traces for the redundancy bits
4) Firmware which enables ECC functionality

If you need reporting, the firmware also needs to somehow signal the event. Alternatively, the syndrome registers can be polled but this is less than ideal for actual usage. W480M does not implement any reporting. You will only get (silent) correction. Better than no ECC at all.

A good solution would report corrected errors to operating system via some mechanism. Supermicro X12SAE can at least log the errors via ACPI. Linux can and does poll the firmware for errors but EDAC does not support Comet Lake / Rocket Lake so the only thing that can be seen is that there was an error, like this:
Code:
[107031.430776] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[107031.430786] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[107031.430789] {1}[Hardware Error]: event severity: corrected
[107031.430792] {1}[Hardware Error]:  Error 0, type: corrected
[107031.430796] {1}[Hardware Error]:  fru_text: CorrectedErr
[107031.430799] {1}[Hardware Error]:   section_type: memory error
[107031.430802] {1}[Hardware Error]:   error_type: 8, parity error
No idea about Windows.

For $200, well I don't think you can get enterprise solutions for consumer prices. I bought the same board but it turned out to be just another Gigabyte gaming board which happens to support Xeon CPUs.
 
  • Like
Reactions: BigBullion

BigBullion

Member
Jul 28, 2022
45
14
8
Just to clear up potential confusion, the screenshot of the W480M board with the i5-16000K was taken by the Gigabyte team while they were testing my board. :)

Based on your comments, I am under the impression that boards such as the ASUS Pro WS W480/W680-ACE and the ASRock Rack W480M WS do not report errors to the operating system since they do not mention ACPI/WHEA in their specs/manual.

I think the lesson here is that if something is not explicitly mentioned as supported then assume that it's unimplemented or broken.
 

Alterra

New Member
Feb 26, 2023
7
2
3
Ah, okay, so I suppose Gigabyte support was then also using different memory modules, probably a standard set they have tested to work earlier. Not your memory modules. I admit I started wondering myself "how come the ECC corrected errors do not show up as real memory read errors if he is using 10600K?" after I hit submit.

I have to say I have no idea about Asus or ASRock boards. It would be interesting to see some table of boards and features which clearly shows what features are actually working in which board. There are plenty of comparisons on HW review sites but they are primarily about benchmarks and peripherals. Meanwhile I guess we will just have to search, ask around or ultimately try it out ourselves. It would be cool to see STH reviews address this kind of stuff in more detail.

You are right, "ECC support" can mean just about anything if not specified in more detail. This is quite sad state of matters, as just about the only difference between Core and Xeon W is the ECC support so you would assume it is at least done properly.
 

heromode

Active Member
May 25, 2020
381
204
43
I have an 11th Gen Intel Xeon W-1370 Rocket Lake CPU which supports DDR4 ECC UDIMMs. The motherboard is Gigabyte W480M Vision W, with latest BIOS (version F21).

In order to enable the system to log ECC errors I have attempted to install Rasdaemon and EDAC. This is my first time installing and using Rasdaemon and EDAC.
For reference, i tried this on my Asus Pro WS C246-ACE motherboard with a Xeon E-2186G and 2x 16GB Mushkin Proline DDR4-2666 ECC udimm's (MPL4E266KF16G28)

This is Debian Bullseye stable, fully updated (5.10.0-21-amd64)

Code:
apt install rasdaemon edac-util

# systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-03-08 02:59:51 EET; 10min ago
   Main PID: 24264 (rasdaemon)
      Tasks: 1 (limit: 37040)
     Memory: 9.4M
        CPU: 26ms
     CGroup: /system.slice/rasdaemon.service
             └─24264 /usr/sbin/rasdaemon -f -r

Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Enabled event ras:extlog_mem_event
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Listening to events for cpus 0 to 11
Mar 08 02:59:51 c246 rasdaemon[24264]: Enabled event mce:mce_record
Mar 08 02:59:51 c246 rasdaemon[24264]: ras:extlog_mem_event event enabled
Mar 08 02:59:51 c246 rasdaemon[24264]: Enabled event ras:extlog_mem_event
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Recording mc_event events
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Recording aer_event events
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Recording extlog_event events
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Recording mce_record events
Mar 08 02:59:51 c246 rasdaemon[24264]: rasdaemon: Recording arm_event events

# ras-mc-ctl --error-count
Label                   CE      UE
mc#0csrow#0channel#1    0       0
mc#0csrow#1channel#1    0       0
mc#0csrow#1channel#0    0       0
mc#0csrow#0channel#0    0       0

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
edac-util: No errors to report.

# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.

# ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model Pro WS C246-ACE

# lsmod | grep edac
ie31200_edac           16384  0

# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183.
Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184.
Just posted this for reference, please say if you need me to test something
 
  • Like
Reactions: BigBullion

heromode

Active Member
May 25, 2020
381
204
43
Well, if you don't mind, it would be interesting to learn if / how that board can report correctable errors. Linux instructions for injecting errors here: APEI Error INJection — The Linux Kernel documentation
At quick glance dmesg does not show anything like
Code:
ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL           00000001 INTL 00000001)
also, /sys/firmware/acpi/tables does not contain a EINJ file.

I'm pretty sure there isn't any WHEA option in my bios, from memory.. i'll look next time i boot for any WHEA, ACPI5 or error injection related settings.

here is my mobo manual if interested: https://dlcdnets.asus.com/pub/ASUS/mb/LGA1151/Pro_WS_C246-ACE/E15411_Pro_WS_C246-ACE_UM_WEB.pdf
 
  • Like
Reactions: Alterra

ritcgab

New Member
Mar 17, 2023
1
2
3
Registered to reply.

I think it's just the kernel-level EDAC driver for recent Intel Xeon CPUs are all not in the mainline.

First of all, to get the DRAM info reported by your motherboard, you can use dmidecode -t Memory (as root), and in the output you should find something like this (suppose DDR4 ECC UDIMMs):

Code:
Total Width: 72 bits
Data Width: 64 bits
If you see the total width is 72 bits, the DIMM and the CPU memory controller is wired correctly, which means ECC is enabled.

To get proper error reporting, the kernel needs the CPU's corresponding EDAC driver. However, I believe Intel never mainlined their recent drivers. Your CPU, Xeon W-1370, is Rocket Lake. But it seems that even Comet Lake driver is not mainlined, not to mention Rocket Lake.


Not a single recent commit added support for newer CPUs. Looks like they just care about Xeon CPUs prior to Coffee Lake, or their latest Sapphire Rapids (see i10nm EDAC driver).

I owned a Dell mobile workstation with Xeon E-11955M CPU (Tiger Lake). I put in ECC UDIMMs, confirmed the extra ECC lanes are wired to the CPU, but no kernel-level support. Their driver for this generation only resides in their own kernel (linux-intel-lts/ieh_edac.c at 6.1/linux · intel/linux-intel-lts) and they never bothered to mainline it.

Now I use Ryzen desktop CPUs.