Correctable ECC@DIMM Errors & Sensor Data Displays

caveat lector · Mar 1, 2014

A couple weeks ago I assembled two identical servers. They each have the following:

Supermicro X10SLM+-LN4F Motherboard

Intel Xeon E3 1230 V3 3.3G 4C 8T 8M CPU

4x Kingston KVR16LE11/8EF Memory

They are both "headless" and being run from a networked Windows workstation via IPMI View and the Java iKVM Viewer.

Ubuntu 13.10 Server installed normally on both servers and has run without apparent errors during configuration and testing. Even so, one server has exhibited a couple of anomalies:

Anomaly 1:

Memtest86 was started on both servers about midnight. This morning two passes had completed with ECC off. Memtest86 found no errors. However, while those tests were running the following error records appeared in the IPMI View "System Event Log" for one of the Servers (time is UTC):

1,System Event,03/01/2014 09:18:21 Sat,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMA1(CPU1)

2,System Event,03/01/2014 09:18:23 Sat,Memory,,Assertion: Memory| Event = Correctable ECC@DIMMB1(CPU1)

Notice they occurred two seconds apart, but one was in DIMMA1 and the other was in DIMMB1.

Question 1: Why did Memtest86 not find those errors while testing with ECC off? It seems that the data read should have been different than the data written without error correction.

Question 2: Why would there happen to be two unassociated errors two-seconds apart in different DIMMs over a many-hour test? That is possible of course, but it seems more likely that they are associated.

Question 3: Have you seen memory errors in the IPMI View "System Event Log" during memory testing that were not logged by Memtest86?

Anomaly 2:

IPMI View always displays voltage, temperature and fan data for one of the servers, regardless of whether the iKVM Console is running, but not for the other. That data is displayed for both servers immediately after logging into to them via IPMI. However, the data display for one server (the one that logged memory errors) disappears after launching the KVM Console.

Question 4: Have you seen that happen or have any idea of cause?

caveat lector · Mar 4, 2014

No more memory errors were logged during a couple days of additional testing. It still seems strange that Memtest86 didn't log errors that were logged in IPMI View while Memtest86 was running. That doesn't inspire confidence in Memtest86.

The voltage, temperature and fan data display issue was found to be due to an IPMI View V2.9.28 software bug. I found later that data also wasn't displaying under various other conditions and then eventually that the true problem is a malfunctioning "Hide inactive item" selection on the "Sensors" panel. If that option is selected at a time when there is no active data, data will not display when it becomes active. It is a trivial issue except that making that selection (especially if the IPMI View configuration is subsequently saved) can make a motherboard seem to be malfunctioning.

OBasel · Mar 4, 2014

That's really bizarre. Did those errors occur then about 9 hours into Memtest86+?

Which version of memtest86 are you using? I remember there was a bug in older versions that would not show ecc status.

Aluminum · Mar 4, 2014

To be honest, a pair of time correlated errors in different modules sounds like a random background radiation event from a cosmic ray or alpha particle or whatever. Doesn't really fit the electrical fault or defective dimm model.
Depending on how the board enumerates its slots, those were even in different memory channels. The only better detector would be a second computer, and usually they are too far apart to catch the same event.

They happen as a part of nature, and the ECC did its job. This is why if you really care about your data and it runs through ram all day long (e.g. ZFS) its pretty much a requirement.

caveat lector · Mar 5, 2014

OBasel said:
That's really bizarre. Did those errors occur then about 9 hours into Memtest86+?

No, because of the time difference from UTC the errors occurred about an hour into Memtest86+.

OBasel said:
Which version of memtest86 are you using? I remember there was a bug in older versions that would not show ecc status.

It was Memtest86+ 4.20-1.1ubuntu5 included with the Ubuntu 13.10 server installation package.

caveat lector · Mar 5, 2014

Aluminum said:
To be honest, a pair of time correlated errors in different modules sounds like a random background radiation event from a cosmic ray or alpha particle or whatever. Doesn't really fit the electrical fault or defective dimm model.
Depending on how the board enumerates its slots, those were even in different memory channels. The only better detector would be a second computer, and usually they are too far apart to catch the same event.

A single particle would have had to have a trajectory that passed through two chips, but that of course could have happened. Damage to data in the two chips would have occurred almost instantly, but there easily could have been a two-second time difference between times when the software discovered the two errors.

Aluminum said:
They happen as a part of nature, and the ECC did its job. This is why if you really care about your data and it runs through ram all day long (e.g. ZFS) its pretty much a requirement.

Yes, or btrfs after there has been a stable release. This also demonstrates why anyone who cares about the integrity of their data should be using ECC memory.

Aluminum · Mar 5, 2014

caveat lector said:
A single particle would have had to have a trajectory that passed through two chips, but that of course could have happened. Damage to data in the two chips would have occurred almost instantly, but there easily could have been a two-second time difference between times when the software discovered the two errors.

Could have easily been a cascade of particles with varying decays or something that built up a local charge as well, cosmic ray events and similar generate all sorts of interesting stuff. (what makes it to ground level is rarely the original particle)
The increase of flux for a few seconds for {whatever} reason but only two registers were in the right conditions to flip.

Mainly though two ram registers in different chips on different modules on separate memory channels have very little in common; their capacitance, voltage and reference ground are only loosely linked. It just doesn't fit an internal hardware fault hardly at all.

Oh yeah, it could have happened within or on its way to the IMC itself, not necessarily on the dimms if I'm not mistaken on where the algorithm logic is actually executed. (fauly IMC also unlikely or computer would be a crash magnet)

Search

Correctable ECC@DIMM Errors & Sensor Data Displays

caveat lector

New Member

caveat lector

New Member

OBasel

Active Member

Aluminum

Active Member

caveat lector

New Member

caveat lector

New Member

Aluminum

Active Member