RAM error

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Hi,

My ZFS scrub failed with checksum errors, and I decided to do a memtest - and not unsurprising the test fails - but I am unsure if the error addresses mean more than one RAM module is dead?

Its a Supermicro X11SLL-F board with 4x16GB ECC UDIMM modules:

1657448596428.png

If the Lowest and highest error address stayed close - I would have assumed it was just one module - but since its 35GB's apart - it seems like its more than one module that is bad - is that conclusion flawed?

Another possibility could also be that its the board that is at fault?

I think its unlikely that two RAM modules goes bad at the exact same time?

Are my conclusions flawed - and how do I best test if its the RAM or board?
 

i386

Well-Known Member
Mar 18, 2016
4,218
1,540
113
34
Germany
I would have assumed it was just one module - but since its 35GB's apart
It could be the same module. the memory controller inside the cpu maps the physical dimms to virtual addresses so that the cpu doesn't have to do it (makes low level/os programming easier)

Did you try to reseat the ram/cpu?
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,143
1,546
113
can be one module because of interleaving and other memory mapping stuff. yoou have to test the DIMMs one by one single.
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
It could be the same module. the memory controller inside the cpu maps the physical dimms to virtual addresses so that the cpu doesn't have to do it (makes low level/os programming easier)

Did you try to reseat the ram/cpu?
Ok - that would be nice if its just a single module - that should rule out the board - and no I haven't tried to reseat memory - I can do that.

can be one module because of interleaving and other memory mapping stuff. yoou have to test the DIMMs one by one single.
Thanks - I will reseat all memory - run a test - and if it still fails - test one module at a time.
I will report back with findings :)
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
Strange - reseated all RAM and got the same error.
Have now tested all 4 ram modules individually - and two modules show errors.
I will now test in a different ram slot - just to rule out the board - but its really strange - I have never experienced two ram modules go bad at exactly the same time.

Edit:
Tested in different slots/different memory channel - and the two "bad" modules still show errors, where the two "good" does not. So although unlikely it seems like I out of the blue had two ram modules go bad. That is crazy.

Unfortunately I do not have another board that takes ECC UDIMM, so I cannot test the ram in another machine - that would have been nice, just to be 100% sure that its not a combination of factors that makes the RAM bad in that motherboard/cpu combo.

Any other suggestions - could it be a power supply issue that causes this? Unlikely right?

Edit2:
I am baffeled - I have litterally _NEVER_ had any RAM go bad on me before since the end of the 1980's - I have bought bad RAM that did not work from the beginning - but I have never experienced RAM go bad after a looon time in use. But I guess I have been lucky :)
 
Last edited:

Stephan

Well-Known Member
Apr 21, 2017
920
697
93
Germany
I sent you something.

Also, try stress-ng with rasdaemon running in a recent Linux that has ~25-50% of RAM as swap. Swap to counter the kernel's OOM killer. Swap should be coming from a partition or a swapfile or zRAM. I personally like to use zRAM with 1/32th (1/one DIMM size) of total RAM from the package systemd-swap, like so:

/etc/systemd/swap.conf.d/swap.conf
zswap_enabled=0
zram_enabled=1
zram_count=1
zram_streams=1
zram_size=$(( RAM_SIZE / 32 ))
The watcher:
Bash:
#!/bin/sh
systemctl stop rasdaemon
rm -f /var/lib/rasdaemon/ras-mc_event.db
systemctl start rasdaemon

exec watch -n 5 \
    "ras-mc-ctl --summary | \
    grep -v '^$'; echo \"\"; ras-mc-ctl --error-count; echo \"\"; free -h ; echo \"\"; \
    journalctl -b -n 500 | \
    grep -Ev \"( (systemd|systemd-logind|smbd|dbus-daemon|systemd-networkd|polkitd|sshd)\\[[0-9]+\\]: )|kernel: (cdc_ether|usb) \" | \
    tail -n20"
The stressor:
Bash:
#!/bin/sh
exec nice stress-ng --vm $(nproc) --vm-bytes 86% --vm-keep --vm-populate --vm-madvise willneed --verify -v -t 4h --tz --perf
Why 86%... empirically figured that one out, up to which RAM usage the kernel's out-of-memory killer will remain inactive and not kill stress-ng's processes due to low available memory. Wanted to let stress-ng use as much memory as possible.

When I bought the used Samsung RDIMMs I had found two slightly faulty DIMMs. So I tested all memtest versions on the planet to see how they fare with ECC, meaning ECC still able to correct all errors, but DIMM clearly troubled. Microsoft memdiag showed nothing, worthless. Freeware memtest86+ showed nothing, worthless. PassMark memtest86 reported DIMM ECC errors, so the best of the bunch, but did not report other errors which were not clearly attributable to a certain DIMM. Only rasdaemon on modern live Linux with stress-ng could show me such deep insight. This being a Xeon Platinum with C621 chipset, lower Xeons might not have enough instrumentation on the chip:

Untitled.png
 

Bjorn Smith

Well-Known Member
Sep 3, 2019
876
481
63
49
r00t.dk
I sent you something.
Thank you - I will try that version of memtest - but I think the two modules are just bad - even if your version of memtest said otherwise - I would not trust them.
But I can run your version of memtest on the remaining two modules and see if they exhibit any errors - the memtest previously returned errors within a couple of minutes on the bad modules - now the remaining two have run for 40 minutes so far without errors - I will ofcourse let it run longer :) Before deciding on whether or not to trust them.
 

i386

Well-Known Member
Mar 18, 2016
4,218
1,540
113
34
Germany
So although unlikely it seems like I out of the blue had two ram modules go bad. That is crazy.
I had two samsung dimms that stopped working after a reboot. Before that everything worked fine in windows, nothing crashed, no bluescreens...