Here is an update on testing NEMIX Supermicro compatible RAM available on
Newegg and Amazon with return policy and a lifetime warranty claim (16GB DDR4 RDIMM ECC 3200, MEM-DR46LD-ER32). For review, I am using three H11DSi mobos with 2x7f52 on 2 boards and 2x7302 on the other and plan on expanding this cluster of machines. BIOS version is 2.1 on all but one board with 7f52x2.
Rasdaemon and stress-ng
I ran rasdaemon with stress-ng using the bash scripts provided by Stephen in the post referenced above on 82% of memory over 24 hr for 3 computational nodes (each node has 256GB). I may be able to go higher than 82% but need more time to explore this. I set the “ncore” variable in Stephen’s bash script to 30 on these 32 core machines. So this approach is testing 13.12 sticks out of 16 on each node. Rasdaemon reported no memory errors and the CE (corrected errors) and UE (uncorrected errors) columns produced using Stephen’s watcher script showed all zeros (i.e. no errors).
Memtest86(free)
I ran memtest86(free) on a H11DSi board with 7f52x2 and BIOS version 2.4 using the default setup for a single core. This test ran for over 35 hours and made it past test 1 and 50% through test 2 with no errors reported. This test became very slow only moving 2% over 5 hours so I stopped the test. Memtest86(free) froze at ~1.5hr on the other two boards with BIOS 2.1. These slowness/freezing issues are well documented for Supermicro boards (see link above). Resolving these slowness/freezing issues will require time invested with the Pro version and working with Passmark tech support.
memtester
I ran a single pass of memtester (memtester 200G 1) on each node with no errors reported.
Results Summary
Given that the rasdaemon/stress-ng approach is considered the best immediately available tool by several posters for identifying memory problems under load, I think one could conclude that my 3 batches of NEMIX Supermicro compatible RAM show no evidence of poor quality.
It is hard to find this RAM in bulk at reasonable prices with a lifetime warranty claim. Supermicro is out of stock for 16GB tested name brand sticks. Similar named brand Hynix RAM is available on eBay in large qualities from tm_space for $63 per stick with no lifetime warranty claim. However, the price from NEMIX is currently $35 per stick on Amazon and Newegg and large batches can be purchased. OWC also provides equivalent RAM on Amazon for $39 per stick available in large batches. If OWC provides better customer service this is a path to consider (I may try a couple of these sticks). Several other sites that offer name brand RAM sticks do not have enough supply to satisfy my use case (if you find one please share).
I see no evidence indicating I should pay double to more than triple for name brand RAM for my use case. But by all means one should be ready to go with tools like memtester86, rasdaemon and stress-ng, and exercise return and warranty policies. If I was an IT professional whose job it was to manage multiple machines for angry mobs of office workers I would definitely go for name brand RAM for peace of mind. However, for a small experimental cluster that requires lots of RAM sticks, has a narrow application focus and is used by a small team of scientists comfortable with testing and replacing hardware, I think the cheaper stuff should be considered with eyes wide opened as it may be a key factor in reaching objectives given time and budget constraints.
Moving forward with any RAM I plan on using Stephen‘s scripts with rasdaemon and stress-ng. This will allow me to have awareness of memory errors during real life applications and take action when necessary. You can map the rasdaemon edac paths to physical DIMM names using the following link, which will make it easier to identify a bad stick:
Monitoring ECC memory on Linux with rasdaemon
I plan on setting up these configuration files for all machines, which requires systematically removing and installing DIMMS (annoying but I think useful in the long run). I may get memtest86Pro but am not yet seeing the advantage of this relative to rasdaemon and stress-ng for my use case.
I will add additional updates to this post if I encounter any errors or performance issues. Thank you RolloZ170, Stephen and alex_stief for educating me on approaches for testing RAM.