Cpu cache tester

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gtech1

Member
May 27, 2019
79
7
8
After a very long app debugging process, the developpers are 99.9% sure one of my E5-2620v3 cpus have a faulty l1/2/3 cache. I only tested using memtest86+ Pro and it found no errors but I understand that memtest disables all caches and just tests the ram.

Is there an app to test the cpu caches?
 

MBastian

Active Member
Jul 17, 2016
205
59
28
Düsseldorf, Germany
At least with Linux you should see "[Hardware Error]" messages with journald/dmesg. I am not sure how such a thing could happen without other user or kernel processes barfing out once in a while. Windows should have something similar in it's system messages.
With Windoes the cache subtest of the AIDA64 system stability test might help. With Linux you could try the stress-ng suite, it has some cache trashing stuff.

Did you clamp the apps process(es) on all CPU cores of one physical chip or just one core? L1/L2 is per core, L3 is per CPU. I suspect you did clamp them on all cores on one CPU. You could try to narrow it down, see if it happens only on one half (be sure either disable HT or select the right siblings) of the cores of one chip or not. If it does the L3 cache is not the culprit.
 
Last edited:

i386

Well-Known Member
Mar 18, 2016
4,245
1,546
113
34
Germany
I think a corrupt cpu cache would cause system crashes and corrupt data all the time...
 

gtech1

Member
May 27, 2019
79
7
8
I think a corrupt cpu cache would cause system crashes and corrupt data all the time...
This server runs only a single app, it's on FreeBSD 12.2. 6 other similar servers with similar hardware specs, also FreeBSD 12.2, same app, run just fine.

The app basically stores files to disk and writes their CRC at the same time.

About once per day, an invalid CRC gets calculated. About once per week, the app coredumps.

Coredump analysis by the app developper shows a random memory jump that shouldn't happen, but memory contents is fine. The app just jumps unexpectedly to a random location.

I am replacing the server entirely but I want to be able to test it after I take it off production and identify the issue better.

Before putting it in production I just tested with Memtest86+ pro which showed no errors after 4 passes.
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,421
470
83
A long long time ago when I was at MS working on NT 3.51. We had reports if random crashes on a MVP's server. This is in the mid 90's when cache chips were external to the cpu. As we progressed in the development of 3.51. The crashes were more frequent and more random...

We got the MVP to ship us his system. We had 3 days to figure out what was wrong. We swapped out EVERYTHING in his system, ram/video card/scsi adapter. I could get it to crash reliably at this point. So I setup the floppy boot disk for debug...

We got one of the memory manager dev's to look at the issue... after about 1 hour he came back and said. This 1 bit does not hold state...

My boss was perplexed. What we did not replace was the cache ram. So he went and bought 1 cache ram chip. From that point forward the system worked flawlessly.

My boss hung the bad cache ram chip in the window to his office with this statement.

"This is the most expensive ram chip that MS has ever purchased. "

Chris

Chances are the cpu is bad...
 

gtech1

Member
May 27, 2019
79
7
8
A long long time ago when I was at MS working on NT 3.51. We had reports if random crashes on a MVP's server. This is in the mid 90's when cache chips were external to the cpu. As we progressed in the development of 3.51. The crashes were more frequent and more random...

We got the MVP to ship us his system. We had 3 days to figure out what was wrong. We swapped out EVERYTHING in his system, ram/video card/scsi adapter. I could get it to crash reliably at this point. So I setup the floppy boot disk for debug...

We got one of the memory manager dev's to look at the issue... after about 1 hour he came back and said. This 1 bit does not hold state...

My boss was perplexed. What we did not replace was the cache ram. So he went and bought 1 cache ram chip. From that point forward the system worked flawlessly.

My boss hung the bad cache ram chip in the window to his office with this statement.

"This is the most expensive ram chip that MS has ever purchased. "

Chris

Chances are the cpu is bad...
yep, that's what I figure even though there are no such warnings in the OS/IPMI. What stress test can I use to single this out ?