Did I diagnose a hardware problem correctly on a Threadripper?

LenE

New Member
Jan 29, 2020
26
4
3
Wiser and more experienced gurus of the STH community, please help me to sanity check my method and diagnosis of a bad core, before I initiate a warranty claim. I come from a Mac background where I am used to having hardware diagnostic tools embedded into ROMs, but have not found similar tools for Linux for diagnosing potential hardware problems.

Back in April, I built a deep learning workstation around an AMD Threadripper 3960X and an nVidia RTX 2080 Ti. I used an ASRock TRX40 Creator motherboard (flashed with the latest BIOS), 128 GB of Corsair RAM (4 x 32), and a 1 TB Sabrent PCIe 4 NVMe SSD. I am currently powering the system with a 1000W Titanium 80 power supply from Seasonic. I have the CPU cooled with a custom loop that contains 2 360mm radiators. I have run everything at default bios auto settings with absolutely no attempt at overclocking anything.

I have tried to run Ubuntu 18.04, 19.10, and 20.04 on this system, and all were somewhat unstable. I had trouble on each of them getting the NVidia drivers and the CUDA toolkit installed and usable. The big hangup would be the near impossibility of getting the driver to build a correct kernel module that would work with whichever kernel was installed. I could get NVidia stuff to play nice with the hwe kernel (5.3.0), but I could not get it to work at all with a mainline kernel that should fully support the TRX40 features (5.4.7)

Eventually, I got NVidia stuff working (complete mystery on what fixed the problem), so I thought I’d do some Folding@Home to burn-in the system. When I could get work units, the GPU jobs ran well enough, but the default CPU slot was a total failure. I tried splitting up the cores into several 6-core CPU slots, and this mostly worked, but I kept having failures on various slots. My initial thoughts were that I might have a RAM problem, or that the relative newness of the TRX40 chipset may be contributing to issues (I had to pass mce=off to the kernel to install or boot).

I ran MemTest86 for a day and a half, and had zero issues reported. RAM didn’t appear to be the issue. My next course was to try to configure and compile a kernel that would have support for the hardware.

I started with trying to utilize all cores and threads (make -j48). That totally failed with various internal compiler errors and other segmentation faults. I reduced the threads until I was down to a single thread compile. That still failed, but I did notice through the system monitor that that single thread was passing around randomly to all of the cores. That got me thinking that I may have a bad core, but which one?

After some googling, I found a method to take individual cores offline (echo 0 | sudo tee /sys/devices/system/cpu/cpuXX/online). To make this simpler, I disabled SMT and PBO in the bios, and went to work trying to systematically test with compiling the kernel using as many CPU’s as I had enabled. I started at the last core disabling core 23. I got compiler errors. I cleaned the kernel and tried to build again, same errors but at different places.

I re-enabled core 23, and disabled core 22. I compiled the kernel, and success! I cleaned the build, and did it again, also success. At this point, I noticed that only the first 22 cores (0-21) were used in the compile. 22, which was turned off wasn’t, and neither was 23, which was online. I'm not sure why the last core was avoided, but at this point, I think I have found the problematic defective core. I’ve been running this machine for the last 24 hours with that core turned off, and it has been rock solid churning out Folding@Home work units with 22 of the CPU cores engaged.

Now that I have reached what I think is the end where I should contact AMD about replacing my CPU, I’m looking to the community to see if perhaps I have erred and mis-interpreted my findings. I’m also curious if I just completely missed any software tool that could march through my CPU to test each core individually to find if any are faulty. Any advice on additional tests would be greatly appreciated.

Oh, last thing, this CPU was bought new in the retail packaging. I installed it into its socket only this single time, and treated like I was defusing a bomb so as to not do any damage to any of the ~4000 pins it sits on.
 

ari2asem

Active Member
Dec 26, 2018
412
65
28
The Netherlands, Groningen
maybe stupid suggestion...try all your testing also under windows. why??? to rule out hardware issue's.

i know you have (software related) problems under Linux. but if you have same problems under windows, then you know it is your hardware
 

LenE

New Member
Jan 29, 2020
26
4
3
I had considered doing that (test under Windows), but didn’t because I didn’t want to spend money on buying a seat of Windows that I had no intention of using on this machine beyond initial testing. Considering how much I already have tied up though, it would be a small price for a double check.
 

PigLover

Moderator
Jan 26, 2011
2,917
1,234
113
You don't have to pay for a "seat" to test on Windows. You can download Win10 using the Media Creation Tool download and install it without activating it (select "I don't have a key" when it asks during the install). Per the terms of the license you have a 30 day eval period which you can use to do your test. The activation is never actually enforced, just increasing nags and some cosmetic features that are disabled (like no wallpaper). But for the first 30 days are you completely within the license terms to use it for evaluation.
 
  • Like
Reactions: nasi

Spartacus

Active Member
May 27, 2019
572
210
43
Austin, TX
Secondarily, have you tried reseating the CPU? They're very finicky about the pressure needed for contact too much or too little and it can cause memory and/or core issues.
That would be the starting point I would recommend after the windows test, inspect the pins on the socket and reseat with proper pressure/tightening.
 

LenE

New Member
Jan 29, 2020
26
4
3
I was really fearing reseating. I shouldn’t, but I am super paranoid of making stupid mistakes with the super-fragile pins. Still a really good idea to make sure to re-check this.

Any suggestions on what to run as a test on Windows? Coming from a macOS and Unix background, I am completely at odds on what tools exist on Windows. I asked a Windows using coworker who has a relatively recent Intel HEDT build, and he just stared back blankly with no idea of what to use.
 

LenE

New Member
Jan 29, 2020
26
4
3
Will that identify a bad core, or will it just fail if a bad core is present?
 

ari2asem

Active Member
Dec 26, 2018
412
65
28
The Netherlands, Groningen




or you could run boinc with world community grid. wcg runs only on cpu.

or run folding@home with only cpu slot. f@h runs also on gpu, but you disable that in the config of f@h
 
Last edited:

LenE

New Member
Jan 29, 2020
26
4
3
OK, I got Windows up and running, and everything appears to be normal. Since I had immediate issues running Folding@Home on Linux with all of the cores enabled, I threw that at it first. Absolutely no problem running on all cores. Resource manager shows every core using about 90% of available CPU power for each core.

I have several immediate observations. The CPU is running much cooler (~15-20°C) than it was under Linux. At the same time, running all out, the points per day metric from FAH was about 30% lower than it was under Linux. When checking the particulars of the client, I saw that it claims to be 32-bit code. I don't recall what it said for the Linux client, but perhaps the problem is running 64-bit code on that suspect core? Maybe the Windows Folding@Home client doesn't use AVX2, and that may be related to the problem with that core?

I decided to use AMD's Ryzen Master software's built in stress test, thinking it might show something. No problems were reported, and again the CPU temperature was still 15-20° C cooler than it was under Linux (with two cores sitting on the sidelines). The only conclusion I was able to draw is that Windows appears to not let the CPU be taxed nearly as much as Linux does. Perhaps the more relaxed loading of the cores in Windows has prevented the problem from occurring?

I think I need to find some other tests.
 

LenE

New Member
Jan 29, 2020
26
4
3
OK, I tried this one and turned on all of the CPU instruction set tests for the cpu. I got no errors through a 15 minute run (38 cycles, 27.379 T operations). This was far less taxing than Folding@Home.

I guess what I can't wrap my head around is if this CPU is perfectly functional under Windows, why is it falling apart under Linux until a certain core is shut down.
 
Last edited:

LenE

New Member
Jan 29, 2020
26
4
3
I went back into the BIOS and turned PBO and SMT back on, the default settings for the motherboard. I first hit the problems in Linux with all of the default settings, so I reset these to see if this would change the Windows experience. Boy did it!

Folding@Home is no longer stable. It is acting as it did in Linux, but with different errors in its log. The CPU temperature is up to where I saw it in Linux. While it is warmer, it is still 20°C below AMD's specified maximum for this CPU.

I guess my next steps will be to go in and check the pads and pins and re-seat the CPU. This is enough fun for one night.
 
  • Like
Reactions: ari2asem

Zerd

New Member
Jul 29, 2013
1
0
1
This looks eerily similar to the issues found over at level1:
 

LenE

New Member
Jan 29, 2020
26
4
3
Yes. Very similar. I had no issues running Prime95 with in-place 16k FFT’s, though.

I can say that on some workloads the failures would happen pretty quick, as the fans were ramping their speeds and not yet hitting peak. other times, the machine is just chugging along for an extended period and then there is a cascade of failures that take out every program with a UI at once.

I hope it isn’t a VRM issue. I have no handy oscilloscopes around.
 
Last edited:

LenE

New Member
Jan 29, 2020
26
4
3
I pulled the CPU and re-seated it. Good news is that I didn’t have any bent pins, and the pads all looked immaculate. Unfortunately, I found out on reassembly that I am out of thermal paste. It looks like Newegg and Amazon are also completely out until June 1.


Here is the socket. The shadows from the illumination I had make it difficult to discern the pins, but I went over it with a magnifying glass to see if there were any bent pins. I didn’t see any.

0B11A1A8-A30B-405C-9FFE-A9647AFE168B.jpeg

The pads on the processor were all good.
CE63CECA-5027-42DB-800C-385CA08D5AB8.jpeg
 

bayleyw

Member
Jan 8, 2014
42
2
8
PBO is an overclock; my guess is that with it turned on core 22 is not receiving enough voltage to be stable. You can almost certainly fix this by tweaking the voltages, the complaints about AVX2 online more or less boil down to the fact that out of the box on some boards and firmwares the processor doesn't receive enough voltage to stabilize heavy vector loads.

There are some teething issues surrounding such aggressively tuned high core count processors right now, I'd expect things to improve as more people buy these and the manufacturers get feedback.
 

LenE

New Member
Jan 29, 2020
26
4
3
I don’t know if PBO on my system is an overclock. When it is disabled, all cores are locked at the base frequency. When it is enabled, cores can boost within the advertised envelope or sleep, and I don’t get any of the scary overclocking warnings that I get when I visit the overclocking tabs in the BIOS. PBO seems to be the default operating mode of the processor.

I had suspected AVX2 as part of the problem, but the more I thought about it, I disabled core 22 based on finding stability on a multi-threaded compiler load. I don’t think gcc uses AVX instructions to compile code. As it happened, heavy AVX loads did stabilize as well.

Up to this point, I had not used any of the voltage or frequency tweaking abilities of the BIOS. I didn’t want to cause any Inadvertent damage that would void the warranty. I am hopeful that the re-seating of the processor and/or an updated AEGSA may fix the issue. I won’t know though until I can get more thermal compound from a friend.
 

LenE

New Member
Jan 29, 2020
26
4
3
I got some thermal paste from a friend and put it back together to test again last night. Unfortunately I am still seeing the same instability.

My friend’s thermal compound had laid around for a few years, so it was a little thicker. I saw temperatures about 10°C higher than what I was seeing before. This was still well under the 95°C limit of the processor. I did see that the Vcore was boosted about .2 volts, and under load the processor regulated itself to around 99% of the 280W socket limit. Most cores doing sustained work boosted to 4.079 GHz, which is just below the recorded peak frequency of 4.094 GHz. I didn’t see any peak anywhere near the 4.5GHz. burst limit.

I am going to retest with SMT off to see if that is the differentiator. with that off, I can try to confirm what I found in Linux re core 22. I certainly get more information from Ryzen Master on Windows, but it still seems much harder to fully tax the CPU on Windows.