Wiser and more experienced gurus of the STH community, please help me to sanity check my method and diagnosis of a bad core, before I initiate a warranty claim. I come from a Mac background where I am used to having hardware diagnostic tools embedded into ROMs, but have not found similar tools for Linux for diagnosing potential hardware problems.
Back in April, I built a deep learning workstation around an AMD Threadripper 3960X and an nVidia RTX 2080 Ti. I used an ASRock TRX40 Creator motherboard (flashed with the latest BIOS), 128 GB of Corsair RAM (4 x 32), and a 1 TB Sabrent PCIe 4 NVMe SSD. I am currently powering the system with a 1000W Titanium 80 power supply from Seasonic. I have the CPU cooled with a custom loop that contains 2 360mm radiators. I have run everything at default bios auto settings with absolutely no attempt at overclocking anything.
I have tried to run Ubuntu 18.04, 19.10, and 20.04 on this system, and all were somewhat unstable. I had trouble on each of them getting the NVidia drivers and the CUDA toolkit installed and usable. The big hangup would be the near impossibility of getting the driver to build a correct kernel module that would work with whichever kernel was installed. I could get NVidia stuff to play nice with the hwe kernel (5.3.0), but I could not get it to work at all with a mainline kernel that should fully support the TRX40 features (5.4.7)
Eventually, I got NVidia stuff working (complete mystery on what fixed the problem), so I thought I’d do some Folding@Home to burn-in the system. When I could get work units, the GPU jobs ran well enough, but the default CPU slot was a total failure. I tried splitting up the cores into several 6-core CPU slots, and this mostly worked, but I kept having failures on various slots. My initial thoughts were that I might have a RAM problem, or that the relative newness of the TRX40 chipset may be contributing to issues (I had to pass mce=off to the kernel to install or boot).
I ran MemTest86 for a day and a half, and had zero issues reported. RAM didn’t appear to be the issue. My next course was to try to configure and compile a kernel that would have support for the hardware.
I started with trying to utilize all cores and threads (make -j48). That totally failed with various internal compiler errors and other segmentation faults. I reduced the threads until I was down to a single thread compile. That still failed, but I did notice through the system monitor that that single thread was passing around randomly to all of the cores. That got me thinking that I may have a bad core, but which one?
After some googling, I found a method to take individual cores offline (echo 0 | sudo tee /sys/devices/system/cpu/cpuXX/online). To make this simpler, I disabled SMT and PBO in the bios, and went to work trying to systematically test with compiling the kernel using as many CPU’s as I had enabled. I started at the last core disabling core 23. I got compiler errors. I cleaned the kernel and tried to build again, same errors but at different places.
I re-enabled core 23, and disabled core 22. I compiled the kernel, and success! I cleaned the build, and did it again, also success. At this point, I noticed that only the first 22 cores (0-21) were used in the compile. 22, which was turned off wasn’t, and neither was 23, which was online. I'm not sure why the last core was avoided, but at this point, I think I have found the problematic defective core. I’ve been running this machine for the last 24 hours with that core turned off, and it has been rock solid churning out Folding@Home work units with 22 of the CPU cores engaged.
Now that I have reached what I think is the end where I should contact AMD about replacing my CPU, I’m looking to the community to see if perhaps I have erred and mis-interpreted my findings. I’m also curious if I just completely missed any software tool that could march through my CPU to test each core individually to find if any are faulty. Any advice on additional tests would be greatly appreciated.
Oh, last thing, this CPU was bought new in the retail packaging. I installed it into its socket only this single time, and treated like I was defusing a bomb so as to not do any damage to any of the ~4000 pins it sits on.
Back in April, I built a deep learning workstation around an AMD Threadripper 3960X and an nVidia RTX 2080 Ti. I used an ASRock TRX40 Creator motherboard (flashed with the latest BIOS), 128 GB of Corsair RAM (4 x 32), and a 1 TB Sabrent PCIe 4 NVMe SSD. I am currently powering the system with a 1000W Titanium 80 power supply from Seasonic. I have the CPU cooled with a custom loop that contains 2 360mm radiators. I have run everything at default bios auto settings with absolutely no attempt at overclocking anything.
I have tried to run Ubuntu 18.04, 19.10, and 20.04 on this system, and all were somewhat unstable. I had trouble on each of them getting the NVidia drivers and the CUDA toolkit installed and usable. The big hangup would be the near impossibility of getting the driver to build a correct kernel module that would work with whichever kernel was installed. I could get NVidia stuff to play nice with the hwe kernel (5.3.0), but I could not get it to work at all with a mainline kernel that should fully support the TRX40 features (5.4.7)
Eventually, I got NVidia stuff working (complete mystery on what fixed the problem), so I thought I’d do some Folding@Home to burn-in the system. When I could get work units, the GPU jobs ran well enough, but the default CPU slot was a total failure. I tried splitting up the cores into several 6-core CPU slots, and this mostly worked, but I kept having failures on various slots. My initial thoughts were that I might have a RAM problem, or that the relative newness of the TRX40 chipset may be contributing to issues (I had to pass mce=off to the kernel to install or boot).
I ran MemTest86 for a day and a half, and had zero issues reported. RAM didn’t appear to be the issue. My next course was to try to configure and compile a kernel that would have support for the hardware.
I started with trying to utilize all cores and threads (make -j48). That totally failed with various internal compiler errors and other segmentation faults. I reduced the threads until I was down to a single thread compile. That still failed, but I did notice through the system monitor that that single thread was passing around randomly to all of the cores. That got me thinking that I may have a bad core, but which one?
After some googling, I found a method to take individual cores offline (echo 0 | sudo tee /sys/devices/system/cpu/cpuXX/online). To make this simpler, I disabled SMT and PBO in the bios, and went to work trying to systematically test with compiling the kernel using as many CPU’s as I had enabled. I started at the last core disabling core 23. I got compiler errors. I cleaned the kernel and tried to build again, same errors but at different places.
I re-enabled core 23, and disabled core 22. I compiled the kernel, and success! I cleaned the build, and did it again, also success. At this point, I noticed that only the first 22 cores (0-21) were used in the compile. 22, which was turned off wasn’t, and neither was 23, which was online. I'm not sure why the last core was avoided, but at this point, I think I have found the problematic defective core. I’ve been running this machine for the last 24 hours with that core turned off, and it has been rock solid churning out Folding@Home work units with 22 of the CPU cores engaged.
Now that I have reached what I think is the end where I should contact AMD about replacing my CPU, I’m looking to the community to see if perhaps I have erred and mis-interpreted my findings. I’m also curious if I just completely missed any software tool that could march through my CPU to test each core individually to find if any are faulty. Any advice on additional tests would be greatly appreciated.
Oh, last thing, this CPU was bought new in the retail packaging. I installed it into its socket only this single time, and treated like I was defusing a bomb so as to not do any damage to any of the ~4000 pins it sits on.