Did I diagnose a hardware problem correctly on a Threadripper?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

LenE

New Member
Jan 29, 2020
28
8
3
So it turns out that you can't deactivate just one core in Windows 10. Using Ryzen Master, I could turn off the suspect core, but there is a requirement to turn off cores symmetrically across CCX's. This means I had to turn off seven additional cores to verify the suspect core. Turning off core 23 (22 on Linux from zero index) and seven others enabled stable operation, just like on Linux. This was stable with PBO and SMT on.

I am initiating a return with AMD.
 

LenE

New Member
Jan 29, 2020
28
8
3
I started the process on Sunday night, but I still haven’t gotten an RMA yet. For some reason, the AMD e-mail system stripped all of my attachments, and they can’t proceed without photos. They tell me that if they can’t verify my pictures within 10 days, the ticket will die. I’m going to try sending them a link to an obscure web server I have to see if that works.

Any AMD warranty veterans out there who can offer advice to make this go more smoothly?
 

gb00s

Well-Known Member
Jul 25, 2018
1,177
587
113
Poland
OK, I got Windows up and running, and everything appears to be normal. Since I had immediate issues running Folding@Home on Linux with all of the cores enabled, I threw that at it first. Absolutely no problem running on all cores. Resource manager shows every core using about 90% of available CPU power for each core.

I have several immediate observations. The CPU is running much cooler (~15-20°C) than it was under Linux. At the same time, running all out, the points per day metric from FAH was about 30% lower than it was under Linux. When checking the particulars of the client, I saw that it claims to be 32-bit code. I don't recall what it said for the Linux client, but perhaps the problem is running 64-bit code on that suspect core? ....
If I’m not mistaken F@H on Win uses 32cores ‘full’. That’s why you see 90% usage only. Maybe they changed it in the last update some weeks ago. With Linux I was always using all 48T for 100%. Additionally and historically F@H on Linux always receives higher PPD. But to fully take advantage of your cores you have to set the cpu stack to advanced mode.

But all in all should not explain your one specific core issue. But two prev clients from ML space reported the same issues but with a 3990x. RMA’d twice.
 

LenE

New Member
Jan 29, 2020
28
8
3
F@H still has the 32 thread limit (actually 31) on Windows.

That’s interesting that one of your clients had the same issue twice on a 3990X. I had found a thread on another site where an owner of a 3970X had discovered a bad core, which is what pushed me to start parking cores. On the one hand, processors with 24/32/64 cores would be more likely to have a marginal core than a CPU with 4 or 8. What concerns me a bit though is that all of the Threadrippers use higher binned CCX’s, so I assume that all of the silicon used is well understood before packaging. That makes me wonder what kind of checks are done on the final product. In the case of my 3960X, the CCX’s are binned to use three of the eight cores each. One would think that a higher binned CCX yielding 3 acceptable higher clockable cores may have a fourth alternative core that could be switched in, if a good core was found to be bad after packaging.

In any case, AMD approved my RMA last night. It will be going out on Monday.
 
Last edited:
  • Like
Reactions: ari2asem

LenE

New Member
Jan 29, 2020
28
8
3
I finally got my replacement processor last night. So far, it has been running very well and seems much more stable than the original. I’ve been pushing it through more Folding@Home for the last 22 hours, and no problems have turned up so far. According to Ryzen Master, this particular CPU is using less power than the one with a defective core, at the same load. My temperatures are slightly higher, but I chalk that up to adding my GPU into the cooling loop.

One thing that struck me about the replacement is that the serial number was wildly different from my original processor. Not knowing anything about the method AMD uses to generate serial numbers, this may be completely irrelevant.

I will give my system one more day on Windows before wiping that out and switching back to Linux.