Help identify faulty hardware

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

guilly

New Member
Feb 23, 2024
10
0
1
Wondering if I could get some guidance on identifying whether I have a faulty MB or CPU. I purchased a supermicro H12SSL-i / Epy 7302 from ebay but it's been very unstable (Random restarts). So far I've done the following:

1. Updated BIOS Firmware / BMC
2. Installed Proxmox 8

The host is stable if no VM's are running, however, as soon as I start a VM the instability starts. I can make it through an install of Windows 11 but it'll randomly crash the host whenever I attempt to start the vm or a few minutes after it's booted.

Troubleshooting steps
1. Memtest86 ran successful (10+hrs)
2. Confirmed cooling isn't an issue
3. Memory module is supported DDR4 3200MHz 1.2v

Observed Errors
1. Event Logs in BIOS show 0x0B - CPU Failure

Just wondering what else I can do to iron out MB / CPU as the potential issue before the 30 day return policy expires ? My only thought was to install windows on the host and attempt to run prime95 ?
 

guilly

New Member
Feb 23, 2024
10
0
1
I ended up booting Ubuntu 22.04 from USB and running prime95. The system reboots within seconds. I was watching the temps in IPMI and when the CPU hits 30 degrees the system reboots.

I've never dealt with a failed CPU before, would you say this is a good indication it's CPU and not a compatibility issue ? I have no other components connected.
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,422
1,638
113
I've never dealt with a failed CPU before, would you say this is a good indication it's CPU and not a compatibility issue ? I have no other components connected.
can be unproper cpu install, weak PSU, too thin/long cables.
also mobo can have missing VRM phases.
 

Stephan

Well-Known Member
Apr 21, 2017
945
714
93
Germany
Disable each core + it's hyperthread sibling one by one, see if problem goes away. I have written a howto https://forums.servethehome.com/index.php?resources/machine-check-exception-mce-workaround.55/ should also work for AMD. Memtest86 has a blacklist with faulty non-multiprocessor capable UEFI BIOSes on some boards, might only use one or a few cores. In your case might be just the cores which work. If that doesn't work, return for refund. It's an old scheme that a seller will sell a CPU that is known broken until it "sticks" with a buyer.
 

guilly

New Member
Feb 23, 2024
10
0
1
can be unproper cpu install, weak PSU, too thin/long cables.
also mobo can have missing VRM phases.
Thanks, I ordered another PSU and will re test. The PSU I'm using is brand new from corsair ( so I'm doubtful but never know.

It was a combo Epyc7302 / supermicro H12SSL-I from tugm4470 on ebay so the CPU was pre installed.
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,422
1,638
113
btw remember i had similar issues with a corsair PSU, it was the mains cable(from the wall to the PSU)
there was a bad contact, symptom was reboots under load, very crazy.
 

guilly

New Member
Feb 23, 2024
10
0
1
Alright I think I may have a memory issue even though memtest86 passed no problem. I'm thinking it's a compatibility issue ? If I run Prime95 Small FFT (Option 1) or Medium FFT (option 2). The CPU handles it fine and no system reboots. When I run the blend test (Option 3) which introduces memory I get a system reboot right away.

The memory I'm using is 64GB 3200MHz. It's Dell branded but the label says

SK hynix - 64GB 2Rx4 PC4 - 3200AA - RB4 - 12

Is there any way to determine if this is a compatibility issue ? The memory type is supported.

I've attempted the test with 1 (A1) slot and 4 (A1 - D1) slots filled with same results. I do not have 8 sticks to tests with, even though I know it's not the ideal layout I dont' think it should cause system reboots ?
 

guilly

New Member
Feb 23, 2024
10
0
1
Unfortunately I have zero errors in BMC. The only error is 0x0B - CPU Failure from event logs in BIOS. I haven't been able to find any information on what this error means though.
 

CyklonDX

Well-Known Member
Nov 8, 2022
857
283
63
if you have/get multimeter you should check phases on the cpu, and memory (on motherboard).
The mem controller is on the cpu, so if its based on stress on all components its more likely issue on the mobo power delivery side. Rather than cpu alone.
 
  • Like
Reactions: guilly

guilly

New Member
Feb 23, 2024
10
0
1
Thanks all, I sent a followup message to tugm4470 on ebay to request a replacement. I see no reason why I should troubleshoot any further with it being a new purchase.
 

erock

Member
Jul 19, 2023
84
17
8
I had very similar issues with a H11DSi mobo and two 7f52 CPU’s purchased from tugm4470. However, after much trial and error I was able to resolve the issue. In a nut shell, I think it was a cooling problem. Disabling SMT was a quick fix that allowed me to improve cooling. I eventually got the system running with SMT enabled and all is well. Here is my post in case it is helpful: https://forums.servethehome.com/ind...ms-and-spontaneous-restarts.41442/post-392282
 
  • Like
Reactions: izx