Supermicro H11DSi + 2x7f52 Boot Problems and Spontaneous Restarts

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

erock

Member
Jul 19, 2023
84
17
8
I encountered this issue (see images) when installing a Linux operating system after initial POST on a Supermicro H11DSi with 2x7f52 Epyc processors. POST went fine and the OS loaded but when I tried to reformat and install the OS (Pop-OS) I got the error messages shown in the attached images (see has mount_opt None from the log file). The problem was associated with a bad Samsung 870 EVO Sata III 1TB drive from Amazon. I have used several of these in other H11DSi builds with no issue so this must be a bad apple. I replaced this with a Western Digital Drive and all was well. I thought I would share this in case you encounter the same problem. Also, the image next to the Pop-OS error message is fun.
 

Attachments

erock

Member
Jul 19, 2023
84
17
8
Hi All,

The problem has re-emerged. So the OS was installed and I ran a few stress tests and the system spontaneously shutdown after the tests were over (not at full load) and I could no longer boot from the WD ssd. It was as if the OS had completely been lost. POST shows no issues. However, after resetting the CMOS and trying to boot again from the usb the system resets at the start of reinstalling the OS.

Any ideas on how to effectively trouble shoot this?
 
Last edited:

erock

Member
Jul 19, 2023
84
17
8
Here is what I see on screen after I try and boot the operating system. The machine quickly restarts so this is difficult to capture. Not sure if this is useful but wanted to see if anyone can identify a clue.
 

Attachments

alex_stief

Well-Known Member
May 31, 2016
884
312
63
39
Not that I knew anything about the logs you posted. Just want to share that when dealing with "bad" SATA drives, never rule out a bad cable ;)
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,436
1,644
113
check the smart data of the drive with other PC.
check UDMA error count value. if it is not zero and increases over time, bad cable or bad RAM can cause this.
 
  • Like
Reactions: erock

erock

Member
Jul 19, 2023
84
17
8
Here is an update. I replaced the 7f52’s with 7302’s and all is well. So I don’t think it is RAM or cables and my original thought about a bad disk seems wrong. Another clue is that once I installed the OS to disk with 7302’s I can reach the bios with the 7f52’s back in. However, after logging in to the OS the system restarts. But again if I go back to the 7302’s this does not happen.
 

erock

Member
Jul 19, 2023
84
17
8
what RAM do you have ? i hope not SK-Hynix LRDIMM.
run memtest86pro
Here is my memory info:
  • MEM-DR416L-HL01-ER32 16GB
  • 16GB DDR4 3200 (PC4 25600)
  • ECC Registered RDIMM Server Memory
I will give memtest86pro a try. Thank you for the tip.
 
  • Like
Reactions: RolloZ170

erock

Member
Jul 19, 2023
84
17
8
Here is an update of the journey:

I switched out the 7f52’s for 7302’s and all of the issues went away (I could boot from my usb and save the OS to formatted SSD). With the 7302‘s the spontaneous restarts also went away. I then got brave and decided to reinstall the 7f52’s. Initially, POST got stuck on CPU initialization (similar to my other post). I then changed the socket location of the 7f52’s and POST was able to move past CPU initialization (This step was weird and I don’t understand it!). I then could boot into the OS. However, the random restarts occurred again. I then went into the BIOS and disabled SMT (Advanced CPU Config) and IOMMU (NB Config) following a recommendation from another thread. This did the trick (at least so far)! I could boot into the OS, run a stress test with stress-ng, run my scientific calculations and the system never reset. To be extra sure of the cause I went back into the BIOS and activated SMT. The restart issue reappeared. I am now doing more testing with SMT disabled and so far so good (my scientific calculations are not impacted much by this since cores matter more than threads for my use case).

Interestingly, I have a H11DSi rev 2 with two 7f52’s and SMT works fine. The power supplies are the same but the board with working SMT has BIOS version 2.4 as opposed to 2.1. Another difference is that the board with the working SMT is brand new out of the box whereas the one with the problems is used board with who knows how much stress. I have had lots of luck with used CPU’s. Not sure if I am going to cheap out anymore with used motherboards.

I will update this thread if anything changes.
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
39
Have you tried keeping an eye on the VRM temperatures? The 7F52 are much higher TDP, and the H11DSi definitely can run into trouble with CPU VRMs overheating in workstation cases. You can use Supermicros "SuperDoctor" software to read the sensors.
 

erock

Member
Jul 19, 2023
84
17
8
have you ever tried the CPUs in single config one by one to find the bad one ?
Well they are both working now without SMT. I did write down the last three digits of some ID number so I know which one was causing the CPU initialization to hang. Would this be useful info?
 

erock

Member
Jul 19, 2023
84
17
8
Have you tried keeping an eye on the VRM temperatures? The 7F52 are much higher TDP, and the H11DSi definitely can run into trouble with CPU VRMs overheating in workstation cases. You can use Supermicros "SuperDoctor" software to read the sensors.
I was taking a few peaks at VRM temps via “ipmitool sensor” and don’t recall anything past 60C but I was not giving this enough focus. I was wondering about VRM’s too. I was actually bench testing (with coolers and fans working) so the parts were not in a case. With SMT disabled and running stress-ng at max cpu load VRM temps range from 43-58C. What is too high?
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,436
1,644
113
Well they are both working now without SMT. I did write down the last three digits of some ID number so I know which one was causing the CPU initialization to hang. Would this be useful info?
update the BIOS first. AGESAs updated.
 
  • Like
Reactions: erock

erock

Member
Jul 19, 2023
84
17
8
I was taking a few peaks at VRM temps via “ipmitool sensor” and don’t recall anything past 60C but I was not giving this enough focus. I was wondering about VRM’s too. I was actually bench testing (with coolers and fans working) so the parts were not in a case. With SMT disabled and running stress-ng at max cpu load VRM temps range from 43-58C. What is too high?

Also, in my case with a the brand new H11DSi and 7f52’s, I have 3 1600rpm be quiets in the font and two Noctua 1500rpm fans for the tower coolers (1400 rpm be quiet in back). This seems to be doing a good job with airflow across the board but I do wonder if there should be more. I would like to avoid using a server chassis but may be headed in that direction since I am building a small cluster. I am using two Noctua NH-U14S tower coolers because they keep 240-280W CPUs at reasonable temps while minimizing noise but you need good top ventilation that limits dust accumulation when powered down (see be quiet dark base pro 900). But this may not be enough to manage the VRM temps for all use cases. Any setup recommendations for better cooling are welcome.
 
Last edited:

erock

Member
Jul 19, 2023
84
17
8
update the BIOS first. AGESAs updated.
Thank you for pointing this out. The BIOS update process is a bit opaque to me for Supermicro boards. For example, the bios update zip has some poorly documented scripts with crappy examples (at least to me). Can you point me to better guidances and examples?
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
39
I was taking a few peaks at VRM temps via “ipmitool sensor” and don’t recall anything past 60C but I was not giving this enough focus. I was wondering about VRM’s too. I was actually bench testing (with coolers and fans working) so the parts were not in a case. With SMT disabled and running stress-ng at max cpu load VRM temps range from 43-58C. What is too high?
That's very low for a full CPU load. Quite suspiciously so. Maybe it was not the right sensor values? The CPU VRMs can and will go above 100°C.
I get up to 60°C with two 7551 CPUs, SMT off, and a water block on the VRMs.

For easy bios update via IPMI: GitHub - bwachter/supermicro-ipmi-key: Generate keys for supermicro IPMI
 
Last edited:
  • Like
Reactions: erock

RolloZ170

Well-Known Member
Apr 24, 2016
5,436
1,644
113
The BIOS update process is a bit opaque to me for Supermicro boards. For example, the bios update zip has some poorly documented scripts with crappy examples (at least to me). Can you point me to better guidances and examples?
with IPMI - you need OOB licence, you can make your own

with BIOS EFI shell
read the "Readme for H11 2.0 AMI BIOS-UEFI.txt" file
 
  • Like
Reactions: erock