Supermicro h11dsi rev 2.0, after 8h the system freezes.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RolloZ170

Well-Known Member
Apr 24, 2016
7,981
2,502
113
germany
If that's the case, then why does the system work for 8 hours and not less? After all, if there is a problem with the contact, then the problem should appear faster and not after 8 hours of active work.
ECC (corrected) error thresholds is only idea i have if you just add (ecc error issueing) RAM.
After all, if there is a problem with the contact, then the problem should appear faster and not after 8 hours of active work.
no i mean the corrected EC errors. threshold.
search in BIOS for this setting, maybe you can increase count or disable and check for change of 8 hour time.
 

sko

Active Member
Jun 11, 2021
381
236
43
Both processors are overclocked
oh dear...

the case cover is removed.
so no controlled airflow through the case (and hence cpu coolers and memory modules)?

So my bet would also be temperatures...

As was already stated: check for ECC errors! Real operating systems show those in their logs (no idea about windows), otherwise check the BMC, supermicro IPMI usually also shows ECC/memory errors (and others that might lead to crashes/freezes) in the health event logs.

Usually with too many ECC errors (or other malfunction that can be detected) the module (or all on that lane if necessary) gets shut down by the BMC, which should trigger a panic + reboot in the OS. The system then runs with the remaining modules (if not below minimal supported configuration). Not sure if this behaviour is similar on AMD platforms though.
 

sko

Active Member
Jun 11, 2021
381
236
43
ipmitool. Or simply look at the sensor readings in the ipmi web interface (where you already checked the health event log, right?)
 

RolloZ170

Well-Known Member
Apr 24, 2016
7,981
2,502
113
germany
i remember a freeze can also caused by a BIOS bug, EFI uses memory but not declared in table.
can you try other BIOS version to check ?
 

RolloZ170

Well-Known Member
Apr 24, 2016
7,981
2,502
113
germany
If the program does overclocking on the bus, then why did the system work for more than 8 hours with 15 PCs of RAM installed?
the 15 pcs can do the (little?) overclock, the others not.
just do not overclock and see what happens.

its hard to debug without spare mobo, CPUs and RAM.
 

RolloZ170

Well-Known Member
Apr 24, 2016
7,981
2,502
113
germany
With 15 dies, it worked stably, I remember it, when I put 16 dies, then instability appeared, I returned 15 dies back, but stability did not return.
the BIOS doesn't like swapping/adding/removing RAM.
between config.changes you should clear CMOS to prevent issues.
 

erock

Active Member
Jul 19, 2023
195
40
28
  • Does the freeze occur at exactly 8 hrs or approximately 8 hrs?
  • What CPU's are installed?
  • Are the CPU's new or used?
  • Are you checking for memory errors using rasdaemon (see STH posts for scripts)?
  • What is the load distribution prior to freezing?
  • What do your CPU and VRM temperatures look like prior to the freeze?
 

RolloZ170

Well-Known Member
Apr 24, 2016
7,981
2,502
113
germany
If that's the case, then why does the system work for 8 hours and not less?
i remember a similar issue with Intel ME, if any fatal error(e.g. incompatible pch:me version) the unit turns OFF after 30 minutes exactly.
maybe there is a similar mechanism in the EPYC SoC PSP.