amd epyc rome on h122ssl-c random crashes

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

julianmh

New Member
Jul 10, 2023
23
2
3
Heres the story:
Ive got a server with the h122sl-c mainboard and AMD EPYC 7262 as cpu, it has 4x8 gig of ram in 4 lanes.
It also has 2x8TB Seagate drives as Raid 1 in the SAS controller and 2x990 Samsung pro as nvme on board.
This got shipped without any heatsink. Zippy redundant server psu dual 800 watts

It mostly is a file server and it has not much installed on it.
Server 2022 standard is installed and it runs:
- 1 hyper v windows 10 vm
- duplicati as encrypted backup
- acts as a file server
- windows default server backup (wbadmin)

Thats about it.

After a week it crashed without any appearant reason. No log no nothing just a kernel power event that it was shutdown unclean.
I figured the nvme got to hot, 69 degree in idle, so they moved onto a asus pcie 5.0 card and now idle at around 37 degree and go up to 42 when on high load.
If it crashes it goes into the uefi shell and stays there until a cold boot.

Now this server crashes every 5-7 days still with no appaerent reason and still only a kernel power event is logged.

What i tried so far:
- bios upgrade from 2.4 to 2.5
- bcm update from 01.01.6 to 01.01.10
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)
- reinstalled windows server from scratch to rule out any system / file issues because of the overheating crash
- swapped out the nvme with another 990 pro
- swapped out the asus pcie 5.0 card
- used another slot on the mainboard for nvme
- used other sata slots for the seagates
- disabled duplicati
- disabled wbadmin
- windows memory diagnostics ran for a full night
- prime95 ran with everything enabled for about 1 or 2 hours, without anything
- reset bios to default optimals (did not change a thing but you never know)
- today, so need to wait 5 days until crash, i enabled srvio and iommu in the bios instead of auto

Still crashing and going into uefi shell.
The timings of the crashes do not seem to correlate with any scheduled task or duplicati backup tasks.

NOW the fun part.
i ordered the a similar system with a bit different setup.
- other psu (650 watts normal psu)
- nvme 980 pro 256gb, added directly some heatsinks and idle and full load is ok
- 2x 16 gig ram modules
- there is a firebird database server running but it mostly acts as a file server too
- no win vm

In common:
- duplicati
- same raid 1 2x8tb seagate
- wbadmin
- h122ssl mainboard
- same epyc cpu

This other server now crashes too, but it does not go into uefi shell but just reboots without any log but with the same kernel power event.
The other server now crashed after 3 days of uptime.

I am now absolutely clueless. im lost. any ideas?

Thanks for any help or direction.
I still hope i just do not see the issue.
 
Last edited:

scline

Member
Apr 7, 2016
91
32
18
35
The memory errors when multi thread- are they always on the same channel? If you move sticks around does the error go with it?
 

julianmh

New Member
Jul 10, 2023
23
2
3
The errors in memtest, when run in parallel is, "UEFI FIRMWARE Error, could not start CPU X" this does not include any memory address or something leading back o a bank.

Yesterday i let the windows memory diagnostic run also without any issue.
It feels a bit unlikely that two brand new builds come with a memory issue, but of course possible.

I did not test to remove / move sticks when running multihreaded memtesting i only switched to "one cpu after another test" and it passed 4 times without any further issue. Example for memtest output, as i did not make a picture of it: Memtest show error: [UEFI Firmware Error] Could not start CPU3 - PassMark Support Forums
 

RolloZ170

Well-Known Member
Apr 24, 2016
4,394
1,243
113
56
Example for memtest output, as i did not make a picture of it
a EFI mem. table bug can also effect the OS, not only memtest.
there is a blacklist in memtest for those EFI bug platforms.

another idea i have: some platforms restart if the max. (corrected)ECC errors treshold is exceeded.
these Errors only show with memtest86 pro(free) and ECC-Polling enabled.
windows memory diagnostic can not find corrected ECC errors.
 

rtech

Active Member
Jun 2, 2021
274
100
43
I had similar problem until i disabled watchdog Supermicro motherboard too
 

rtech

Active Member
Jun 2, 2021
274
100
43
the watchdog restarts because some FW stopped normal working. I don't think it's a satisfactory solution.
I did not have that impression far too many reboots. I used Ipfire Linux very minimal distro without watchdog i have 60 day uptime already.
 

julianmh

New Member
Jul 10, 2023
23
2
3
a EFI mem. table bug can also effect the OS, not only memtest.
there is a blacklist in memtest for those EFI bug platforms.

another idea i have: some platforms restart if the max. (corrected)ECC errors treshold is exceeded.
these Errors only show with memtest86 pro(free) and ECC-Polling enabled.
windows memory diagnostic can not find corrected ECC errors.
There is a huge thread in the passmark forums about not working systems so im not so sure that the blacklist covers all the systems having those bugs.

I had a call with supermicro in the netherlands today and they said, i should definitely disable c-state.
Being an idle issue would explain why it is crashing on the first machine on the weekends and for the second system why it crashed after 2 1/2 days as it is not really in use.

If that won't work i'll buy a mem86 pro, as i was unaware that the normal version does not check ECC. Thanks for this info.
But as far as i understand a whole bunch of ECC errors should be logged by the ipmi health log shouldnt it?
 

RolloZ170

Well-Known Member
Apr 24, 2016
4,394
1,243
113
56
If that won't work i'll buy a mem86 pro, as i was unaware that the normal version does not check ECC. Thanks for this info.
But as far as i understand a whole bunch of ECC errors should be logged by the ipmi health log shouldnt it?
ECC corrected errors must be polled by a application. memtest does not see what the BMC sees.
saw a setting to log one entry every 100 errors or similar to prevent LOG overfilling.
 

julianmh

New Member
Jul 10, 2023
23
2
3
ok server again crashed on a sunday between 12 and 15, need to check bmc later to see what time exactly.
So global c state is out of the equation.
 

RolloZ170

Well-Known Member
Apr 24, 2016
4,394
1,243
113
56
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)
sorry i have overseem this:
if multi CPU makes memtest UEFI issues the system hangs, reset is not usual but anyway.
if you have a UEFI issue (and you have) try other BIOS, BIOS 2.1 i.e. for testing
 

julianmh

New Member
Jul 10, 2023
23
2
3
two bad cpus would be possible but unlikely.

the firstbios is at 2.5 the other at 2.4
the first has 4x8 gig of rams Samsung M393A1K43DB2-CWE
the other 2x 16 gigs of ram M393A2K43DB3-CWE RDIMM

currently memtest86+ pro is running with ecc check and nothing pops up.
Again the system runs with sequential cpu tests.
 

julianmh

New Member
Jul 10, 2023
23
2
3
sorry i have overseem this:
if multi CPU makes memtest UEFI issues the system hangs, reset is not usual but anyway.
if you have a UEFI issue (and you have) try other BIOS, BIOS 2.1 i.e. for testing
The issues with "testing" is, it always crashes on weekends or at least 5-7 days after a clean boot.
i've no idea what could trigge this. all my options of "that could be something possible" like overheating, disk issues (by disk or type of mounting / attaching) are of the table.
 

julianmh

New Member
Jul 10, 2023
23
2
3
you wrote " - same epyc cpu "

check 2.1
i meant its the same type but i just looked at hwinfo64 and they are slightly different. 7252 and in the first build 7262. thought i ordered the same but the second was a bit more budgeted.
 

julianmh

New Member
Jul 10, 2023
23
2
3
checked the BIOS logs ?
by this you mean lsi sas logs and ipmi health logs right?

LSI sas logs has nothing and ipmi health logs has some fan issues with fan 5 and two less speed, but only after the crash maybe it has something to do with then going to uefi shell. anyway there is this:
nothing which leads me into any direction before that or even close to before.
2023-07-16 14:25:39OS Stop/Shutdown[SYS-0067] Runtime critical stop (a.k.a. core dump, blue screen) - AssertionSensor-specific