amd epyc rome on h122ssl-c random crashes

julianmh · Jul 10, 2023

Heres the story:
Ive got a server with the h122sl-c mainboard and AMD EPYC 7262 as cpu, it has 4x8 gig of ram in 4 lanes.
It also has 2x8TB Seagate drives as Raid 1 in the SAS controller and 2x990 Samsung pro as nvme on board.
This got shipped without any heatsink. Zippy redundant server psu dual 800 watts

It mostly is a file server and it has not much installed on it.
Server 2022 standard is installed and it runs:
- 1 hyper v windows 10 vm
- duplicati as encrypted backup
- acts as a file server
- windows default server backup (wbadmin)

Thats about it.

After a week it crashed without any appearant reason. No log no nothing just a kernel power event that it was shutdown unclean.
I figured the nvme got to hot, 69 degree in idle, so they moved onto a asus pcie 5.0 card and now idle at around 37 degree and go up to 42 when on high load.
If it crashes it goes into the uefi shell and stays there until a cold boot.

Now this server crashes every 5-7 days still with no appaerent reason and still only a kernel power event is logged.

What i tried so far:
- bios upgrade from 2.4 to 2.5
- bcm update from 01.01.6 to 01.01.10
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)
- reinstalled windows server from scratch to rule out any system / file issues because of the overheating crash
- swapped out the nvme with another 990 pro
- swapped out the asus pcie 5.0 card
- used another slot on the mainboard for nvme
- used other sata slots for the seagates
- disabled duplicati
- disabled wbadmin
- windows memory diagnostics ran for a full night
- prime95 ran with everything enabled for about 1 or 2 hours, without anything
- reset bios to default optimals (did not change a thing but you never know)
- today, so need to wait 5 days until crash, i enabled srvio and iommu in the bios instead of auto

Still crashing and going into uefi shell.
The timings of the crashes do not seem to correlate with any scheduled task or duplicati backup tasks.

NOW the fun part.
i ordered the a similar system with a bit different setup.
- other psu (650 watts normal psu)
- nvme 980 pro 256gb, added directly some heatsinks and idle and full load is ok
- 2x 16 gig ram modules
- there is a firebird database server running but it mostly acts as a file server too
- no win vm

In common:
- duplicati
- same raid 1 2x8tb seagate
- wbadmin
- h122ssl mainboard
- same epyc cpu

This other server now crashes too, but it does not go into uefi shell but just reboots without any log but with the same kernel power event.
The other server now crashed after 3 days of uptime.

I am now absolutely clueless. im lost. any ideas?

Thanks for any help or direction.
I still hope i just do not see the issue.

scline · Jul 10, 2023

The memory errors when multi thread- are they always on the same channel? If you move sticks around does the error go with it?

julianmh · Jul 10, 2023

The errors in memtest, when run in parallel is, "UEFI FIRMWARE Error, could not start CPU X" this does not include any memory address or something leading back o a bank.

Yesterday i let the windows memory diagnostic run also without any issue.
It feels a bit unlikely that two brand new builds come with a memory issue, but of course possible.

I did not test to remove / move sticks when running multihreaded memtesting i only switched to "one cpu after another test" and it passed 4 times without any further issue. Example for memtest output, as i did not make a picture of it: Memtest show error: [UEFI Firmware Error] Could not start CPU3 - PassMark Support Forums

RolloZ170 · Jul 11, 2023

julianmh said:
Example for memtest output, as i did not make a picture of it

a EFI mem. table bug can also effect the OS, not only memtest.
there is a blacklist in memtest for those EFI bug platforms.

another idea i have: some platforms restart if the max. (corrected)ECC errors treshold is exceeded.
these Errors only show with memtest86 pro(free) and ECC-Polling enabled.
windows memory diagnostic can not find corrected ECC errors.

rtech · Jul 11, 2023

I had similar problem until i disabled watchdog Supermicro motherboard too

RolloZ170 · Jul 11, 2023

rtech said:
I had similar problem until i disabled watchdog Supermicro motherboard too

the watchdog restarts because some FW stopped normal working. I don't think it's a satisfactory solution.

rtech · Jul 11, 2023

RolloZ170 said:
the watchdog restarts because some FW stopped normal working. I don't think it's a satisfactory solution.

I did not have that impression far too many reboots. I used Ipfire Linux very minimal distro without watchdog i have 60 day uptime already.

julianmh · Jul 11, 2023

RolloZ170 said:
a EFI mem. table bug can also effect the OS, not only memtest.
there is a blacklist in memtest for those EFI bug platforms.

another idea i have: some platforms restart if the max. (corrected)ECC errors treshold is exceeded.
these Errors only show with memtest86 pro(free) and ECC-Polling enabled.
windows memory diagnostic can not find corrected ECC errors.

There is a huge thread in the passmark forums about not working systems so im not so sure that the blacklist covers all the systems having those bugs.

I had a call with supermicro in the netherlands today and they said, i should definitely disable c-state.
Being an idle issue would explain why it is crashing on the first machine on the weekends and for the second system why it crashed after 2 1/2 days as it is not really in use.

If that won't work i'll buy a mem86 pro, as i was unaware that the normal version does not check ECC. Thanks for this info.
But as far as i understand a whole bunch of ECC errors should be logged by the ipmi health log shouldnt it?

RolloZ170 · Jul 11, 2023

julianmh said:
If that won't work i'll buy a mem86 pro, as i was unaware that the normal version does not check ECC. Thanks for this info.
But as far as i understand a whole bunch of ECC errors should be logged by the ipmi health log shouldnt it?

ECC corrected errors must be polled by a application. memtest does not see what the BMC sees.
saw a setting to log one entry every 100 errors or similar to prevent LOG overfilling.

julianmh · Jul 16, 2023

ok server again crashed on a sunday between 12 and 15, need to check bmc later to see what time exactly.
So global c state is out of the equation.

julianmh · Jul 16, 2023

just ordered memtest86 pro will test that

RolloZ170 · Jul 16, 2023

julianmh said:
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)

sorry i have overseem this:
if multi CPU makes memtest UEFI issues the system hangs, reset is not usual but anyway.
if you have a UEFI issue (and you have) try other BIOS, BIOS 2.1 i.e. for testing

RolloZ170 · Jul 16, 2023

julianmh said:
- same epyc cpu

bad CPU ?
we don't know what RAM you have. 4x 8GB or 2x 16GB.
RDIMM,LRDIMM, vendor ?

julianmh · Jul 16, 2023

two bad cpus would be possible but unlikely.

the firstbios is at 2.5 the other at 2.4
the first has 4x8 gig of rams Samsung M393A1K43DB2-CWE
the other 2x 16 gigs of ram M393A2K43DB3-CWE RDIMM

currently memtest86+ pro is running with ecc check and nothing pops up.
Again the system runs with sequential cpu tests.

RolloZ170 · Jul 16, 2023

julianmh said:
two bad cpus would be possible but unlikely.

you wrote " - same epyc cpu "

julianmh said:
the firstbios is at 2.5 the other at 2.4

check 2.1

julianmh · Jul 16, 2023

RolloZ170 said:
sorry i have overseem this:
if multi CPU makes memtest UEFI issues the system hangs, reset is not usual but anyway.
if you have a UEFI issue (and you have) try other BIOS, BIOS 2.1 i.e. for testing

The issues with "testing" is, it always crashes on weekends or at least 5-7 days after a clean boot.
i've no idea what could trigge this. all my options of "that could be something possible" like overheating, disk issues (by disk or type of mounting / attaching) are of the table.

RolloZ170 · Jul 16, 2023

checked the BIOS logs ?

julianmh · Jul 16, 2023

RolloZ170 said:
you wrote " - same epyc cpu "

check 2.1

i meant its the same type but i just looked at hwinfo64 and they are slightly different. 7252 and in the first build 7262. thought i ordered the same but the second was a bit more budgeted.

julianmh · Jul 16, 2023

RolloZ170 said:
checked the BIOS logs ?

by this you mean lsi sas logs and ipmi health logs right?

LSI sas logs has nothing and ipmi health logs has some fan issues with fan 5 and two less speed, but only after the crash maybe it has something to do with then going to uefi shell. anyway there is this:
nothing which leads me into any direction before that or even close to before.

2023-07-16 14:25:39

OS Stop/Shutdown

[SYS-0067] Runtime critical stop (a.k.a. core dump, blue screen) - Assertion

Sensor-specific

RolloZ170 · Jul 16, 2023

memtest86 pro must be working with all 8 cores

amd epyc rome on h122ssl-c random crashes

New Member

Member

New Member

Well-Known Member

Active Member

Well-Known Member

Active Member

New Member

Well-Known Member

New Member

New Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

New Member

Well-Known Member

New Member

New Member

Well-Known Member