Heres the story:
Ive got a server with the h122sl-c mainboard and AMD EPYC 7262 as cpu, it has 4x8 gig of ram in 4 lanes.
It also has 2x8TB Seagate drives as Raid 1 in the SAS controller and 2x990 Samsung pro as nvme on board.
This got shipped without any heatsink. Zippy redundant server psu dual 800 watts
It mostly is a file server and it has not much installed on it.
Server 2022 standard is installed and it runs:
- 1 hyper v windows 10 vm
- duplicati as encrypted backup
- acts as a file server
- windows default server backup (wbadmin)
Thats about it.
After a week it crashed without any appearant reason. No log no nothing just a kernel power event that it was shutdown unclean.
I figured the nvme got to hot, 69 degree in idle, so they moved onto a asus pcie 5.0 card and now idle at around 37 degree and go up to 42 when on high load.
If it crashes it goes into the uefi shell and stays there until a cold boot.
Now this server crashes every 5-7 days still with no appaerent reason and still only a kernel power event is logged.
What i tried so far:
- bios upgrade from 2.4 to 2.5
- bcm update from 01.01.6 to 01.01.10
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)
- reinstalled windows server from scratch to rule out any system / file issues because of the overheating crash
- swapped out the nvme with another 990 pro
- swapped out the asus pcie 5.0 card
- used another slot on the mainboard for nvme
- used other sata slots for the seagates
- disabled duplicati
- disabled wbadmin
- windows memory diagnostics ran for a full night
- prime95 ran with everything enabled for about 1 or 2 hours, without anything
- reset bios to default optimals (did not change a thing but you never know)
- today, so need to wait 5 days until crash, i enabled srvio and iommu in the bios instead of auto
Still crashing and going into uefi shell.
The timings of the crashes do not seem to correlate with any scheduled task or duplicati backup tasks.
NOW the fun part.
i ordered the a similar system with a bit different setup.
- other psu (650 watts normal psu)
- nvme 980 pro 256gb, added directly some heatsinks and idle and full load is ok
- 2x 16 gig ram modules
- there is a firebird database server running but it mostly acts as a file server too
- no win vm
In common:
- duplicati
- same raid 1 2x8tb seagate
- wbadmin
- h122ssl mainboard
- same epyc cpu
This other server now crashes too, but it does not go into uefi shell but just reboots without any log but with the same kernel power event.
The other server now crashed after 3 days of uptime.
I am now absolutely clueless. im lost. any ideas?
Thanks for any help or direction.
I still hope i just do not see the issue.
Ive got a server with the h122sl-c mainboard and AMD EPYC 7262 as cpu, it has 4x8 gig of ram in 4 lanes.
It also has 2x8TB Seagate drives as Raid 1 in the SAS controller and 2x990 Samsung pro as nvme on board.
This got shipped without any heatsink. Zippy redundant server psu dual 800 watts
It mostly is a file server and it has not much installed on it.
Server 2022 standard is installed and it runs:
- 1 hyper v windows 10 vm
- duplicati as encrypted backup
- acts as a file server
- windows default server backup (wbadmin)
Thats about it.
After a week it crashed without any appearant reason. No log no nothing just a kernel power event that it was shutdown unclean.
I figured the nvme got to hot, 69 degree in idle, so they moved onto a asus pcie 5.0 card and now idle at around 37 degree and go up to 42 when on high load.
If it crashes it goes into the uefi shell and stays there until a cold boot.
Now this server crashes every 5-7 days still with no appaerent reason and still only a kernel power event is logged.
What i tried so far:
- bios upgrade from 2.4 to 2.5
- bcm update from 01.01.6 to 01.01.10
- memtest86 a night, no issues when cpu run in serial, cpu reset bugs when running multithreaded but according to the memtest forums this happens often and seems to be unrelated (they state that is an uefi issue)
- reinstalled windows server from scratch to rule out any system / file issues because of the overheating crash
- swapped out the nvme with another 990 pro
- swapped out the asus pcie 5.0 card
- used another slot on the mainboard for nvme
- used other sata slots for the seagates
- disabled duplicati
- disabled wbadmin
- windows memory diagnostics ran for a full night
- prime95 ran with everything enabled for about 1 or 2 hours, without anything
- reset bios to default optimals (did not change a thing but you never know)
- today, so need to wait 5 days until crash, i enabled srvio and iommu in the bios instead of auto
Still crashing and going into uefi shell.
The timings of the crashes do not seem to correlate with any scheduled task or duplicati backup tasks.
NOW the fun part.
i ordered the a similar system with a bit different setup.
- other psu (650 watts normal psu)
- nvme 980 pro 256gb, added directly some heatsinks and idle and full load is ok
- 2x 16 gig ram modules
- there is a firebird database server running but it mostly acts as a file server too
- no win vm
In common:
- duplicati
- same raid 1 2x8tb seagate
- wbadmin
- h122ssl mainboard
- same epyc cpu
This other server now crashes too, but it does not go into uefi shell but just reboots without any log but with the same kernel power event.
The other server now crashed after 3 days of uptime.
I am now absolutely clueless. im lost. any ideas?
Thanks for any help or direction.
I still hope i just do not see the issue.
Last edited: