It was the nvmes. One time I sat by the server watching it die and had the opportunity to open event viewer.
Both Samsung 990 had the same issue I guess it's a driver or firmware issue (yes newest firmwares were installed).
Swapped disks had no issues anymore.
yes, read if it is enabled by default but was unsure so i wrote the settings to disk and looked at the eccpoll setting in the mt86.cfg which is default set to 1. TL;DR; yes it is
ok just fiddled a bit around if something is extraordinary hot and if you look at page 10 here https://www.supermicro.com/manuals/motherboard/EPYC7000/MNL-2314.pdf the heatsink next to LEDSAS is extremly hot.
it was the exhaust fan and optimal speed went into "reporting" as it was not spinning fast enough, full speed just yanks up all the fans and standard speed makes it green and again. i don't think that overheating is an issue as the server is hardly under load and temps were always ok when i...
just wanted to say: i appreciate you, thanks!
ok, that i could test with a bios downgrade. i let this run the last 40% and then see if its still there after that downgrade and test with multi core again.
did not check till now. The idea was to rule out ECC issues first and then have a second run with all cores / threads enabled. (not sure if its correct if memtest tells me it has found 16 cpus, if it is supposed to count the cpu and the threads), on my list to look that up
by this you mean lsi sas logs and ipmi health logs right?
LSI sas logs has nothing and ipmi health logs has some fan issues with fan 5 and two less speed, but only after the crash maybe it has something to do with then going to uefi shell. anyway there is this:
nothing which leads me into any...
i meant its the same type but i just looked at hwinfo64 and they are slightly different. 7252 and in the first build 7262. thought i ordered the same but the second was a bit more budgeted.
The issues with "testing" is, it always crashes on weekends or at least 5-7 days after a clean boot.
i've no idea what could trigge this. all my options of "that could be something possible" like overheating, disk issues (by disk or type of mounting / attaching) are of the table.
two bad cpus would be possible but unlikely.
the firstbios is at 2.5 the other at 2.4
the first has 4x8 gig of rams Samsung M393A1K43DB2-CWE
the other 2x 16 gigs of ram M393A2K43DB3-CWE RDIMM
currently memtest86+ pro is running with ecc check and nothing pops up.
Again the system runs with...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.