I think that the real situation is that 1/3 of the Raptor Lake CPU's manufactured are marginal right from the get-go, and the BIOS updates probably works for some less marginal examples. Others, when put under stress, even with the BIOS fixes, are unstable.
I agree to that.
At work, we have an 13900K running stable 24/7 on a X13SAE-F since last summer. At home, I have a 14900K on a X13SAE running since October 2023. In this time, it crashed four times without any trace in the Linux logs. Four times sounds not that bad but this is a workstation/server running 24/7 and I bought the very expensive W680 board especially to avoid such problems. Its predecessor, a Xeon X3470 on a server board, run 13 years (!) 24/7 without a single crash (obviously with some reboots for Linux kernel updates in between) - that is what I expect from this class of hardware.
To include more information: Both machines are set to PL1=PL2=125 W so no excessive thermal stress. Furthermore, once the issues appeared in press, I reduced the allowed turbo frequencies for both machines. In case of 14900k, I used 5,2 GHz for P-cores and 4,2 GHz for E-cores. This limits Vcore even in single core load to just 1.2 V.
However, based on what I read the voltage spikes we can see using the onboard monitoring are not the only problem but very short spikes usually not catched by these measurements. Therefore, my 1.2 V are only what I see - this does not mean that there are no short term peaks.
More observations: The first crash occurred within the first two month with no limits on frequencies. Than, it run more or less stable with microcode 0x123 (no BIOS update, only linux microcode update) and limited frequencies. With 0x125 (again linux) it crashed once after some months and with 0x12b it crashed twice in three weeks (first linux microcode, second BIOS to 4.1). The 13900K at work runs the same OS with the same configuration and got more or less the same Microcode updates - no issues so far.
Therefore, I start to consider using Intels warranty and swap the CPU. So far, I heard that they do not offer to first send a new processor so I have to find some LGA1700 CPU for the meantime since this machine has to run. I just hope I get something better than
@James C. Owens with his 14900K that crashed although it never saw the high voltages utilized by older microcode versions.
Therefore, on the 13900K vs. 14900K issue he and me observed the same behavior, but I wonder if this has really something to do with 13th vs. 14th generation since this is the same die in the same version. The only difference is clock speed but three out of my for crashed happened at turbo frequency settings below (!) that of the 13900K. The one this morning was at 5,2/4,2 with PL1=PL2=110W and it most likely happened at idle (but this does not mean that there was no short activity that loaded one or more cores for some time).
Does anybody has a benchmark for Linux that definitely crashes an affected processor?
I already played with mprime but this did not cause any problems.