Xeon Platinum P-8124 & ASROCK EPC621D8A

Aug 28, 2019
99
44
18
Just updated all packages, updated to 5.19 kernel and added the microcode package for it. rebooted and 8275cl still working. Testing with P-8124 next.
 
Aug 28, 2019
99
44
18
Just a few random ideas, of which you may have tried them all. On your Proxmox install, try these things while it is stable with the 8275cl's before testing the 8124's:

1. Update the Proxmox install to all newest packages.
2. Install the new 5.19 kernel.
3. Install the Intel microcode package for Debian.

If it keeps crashing, post the kernel dump log with backtraces for review.
After updating bios with custom microcode and then putting the p-8124 back in it seems to still be failing. The issues I get are the same as before. SEL logs on IPMI report Critical Interrupts: "Bus Currectable Error - Asserted" and Processor - Configuration Error - Asserted which results in a hard crash. I can try to upload the bios file i'm using.
 

jasonsansone

Member
Sep 15, 2020
32
10
8
After updating bios with custom microcode and then putting the p-8124 back in it seems to still be failing. The issues I get are the same as before. SEL logs on IPMI report Critical Interrupts: "Bus Currectable Error - Asserted" and Processor - Configuration Error - Asserted which results in a hard crash. I can try to upload the bios file i'm using.
That isn’t the crash information from Debian but what is reported by the BMC. You should be able to view the dump with the remote console, but here is an alternate method.
 
Aug 28, 2019
99
44
18
So i'll load in the bronze cpu I have to pull the crash logs from debian. and yeah I just pulled that if it's useful. just trying to find the bronze I had
 
Aug 28, 2019
99
44
18
That isn’t the crash information from Debian but what is reported by the BMC. You should be able to view the dump with the remote console, but here is an alternate method.
So it didn't get crash logs of any kind. cleared the syslog @ /var/log/ and then ran the p-8124. crash when it loaded the local pve partition but right before the login screen showed up. No logs stored on crash and syslog only shows the bronze loading up. Any recommendations?

going to try installing debian on its own and then proxmox ontop the older way. not a fan but going to see if that gives me any luck
 
Aug 28, 2019
99
44
18
So after installing Debian alone, then installing proxmox and fixing some mistakes on install it all works. So I guess the proxmox install by default will not work so you need to install proxmox's pve after installing debian standalone... damn. So now I need to learn how to automate this process.
 
Aug 28, 2019
99
44
18
The last error I get and it somehow is showing up when no pci slots are even populated and all I have is the cpu, ram, motherboard, psu and that's it. ethernet is plugged in to a single port and the ipmi dedicated port...

 

Stephan

Well-Known Member
Apr 21, 2017
531
329
63
Germany
Just a cautionary note about all those cheap 81xx 82xx Xeons:

There is literally a ton of them out there on ebay for cheap, but I think 50% or so are defective in one way or another. The failure modes vary. I have a hunch they were ejected from the datacenter for a reason. Sometimes the board they sat on might have been defective and not the CPU, but the datacenter didn't bother to check so entire system gets dumped.

Chips from the west coast of the USA all seem to come from a recycler and are badly dinged on corners due to their weight when thrown into a bucket. The 3647 socket itself is delicate. Frankly if you get PCIe errors like that, something is very very wrong. If it doesn't show up with your Bronze CPU, it hints towards the CPU. ProxMox should not require anything special like installing Debian raw and then ProxMox on top, instead of just using their ISO. Also there shouldn't be any bugs resolved in microcode since maybe 2 years. If anything, the new workarounds only introduce MORE bugs.

If the CPU isn't the reason, I had to RMA an Asrock Rack 3647 board because of a DIMM socket signal integrity problem. So board can be an issue too, prepare to swap around alot of gear to figure it out.

What to do... When I test new hardware, first thing I do is run Memtest Pro by Passmark for a few hours. Very good to detect RAM and board errors because the commercial version appears to know about the CPU's IMC and can display fishy errors. Then I boot from USB stick that also starts rasdaemon and then I run "nice -n19 stress-ng --vm $(nproc) --vm-bytes 86% --vm-keep --vm-populate --vm-madvise willneed --verify -v -t 4h --tz --perf" while simultaneously tailing the system log to watch for rasdaemon errors. This will show errors that Memtest might not have been able to trigger in the CPU/RAM/board complex. To test PCIe, I recommend to install a ConnectX3 with two ports into a slot and configure them for loopback in Linux using namespaces.

Server:
Bash:
#!/bin/sh

echo Setup
ip netns add ns_server
ip netns add ns_client

ip link set enp1s0 netns ns_server
ip netns exec ns_server ip addr add dev enp1s0 192.168.1.1/24
ip netns exec ns_server ip link set dev enp1s0 up
ip netns exec ns_server ethtool -s enp1s0 speed 56000 autoneg off

ip link set enp1s0d1 netns ns_client
ip netns exec ns_client ip addr add dev enp1s0d1 192.168.1.2/24
ip netns exec ns_client ip link set dev enp1s0d1 up
ip netns exec ns_client ethtool -s enp1s0d1 speed 56000 autoneg off

ip netns exec ns_server iperf -s -B 192.168.1.1 -w 16M

echo ""
echo Teardown

killall iperf
killall bwm-ng
sleep 0.5

ip netns del ns_server
ip netns del ns_client

echo Done

exit 0
Client:
Bash:
#!/bin/sh

while :; do
    ip netns exec ns_client iperf -c 192.168.1.1 -B 192.168.1.2 -P 2 -w 16M -t 300
    sleep 0.1
done
Here I am using a 50cm 56 Gbps-capable "FDR" Mellanox cable with suitable card, an Oracle 7046442 rev A3 or A4 or A5, flashed to generic MCX354A-FCBT with custom Lenovo 2.42.5032 firmware mashup. I.e. to really push the PCIe slot the card sits on by creating traffic between the physical interfaces. Card has to be cooled with some airflow. The mlx4 driver reports "63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)" so if you max out the 56 Gbps link speed, PCIe 3.0 x8 will be close to the limit. In theory 7880 MByte/s of slot per direction vs. 7000 MByte/s of a single 56 Gbps link per direct. Traffic has to pass PCIe in+out so both in+out of each x8 should be quite busy.

See if one of the slots craps out...
 
  • Like
Reactions: jasonsansone

RolloZ170

Well-Known Member
Apr 24, 2016
1,938
502
113
55
Frankly if you get PCIe errors like that, something is very very wrong. If it doesn't show up with your Bronze CPU, it hints towards the CPU
this is specialy a issue with OEM stepping B1 not regular SKUs.
B1 is not B0 anymore and still far away from H0 stepping, no support from Intel because OEM which run special code as usual.
so unfotunaly we should handle P-8124/P-8136 as ES in the future.
my chinese supplier told me that this B1 SKUs have always PCIe errors and many more issues if the board/OS don't have special support for this stepping.
 

jasonsansone

Member
Sep 15, 2020
32
10
8
UPS and USPS claim my motherboards and processors should deliver today. Hopefully I’ll soon be able to join in the testing and report.
 
  • Like
Reactions: RolloZ170

RolloZ170

Well-Known Member
Apr 24, 2016
1,938
502
113
55
Just a cautionary note about all those cheap 81xx 82xx Xeons:
There is literally a ton of them out there on ebay for cheap, but I think 50% or so are defective in one way or another.
take attantion to the BATCH of the processors, got these 8124Ms for $150 usd each.
2x 8124M sth 75.jpg
L039 = manuf.date 2020, KW 39
 

jasonsansone

Member
Sep 15, 2020
32
10
8
take attantion to the BATCH of the processors, got these 8124Ms for $150 usd each.
View attachment 24543
L039 = manuf.date 2020, KW 39
Did some Google and eBay image searching. I couldn't find any P-8136 manufactured outside of 2016. I did find P-8124 for 2016 and 2017. P-8136 appears to have been manufactured exclusively in 2016. That makes sense being it was in production prior to H0 stepping and official release of SKX. Unlike the 8124M, there was no reason to manufacturer P-8124 or P-8136 after H0 was released.
 
  • Like
Reactions: RolloZ170

RolloZ170

Well-Known Member
Apr 24, 2016
1,938
502
113
55
I couldn't find any P-8136 manufactured outside of 2016. I did find P-8124 for 2016 and 2017. P-8136 appears to have been manufactured exclusively in 2016.
Got a Platinum P-8124 from iamnypz with temp.problems, some cores went up to 107C even without turbo active.
after delidding(special thanks to gb00s for providing the WS 3647) the issue was apparently, totaly dried out thermal paste.
the surface is too big for me to use liquid metal, it runs everywhere but you can't view under the IHS.
reTIMed with MX-5 i got temps at 65C-70C degree under cbr20 load.
 

jasonsansone

Member
Sep 15, 2020
32
10
8
Got a Platinum P-8124 from iamnypz with temp.problems, some cores went up to 107C even without turbo active.
after delidding(special thanks to gb00s for providing the WS 3647) the issue was apparently, totaly dried out thermal paste.
the surface is too big for me to use liquid metal, it runs everywhere but you can't view under the IHS.
reTIMed with MX-5 i got temps at 65C-70C degree under cbr20 load.
I will keep an eye out for temps if I can get everything else working... I'm less worried about crusty old thermal paste and more worried about general compatibility / stability.

The last error I get and it somehow is showing up when no pci slots are even populated and all I have is the cpu, ram, motherboard, psu and that's it. ethernet is plugged in to a single port and the ipmi dedicated port...
Apparently I have too much free time today as I wait on my boards and chips...

The screenshot shows errors on two PCIe devices - 8086:37c0 (PCI bridge) and 8086:2031 (Sky Lake-E PCI Express Root Port B).

Those devices are present on lots of boards with the C621 chipset, including many Supermicro motherboards, but interestingly are not present on X11SPL-F. What does that mean for me? Who knows!
 

RolloZ170

Well-Known Member
Apr 24, 2016
1,938
502
113
55
Those devices are present on lots of boards with the C621 chipset, including many Supermicro motherboards, but interestingly are not present on X11SPL-F. What does that mean for me? Who knows!
the Sky Lake-E PCI Express Root Port is located in the processor.
 

jasonsansone

Member
Sep 15, 2020
32
10
8
Good News: CPU's arrived. Manufacturing dates are Week 1 of 2017 (2x are identical batch) and Week 45 of 2016. Condition is grade A. They almost look like spares. There aren't any scratches or marks from a cooler, zero old thermal paste anywhere, including in the two vent holes on the IHS that are impossible to fully clean out... they look like a brand new chip.

Bad News: Swapped the motherboard in first chassis. The X10DRi-LN4+ is a massive EE-ATX which extended closer to the PDU. The new X11SPL-F is an ATX board. The 24-pin ATX power cable isn't long enough. I won't be making any further progress today.
 
Aug 28, 2019
99
44
18
So I got the pci errors sorted, seems that the system reported it but after cleaning off the pads it's fine. I did also have a bad ram stick which didn't help. Running the p-8124 under 100% load hit 77c so it's fine. Sadly installing using the proxmox iso always fails, likely something either not in the iso or something that is there that causing the instant crash as even during the install the system will hard crash due to it. So i'll just run them at 100% load on cpu and ram for 7 days to see how it goes.
 
  • Like
Reactions: jasonsansone
Aug 28, 2019
99
44
18
I did also find of the 7x p-8124 chips I have with me atm that 2x of them instantly hit 107c under 100% load so looks like i'll be delidding some chips. Any recommendations or tips for the lga 3647 platform is appreciated.
 

jasonsansone

Member
Sep 15, 2020
32
10
8
So I got the pci errors sorted, seems that the system reported it but after cleaning off the pads it's fine. I did also have a bad ram stick which didn't help. Running the p-8124 under 100% load hit 77c so it's fine. Sadly installing using the proxmox iso always fails, likely something either not in the iso or something that is there that causing the instant crash as even during the install the system will hard crash due to it. So i'll just run them at 100% load on cpu and ram for 7 days to see how it goes.
Giving me hope! Great progress!
 
  • Like
Reactions: Randomer Naught