AMD Milan 7763 stuck at 400 MHz during heavy IO load

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Sergiu

New Member
Jul 11, 2019
12
2
1
I have a Gigabyte server with 2 x AMD Milan 7763, 22 x Micron 9300 Pro 15.36TB, directly attached, Ubuntu 20.04, Linux kernel 5.4 and I am puzzled for some time of this strange behavior: under pure CPU load it goes up to 3.25 GHz on all cores, which is great. However whenever I throw a large IO load using fio tool, it throttles down to 400MHz and stays there as long as the test is running. Behavior is somewhat strange because sometimes is observed consistently, sometimes after a restart it is not. I have upgraded also to kernel version 5.11 and I have observed this behavior there also, though seemed to happen more rarely

The OS is running with following parameters: transparent_hugepage=never pcie_aspm=off nvme.io_poll=1 nvme.io_poll_delay=0 processor.max_cstate=1.
There is no sign of overheating, CPUs stay nicely around 50 degrees and during the event, power usage as reported by IPMI is minimal. I have checked and I can confirm that this is not some fluke in reporting the frequency. CPU is indeed running at 400MHz as during this time, threaded load tests do take about 8 times longer and IPMI server power stays at around 550W instead of ~1.1kW which is typical for full load. CPU governor is set to performance, however no effect. Any suggestions of what I can try?
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
Did you ever figure this out? Disheartening to see no replies.

my 7443P just started doing this during high load also (BOINC, TN-Grid and Universe@home). It seems to happen intermittently after many hours of running. So far the only way to bring back the performance and normal clocks is to reboot the system.

AsRock Rack ROMED8-2T w/ 3.20 BIOS
128GB (8x16) DDR4-3200
Ubuntu 22.04 w/5.15 kernel

I’ve run this system for quite a long time running BOINC loads and this has never been a problem until recently. Wondering if I should explore an RMA on the CPU or look at something else setting or configuration wise.

i have several other EPYC systems, but all others are Rome based and don’t have this problem running the same software and OS.
 

Tristana

New Member
Jan 25, 2022
5
0
1
Nearly the same question with Supermicro server. One of EPYC 7763 can not boost, stay at 400MHz.
It is normal until updating the BMC firmware and BIOS.
7763 400MHz.jpg
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
Actually. As an update for my issue specifically. I updated my BIOS from 3.2 to 3.4 and my issue went away. maybe some BIOS corruption. I did see a bunch of CPU over voltage alarms in the IPMI logs. The over volt condition probably threw the CPU into some kind of safe mode. Unsure if it was a real event or just transient errors in the readings, but for me upgrading the BIOS seems to have fixed it.