Drag to reposition cover

GIGABYTE MS03-CE0 + Intel Xeon Emerald Rapids EMR-SP

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

custom90gt

Active Member
Nov 17, 2016
358
135
43
41
I'll be doing this same upgrade when my Q2SR gets here. I'm super excited for this build. What is the bios mod that you needed to do?
 

JosefHrib

Active Member
Jul 25, 2023
127
115
43
40
I'll be doing this same upgrade when my Q2SR gets here. I'm super excited for this build. What is the bios mod that you needed to do?
@RolloZ170 helped me. disable acm. exist second variant with enable acm but must be older version.
Emr is a nice upgrade from spr. Too bad they couldn't be run on motherboards asus w790 sage ace. But if you don't have a problem with c741 boards, this is an opportunity to get very good cpu for little money.
 
Last edited:
  • Like
Reactions: custom90gt

mkgai

New Member
Apr 26, 2025
1
1
3
@RolloZ170 helped me. disable acm. exist second variant with enable acm but must be older version.
Emr is a nice upgrade from spr. Too bad they couldn't be run on motherboards asus w790 sage ace. But if you don't have a problem with c741 boards, this is an opportunity to get very good cpu for little money.
Do you happen to have a copy of those instructions to disable acm?

I have a Gigabyte MS03-CE0 and a Q2SR A0. I saw your contributions to this community and realized I am in the right place - thanks for the great info.
 
  • Like
Reactions: RolloZ170

Andrix

New Member
Mar 15, 2025
16
11
3
What is the sustained all-core AVX512 frequency for Q2SR?

I am experimenting with QYFS + MS03-CE0, and it is running at 1.8 GHz in LINPACK benchmarks. The "Turbo Ratio Limits - AVX-512" from hwinfo screenshots is 23 for 56 cores, but apparently it is just the hard limit and not what is sustained. Anyways, I am trying to figure out how large the gain in performance would be if I replace QYFS by Q2SR. Has anyone had a look at the core clocks during an AVX512 workload?
 

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany
What is the sustained all-core AVX512 frequency for Q2SR?
2400mhz
Code:
Turbo Ratio Limits - IA/SSE, Fused:    40x (1-32c), 34x (33-48c), 27x (49-58c), 26x (59-64c)
Turbo Ratio Limits - IA/SSE, Resolved:    40x (1-32c), 34x (33-48c), 27x (49-58c), 26x (59-64c)
Turbo Ratio Limits - AVX2, Fused:    38x (1-32c), 32x (33-48c), 26x (49-54c), 25x (55-64c)
Turbo Ratio Limits - AVX2, Resolved:    38x (1-32c), 32x (33-48c), 26x (49-54c), 25x (55-64c)
Turbo Ratio Limits - AVX-512, Fused:    35x (1-32c), 29x (33-48c), 25x (49-54c), 24x (55-64c)
Turbo Ratio Limits - AVX-512, Resolved:    35x (1-32c), 29x (33-48c), 25x (49-54c), 24x (55-64c)
Turbo Ratio Limits - TMUL, Fused:    35x (1-32c), 29x (33-48c), 23x (49-54c), 22x (55-64c)
Turbo Ratio Limits - TMUL, Resolved:    35x (1-32c), 29x (33-48c), 23x (49-54c), 22x (55-64c)
 

Andrix

New Member
Mar 15, 2025
16
11
3
2400mhz
Code:
Turbo Ratio Limits - AVX-512, Fused:    35x (1-32c), 29x (33-48c), 25x (49-54c), 24x (55-64c)
Turbo Ratio Limits - AVX-512, Resolved:    35x (1-32c), 29x (33-48c), 25x (49-54c), 24x (55-64c)
Thanks, but have you actually verified that the core clock reaches 2400mhz in an AVX512 workload? It doesn't work this way for QYFS, at least not with the standard power limit settings. A hwinfo screenshot shows
Code:
Turbo Ratio Limits - AVX-512, Fused:    35x (1-28c), 29x (29-42c), 24x (43-50c), 23x (51-56c)
Turbo Ratio Limits - AVX-512, Resolved:    35x (1-28c), 29x (29-42c), 24x (43-50c), 23x (51-56c)
The turbo limit should be 2.3 GHz, whereas I see the following during a linpack benchmark run:
Code:
$ grep MHz /proc/cpuinfo | sort -n -k 4
cpu MHz        : 1739.738
cpu MHz        : 1797.958
cpu MHz        : 1797.960
...
cpu MHz        : 1798.842
cpu MHz        : 1799.059
cpu MHz        : 1799.105
cpu MHz        : 1800.179
cpu MHz        : 2573.705
cpu MHz        : 2800.000
 

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany
The turbo limit should be 2.3 GHz, whereas I see the following during a linpack benchmark run:
limit is only if all cores run AVX512 which is rarely happen. during AVX512 heavy workload there is no space for core clock asking thought,
if i run AIDA benchmarks with heavy AVX512 i can do anything else, HWinfo doesn't response. ask core clock is a mailbox, if the code don't wait for completion(usualy they don't) you get wrong values.
 

Andrix

New Member
Mar 15, 2025
16
11
3
As a follow-up to the above conversation, I got my hands on a Q2SR (many thanks to @RZSN for arranging me access to his hardware) to run Intel optimized benchmarks. I used linpack and mp_linpack benchmarks that are the shared-memory and distributed Intel implementations of the High-Performance LINPACK (HPL) benchmark. To those unfamiliar with HPL, it gives you an idea about the FP64 peak performance in compute-bound tasks.

Anyways, here is a summary for all-cores runs.
CPUlinpack (GFLOPS)mp_linpack (GFLOPS)AVX512 freq. (GHz)Base freq. (GHz)
Q2SR (64c)380040001.9-2.21.7
QYFS (56c)274030701.81.9
8480+ (56c)292031201.92.0

My main take away is that Q2SR offers additional 30+% of peak performance compared to QYFS. I was trying to guess what the gain is by comparing the turbo frequencies (lack of information) and cinebench data (abundant but not that informative for me, also suggesting 13% gain).

Getting back to the question about all-core AVX512 frequencies, Q2SR runs at 2.1 GHz (occasionally 2.2 GHz spotted) in the beginning of a longer workload and drops to 2.0 GHz (occasionally 1.9 GHz spotted) after a continous load. The linpack numbers in the table were recorded towards the end of a 20-minute run. A linpack benchmark started after some short idle time would peak at 3940 GFLOPS. The performance degradation in longer runs could be related to power-limit settings (PL1 time set to 128 s as discussed in this thread) and temperatures, but I don't understand the deal with powerlimits entirely. The CPU power consumption (the turbostat reading) initially reaches 380W staying like this for a while and then reduces to 350W. If anyone can explain this, your comment would be very welcome.

I can post more technical details (possibly in a separate thread?) if anyone is interested.
 

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany
The CPU power consumption (the turbostat reading) initially reaches 380W staying like this for a while and then reduces to 350W. If anyone can explain this, your comment would be very welcome.
power limit for Xeon's is strictly TDP. TDP can be exceeded for a limited time(max.448 sec.).
with PL1 Time Window = 1 sec. you will have 380W for one second.
 
  • Like
Reactions: DHamov

Andrix

New Member
Mar 15, 2025
16
11
3
Well, here is what I saw:
linpack-pl1.png
The workload consists of 4 smaller runs about 40s each. The PL1 time window is set to 128s, and I sketched what I believe was the first one. The average power was 350W (TDP). I assume the next window should start right after those 128s. But then those 380W should have lasted a bit longer in the second window. Does the temperature enter the conversation here or is it completely irrelevant?
 
  • Like
Reactions: DHamov

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany
The workload consists of 4 smaller runs about 40s each. The PL1 time window is set to 128s, and I sketched what I believe was the first one. The average power was 350W (TDP). I assume the next window should start right after those 128s. But then those 380W should have lasted a bit longer in the second window. Does the temperature enter the conversation here or is it completely irrelevant?
PL1 TimeWindow e.g. 128 is not one shoot forever. with some healing (internal calculator) the time can start again, or a fraction of.
If Temperator is a major value, with very proper cooling you could run 380W forever, but it is not the case.
edit: the timer is started after reaching the TDP limit.
 
  • Like
Reactions: DHamov and Andrix

sam55todd

Active Member
May 11, 2023
217
68
28
limit is only if all cores run AVX512 which is rarely happen
I'm not sure if it even has AVX512 for all x64 cores, as far as I know normally Intel CPUs (most Xeons) have two AVX512 execution units per CPU (hardly per tile/chiplet)

edit: specification for 8592+ clearly states:
# of AVX-512 FMA Units = 2
 
Last edited:
  • Like
Reactions: DHamov

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany

RolloZ170

Well-Known Member
Apr 24, 2016
10,079
3,221
113
germany
Indeed, thanks for the link.
"Intel Xeon Scalable processors have two FMA units per core to combine multiplication and addition into a single operation and accelerate computation speeds."
normally Intel CPUs (most Xeons) have two AVX512 execution units per CPU
note that there may be two AVX512 FMA units. but all have AVX512 instruction set.
 

RZSN

New Member
Mar 4, 2023
25
7
3
As I was informed - we have not seen a drop in the GFLops rating when the HT was disabled.
So even 1 thread on 1 core can fully utilize the vector unit(s) present there.
 

Andrix

New Member
Mar 15, 2025
16
11
3
As I was informed - we have not seen a drop in the GFLops rating when the HT was disabled.
So even 1 thread on 1 core can fully utilize the vector unit(s) present there.
Thanks for the reminder. I saw even a slight gain (~50 GFLOPS) after disabling hyperthreads.

note that there may be two AVX512 FMA units. but all have AVX512 instruction set.
Yeah, FMA is quite important as it doubles instructions per cycle resulting in 32 FLOPs per cycle per core.
 

GadflyII

New Member
Aug 14, 2025
7
3
3
Just wanted to say thankyou for this thread, it has been really informative and helpful.

If AVX512 and AMX work on my two Q2SR's, I will be extremely happy.
 

Andrix

New Member
Mar 15, 2025
16
11
3
Unless you receive faulty CPUs, AVX512 and AMX should be working on your Q2SR. AMX wasn't my focus, but I ran the HPL-AI version of the mp_linpack benchmark (see Overview of the Intel® Optimized HPL-AI* Benchmark for a reference) that time in June. Instead doing everything in FP64, HPL-AI performs a linsolve with bfloat16 and then iteratively refines result with FP64. Bfloat16 is where the calculation benefits from AMX.

Here is the condensed output for Q2SR:
Code:
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00R2R2       80000  1536     2     1               7.18            4.75480e+04
WC00R2R2       80000  1536     2     1               7.17            4.75755e+04
WC00R2R2       80000  1536     2     1               7.20            4.74291e+04
WC00R2R2      120000  1536     2     1              22.20            5.18854e+04
WC00R2R2      160000  1536     2     1              49.83            5.48036e+04
WC00R2R2      200000  1536     2     1              96.42            5.53152e+04
WC00R2R2      200000  1536     2     1              95.54            5.58245e+04
The same for QYFS:
Code:
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00R2R2       80000  1536     1     1              10.47            3.26018e+04
WC00R2R2       80000  1536     1     1              10.30            3.31345e+04
WC00R2R2       80000  1536     1     1              10.67            3.19953e+04
WC00R2R2      120000  1536     1     1              36.01            3.19959e+04
WC00R2R2      120000  1536     1     1              33.63            3.42528e+04
WC00R2R2      120000  1536     1     1              33.93            3.39565e+04
WC00R2R2      160000  1536     1     1              72.98            3.74185e+04
WC00R2R2      160000  1536     1     1              74.39            3.67090e+04
WC00R2R2      160000  1536     1     1              76.39            3.57455e+04
And the same for 8480+:
Code:
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00R2R2       80000  1536     1     1               9.87            3.45798e+04
WC00R2R2       80000  1536     1     1               9.94            3.43259e+04
WC00R2R2      120000  1536     1     1              29.18            3.94775e+04
WC00R2R2      120000  1536     1     1              30.00            3.83982e+04
WC00R2R2      160000  1536     1     1              67.36            4.05389e+04
WC00R2R2      160000  1536     1     1              66.23            4.12296e+04
Intel shows 32-37 TFLOPS on 8480+ in smaller single-socket Bfloat16/AMX workloads (See fig. 6, Improve AI Efficiency, Scalability, and Performance with Intel AMX...). My numbers roughly match theirs, also EMR beats SPR once again. I guess, this should leave an impression that the AMX units are present and usable in Q2SR.

Once your CPUs arrive, just run the benchmarks. Be it HPL-AI or BF16GEMM (as shown by Intel), you roughly know what numbers to expect.