Sapphire / Emerald Rapids - Memory bandwidth & PCIe Root complex Discussion

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
exclusive thread because this is very importent discussion: come from here.

if anyone made Passmark Memory Mark or AIDA64 memory&cache bechmark runs please report here,
but with all information i.e. motherboard and memory population.
AIDA64 reports only the multi-threaded bandwidth.
Passmark Memory Mark checks also the single-thread memory bandwidth.

also check this:
 
Last edited:

Civiloid

Member
Jan 15, 2024
68
47
18
Switzerland
I had intention to run Linux on my machines (once I'll get everyting and assemble them). I guess best equivalent for memory bandwidth would be STREAM (with parameters), however only tool for memory latency measurement I know is Chips & Cheese Latency Tool. Would it be better to collect that as well (for those who runs Linux), or better to stick to directly comparable Passmark & AIDA64 and only on Windows?
 
  • Like
Reactions: sam55todd

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
Passmark Performance Test Memory Mark:
CPU: Intel Xeon Platinum 8461V Q16Z (EV-QS)
motherboard: Gigabyte MS33-AR0
RAM: 4x 16GB 1Rx8 RDIMM DDR5-4800
populated as mentioned in the manual.

memory Threaded: 131727
memory Read-cached: 21253
MemMark_all sth.jpg

Passmark Performance Test Memory Mark:
CPU: Q077 ES1 D0 ( like 6430 )
motherboard: Gigabyte MS33-AR0
RAM: 4x 16GB 1Rx8 RDIMM DDR5-4800
populated as mentioned in the manual.

this is MCC chop monolithic die, not 4 tiles chiplet.
memory Threaded: 117113
memory Read-cached: 24431
Unbenannt006_MemMark sth.jpgUnbenannt001 sth.jpg
 
Last edited:
  • Like
Reactions: sam55todd

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
Would it be better to collect that as well (for those who runs Linux), or better to stick to directly comparable Passmark & AIDA64 and only on Windows?
great, go on !
linux tool are welcome here. they should not have any windows ballast.
 
  • Like
Reactions: Civiloid

twin_savage

Member
Jan 26, 2018
75
40
18
34
I had intention to run Linux on my machines (once I'll get everyting and assemble them). I guess best equivalent for memory bandwidth would be STREAM (with parameters), however only tool for memory latency measurement I know is Chips & Cheese Latency Tool. Would it be better to collect that as well (for those who runs Linux), or better to stick to directly comparable Passmark & AIDA64 and only on Windows?
The Intel MLC tool is compiled for Linux and can check latency; and it's read test on Linux also produces numbers identical to AIDA64's read performance on Windows.
 

twin_savage

Member
Jan 26, 2018
75
40
18
34
This thread has a very interesting topic for me because my primary compute need is solving very large system of equations that scale very strongly with memory bandwidth and latency and take hundreds to thousands of hours on current SPR systems to solve.

WRT the original post that spawned into this thread, I've also noticed very non-ideal scaling with more and more DDR channels or even sockets. This is obviously because the core itself cannot sustain the theoretical speeds, for example STREAM Triad on SPR-HBM is "only" ~700GB/s, while the theoretical speed should be 1638.4GB/s. John McCalpin has an excellent write up on this from ISC 2023 where he goes over how core concurrency is the cause of this non-prefect scaling.
 
  • Like
Reactions: RolloZ170

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
Here's my AIDA64 results, I should be able to run Passmark Perf test on the system soonish:
nice. note AIDA64 does not care about channel interleaving, results just added up.
i have tested on LGA3647 platform with 6 and 5 RDIMMs, result is 120(6) and 100(5),
thats weird because with 5 RDIMM interleaving is not supported.
for that reason AIDA64 is less helpfull, we need single thread bandwidth.
 

bayleyw

Active Member
Jan 8, 2014
328
108
43
what system ? no single threaded bandwidth ?
1S 8461V on X13SEI, three vendors worth of mixed DIMMs at 4800MHz.

MT benchmark:
Code:
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          194959.4     0.035496     0.035296     0.035869
Scale:         195096.5     0.035439     0.035271     0.035944
Add:           202271.9     0.051210     0.051030     0.052332
Triad:         202911.8     0.051176     0.050869     0.055849
1T benchmark:
Code:
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           22199.0     0.310183     0.309982     0.310637
Scale:          15316.9     0.449589     0.449262     0.450031
Add:            15655.1     0.660191     0.659333     0.661318
Triad:          15630.4     0.661298     0.660376     0.662367
Edit: updated results with larger array size.
 
Last edited:
  • Like
Reactions: RolloZ170

twin_savage

Member
Jan 26, 2018
75
40
18
34
nice. note AIDA64 does not care about channel interleaving, results just added up.
i have tested on LGA3647 platform with 6 and 5 RDIMMs, result is 120(6) and 100(5),
thats weird because with 5 RDIMM interleaving is not supported.
for that reason AIDA64 is less helpfull, we need single thread bandwidth.
I just ran passmark and updated my post, the numbers seem kind of strange, it reports 57ns latency which is very optimistic.
I included the MLC benchmarks that show true single thread memory performance aswell.
 
  • Like
Reactions: RolloZ170

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
I just ran passmark and updated my post, the numbers seem kind of strange, it reports 57ns latency which is very optimistic.
depends on SNC configuration(Sub Numa Clustering)
i.e. with a XCC (4 tiles) and SNC4, each tile own's a near dual channel MRC with low latency.
 
  • Like
Reactions: twin_savage

twin_savage

Member
Jan 26, 2018
75
40
18
34
depends on SNC configuration(Sub Numa Clustering)
i.e. with a XCC (4 tiles) and SNC4, each tile own's a near dual channel MRC with low latency.
W790 BIOS settings on this are confusing, it makes it sounds like UMA modes cannot be enabled based on "the current system configuration" but when I enable them and try different clustering schemes I get reproducible differences in memory performance:

2.jpg

It almost sounds like UMA-based clustering=SNC but the BIOS explicitly mentions that if SNC is enabled then UMA-based clustering will be disabled.
 

RolloZ170

Well-Known Member
Apr 24, 2016
6,144
1,880
113
It almost sounds like UMA-based clustering=SNC but the BIOS explicitly mentions that if SNC is enabled then UMA-based clustering will be disabled
SNC = Sub NUMA. cores can be clustered and memory. you can have 4 core clusters but no memory clusters (all channel interleaving)
 

twin_savage

Member
Jan 26, 2018
75
40
18
34
SNC = Sub NUMA. cores can be clustered and memory. you can have 4 core clusters but no memory clusters (all channel interleaving)
I just realized there are no SNC settings in BIOS, SNC is always 1 for this motherboard and CPU combo, so of course virtual NUMA and UMA-based clustering work.


For those interested in how Virtual NUMA and UMA-Based Clustering affect memory latency via MLC:

Virtual Numa Disabled, UMA Hemispheres = 1 NUMA node exposed to OS, 91ns±0.5ns idle memory latency
Virtual Numa Disabled, UMA Quadrants = 1 NUMA node exposed to OS 89.1ns±0.6ns? idle memory latency

Virtual Numa Enabled, UMA Hemispheres = 2 NUMA nodes exposed to OS, 91ns±0.4ns idle memory latency
Virtual Numa Enabled, UMA Quadrants = 2 NUMA nodes exposed to OS, 89.5ns±0.4ns idle memory latency
 
Last edited:

Civiloid

Member
Jan 15, 2024
68
47
18
Switzerland
So I finally got my Dual 8490H ES (E0) setup working.

CPU: 2x8490H ES E0
Motherboard: Gigabyte MS73-HB1, BIOS R02
RAM: 16x16 GB Samsung 1Rx8 4800MHz

Intel Memory Checker:
Code:
# ./mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for sequential access (in ns)...
        Numa node
Numa node         0         1         2         3         4         5         6         7
       0      97.7     109.0     110.3     121.9     180.9     181.8     185.3     187.9
       1     103.8     100.6     123.3     113.4     181.2     182.2     186.0     188.4
       2     110.6     120.2      97.8     108.6     182.0     183.2     186.2     189.4
       3     119.0     113.8     104.8      98.6     182.9     184.0     186.7     189.9
       4     184.1     187.8     186.7     187.4      96.7     108.4     113.7     128.7
       5     184.4     188.0     187.0     187.9     104.2      96.4     121.6     116.2
       6     185.7     187.9     187.1     188.5     111.7     120.9      96.1     107.9
       7     186.1     188.0     188.1     189.5     120.0     112.2     107.5      95.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :    488767.1
3:1 Reads-Writes :    424092.1
2:1 Reads-Writes :    414666.9
1:1 Reads-Writes :    397502.9
Stream-triad like:    415744.4

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1         2         3         4         5         6         7
       0    61508.4    60951.5    60935.1    60773.3    60647.3    60593.6    60580.0    60583.8
       1    60937.1    61641.0    60847.3    60826.5    60621.0    60164.3    60158.1    60150.2
       2    61007.9    60938.2    61385.7    60852.4    60219.6    60161.8    60129.9    60166.3
       3    60895.9    61048.3    60974.9    61044.8    60208.5    60156.0    60146.6    60152.2
       4    60246.7    60267.7    60187.7    60143.1    61144.7    60281.1    60721.8    60129.7
       5    60654.6    60638.2    60600.5    60506.8    60449.8    60820.3    60263.9    60707.1
       6    60652.7    60633.1    60600.1    60509.5    60834.0    60366.4    60743.6    60316.6
       7    60673.7    60646.9    60639.6    60542.3    60388.1    60795.0    60362.7    60599.4

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject    Latency    Bandwidth
Delay    (ns)    MB/sec
==========================
 00000    335.46     488651.0
 00002    336.15     488353.8
 00008    332.96     488381.2
 00015    324.94     488604.0
 00050    320.47     487437.9
 00100    327.85     487903.7
 00200    117.88     285766.2
 00300    109.74     196740.6
 00400    105.95     149506.9
 00500    105.43     120438.4
 00700    102.65      86946.0
 01000    101.25      61365.5
 01300    100.44      47526.2
 01700     99.45      36574.7
 02500     99.06      25156.3
 03500     98.72      18192.0
 05000     98.44      12936.0
 09000     98.16       7591.3
 20000     97.85       3790.0

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency    60.0
Local Socket L2->L2 HITM latency    61.4
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
            Reader Numa Node
Writer Numa Node     0         1         2         3         4         5         6         7
            0         -      67.0      74.0      80.1     152.7     153.7     155.0     156.0
            1      70.3         -      82.9      74.7     156.6     157.6     158.8     159.8
            2      76.1      82.4         -      66.4     158.4     159.4     160.6     161.6
            3      85.4      76.6      69.6         -     162.3     163.2     164.5     165.5
            4     152.7     153.5     154.6     155.6         -      67.0      74.4      80.5
            5     156.6     157.3     158.4     159.4      70.3         -      83.4      74.6
            6     158.3     159.1     160.2     161.2      76.3      83.0         -      67.0
            7     161.8     162.6     163.7     164.6      84.8      76.9      69.6         -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
            Reader Numa Node
Writer Numa Node     0         1         2         3         4         5         6         7
            0         -      71.6      79.2      90.0     166.8     169.7     172.6     175.8
            1      70.0         -      88.2      79.9     166.2     169.0     172.0     175.2
            2      75.1      85.6         -      71.0     164.4     167.3     170.2     173.4
            3      84.8      76.0      69.7         -     164.2     167.0     169.9     173.1
            4     166.8     169.4     172.3     175.2         -      71.6      78.9      90.0
            5     166.2     168.8     171.8     174.7      70.0         -      88.4      79.7
            6     164.8     167.4     170.3     173.2      75.1      86.3         -      71.8
            7     164.0     166.6     169.5     172.4      84.3      76.1      69.7         -

Passmark:
Code:
Genuine Intel CPU 0000%@ (x86_64)
120 cores @ 1601 MHz  |  251.5 GiB RAM
Number of Processes: 240  |  Test Iterations: 1  |  Test Duration: Medium
--------------------------------------------------------------------------------
CPU Mark:                          102852
  Integer Math                     835859 Million Operations/s
  Floating Point Math              595255 Million Operations/s
  Prime Numbers                    1124 Million Primes/s
  Sorting                          381430 Thousand Strings/s
  Encryption                       184905 MB/s
  Compression                      3034351 KB/s
  CPU Single Threaded              2157 Million Operations/s
  Physics                          14085 Frames/s
  Extended Instructions (SSE)      204267 Million Matrices/s

Memory Mark:                       2479
  Database Operations              31149 Thousand Operations/s
  Memory Read Cached               19292 MB/s
  Memory Read Uncached             11890 MB/s
  Memory Write                     10244 MB/s
  Available RAM                    237085 Megabytes
  Memory Latency                   62 Nanoseconds
  Memory Threaded                  513800 MB/s
--------------------------------------------------------------------------------
Link: PassMark Software - Display Baseline ID# 5058905
 

DHamov

Active Member
Jan 12, 2024
112
27
28
Great thread i will also try to contribute later.
Memory related question: On dual socket system would it matter much to have one cpu with 2rx8 modules and the other with 1rx4 or 1rx8 memory modules?
Understand that perfect symetry would be the best, but approximately how much performance hit is to be expected? Or will it just not work?