Sapphire / Emerald Rapids - Memory bandwidth & PCIe Root complex Discussion

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RolloZ170

Well-Known Member
Apr 24, 2016
9,980
3,204
113
germany
That's per individual HBM2E die within the stack, each MAX CPU's HBM2E bus is 4096 bits wide between all the chiplets.
can't be true. with 3200mhz and 4096 bits the bandwidth should be higher.
Pseudo channel mode.jpg
Code:
Memory Device
   
Total Width:    128 bits
Data Width:    112 bits
Device Size:    16384 MBytes
Device Form Factor:    Unknown
Device Locator:    CPU0_HBMIO0
Bank Locator:    Bank 0
Device Type:    HBM2
Device Type Detail:    Synchronous
Memory Speed:    3200 MT/s
Configured Memory Speed:    3200 MT/s
Manufacturer:    Intel
Serial Number:    0000
Part Number:  
Asset Tag:  
Cache Size:    16 GBytes


Memory Device
   
Total Width:    128 bits
Data Width:    112 bits
Device Size:    16384 MBytes
Device Form Factor:    Unknown
Device Locator:    CPU0_HBMIO1
Bank Locator:    Bank 1
Device Type:    HBM2
Device Type Detail:    Synchronous
Memory Speed:    3200 MT/s
Configured Memory Speed:    3200 MT/s
Manufacturer:    Intel
Serial Number:    0000
Part Number:  
Asset Tag:  
Cache Size:    16 GBytes


Memory Device
   
Total Width:    128 bits
Data Width:    112 bits
Device Size:    16384 MBytes
Device Form Factor:    Unknown
Device Locator:    CPU0_HBMIO2
Bank Locator:    Bank 2
Device Type:    HBM2
Device Type Detail:    Synchronous
Memory Speed:    3200 MT/s
Configured Memory Speed:    3200 MT/s
Manufacturer:    Intel
Serial Number:    0000
Part Number:  
Asset Tag:  
Cache Size:    16 GBytes


Memory Device
   
Total Width:    128 bits
Data Width:    112 bits
Device Size:    16384 MBytes
Device Form Factor:    Unknown
Device Locator:    CPU0_HBMIO3
Bank Locator:    Bank 3
Device Type:    HBM2
Device Type Detail:    Synchronous
Memory Speed:    3200 MT/s
Configured Memory Speed:    3200 MT/s
Manufacturer:    Intel
Serial Number:    0000
Part Number:  
Asset Tag:  
Cache Size:    16 GBytes
 
Last edited:

twin_savage

Active Member
Jan 26, 2018
169
127
43
35
can't be true. with 3200mhz and 4096 bits the bandwidth should be higher.
You're not exactly wrong, there is some serious memory contention concurrency issues going on in the design. Theoretically each CPU should be able to get 1.6TB/s out of the 4 HBM stacks.... but can't in real world memory workloads/benchmarks.

If each stack was only 128bit wide as 3200MT/s, then each would only be running at 51.2GB/s which is slower than what we're experiencing.
 
Last edited:
  • Like
Reactions: RolloZ170

DHamov

Active Member
Jan 12, 2024
121
31
28
Hi, What do you mean?
4 * 128 * 3.2GB/s = 1638.4 GB/s (peak) right?
EDIT: WRONG! ok, I thought this before (because the results seem to come out right, and it was simple, and then I stopped thinking about it). But after reading your post then by the same logic, normal DDR5 would be 64*4.8GB/s per channel, which is not true.
so there is this factor of 8.
In the micron doc, they mention 3.2Gb/s per pin and IO width of 1024 per device. So i guess 128-->1024 is that factor of 8.
And then 4x1024=4096 interface you mentioned.
 
Last edited:

twin_savage

Active Member
Jan 26, 2018
169
127
43
35
An interesting thing I've found out is that numactl's weighted interleave function on SPR-HBM does not increase memory performance as it was designed to (well designed to for CXL memory devices at least).
The way it was supposed to work is that is interleaves HBM and DDR5 together for theoretically a higher performance while in flat memory mode.
 

Laugh|nGMan

Member
Nov 27, 2012
60
16
8
If that helps. For the sake of completeness with custom, homo, hetero color profiles on x13sem-tf

XCC
Intel® Xeon® Gold 6530 @ 2.1 GHz 32 Cores, 160M cache (Emerald Rapids, 5th Gen) SNC-2 Win11_24H2 Single socket latency Min=57.75ns Median=75.0ns Max=107.75ns
6530_c2c_snc-2_win11_24H2_custom_color.JPG

Win11 24H2 (OS install date 2024-04)
6530_c2c_snc-2_win11_24H2_custom_color.JPG6530_c2c_snc-2_win11_24H2_homo_color.JPG6530_c2c_snc-2_win11_24H2_hetero_color.JPG

Win10 21H2 (OS install date 2019-12)
6530_c2c_snc-2_win10_21H2_custom_color.JPG6530_c2c_snc-2_win10_21H2_homo_color.JPG6530_c2c_snc-2_win10_21H2_hetero_color.JPG6530_aida_high_perf_mem.JPG
 
Last edited:

111alan

Active Member
Mar 11, 2019
343
135
43
Haerbing Institution of Technology
If that helps. For the sake of completeness with custom, homo, hetero color profiles on x13sem-tf

XCC
Intel® Xeon® Gold 6530 @ 2.1 GHz 32 Cores, 160M cache (Emerald Rapids, 5th Gen) SNC-2 Win11_24H2 Single socket latency Min=57.75ns Median=75.0ns Max=107.75ns
View attachment 43946

Win11 24H2 (OS install date 2024-04)
View attachment 43946View attachment 43947View attachment 43948

Win10 21H2 (OS install date 2019-12)
View attachment 43951View attachment 43952View attachment 43953View attachment 43955
This is really bad. Nearly twice as slow as 3rd gen.
 

tigweld0101

Active Member
Apr 18, 2015
124
43
28
58
If that helps. For the sake of completeness with custom, homo, hetero color profiles on x13sem-tf

XCC
Intel® Xeon® Gold 6530 @ 2.1 GHz 32 Cores, 160M cache (Emerald Rapids, 5th Gen) SNC-2 Win11_24H2 Single socket latency Min=57.75ns Median=75.0ns Max=107.75ns
View attachment 43946
How do you make that? I'd like try on some of our systems when we rotate them for service
 

Laugh|nGMan

Member
Nov 27, 2012
60
16
8
How do you make that?
httpx://www.capframex.com/assets/static/latency-heatmap.html
Tick Custom Colors and replace colors with this one
Code:
                [
                    { "pct": 0.0, "color": { "r": 253, "g": 251, "b": 36 } },
                    { "pct": 0.1, "color": { "r": 100, "g": 203, "b": 93 } },
                    { "pct": 0.3, "color": { "r": 87, "g": 198, "b": 101 } },
                    { "pct": 0.5, "color": { "r": 61, "g": 74, "b": 137 } },
                    { "pct": 1.0, "color": { "r": 68, "g": 1, "b": 84 } }
                ]
Added...
MCC
Intel® Xeon® Gold 5416S @ 2.0 GHz 16 Cores, 30M cache (Sapphire Rapids, 4th Gen) Red Hat Enterprise Linux 10 6.12.0-55.22.1.el10_0.x86_64 Single socket latency Min=61.0ns Median=76.0ns Max=88.2ns
5416_c2c_linux_rhel10_custom_color.png

Intel® Xeon® Gold 5416S @ 2.0 GHz 16 Cores, 30M cache (Sapphire Rapids, 4th Gen) Red Hat Enterprise Linux 10 6.12.0-55.22.1.el10_0.x86_64 Single socket latency under load Min=81.0ns Median=126.0ns Max=267.0ns
5416_c2c_under_load_linux_rhel10_custom_color.png

Maximum latency under load tripled.
Code:
Metric    Idle (No Stress)    Under stress-ng Load    Change
Min Latency    61.0 ns    81.0 ns    +32.8%
Median Latency    76.0 ns    126.0 ns    +65.8%
Max Latency    88.2 ns    267.0 ns    +202.7%
grep 'microcode' /proc/cpuinfo | uniq
microcode : 0x2b000639
 
Last edited: