LGA4677 Intel Socket related discussion placeholder CPU Motherboards RAM Channels Chipset PCIe BIOS BMI Heatsinks Cooling

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sam55todd

Active Member
May 11, 2023
115
28
28
Intel Socket LGA4677 ( Socket E / FCLGA4677 ) discussion placeholder

Xeon Scalable Processors 4th / 5th generation
CPUs: Sapphire Rapids / Emerald Rapids (AMX / QAT / DLB / DSA / IAA ) MCC, XCC, EE-LCC, 1EA, 1EB, 1EC
Motherboards: Asus, AsRock, Gigabyte, SuperMicro, Tyan
RAM: DDR5 DIMM/RDIMM / HBM
Channels: up to four Dual channel = x8
Chipsets: W790 (Desktop/WS) / C741 (PCIe 3) / C741E (PCIe 2)
PCIe: Gen 2/3/4/5
BIOS:
BMC:
ASPEED AST2600
Heatsinks / Cooling:
Performance / Benchmarks:
Power Requirements:
MOBOs/Barebones/Servers:
UPI:

Another alternative clone topic, consolidation space to prevent contaminating other threads because contagion will be spreading for couple of years until platform is EOL.

Related threads:
W790 Motherboard discussion thread
Sapphire / Emerald Rapids - Memory bandwidth & PCIe Root complex Discussion
ES Xeon Discussion
DDR5 / DIMM / RDIMM

P.S. I won't be update this this header much because not visiting platform very often..
 
Last edited:
  • Like
Reactions: RolloZ170

sam55todd

Active Member
May 11, 2023
115
28
28
Is there a way to check number of AMX units available on Sapphire Rapids CPUs ?
Like with AVX-512 FMA units - most CPUs have only two
(even for higher-core CPU counts, e.g 48 cores and still only 2 AVX-512 FMA units, some less, some none)
and similarly with other accelerators (QAT/DLB/DSA/IAA normally 0,1 or 4 units)
After looking at chiplet layout design I've got an impression they have only one AMX unit per whole CPU (not per core, or more likely one per multi-core chiplet like with AVX-512, e.g. 8592V CPU has two chiplets with 32 cores each = 64 cores and 2x AVX-512 FMA units)
Or it's not even a thing for AMX ?

p.s. yes, I do know there's practically no mass software compiled yet to support AMX instructions,
doubt even Windows 11 has code to utilize it for any gains.
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
306
102
43
Is there a way to check number of AMX units available on Sapphire Rapids CPUs ?
Like with AVX-512 FMA units - most CPUs have only two
(even for higher-core CPU counts, e.g 48 cores and still only 2 AVX-512 FMA units, some less, some none)
and similarly with other accelerators (QAT/DLB/DSA/IAA normally 0,1 or 4 units)
After looking at chiplet layout design I've got an impression they have only one AMX unit per whole CPU (not per core, or more likely one per multi-core chiplet like with AVX-512, e.g. 8592V CPU has two chiplets with 32 cores each = 64 cores and 2x AVX-512 FMA units)
Or it's not even a thing for AMX ?

p.s. yes, I do know there's practically no mass software compiled yet to support AMX instructions,
doubt even Windows 11 has code to utilize it for any gains.
AVX/AMX counts are per core, the other accelerators are per package.
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
you can use AMX instructions in a normal way like AVX512, but some are sent to accelerators for execution (i.e. TMUL)
 

sam55todd

Active Member
May 11, 2023
115
28
28
AVX/AMX counts are per core, the other accelerators are per package.
Any points to these facts? Not the AVX in general (which are indeed per core) but rather AVX-512 FMA ? (and AMX)
you can use AMX instructions in a normal way like AVX512, but some are sent to accelerators for execution (i.e. TMUL)
Yes, Intel says here {Section called "Intel® Advanced Matrix Extensions (Intel® AMX)"} what AMX is actually Accelerator (of two components - second being TMUL) and architectural flow is like with AVX-512.

This source shows abbreviated die with 2 special buffers for AVX-512 (presumably FMU array) per chiplet and most likely only single AMX unit (per chiplet too, not per core) - text right in a middle of the image.

1706053694888.png


So far all is pointing to a same conclusion - AMX is per tile, not per Core.
While number of AVX-512 FMA units per CPU are explicitly stated in processor specification - Advanced Technologies section (last two lines - AVX-512 and AMX)
 
Last edited:

sam55todd

Active Member
May 11, 2023
115
28
28
each thread can run independent AMX code, but TMUL is executed by TMUL unit accelerator.
Yes, they can send commands to TMUL accelerator, and have multiple first stage of AMX implementation tmm0...tmm[n-1] , but only one TMUL , therefore only one AMX (simply tmm commands will be waiting in queue for single TMUL) , right?
 

sam55todd

Active Member
May 11, 2023
115
28
28
do not confuse tiles(SPR silicon) with tiles(AMX tiles)
I'm actually getting confused, so are those TMUL(s?) per Core? or per CPU (like accelerator which would make more sense looking at documents above)? or per Chiplet (batch of cores - but this is less likely as I think it is a separate accelerator tile mounted on package assembly)?
p.s. for sake of simplifying things if I understand correctly terms tile and chiplet are interchangeable (meaning same thing), not sure about die (most likely the same) - then we have "package" (CPU) on higher level (except for IPs which are Individual Packages like dies stacked on same CPU) ..
 
Last edited:

sam55todd

Active Member
May 11, 2023
115
28
28
Another related question - we have various Customer Sapphire Rapids custom CPU models spreading on a market,
are those having extra features? or rather opposite - discounted result of failed binning with reduced functionality (e.g. AMX fails but CPUs can be used for Networking servers where this type of compute is not needed, therefore 8xxxC is bulk-supplied to specific OEM builds).
Or those 8xxxC - coded CPUs are a gamble and can be either way?

Also from what I've seen - 8xxxB are rather having full if not extended functionality (e.g. 8475B for Alibaba cloud), therefore question is - does this needs a different microcode or special programming/drivers/bios to access these extras (although those are already sold with declared support of mainstream MB sellers, e.g. SM,GB, etc. with CPU-Z pics showing regular stepping)?
I understand what on early stages it's unlikely anyone here has this knowledge (or are not allowed to share for some reasons) but just in case if someone already did research or tests and willing to share findings..
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
but only one TMUL , therefore only one AMX (simply tmm commands will be waiting in queue for single TMUL) , right?
yes there is one TMUL unit, but many AMX instructions don't need this accel.
'm actually getting confused, so are those TMUL(s?) per Core?
no, but from the coders ponit of view, it looks like.
tiles: the AMX instruction use matrice variables called TILES. not confuse this with the chiplets of the processor.
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
we have various Customer Sapphire Rapids custom CPU models spreading on a market,
are those having extra features? or rather opposite - discounted result of failed binning with reduced functionality (e.g. AMX fails but CPUs can be used for Networking servers where this type of compute is not needed, therefore 8xxxC is bulk-supplied to specific OEM builds).
Or those 8xxxC - coded CPUs are a gamble and can be either way?
intel builds some 'C' confidential SKUs for select customers, what is missing or extra knows only intel and the validated customer(NDA)
 
  • Like
Reactions: sam55todd

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
Also from what I've seen - 8xxxB are rather having full if not extended functionality (e.g. 8475B for Alibaba cloud), therefore question is - does this needs a different microcode
even the mainstream SKUs have different functionality, suffix P,V,M,H,N,S,T,U,Q..
they share same MCU.
 
  • Like
Reactions: sam55todd

bayleyw

Active Member
Jan 8, 2014
306
102
43
1000002187.jpg
it took some digging but here's official verification that AMX is per core, not one big systolic array per quadrant
 
  • Like
Reactions: sam55todd

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
it took some digging but here's official verification that AMX is per core, not one big systolic array per quadrant
this info is hard to understand/misleading.
the perf. gain from 3rd gen to 4th gen is 8 times increased.
each core can use AMX, but there is only one TMUL unit per package which can do 2048 TMULs per cycle,
on a cpu with 48 cores this makes 2048 / 48 = 42,6 TMULS per cycle.
one TMUL unit per core does not fit on a die, i.e. check the size of the QAT area.
 

sam55todd

Active Member
May 11, 2023
115
28
28
So we're ending up with the conclusion that subset of AMX instructions is offloaded to TMUL which is implemented as separate accelerator (just like IAA/QAT/DLB/DSA) with unknown number of those AMX TMUL accelerator units on each CPU model (not being specified on Intel' Arc).
I guess for now we can assume it's 1 unit for mainstream CPUs.
 

RolloZ170

Well-Known Member
Apr 24, 2016
5,429
1,644
113
o we're ending up with the conclusion that subset of AMX instructions is offloaded to TMUL which is implemented as separate accelerator (just like IAA/QAT/DLB/DSA) with unknown number of those AMX TMUL accelerator units on each CPU model (not being specified on Intel' Arc).
yes. but QAT etc. are PCI devices.
TMUL is more like AVX512 FMA unit.
I guess for now we can assume it's 1 unit for mainstream CPUs.
i think its easier, or some have to decide to which of the unit the work goes. better one with more operations per cycle, and all other remains the same.
you can see in the AMX code sample, after init. of all variable a call to "the" syscall function will return after the work is done.