AMD Epyc - performance impact of runing 4 channel memory instead of 8 Channels

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Sapphiron

New Member
Mar 2, 2018
11
0
1
41
Hi All

I have a client who is buying a 1U single socket Supermicro server with an Epyc 7351P 16 Core processor. The application is as a small Proxmox host, running Debian VM instances for clients.

The intention is to run about 40 VM's at 128GB of RAM (32GB modules), but retaining the option of adding more disks and more RAM later to up that to +- 70 VM's with 256GB of RAM.

Is there is any good data on the performance impact of only running 4 memory channels, vs the full 8 channels?

I have seen data on Theadripper (which is "half" an Eypc processor), that some complex workloads are negatively affected by 20-30% when going from 4 channel to 2 channel memory. The speed of the memory also seems to play a significant role.

The local vendor is saying that the impact will likely only be single digit percentages, but I suspect they are treating it like an Intel server, for which memory bandwidth is not that important for CPU performance.

Thanks in advance
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
I see what you are saying. This is something for STH to investigate. I just have not had time to do so lately with so much travel.

It will depend quite a bit on workload of the VMs. I do *highly* suggest running in 8 channel mode for EPYC and 4 as an absolute minimum.
 

Sapphiron

New Member
Mar 2, 2018
11
0
1
41
I see what you are saying. This is something for STH to investigate. I just have not had time to do so lately with so much travel.

It will depend quite a bit on workload of the VMs. I do *highly* suggest running in 8 channel mode for EPYC and 4 as an absolute minimum.
That would be great.

The VM's are individual stacks of a Java-based Web-app with MySQL server on each VM.

I don't know how common it is for people to half populate and "upgrade later/too-late/never" their servers.

Maybe an article on the implications of the practice might be in order.
 

101

Member
Apr 30, 2018
35
12
8
Not performance related, but a couple of old AIDA64 runs

All 8 channels populated:


4 channels populated, one on each core:


4 channels populated, dual channel on 2 cores (ala TR32):


and yes the bios is different, but the RAM issues on this board have been sorted since F04.
 
  • Like
Reactions: Sapphiron

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
The local vendor is saying that the impact will likely only be single digit percentages, but I suspect they are treating it like an Intel server, for which memory bandwidth is not that important for CPU performance.
It's not an Intel-AMD thing. If the workload requires more memory bandwidth than the system can provide, performance will suffer.
How much depends on how bandwidth limited the workload is, not on the brand of CPU.
 

Sapphiron

New Member
Mar 2, 2018
11
0
1
41
It's not an Intel-AMD thing. If the workload requires more memory bandwidth than the system can provide, performance will suffer.
How much depends on how bandwidth limited the workload is, not on the brand of CPU.
I would normally completely agree with you, if not for infinity fabric, which is related to Memory performance. I am just not certain if is the clock rate that matters, or the memory bandwidth.

Ryzen's Infinity Fabric Clock Speed is Linked to Memory Clock Speed... Might Explain Why Memory OCs Make a Noticeable Impact in Performance on Ryzen?

That's not even with a multiple die CPU, like Threadripper or Epyc.
 

Sapphiron

New Member
Mar 2, 2018
11
0
1
41
Not performance related, but a couple of old AIDA64 runs

All 8 channels populated:


4 channels populated, one on each core:


4 channels populated, dual channel on 2 cores (ala TR32):


and yes the bios is different, but the RAM issues on this board have been sorted since F04.
Thanks,

It does show that populating the right slots to ensure that each CPU die gets a RAM module, makes a big difference.

(63GB/s vs 83GB/s) and nearly half the latency (from 138ns to 87.8ns)
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
I am just not certain if is the clock rate that matters, or the memory bandwidth.
I am fairly certain of that. Memory frequency matters for infinity fabric, memory bandwidth can not have an impact on this. That would imply that intra- or inter-die traffic would have to take a detour over RAM. Which is definitely not going to happen.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
You do get less memory bandwidth if you are accessing RAM over Infinity Fabric versus directly attached to the same NUMA node.
AMD-EPYC-7551P-Memory-Bandwidth-NUMA-Node-Matrix.jpg

When IF speed drops, accessing RAM attached do a different NUMA node goes below the 19GB/s or so you would see here since the fabric is running slower.
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
Of course, but in this case it is the IF that is limiting bandwidth and latency. It does not matter if the remote node has 1 or 2 channels populated.
 

wwwean

New Member
Nov 21, 2020
10
0
1
Hello.
cachemem.png
What could be the problem (Memory Write)?
One of the two CPUs is installed in the first slot.
All 8 channels are populated with the same memory modules.
If 4 channels are populated (CDHG) also Read more than Write by ~ 1.4 times.
And a strange squeals periodically heard from the motherboard (as if the choke is squealing in the power supply). This is normal?
 
Last edited:

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
IIRC that should be normal for an Epyc 7302. Memory traffic has to travel between compute dies and I/O die. This pipeline is twice as wide for reads than it is for writes. With only 4 enabled compute dies on a 7302 (maximum for Epyc Rome is 8), there is just not enough bandwidth to saturate all 8 memory channels in write operation.
Unless you have some veeeery specific application, I would not worry about it. This was a deliberate design choice by AMD, because writes are usually less important than reads. Working in a field where memory bandwidth really matters, I tend to agree.

Edit: the fact that you still see significantly lower writes than reads with only 4 channels populated could have other reasons. Aida64 not being the most accurate tool for testing on this kind of hardware is probably one of them.
 
Last edited:
  • Like
Reactions: TXAG26