EPYC 7002 CCD Limitations

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

sirsquishy

New Member
Aug 6, 2018
6
6
3
So after a battery of tests this past week and weekend before putting our new box into production I found what I think is a per CCD limitation that brings us back to needing to define NUMA domains at the CCD level to ensure throughput performance. This is especially true for VMs that have IOMMU NVMe drives attached to them.

It appears that each CCD has a throughput limit of 80GB/s where reads are limited to 65GB/s and writes are limited to 45GB/s per CCD. If you limit a VM to a CCX on the CCD then your reads are limited to 35GB/s and writes are 45GB/s. Span the VM across two full CCDs and the throughput maxes at 90GB/s, reads being 85GB/s and writes being 51GB/s. Span the VM across four full CCDs max throughput is 120GB/s, reads being 115GB/s and writes being 85GB/s. Whats equally interesting in this is how the memory latency drops during these tests. A single CCD has 129ns latency while two CCDs dropped to 121ns, and four CCDs dropped to 110ns.

In one of my tests with 7 P4610's in z2 I was able to starve the memory IO Bandwidth to the drive cluster by limiting the VM to a single CCD. Meaning from ESXTOP I was seeing %RDY spikes on cores 4 and 5 on the same CCD and all of the SMT threads while maxing out throughput to the z2 pool.

In 90% of the configurations we would push on this platform, this is not an issue at all. But when talking IOMMU mapped NVMe drives to VMs and the like, where a single 7002 Core is capable of pushing 9.8GB/s throughput, we will saturate CCD throughput very quickly as we stack out NVMe drives if we were to peak some of the newer NVMe NAND that is out there, such as the newer PM1733 drives. This is especially true for larger core counts per CCD when the hypervisor locks all cores to the L3 cache domain based on the current logic.

Mind you this is under ESXi and I am sure there are some tuning that VMware still needs to address on 6.7update3 regarding CCD performance.

But I am wondering if anyone else has seen this behavior?
 

maes

Active Member
Nov 11, 2018
102
69
28
Unless I am mistaken, that sounds like the limitation might be with the interconnect architecture. From memory, with Zen 2, every connection 'external' to the CPU is through the I/O die, to which the CCDs are connected over Infinity Fabric. By limiting a VM to only one CCD you might be getting close to saturation on that specific IF link?

The Infinity Fabric in the Rome chips can do a 32 byte read and a 16 byte write per fabric clock, so that might explain the difference in read/write speeds.

Might be worth looking into if there's any control over IF link frequency, like it's offered on some consumer/prosumer ryzen motherboards. Used to be that the IF ran at the same clock as the memory (ex: 1333MHz for DDR4-2666) but on Zen2 that might be different.
 

sirsquishy

New Member
Aug 6, 2018
6
6
3
This is on a Dell R7515, currently there isn't any IF specific configuration options like there are on AM4/Consumer platforms. If the 32byte read and 16byte writes are the hard coded limits that would explain the single CCD bottleneck I am seeing. Like I said already, unless the system was pushing full NVMe performance this wouldn't be a huge issue, but in the case you had a VM using NVMe DAS you would want to spread the cores across all CCDs to get the max throughput with out starving out the rest of the resources on those CCDs. I currently have SR's with Dell and VMware to see if there is anyway we can mask based on CCD for such a configuration. The other issue I found is inside of the CCX boundary on the CCD Memory Read is limited to 35GB/s and that will affect more then 80% of the VMs out there since the max Core/SMT per CCX is 4c/8t. But since ESXi has no way of addressing CCX or CCD directly with out over allocating on Cores this requires further testing.