So after a battery of tests this past week and weekend before putting our new box into production I found what I think is a per CCD limitation that brings us back to needing to define NUMA domains at the CCD level to ensure throughput performance. This is especially true for VMs that have IOMMU NVMe drives attached to them.
It appears that each CCD has a throughput limit of 80GB/s where reads are limited to 65GB/s and writes are limited to 45GB/s per CCD. If you limit a VM to a CCX on the CCD then your reads are limited to 35GB/s and writes are 45GB/s. Span the VM across two full CCDs and the throughput maxes at 90GB/s, reads being 85GB/s and writes being 51GB/s. Span the VM across four full CCDs max throughput is 120GB/s, reads being 115GB/s and writes being 85GB/s. Whats equally interesting in this is how the memory latency drops during these tests. A single CCD has 129ns latency while two CCDs dropped to 121ns, and four CCDs dropped to 110ns.
In one of my tests with 7 P4610's in z2 I was able to starve the memory IO Bandwidth to the drive cluster by limiting the VM to a single CCD. Meaning from ESXTOP I was seeing %RDY spikes on cores 4 and 5 on the same CCD and all of the SMT threads while maxing out throughput to the z2 pool.
In 90% of the configurations we would push on this platform, this is not an issue at all. But when talking IOMMU mapped NVMe drives to VMs and the like, where a single 7002 Core is capable of pushing 9.8GB/s throughput, we will saturate CCD throughput very quickly as we stack out NVMe drives if we were to peak some of the newer NVMe NAND that is out there, such as the newer PM1733 drives. This is especially true for larger core counts per CCD when the hypervisor locks all cores to the L3 cache domain based on the current logic.
Mind you this is under ESXi and I am sure there are some tuning that VMware still needs to address on 6.7update3 regarding CCD performance.
But I am wondering if anyone else has seen this behavior?
It appears that each CCD has a throughput limit of 80GB/s where reads are limited to 65GB/s and writes are limited to 45GB/s per CCD. If you limit a VM to a CCX on the CCD then your reads are limited to 35GB/s and writes are 45GB/s. Span the VM across two full CCDs and the throughput maxes at 90GB/s, reads being 85GB/s and writes being 51GB/s. Span the VM across four full CCDs max throughput is 120GB/s, reads being 115GB/s and writes being 85GB/s. Whats equally interesting in this is how the memory latency drops during these tests. A single CCD has 129ns latency while two CCDs dropped to 121ns, and four CCDs dropped to 110ns.
In one of my tests with 7 P4610's in z2 I was able to starve the memory IO Bandwidth to the drive cluster by limiting the VM to a single CCD. Meaning from ESXTOP I was seeing %RDY spikes on cores 4 and 5 on the same CCD and all of the SMT threads while maxing out throughput to the z2 pool.
In 90% of the configurations we would push on this platform, this is not an issue at all. But when talking IOMMU mapped NVMe drives to VMs and the like, where a single 7002 Core is capable of pushing 9.8GB/s throughput, we will saturate CCD throughput very quickly as we stack out NVMe drives if we were to peak some of the newer NVMe NAND that is out there, such as the newer PM1733 drives. This is especially true for larger core counts per CCD when the hypervisor locks all cores to the L3 cache domain based on the current logic.
Mind you this is under ESXi and I am sure there are some tuning that VMware still needs to address on 6.7update3 regarding CCD performance.
But I am wondering if anyone else has seen this behavior?