Dual CPUs Bottlenecks

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

latot

Member
May 7, 2023
36
0
6
Hi all, I was in the LTT forums, checkig Dual Xeon e5-2699v4 vs Threadripper 2990 (buy used ones with low budget) and they tell me... forgot about threadripper and go for epyc!

Well, in short, they recommended me 7742 or 7551 and similar.

So, what is the point? Actually I was reading a dual socket mobo, and I found the chipset diagram..., all the resources are connected to one CPU or the other one, that also means there is the case of have bottlenecks!

I want to build a pc for... math, calculations, simulate phisics, database, web servers, apis, programming, well a looot of things, and they are pretty heavy process, and I want to run them at the same time, probs I'll split them on VMs or similar.

So when I though of it.... is most likely the bottleneck can be a problem, but it depends!, here is where is unknown to me, let put some example:

I have a mobo with 128gb of ram (64gb each cpu), 2tb (1tb per cpu) and I put run process like:

1) 2 threads using 100gb, if the mobo choose one cpu or other, it will need to move data from one CPU to the other casing bottleneck
2) The same as above but reading data from the disks (both of them)
3) The same with any pci....
4) the same as above with any case

Be able to handle, organize a dual cpu mobo seems hard, optimize process is something important, but we usually can't says to the programs, "please use this core"... so having a lot of things can really impact the performance.

Maybe the question of this topic would be... for Xeon and Epyc, is there some things we need to know or to do to can use them in a good way? or maybe is better just one cpu instead of dual? or there is only some particular cases where a dual cpu can shine?

In LTT forum tell me, handle this was a problem with amd opteron.

Well, if the mobos are good at organizing the process and resources, all would be right :)

Thx!
 

heromode

Active Member
May 25, 2020
380
201
43
you're gonna be dealing with NUMA (non-uniform memory access) (numactl) and cpu pinning etc..
current proxmox supports CPU pinning but not numactl socket pinning
the path between the cpu's is called QPI (quick path interconnect)

basically any VM you run that uses a pcie device in pass-through, you want to pin to the CPU hosting the pcie lanes.

It's a pretty big learning curve, in that sense a single-socket solution takes away lots of tinkering etc.. There is lots of debate and lots of testing to be done..

here is a random proxmox thread about the subject: CPU pinning?
 

alex_stief

Well-Known Member
May 31, 2016
884
312
63
38
Full disclosure: you opened a can of worms here. You already got some pointers to start reading if you really want to get to the bottom of this.
Just a word of advice: single CPUs are not immune to any of this. Especially some Threadripper 2000 series had a pretty ridiculous NUMA topology. And Epyc CPUs like the 7742 are basically "NUMA light" on a single package.

If we oversimplify things a bit, there are two reasons why you would want to deal with the added complexity of 2 or more CPUs in a shared memory system:
1) You need more cores than a single CPU can provide
2) You need more memory bandwidth than single CPU has
If you can answer no to both of these, don't bother with it.
 

nexox

Well-Known Member
May 3, 2023
667
275
63
Multi-socket systems do have a more complicated topology, but the links between sockets (QPI for E5 v4 Xeons) are rather fast - around half of the total possible memory bandwidth for a single socket (but you'll pretty much always get full QPI bandwidth and almost never achieve peak rated memory bandwidth.) This could theoretically limit memory IO, but for anything on the PCI-e bus you'll only get a bit of added latency, for either of these performance characteristics to become a bottleneck requires particular, and not terribly common, resource usage from the application. At work we use almost exclusively dual socket machines, so we tested some NUMA-aware optimizations a while back and saw no performance difference, because we're almost always limited by CPU speed or storage latency, reading memory from the other socket wasn't an issue at all.

The placement of processes and memory allocations isn't handled by the motherboard, the OS makes those decisions, and modern Linux kernels are pretty good about handling this well enough for most applications. Additionally, you can tell a process which core(s) to use, and since you can look up which cores are associated with a given NUMA node, it's also relatively easy to set a process to run on one node. I don't use Proxmox, but KVM can do this, libvirt has specific functions for it, and Red Hat has a detailed tutorial: 33.8. Setting KVM processor affinities Red Hat Enterprise Linux 5 | Red Hat Customer Portal
 

heromode

Active Member
May 25, 2020
380
201
43
but for anything on the PCI-e bus you'll only get a bit of added latency, for either of these performance characteristics to become a bottleneck requires particular, and not terribly common, resource usage from the application. At work we use almost exclusively dual socket machines, so we tested some NUMA-aware optimizations a while back and saw no performance difference
This. Linux kernel is really good at this. I run 2x Quadro P620's on a C612 mobo in pcie passthrough for 2x desktop VM's, and i pin the cpu's of those VM's to the socket hosting the pcie lanes, then i run other VM's with numa awareness enabled in kvm (proxmox), not pinned to anything, i trust NUMA and linux kernel to take care of it. Been fine so far.

edit: oh yeah you want to be sure x2apic is enabled in mobo bios, and check dmesg for it.
 
  • Like
Reactions: T_Minus and nexox

RolloZ170

Well-Known Member
Apr 24, 2016
5,361
1,612
113
1) 2 threads using 100gb, if the mobo choose one cpu or other, it will need to move data from one CPU to the other casing bottleneck
note you can have this in a single socket CPU, EPYC rome i.e.
it will need to move data from one CCX to other CCX traveling over the Fabric (IO-DIE)
CCX->copper->IO-DIE->copper->CCX
 
  • Like
Reactions: T_Minus

heromode

Active Member
May 25, 2020
380
201
43
Code:
# dmesg | grep x2apic
[    0.000827] x2apic: enabled by BIOS, switching to x2apic ops
[    0.010999] Setting APIC routing to cluster x2apic.
[    0.794764] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.795527] DMAR-IR: Enabled IRQ remapping in x2apic mode
edit: i remember in my bios, there's a setting for enabling x2apic, and next a setting for enabling or disabling some other x2apic related thingy, and if you enable that second option, it will cause linux kernel to then DISABLE x2apic support at boot. That's why it's important to check dmesg after you play with BIOS x2apic settings.
 
Last edited:

nexox

Well-Known Member
May 3, 2023
667
275
63
This. Linux kernel is really good at this.
This is true, but for various reasons our application doesn't even get the benefits of the kernel's optimizations, statistically 50% of memory access happens on the wrong socket, but it still doesn't matter because we can't actually process data at anywhere near memory speed. Obviously other applications are going to be more sensitive than ours, but most of them are going to get all they need from the kernel with no extra effort.


If you can answer no to both of these, don't bother with it.
There's always point 3) Dual sockets are cool. I keep getting close to pulling the trigger on an X9DRH-7TF and another E5-2690v2 for my desktop, the only thing stopping me is the lack of a 32 bit PCI slot for my 22 year old Audigy, but on the other hand, I'd be that much closer to hitting 512GB of memory.
 

latot

Member
May 7, 2023
36
0
6
Oks.....

There is a lot of info!, Great!

My question not related to the comments but showed thx to them..., If I have a mobo with dual socket, but I just use one cpu with all the ram slots (16 of them), would be similar to have a mobo of 1 socket with 16 of them full?

here is a random proxmox thread about the subject: CPU pinning?
Reading it, has some great options there.

Especially some Threadripper 2000 series had a pretty ridiculous NUMA topology. And Epyc CPUs like the 7742 are basically "NUMA light" on a single package.
How is it? even a one CPU need to use NUMA?!! that is how they are able to use so much ram in first place? where can I get more doc from here?

The placement of processes and memory allocations isn't handled by the motherboard, the OS makes those decisions, and modern Linux kernels are pretty good about handling this well enough for most applications. Additionally, you can tell a process which core(s) to use, and since you can look up which cores are associated with a given NUMA node, it's also relatively easy to set a process to run on one node. I don't use Proxmox, but KVM can do this, libvirt has specific functions for it, and Red Hat has a detailed tutorial: 33.8. Setting KVM processor affinities Red Hat Enterprise Linux 5 | Red Hat Customer Portal
Mmmm, that is great!, can we set for example.. the PCIe or something about the ram? or that must be managed at a lower level?

edit: oh yeah you want to be sure x2apic is enabled in mobo bios, and check dmesg for it.
I'll check it!

note you can have this in a single socket CPU, EPYC rome i.e.
it will need to move data from one CCX to other CCX traveling over the Fabric (IO-DIE)
CCX->copper->IO-DIE->copper->CCX
How much impact it has?


Thxx!!
 

latot

Member
May 7, 2023
36
0
6
There's always point 3) Dual sockets are cool. I keep getting close to pulling the trigger on an X9DRH-7TF and another E5-2690v2 for my desktop, the only thing stopping me is the lack of a 32 bit PCI slot for my 22 year old Audigy, but on the other hand, I'd be that much closer to hitting 512GB of memory.
What to say.. I agree with point 3 :D
 

nexox

Well-Known Member
May 3, 2023
667
275
63
can we set for example.. the PCIe or something about the ram? or that must be managed at a lower level?
The memory and PCIe devices are permanently connected to a particular socket, but you can find out which cores are on the socket for a given PCIe device and set a process's CPU affinity to those cores, the kernel will then default to placing memory allocations for that process on the local memory for that socket when possible. You may need to think ahead with PCIe device placement if you need two for a single application/VM, Supermicro (and probably others) make it pretty easy by labeling the CPU for each slot right on the board.
 

heromode

Active Member
May 25, 2020
380
201
43
numa = non-uniform memory access, ie the linux kernel tries to use the RAM closest to the process accessing it automatically. without numa on a dual socket system with 2x32GB per socket, 128GB per system, every process would reserve the same amount of RAM across both sockets. A process requiring 8 GB would run 4GB across the QPI link off the second CPU. The performance would be relative to that, if QPI bandwidth and latency is half of a local RAM operation, then the total process efficiency would decrease 25%.

but with numa, unless you have specific issues, like gamers with pcie passthrough, or other extremely latency sensitive applications, you can just let the modern linux kernel handle it. If you run a GPU card with pcie passthrough in a VM you might want to pin the vcpu's / threads of that VM to the CPU hosting the GPU.. But then again you might just forget about it and it will all work just as well.

If you have 2 sockets, just be sure to enable NUMA awareness in KVM. That's it.

edit: apt install numactl
man numactl

also intel has some great articles for highly specific workloads requiring the absolute maximum performance.
 
Last edited: