Single vs Dual CPU performance differences?

ycp

Member
Jun 22, 2014
175
8
18
Hey,

I have 3 of the same model Intel 24 core ES Cpus.
1 of them is in a single socket supermicro motherboard
2 of them are in a Dual Socket Supermicro motherboard.

The system with 2 of the cpus doesn't perform at double the speed of the single cpu system.
I am rendering 3d animations, and I am getting a 1.6x improvement in render times with the Dual system compared to the Single Socket system.

Is this normal behavior or could there be another problem?
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,010
4,994
113
Often rendering scales well so 1.6x scaling seems a bit low. It depends on the software/ workloads as much as the hardware.
 

PigLover

Moderator
Jan 26, 2011
2,976
1,283
113
Reasonably normal.

You would only see 2x performance if your application could keep twice as many CPUs busy at 100% utilization. This would be a very rare application. Since you are rendering, it is likely that the application is at least partially limited by IO (its ability to move data to/from the disks) and may perhaps have a limit on the number of threads it can use effectively.

1.6x performance gain on rendors going to a 2-socket system with identical CPUs is probably a midrange result.
 

ycp

Member
Jun 22, 2014
175
8
18
The Cpu usage for both systems on all the cores is 100%. So it seems that the rendering software is able to use all the cores.
I have 64gb of memory in each system but during my renders only 30gb is used.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,010
4,994
113
What does the memory topology look like? How many DIMMs? Skylake or E5?
 

ycp

Member
Jun 22, 2014
175
8
18
The CPU's are Skylake Socket 3647.
I have 4 sticks of 16gb in each system.
In the system with dual cpus i have 2 sticks for each cpu so a total of 64gb
Int the system with Single Cpu i have 4 sticks so a total of 64gb.
Memory is Micron 2133mhz for both systems.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,010
4,994
113
Sometimes scaling is impacted by the number of DIMMs per CPU. I am actually running a single socket AMD EPYC system through a few benchmarks at different DIMM counts (2, 4, 8) to see performance impacts right now.
 

mstone

Active Member
Mar 11, 2015
505
117
43
43
The CPU's are Skylake Socket 3647.
I have 4 sticks of 16gb in each system.
In the system with dual cpus i have 2 sticks for each cpu so a total of 64gb
Int the system with Single Cpu i have 4 sticks so a total of 64gb.
Memory is Micron 2133mhz for both systems.
sounds starved for memory bandwidth
 

alex_stief

Active Member
May 31, 2016
673
213
43
35
Two DIMMs per CPU is less memory bandwidth available which can affect performance.
Then again, it matters where these DIMMs are populated. Make sure you have dual-channel for both CPUs. You can also try to use the memory from the single-socket system for a test with quad-channel memory. This could expose if you have a bandwidth limit.
 

ycp

Member
Jun 22, 2014
175
8
18
Sometimes scaling is impacted by the number of DIMMs per CPU. I am actually running a single socket AMD EPYC system through a few benchmarks at different DIMM counts (2, 4, 8) to see performance impacts right now.
Hey hope you write an article on your findings.
 

ycp

Member
Jun 22, 2014
175
8
18
Two DIMMs per CPU is less memory bandwidth available which can affect performance.
Then again, it matters where these DIMMs are populated. Make sure you have dual-channel for both CPUs. You can also try to use the memory from the single-socket system for a test with quad-channel memory. This could expose if you have a bandwidth limit.
I have another set of 64gb of memory, I will add that to the existing 64gb so that would be 4 sticks of 16gb per CPU and see if it performs better.
Will keep you guys posted.
 

ycp

Member
Jun 22, 2014
175
8
18
Two DIMMs per CPU is less memory bandwidth available which can affect performance.
Then again, it matters where these DIMMs are populated. Make sure you have dual-channel for both CPUs. You can also try to use the memory from the single-socket system for a test with quad-channel memory. This could expose if you have a bandwidth limit.
Ok so on the Dual CPU System i added another 4 sticks of 16gb of Memory. So total amount of Memory is now 128gb on the system with 4 sticks of 16gb per CPU.
Even after adding more Memory the speed difference between Dual and Single CPU does not make a difference. So i don't think that with less memory there were memory bandwidth issues.
So back to the drawing board i go to find a way to get more performance out of a Dual CPU system.
 

alex_stief

Active Member
May 31, 2016
673
213
43
35
Is the memory actually populated correctly?
Did you disable NUMA?
Have you checked if scaling is even linear up to 24 cores on the single-socket system?
How is performance running on 24 cores of the dual-socket system? Check both running 12 cores each on both CPUs and running all 24 cores of one CPU.
 

ycp

Member
Jun 22, 2014
175
8
18
Hey Alex,
Sorry i Have no idea what NUMA is and where i can disable it
The scaling is linear on the single socket system because the cpu usage is 100% on all cores throughout the render.
The same goes for the Dual socket system.
 

alex_stief

Active Member
May 31, 2016
673
213
43
35
Sorry i Have no idea what NUMA is and where i can disable it
It is a bios option. You would need to try which one works better for you
The scaling is linear on the single socket system because the cpu usage is 100% on all cores throughout the render.
Scaling is not determined by how much CPU usage the operating system shows. You would actually have to test strong scaling by running the application on 1, 2, 4, 8, 16 and 24 cores. Ideally disabling turbo boost not to skew the results. If 24 cores are not ~24 times faster than one core, you have an application/workload that simply does not scale linearly up to 24 cores, so you can not expect it to scale linearly up to 48 cores. Edit: Unless of course the bottleneck is a shared CPU resource like memory bandwidth and going to 48 cores means having 2 CPUs.

You keep ignoring my question about how the memory was populated. Are you 100% sure that you did a good job there?
 

ycp

Member
Jun 22, 2014
175
8
18
Hello Alex,

Sorry, I don't mean to ignore your question about the memory population, But the motherboard i have is supermicro X11DPH-T. I have installed the memory exactly the way it says to do it in the Manual. Basically the manual says if you have 2 cpus and 8 sticks of memory then populate these slots. Thats exactly the way i have done it so i think i have the memory correctly installed.

I will try the bios option for NUMA.

Now i understand with what you mean by proper scaling. I will try to conduct tests with less cores and then see if it scales.

Thanks for all your help.
 

ycp

Member
Jun 22, 2014
175
8
18
Is the memory actually populated correctly?
Did you disable NUMA?
Have you checked if scaling is even linear up to 24 cores on the single-socket system?
How is performance running on 24 cores of the dual-socket system? Check both running 12 cores each on both CPUs and running all 24 cores of one CPU.

Hey, I tried NUMA, no difference in performance. I checked the windows task manger and changed the cpu view to NUMA nodes but it shows 2 nodes before disabling NUMA and after disabling NUMA. I know for sure i disabled NUMA in the bios but maybe windows doesn't show the changes.


As far as scaling going goes. I wasn't able to test it with disabling cores in the bios because that led to blue screens in windows so it wouldn't boot.
So i ran a few tests using the affinity settings for windows but couldn't come up with any clear conclusion.

Also i use Vray for rendering. They make their own benchmark. After running that benchmark and comparing my results to results from other people with similar hardware, it seems that these new Xeon Scalable cpu's don't scale that well when there are really high core counts in a system.
Other people where close to around 1.7 - 1.8x gains when comparing single socket to dual socket systems with high core counts.
Cpus with less than 20 cores seem to scale well from one socket to 2 sockets so it seems that the software is really not optimized yet for lots of cpu cores.

The link is given below to a searchable database of benchmark results for many different models of cpus
V-Ray Benchmark | Chaos Group

I might be wrong with my analysis but hope someone will let me know if i screwed up.
 

alex_stief

Active Member
May 31, 2016
673
213
43
35
Scaling analysis is usually done on the software level because it is much more convenient.
i.e. leaving all cores active, but restricting the amount of cores used directly in the software.
Edit: Windows shouldn't show two NUMA nodes with NUMA disabled. No idea what went wrong there.
 
Last edited: