Application run time doubles with Xeon Gold 6154 over i9-7900X

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Frank173

Member
Feb 14, 2018
75
9
8
Hello,

I am running into a problem with my new server that hosts 2 Xeon Gold 6154. I am comparing the time to run a process in an application I wrote between my new server that hosts 2 Xeon Gold 6154 vs my old workstation with an i9-7900X. The process I run is SINGLE THREADED and should only occupy one single core which it does.

The two CPUs's spec look as follow:

Xeon Gold 6154:
Base Clock: 3.00 GHz
Max Turbo 3.70 GHz
All Core: 3.70 GHz

i9-7900X:
Base Clock: 3.30 GHz
Max Turbo: 4.30 GHz

When I run the process it took about 23 seconds on the i9 CPU to complete, on my new Xeon Gold 6154 it takes over 58 seconds, more than twice as long. I am stunned and hope I am able to tweak certain settings to improve performance. Please note all other variables are virtual identical, same AVX/AVX2/AVX512 features, except memory, on the i9 I employed 3200 speed non ECC DDR4 modules, on the Xeon I use 2666 ECC modules, not a huge difference. All other variables are virtually identical, down to SSDs and other storage.

Hence I am absolutely certain that the performance difference can only be explained by the different CPUs. Couple things I noticed: On the i9-7900x I am able to allocate specific cores to be preferred over other cores for heavy workloads. But I never changed the default settings. What I noticed when I visualized core utilization is that the core that is utilized almost 100% constantly switches, the changing core IDs are all on the same node (same physical processor) but still there must be a lot of overhead regarding the changing cores. Could that be the reason for the performance delta?

My question: Has anyone experience in this field to point me to couple things I could check out or re-configure? Is there a way to lock specific cores on the Xeon Gold to be preferred over other cores?

There is no way that the slight tact speeds of the two CPUs alone explain this huge performance differential. Something else is going on here. I actually noticed the same slow speeds on another desktop machine with AMD Threadripper and discounted it to the suboptimal fabric of the first generation Threadripper. But I now get the suspicion that something else is going on here.

Please, I am really at the point of frustration and really would appreciate some advice. One big reason I bought the relatively fast 6154 xeon (third fastest after the 6144 and 6146) was to run this specific application quite fast. If nothing else helps I might look into ways to force a specific core to be utilized inside my programming code but before I do that I consider other options because the code ran very fast on my i9-7900X machine.

Thanks for your input!!!
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
First off, have you checked you're actually hitting your max turbo clocks on each CPU?

The memory speed can make a huge difference depending on the nature of the application but it's likely not an issue here but it'd be a good idea to know what the mem timings are and what the application might be doing behind the scenes.

What OS are you running on? Have you done a timed bench logging process CPU, memory and IO stats? If you remove/disable one of the xeons (which will prevent any stalls if the process ever jumps from one CPU to another) do the results improve?
 

Frank173

Member
Feb 14, 2018
75
9
8
Yes, both Xeon CPUs reach 3.69 GHz turbo boost under stress tests with all cores reportedly running at 3.69 GHz. My DDR4 ECC are CL19.

I am running Windows 10 for Workstation on both machines. I will remove one CPU tomorrow and see which core the process utilizes and whether I still see jumps of full utilization during the process run from one core to another. Any other thoughts? I am currently reading the following article and wonder whether the much poorer latency of PAUSE instructions on Skylake CPUs might explain things, though I doubt it because this is about thread contention which is not the issue here (Why Skylake CPUs Are Sometimes 50% Slower – How Intel Has Broken Existing Code)

First off, have you checked you're actually hitting your max turbo clocks on each CPU?

The memory speed can make a huge difference depending on the nature of the application but it's likely not an issue here but it'd be a good idea to know what the mem timings are and what the application might be doing behind the scenes.

What OS are you running on? Have you done a timed bench logging process CPU, memory and IO stats? If you remove/disable one of the xeons (which will prevent any stalls if the process ever jumps from one CPU to another) do the results improve?
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
Depending on the motherboard you can sometimes disable one of the CPUs in the BIOS, but if you're running windows you should be able to set affinity for the process either in task manager (if you're quick or if it's a long-running process) or by using the start command if you're running from the command line. For example the following will pin a process to cpu2:
Code:
start /affinity 7 c:/path/to/my_benchmark.exe --benchmark-args go here
The affinity is a hex value corresponding to the ID of the CPU, it gets more complicated when multiple CPUs are involved so I'll have to go and have a re-read of the man page but that should be enough for a trial run at least.

Likewise, set up a perfmon for the process in question and compare the numbers across runs on the different CPUs - if they're both running the exact same process with the same data load, all of the numbers should be comparable; as a starter for 10, priv and user CPU should show roughly the same split, IO read and write should be the same, memory PWS and commit should be more or less the same, that sort of thing.

If your bench really is single-threaded then no, that issue with the pause instruction probably shouldn't matter (depending on how it's written of course, the developer might be using pauses for shigs'n'tiggles) but as it seems like the bench isn't for public consumption you'd either need a profiler run or a windbg trace to verify.

P.S. What's with this worrying trend of putting the quote you're replying to after the reply you've written...? Makes quoting very difficult to follow...

P.P.S. After re-readig my notes I remembered I ended up just using psexec instead of start as its affinity syntax was vastly simpler - just use the -a switch and processor number(s) and no need to work out the bitwise hex thingy.
 
Last edited:

jahsoul

Active Member
Dec 13, 2013
262
34
28
War Eagle Country
Something else to look at is your i9 does support Turbo Boost 3.0, meaning that if you had the headroom, your CPU was running at 4.5ghz. Also, what's the RAM configuration on both systems and do you know if the application was memory bound?
 

Frank173

Member
Feb 14, 2018
75
9
8
Thanks for those pointers, will take a stab tomorrow and report back.

Re quoting others, I think it's accepted practice to reply above the quoted reference.

Depending on the motherboard you can sometimes disable one of the CPUs in the BIOS, but if you're running windows you should be able to set affinity for the process either in task manager (if you're quick or if it's a long-running process) or by using the start command if you're running from the command line. For example the following will pin a process to cpu2:
Code:
start /affinity 7 c:/path/to/my_benchmark.exe --benchmark-args go here
The affinity is a hex value corresponding to the ID of the CPU, it gets more complicated when multiple CPUs are involved so I'll have to go and have a re-read of the man page but that should be enough for a trial run at least.

Likewise, set up a perfmon for the process in question and compare the numbers across runs on the different CPUs - if they're both running the exact same process with the same data load, all of the numbers should be comparable; as a starter for 10, priv and user CPU should show roughly the same split, IO read and write should be the same, memory PWS and commit should be more or less the same, that sort of thing.

If your bench really is single-threaded then no, that issue with the pause instruction probably shouldn't matter (depending on how it's written of course, the developer might be using pauses for shigs'n'tiggles) but as it seems like the bench isn't for public consumption you'd either need a profiler run or a windbg trace to verify.

P.S. What's with this worrying trend of putting the quote you're replying to after the reply you've written...? Makes quoting very difficult to follow...

P.P.S. After re-readig my notes I remembered I ended up just using psexec instead of start as its affinity syntax was vastly simpler - just use the -a switch and processor number(s) and no need to work out the bitwise hex thingy.
 

Frank173

Member
Feb 14, 2018
75
9
8
I get that but my process now takes 3 times as long while the 4.5 to 3.7 frequency delta is nowhere close to explaining that differential.

The process perused memory but less than 10% of available memory, both machines have 128 GB memory or more.

Something else to look at is your i9 does support Turbo Boost 3.0, meaning that if you had the headroom, your CPU was running at 4.5ghz. Also, what's the RAM configuration on both systems and do you know if the application was memory bound?
 

jahsoul

Active Member
Dec 13, 2013
262
34
28
War Eagle Country
In the grand scheme, that nearly 1ghz increase is pretty significant for a single threaded process (with all things equal and in a perfect world) but the reason I asked about the RAM was because we didn't really know your Xeon setup. I ran 6 sticks and 8 sticks and 6 ran better for me but this is a tough one. All things being equal, I don't know what could accounted for such a huge jump, unless the clock speed really made that much of a difference. Honestly, the only thing that I could think of is if you had some Optane in the mix. *shrugs*
I get that but my process now takes 3 times as long while the 4.5 to 3.7 frequency delta is nowhere close to explaining that differential.

The process perused memory but less than 10% of available memory, both machines have 128 GB memory or more.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
Maybe a good case to use the Intel performance profiling tools?
 

mstone

Active Member
Mar 11, 2015
505
118
43
46
Because English isn't a top-to-bottom language and because the reader lacks the context to properly understand a reply, forcing them to waste time reading through a message to find the context then rereading the message (now with context) to understand it. Why isn't it? It is not generally accepted.
Re quoting others, I think it's accepted practice to reply above the quoted reference.
the reason I asked about the RAM was because we didn't really know your Xeon setup. I ran 6 sticks and 8 sticks and 6 ran better for me
It's a 6 channel chip: 8 sticks will have either one third or two thirds the total bandwidth of 6 sticks, depending on how badly unbalanced the memory is. And the memory in the xeon system in question started out 20% slower...so depending on how it's provisioned, the xeon may have as little as 25% the memory bandwidth of the i9.

I am running into a problem with my new server that hosts 2 Xeon Gold 6154. I am comparing the time to run a process in an application I wrote between my new server that hosts 2 Xeon Gold 6154 vs my old workstation with an i9-7900X. The process I run is SINGLE THREADED
Honestly, it sounds like the system was spec'd badly in the beginning: if you need single threaded performance for a dedicated app, a dual-socket configuration is almost certainly not going to provide the best result, certainly not the best bang for the buck.
 

jahsoul

Active Member
Dec 13, 2013
262
34
28
War Eagle Country
It's a 6 channel chip: 8 sticks will have either one third or two thirds the total bandwidth of 6 sticks, depending on how badly unbalanced the memory is. And the memory in the xeon system in question started out 20% slower...so depending on how it's provisioned, the xeon may have as little as 25% the memory bandwidth of the i9.
.
Yep...that was the first thing that popped in my mind when I read the OP. This reminds me that I need to stop being lazy and get going on my build.
 

Frank173

Member
Feb 14, 2018
75
9
8
Because English isn't a top-to-bottom language and because the reader lacks the context to properly understand a reply, forcing them to waste time reading through a message to find the context then rereading the message (now with context) to understand it. Why isn't it? It is not generally accepted.

Happy to oblige to your preferences, after all it is you who is attempting to help me here.


It's a 6 channel chip: 8 sticks will have either one third or two thirds the total bandwidth of 6 sticks, depending on how badly unbalanced the memory is. And the memory in the xeon system in question started out 20% slower...so depending on how it's provisioned, the xeon may have as little as 25% the memory bandwidth of the i9.

The memory is perfectly balanced, 12 sticks, 16GB each, 12 channels. Please see screenshots in my following posts where I run general benchmarks. Not sure where you take your claim from that this system deals with 8 sticks.

Honestly, it sounds like the system was spec'd badly in the beginning: if you need single threaded performance for a dedicated app, a dual-socket configuration is almost certainly not going to provide the best result, certainly not the best bang for the buck.

It was perfectly spec'd from the beginning with much consideration given to all future use cases. The server is supposed to run a) deep learning models -> 4 Titan RTX cards linked via 100GB/sec nvlink, b) to load data stored on disks very fast -> 4 striped nvme drives delivering read and write speed of over 8GB/sec for sequential data, which is a specific requirement for my use case, c) to run single threaded back tests over vast amounts of data (hundreds of millions of data points) with occasional multi threaded workload requirements for genetic algorithm optimizations -> multi threaded performance is superior but I am currently stuck on the single threaded CPU performance hence my post.

You can see that memory performance is in any regards vastly superior to any single CPU solution currently on market (ignoring unstable OC'ed gaming rigs perhaps). Multi threaded performance also comes out on top, and single threaded performance is on par for all benchmark suits I ran and posted results in my next post, but single threaded performance lacks for currently un-explainable reasons in regards to my specific application. I am targeting expected performance and not looking to magically beat performance of single CPU systems that run on faster frequencies.

Hence, I would ask you to just ask questions rather than making unsubstantiated claims that are factually false. If you need further information to assist I am happy to supply any kind of information. Just ask.
 
Last edited:

Frank173

Member
Feb 14, 2018
75
9
8
I ran some broad based performance benchmarks and all numbers point to a balanced and well performing system for all benchmarks I ran so far: Aida64 Engineer, SuperPi, Geekbench, and Cinebench with the latter just being a sanity check whether multi threaded performance runs at expected levels.

I start to suspect that my issues are very specifically related to how my C# process code compiles on VS2017 and whether there are certain threading issues involved. Even though the core process is single threaded, due to the nature of it being being kicked off by a GUI and the implementation of a message bus it does involve different threads and I think @Patrick and @EffrafaxOfWug are spot on with having to really use Intel's and MS's performance benchmarking tools for further investigation...
 

Attachments

Last edited:

Frank173

Member
Feb 14, 2018
75
9
8
I will be damned, but I just re-ran the profiler of my process on my i9 machine and it took about 50 seconds vs 60 seconds on my Xeon box. That is roughly 20% slower on the Xeon. The i9 CPU with Turbo Boost 3.0 runs at 4.5GHz vs 3.7GHz on my Xeon box which is also around a 20% performance differential.

That clearly points to either Visual Studio 2017 having changed certain compile procedures/settings lately (I just updated VS2017 to the latest version) or Microsoft Windows with any of the latest updates impacting performance. Just two weeks ago the same process took around 20-23 seconds to complete. Zeroing in on the issue, though I am not able to exactly yet pinpoint the problem.
 
Last edited:

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
I will be damned, but I just re-ran the profiler of my process on my i9 machine and it took about 50 seconds vs 60 seconds on my Xeon box. That is roughly 20% slower on the Xeon. The i9 CPU with Turbo Boost 3.0 runs at 4.5GHz vs 3.7GHz on my Xeon box which is also around a 20% performance differential.
Are you saying the same binary runs at the expected speeds on the xeon when run under the profiler? Or is it now running at expected speeds regardless? Or did you make some changes to the compilation of the binary (you mention updating VS and the OS but I'm not sure if you mean you've recompiled as well)?

Are you able to compile your code under a different compiler than MSVC (e.g. ICC, gcc, LLVM/clang) and see if the same discrepancies occurs there?
 
Last edited:

Frank173

Member
Feb 14, 2018
75
9
8
I was mistaken to believe the performance differential to be explained by the different hardware units on which it was run. I measured performance a few weeks ago when I only had my i9 workstation box and the process completed in 20-23 seconds. Then I ran the same process yesterday on my new Xeon box and was shocked it took almost 60 seconds. Only later did I find out that the identical process re-compiled now also runs at 58 seconds on my i9 machine pointing to a compilation issue or a windows OS update that was perhaps automatically installed between the two runs. I have still not been able to identify the exact issue because I verified and ensured that none of the compilation settings changed.

Are you saying the same binary runs at the expected speeds on the xeon when run under the profiler? Or is it now running at expected speeds regardless? Or did you make some changes to the compilation of the binary?
 

Frank173

Member
Feb 14, 2018
75
9
8
Happy to report that I solved the problem. Somehow the "Optimize code" initially was enabled but disabled in the later run, which showed a slower performance. It might have been a VS update glitch as VS ran an update in between both benchmarks. So, nothing hardware related, sorry to have rattled the cage here, but perhaps it might help someone in the future with a similar issue.
 
  • Like
Reactions: Zhang