E5v4 Xeon vs EPYC AI acceleration? (CPU utilization differences between platforms)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
So I do a lot of BOINC processing, and with a particular application on the GPUGRID project, I'm seeing wildly different CPU utilizations that seems to be related to the Intel vs AMD platforms. The same behavior has been confirmed with lower end AMD (3000/5000-series) consumer parts also.

The application is their Python GPU application, diverges from conventional BOINC GPU tasks in that it's a multithreaded CPU app with CUDA support, instead of a pure GPU application. each task will spawn 32 threads (simulations) plus the main application process. see info about what the application is doing here if interested: https://gpugrid.net/forum_thread.php?id=5233

now when running this application on my Xeon E5-2697Av4 (16c/32t) system (Linux, Ubuntu 22.04, 5,15 kernel), the total amount of CPU used is rather low. about 1-2 threads used per process (100-200% process CPU %). but i seem to start hitting OS scheduler bottlenecks as attempting to scale this up hits diminishing returns with too many concurrent processes to be serviced by only 32 threads, even with low total utilization. I saw the same low CPU use behavior on an even older Xeon E5-1680v2.

So I moved the same GPU and software over to an AMD EPYC platform, with a 7443P to get more threads, and hopefully better IPC. but what I found was puzzling. the AMD system used WAY more CPU to do the same work as the old Xeon. using roughly 5x the CPU support per process (500-700%), negating any improvement by having more threads, and the tasks did not really process any faster despite the higher CPU use.

so my question, is there some kind of lesser known AI acceleration present on the intel Xeon CPUs? I know the 3rd gen Scalable Xeons like to taught their built in AI acceleration, but I can't find any materials that talks about anything like this on the older v2/v3/v4 Xeons.

or are there any other kinds of hardware accelerators at play that are present in even these older Xeons but not the more modern AMD EPYC and Ryzen CPUs? or maybe even some kind of kernel/software thing?

I did try running the 6.0 linux kernel when it first came out, as there was all that buzz about some 20 year old bug limiting performance for AMD, but that seemed to have no effect here and i still saw high CPU use on AMD.

CPU use on Intel Xeon E5-2697Av4:


CPU use on AMD EPYC 7443P:
 

Patriot

Moderator
Apr 18, 2011
1,450
789
113
Intel Added VNNI in Cascade lake and Bfloat16 in icelake.
you have AVX2 in v4 but no specific ai accelerations.
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
AVX2 is present in the EPYC as well.

any other reason for the big difference in CPU use on the platforms? favoring the much older Xeon feels very strange without any specific acceleration
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
import time
import glob
import yaml
import torch
import wandb
import shutil
import random
import argparse
import numpy
import torch.nn
import pytorchrl

it's a python script.
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
it's MKL, 2019.0.4 . edit see my reply later down

i did already go down the path of trying to implement the debug parameter with os.environ["MKL_DEBUG_CPU_TYPE"] = "5" and it had no effect. the CPU utilization was the same.

early on I did think this had promise to fix the issue. but I never saw any difference when trying to actually implement it.
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
it's MKL, 2019.0.4 .

i did already go down the path of trying to implement the debug parameter with os.environ["MKL_DEBUG_CPU_TYPE"] = "5" and it had no effect. the CPU utilization was the same.

early on I did think this had promise to fix the issue. but I never saw any difference when trying to actually implement it.
actually it's not MKL.

this script/application ships with it's own python3 environment included for compatibility. it doesnt use the local version.

inspecting I find:

Code:
ian@Test-Bench:~/BOINC/slots/0/bin$ ./python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__config__.show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'np' is not defined
>>> import numpy as np  
>>> np.__config__.show()
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
so I guess it's just using the default blas
 

juma

Member
Apr 14, 2021
64
34
18
actually it's not MKL.

this script/application ships with it's own python3 environment included for compatibility. it doesnt use the local version.

inspecting I find:

Code:
ian@Test-Bench:~/BOINC/slots/0/bin$ ./python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__config__.show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'np' is not defined
>>> import numpy as np 
>>> np.__config__.show()
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/conda/feedstock_root/build_artifacts/numpy_1640083064494/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
so I guess it's just using the default blas
Interesting, it could be worth trying to switch to MKL w/ the debug override or OpenBLAS then.
 
  • Like
Reactions: gb00s

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
what do you mean "switch to MKL"?

it's my understanding that NumPy needs to be built with MKL in order to do that.

is there a way without rebuilding the module?
 

juma

Member
Apr 14, 2021
64
34
18
Yeah, you'll have to rebuild with the specific library in mind no matter what. Though you should be able to just create a new environment in Conda that uses numpy with a different math library without affecting your current setup.
 

CyklonDX

Well-Known Member
Nov 8, 2022
784
255
63
In ref to boinc (I also crunch a bit)
The task is configured to take specific amount of cpu. (In %)
(Thus more cores you have the job will eat more cores and use more cpu, thus negating your potential gains)
You can overwrite that by creating app_config.xml file in your boinc project location.

(here's example of milkyway)

XML:
<app_config>
<app>
<name>milkyway</name>
<max_concurrent>10000</max_concurrent>
<gpu_versions>
<gpu_usage>.25</gpu_usage>
<cpu_usage>.05</cpu_usage>
</gpu_versions>
</app>
</app_config>
This will use 25% of GPU, and 5% of CPU, and allows you to run 4 instances in this case per GPU.
Now by understanding this, '5%' of CPU is going to be different if you have 8c, 16c, 32c parts. In case of GPUGRID its likely by default taking 15-20% of CPU for each task. Thus you see the load.

This is something you have to test what is optimal for performance for you. Bigger CPU slice often produces slower tasks, as CPU is slowing the GPU work.


; then as next part of optimization you should turn off irqbalance, and pin specific threads for the jobs.
(On amd side, ensure you are within same chiplet/cache zone while pinning threads)


In terms of epyc vs v4's. Epyc has much better ipc and will always win.

(Editing the script feature includes, won't give you performance. You would need to rewrite task to use certain cpu feature set to get anything at all. Just including modules won't do anything unless they are used.)


There are custom python versions that potentially can offer better performance.
There were some shenanigans before going on by intel (restricting /falling back to different much slower feature sets on amd)

There is custom module called fask-mkl-amd

but it will require some potential changes to the script to actually use it. (It would be best to go onto the project boards, and ask devs to add feature sets for amd - it does not hinder or stops intel code in anyway after all)

(before you can use module, you need to install it)
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
In ref to boinc (I also crunch a bit)
The task is configured to take specific amount of cpu. (In %)
(Thus more cores you have the job will eat more cores and use more cpu, thus negating your potential gains)
You can overwrite that by creating app_config.xml file in your boinc project location.

(here's example of milkyway)

XML:
<app_config>
<app>
<name>milkyway</name>
<max_concurrent>10000</max_concurrent>
<gpu_versions>
<gpu_usage>.25</gpu_usage>
<cpu_usage>.05</cpu_usage>
</gpu_versions>
</app>
</app_config>
This will use 25% of GPU, and 5% of CPU, and allows you to run 4 instances in this case per GPU.
Now by understanding this, '5%' of CPU is going to be different if you have 8c, 16c, 32c parts. In case of GPUGRID its likely by default taking 15-20% of CPU for each task. Thus you see the load.

This is something you have to test what is optimal for performance for you. Bigger CPU slice often produces slower tasks, as CPU is slowing the GPU work.


; then as next part of optimization you should turn off irqbalance, and pin specific threads for the jobs.
(On amd side, ensure you are within same chiplet/cache zone while pinning threads)


In terms of epyc vs v4's. Epyc has much better ipc and will always win.

(Editing the script feature includes, won't give you performance. You would need to rewrite task to use certain cpu feature set to get anything at all. Just including modules won't do anything unless they are used.)


There are custom python versions that potentially can offer better performance.
There were some shenanigans before going on by intel (restricting /falling back to different much slower feature sets on amd)

There is custom module called fask-mkl-amd

but it will require some potential changes to the script to actually use it. (It would be best to go onto the project boards, and ask devs to add feature sets for amd - it does not hinder or stops intel code in anyway after all)

(before you can use module, you need to install it)
thanks for the reply.

first, just to correct you, the cpu_usage flag in app_config does not control how much CPU is used by the science application. it has never done that and BOINC doesn't have the power to control the science apps in this way unless the app was specifically coded to allow passing arguments from BOINC to it (and in which case you would use the cmdline options instead of cpu_usage). what this parameter actually does is tell BOINC how much CPU is being used for bookkeeping/accounting purposes. so it knows how much available resources you have remaining based on what's running. If I were to set this value very low, the science app wouldn't change it's usage, but BOINC will think I have free CPU cores, and attempt to run more work on it, essentially running more tasks than I have CPU cores, leading to overcommitment of the CPU and thrashing/poor performance. it's fine to set this value low for Milkyway, because that application actually has very low CPU use and that puts it in line with the BOINC setting. with these GPUGRID tasks I actually set this value very high (10-16) to prevent other CPU tasks from running and clogging things up.

thanks for the link to the other resources. I've seen that libfakeintel thing before, and tried it before. but it had no effect, probably because I didnt realize at the time that MKL wasnt even being used. I'll try again with MKL

I did rebuild the whole numpy module with MKL (2019 version, since debug was removed in 2021 or so), and shoehorned it into the existing GPUGRID python package (in the running slot, so it's not permanent). It was a bit of a pain because I had to build the exact same version of python and numpy that GPUGRID ships or else it wouldnt import to python properly. ran on my Intel system to avoid the whole debug thing. but nothing seemed different, it all was the same as before, and not any faster. so this all might be a wild goose chase anyway. but I'll try doing the same thing on my AMD system with DEBUG=5 or libfakeintel anyway.
 

CyklonDX

Well-Known Member
Nov 8, 2022
784
255
63
While impact can be hard to see, and since its fractional (1 CPU isn't 1 Core per say - unless job is also compiled to run as such.)
(

2 CPU's (original 1) causes more usage on the threads.
1672928960181.png

1 CPU you can see lower cpu thread usage.

1672929062431.png




*(2 CPUs) (Runtime 141.31, 69.81sec CPU time)
1672930046789.png
*(1 CPU) (Runtime 85sec; 44sec CPU time)
1672930063028.png
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
I've done a lot of testing and work with various BOINC projects over the years, even making a custom BOINC client and making different changes in the code. cpu_usage is the most misunderstood flag across many BOINC users. trust me, it doesn't do what you think it does. it ONLY tells BOINC how much is in use at any given time based on what's running because BOINC doesn't have the ability to "sense" what is happening with hardware utilization, it only knows based on it's own bookkeeping. I've tested this extensively.

your comparison isn't controlled. the tasks were different sizes and had different runtimes, but the ratio of cpu_time:runtime is exactly the same at ~50% for both, so that supports my argument that you cant change CPU use with this flag. if the flag had any affect you would see a noticeable difference in the ratio, but you dont. your 5% vs 7% screenshot is inconclusive at best, more likely to be due to some other background process if you look at the einstein process specifically you will see they are the same.

but this is a bit off topic at this point.
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
So I've switched the package to use the new numpy built against MKL. loaded it on the AMD system, confirmed that it's being used. and also set the MKL_DEBUG_CPU_TYPE=5 variable.

everything runs, but nothing has changed with CPU use behavior. and looking closer at the code, numpy is barely even used anyway, so I guess it's no surprise that it didnt change anything. the main "meat" where the code probably spends the most time is the learner using PyTorchRL. I can't seem to find any relevant results with AMD cpus vs intel using this module.

any other ideas?
 
  • Like
Reactions: T_Minus

Skillz

New Member
Apr 16, 2018
15
1
3
40
As mentioned already, but just to back up the claim.

The <gpu_usage> and <cpu_usage> does not tell BOINC to use only that much of the CPU per application.

BOINC sees that you have say 16 cores and 32 threads. If you tell BOINC to only use 1 core, then it just deducts 1 thread from the 32 available. This is to determine how many apps can run. If you know the app uses very little CPU, then you can tell BOINC to use even less CPU per task which will effectively make it run more tasks simultaneously on the CPU.

The same is for GPU. It doesn't tell BOINC to use, in your example .25 only 25% of the CPU. It just means for every 1 GPU, you can assign 4 tasks to run on it. Hence why 4 tasks will run simultaneously on a single GPU this way.

Its very seldomly used this way for CPUs. Since it almost never has a benefit (except some NCI projects) to run more tasks than you have cores/threads.

One good example for this usage, other than being able to run multiple tasks on a single GPU is to ensure BOINC assigns enough CPU tasks to run on a CPU.

So say you have a 16core/32thread CPU. You have 1 GPU. You run one or more projects on this host.

If BOINC thinks the GPU will need 1 full thread to run that 1 GPU task, then you'll only ever be able to run 31 tasks at once.

But say you know the task on the GPU uses little to no CPU. So you set the CPU usage on that task to something really low. This way, BOINC wont reserve 1 thread for the GPU task allowing you to run 32 CPU tasks on the CPU.

Its not a very good way to control things though, IMHO. Which is why I usually have 2 instances of BOINC installed on my hosts with GPUs. So I can much easily control the number of CPU tasks I want to run regardless of the project on one instance and GPU tasks on the other instance.
 

CyklonDX

Well-Known Member
Nov 8, 2022
784
255
63
any other ideas?
Can you post the python code? I can take a look if its doable to rewrite it - and run few tests on jupyter.

(Potentially using rocm libs, with pytorch may grant it proper execution paths.)
 

gsrcrxsi

Active Member
Dec 12, 2018
293
96
28
Can you post the python code? I can take a look if its doable to rewrite it - and run few tests on jupyter.

(Potentially using rocm libs, with pytorch may grant it proper execution paths.)
i probably wouldn't feel comfortable posting it publicly since it's someone else's scientific research, but I can PM it to you.

or since you're familiar with BOINC, if you have a Linux/Nvidia system, you can attach to GPUGRID and download a task. you'll get sent the python code (~500 lines) and a python environment archive (2.7GB) with the task.