Why Xeon Phi 7210 machine so slow?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

bayleyw

Active Member
Jan 8, 2014
292
95
28
It's a 1.3GHz Atom with AVX-512, it's never going to be fast - you're not supposed to run Windows on them, you're supposed to run Linux and set them up as MPI nodes in your supercomputer.

The Phi's have always had a half-assed implementation of software; the first-gen PCIe ones literally *ran Linux* on the card, exposing the interface as a virtual network interface. The second-gen ones were bootable, but that was probably a mistake, since it tempts people to try to run commodity code on their nodes.

You can get a P100 (~5 DP TFLOPS) for $300 which makes the Phi nodes a real tenuous proposition, last I checked they were still going for $300+ a node and the P100 has features like "runs software" and "supported by its manufacturer".

The reason the CPU-Z performance is so bad is because you're supposed to use them like a GPU - each core has dual 512-bit FMACs which correspond to 32 32-bit FMA's per cycle; in Nvidia parlance we'd call each SIMD lane a "compute core" and pretend the Phi is a "2048-core" processor (the devil is in the details, on GPU's you'd get something like 8x32 in one macro-core, then 8-ish of those sharing load/store and miscellaneous frontend nonsense).
 

arthur513

New Member
Oct 6, 2022
16
2
3
Thanks for the info. I am not a UNIX guy. I got this is mainly for setting up a home lab . I want to install windows server 2019 on them.. but failed to install..
 

111alan

Active Member
Mar 11, 2019
290
107
43
Haerbing Institution of Technology
1. low clock speed.
2. only 2-way instruction decoding(mainstream 5-6 for intel and 4 for amd now) and bad branch prediction. Only 1 FPU and 2 ALU. Overall very low bandwidth, unless you can use AVX512.
3. no L3 cache, very passive prefetch.
4. high mem latency(except for HBM/MCDRAM models).
 
  • Like
Reactions: Stephan

arthur513

New Member
Oct 6, 2022
16
2
3
1. low clock speed.
2. only 2-way instruction decoding(mainstream 5-6 for intel and 4 for amd now) and bad branch prediction. Only 1 FPU and 2 ALU. Overall very low bandwidth, unless you can use AVX512.
3. no L3 cache, very passive prefetch.
4. high mem latency(except for HBM/MCDRAM models).
A lot info and good to know. Got it cheap and just wanted to test it out. Thanks.
 

shpitz461

Member
Sep 29, 2017
109
19
18
50
I couldn't help myself, but MCDRAM, I didn't know McDonalds is in the RAM business as well :p

Did you try booting a Linux USB just to test? You run it off the USB without installing to hdd/ssd just to see how it performs.

Reading others' reply shows that this CPU is not meant to be used as CPU therefore it has very poor basic CPU performance.
 

arthur513

New Member
Oct 6, 2022
16
2
3
I couldn't help myself, but MCDRAM, I didn't know McDonalds is in the RAM business as well :p

Did you try booting a Linux USB just to test? You run it off the USB without installing to hdd/ssd just to see how it performs.

Reading others' reply shows that this CPU is not meant to be used as CPU therefore it has very poor basic CPU performance.
I was able to install Windows Server 2019 onto it. The performance was much better than Windows 10. The CPU usage was low around 3-5 %. Seems the CPU is good working with Windows Server OS.
 

shpitz461

Member
Sep 29, 2017
109
19
18
50
I was able to install Windows Server 2019 onto it. The performance was much better than Windows 10. The CPU usage was low around 3-5 %. Seems the CPU is good working with Windows Server OS.
Excellent, did you re-run the benchmarks you did previously? How does it perform?
 

Styp

Member
Aug 1, 2018
69
21
8
From a software engineering perspective:
- The good thing is it is x86, not some proprietary OpenCL / Cuda 'non-sense' (Special knowledge)
- It is very complicated to optimize for optimal performance: 4 Threads per Core, high-core count and AVX512 don't make it easier
- Suitable algorithms will find a high uplift: Matrix Multiplication, Deep Learning Inference (if optimized), Anything that fits in 16GB with high cache locality and AVX-512 compatibility (Matrix operations, etc.)
- No copy operations from the HOST system! This is a huge advantage compared to GPUs. (Manipulating of gradients in deep learning for example)

Windows, in general, is terrible, and Linux is way better. In my opinion, unless you do some software tinkering on heavy matrix calculations - there is no real benefit for you on the Phi.
 
  • Like
Reactions: hmw and Stephan

arthur513

New Member
Oct 6, 2022
16
2
3
From a software engineering perspective:
- The good thing is it is x86, not some proprietary OpenCL / Cuda 'non-sense' (Special knowledge)
- It is very complicated to optimize for optimal performance: 4 Threads per Core, high-core count and AVX512 don't make it easier
- Suitable algorithms will find a high uplift: Matrix Multiplication, Deep Learning Inference (if optimized), Anything that fits in 16GB with high cache locality and AVX-512 compatibility (Matrix operations, etc.)
- No copy operations from the HOST system! This is a huge advantage compared to GPUs. (Manipulating of gradients in deep learning for example)

Windows, in general, is terrible, and Linux is way better. In my opinion, unless you do some software tinkering on heavy matrix calculations - there is no real benefit for you on the Phi.
Thank you for the valued info!
 
  • Like
Reactions: RolloZ170