It's a 1.3GHz Atom with AVX-512, it's never going to be fast - you're not supposed to run Windows on them, you're supposed to run Linux and set them up as MPI nodes in your supercomputer.
The Phi's have always had a half-assed implementation of software; the first-gen PCIe ones literally *ran Linux* on the card, exposing the interface as a virtual network interface. The second-gen ones were bootable, but that was probably a mistake, since it tempts people to try to run commodity code on their nodes.
You can get a P100 (~5 DP TFLOPS) for $300 which makes the Phi nodes a real tenuous proposition, last I checked they were still going for $300+ a node and the P100 has features like "runs software" and "supported by its manufacturer".
The reason the CPU-Z performance is so bad is because you're supposed to use them like a GPU - each core has dual 512-bit FMACs which correspond to 32 32-bit FMA's per cycle; in Nvidia parlance we'd call each SIMD lane a "compute core" and pretend the Phi is a "2048-core" processor (the devil is in the details, on GPU's you'd get something like 8x32 in one macro-core, then 8-ish of those sharing load/store and miscellaneous frontend nonsense).
The Phi's have always had a half-assed implementation of software; the first-gen PCIe ones literally *ran Linux* on the card, exposing the interface as a virtual network interface. The second-gen ones were bootable, but that was probably a mistake, since it tempts people to try to run commodity code on their nodes.
You can get a P100 (~5 DP TFLOPS) for $300 which makes the Phi nodes a real tenuous proposition, last I checked they were still going for $300+ a node and the P100 has features like "runs software" and "supported by its manufacturer".
The reason the CPU-Z performance is so bad is because you're supposed to use them like a GPU - each core has dual 512-bit FMACs which correspond to 32 32-bit FMA's per cycle; in Nvidia parlance we'd call each SIMD lane a "compute core" and pretend the Phi is a "2048-core" processor (the devil is in the details, on GPU's you'd get something like 8x32 in one macro-core, then 8-ish of those sharing load/store and miscellaneous frontend nonsense).