Hello,
I found your result here during my research and I am very interested in how exactly you achieved it. I have similar hardware, but my Gflops can't even get close to yours. I would be very grateful, if you could give me some additional informations or maybe even post your config files (HPL.dat) here.
Which CUDA-Linpack version are you using? The only one I found and use seems rather old: hpl-2.0_FERMI_v15.
I work on a cluster with 7 gpu nodes, each node got the following hardware:
2x Intel Xeon E5-2640 v4
8x DDR4-2400 8 GB Memory
Intel X10DGQ Board
4x Tesla P100 16GB HBM2
CUDA 9.0
The best I could achieve yet was roughly 3500 Gflops - with all 28 GPUs.. I think the benchmark isn't using the GPUs at all because nvidia-smi shows barely any usage (~45W/300W, 0% GPU-Util, ~2400 MiB Mem) and those 14 Xeons should be able to get close to 3500 Gflops on their own as far as I know. There is no warning or error whatsoever and everything always ends with PASSED.
I would be very happy about any advice.