So here is the big update for today, nproc-1 works well on lower core count Xeon D, Xeon E3 and Xeon Phi x200 chips. On some of the dual socket systems, using the number of threads equal to physical cores seems to work best. This is a good illustration on why testing on more than 3 different types of platforms is a good idea
I now have a second docker image doing nproc/2 for the number of threads. The results on the higher-end machines are dramatic:
4x Intel Xeon E7-8890 V4 = 2280H/s - no longer with us :-(
2x Intel Xeon E5-2699 V4 = 1220H/s -> 1723H/s
1x Intel Xeon Phi 7210 = 1117H/s -> 602H/s (case to use nproc-1)
2x Intel Xeon E5-2698 V4 = 1010H/s -> 1572H/s
2x Intel Xeon E5-2690 V3 = 840H/s -> 1100H/s
2x Intel Xeon E5-2683 V3 = 826H/s -> 969H/s
2x Intel Xeon E5-2670 V3 = 793H/s -> 989H/s
2x Intel Xeon E5-2650L V3 = 720H/s -> 809H/s
2x Intel Xeon E5-2620 V4 = 620H/s -> 824H/s
2x Intel Xeon E5-2670 V1 = 590H/s -> 785H/s
2x Intel Xeon E5-2620 V1 = 363H/s -> 462H/s
2x Intel Xeon X5675 = 340H/s (Fractal)
1x Intel Xeon D-1587 = 219H/s -> 318H/s
1x Intel Xeon E5-1515M V5 = 185H/s -> 175H/s (case to use nproc-1)
1x Intel Xeon D-1541 = 178H/s -> 178H/s (case to use either nproc/2 or nproc-1)
1x Intel Xeon D-1540 = 157H/s -> 157H/s (case to use either nproc/2 or nproc-1)
2x Intel Xeon E5620 = 150H/s (cafcwest)
1x Intel Xeon E3-1245 V5 = 140H/s -> 138H/s (case to use nproc-1)
1x Pentium D1508 = 50H/s -> 47H/s (case to use nproc-1)
I also tried nproc/2 - 1 (e.g. physical cores -1 on HT system.) That had some cases where it did better, and in some cases it did worse but the margin was very small in either way. On 2 core machines nproc/2-1 ravages performance.
When I mentioned adding over 3KH/s last evening, that is what I did. Also added E5-2683 V3 to this list with both figures.