Ok since that is an odd 55MB L3 cache chip, I tried something different. 4x containers. 2x 27 av=2, 2x 1 av=1 (for that extra MB L3 cache)
6600H/s. Ever so slight of an improvement.
Also, I'm only running 4 containers on 2-cpu systems. 2 per cpu.
Anyway, I'd be interested to see how this works on other cpus as well. Run av=2 on all the real cores, and av=1 to use up the remaining cache on the hyperthreading instance. On a dual e5-2660v2 or 2680v2 with all cores enabled, there would be 5MB cache left over after using 10 cores of av=2 on the "real cores". So could try up to 5 threads of av=1 on the second instance.
Equally, it may turn out you'd want to do the reverse -- 13-16 cores of av=1 and 4-6 cores of av=2 would use most/all the cores (hyperthreaded and otherwise) and also all the cache. Though I'm not sure how you'd balance it -- 10x av1 on all real cores, 6 av1 on 6 of the 10 hyperthreads, and 4 av2 on the other 4 hyperthreads?
Get's complicated in a hurry, but may squeeze a couple percent.