Andy... dba and I have been corresponding via e-mail over the last day or two. He has some really interesting results on the 2P system and almost hit 6GB/s off of a single controller.
Is the 2P system a SB/Romley platform? 6 GB/sec per HBA sounds very, very good. I assume this is off the LSI 9202-16e adapters with 16 SSDs connected. My 9202-16e adapters arrived last friday (i got 4 of them), but I still have to wait for cables to connect them to the SSDs. 4 times 6 GB = 24 GB/sec seems to be the upper limit for a balanced machine with 2 SB CPUs. I/O at this rate consumes ca. 30% of available memory bandwidth. It is less of an issue with systems tuned for storage, but if the processing of that 24 GB/sec datastream requires more than 2 byte memory access by the CPU for each byte read in from the SSDs, the application becomes quite fast CPU bound. Unless the app is well optimized (high % of memory accesses are covered by the cache hierarchy), the 8 memory channels become fast the next bottleneck. Funny times after so many years of I/O "starvation" ...
For the type of app currently running on the 2-socket workstation - the CPU is really maxed out from the computational workload - (almost constantly at 150 watt during 24 hrs) including ca. 300 TB of I/O in one day. To preserve as much CPU capacity for the compute part of the job, my current setting is a bit different from the past. Reflecting on the "bottleneck" of the LSI 9207-8i past 5 drives and the inevitable higher CPU load in a pure HBA setting with 32 SSDs passed through the HBA's to establish a software raid-0, I configured the 4 LSI HBAs with 2 raid0 each (4 SSDs per raid0). 2 software raid0 on top of that with each taking one 4-SSD raid per LSI HBA. This arrangement is in my case faster than the "classic" arrangement of of connecting all 16 SSDs of 2 LSI adapters into one raid0 (2 x LSI based 8-SSD raid0, with one software raid0 on top).
On power consumption: according to HWInfo, the 16 ECC RAMs are consuming 50 watt per CPU under high load, giving 100 watt of power just for memory. Add the energy consumption of fans, the LSI controllers, the SSDs, etc. and the 2 high-powered CPUs with 150 Watt TDP each are responsible for a tad above 50% of the total energy consumption under high load (ca. 550 watt peak measured on the wall outlet).
On temperature: The 2 Corsair H100 keep the Xeons below 60 degrees, allowing Intels speed step to throttle up all 8 cores to sustainable 3.4 GHz (nominal speed is 3.1 GHz) with eventual peak of single cores reaching 3,8 Ghz.
On NUMA: CoreInfo reports that the difference between local memory access for a CPU thread and a CPU thread accessing process memory located in memory connected to the second CPU is 1.6. Better make sure that CPU intensive apps are NUMA aware to maintain high workloads. Databases usually are, but other server apps less often. There is a good thing with the 2-socket SB boards vs. their larger silblings. Sandy bridge CPUs for dual socket motherboards have 2 QPI interconnects, connecting 2 CPUs. In a 4 socket motherboard, the CPUs are only connected via single QPI interconnects. One QPI can transfer ca. 16 GB/sec. The 2-socket CPUs can utilize a higher share of available memory bandwidth of the second CPU than the 4 socket machine.
Lets take the 38-40 GB/sec memory bandwidth the 4 channels deliver per CPU socket.
In a dual CPU system the 2 QPI can theoretically transfer 32 GB/sec, the single QPI link in a 4-socket machine 16 GB/sec. As long as the OS can accomodate that process/memory affinity is local, the impact is negligible. If not, the 4-socket machine is more exposed in this domain than the 2-socket machine.
cheers,
Andy