Thanks for sharing this. It is quite an impressive achievement! I do wish that they had described the setup in more detail. Some of the details are easy to infer, but some seem impossible.
They didn't provide many details, so let's speculate. The press release touts >10GB/second throughput for SMB traffic from one server to some number of clients. From my perspective, the achievement can be broken down into four smaller achievements:
1) Achieving enormous SMB throughput through a Mellanox Infiniband nework.
2) Achieving enormous throughput from a storage system based on OCZ SSD drives and LSI RAID controllers.
3) Achieving enormous throughput with very low CPU utilization.
4) Somehow cramming all of this into a 1U server (see photo at
http://www.storagereview.com/mellan...d_performance_with_windows_server_2012_hyperv)
Let's first speculate about the network connection between the server and the clients. When you think of a "network", you almost always think Ethernet. The "big win" described in the press release is that Infiniband can do much better than any currently available Ethernet network. Your standard Gigabit Ethernet (abbreviated 1GbE or just GbE) can push about 120MB/Second worth of real-world data. That was fast five years ago, but just doesn't cut it now. Stepping up to 10GbE should provide ten times the performance (1200MB/Second) but it doesn't. You can actually expect somewhere between 500 and 900MB/Second of real-world throughput from a highly optimized 1GbE connection. Has anyone seen more than this?
Ethernet would not be good enough, so they brought in Infiniband. The fastest version, used in the press release system, is 56Gb/second, 5.6 times more raw bits than 10Gb Ethernet. That said, Infiniband is not a direct replacement for Ethernet, and thus is not normally considered as a general purpose networking interconnect. You can run IP over Infiniband, but the performance is poor - I've seen results for 40Gb/s Infiniband with throughput lower than 10Gb Ethernet. The achievement in the press release required ditching standard IP in favor of a more efficient RDMA-based protocol. I know nothing about "SMB Direct" in Windows 2012, but I am currently getting a bit of experience with something analogous - Sockets Direct Protocol (SDP). Both represent more efficient ways to utilize a very high bandwidth connection with existing applications - databases in my case and file serving in the press release. By deploying both a very fast interconnect (56Gb/s Infiniband) and a very efficient protocol (SMB Direct), the group behind the press release achieved extremely high throughput in a very real-world workload. They did it with just 4.6% CPU utilization, which is also very impressive.
That said, they could not have achieved their goal with just one Infiniband connection. Extrapolating from Infiniband + SDP testing on Linux, I would have guessed around 4.4GB/second per Infiniband connection. The setup in the press release might have achieved slightly better or somewhat worse results, so let's guess that they used between two and four Infiniband connections in the server. Those two or four Infiniband ports would have required somewhere between one and four PCIe slots in the server - or perhaps zero slots if the Infiniband were embedded into the motherboard.
Second, let's talk disk I/O, again with the goal of figuring out how they configured a system to achieve >10GB/Second of throughput. A photo of the test setup shows five drive chassis below a server. The press release talks about Supermicro chassis, LSI 9285 RAID cards, and OCZ Talos 2R SSD drives. The chassis look to me like Supermicro SC216 parts, which are 24-bay SAS2/SATA3 devices with either a passive backplane, a single SAS expander, or dual expanders. Since the OCZ drives are dual-ported SAS2 drives, let's assume that the chassis are the dual SAS2 expander models, each with two SFF-8088 connectors out the back. Now let's assume that each of these JBOD chassis is connected to a single LSI 9285 RAID card in the server, each of which also has two SFF-8088 connectors. Each RAID card will have two connections to a JBOD, providing failover (not important in this benchmark) and additional throughput. Will this setup provide enough IO bandwidth? Yes. The theoretical IO bandwidth of 10 SAS2 x4 connections (two connections per card, five cards) is around 24GB/Second - more than twice what we need. In actuality, the LSI 9285 cards will be a bottleneck, throttling throughput to around 2.5GB/Second/card or 12.5GB/Second total. Fortunately that is still enough bandwidth to achieve the press release results of >10GB/Second… assuming that the drives are up to the task.
That brings us to the SSD drives. The Talos drives are rated for "550MB/Second" reads. Real world results will be significantly below this number, but still quite high - between 300 and 400MB/Second is likely assuming that the SAS expanders were not limiting (versions of the Supermicro expanders from several years ago did significantly limit throughput in my testing) . Assuming 350MB/Second reads, getting to 10GB/Second requires just 30 SSD drives, six drives per controller. I know from experience that the current generation of LSI RAID cards can each handle about six fast SSD drives before starting to plateau, so this number seems just barely reasonable. Alternatively, 300MB/Second/drive would require 35 drives or seven drives per controller. Of course the SAS expanders might slow things down somewhat, requiring the addition of more drives to compensate, but since OCZ participated in the test, the drives were essentially free, so the actual configuration could easily have been six, seven, eight, or even more drives per controller - up to the 24-disk capacity of each Supermicro JBOD.
And the server hardware? Here is where I get stuck. The photo (see
http://www.storagereview.com/mellan...d_performance_with_windows_server_2012_hyperv) shows a 1U Supermicro server perched above those JBOD racks. The server has eight 2.5" drive slots, which should help identify it. It looks like a SuperServer 1027R. Presumably, it's the most appropriate machine that Supermicro currently offers - or something even better not quite yet available to buy. The press release describes a part number (SRS-14URKS-0604-01-VI011), but I can't find it on their web site. Whatever model it is, it needs five PCIe 2.0 x8 slots for the RAID cards and either between one and four PCIe slots for the Infiniband cards or two built-in 56Gb Infiniband ports on the motherboard. That's somewhere between five and nine PCIe slots. In a 1U server? I know of no such machine from Supermicro. In fact, three PCIe ports looks like the maximum in any Supermicro 1U server that matches the provided photo. In fact, here is the best that I can come up with and it seems flimsy:
1) They used a Supermicro 1027R SuperServer with three Pcie 3.0 slots.
2) They used a Mellanox PCIe3.0 x16 dual-port card and somehow managed to achieve >5GB/Second of real-world throughput per Infiniband port, about 15% better than I would expect.
3) They used two LSI RAID cards instead of five (since they didn't have enough slots) and made sure that the test data fit into the RAID card cache and/or the OS cache. Further, either the LSI 9285 RAID cards were somehow able to achieve far better results in a PCIe3 motherboard than they are able to do in a PCIe2 motherboard - almost double in fact - even though they are PCIe2 cards or the OS cache made the throughput possible. In either case the five JBOD chassis were mostly for show - the OCZ SSD drives did pretty much nothing after loading their data to the caches.
Just saw the Mellanox
press release. Supermicro servers, Mellanox FDR, OCZ SSDs and 10.36GB/s for Hyper-V
Yikes!