Mellanox with Supermicro and OCZ = 10.36GB/s

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

cactus

Moderator
Jan 25, 2011
830
75
28
CA
How did they get 10.36GByte/s(82.88GBit/s) over a 56GBit/s line? Aggregated links?

Edit: Did some reading. IB is like PCIe where you can have 1x, 4x or 12x connections. 4x seems to be the normal single port carrying 56Gb/s for FDR.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
Thanks for sharing this. It is quite an impressive achievement! I do wish that they had described the setup in more detail. Some of the details are easy to infer, but some seem impossible.

They didn't provide many details, so let's speculate. The press release touts >10GB/second throughput for SMB traffic from one server to some number of clients. From my perspective, the achievement can be broken down into four smaller achievements:
1) Achieving enormous SMB throughput through a Mellanox Infiniband nework.
2) Achieving enormous throughput from a storage system based on OCZ SSD drives and LSI RAID controllers.
3) Achieving enormous throughput with very low CPU utilization.
4) Somehow cramming all of this into a 1U server (see photo at http://www.storagereview.com/mellan...d_performance_with_windows_server_2012_hyperv)

Let's first speculate about the network connection between the server and the clients. When you think of a "network", you almost always think Ethernet. The "big win" described in the press release is that Infiniband can do much better than any currently available Ethernet network. Your standard Gigabit Ethernet (abbreviated 1GbE or just GbE) can push about 120MB/Second worth of real-world data. That was fast five years ago, but just doesn't cut it now. Stepping up to 10GbE should provide ten times the performance (1200MB/Second) but it doesn't. You can actually expect somewhere between 500 and 900MB/Second of real-world throughput from a highly optimized 1GbE connection. Has anyone seen more than this?

Ethernet would not be good enough, so they brought in Infiniband. The fastest version, used in the press release system, is 56Gb/second, 5.6 times more raw bits than 10Gb Ethernet. That said, Infiniband is not a direct replacement for Ethernet, and thus is not normally considered as a general purpose networking interconnect. You can run IP over Infiniband, but the performance is poor - I've seen results for 40Gb/s Infiniband with throughput lower than 10Gb Ethernet. The achievement in the press release required ditching standard IP in favor of a more efficient RDMA-based protocol. I know nothing about "SMB Direct" in Windows 2012, but I am currently getting a bit of experience with something analogous - Sockets Direct Protocol (SDP). Both represent more efficient ways to utilize a very high bandwidth connection with existing applications - databases in my case and file serving in the press release. By deploying both a very fast interconnect (56Gb/s Infiniband) and a very efficient protocol (SMB Direct), the group behind the press release achieved extremely high throughput in a very real-world workload. They did it with just 4.6% CPU utilization, which is also very impressive.

That said, they could not have achieved their goal with just one Infiniband connection. Extrapolating from Infiniband + SDP testing on Linux, I would have guessed around 4.4GB/second per Infiniband connection. The setup in the press release might have achieved slightly better or somewhat worse results, so let's guess that they used between two and four Infiniband connections in the server. Those two or four Infiniband ports would have required somewhere between one and four PCIe slots in the server - or perhaps zero slots if the Infiniband were embedded into the motherboard.

Second, let's talk disk I/O, again with the goal of figuring out how they configured a system to achieve >10GB/Second of throughput. A photo of the test setup shows five drive chassis below a server. The press release talks about Supermicro chassis, LSI 9285 RAID cards, and OCZ Talos 2R SSD drives. The chassis look to me like Supermicro SC216 parts, which are 24-bay SAS2/SATA3 devices with either a passive backplane, a single SAS expander, or dual expanders. Since the OCZ drives are dual-ported SAS2 drives, let's assume that the chassis are the dual SAS2 expander models, each with two SFF-8088 connectors out the back. Now let's assume that each of these JBOD chassis is connected to a single LSI 9285 RAID card in the server, each of which also has two SFF-8088 connectors. Each RAID card will have two connections to a JBOD, providing failover (not important in this benchmark) and additional throughput. Will this setup provide enough IO bandwidth? Yes. The theoretical IO bandwidth of 10 SAS2 x4 connections (two connections per card, five cards) is around 24GB/Second - more than twice what we need. In actuality, the LSI 9285 cards will be a bottleneck, throttling throughput to around 2.5GB/Second/card or 12.5GB/Second total. Fortunately that is still enough bandwidth to achieve the press release results of >10GB/Second… assuming that the drives are up to the task.

That brings us to the SSD drives. The Talos drives are rated for "550MB/Second" reads. Real world results will be significantly below this number, but still quite high - between 300 and 400MB/Second is likely assuming that the SAS expanders were not limiting (versions of the Supermicro expanders from several years ago did significantly limit throughput in my testing) . Assuming 350MB/Second reads, getting to 10GB/Second requires just 30 SSD drives, six drives per controller. I know from experience that the current generation of LSI RAID cards can each handle about six fast SSD drives before starting to plateau, so this number seems just barely reasonable. Alternatively, 300MB/Second/drive would require 35 drives or seven drives per controller. Of course the SAS expanders might slow things down somewhat, requiring the addition of more drives to compensate, but since OCZ participated in the test, the drives were essentially free, so the actual configuration could easily have been six, seven, eight, or even more drives per controller - up to the 24-disk capacity of each Supermicro JBOD.

And the server hardware? Here is where I get stuck. The photo (see http://www.storagereview.com/mellan...d_performance_with_windows_server_2012_hyperv) shows a 1U Supermicro server perched above those JBOD racks. The server has eight 2.5" drive slots, which should help identify it. It looks like a SuperServer 1027R. Presumably, it's the most appropriate machine that Supermicro currently offers - or something even better not quite yet available to buy. The press release describes a part number (SRS-14URKS-0604-01-VI011), but I can't find it on their web site. Whatever model it is, it needs five PCIe 2.0 x8 slots for the RAID cards and either between one and four PCIe slots for the Infiniband cards or two built-in 56Gb Infiniband ports on the motherboard. That's somewhere between five and nine PCIe slots. In a 1U server? I know of no such machine from Supermicro. In fact, three PCIe ports looks like the maximum in any Supermicro 1U server that matches the provided photo. In fact, here is the best that I can come up with and it seems flimsy:
1) They used a Supermicro 1027R SuperServer with three Pcie 3.0 slots.
2) They used a Mellanox PCIe3.0 x16 dual-port card and somehow managed to achieve >5GB/Second of real-world throughput per Infiniband port, about 15% better than I would expect.
3) They used two LSI RAID cards instead of five (since they didn't have enough slots) and made sure that the test data fit into the RAID card cache and/or the OS cache. Further, either the LSI 9285 RAID cards were somehow able to achieve far better results in a PCIe3 motherboard than they are able to do in a PCIe2 motherboard - almost double in fact - even though they are PCIe2 cards or the OS cache made the throughput possible. In either case the five JBOD chassis were mostly for show - the OCZ SSD drives did pretty much nothing after loading their data to the caches.



Just saw the Mellanox press release. Supermicro servers, Mellanox FDR, OCZ SSDs and 10.36GB/s for Hyper-V

Yikes!
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
12,517
5,830
113
Let me throw a crazy one out there. That SR picture was not in the press release and you can tell it was just someone snapping at a trade show (the filename has teched in it though...) because of the photo composition and lighting. It may be a stock photo instead of the actual system used.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
You are probably correct. What are you thinking? How about 6037R-TXRF?


Let me throw a crazy one out there. That SR picture was not in the press release and you can tell it was just someone snapping at a trade show (the filename has teched in it though...) because of the photo composition and lighting. It may be a stock photo instead of the actual system used.
 

mobilenvidia

Moderator
Sep 25, 2011
1,952
213
63
New Zealand
I take it the motherboard they used X9DRT-HIBFF is designed to go in a 2u case
So I doubt it's in a 1u.

10GB/s from a LSI9285 I doubt very much, possibly 2 controllers yes doable.

Still interesting stuff, not going into any home builds soon I would think :)
 

cactus

Moderator
Jan 25, 2011
830
75
28
CA
Also, searching SRS-14URKS leads you to a racked cluster with 4+ nodes in it.

The press release never says half-duplex throughput, could be ~5GB/s each ways?
 

ehorn

Active Member
Jun 21, 2012
342
52
28
We saw Mellanox and MS demonstrate 5.6GB/s Point-to-Point with a single NIC at the InterOp, so add the Multichannel features of SMB 3.0 and we can scale until: 1) We run out of available lanes and 2) The components are saturated.

We have seen that the 9207's provide linear scaling on all current 6Gb SSD's.
We have heard that the latest ConnectX-3's are even better at offloading work from the CPU.
We have heard that Multichannel can provide near linear scaling of NIC's.

What we do not yet know is where is what/where is the saturation point?

But add these things up and we have some dreamy throughput possibilities with this current tech. I think these two (Mellanox and MS) have done their homework on this setup. I suppose more data will reveal what is what though.

Which is why I am like a kid in a candy store waiting to see the data points from that EchoStreams FlacheSan2 box. It has the potential to transfer > 20GB/s through a single 2U chassis.

I am sure some guys playing with this new tech have seen some incredible numbers in their labs. It seems like records are being smashed on a daily basis with all this new tech.

And 12Gb has not even hit the market yet... Exciting times for storage guys...

peace,
 
Last edited:

Patrick

Administrator
Staff member
Dec 21, 2010
12,517
5,830
113
I also think a big driver is that larger SSD capacities are more accessible than ever before. In November 2010 I built a $1,000 array of 8x 64GB 3.0gbps SandForce drives and a LSI 9211-8i controller. Speed broke 2.0 GB/s but capacity even using RAID 0 was sub 0.5TB.

Some of the 240GB drives are in the $160 range now on special, so for about $1,500 you can add a setup that has almost 2TB (four times the capacity) and twice the throughput.

One other big advantage with larger drives is you have more space to play with overprovisioning which generally helps performance and longevity. These days, I really think the 240-512GB drives are the most attractive. The last six SSDs I have purchased were all in the 240-256GB range. The additional space really helps.
 

ehorn

Active Member
Jun 21, 2012
342
52
28
Agreed.

And your maths are spot on too. Today, one can purchase top of the line Toggle-Mode Nand MLC + Cables + PCIe 3.0 HBA's for at or under $1/GB (usable) total.

For that, you get some incredible performance scaling AND decent capacity (even with a generous OP).

That is all good in my eyes.

peace,
 

ehorn

Active Member
Jun 21, 2012
342
52
28
...I do wish that they had described the setup in more detail.
It is a Romley platform (Dual E5-2680's) w/4 LSI Raid Controllers in RAID10 and 2 IB-FDR NICs

They also demo'd SMB MultiChannel on that setup where they pulled the plug on one of the IB NIC's and still got over 6GB/s.

More info on the setup here: http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/WSV310

The Supermicro setup starts at ~ minute 39...

peace,
 
Last edited: