Infiniband Testing for High Speed Low Power AIO

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

33_viper_33

Member
Aug 3, 2013
204
3
18
Purpose / end state goal:

I intend to build 2 new ESXi Servers on Xeon E3-1200v3 architecture in order to run a high efficiency network with minimal “always on” physical network hardware. This means no always on physical switch and favoring multiple NICs to form a soft-switch to cover down. Both servers will run ESXi 5.5 connected with Infiniband. One server will run as a router/NAS (ZFS) and possible domain box. The NAS server will be a single spinning disk for frequently accessed files and 3-4 SSD ZFS array for VM storage and PXE boot targets. The second server will be a backup and mass storage server running Windows 2012 Essentials with iSCSI targets to a OpenIndiana ZFS array. In the beginning, the entire setup will be consolidated to one server to spread out cost. If it runs well, I may keep it to one server. If not, I may offload the large array and Windows 2012 portion to a C6100 node until the second server is built. Separate servers are preferable to me to minimize always on drives for power and wear considerations.

One of the primary goals is to keep power requirements to a minimum. Currently, my Pfsense router and server is hosted on a Core2 Quad utilizing an APC SMX1500 UPS to power up and shut down the network on a schedule. This became important to me after I saw my power bills. This beast sucks 250w with the switch at idle and can easily pull upwards of 500w. With the new Xeon chips idling around 22 watts, I figure I can keep idle power requirements to below 70w with 2 network cards PCIe cards, 4 SSDs and a spiny disk. The second server will contain 1 Infiniband card, 1 HBA, and 4 growing to 8 x 4TB spiny disks over raid 50 (two raidz1 arrays in a stripe as space is needed). This may become 4 x 4tb and 4 x 6tb drives if the 6tb drives are released to consumers on time. All said and done, the first server running the router and NAS will idle low for always on activity. The second server will be a bit less efficient with the number of drives, but will only run when I accessing movies or automatic backups are conducted. I think 70w always on and 140w with storage server at idle sounds much better than the 250w or more likely 330w of the old system.

Another primary goal is to minimize expense and maximize performance by consolidating SSDs on the network into a single ZFS array. This will prevent each machine from requiring its own SSD thereby reducing the quantity of SSDs I must buy and keep powered on. For this to work well, network speed is a must hence the reason for infiniband. The fat pipe infiniband provides will allow multiple operating systems to be remotely hosted without bogging down the network.

A secondary goal would be to achieve this package in the form of a cluster. This may be a bad idea as I’m utilizing different hardware (add-on cards) across both servers. Achieving HA will be impossible with the storage array since it’s specific to one server due to hardware. The two ZFS servers must stay on their respective ESXi host. Vmware Virtual SAN may be a solution, but I have done zero research to date to understand its capabilities and how it works.

I’ve been working on setting up Infiniband in what little spare time I have with varying degrees of success. Since I’ve posted in multiple other posts, I thought I would consolidate in a separate thread and document my trials and errors. If you have suggestions or advice, please send them my way as it is most welcome.

The Test Bed:
Dell C6100
-4 Nodes, each w/ 24GB ram and 2 x Xeon 5520 processors, SAS Mezz adapters and 2 x 250GB HDDs. (reconfigured to 2 nodes with 48GB ram for testing)
-2 nodes have Mellanox ConnectX-2 VPI adapters. (all testing has now been moved to these hosts IOT achieve consistency with the Mellanox adapters. HDD swaps are conducted to test differing configurations.)
-2 nodes have HP ConnectX VPI adapters (HP adapters work well, but driver support sucks for Windows 7 bare medal)

Operating systems in use:
ESXi 5.5
Windows 7
Windows 8
Windows 2012 Essentials
OpenIndiana
PFsense 2.1
Ubuntu


Testing:
Most of my testing to this point has been with 10Gbe which has been pretty good and very user friendly. The main issue I’ve had is saturation of 10Gbe, high latency, and the price and power requirements of 10gbe switches and network cards. If my planed setup is going to work well, I believe infiniband is the answer.

Initially all speed tests shall be performed with primo ramdisk operating in IO mode and 14.5gb of mixed size file transfers. Each test is performed 10 times minimum with erroneous data point thrown out.

This is not the most scientific process, but intended just to get an idea of how everything works. At some point, I need to learn how to use iometer in order to get more consistent and descriptive results. For now, I’m looking at very simple setups to get a baseline quickly and determine what testing I want to conduct in the future. Once I determine the way forward, I will update each post to include iometer results.
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 1a: Bare Metal Windows 7 File Transfer

Purpose: The first test was performed to get a base line for single file transfers between Windows 7 bare metal hosts. These values will be utilized to determine how well ESXi handles similar traffic between VMs.
Settings:
Jumbo frames to 9000
15GB Ramdisk

Results: I am capable of pushing 700-1,100MB/s with simple copy/past between hosts over IPoIB.

Analysis: Not too bad considering no tweaks other than jumbo frams and no RDMA support. These numbers are consistent with a 10Gbe connection suggesting that only one of the 4 infiniband lanes are being utilized. This is slightly better than I achieved over 10Gbe. I now need to do some tweaking to maximize this performance. Based on advice from mrkrad, I checked IRQs which don’t appear to be shared. They range from -3 to -8. The negative digits are new to me since I haven’t played with IRQs since the set in BIOS days. Any advice here would be appreciated. I intend to set the MTU to a higher setting. Very curious what MTU settings worked well for others. Any other suggestions are welcome. One other interesting thing to note, these cards linked at 32Gb in Windows 7 instead of 40Gb as ESXi did. I’m guessing that has something to do with the PCIe 2.0 x8 limitation of 32Gb/s.
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 2: ESXI based Windows 7 File Transfer

Purpose: The second test contains two Windows 7 VMs under ESXi performing the same simple file transfer as in the test 1.

Setup: The infiniband cards are attached to vSwitches. Each VM has a single VMX3 adapter. Jumbo frames are enabled, MTU set to default (1500).
I would like to check IRQs but am finding documentation hard to come by for ESXi. If anyone would like to advise me here, it would be appreciated.

Results: The best performance I achieve in this test is 600MB/s, but it is extremely inconsistent. Normal speed is 200-400MB/s.

UPDATE: consistency issues fixed after finding a bad stick of RAM. I’m now getting consistent 600MB/s transfers.

Analysis: This test leads me to believe the vSwitch is not very efficient and the ESXi overhead is considerable. I’m disappointed after achieving just shy of 500MB/s constant from OpenIndiana to 2012 server on the same ESXi host over VMX3 adapters and the vSwitch. This was done using 2 spiny disks for OI and an SSD vHHD for Windows 2012 over iSCSI. iSCSI may be the key word here.
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 3: ESXI Pass-Through and Windows 7 File Transfer

Setup: The third test will include the same settings found in test two with the infiniband cards passed to the VMs using VT-d.

Side note: I initially attempted this with the HP cards but realized that driver support for Windows 7 wasn’t all that great with the older generation cards. While I was struggling to get these working properly, I purchased two Connectx-2 cards from DBA which are much better. While I finally managed to get the HPs working under Windows 7, I scrapped all original data and started from scratch using the connectX-2 cards to achieve consistency.

Results: More information to follow.

Analysis: TBD
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 4: ESXI SR-IOV and Windows 7 File Transfer

Setup: The fourth test is identical to test 3 utilizing SR-IOV to provide the Windows 7 VMs with virtual function (VF) infiniband adapters instead of VT-d Pass-through.

Testing: None due to lack of SR-IOV support with current ESXi drivers.

Analysis: Although VMware indicates support for connectx-2 SR-IOV, I’ve had zero luck with it. http://kb.vmware.com/selfservice/mi...nguage=en_US&cmd=displayKC&externalId=2058261

Doing some more research on Mellanox forum, it is indicated the OFPD 1.8.1 driver doesn’t support SR-IOV but the next driver would. I’m running OFPD 1.8.2 and cannot find support. I attempted to set up SR-IOV by performing the following:

I enabled SR-IOV, VT-d, VT-x in bios. I then putty'ed into ESXi and tried the following commands:
~ # esxcli system module parameters set --module mlx4_ib --parameter-string=max_vfs=20,20
Received the following: Unable to set module parameters the following params are invalid: max_vfs
~ # esxcfg-module mlx4_ib -s max_vfs=20,20
Received the following: Unable to set module parameters the following params are invalid: max_vfs

I also tried the IpoIB, core, ib_mlx4 and every other portion of the driver with no luck. I then looked at the driver in vSphere client in the profiles, and it didn’t contain any settings for virtual functions. Unless someone knows something I don’t, I don’t think SR-IOV is supported under ESXi just yet.

UPDATE:

Found this after some digging. http://community.mellanox.com/thread/1078
Looks like I will have to wait a while before continuing this tests as the current ESXi driver doesn't support SR-IOV.
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 5: Windows 8 bare medal and RDMA

Purpose: Test 5 will consist of two Windows 8 bare medal installs IOT test RDMA.

Results: TBD

Analysis: TBD
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 6: ESXI with Windows 8 and RDMA

Purpose: Test 6 will consist of two Windows 8 VMs utilizing RDMA.

Results: TBD

Analysis: TBD
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Test 7: OpenIndiana iSCSI targets and Windows 2012 Essentials utilizing RDMA

Purpose: Test 7 will be OpenIndiana and Windows 2012 utilizing ISCSI. After the performance I saw in playing around with OI and iSCSI, I’m hopeful this will work well over IB.

Results: TBD

Analysis: TBD
 
Last edited:

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
you need 5600 series Westmere for SR-IOV I thought?

What are your power profiles -- Hardware/bios should be max/max/max (no C/C1E) , and inside the VM - maxxed out.

If you would like to try Storevirtual 11, I can help you out. It has Adaptive Optimization which is less caching and more storage tiering, like VSAN can cluster, but unlike VSAN can spout ISCSI to any device. Easy to setup with version 11.

Are you seeing high vcpu for RSS transfers? You should see 4 cores of utilization inside the guest to do 10gb. The sign of misconfiguration would be 1 or 2 cores only being used. Likewise netq or equivalent - use ETHTOOL in esxi command line to make sure you have 8 RX/TX queue's to match 8 cores! And they are being used. ESXi likes to make the queues and only use 1.

Remember ESXi doesn't try to make one VM fast. You need multiple subnet/vswitch/ports to actually achieve 10gb or faster. See ETHTOOL above for proof of this.
 

Biren78

Active Member
Jan 16, 2013
550
94
28
One of the primary goals is to keep power requirements to a minimum. Currently, my Pfsense router and server is hosted on a Core2 Quad utilizing an APC SMX1500 UPS to power up and shut down the network on a schedule. This became important to me after I saw my power bills. This beast sucks 250w with the switch at idle and can easily pull upwards of 500w...
Yikes! You know - with that much power being sucked you could actually make a "business case" for replacing home gear. Crazy! I do hope you live somewhere cold so you can use the exhaust heat for something.
 

33_viper_33

Member
Aug 3, 2013
204
3
18
Yikes! You know - with that much power being sucked you could actually make a "business case" for replacing home gear. Crazy! I do hope you live somewhere cold so you can use the exhaust heat for something.
Yeah, I was shocked by those numbers. I didn’t mention that the old server is running 9 x 1.5tb drives, 1 x 150GBE 15000rpm drive, 2 x 1TB drives and an SSD in addition to an Areca RAID controller, 10gbe NIC and a 4 port 1gbe NIC. It’s in a Norco 3216 case with 6 fans. The whole thing runs very hot. I do live in the north and the heat is not a bad thing in the winter. When I was living in the south, I hated working in the office due to the heat of the beast. That is why I use the UPS to nicely start and stop the server when not in use. This is also why I want to separate the large array out from the rest of the all in one since I don’t need access to it all the time.

A smaller NAS would do fine combined with the router and soft switch in a single box. I only need 2 IB ports for the second server and a ESXi node for testing purposes, 2 x 10gbe ports for the Media room and office computer, and 4 ports (likely all onboard pending mother board purchased) for WAN, switch, UPS and laptop. A wireless adapter can handle the mobile devices and the switch will provide the additional ports for things like my Avocet KVM, management interfaces, C6100, etc. that don’t always need to be running. Therefore, 60-70w is my target since that is what my dell 5224 switch uses at idle. I can also drop 10-20w for the wireless AP.

As for making good BUSINESS sense, exactly. The server does some off site backups. I’m speed limited by my WAN and the server is going to run for as long as the backups take each night. The backups do not work the server but does spin up the disks. Idle power is very important here across the board. The backup server only needs to run for 2-3 hours a night and when someone wants to watch movies. The network server needs to run for the same duration of time in addition to whenever someone is home. Two separate schedules that can easily be controlled by the UPS.

Mrkrad, thanks for the insite!

Bios: C state = disabled, C1E = disabled, performance = max
ESXI: Performance = max in one, and is unavailable in the other. Not sure why. Both were originally set to max prior to messing with vSphere web client and both BIOS’ settings are identical.
Clients: power profile set to performance

Not entirely sure of processor support for SR-IOV, however the 5500 series does support VT-d and VT-x. I'm not finding specific references to SR-IOV on Intel's site.

The clients have 12 cores assigned and 20Gb of RAM. RAMdisk takes up 15Gb. I’m starting to think I have a ram leak. My tests are getting worse and worse. One VM’s processor sits around idle peaking out and 10% occasionally. The other can get upwards of 90 percent and sometimes causing the system to become temporarily unresponsive. The same PC’s ram starts climbing in utilization until its near 100% until the transfer is over. It will then start to give back resources until it’s around 16gb of utilization. It doesn’t matter which machine initiates the transfer. However, it appears to be worse on writes to the suspected bad node but is still affected by reads. I may swap out the bad node and move memory around to test for problems.

Bottom line, after last nights tests, I’m realizing I have some other issues going on here outside of just ESXi. I need to go back and test bare metal and see if I notice any issues. Not sure I will since the bad stick of ram (if that’s the problem) may not be used due to the overall system requirements. Less RAM is used bare metal.
 

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
sounds like power management is still happening! go to advanced settings -> power in vcenter and change the 10%/20% process idle settings and C/P state settings to disable them then set power management to custom. What ESXi will do is park one cpu and move the primary one to lowest P state if possible. Park as in shut off the 2nd socket if nothing important is happening. iirc it was in esxi 5.1 -> 10 or 20% (both cpu's combined average sampled load). which is about where you claim the cpu's are hovering.
 

33_viper_33

Member
Aug 3, 2013
204
3
18
Ok, did some trouble shooting and found some bad RAM. After removing the bad ram, I’m seeing consistent 600Mb/s transfers. Processor runs 6 cores at 13% with occasional spikes up to 35%. RAM utilization now climbs on both VMs but only by a Gb and quickly is returned after the transfer. This still indicates VMware has a bottleneck somewhere. I’m going to move on to windows 8 and see if I get similar speeds with RDMA. If so, I’ll take mrkrad’s advice and try adding a second virtual NIC and test with teaming.

Test 2 post has been updated.

Thanks for the power state advice mrkrad! I’ll check it this evening. Also, thank you for the Storevirtual offer of assistance. I may take you up on that, but am 3 months out from that phase of testing. I'll start doing research into the various options soon. I just want to get the network working well first.
 
Last edited:

33_viper_33

Member
Aug 3, 2013
204
3
18
Quick update. Found this after some digging. SR-IOV on ESXi 5.1 | Mellanox Interconnect Community
Looks like I will have to wait a while before continuing this tests as the current ESXi driver doesn't support SR-IOV. I updated the SR-IOV post above. While waiting for the updated driver, I'm setting up SR-IOV using an Intel X540 10GBE card just for the exercise and gained understanding. I'll post more info sometime after the holidays.
 

33_viper_33

Member
Aug 3, 2013
204
3
18
I've just picked this project up again and finished setting up 2 x windows 2012 servers for direct connect speed tests over SMB3. After setting up both servers and updating, I installed the mellanox driver and started openSM on one followed by both servers. One server makes a connection, the other says its disconnected. I've reinstalled the driver, tried opensm on both machines. I'm now lost! The server is making a good connection with one windows 2012 server connected to an ESXi Server. Any ideas?

Update:

Just a side note, iSCSI from openindiana to windows 2012 performance was pathetic. 450mbps average. I could do better with simple shares and gigabit. This was transferred between two raid arrays and not RAMdisk like all my other tests. I know one array can transfer in access of 600MB/s. A single hard drive should be capable of doing better than 450mbps. I'm not sure whats going on there either.
 
Last edited:

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
I've just picked this project up again and finished setting up 2 x windows 2012 servers for direct connect speed tests over SMB3. After setting up both servers and updating, I installed the mellanox driver and started openSM on one followed by both servers. One server makes a connection, the other says its disconnected. I've reinstalled the driver, tried opensm on both machines. I'm now lost! The server is making a good connection with one windows 2012 server connected to an ESXi Server. Any ideas?
It's probably a configuration issue, but just to make sure: Swap cables, and then swap cards between the machines.