InfiniBand, SRP, ESXi 5.5/6.0, Solaris 11.2, PCIe Passthrough

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Nadster

New Member
Nov 24, 2014
6
1
3
So with an abundance of inexpensive InfiniBand adapters available on eBay, I purchased a couple of Mellanox ConnectX-2 VPI dual port 40 Gb/s HCAs (MHQH29B-XTR) to see if I could get them to work in my ESXi environment (SuperMicro X8DTH-6F, dual X5650s, 48 GB).

The twist here is that I want to maintain my all-in-one setup, so the ZFS array will be managed via a Solaris 11.2 VM running on the ESXi host itself.

Both HCAs were flashed with 2.9.1200 firmware. Testing was initially performed on ESXi 5.5 running Mellanox’s OFED 1.8.2.4 package. ESXi also hosted the subnet manager. Solaris 11.2 detected the HCA configured for PCIe passthrough. LEDs on both cards illuminated as expected and ibstat reported 40 Gb/s link rate on both ends. I went through the Solaris COMSTAR configuration to export a test 16 GB volume which ESXi detected and everything seemed to work fine until I tried to format the drive using ESXi’s ‘Add Storage’ procedure. Formatting the 16 GB volume required ~30 minutes to complete with the Windows vSphere client reporting timeout errors and the UI generally becoming very unresponsive. The new 16 GB SRP-backed datastore eventually showed up in the ‘Configuration > Storage’ section as expected, but interacting with the new volume was painfully slow.

As a second test, I used PCIe passthrough to direct both HCAs to separate Linux VMs, with each VM receiving one physical adapter. With a Linux-based subnet manager running on one of the VMs, I again noted both HCAs reported physical/logical links as expected. Running one of the Linux-supplied InfiniBand bandwidth test applications, I verified the throughput was comparable to what others have reported given the 40 Gb/s link rate.

As a final test, I flashed both HCAs with 2.10.720 firmware and tested with ESXi 6.0. ESXi 6.0, running the same InfiniBand software used with ESXi 5.5, appeared to work fine. Solaris 11.2, however, failed to boot with the OS reporting ‘interrupts/eq failed’ and ESXi 6.0 automatically shutting the VM down with an error stating PCIe passthrough failed to register an interrupt. When I downgraded the firmware on both HCAs back to 2.9.1200, ESXi 6.0 booted the Solaris 11.2 VM without incident, but I observed the same slow behavior as with ESXi 5.5.

So at this point I am left scratching my head. PCIe passthrough of these InfiniBand adapters do appear to work properly with two Linux VMs, but ESXi 5.5/6.0 combined with a Solaris 11.2 VM is a no-go at this point for some unknown reason.

I am still digging into this to see if I can get it to work, but if anyone has any ideas/suggestions, I would appreciate it!
 

markpower28

Active Member
Apr 9, 2013
413
104
43
What's your HBA? And what kind disk?

For your all-in-one ESXi server, is this the only server in your environment? Do you plan to attach other servers to the ZFS on it?
 

Nadster

New Member
Nov 24, 2014
6
1
3
The ESXi host is the only server in my environment and I have no plans to add any additional servers at present.

I probably should have been a bit clearer about my test setup. While testing the InfiniBand configuration, the only storage I exported to ESXi was from the Solaris 11.2 VM. I created a ZFS volume off the default pool, which itself was backed by a VMDK residing on a separate SSD. For this test, I was only interested in getting all the InfiniBand configuration details worked out and expected performance comparable to a single SSD.

The actual storage array has 24 TB connected to a couple of LSI SAS 2008 controllers that are running under FreeBSD 10.0 via PCIe passthrough mode. This setup currently exports the ZFS datasets back to ESXi via NFS. The idea was to migrate this ZFS pool over to Solaris 11.2, and with InfiniBand move the performance bottleneck back to the storage array.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
If you only has one server, why don't you create a internal switch and share with different VMs? It will be much faster than anything else.

And how did Solaris VM IB card communicate with ESXi IB card?
 

Nadster

New Member
Nov 24, 2014
6
1
3
I currently am using an internal switch that links ESXi with the FreeBSD VM for (Ethernet-based) NFS storage traffic. When I benchmarked the network performance over those two endpoints, I was seeing a few Gb/s as I recall.

The two InfiniBand HCAs are plugged into adjacent PCIe slots and are connected via a single external cable (looped back to the first port on each card).

The idea behind InfiniBand is not just for improved bandwidth/latency, but also the lower CPU utilization that RDMA-based SRP provides. And given how cheap InfiniBand hardware has become, it seemed like something that would be fun/educational to try out.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
very interesting scenario. RDMA works the best when directly communicate from one PHYSICAL host to another PHYSICAL host. In virtual environment, you can't really isolate CPU even with reservation. I haven't see how that will work within the same server.