So with an abundance of inexpensive InfiniBand adapters available on eBay, I purchased a couple of Mellanox ConnectX-2 VPI dual port 40 Gb/s HCAs (MHQH29B-XTR) to see if I could get them to work in my ESXi environment (SuperMicro X8DTH-6F, dual X5650s, 48 GB).
The twist here is that I want to maintain my all-in-one setup, so the ZFS array will be managed via a Solaris 11.2 VM running on the ESXi host itself.
Both HCAs were flashed with 2.9.1200 firmware. Testing was initially performed on ESXi 5.5 running Mellanox’s OFED 1.8.2.4 package. ESXi also hosted the subnet manager. Solaris 11.2 detected the HCA configured for PCIe passthrough. LEDs on both cards illuminated as expected and ibstat reported 40 Gb/s link rate on both ends. I went through the Solaris COMSTAR configuration to export a test 16 GB volume which ESXi detected and everything seemed to work fine until I tried to format the drive using ESXi’s ‘Add Storage’ procedure. Formatting the 16 GB volume required ~30 minutes to complete with the Windows vSphere client reporting timeout errors and the UI generally becoming very unresponsive. The new 16 GB SRP-backed datastore eventually showed up in the ‘Configuration > Storage’ section as expected, but interacting with the new volume was painfully slow.
As a second test, I used PCIe passthrough to direct both HCAs to separate Linux VMs, with each VM receiving one physical adapter. With a Linux-based subnet manager running on one of the VMs, I again noted both HCAs reported physical/logical links as expected. Running one of the Linux-supplied InfiniBand bandwidth test applications, I verified the throughput was comparable to what others have reported given the 40 Gb/s link rate.
As a final test, I flashed both HCAs with 2.10.720 firmware and tested with ESXi 6.0. ESXi 6.0, running the same InfiniBand software used with ESXi 5.5, appeared to work fine. Solaris 11.2, however, failed to boot with the OS reporting ‘interrupts/eq failed’ and ESXi 6.0 automatically shutting the VM down with an error stating PCIe passthrough failed to register an interrupt. When I downgraded the firmware on both HCAs back to 2.9.1200, ESXi 6.0 booted the Solaris 11.2 VM without incident, but I observed the same slow behavior as with ESXi 5.5.
So at this point I am left scratching my head. PCIe passthrough of these InfiniBand adapters do appear to work properly with two Linux VMs, but ESXi 5.5/6.0 combined with a Solaris 11.2 VM is a no-go at this point for some unknown reason.
I am still digging into this to see if I can get it to work, but if anyone has any ideas/suggestions, I would appreciate it!
The twist here is that I want to maintain my all-in-one setup, so the ZFS array will be managed via a Solaris 11.2 VM running on the ESXi host itself.
Both HCAs were flashed with 2.9.1200 firmware. Testing was initially performed on ESXi 5.5 running Mellanox’s OFED 1.8.2.4 package. ESXi also hosted the subnet manager. Solaris 11.2 detected the HCA configured for PCIe passthrough. LEDs on both cards illuminated as expected and ibstat reported 40 Gb/s link rate on both ends. I went through the Solaris COMSTAR configuration to export a test 16 GB volume which ESXi detected and everything seemed to work fine until I tried to format the drive using ESXi’s ‘Add Storage’ procedure. Formatting the 16 GB volume required ~30 minutes to complete with the Windows vSphere client reporting timeout errors and the UI generally becoming very unresponsive. The new 16 GB SRP-backed datastore eventually showed up in the ‘Configuration > Storage’ section as expected, but interacting with the new volume was painfully slow.
As a second test, I used PCIe passthrough to direct both HCAs to separate Linux VMs, with each VM receiving one physical adapter. With a Linux-based subnet manager running on one of the VMs, I again noted both HCAs reported physical/logical links as expected. Running one of the Linux-supplied InfiniBand bandwidth test applications, I verified the throughput was comparable to what others have reported given the 40 Gb/s link rate.
As a final test, I flashed both HCAs with 2.10.720 firmware and tested with ESXi 6.0. ESXi 6.0, running the same InfiniBand software used with ESXi 5.5, appeared to work fine. Solaris 11.2, however, failed to boot with the OS reporting ‘interrupts/eq failed’ and ESXi 6.0 automatically shutting the VM down with an error stating PCIe passthrough failed to register an interrupt. When I downgraded the firmware on both HCAs back to 2.9.1200, ESXi 6.0 booted the Solaris 11.2 VM without incident, but I observed the same slow behavior as with ESXi 5.5.
So at this point I am left scratching my head. PCIe passthrough of these InfiniBand adapters do appear to work properly with two Linux VMs, but ESXi 5.5/6.0 combined with a Solaris 11.2 VM is a no-go at this point for some unknown reason.
I am still digging into this to see if I can get it to work, but if anyone has any ideas/suggestions, I would appreciate it!