ConnectX-2 crash ESXI host when VM is booted

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Moff Tigriss

New Member
Sep 24, 2016
9
1
3
39
EDIT 1 : post #3, ESXI is not the culprit here. Launching anything bare metal do exactly the same.

Hi,

I'm finally updating my homelab network to 10GB, and I have some issues with my ConnectX-2 cards.

The server is an R710, with ESXI 6.5 u1. I want to passtrough the card to a Windows Server 2016 VM, or a Debian.

They are the HP version of the ConnectX-2 MNPA19-XTR. The firmware was old (2.9.1000), and needed an upgrade before going further. The process is common, well documented, and I flashed the generic firmware 2.10.720. In the process, I backuped the original firmware and config, just in case.

After that, the Windows VM crash the entire host when booting. In fact, even a new VM with the install ISO do the same before showing anything. With the Debian VM, it's fine... but the Mellanox driver say something is wrong (something about bad IRQ mapping, the last message is "BIOS or ACPI interrupt routing problem?"), and there is no adapter available.

So I tried to flash the 2.9.8350 firmware. Now, I can't even boot a linux VM with the card, and the host just hang, without even rebooting.

Some details :
- Tested with two cards, exactly the same result on both.
- The VM had all the CPUs, reserved memory and was the only one launched.
- Latest driver was installed on Windows, without improvement.
- With 2.9.1000 and 2.9.1200 firmware, Windows boot correctly, without crashing the host. Sadly, I didn't verified the result with mlx4_core on Linux.
- I will try to reflash the original firmware, with the recovery jumper.
- I did not replace the HP PXE rom boot with the default, because it's more recent. But the PXE boot work well, and I have seen examples of people keeping the HP one anyway.
- There is no difference between the HP ini file and the default ones.


My questions is simple : What is going on ? It's a common OEM card, a common procedure, an very common host, vanilla OSes. There is even examples here of people with the same configuration.

Thank you !
 
Last edited:

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Interesting - I had a similar issue yesterday with passthrough as well (pcie based ssd on generic adapter).
Without driver I could pass through the drive, but when I installed drivers in windows the vm froze, then the whole box. Reboot helps the box but then the vm freezes on start and takes down the esx box too . Physically removing the drive/passthrough device solved it, same issue in different slot.
In the end I moved the card to a physical box where it worked fine.
That was with the latest 2017 update, have not tested with the recent 2018 update..

Just mentioning it to indicate it might not be a specific device but an ESX issue...

Edit: And wrong subforum, please ask for move to VMWare/ESX
 

Moff Tigriss

New Member
Sep 24, 2016
9
1
3
39
Well, I posted on the Network subforum because I was really uncertain about ESXI being the culprit here, and I was right : launching anything else directly on the R710 do the same.

I had a little thing with ESXI : if the card is moved on another slot, the system crash (PCIe device error on the front of the R710) before even being fully loaded. Back on the "original" slot, ESXI can boot, but nothing new with the VMs. The purple screen of death give nothing useful.

So now, the issue is probably somewhere in a BIOS configuration (not so much choices with Dell servers, it will be quick to test), the INI configuration of the Mellanox firmware (this is mainly why I asked here), or something really obscure and unresolvable (but, again, R710 is a widely used server. If there was any incompatibility, it will be documented).
 

Moff Tigriss

New Member
Sep 24, 2016
9
1
3
39
Hm ok, weird. Are the other slots x8 too?
One is 8x, the two others are 4x. But no complaints when plug a card with original HP firmware.

esxi 6.5 has a bug that psod with 10gbe cards.
shoukd be fixed in a new update.

VMware Knowledge Base

-J
That might be that, but I have the same issue when booting a Linux on the host, or even the Windows install. I wish I can change the title of the post.


Today, I have some new things :
- With the last firmware, ESXI can use the cards as interfaces. And they work correctly with virtualized interfaces (around 7 Gb/s between two VMs). Not ideal, but still far better than 1Gb/s.
- Directly on the host, Debian refuse to launch the mlx4_core driver. Same thing with VyOS.
- Directly on the host, Windows crash during the boot.
- Tried playing with PSID in the firmware, but nothing.
- Updated PXE rom to the real last version, nothing new (not a surprise).
- Flashing all the rom (-nofs flag), did nothing.

The BIOS is correctly configured (I/OAT, VT-d, non interleaving mode for memory...).

I also tried a dual port card, the MNPH29D-XTR. The old firmware is working in Debian, but not anymore after the upgrade to 2.10.0720.

Well, I think I tried everything possible here.
I still hope there is a person with a glorious solution.