ConnectX-2 and ESXi 6.0

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
I'm getting to the point where I need some real help here. I "thought" all I'd need to do is install OFED on the ESXi host, and have an OpenSM instance running somewhere in the Infiniband network. Apparently not. I've got a host with a MHQH29B card in it, but I can't get a link light to show up on the IS5024Q switch. I tried the 1.9.10 OFED, and then noticed that it isn't for InfiniBand, just for Ethernet. (source) So I switched to the 1.8.2.4 OFED... Which still detects my card okay, but I can't get the damned thing to link up! I even tried running OpenSM on the host. Still nothing... Not that it should matter. I've got OpenSM.exe running as a service on a Windows box. THAT machine has no issues lighting up a link light on the switch - Even if I swap the card with the one currently in the troubled ESXi host!

Code:
[root@esxi-01:/opt/mellanox/bin] ./mst status -v
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA    NET                       NUMA
ConnectX2(rev:b0)       mt26428_pciconf0              01:00.0
[root@esxi-01:/opt/mellanox/bin] ./mdevices_info
PCI devices:
------------
ConnectX2(rev:b0)       mt26428_pciconf0
[root@esxi-01:/opt/mellanox/bin] ./flint -d mt26428_pciconf0 q
Image type:      FS2
FW Version:      2.10.720
FW Release Date: 12.3.2012
Device ID:       26428
Description:     Node             Port1            Port2            Sys image
GUIDs:           0008f104039a5d6c 0008f104039a5d6d 0008f104039a5d6e 0008f104039a5d6f
MACs:                                 0008f19a5b82     0008f19a5b83
VSD:
PSID:            MT_0D81120009
[root@esxi-01:/opt/mellanox/bin] esxcli software vib list|grep Partner
nmst                           3.8.0.56-1OEM.600.0.0.2295424         MEL       PartnerSupported    2015-04-09
mft                            3.8.0.56-0                            Mellanox  PartnerSupported    2015-04-09
net-ib-addr                    1.9.10.0-1OEM.550.0.0.1331820         Mellanox  PartnerSupported    2015-04-10
net-ib-cm                      1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-ib-core                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-ib-ipoib                   1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-ib-mad                     1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-ib-sa                      1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-ib-umad                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-mlx4-core                  1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-mlx4-en                    1.9.10.0-1OEM.550.0.0.1331820         Mellanox  PartnerSupported    2015-04-10
net-mlx4-ib                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
net-rdma-cm                    1.9.10.0-1OEM.550.0.0.1331820         Mellanox  PartnerSupported    2015-04-10
scsi-ib-iser                   1.9.10.0-1OEM.550.0.0.1331820         Mellanox  PartnerSupported    2015-04-10
scsi-ib-srp                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported    2015-04-10
What step am I missing?
 

tjk

Active Member
Mar 3, 2013
481
199
43
I'd get rid of the 1.9 drivers that ship with ESX, especially since you added the 1.8 drivers.

I have these running on around 50+ ESX nodes, without issue, the only difference is I use the 4036 switch that has a subnet manager built in.

The only process I follow is:

Install ESX 5.5, patch it, then...

#Remove 1.9 driver that ships with ESX 5.5
esxcli software vib remove -n=net-mlx4-en -n=net-mlx4-core
reboot

#Install latest 1.8.2.4 driver for SRP or IPoIB
esxcli software vib install -d /tmp/MLNX-OFED-ESX-1.8.2.4-10EM-500.0.0.472560.zip --no-sig-check

#Optional - Install MFT Tools - allows you to query the card as well as apply firmware updates through ESX
esxcli software vib install -v /tmp/MLNX-MFT-ESX-3.7.1.3-10EM-550.0.0.1331820.zip --no-sig-check

reboot
esxcli software vib list|grep Mel

Code:
~ # esxcli software vib list|grep Mel
mft                            3.8.0.56-0                            Mellanox  PartnerSupported  2015-04-06 
net-ib-cm                      1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-ib-core                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-ib-ipoib                   1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-ib-mad                     1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-ib-sa                      1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-ib-umad                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-mlx4-core                  1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-mlx4-ib                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04 
net-mst                        3.8.0.56-1OEM.550.0.0.1331820         Mellanox  PartnerSupported  2015-04-06 
scsi-ib-srp                    1.8.2.4-1OEM.500.0.0.472560           Mellanox  PartnerSupported  2015-04-04
 

markpower28

Active Member
Apr 9, 2013
413
104
43
IS5024Q switch
IS5024Q does not have build-in SM. When you install OpenSM you also need to upload/burn partition.config file to it so it will start communicate with the rest of the HCAs.

4036 does have partition.config on the switch. That's why when you install 1.8.4 driver it led up.
 

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
Did you apply partition.config file?
IS5024Q does not have build-in SM. When you install OpenSM you also need to upload/burn partition.config file to it so it will start communicate with the rest of the HCAs.
This is confusing me a bit.

So, right now, I have two HCAs plugged into the IS5024Q. One is connected to a Windows workstation. The other is connected to an ESXi host. Same cards, same firmware. Only the Windows box has a link light. My understanding is that I only need OpenSM on one of these two machines in order for both of them to be able to talk to each other through the switch. Is this correct?

Additionally, partition.config is something for OpenSM, right? I shouldn't need it on the ESXi host at all, because OpenSM is running on the Windows box... Right?

Or is partition.config something that needs to be put on the IS5024Q somehow? And if so, how? Through the I2C port?
 
Last edited:

markpower28

Active Member
Apr 9, 2013
413
104
43
Windows is kind of on it's own. For testing, you can have two Windows server and install Mellanox OFED for Windows then it will work out of the box without any configuration.

For Linux/ESXi/Solaris, it does require SM (on switch or not) running along with partition.config configured. See link InfiniBand in the lab… | Erik Bussink
 
  • Like
Reactions: RandyC

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
For Linux/ESXi/Solaris, it does require SM (on switch or not) running along with partition.config configured.
Does it require OpenSM on every host?

Sadly, that link doesn't work for me. I mean, I can click on it and read the text, but it's not 100% my situation. He's installing on ESXi 5.x. I'm installing on 6.0. My /scratch/opensm/ directory is empty. Where am I supposed to put partitions.conf? Do I need to create the subdirectories for each port? (Am I to assume the subdirectories are titled after the GUIDs?) If I can't get this working by this weekend, I may just give up and downgrade to 5.5.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
Does it require OpenSM on every host
Only one is required.

You are using the same driver 1.8.2.4. It should work with both 5.5 and 6. There is no partition.config file in that directory that's why you need to copy it to there. I use switch build-in SM so never mess with ESXi OpenSM. But it does like you need to create for each port. If you are using the default partition.config then all other ESXi should communicate with it accordingly.
 

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
Only one is required.
Woohoo!

You are using the same driver 1.8.2.4. It should work with both 5.5 and 6. There is no partition.config file in that directory that's why you need to copy it to there. I use switch build-in SM so never mess with ESXi OpenSM. But it does like you need to create for each port. If you are using the default partition.config then all other ESXi should communicate with it accordingly.
Okay, to verify. I need to:

1) Create a partition.conf file. (It doesn't exist anywhere on the system.)
2) Create a directory for each port under /scratch/opensm/.
3) Copy the partition.conf file to each directory I just created.

That sound right?
 

tjk

Active Member
Mar 3, 2013
481
199
43
You are using the same driver 1.8.2.4. It should work with both 5.5 and 6.
He still has some of the 1.9 drivers installed, see my post earlier in the thread. Not sure if that is going to conflict with the 1.8 drivers.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
sounds right.

you need to have a valid partition.config file. So I would just copy that from your 4036.
 

markpower28

Active Member
Apr 9, 2013
413
104
43
He still has some of the 1.9 drivers installed, see my post earlier in the thread. Not sure if that is going to conflict with the 1.8 drivers.
you need to remove all 1.9.7 driver otherwise it will not work.
 

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
He still has some of the 1.9 drivers installed, see my post earlier in the thread. Not sure if that is going to conflict with the 1.8 drivers.
Doing that now. Apparently 6.0 has "nmlx4" drivers as well as net-mlx4. I wonder if that was the issue?
 

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
Doing that now. Apparently 6.0 has "nmlx4" drivers as well as net-mlx4. I wonder if that was the issue?
This was the issue!

Remove EVERYTHING I've installed (and the inbox drivers):
Code:
esxcli software vib remove -f -n net-ib-addr -n net-ib-cm -n net-ib-core -n net-ib-mad -n net-ib-sa -n net-ib-umad -n net-mlx4-core -n net-mlx4-en -n net-mlx4-ib -n net-mst -n net-rdma-cm -n scsi-ib-iser -n ib-opensm -n nmlx4-core -n nmlx4-en -n nmlx4-rdma -n scsi-ib-srp -n net-ib-ipoib -n mft -n nmst
Install just the OFED:
Code:
esxcli software vib install -d /var/tmp/MLNX-OFED-ESX-1.8.2.4-10EM-500.0.0.472560.zip
Link lights up. No OpenSM required (since my Windows host is running it). Woohoo! Now to do it on the other couple hosts, and start testing speeds.
 

tjk

Active Member
Mar 3, 2013
481
199
43
Welcome to hours ago, glad it worked for you! You should only need one SM, but you can run 2 for backup, esp if this is going to do anything production level.
 

Dravor

New Member
Aug 17, 2015
19
1
3
47
This was the issue!

Remove EVERYTHING I've installed (and the inbox drivers):
Code:
esxcli software vib remove -f -n net-ib-addr -n net-ib-cm -n net-ib-core -n net-ib-mad -n net-ib-sa -n net-ib-umad -n net-mlx4-core -n net-mlx4-en -n net-mlx4-ib -n net-mst -n net-rdma-cm -n scsi-ib-iser -n ib-opensm -n nmlx4-core -n nmlx4-en -n nmlx4-rdma -n scsi-ib-srp -n net-ib-ipoib -n mft -n nmst
Install just the OFED:
Code:
esxcli software vib install -d /var/tmp/MLNX-OFED-ESX-1.8.2.4-10EM-500.0.0.472560.zip
Link lights up. No OpenSM required (since my Windows host is running it). Woohoo! Now to do it on the other couple hosts, and start testing speeds.
What kind of speeds did you end up getting? I am looking at ordering cards and cabels today.

Also which opensm are you running on windows?

Thanks!!
 

TeeJayHoward

Active Member
Feb 12, 2013
376
112
43
What kind of speeds did you end up getting? I am looking at ordering cards and cabels today.

Also which opensm are you running on windows?
I'm running the OpenSM included as an option in the Windows driver install package. According to Device Manager, it's driver version 4.95.10777.0. The file name I have stored on my NAS is WinMFT_x64_3_8_0_56.exe. I'm not sure that's the right file, but it's the only .exe I still have, so it might be?

As for speeds, ESXi is showing a 40Gbps link, but I'm running IPoIB which I think is limited to 10Gbps on these cards. The Windows box is showing a Link Speed of 32.0 Gbps/Full Duplex. Either way, it's more speed than I have disk for. Might be fun to put together a RAMdisk one of these days and do some tests. I may have even done that months ago when I set this up. Can't recall, and didn't mark down the results anywhere. Not even an iperf. Really doesn't seem like me...