Fighting New Mellanox ConnectX-3 Setup

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
So, through much blood, sweat, and tears, I have my NAS and ESXi recognizing their new-ish ConnectX-3 cards in IB mode. I can send pings across the links and everyone looks happy. However, after adding my NFS storage to ESXi, I am now having another problem. Access to the NFS share from ESXi is sporadic at best and non-existent at worst. I am getting the following error messages on the NAS server:

Dec 28 15:36:54 san kernel: ib_srpt rejected SRP_LOGIN_REQ because target port mlx4_0_1 has not yet been enabled
Dec 28 15:36:54 san kernel: ib_srpt Rejecting login with reason 0x10001
Dec 28 15:36:54 san kernel: ib_srpt Received SRP_LOGIN_REQ with i_port_id 651e:2d03:00b7:5672:24be:05ff:ff9e:8e41, t_port_id e41d:2d03:00b7:5670:e41d:2d03:00b7:5670 and it_iu_len 580 on port 2 (guid=fe80:0000:0000:0000:e41d:2d03:00b7:5672); pkey 0xffff
Dec 28 15:36:54 san kernel: ib_srpt rejected SRP_LOGIN_REQ because target port mlx4_0_2 has not yet been enabled
Dec 28 15:36:54 san kernel: ib_srpt Rejecting login with reason 0x10001


I have exhausted my google-foo and this error hasn't really led me anywhere helpful. Any thoughts are appreciated.

As far as the NAS server goes, it is a stock CentOS 6.7.
- yum update
- yum groupinstall "Infiniband Support"
- systemctl enable rdma
- configure IB interfaces and verify pings are working to ESXi.

Thanks.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
What is your NAS ?
is it based on SCST, and if so, could you post output of
scstadm --write_config > /dev/stdout
(maybe: scstadm --write-config ...)
or
cat /etc/scst.conf

On first look it looks like your target is not (yet) enabled ...
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
I'm not even sure what scst is. I did go to the pains to install and configure it after seeing that error message, but that didn't fix the problem.

# Automatically generated by SCST Configurator v3.3.0-pre1.
TARGET_DRIVER copy_manager {
TARGET copy_manager_tgt
}
TARGET_DRIVER iscsi {
enabled 1
TARGET mlx4_0_1 {
enabled 1
rel_tgt_id 1
GROUP lxc1
}
TARGET mlx4_0_2 {
enabled 1
rel_tgt_id 2
GROUP lxc1
}

}


Additionally:
# cat /sys/kernel/scst_tgt/targets/iscsi/mlx4_0_1/enabled
1
# cat /sys/kernel/scst_tgt/targets/iscsi/mlx4_0_2/enabled
1


Also I am using NFS, so I'm not sure why I would need to enable targets, but I went ahead and created what I thought the error message was complaining about.

Thanks.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
hm I think the driver model changed since then, not sure that's a good combination. Would be great if it still worked since IB support is kind of dead in newer ESXi versions, but your issue is not really inspiring confidence;)

Alternatively you could try Ethernet & Roce?
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
I had the cards initially in Ethernet mode, but IIRC my switch didn't support it (IS5030.)

I'll roll back to 6.5 and re-try. I think I only did an initial ping test and then tried again with 6.7, did another ping test, concluded it worked fine, and went along with 6.7 from there.

Thanks.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
yes 5030 might not do ETH.
6.5 should work iirc (never used it myself but read reports)
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
Think you are lost, too - sorry.
You will need something with Ethernet as transport, not infiniband.

Just was a bit confused why you obviously had ib_srpt loaded and see errors from it - but doesn‘t matter in your case as esxi means no IB at all ...
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
I finally got it working. I started moving cards around to see if the problem was specific to a certain card and it turned out to be. I'm now running with a ConnectX-2 card in the NAS and one of the working ConnectX-3 cards in ESXi.

Just was a bit confused why you obviously had ib_srpt loaded and see errors from it - but doesn‘t matter in your case as esxi means no IB at all ...
Alex, I'm not sure why you had problems with IB in ESXi, but it is working for me now. Additionally, I was able to use ESXi 6.7 without issues.

Thanks.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
Ok, just wonder how this can work, as there seems to be no support for IB in ESXi ... and still don't understand what the ib_srpt kernel module on you nas exactly does / why something tries to connect to an srp target, but if it works it works ...

I have no Problems, i don't use ESXi - so no need to keep an eye on HCL / what is currently supported and what fell out of support ;)
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Well maybe the magic is in the driver. Surprising it works since the driver model is supposed to have changed, but if it works...

So what exactly have you now configured? Basic IB or any RDMA solution on top of it?
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
Just basic IB. I had to disable the RDMA driver load to get it to stop complaining. Not that I imagine my spindle drives are fast enough for it to matter.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
ah ok.
good to know that's still working in esxi with the old driver. wonder whether this will work with cx4's also...
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
I have one that I bought by accident that has SFP28 instead of QSFP+ (which I thought it had), or I'd give it a try for you.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
No worries, can test when I find the time;) Still not decided whether I want to go down the IB route or not. Will depend on the power draw/noise of the other switches I am expecting I guess

And too bad you didnt get a QSFP28 one, that would would have been compatible to QSFP(+)
 

svtkobra7

Active Member
Jan 2, 2017
362
87
28
ESXi 6.7 and driver 1.8.2.5 from Mellanox per Home Lab Gen IV – Part V Installing Mellanox HCAs with ESXi 6.5

Thanks.
This really worked for you with 6.7? I followed the referenced guide as well (additionally uninstalling mft / nmst) and either (a) I don't "get it" or (b) it didn't work (well, other than moving the vmnics from network nics to storage adapters??? After that install I ran /opt/mellanox/bin/mlnx-srp-config and /opt/mellanox/bin/openibd.sh ...

Am I missing something?

@Rand__ at least I haven't lost another pool ... yet ...

Code:
[root@ESXi-02:~] esxcli storage core adapter list
HBA Name  Driver     Link State  UID                     Capabilities  Description
--------  ---------  ----------  ----------------------  ------------  ---------------------------------------------------------------------
vmnic2    mlx4_core  link-n/a    gsan.81000000000000000                (0000:84:00.0) Mellanox Technologies MT27500 Family [ConnectX-3]
vmhba0    vmw_ahci   link-n/a    sata.vmhba0                           (0000:00:1f.2) Intel Corporation Patsburg 6 Port SATA AHCI Controller
vmhba2    nvme       link-n/a    pscsi.vmhba2                          (0000:02:00.0) Intel Corporation 900P Series [Add-in Card]
vmhba3    nvme       link-n/a    pscsi.vmhba3                          (0000:82:00.0) Intel Corporation 900P Series [Add-in Card]
vmhba33   mlx4_core  link-n/a    gsan.81000000000000000                (0000:84:00.0) Mellanox Technologies MT27500 Family [ConnectX-3]
Code:
[root@ESXi-02:~] esxcli network nic list
Name    PCI Device    Driver  Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
------  ------------  ------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------
vmnic0  0000:05:00.0  ixgben  Up            Up            1000  Full    0c:c4:7a:5e:ad:98  1500  Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
vmnic1  0000:05:00.1  ixgben  Up            Up            1000  Full    0c:c4:7a:5e:ad:99  1500  Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
In your case you'd also need to run openSM i think since you don't have a switch (where it might run on) and only the two hosts.
There was a hacked binary for ESX somewhere iirc.
But unfortunately this might not have been the best recommendation - this seems to be IB only (which I forgot) which has its on quirks (as needing a SM) and might or might not increase your direct connection speed. Furthermore you wouldnt be able to run one port in IB and the other in ETH i think with cx3's (or was that only true with VFs... hmm been too long).
 

ValuedCustomer

New Member
Dec 28, 2018
8
0
1
I am using a switch which is running the SM. And I have some recollection of not being able to split the modes on the ports as well, but I can't remember which card that was on now. I also couldn't get QDR working, and I'm not sure why the switch wouldn't do it. When I directly connect the cards to each other, I can get QDR, so it has to have something to do with the switch. I have the ports set to support it, but it just won't negotiate it for some reason. But, meh. I'm not terribly worried about it.

Code:
Name          PCI Device    Driver    Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
------------  ------------  --------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------
vmnic1000402  0000:0d:00.0  ib_ipoib  Up            Up           20000  Full    e4:1d:2d:b7:56:72  1500  Mellanox Technologies MT27500 Family [ConnectX-3]
vmnic4        0000:0d:00.0  ib_ipoib  Up            Up           20000  Full    e4:1d:2d:b7:56:71  1500  Mellanox Technologies MT27500 Family [ConnectX-3]
*Edit* Added installation notes, if it helps.
Code:
================================================================================
-                    ESXi 6.5/6.7 w/ ConnectX-3 card in IB mode                    -
================================================================================
  Original Source:
    https://vmexplorer.com/2018/06/08/home-lab-gen-iv-part-v-installing-mellanox-hcas-with-esxi-6-5/
--------------------------------------------------------------------------------
  - Install 6.5 or 6.7

  - Copy over files
    pscp MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip nmst-4.11.0.103-1OEM.650.0.0.4598673.x86_64.vib mft-4.11.0.103-10EM-650.0.0.4598673.x86_64.vib root@192.168.97.22:/tmp

  - Disable native driver for vRDMA:
    esxcli system module set –e false -m nrdma
    esxcli system module set –e false -m nrdma_vmkapi_shim
    esxcli system module set –e false -m nmlx4_rdma
    esxcli system module set –e false -m vmkapi_v2_3_0_0_rdma_shim
    esxcli system module set –e false -m vrdma

  - Uninstall default driver set
    esxcli software vib remove -n net-mlx4-en -n net-mlx4-core -n nmlx4-rdma -n nmlx4-en -n nmlx4-core -n nmlx5-core

  - Install Mellanox OFED 1.8.2.5 and tools
    esxcli software vib install -d /tmp/MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip
    esxcli software vib install -v /tmp/nmst-4.11.0.103-1OEM.650.0.0.4598673.x86_64.vib
    esxcli software vib install -v /tmp/mft-4.11.0.103-10EM-650.0.0.4598673.x86_64.vib

  - Remove scsi-ib-srp module because it caused lock-ups. Not sure if this will cause problems.
    esxcli software vib remove -n scsi-ib-srp

  - Reboot
 
Last edited: