Fighting New Mellanox ConnectX-3 Setup

Discussion in 'Linux Admins, Storage and Virtualization' started by ValuedCustomer, Dec 28, 2018.

  1. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    So, through much blood, sweat, and tears, I have my NAS and ESXi recognizing their new-ish ConnectX-3 cards in IB mode. I can send pings across the links and everyone looks happy. However, after adding my NFS storage to ESXi, I am now having another problem. Access to the NFS share from ESXi is sporadic at best and non-existent at worst. I am getting the following error messages on the NAS server:

    Dec 28 15:36:54 san kernel: ib_srpt rejected SRP_LOGIN_REQ because target port mlx4_0_1 has not yet been enabled
    Dec 28 15:36:54 san kernel: ib_srpt Rejecting login with reason 0x10001
    Dec 28 15:36:54 san kernel: ib_srpt Received SRP_LOGIN_REQ with i_port_id 651e:2d03:00b7:5672:24be:05ff:ff9e:8e41, t_port_id e41d:2d03:00b7:5670:e41d:2d03:00b7:5670 and it_iu_len 580 on port 2 (guid=fe80:0000:0000:0000:e41d:2d03:00b7:5672); pkey 0xffff
    Dec 28 15:36:54 san kernel: ib_srpt rejected SRP_LOGIN_REQ because target port mlx4_0_2 has not yet been enabled
    Dec 28 15:36:54 san kernel: ib_srpt Rejecting login with reason 0x10001


    I have exhausted my google-foo and this error hasn't really led me anywhere helpful. Any thoughts are appreciated.

    As far as the NAS server goes, it is a stock CentOS 6.7.
    - yum update
    - yum groupinstall "Infiniband Support"
    - systemctl enable rdma
    - configure IB interfaces and verify pings are working to ESXi.

    Thanks.
     
    #1
  2. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    What is your NAS ?
    is it based on SCST, and if so, could you post output of
    scstadm --write_config > /dev/stdout
    (maybe: scstadm --write-config ...)
    or
    cat /etc/scst.conf

    On first look it looks like your target is not (yet) enabled ...
     
    #2
  3. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    I'm not even sure what scst is. I did go to the pains to install and configure it after seeing that error message, but that didn't fix the problem.

    # Automatically generated by SCST Configurator v3.3.0-pre1.
    TARGET_DRIVER copy_manager {
    TARGET copy_manager_tgt
    }
    TARGET_DRIVER iscsi {
    enabled 1
    TARGET mlx4_0_1 {
    enabled 1
    rel_tgt_id 1
    GROUP lxc1
    }
    TARGET mlx4_0_2 {
    enabled 1
    rel_tgt_id 2
    GROUP lxc1
    }

    }


    Additionally:
    # cat /sys/kernel/scst_tgt/targets/iscsi/mlx4_0_1/enabled
    1
    # cat /sys/kernel/scst_tgt/targets/iscsi/mlx4_0_2/enabled
    1


    Also I am using NFS, so I'm not sure why I would need to enable targets, but I went ahead and created what I thought the error message was complaining about.

    Thanks.
     
    #3
  4. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    What ESXi version & driver are you using?
     
    #4
  5. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    #5
  6. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    hm I think the driver model changed since then, not sure that's a good combination. Would be great if it still worked since IB support is kind of dead in newer ESXi versions, but your issue is not really inspiring confidence;)

    Alternatively you could try Ethernet & Roce?
     
    #6
  7. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    I had the cards initially in Ethernet mode, but IIRC my switch didn't support it (IS5030.)

    I'll roll back to 6.5 and re-try. I think I only did an initial ping test and then tried again with 6.7, did another ping test, concluded it worked fine, and went along with 6.7 from there.

    Thanks.
     
    #7
  8. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    yes 5030 might not do ETH.
    6.5 should work iirc (never used it myself but read reports)
     
    #8
  9. markpower28

    markpower28 Active Member

    Joined:
    Apr 9, 2013
    Messages:
    393
    Likes Received:
    98
    #9
  10. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    Think you are lost, too - sorry.
    You will need something with Ethernet as transport, not infiniband.

    Just was a bit confused why you obviously had ib_srpt loaded and see errors from it - but doesn‘t matter in your case as esxi means no IB at all ...
     
    #10
  11. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    I finally got it working. I started moving cards around to see if the problem was specific to a certain card and it turned out to be. I'm now running with a ConnectX-2 card in the NAS and one of the working ConnectX-3 cards in ESXi.

    Alex, I'm not sure why you had problems with IB in ESXi, but it is working for me now. Additionally, I was able to use ESXi 6.7 without issues.

    Thanks.
     
    #11
  12. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    874
    Likes Received:
    94
    Ok, just wonder how this can work, as there seems to be no support for IB in ESXi ... and still don't understand what the ib_srpt kernel module on you nas exactly does / why something tries to connect to an srp target, but if it works it works ...

    I have no Problems, i don't use ESXi - so no need to keep an eye on HCL / what is currently supported and what fell out of support ;)
     
    #12
  13. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    Well maybe the magic is in the driver. Surprising it works since the driver model is supposed to have changed, but if it works...

    So what exactly have you now configured? Basic IB or any RDMA solution on top of it?
     
    #13
  14. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    Just basic IB. I had to disable the RDMA driver load to get it to stop complaining. Not that I imagine my spindle drives are fast enough for it to matter.
     
    #14
  15. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    ah ok.
    good to know that's still working in esxi with the old driver. wonder whether this will work with cx4's also...
     
    #15
  16. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    I have one that I bought by accident that has SFP28 instead of QSFP+ (which I thought it had), or I'd give it a try for you.
     
    #16
  17. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    No worries, can test when I find the time;) Still not decided whether I want to go down the IB route or not. Will depend on the power draw/noise of the other switches I am expecting I guess

    And too bad you didnt get a QSFP28 one, that would would have been compatible to QSFP(+)
     
    #17
  18. svtkobra7

    svtkobra7 Active Member

    Joined:
    Jan 2, 2017
    Messages:
    316
    Likes Received:
    64
    This really worked for you with 6.7? I followed the referenced guide as well (additionally uninstalling mft / nmst) and either (a) I don't "get it" or (b) it didn't work (well, other than moving the vmnics from network nics to storage adapters??? After that install I ran /opt/mellanox/bin/mlnx-srp-config and /opt/mellanox/bin/openibd.sh ...

    Am I missing something?

    @Rand__ at least I haven't lost another pool ... yet ...

    Code:
    [root@ESXi-02:~] esxcli storage core adapter list
    HBA Name  Driver     Link State  UID                     Capabilities  Description
    --------  ---------  ----------  ----------------------  ------------  ---------------------------------------------------------------------
    vmnic2    mlx4_core  link-n/a    gsan.81000000000000000                (0000:84:00.0) Mellanox Technologies MT27500 Family [ConnectX-3]
    vmhba0    vmw_ahci   link-n/a    sata.vmhba0                           (0000:00:1f.2) Intel Corporation Patsburg 6 Port SATA AHCI Controller
    vmhba2    nvme       link-n/a    pscsi.vmhba2                          (0000:02:00.0) Intel Corporation 900P Series [Add-in Card]
    vmhba3    nvme       link-n/a    pscsi.vmhba3                          (0000:82:00.0) Intel Corporation 900P Series [Add-in Card]
    vmhba33   mlx4_core  link-n/a    gsan.81000000000000000                (0000:84:00.0) Mellanox Technologies MT27500 Family [ConnectX-3]
    
    Code:
    [root@ESXi-02:~] esxcli network nic list
    Name    PCI Device    Driver  Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
    ------  ------------  ------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------
    vmnic0  0000:05:00.0  ixgben  Up            Up            1000  Full    0c:c4:7a:5e:ad:98  1500  Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
    vmnic1  0000:05:00.1  ixgben  Up            Up            1000  Full    0c:c4:7a:5e:ad:99  1500  Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
    
     
    #18
  19. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,426
    Likes Received:
    486
    In your case you'd also need to run openSM i think since you don't have a switch (where it might run on) and only the two hosts.
    There was a hacked binary for ESX somewhere iirc.
    But unfortunately this might not have been the best recommendation - this seems to be IB only (which I forgot) which has its on quirks (as needing a SM) and might or might not increase your direct connection speed. Furthermore you wouldnt be able to run one port in IB and the other in ETH i think with cx3's (or was that only true with VFs... hmm been too long).
     
    #19
  20. ValuedCustomer

    ValuedCustomer New Member

    Joined:
    Dec 28, 2018
    Messages:
    8
    Likes Received:
    0
    I am using a switch which is running the SM. And I have some recollection of not being able to split the modes on the ports as well, but I can't remember which card that was on now. I also couldn't get QDR working, and I'm not sure why the switch wouldn't do it. When I directly connect the cards to each other, I can get QDR, so it has to have something to do with the switch. I have the ports set to support it, but it just won't negotiate it for some reason. But, meh. I'm not terribly worried about it.

    Code:
    Name          PCI Device    Driver    Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
    ------------  ------------  --------  ------------  -----------  -----  ------  -----------------  ----  ---------------------------------------------------------
    vmnic1000402  0000:0d:00.0  ib_ipoib  Up            Up           20000  Full    e4:1d:2d:b7:56:72  1500  Mellanox Technologies MT27500 Family [ConnectX-3]
    vmnic4        0000:0d:00.0  ib_ipoib  Up            Up           20000  Full    e4:1d:2d:b7:56:71  1500  Mellanox Technologies MT27500 Family [ConnectX-3]
    
    *Edit* Added installation notes, if it helps.
    Code:
    ================================================================================
    -                    ESXi 6.5/6.7 w/ ConnectX-3 card in IB mode                    -
    ================================================================================
      Original Source:
        https://vmexplorer.com/2018/06/08/home-lab-gen-iv-part-v-installing-mellanox-hcas-with-esxi-6-5/
    --------------------------------------------------------------------------------
      - Install 6.5 or 6.7
    
      - Copy over files
        pscp MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip nmst-4.11.0.103-1OEM.650.0.0.4598673.x86_64.vib mft-4.11.0.103-10EM-650.0.0.4598673.x86_64.vib root@192.168.97.22:/tmp
    
      - Disable native driver for vRDMA:
        esxcli system module set –e false -m nrdma
        esxcli system module set –e false -m nrdma_vmkapi_shim
        esxcli system module set –e false -m nmlx4_rdma
        esxcli system module set –e false -m vmkapi_v2_3_0_0_rdma_shim
        esxcli system module set –e false -m vrdma
    
      - Uninstall default driver set
        esxcli software vib remove -n net-mlx4-en -n net-mlx4-core -n nmlx4-rdma -n nmlx4-en -n nmlx4-core -n nmlx5-core
    
      - Install Mellanox OFED 1.8.2.5 and tools
        esxcli software vib install -d /tmp/MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585.zip
        esxcli software vib install -v /tmp/nmst-4.11.0.103-1OEM.650.0.0.4598673.x86_64.vib
        esxcli software vib install -v /tmp/mft-4.11.0.103-10EM-650.0.0.4598673.x86_64.vib
    
      - Remove scsi-ib-srp module because it caused lock-ups. Not sure if this will cause problems.
        esxcli software vib remove -n scsi-ib-srp
    
      - Reboot
     
    #20
    Last edited: Jan 22, 2019
Similar Threads: Fighting Mellanox
Forum Title Date
Linux Admins, Storage and Virtualization Proxmox VE 6 and Mellanox ConnectX-3 Pro EN Not Working Jul 21, 2019
Linux Admins, Storage and Virtualization Mellanox - Connectx-3 - can modify config Apr 6, 2018
Linux Admins, Storage and Virtualization Ubuntu 16.04.2 LTS Server and Mellanox ConnectX-3 installation WTF? May 18, 2017
Linux Admins, Storage and Virtualization Proxmox VE v5.x + Mellanox ConnectX-3 May 3, 2017

Share This Page