connectX2 MNPA19 centos 4.17

Discussion in 'Linux Admins, Storage and Virtualization' started by netblues, Jul 27, 2018.

  1. netblues

    netblues New Member

    Joined:
    Jul 27, 2018
    Messages:
    2
    Likes Received:
    0
    Hi
    I'm playing around with two mellanox x2 mnpa19 under centos
    Managed to hook them up, iperf is splendid at 9.8 gbits,
    However I ran into strange kernel version issue
    one box is a dell610t, the other is a dell710r
    Both run the same centos platform.
    dell710 runs mellanox happily at any kernel version
    dell610 runs up to 4.17.09
    booting into 4.17.10 throughs nasty dmesg errors and then the card isn't recognized
    with ip link. I can still see it on lspci
    I would think its the kernel, BUT the same kernel runs happily on the other machine.
    Initially firmware on 610 (non working ) was newer (2.9.1200). Managed to downgrade to 2.9.1000 so as to be the same. Proved irrelevant.
    followed this
    Mellanox/mlxsw
    noted Note: MFT needs to be re-installed following every kernel update.
    Reinstalled. Nada. Going back to 4.17.09 always works with no further quirks.

    Code:
    lspci | grep net
    01:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    01:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    03:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
    06:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    07:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
    
     dmesg | grep mlx4
    [    1.414133] mlx4_core: Mellanox ConnectX core driver v4.0-0
    [    1.414147] mlx4_core: Initializing 0000:03:00.0
    [    1.575038] mlx4_core 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
    [    1.701073] mlx4_core 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
    [   62.463014] mlx4_core 0000:03:00.0: command 0x4 timed out (go bit not cleared)
    [   62.463016] mlx4_core 0000:03:00.0: device is going to be reset
    [   63.519440] mlx4_core 0000:03:00.0: device was reset successfully
    [   63.519445] mlx4_core 0000:03:00.0: QUERY_FW command failed, aborting
    [   63.519447] mlx4_core 0000:03:00.0: Failed to init fw, aborting.
    [   64.543397] mlx4_core: probe of 0000:03:00.0 failed with error -5
    
    
    Running the latest kernel is not that important, but on the long run this can be a cause for trouble.
    Any ideas/pointers more than welcome :)

    Regards

    working setup follows

    Code:
    uname -a
    Linux dell710r.local 4.17.10-1.el7.elrepo.x86_64 #1 SMP Wed Jul 25 15:25:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
    mlxfwmanager
    Querying Mellanox devices firmware ...
    
    Device #1:
    ----------
    
      Device Type:      ConnectX2
      Part Number:      MNPA19_A1-A2
      Description:      ConnectX-2 Lx EN network interface card; single-port SFP+; PCIe2.0 5.0GT/s; mem-free; RoHS R6
      PSID:             MT_0F60110010
      PCI Device Name:  /dev/mst/mt26448_pci_cr0
      Port1 MAC:        0002c9506d38
      Port2 MAC:        0002c9506d39
      Versions:         Current        Available   
         FW             2.9.1000       N/A         
         PXE            3.3.0400       N/A         
    
      Status:           No matching image found
    
    ethtool p3p1
    Settings for p3p1:
            Supported ports: [ FIBRE ]
            Supported link modes:   10000baseT/Full
            Supported pause frame use: No
            Supports auto-negotiation: No
            Supported FEC modes: Not reported
            Advertised link modes:  10000baseT/Full
            Advertised pause frame use: No
            Advertised auto-negotiation: No
            Advertised FEC modes: Not reported
            Speed: 10000Mb/s
            Duplex: Full
            Port: FIBRE
            PHYAD: 0
            Transceiver: internal
            Auto-negotiation: off
            Supports Wake-on: d
            Wake-on: d
            Current message level: 0x00000014 (20)
                                   link ifdown
            Link detected: yes
    
    
    dell710r ~]# dmesg | grep mlx4
    [    1.842196] mlx4_core: Mellanox ConnectX core driver v4.0-0
    [    1.842211] mlx4_core: Initializing 0000:07:00.0
    [    4.272560] mlx4_core 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:09.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
    [    4.417617] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
    [    4.417863] mlx4_en 0000:07:00.0: Activating port:1
    [    4.418618] mlx4_en: 0000:07:00.0: Port 1: enabling only PFC DCB ops
    [    4.420260] mlx4_en: 0000:07:00.0: Port 1: Using 24 TX rings
    [    4.420261] mlx4_en: 0000:07:00.0: Port 1: Using 16 RX rings
    [    4.420422] mlx4_en: 0000:07:00.0: Port 1: Initializing port
    [    4.490333] mlx4_core 0000:07:00.0 p3p1: renamed from eth0
    [   11.149545] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
    [   11.150333] <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
    [  838.363892] mlx4_en: p3p1: Steering Mode 1
    [  993.278912] mlx4_en: p3p1: Close port called
    [  998.614753] mlx4_en: p3p1: Steering Mode 1
    [43059.287860] mlx4_en: p3p1: Link Up
    [43469.354768] mlx4_en: p3p1: Link Down
    [43477.635787] mlx4_en: p3p1: Link Up
    [44200.451559] mlx4_en: p3p1: Link Down
    [44449.026510] mlx4_en: p3p1: Link Up
    [44449.131504] mlx4_en: p3p1: Link Down
    [44449.186415] mlx4_en: p3p1: Link Up
    [44543.735715] mlx4_en: p3p1: Link Down
    [45739.637833] mlx4_en: p3p1: Link Up
    [45739.692798] mlx4_en: p3p1: Link Down
    [45739.792755] mlx4_en: p3p1: Link Up
    [90178.876818] mlx4_en: p3p1: Close port called
    [90178.911964] mlx4_en: p3p1: Link Down
    [90184.441127] mlx4_en: p3p1: Steering Mode 1
    [90186.788175] mlx4_en: p3p1: Link Up
    [90191.455819] mlx4_en: p3p1: Link Down
    [90199.811894] mlx4_en: p3p1: Link Up
    [105148.790364] mlx4_en: p3p1: Link Down
    [106626.351457] mlx4_en: p3p1: Link Up
    [131280.746956] mlx4_en: p3p1: Link Down
    [132468.233247] mlx4_en: p3p1: Link Up
    [132468.288192] mlx4_en: p3p1: Link Down
    [132468.388147] mlx4_en: p3p1: Link Up
    [134150.960306] mlx4_en: p3p1: Link Down
    
    
    
    

    and on dell610t

    Code:
    
    uname -a
    Linux dell610t.local 4.17.9-1.el7.elrepo.x86_64 #1 SMP Sun Jul 22 11:57:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
    
    mlxfwmanager
    Querying Mellanox devices firmware ...
    
    Device #1:
    ----------
    
      Device Type:      ConnectX2
      Part Number:      MNPA19_A1-A2
      Description:      ConnectX-2 Lx EN network interface card; single-port SFP+; PCIe2.0 5.0GT/s; mem-free; RoHS R6
      PSID:             MT_0F60110010
      PCI Device Name:  0000:03:00.0
      Port1 MAC:        0002c95254ca
      Port2 MAC:        0002c95254cb
      Versions:         Current        Available   
         FW             2.9.1000       2.9.1000   
         PXE            3.3.0400       3.3.0400   
    
      Status:           Up to date
    
    ethtool p5p1
    Settings for p5p1:
            Supported ports: [ FIBRE ]
            Supported link modes:   10000baseT/Full
            Supported pause frame use: No
            Supports auto-negotiation: No
            Supported FEC modes: Not reported
            Advertised link modes:  10000baseT/Full
            Advertised pause frame use: No
            Advertised auto-negotiation: No
            Advertised FEC modes: Not reported
            Speed: 10000Mb/s
            Duplex: Full
            Port: FIBRE
            PHYAD: 0
            Transceiver: internal
            Auto-negotiation: off
            Supports Wake-on: d
            Wake-on: d
            Current message level: 0x00000014 (20)
                                   link ifdown
            Link detected: yes
    
    [root@dell610t ~]# dmesg | grep mlx4
    [    5.714617] mlx4_core: Mellanox ConnectX core driver v4.0-0
    [    5.714628] mlx4_core: Initializing 0000:03:00.0
    [    8.171325] mlx4_core 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:03.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
    [    8.252041] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
    [    8.252372] mlx4_en 0000:03:00.0: Activating port:1
    [    8.253435] mlx4_en: 0000:03:00.0: Port 1: enabling only PFC DCB ops
    [    8.254477] mlx4_en: 0000:03:00.0: Port 1: Using 12 TX rings
    [    8.254478] mlx4_en: 0000:03:00.0: Port 1: Using 8 RX rings
    [    8.254652] mlx4_en: 0000:03:00.0: Port 1: Initializing port
    [    8.277082] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
    [    8.277590] <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
    [    8.385610] mlx4_core 0000:03:00.0 p5p1: renamed from eth0
    [   10.545695] mlx4_en: p5p1: Link Up
    [   17.261120] mlx4_en: p5p1: Steering Mode 1
    
    
    
     
    #1
    Last edited: Jul 27, 2018
  2. JustinClift

    JustinClift New Member

    Joined:
    Oct 5, 2014
    Messages:
    24
    Likes Received:
    11
    As a thought, this kind of sounds like a problem I've seen over the last few months with some other Mellanox cards (ConnectX-1 and ConnectX-2 series) with CentOS. After a certain kernel version, some cards would only be seen in Infiniband (IB) mode, and the setting to have them come up as Ethernet mode is completely ignored.

    The workaround I found was to edit the modprobe line used for adding the Mellanox kernel module to the system. I needed to add "port_type_array=2" in the appropriate spot (as shown below), then run "dracut --force" to rebuild the system initrd:

    Code:
    $ cat /usr/lib/modprobe.d/libmlx4.conf
    # WARNING! - This file is overwritten any time the rdma rpm package is                                                                                                                   
    # updated.  Please do not make any changes to this file.  Instead, make                                                                                                                 
    # changes to the mlx4.conf file.  It's contents are preserved if they                                                                                                                   
    # have been changed from the default values.                                                                                                                                             
    install mlx4_core /sbin/modprobe --ignore-install mlx4_core port_type_array=2 $CMDLINE_OPTS && (if [ -f /usr/libexec/mlx4-setup.sh -a -f /etc/rdma/mlx4.conf ]; then /usr/libexec/mlx4-setup.sh < /etc/rdma/mlx4.conf; fi; /sbin/modprobe mlx4_en; if /sbin/modinfo mlx4_ib > /dev/null 2>&1; then /sbin/modprobe mlx4_ib; fi)
    
    That's with the standard CentOS 7.x drivers, not the ones Mellanox wants people to download and install manually.

    It works for me, though my cards are all VPI ones instead of the EN models. Something similar *might* work for you too. The key to tracking down the problem is to (after booting) manually remove the Mellanox kernel modules:

    Code:
    $ sudo rmmod mlx4_ib mlx4_eth mlx4_core
    Then load each of the modules in order, experimenting with the module options that seem to make sense:

    Code:
    $ sudo modprobe mlx4_core port_type_array=2
    To find out the options available for a kernel module, use the "modinfo" command and look for the "parm" lines:

    Code:
    $ modinfo mlx4_core | grep parm
    parm:           debug_level:Enable debug tracing if > 0 (int)
    parm:           msi_x:attempt to use MSI-X if nonzero (int)
    parm:           num_vfs:enable #num_vfs functions if num_vfs > 0
    parm:           probe_vf:number of vfs to probe by pf driver (num_vfs > 0)
    parm:           log_num_mgm_entry_size:log mgm size, that defines the num of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12. To activate device managed flow steering when available, set to -1 (int)
    parm:           enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the FW supports this (default: True) (bool)
    parm:           enable_4k_uar:Enable using 4K UAR. Should not be enabled if have VFs which do not support 4K UARs (default: false) (bool)
    parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
    parm:           log_num_vlan:Log2 max number of VLANs per ETH port (0-7) (int)
    parm:           use_prio:Enable steering by VLAN priority on ETH ports (deprecated) (bool)
    parm:           log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int)
    parm:           port_type_array:Array of port types: HW_DEFAULT (0) is default 1 for IB, 2 for Ethernet (array of int)
    parm:           enable_qos:Enable Enhanced QoS support (default: off) (bool)
    parm:           internal_err_reset:Reset device on internal errors if non-zero (default 1) (int)
    
    Hope that helps. :)
     
    #2
    Tha_14 likes this.
  3. netblues

    netblues New Member

    Joined:
    Jul 27, 2018
    Messages:
    2
    Likes Received:
    0
    I see what you mean, however this isn't logical since on a similar box same centos version works happily.

    By closely examining dmesg I noticed this

    not working card
    Code:
    mlx4_core 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:03.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
    
    working card
    Code:
    mlx4_core 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:09.0 (capable of 63.008 Gb/s with 8 GT/s x8 link
    
    In other words, the card not working under all OS versions is seated on a 4x slot... (gen2)
    I tried moving the card to an 8x slot. (like the one working fine)
    To my suprise, system failed to complete bios bootup and ended up complaining with PCIe training error slot 2 System Halted
    I tried moving the card to other slots.
    The problem followed.. pcie training error on all 8X slots.
    It was not practical to use the other card from the other system (seated on 8x slot) here
    But I suspect it will work and I'm facing a faulty card here and not a software issue...
    At some point bios managed to boot with the card attached but it was nowhere to be found, even at lspci level.

    Next best thing is exchanging cards.

    I'll keep posted on the findings.

    Regards
     
    #3

Share This Page