connectX2 MNPA19 centos 4.17

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

netblues

New Member
Jul 27, 2018
5
0
1
127.0.0.1
Hi
I'm playing around with two mellanox x2 mnpa19 under centos
Managed to hook them up, iperf is splendid at 9.8 gbits,
However I ran into strange kernel version issue
one box is a dell610t, the other is a dell710r
Both run the same centos platform.
dell710 runs mellanox happily at any kernel version
dell610 runs up to 4.17.09
booting into 4.17.10 throughs nasty dmesg errors and then the card isn't recognized
with ip link. I can still see it on lspci
I would think its the kernel, BUT the same kernel runs happily on the other machine.
Initially firmware on 610 (non working ) was newer (2.9.1200). Managed to downgrade to 2.9.1000 so as to be the same. Proved irrelevant.
followed this
Mellanox/mlxsw
noted Note: MFT needs to be re-installed following every kernel update.
Reinstalled. Nada. Going back to 4.17.09 always works with no further quirks.

Code:
lspci | grep net
01:00.0 Ethernet controller: Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
01:00.1 Ethernet controller: Broadcom Limited NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
03:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
06:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
07:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)

 dmesg | grep mlx4
[    1.414133] mlx4_core: Mellanox ConnectX core driver v4.0-0
[    1.414147] mlx4_core: Initializing 0000:03:00.0
[    1.575038] mlx4_core 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
[    1.701073] mlx4_core 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
[   62.463014] mlx4_core 0000:03:00.0: command 0x4 timed out (go bit not cleared)
[   62.463016] mlx4_core 0000:03:00.0: device is going to be reset
[   63.519440] mlx4_core 0000:03:00.0: device was reset successfully
[   63.519445] mlx4_core 0000:03:00.0: QUERY_FW command failed, aborting
[   63.519447] mlx4_core 0000:03:00.0: Failed to init fw, aborting.
[   64.543397] mlx4_core: probe of 0000:03:00.0 failed with error -5
Running the latest kernel is not that important, but on the long run this can be a cause for trouble.
Any ideas/pointers more than welcome :)

Regards

working setup follows

Code:
uname -a
Linux dell710r.local 4.17.10-1.el7.elrepo.x86_64 #1 SMP Wed Jul 25 15:25:01 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
mlxfwmanager
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX2
  Part Number:      MNPA19_A1-A2
  Description:      ConnectX-2 Lx EN network interface card; single-port SFP+; PCIe2.0 5.0GT/s; mem-free; RoHS R6
  PSID:             MT_0F60110010
  PCI Device Name:  /dev/mst/mt26448_pci_cr0
  Port1 MAC:        0002c9506d38
  Port2 MAC:        0002c9506d39
  Versions:         Current        Available   
     FW             2.9.1000       N/A         
     PXE            3.3.0400       N/A         

  Status:           No matching image found

ethtool p3p1
Settings for p3p1:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
        Link detected: yes


dell710r ~]# dmesg | grep mlx4
[    1.842196] mlx4_core: Mellanox ConnectX core driver v4.0-0
[    1.842211] mlx4_core: Initializing 0000:07:00.0
[    4.272560] mlx4_core 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:09.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    4.417617] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
[    4.417863] mlx4_en 0000:07:00.0: Activating port:1
[    4.418618] mlx4_en: 0000:07:00.0: Port 1: enabling only PFC DCB ops
[    4.420260] mlx4_en: 0000:07:00.0: Port 1: Using 24 TX rings
[    4.420261] mlx4_en: 0000:07:00.0: Port 1: Using 16 RX rings
[    4.420422] mlx4_en: 0000:07:00.0: Port 1: Initializing port
[    4.490333] mlx4_core 0000:07:00.0 p3p1: renamed from eth0
[   11.149545] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[   11.150333] <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
[  838.363892] mlx4_en: p3p1: Steering Mode 1
[  993.278912] mlx4_en: p3p1: Close port called
[  998.614753] mlx4_en: p3p1: Steering Mode 1
[43059.287860] mlx4_en: p3p1: Link Up
[43469.354768] mlx4_en: p3p1: Link Down
[43477.635787] mlx4_en: p3p1: Link Up
[44200.451559] mlx4_en: p3p1: Link Down
[44449.026510] mlx4_en: p3p1: Link Up
[44449.131504] mlx4_en: p3p1: Link Down
[44449.186415] mlx4_en: p3p1: Link Up
[44543.735715] mlx4_en: p3p1: Link Down
[45739.637833] mlx4_en: p3p1: Link Up
[45739.692798] mlx4_en: p3p1: Link Down
[45739.792755] mlx4_en: p3p1: Link Up
[90178.876818] mlx4_en: p3p1: Close port called
[90178.911964] mlx4_en: p3p1: Link Down
[90184.441127] mlx4_en: p3p1: Steering Mode 1
[90186.788175] mlx4_en: p3p1: Link Up
[90191.455819] mlx4_en: p3p1: Link Down
[90199.811894] mlx4_en: p3p1: Link Up
[105148.790364] mlx4_en: p3p1: Link Down
[106626.351457] mlx4_en: p3p1: Link Up
[131280.746956] mlx4_en: p3p1: Link Down
[132468.233247] mlx4_en: p3p1: Link Up
[132468.288192] mlx4_en: p3p1: Link Down
[132468.388147] mlx4_en: p3p1: Link Up
[134150.960306] mlx4_en: p3p1: Link Down

and on dell610t

Code:
uname -a
Linux dell610t.local 4.17.9-1.el7.elrepo.x86_64 #1 SMP Sun Jul 22 11:57:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

mlxfwmanager
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX2
  Part Number:      MNPA19_A1-A2
  Description:      ConnectX-2 Lx EN network interface card; single-port SFP+; PCIe2.0 5.0GT/s; mem-free; RoHS R6
  PSID:             MT_0F60110010
  PCI Device Name:  0000:03:00.0
  Port1 MAC:        0002c95254ca
  Port2 MAC:        0002c95254cb
  Versions:         Current        Available   
     FW             2.9.1000       2.9.1000   
     PXE            3.3.0400       3.3.0400   

  Status:           Up to date

ethtool p5p1
Settings for p5p1:
        Supported ports: [ FIBRE ]
        Supported link modes:   10000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  10000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000014 (20)
                               link ifdown
        Link detected: yes

[root@dell610t ~]# dmesg | grep mlx4
[    5.714617] mlx4_core: Mellanox ConnectX core driver v4.0-0
[    5.714628] mlx4_core: Initializing 0000:03:00.0
[    8.171325] mlx4_core 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:03.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
[    8.252041] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
[    8.252372] mlx4_en 0000:03:00.0: Activating port:1
[    8.253435] mlx4_en: 0000:03:00.0: Port 1: enabling only PFC DCB ops
[    8.254477] mlx4_en: 0000:03:00.0: Port 1: Using 12 TX rings
[    8.254478] mlx4_en: 0000:03:00.0: Port 1: Using 8 RX rings
[    8.254652] mlx4_en: 0000:03:00.0: Port 1: Initializing port
[    8.277082] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-0
[    8.277590] <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
[    8.385610] mlx4_core 0000:03:00.0 p5p1: renamed from eth0
[   10.545695] mlx4_en: p5p1: Link Up
[   17.261120] mlx4_en: p5p1: Steering Mode 1
 
Last edited:

JustinClift

Member
Oct 5, 2014
35
14
8
As a thought, this kind of sounds like a problem I've seen over the last few months with some other Mellanox cards (ConnectX-1 and ConnectX-2 series) with CentOS. After a certain kernel version, some cards would only be seen in Infiniband (IB) mode, and the setting to have them come up as Ethernet mode is completely ignored.

The workaround I found was to edit the modprobe line used for adding the Mellanox kernel module to the system. I needed to add "port_type_array=2" in the appropriate spot (as shown below), then run "dracut --force" to rebuild the system initrd:

Code:
$ cat /usr/lib/modprobe.d/libmlx4.conf
# WARNING! - This file is overwritten any time the rdma rpm package is                                                                                                                   
# updated.  Please do not make any changes to this file.  Instead, make                                                                                                                 
# changes to the mlx4.conf file.  It's contents are preserved if they                                                                                                                   
# have been changed from the default values.                                                                                                                                             
install mlx4_core /sbin/modprobe --ignore-install mlx4_core port_type_array=2 $CMDLINE_OPTS && (if [ -f /usr/libexec/mlx4-setup.sh -a -f /etc/rdma/mlx4.conf ]; then /usr/libexec/mlx4-setup.sh < /etc/rdma/mlx4.conf; fi; /sbin/modprobe mlx4_en; if /sbin/modinfo mlx4_ib > /dev/null 2>&1; then /sbin/modprobe mlx4_ib; fi)
That's with the standard CentOS 7.x drivers, not the ones Mellanox wants people to download and install manually.

It works for me, though my cards are all VPI ones instead of the EN models. Something similar *might* work for you too. The key to tracking down the problem is to (after booting) manually remove the Mellanox kernel modules:

Code:
$ sudo rmmod mlx4_ib mlx4_eth mlx4_core
Then load each of the modules in order, experimenting with the module options that seem to make sense:

Code:
$ sudo modprobe mlx4_core port_type_array=2
To find out the options available for a kernel module, use the "modinfo" command and look for the "parm" lines:

Code:
$ modinfo mlx4_core | grep parm
parm:           debug_level:Enable debug tracing if > 0 (int)
parm:           msi_x:attempt to use MSI-X if nonzero (int)
parm:           num_vfs:enable #num_vfs functions if num_vfs > 0
parm:           probe_vf:number of vfs to probe by pf driver (num_vfs > 0)
parm:           log_num_mgm_entry_size:log mgm size, that defines the num of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12. To activate device managed flow steering when available, set to -1 (int)
parm:           enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the FW supports this (default: True) (bool)
parm:           enable_4k_uar:Enable using 4K UAR. Should not be enabled if have VFs which do not support 4K UARs (default: false) (bool)
parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
parm:           log_num_vlan:Log2 max number of VLANs per ETH port (0-7) (int)
parm:           use_prio:Enable steering by VLAN priority on ETH ports (deprecated) (bool)
parm:           log_mtts_per_seg:Log2 number of MTT entries per segment (1-7) (int)
parm:           port_type_array:Array of port types: HW_DEFAULT (0) is default 1 for IB, 2 for Ethernet (array of int)
parm:           enable_qos:Enable Enhanced QoS support (default: off) (bool)
parm:           internal_err_reset:Reset device on internal errors if non-zero (default 1) (int)
Hope that helps. :)
 
  • Like
Reactions: Tha_14

netblues

New Member
Jul 27, 2018
5
0
1
127.0.0.1
I see what you mean, however this isn't logical since on a similar box same centos version works happily.

By closely examining dmesg I noticed this

not working card
Code:
mlx4_core 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:00:03.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
working card
Code:
mlx4_core 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:09.0 (capable of 63.008 Gb/s with 8 GT/s x8 link
In other words, the card not working under all OS versions is seated on a 4x slot... (gen2)
I tried moving the card to an 8x slot. (like the one working fine)
To my suprise, system failed to complete bios bootup and ended up complaining with PCIe training error slot 2 System Halted
I tried moving the card to other slots.
The problem followed.. pcie training error on all 8X slots.
It was not practical to use the other card from the other system (seated on 8x slot) here
But I suspect it will work and I'm facing a faulty card here and not a software issue...
At some point bios managed to boot with the card attached but it was nowhere to be found, even at lspci level.

Next best thing is exchanging cards.

I'll keep posted on the findings.

Regards
 

JustinClift

Member
Oct 5, 2014
35
14
8
I see what you mean, however this isn't logical since on a similar box same centos version works happily.
Yeah, that's one of the weird thing with Mellanox cards recently, at least on CentOS.

I have several systems here with Mellanox ConnectX cards, series 1, 2, and 3. Some cards will work fine in some boxes, yet in other boxes (same kernel version) the exact same cards will just not switch into eth mode, even when told to in the official spot (/etc/rdma/rdma.conf). Those need the manual edit above to get working.

It was frustrating to get this figured out at first, but now that I know, it's pretty simple to keep working for a small group of machines. It could be a problem with a large roll-out of boxes though. ;)
 

netblues

New Member
Jul 27, 2018
5
0
1
127.0.0.1
Well, I have played with the options mentioned above
It doesn't work even in IB mode.. complaints about a hardware semaphore.
I also, replaced card, issue remains. It only works on 4x slots, and is also erratic on when it works. Seems to be a motherboard issue.
Same card on another machine, with same centos kernel, works fine...
On the other hand, everything else on the server (a dell 61ot) works fine.
I don't have any means to investigate this any further. I will have an intel card soon to test further.