6.7U1 Vsan doesn't seem to work with Connectx-3

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

justinm001

New Member
Apr 11, 2018
19
1
3
38
I've been trying to upgrade 6.7 to U1 and am unable to get vSAN service added to VMkernel. Just get general vsan error when trying. Even worse is I can get it all setup over 1G copper but when I add the drivers to the infiniband NIC and get them online it removes the vSAN service from the 1g link.

Been pulling my hair out on this and tried everything including new ESXI install and new Vsphere with new cluster. If I run nested ESXI 6.7U1 it'll work fine but i'm guessing its because it sees vmxnet3 NIC and not Connectx-3.

Now to try other drivers than the MLNX-OFED-ESX-1.8.2.5-10EM-600.0.0.2494585 ones that have been rock solid.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I have 4 vsan boxes (all cx4) and 1 of the 4 is not working since the upgrade showing some RDMA errors/ unresolved symbols.

In my case I can't get out of maintenance mode; it looked like MLX error but is not as far as I can tell as I had removed all MLX modules and it didnt work either, still got rdt error

Code:
2018-11-06T12:34:59.326Z cpu15:2098312)Elf: 2101: module rdt has license VMware
2018-11-06T12:34:59.329Z cpu15:2098312)WARNING: Elf: 3144: Kernel based module load of rdt failed: Unresolved symbol <ElfRelocateFile failed>
2018-11-06T12:35:00.047Z cpu4:2098312)Loading module rdt ...


2018-11-06T12:34:25.396Z cpu4:2097620)<NMLX_ERR> nmlx5_core: nmlx5_core_DeviceGetParamMaxVfs - (vmkdrivers/native/BSD/Network/mlnx/nmlx5/nmlx5_core/nmlx5_core_main.c:317) vmk_DeviceGetParamMaxVfs failed: Not initialized, continuing with no VFs
2018-11-06T12:34:26.176Z cpu12:2097620)<NMLX_ERR> nmlx5_core: nmlx5_core_DeviceGetParamMaxVfs - (vmkdrivers/native/BSD/Network/mlnx/nmlx5/nmlx5_core/nmlx5_core_main.c:317) vmk_DeviceGetParamMaxVfs failed: Not initialized, continuing with no VFs
2018-11-06T12:34:26.663Z cpu10:2097770)WARNING: Elf: 1741: Relocation of symbol <vmk_RDMACapRegister> failed: Unresolved symbol
2018-11-06T12:34:26.663Z cpu10:2097770)WARNING: Elf: 1741: Relocation of symbol <vmk_RDMACapRegister> failed: Unresolved symbol
2018-11-06T12:34:26.663Z cpu10:2097770)WARNING: Elf: 1741: Relocation of symbol <vmk_RDMACapRegister> failed: Unresolved symbol
2018-11-06T12:34:26.663Z cpu10:2097770)WARNING: Elf: 1741: Relocation of symbol <vmk_RDMACapRegister> failed: Unresolved symbol
2018-11-06T12:34:26.663Z cpu10:2097770)WARNING: Elf: 1741: Relocation of symbol <vmk_RDMAModifyQPArgsValid> failed: Unresolved symbol
I spent a couple of hours already on this (trying to downgrade, reinstall, cleanup old modules etc) but no luck. Not sure at this time what the underlying reason is, but its not the MLX card, i have similar in my other boxes.
 

justinm001

New Member
Apr 11, 2018
19
1
3
38
Does your vmkernel show vsan enabled? My bet is it doesn't which is why you cant exit maintenance mode. I had same problem. I think a couple times it showed enabled but if I removed I couldn't add it
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Actually i moved the box out of the vsan cluster and still can move out of maint mode, can't even leave the vsan cluster;)

esxcli vsan cluster leave
Failed to leave the host from vSAN cluster. The command should be retried: Unable to load module /usr/lib/vmware/vmkmod/cmmds: Invalid or missing namespace

...
Code:
Load of <vsanutil> failed : missing required namespace <com.vmware.rdt#0.0.0.1>
2018-11-07T21:45:36.677Z cpu2:2165150)WARNING: Elf: 3144: Kernel based module load of vsanutil failed: Invalid or missing namespace <ElfSetNamespaceInfo failed>
2018-11-07T21:45:37.418Z cpu2:2165150)WARNING: Elf: 2277: Load of <cmmds> failed : missing required namespace <com.vmware.vsanutil#0.0.0.1>
2018-11-07T21:45:37.418Z cpu2:2165150)WARNING: Elf: 3144: Kernel based module load of cmmds failed: Invalid or missing namespace <ElfSetNamespaceInfo failed>
 
Last edited:

justinm001

New Member
Apr 11, 2018
19
1
3
38
Thats exactly what I was getting. Now with fresh install I'm unable to add to vSAN cluster. Everything else is good just vSAN doesn't work.

Also i'm not even able to add vSAN service to an 1g ethernet NIC unless i uninstall the mellanox infiniband drivers
 

justinm001

New Member
Apr 11, 2018
19
1
3
38
I give up and rolling back to 6.7. I'm unable to use MST in 6.7U1 as getting "-E- nmlx core must be loaded before starting mst." error even though its loaded.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Still not sure what might cause this.
As I said only one of my four boxes has it, 2 server, 2 workstation boards (albeit different vendors), the affected one is a X10SRA.
What is your box based on?
 

justinm001

New Member
Apr 11, 2018
19
1
3
38
All are HP DL580 G7's with CX354A cards in them. It looks like when ESXI recognizes the network driver it shuts down vSAN services on the server which prevent vSAN to do anything, thus blocking server from exiting maintenance mode. I'm sure its just some setting or option or oversight since these aren't officially supported on HCL.

Even with fresh install it works fine, until I add the mel drivers then it dies, even on fresh vSAN cluster. Also just a note I can have 6.7U1 on the server and 6.7U1 nested and the nested will work fine in vSAN since no ConnectX drivers.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
as I said I dont have CX3's in my boxes. but that particvular box had some old mlx drivers installed. Maybe that broke sth on the upgrade; but removal has not helped
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
So I tried rolling back to 6.7 but since I had reinstalling U1 several time there was no non U1 environment left...

I then restored to defaults and the box is up and running fine. O/c I have to rebuild it and o/c Host profiles is not working for me (it never does), but at least that shows its a config glitch and not a hardware issue...
 

markpower28

Active Member
Apr 9, 2013
413
104
43
with 6.5/6.7, IB/OFED support is gone. Ethernet is the only way moving forward with Mellanox and VMware. iSER is the only option for RDMA with vSphere and no more SRP...