I have this "simple" setup which is making me crazy because it does not work (anymore)
Node 1 "STORAGE"
OS: Ubuntu 14.04 (server)
IB: Mellanox ConnectX 2
Software installed: Mellanox OFED 2.3-2.0.0 (ubuntu14.04-x86_64)
Node 2 "HYPERV1"
OS: ESXi 5.5
IB: Mellanox ConnectX 2
Software installed: MLNX-OFED-ESX-1.8.2.0, mlx4_en-mlnx-1.6.1.2-471530, ib-opensm-3.3.16-64.x86_64
No IB Switch, the nodes are directly connected.
THE PROBLEM
The fisrt day everything worked well, I was able to export a NFS share from the storage node to the esxi and install a VM on it with a great performance.
The second day I had to reboot the storage node and after that I was no more able to make the connection work. No ibping, no ping.
Here are some details
STORAGE NODE
ESXi NODE
Additional info:
- opensm is not started on the storage node because it is running on the esxi one.
- the cards have both the last fw
- I was not able to use ibping both with -G and -L parameter
Node 1 "STORAGE"
OS: Ubuntu 14.04 (server)
IB: Mellanox ConnectX 2
Software installed: Mellanox OFED 2.3-2.0.0 (ubuntu14.04-x86_64)
Node 2 "HYPERV1"
OS: ESXi 5.5
IB: Mellanox ConnectX 2
Software installed: MLNX-OFED-ESX-1.8.2.0, mlx4_en-mlnx-1.6.1.2-471530, ib-opensm-3.3.16-64.x86_64
No IB Switch, the nodes are directly connected.
THE PROBLEM
The fisrt day everything worked well, I was able to export a NFS share from the storage node to the esxi and install a VM on it with a great performance.
The second day I had to reboot the storage node and after that I was no more able to make the connection work. No ibping, no ping.
Here are some details
STORAGE NODE
Code:
root@storage:~#ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0002:c903:000d:1c08
sys_image_guid: 0002:c903:000d:1c0b
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0D81120009
phys_port_cnt: 2
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2
port_lmc: 0x00
link_layer: InfiniBand
Code:
root@storage:~# hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 1
PCI Device Check ....................... PASS
Kernel Arch ............................ x86_64
Host Driver Version .................... MLNX_OFED_LINUX-2.3-2.0.0: 3.13.0-32-generic
Host Driver RPM Check .................. PASS
Firmware on CA #0 HCA .................. v2.9.1000
Firmware Check on CA #0 (HCA) .......... NA
REASON: NO required fw version
Host Driver Initialization ............. PASS
Number of CA Ports Active .............. 1
Port State of Port #1 on CA #0 (HCA)..... DOWN (InfiniBand)
Port State of Port #2 on CA #0 (HCA)..... UP 4X QDR (InfiniBand)
Error Counter Check on CA #0 (HCA)...... PASS
Kernel Syslog Check .................... PASS
Node GUID on CA #0 (HCA) ............... 00:02:c9:03:00:0d:1c:08
------------------ DONE ---------------------
Code:
root@storage:~# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c903000d1c08
System image GUID: 0x0002c903000d1c0b
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c903000d1c09
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x02510868
Port GUID: 0x0002c903000d1c0a
Link layer: InfiniBand
Code:
root@storage:~# ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:000d:1c09
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:0002:c903:000d:1c0a
base lid: 0x2
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
root@storage:~#
Code:
root@storage:~# ibhosts
Ca : 0x0002c903000d246c ports 2 "hyperv1.home.lan HCA-1"
Ca : 0x0002c903000d1c08 ports 2 "storage HCA-1"
Code:
root@storage:~# ifconfig ib1
ib1 Link encap:UNSPEC HWaddr A0-00-02-20-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:10.0.1.11 Bcast:10.0.1.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:d:1c0a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:48 errors:0 dropped:15 overruns:0 frame:0
TX packets:41 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:2954 (2.9 KB) TX bytes:3288 (3.2 KB)
Code:
root@storage:~# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default ibox.home.lan 0.0.0.0 UG 0 0 0 eth0
10.0.1.0 * 255.255.255.0 U 0 0 0 ib0
10.0.1.0 * 255.255.255.0 U 0 0 0 ib1
192.168.1.0 * 255.255.255.0 U 0 0 0 eth0
Code:
root@storage:~# ping 10.0.1.21
PING 10.0.1.21 (10.0.1.21) 56(84) bytes of data.
From 10.0.1.10 icmp_seq=1 Destination Host Unreachable
From 10.0.1.10 icmp_seq=2 Destination Host Unreachable
From 10.0.1.10 icmp_seq=3 Destination Host Unreachable
ESXi NODE
Code:
~ # /opt/opensm/bin/ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c903000d246c
System image GUID: 0x0002c903000d246f
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251086a
Port GUID: 0x0002c903000d246d
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 68
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c903000d246e
Link layer: InfiniBand
~ #
Code:
~ # esxcli network ip interface ipv4 get
Name IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type DHCP DNS
---- ------------ ------------- -------------- ------------ --------
vmk0 192.168.1.10 255.255.255.0 192.168.1.255 STATIC false
vmk1 10.0.1.21 255.255.255.0 10.0.1.255 STATIC false
~ #
Code:
~ # esxcli network nic list
Name PCI Device Driver Link Speed Duplex MAC Address MTU Description
--------- ------------- ------ ---- ----- ------ ----------------- ---- ------------------------------------------------------------------------------
vmnic0 0000:003:00.0 e1000e Up 1000 Full 00:25:90:06:ca:16 1500 Intel Corporation 82574L Gigabit Network Connection
vmnic1 0000:004:00.0 e1000e Down 0 Half 00:25:90:06:ca:17 1500 Intel Corporation 82574L Gigabit Network Connection
vmnic2 0000:007:00.0 e1000e Up 1000 Full 00:15:17:d6:c6:26 1500 Intel Corporation 82571EB Gigabit Ethernet Controller
vmnic3 0000:007:00.1 e1000e Up 1000 Full 00:15:17:d6:c6:27 1500 Intel Corporation 82571EB Gigabit Ethernet Controller
vmnic_ib0 0000:005:00.0 Up 40000 Full 00:02:c9:0d:24:6d 2044 Mellanox Technologies MT26428 [ConnectX VPI - 10GigE / IB QDR, PCIe 2.0 5GT/s]
vmnic_ib1 0000:005:00.0 Down 0 Half 00:02:c9:0d:24:6e 1500 Mellanox Technologies MT26428 [ConnectX VPI - 10GigE / IB QDR, PCIe 2.0 5GT/s]
~ #
Code:
~ # esxcli network vswitch standard list
vSwitch0
Name: vSwitch0
Class: etherswitch
Num Ports: 1792
Used Ports: 6
Configured Ports: 128
MTU: 1500
CDP Status: listen
Beacon Enabled: false
Beacon Interval: 1
Beacon Threshold: 3
Beacon Required By:
Uplinks: vmnic3, vmnic2
Portgroups: VM Network, Management Network
vSwitch1
Name: vSwitch1
Class: etherswitch
Num Ports: 1792
Used Ports: 4
Configured Ports: 128
MTU: 2044
CDP Status: listen
Beacon Enabled: false
Beacon Interval: 1
Beacon Threshold: 3
Beacon Required By:
Uplinks: vmnic_ib0
Portgroups: ib
~ #
Code:
~ # ping 10.0.1.11
PING 10.0.1.11 (10.0.1.11): 56 data bytes
--- 10.0.1.11 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
~ #
Additional info:
- opensm is not started on the storage node because it is running on the esxi one.
- the cards have both the last fw
- I was not able to use ibping both with -G and -L parameter
Last edited: