Infiniband problems

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Oni.kage

New Member
Nov 8, 2013
9
0
1
Boston, MA
I'm attempting to set up an infiniband connection between two MHQH19B-XTR adapters and I'm not having any luck getting a link up. I suspect that some piece of hardware is bad, but I'm not sure how to tell if it's the cable or one/both of the adapters.

I have the Mellanox OFED package installed on both the server (Ubuntu 12.04) and on the client (Windows 8) and connected with a Mellanox QSFP/QSFP optical cable.

On both the server and client, ibstat just shows the state as being "Down" and the physical state as "Polling." There doesn't appear to be any evidence that the cards can see each other. I have the included subnet manager running on the Ubuntu side. The opensm logs say "SM port is down."

Can anyone provide me with some guidance on how to diagnose this? Unfortunately I don't have any way to test this cable.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
My first hypothesis is that the subnet manager isn't really running, or is bound to the wrong port. Try using the other port on the Linux box, and then, if that doesn't work, try running the subnet manager on the Windows box instead. The subnet manager is installed on Windows as a service by default and just needs to be started.
 

RimBlock

Active Member
Sep 18, 2011
837
28
28
Singapore
I would also check the inside of the ports on the cards.

I have come across a card where the socket inside the port was damaged. Some cards / plugs seem very tight and can be over forced resulting in a damaged socket. The plug should push in and then click at the very end. At worst you should be able to feel the last little connection as it locks in. The plugs can go in the socket upside down (they are not keyed), it is only the last little bit that makes the connection or not.

Other than that I would fully agree with DBA that the subnet manager would seem to be not running correctly / on the right port etc.

Usually the link will only come up (from polling to active) when a subnet manager is detected on the IB network. This is true, from what I have seen, even with a single node connected to a switch.

So would check;
Sockets: for internal damage.
Cables: for that last 'click' to indicate a good lock in the socket.
Subnet Manager: To confirm it is on the right port.

From memory there are commands like ibdiagnet etc that you can use to give info ont eh state of the adapters and connection. Have a search on the OFA or Mellanox sites. Windows has them as well as Linux. In Win 7 they are run from a CMD prompt, not sure about Win8 though.

RB
 

Oni.kage

New Member
Nov 8, 2013
9
0
1
Boston, MA
My first hypothesis is that the subnet manager isn't really running, or is bound to the wrong port. Try using the other port on the Linux box, and then, if that doesn't work, try running the subnet manager on the Windows box instead. The subnet manager is installed on Windows as a service by default and just needs to be started.
Thanks dba.

I'm pretty confident that the subnet manager is running properly on the server:

root@ubuntu:~# cat /etc/default/opensm | grep PORTS
PORTS="0x0002c903004aa20f"
root@ubuntu:~# service opensmd start
Already started
root@ubuntu:~# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0002c903004aa20e
System image GUID: 0x0002c903004aa211
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x0251086a
Port GUID: 0x0002c903004aa20f
Link layer: InfiniBand
root@ubuntu:~#

Starting it on the Windows side gives me the same results. State stays down and openSM just repeats "SM port is down."
 

Oni.kage

New Member
Nov 8, 2013
9
0
1
Boston, MA
It was a bad cable after all. I sent it back and got a replacement; plugged it in and the link came right up.

So annoying, but, it works now. :)