Infiniband 3-node Ring with CentOS 6.5

stephanellis

New Member
Nov 13, 2013
3
0
0
All,

I'm just getting into infiniband for use with gluster and I have to say, I'm impressed with the performance/price. I would like to know if someone (forum user dba maybe) can give me a high level overview of what it would take to build a 3 node ring. This topology is mentioned at this post:

http://forums.servethehome.com/great-deals/1750-mellanox-connectx-2-dual-port-qdr-adapter-lp-bracket-$125-shipped.html

I am sure that several peices need to be configured, opensm, bonding, etc... but I'm not totally sure.

Ultimately I'd like to have three nodes on the same IP subnet without a switch. Can anyone guide me in the right direction?

Thanks!
 

dba

Moderator
Feb 20, 2012
1,478
183
63
San Francisco Bay Area, California, USA
All,

I'm just getting into infiniband for use with gluster and I have to say, I'm impressed with the performance/price. I would like to know if someone (forum user dba maybe) can give me a high level overview of what it would take to build a 3 node ring. This topology is mentioned at this post:

http://forums.servethehome.com/great-deals/1750-mellanox-connectx-2-dual-port-qdr-adapter-lp-bracket-$125-shipped.html

I am sure that several peices need to be configured, opensm, bonding, etc... but I'm not totally sure.

Ultimately I'd like to have three nodes on the same IP subnet without a switch. Can anyone guide me in the right direction?

Thanks!
It's fairly easy. You need a dual-port card in each node plus three cables. Connect each node to two others to form your ring. Wiring done.

On the software side, you need your drivers as usual, plus you need two copies of the subnet manager software on each node, one bound to each IB interface. The details on how to do this vary by operating system, of course. I use Solaris and Windows with IB, so I'm of no help with CentOS.

You mentioned IP subnet, so I'll assume that you are going to use IPoIB. Assuming so, your last step will be to configure IP addresses, gateways, etc. as if you were wiring up an isolated Ethernet network. You might, for example, assign IP addresses like the below. Note that I am assuming your other Ethernet ports are on a different subnet. Let's imagine that you Gigabit ports are 192.168.1.* and that your new IPoIB subnet looks like this:

Node1:
IB1: 10.10.10.1
IB2: 10.10.10.2
Gateway: none

Node2:
IB1: 10.10.10.3
IB2: 10.10.10.4
Gateway: none

Node3:
IB1: 10.10.10.5
IB2: 10.10.10.6
Gateway: none

To access any node, you just use its IP address, making sure that you use the one address (of the two) to which that node is directly connected.
 
Last edited:

PigLover

Moderator
Jan 26, 2011
3,012
1,315
113
Not sure that IP addressing schema works unless you also declare the interfaces to be bridged...you violate IP subnet rules (which are different from IB subnets...). Each IP interface needs to be on a separate IP Subnet. Both ends of each IP interface need to be on the same IP subnet.

More likely need this:

You can do it with smaller subnets, but rationale for the numbering is clearest using a full Class-C.

Everything subnet mask 255.255.255.0

Node 1:
IB1: 10.10.1.1 (connects to Node2 IB1)
IB2: 10.10.2.1 (connects to Node 3 IB1)

Node 2:
IB1: 10.10.1.2 (connects to Node1 IB1)
IB2: 10.10.3.2 (connects to Node3 IB2)

Node 3:
IB1: 10.10.2.3 (connects to Node1 IB2)
IB2: 10.10.3.3 (connects to Node 2 IB2)
 

MiniKnight

Well-Known Member
Mar 30, 2012
3,014
922
113
NYC
got my popcorn out... sitting back and learning something that I wanted to know also
 

dba

Moderator
Feb 20, 2012
1,478
183
63
San Francisco Bay Area, California, USA
It works as show in Windows server, but I can see how using different subnets would be required in Linux, and probably better overall.

Not sure that IP addressing schema works unless you also declare the interfaces to be bridged...you violate IP subnet rules (which are different from IB subnets...). Each IP interface needs to be on a separate IP Subnet. Both ends of each IP interface need to be on the same IP subnet.

More likely need this:

You can do it with smaller subnets, but rationale for the numbering is clearest using a full Class-C.

Everything subnet mask 255.255.255.0

Node 1:
IB1: 10.10.1.1 (connects to Node2 IB1)
IB2: 10.10.2.1 (connects to Node 3 IB1)

Node 2:
IB1: 10.10.1.2 (connects to Node1 IB1)
IB2: 10.10.3.2 (connects to Node3 IB2)

Node 3:
IB1: 10.10.2.3 (connects to Node1 IB2)
IB2: 10.10.3.3 (connects to Node 2 IB2)
 

PigLover

Moderator
Jan 26, 2011
3,012
1,315
113
Yeah, Windows plays pretty loose with IP. I think your approach works there because Windows ARPs on every interface in the subnet in order to find the peer's MAC. You only get a response on the interface directly connected to the node with that address and all is well.

But technically even that shouldn't work reliably because, in the presence of an ARP cache, future datagrams can be sent to the cached MAC from any interface on the same subnet. And if those future datagrams are sent on the "wrong" interface it won't reach its destination (unless the other node forwards). I think this gets hidden for Windows SMB traffic because of SMB multipathing - it tries to set up separate flows on both interfaces and one "works" and the other is "silent". But a system without SMB multipath - like Linux using SMB/CIFS/NFS - will see intermittent failures, packet loss and delays.

Worse, since an IP subnet is supposed to define a single broadcast domain, there is no need to send the ARP on multiple interfaces and doing do just creates unnecessary load on the LAN segment. Linux systems (and most routers) that have multiple interfaces on the same subnet will pick an interface and only send the ARP on one of them. If it picks the right one then all is well. If it picks the wrong one then it won't get a responses - and after a timeout it will try again, perhaps using the other interface. This will appear to work, but in fact you have "first packet latencies" that will dog you if you are trying to do anything at all high performance.
 
Last edited:

stephanellis

New Member
Nov 13, 2013
3
0
0
Thanks for the replies. I do understand what's being said here, but I really need one IP subnet across multiple IB subnets, which I thought I could accomplish by bonding the IB ports together. I recently read some howtos from IBM and HP that mentioned IB bonding only does Active/Backup, which means that when bonded, only one port is useable. (please correct me if I'm wrong).

I am trying to use this ring as the interconnect for a GlusterFS volume. GlusterFS maintains a list of nodes (bricks), each of which must be accessible by a single ip address to all of the other nodes and clients. In my case, the brick nodes are also clients. I'm pretty handy with IP on linux, so I think I could turn on forwarding and maybe use a routing protocol (OSPF). I would advertise routes for /32's assigned to the loop back interfaces of each brick node, which would make my gluster setup possible, and also allow traffic to flow around a failed link.

Address assignment would look like
Node1 lo1 10.0.111.1/32 ib0 10.0.1.1/30 ib1 10.0.1.5/30
Node2 lo1 10.0.111.2/32 ib0 10.0.1.6/30 ib1 10.0.1.9/30
Node3 lo1 10.0.111.3/32 ib0 10.0.1.10/30 ib1 10.0.1.2/30

quagga would make sure that the most direct path would be taken, which would be the direct connection to the other servers.

But that's more complicated than what I was thinking. Since I don't want to by a switch, I might try that. A major problem with this approach would be prioritizing OSPF over other traffic on the IB links, no sure if that's possible. A major advantage (if you don't need the full 20Gbps between each node) is that the ring can get bigger than 3 nodes, within reason.

Just thinking out loud.
 

stephanellis

New Member
Nov 13, 2013
3
0
0
Posting back about my OSPF idea....it works well. Tuning OSPF and prioritizing over other traffic on the IB links would be good, but for now my setup is pretty basic. The meat of it is in my ansible playbooks. Look at the roles named infiniband-ringmember:

https://github.com/stephanellis/playbooks

Take a peak if you're interested.