Bridging Ethernet and IB

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
I have tried searching and the favored answer seems to be to buy an expensive piece of hardware.

This is all a work in progress..

I will be starting with running my filesystem NFS over QDR point to point between my workstation, compute and files/data node, router will likely be pfsense. Operating systems virtualized on Xen (Xen4Centos). The number of physical boxes this will represent will likely vary over time but right now it is strictly a poverty basement build. 1 compute node (centos), 1 workstation node (any and all os), 1 filed/data/everything server linux images. pfsense - router. Will likely break out the router at my first opportunity and buy the inexpensive QDR switch.

Without buying the QDR switch integrated ethernet how do you go about bridging gigabit ethernet to IB?
Can I simply configure vrouting tables in PFSense? Is there a driver available to map between the two protocols? I have been reading and I haven't found a clear answer on this one. Throttling down from 40gb to 1gb seems like it could potentially be very chatty. Do the protocols support? Is there middleware? am I doomed?

Thanks,
Robert
 

Scott Laird

Active Member
Aug 30, 2014
317
147
43
There are two different ways to do IP over Infiniband. The most common (by far) is IPoIB, which encapsulates IP directly in Infiniband and can't be bridged onto Ethernet, largely because the low-level addresses are different lengths.

It's also possible in some cases to use the Ethernet over Infiniband (EoIB) protocol, and do IP over that. It fakes an Ethernet header, so it'd be easy to bridge. The last time I checked, though, support for EoIB was very spotty.

You're going to end up wanting something to route between Ethernet and IB, and it usually ends up being more of a pain than you expected, especially if you dual-attach hosts to both networks. Not impossible or anything, but I kept finding one little annoying thing after another, over a year or so.

I ended up deciding that paying for 10Gbe was less painful than dealing with dual-path routing issues and ripping IB out entirely, but YMMV.
 

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
I will occasionally need the bandwidth from storage to the compute nodes so QDR/RDMA/NFS is compelling. The 4036E with the integrated 10gbe subnet manager is on my shopping list but If I could use a 4036 instead and bridge with pfsense or some other firewall router distribution it could save me $500USD or so on the switch plus 10gbe adapter to the router. Path to the outside world is through a residential cable modem so 1gb is adequate . I've been wondering how much of a pain it might be to run NFS over infiniband Point to point and IP over ethernet at the same time or if that even makes sense. Need more hardware to experiment and more money for hardware.
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,420
470
83
This is why I recommend ROCE. A pure Infiniband is cheaper with older gear. It is not that much more than to just have a pure Ethernet environment.

Chris
 
  • Like
Reactions: Chuntzu

Levi

Member
Mar 2, 2015
76
5
8
34
I'm trying to do the same thing robert. I have so many questions regarding IPoIB or EoIB or ROCE or RDMA.

We are not the only ones ... Infiniband to Ethernet bridging question

All I want to do it to connect my lab servers in my rack, as fast as possible and cheap as possible but while learning along the way.

I have a bunch of the connect x2 sfp+ 10Ge cards. But I can't afford a FC or 10Ge switch so I started looking at this cheap 40gb IB stuff but I don't understand most of it.

I want to connect my NIC's to the 4036 with this breakout cable, 1M 40GbE QSFP+ to 4x 10GbE SFP+ Copper Breakout Cable

but while that gives me fast IB it doesn't connect to the rest of my Home ethernet LAN.

The topspin 90 gateway looks very promising and used to be cheap. They are drying up online. It looks likes its the solution to what we are looking for I think.

Pfsense 2.2 or newer should do what we want as well but I have't found any proof but I'm still looking.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,640
2,058
113
4036E is the cheapest option, right around 2x the price of the non-e version if you can find a seller to 'accept' that offer, most seem to want 800-900.

I have tons of questions too :) so except more from me!!
 
  • Like
Reactions: Chuntzu

Levi

Member
Mar 2, 2015
76
5
8
34
4036E is the cheapest option, right around 2x the price of the non-e version if you can find a seller to 'accept' that offer, most seem to want 800-900.
Yeah but the PFsense is worth a go, you can have your fast storage network and be able to talk to the rest of your ethernet at about 500mbps. At the price for the 4036E I would rather just grab a 24 port 10Ge quanta for much less.

EDIT: Oh and it might be easier to just dual home your servers.
 
Last edited:

Levi

Member
Mar 2, 2015
76
5
8
34
This is why I recommend ROCE. A pure Infiniband is cheaper with older gear. It is not that much more than to just have a pure Ethernet environment.

Chris

My Connext X2 says it has RoCE but I don't know how to enable it. I googled and found this HowTo Configure RoCE in Windows Environment | Mellanox Interconnect Community Could it really be that easy? And this means better latency, less packet loss and CPU usage? So if I enable this between my 2 10Ge servers I will see way less cpu usage during testing?
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,420
470
83
for 2012 R2 there is almost no configuration necessary.

We have a script to change some affinity. and we max out the buffers on the cards. that is it.

Chris
 

Levi

Member
Mar 2, 2015
76
5
8
34
for 2012 R2 there is almost no configuration necessary.

We have a script to change some affinity. and we max out the buffers on the cards. that is it.

Chris
So how would I know I'm using it? I have a windows 10 client with the same card and a windows 2012 server. They connect through the lb4m switch. I get 10gbps bandwidth but it's at 20% cpu or so on both machines.
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,420
470
83
ROCE is not supported on clients (unfortunately). and I do not know why.

It is a server to server tech, where the main case is HyperV cluster to SOFS storage cluster. or in my case. 6000+ servers sending to 200+ servers... 1.6 PB of data in 4 hours...

There are a perfmon counters for RDMA/ROCE. that is where you will see it. being used

Chris
 
  • Like
Reactions: Chuntzu

Levi

Member
Mar 2, 2015
76
5
8
34
ROCE is not supported on clients (unfortunately). and I do not know why.

It is a server to server tech, where the main case is HyperV cluster to SOFS storage cluster. or in my case. 6000+ servers sending to 200+ servers... 1.6 PB of data in 4 hours...

There are a perfmon counters for RDMA/ROCE. that is where you will see it. being used

Chris
Not sure the LB4M supports DCB either.

RoCE (Mellanox)
“Infiniband over Ethernet” > so you “NEED” (no not a real hard requirement) DBC with PFC/ETS (DCBx can be handy) for it to work best. No need for Congestion Notification as it’s for TCP/IP but could be nice with iWARP (see above). Do note that you’ll need to configure your switches for DCB & that’s highly dependent on the vendor & even type of switches.
 
  • Like
Reactions: Chuntzu

cesmith9999

Well-Known Member
Mar 26, 2013
1,420
470
83
DCB is really only necessary in the DataCenter where you are really pushing bits. in home use it is not as critical.

We have many (1000's) servers in a semi (10 mile) away datacenter where DCB is NOT turned on in the WAN link and we still get ROCE transfers between the DC to Campus..

Chris
 
  • Like
Reactions: Chuntzu

RobertFontaine

Active Member
Dec 17, 2015
663
148
43
57
Winterpeg, Canuckistan
I'm going to be pushing bits from a database/file server to 1-n compute nodes with 3 Phi's per node plus a QDR card (those 8 PHI back-plane x9(10)drg-otf's are looking better all the time but I don't see any on the junk market and the X9DRG-QF's are almost affordable if you build chassis out of cardboard boxes or plywood. I may experiment with 4P 10 core e5's but I believe/hope that building out on the mics will force me to write codes that have a good chance to scale out to heavy metal on the cloud if needed in the next 3-5 years (assuming Knight's Landing gets traction).

I was thinking that I was going the Xen route but KVM seems to better support Solaris/ZFS/IB /Napp-it (poor man's SAN) on the database node and Centos/Xeon Phi's on the compute nodes. The rest of the servers, clients, and VMs will be in support systems for day to day and the database/compute and don't have much in the way of hardware requirement.

With a little luck I am feeding the number crunchers from local ram on the compute node that is grabbing from fast flash that is being back-filled by the main database/prepped files on the storage server through QDR. After the initial offloads I'm hoping I can keep the hoppers full while compute does it thing. Not exactly scientific or engineering but low budget, bubble gum and bailing twine. I suspect I can run Ethernet and IB networks in parallel and let IB/RDMA/NFS provide me with the fast data transfer. "Normal" traffic really doesn't have any special requirement and gigabit Ethernet is adequate for outside guests to run experiments.

This is all in the dungeon lab so labor and "enterprise ready" are not critical decision factors. Purely about cash, processing power, and throughput. Things like Raid 0 and ram disks are going to be considered to fill the hose. If the hashes don't match I can rerun a batch without concern about whether I have corrupted an enterprise database, sent out an inappropriate check, or turned off a power grid on the west coast.


This would be a lot easier with a couple of hundred grand and a data center but it wouldn't be nearly as interesting.
 
  • Like
Reactions: Levi and Chuntzu