Windows 7 file transfer hangs between pair of MHES18-XTC Infiniband HBAs

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

danwood82

Member
Feb 23, 2013
66
0
6
Has anyone come across this issue?

I grabbed a cheap set of Mellanox MHES18-XTC InfiniHost III Lx adapters off ebay, and they appear to be working perfectly running point-to-point, between two Windows 7 Professional computers.
I installed the MLNX WinOF VPI 2.1.2 OFED package, as it appears to be last version which supported these adapters.
I've run ib_send_bw.exe bandwidth tests between the cards, and the connection seems to be reliable, consistent, and running full-rate.

I've set up IPoIB on them both, and gotten it working... actually quite surprised how well - it'll manage up to 500MB/s SMB file transfers between ram disks, and only seems to eat up about 10-15% of my 3770k Ivy Bridge CPU. Quite pleased so far.

The big problem though, is with a test directory of 20x ~1.5GB files, copying from one Samsung 830 SSD to another, it'll sometimes work fine, but more often than not it'll get part way through a file copy before the connection dies. It really messes up Windows too, it's actually caused Task Manager to hang and become unrecoverable, seems to make the Network and Sharing Center take forever to even load, and various other issues until I do a full reboot. As far as I can tell, it appears to do this only on the receiving computer, not the sending one.

I'm just about to start picking through everything to try and work out what's going on, but thought I'd check online first to see if someone knows the problem and the fix from past experience.

There's a whole bunch of IPoIB settings for me to play with, wondering if Large Send Offload might help, and also wondering whether there's some buffering issue at play, and it can overrun when a transfer becomes disk-write-limited.


Anyone have any suggestions?

Edit:
The adapters were running the 1.2.000 firmware, which appeared to be the latest, but I spotted a much more recent 1.2.940 version in the custom firmware section. Flashed it onto both cards without a problem, even spotted a bug-fix regarding "HCA might get stuck under stress conditions" - but alas, after reboot, they did exactly the same thing and stalled the receiving machine after about 4GB transfered... it mostly seems to be around that point it goes.
 
Last edited:

badatSAS

Member
Nov 7, 2012
103
0
16
Boston, MA
I'd check and see if it's crapping out when you run out of available RAM to cache to at the OS level first, it sounds like buffering to me.

What do the temperatures on these things look like? The only picture I found of the card didn't show a heatsink on them, does it need one?
 

danwood82

Member
Feb 23, 2013
66
0
6
Both machines have 32GB of RAM, so that shouldn't be the problem. The one or two times it's worked without crashing, it'll seem to run at full bandwidth up until about 4GB, at which point the transfer slows up to around the write-speed of the SSD... and ordinarily that seems to be the most common point it's crapping out... so a buffering issue could very well be the problem. I would have thought that would be taken care of automatically, it must be a pretty common usage scenario that the disk system is slower than the connection.

Are there any adapter settings that control allocated buffer size or something like that? It's definitely not using up more than a fraction of my RAM before it goes.


I'll give the 3.1 OpenFabrics package a try. I thought OpenFabrics packages were one-and-the-same as the Mellanox supplied OFED drivers. I take it this isn't the case then?


Edit: Oh, and yep, they do come with heatsinks attached to the IC. No idea what temperature they get up to, but I presume the cooling is sufficient... there's plenty of airflow through both cases.
 
Last edited:

danwood82

Member
Feb 23, 2013
66
0
6
Ah, just looked through the documentation, and 3.1 supports all Mellanox HCAs. (Although they note that Infinihost cards will be depreciated in the next release)
Looks like a good place to start then.
 

danwood82

Member
Feb 23, 2013
66
0
6
Darnit... Installed OFED 3.1 on both, and it appears to all work fine, except now when I assign static IP addresses, I don't actually get a network connection between them. Subnet manager is running on one node, IB fabric reports everything is fine, and I can see both adapters, but IP refuses to discover the other end.
...except once, right near the start, when the computer not running the subnet manager managed to ping the one that was, but not the other way around... since then, neither has managed it. How odd... hopefully a little more gentle-prodding will get it going.
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
If this is a point-to-point install with no IB switch, then I have seen something similar and it had to do with the subnet manager going pear-shaped. Another possible cause is a firmware-driver incompatibility - I also ran into that problem.

After first making sure that your cards have the latest firmware and that the driver is compatible with your card using that firmware, try this:
With both servers running and everything connected, restart the subnet manager. I know that it's already running, but try a restart. This *may* fix the issue. If it does not, then you may have landed in the same situation that I did when experimenting with IB: A messed-up Windows install. For me, repeatedly removing and re-installing the Mellanox components did not fix the issue, but an OS re-install did. I do not know what got corrupted deep in the registry, but I do know that the reinstall fixed it. Actually, I didn't want to re-install everything only to find that it was a hardware problem, so I removed my drives and set them aside. I installed new drives and then installed Windows. Only after confirming that the new install worked fine was I confident that I had a software issue.

I rather suspect that the action that caused the corruption was this: Temporarily starting a second subnet manager on the second computer. I wanted to see if having two subnet managers on my very simple point-to-point network enabled me to start up the two computers in any order, with the first subnet manager to start taking control. It did not work as hoped, even though Mellanox says it's supported, and after that little experiment I no longer had a working IB network. I don't know that there is a causal relationship between the two events, but it's my working hypothesis.

Darnit... Installed OFED 3.1 on both, and it appears to all work fine, except now when I assign static IP addresses, I don't actually get a network connection between them. Subnet manager is running on one node, IB fabric reports everything is fine, and I can see both adapters, but IP refuses to discover the other end.
...except once, right near the start, when the computer not running the subnet manager managed to ping the one that was, but not the other way around... since then, neither has managed it. How odd... hopefully a little more gentle-prodding will get it going.
 
Last edited:

danwood82

Member
Feb 23, 2013
66
0
6
The other odd thing with the OpenFabrics driver is that it negotiates a 2.5Gbps 1x IB connection, instead of a 10Gbps 4x, which the old Mellanox drivers did.
Could this be a sign that something is wrong, or is there a setting somewhere to instruct it to use a 4x link?

@dba - thanks for the help. I do get the vague impression something is up with the subnet manager, or just the installation in general. I've tried uninstalling and reinstalling a couple of times, and on one of the two machines it keeps installing incremental "OpenFabrics IPoIB Adapter #2", #3, etc... but not on the other. It feels like some stale bits of previous installations are getting left behind, maybe even something from the old Mellanox drivers I tried to start with.

Annoyingly, I'm working abroad for the next three weeks, and trying to do all this over TeamViewer, so I'll have to put off trying a full OS re-install for now... drat!
 
Last edited:

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
I also had the "incrementing connection number" problem (#3, #4, etc.) in the above scenario - another bit of evidence that the install has gone wonky.

The other odd thing with the OpenFabrics driver is that it negotiates a 2.5Gbps 1x IB connection, instead of a 10Gbps 4x, which the old Mellanox drivers did.
Could this be a sign that something is wrong, or is there a setting somewhere to instruct it to use a 4x link?

@dba - thanks for the help. I do get the vague impression something is up with the subnet manager, or just the installation in general. I've tried uninstalling and reinstalling a couple of times, and on one of the two machines it keeps installing incremental "OpenFabrics IPoIB Adapter #2", #3, etc... but not on the other. It feels like some stale bits of previous installations are getting left behind, maybe even something from the old Mellanox drivers I tried to start with.

Annoyingly, I'm working abroad for the next three weeks, and trying to do all this over TeamViewer, so I'll have to put off trying a full OS re-install for now... drat!
 

danwood82

Member
Feb 23, 2013
66
0
6
Ah nuts, in that case, I'll either have to work out a way to reliably clean all references to the drivers and software out the machine, or wait a few weeks and try it with fresh OS installations.

Thanks again for the information. It's good to know someone's at least gone through the same process and got it working in the end :)
 

MiniKnight

Well-Known Member
Mar 30, 2012
3,073
974
113
NYC
Last time I tried the OF drivers, I got errors in the Windows 7 and Windows 2008 R2 installations. Just gave up.
 

danwood82

Member
Feb 23, 2013
66
0
6
Well, I dug through and tried to clean up each machine, followed by running the OpenFabrics IBcleanup.bat file, to get rid of any last trace of the drivers.

I then did a fresh install - the "OpenFabrics IPoIB Adapter #2" issue had gone, it named the adapter correctly, but the IP issues remain.

It'd be good to know some definitive way to test which element was going wrong. As I understand it, the subnet manager handles the IB connection setup, not the IPoIB layer, and the IB connection itself appears to be fine... any diagnostics tools that look at the IB fabric directly (eg. ibnetdiscover) see both adapters. It just refuses to pick up the static IP allocation.

There shouldn't be any more to allocating static IPs than just setting "something.something.something.57 subnet 255.255.255.0" and the same "somethings.58" or other random number at the other end. Right? I shouldn't need a default gateway, I shouldn't need any specific IP address besides avoiding the subnet that my ethernet adapter is on, and making sure both IPoIB devices are on the same subnet as each other?
 

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
You are on the right track. Configure an IP address for each, on a non-routable subnet that you are not currently using. If your home network is 192.168.1.something then use 10.10.0.something for the IB subnet or visa-versa. The exact numbers are not magic, we just want to use a non-routable subnet not currently in use. Set the subnet mast to something reasonable - say 255.255.255.0 and do not set a default gateway nor a DNS address.

Well, I dug through and tried to clean up each machine, followed by running the OpenFabrics IBcleanup.bat file, to get rid of any last trace of the drivers.

I then did a fresh install - the "OpenFabrics IPoIB Adapter #2" issue had gone, it named the adapter correctly, but the IP issues remain.

It'd be good to know some definitive way to test which element was going wrong. As I understand it, the subnet manager handles the IB connection setup, not the IPoIB layer, and the IB connection itself appears to be fine... any diagnostics tools that look at the IB fabric directly (eg. ibnetdiscover) see both adapters. It just refuses to pick up the static IP allocation.

There shouldn't be any more to allocating static IPs than just setting "something.something.something.57 subnet 255.255.255.0" and the same "somethings.58" or other random number at the other end. Right? I shouldn't need a default gateway, I shouldn't need any specific IP address besides avoiding the subnet that my ethernet adapter is on, and making sure both IPoIB devices are on the same subnet as each other?
 

danwood82

Member
Feb 23, 2013
66
0
6
Yup, that's exactly what I've done...

On both sides, once I allocate a static IP, they'll momentarily read "Identifying Network" in the Network Connections window, before quickly switching to "Unidentified Network" (Access type: No network access).

If I disable the subnet manager entirely, then reboot, they'll both read "Network Cable Unplugged"... until I start the subnet manager service, at which point both ends will simultaneously leap through "Identifying" to "Unidentified Network" in less than a second.

There's no doubt they're seeing each other on some level (as I mentioned before - ibnetdiscover happily lists both adapters when run at either end)... but no matter what I do, neither end seems able to see the IP allocation of the other.

You are on the right track. Configure an IP address for each, on a non-routable subnet that you are not currently using. If your home network is 192.168.1.something then use 10.10.0.something for the IB subnet or visa-versa. The exact numbers are not magic, we just want to use a non-routable subnet not currently in use. Set the subnet mast to something reasonable - say 255.255.255.0 and do not set a default gateway nor a DNS address.
 

Toddh

Member
Jan 30, 2013
122
10
18
This is strange. So from one pc you cannot ping the 2nd?

FYI there is no problem running more than one Subnet manager, in fact it is recommended to run 2 for redundancy. They do not recommend running any more than that.

The Unidentified Network is normal and does not affect connectivity. It just means there is no Domain or Internet to identify that subnet. All my adapters read that way but connect and transfer fine.

One setting on the NIC that does affect performance is changing from "Datagram Mode" to "Connected Mode". I think in the windows drivers the setting is just Connected Mode = Enabled. It changes the way the cards flow traffic and also enables the equivalent of Jumbo Frames which increases the packets from 2k - 4k up to 64k. Try this setting but keep in mind it messes with the Windows firewall. They have removed it in the Win12k drivers. Your mileage may vary.




.
 

danwood82

Member
Feb 23, 2013
66
0
6
Nope, I can't ping from either end. Both IPoIB adapters' "Status" windows consistently read "Bytes sent: 0 Bytes received: 0"... which seems strange as even disconnected devices usually have a couple of bytes clocked up in the sent column, just from trying to establish a connection.

I know in theory two subnet managers can run, and the second will default to standby mode, but the impression I've got from various sources is that in practice it tends to cause issues, and to avoid doing it. Does seem odd, as the redundancy element sounds like an important part of the design. Anyway, I've stuck to just running the one for now.

Good to know the unidentified network thing isn't the issue, thanks.
I've tried switching both to Connected Mode, but it doesn't have any affect on the IP problem.

I think I'll try switching back to the old Mellanox WinOF 2.1.2 package tonight, to see whether I at least end up with a working IP connection, with the dropping-out issue, or whether that has stopped working too... if it has, then I'd guess a full OS reinstall is looking increasingly like a requirement :p
 

justgosh

New Member
Jun 6, 2013
1
0
1
I was looking to setup something similar and I found this site along with a Japanese site InfiniBand - FreeStyleWiki (use google chrome to translate)
It lists MLNX_WinOF_VPI 2.1.2 as a working driver and fw-25204-1.2.000 as a tested and confirmed firmware on most OS's. fw-25204-1.2.940 is also listed as working on most, but I couldn't find it on Mellanox's website.
Updating Firmware for Single Port InfiniHost(TM) III Lx MemFree PCI Express HCA Cards - Mellanox Technologies for 1.2.000

Hope someone finds this helpful.