Infiniband IPoIB performance problems?

tjk · Mar 14, 2013

Hey folks, long time lurker, first time poster!

I have a test lab where I am testing the performance of Infiniband, specifically IPoIB, and think I am running into some performance problems, looking for any feedback anyone may have.

Setup:

10 x Dell R610 nodes, 48GB Ram, dual quad core 5600 cpu's, 6x600GB 10K SAS hdd's
10 x MHQH19B-XTR single port QDR Infiniband cards in each server (Firmware version: 2.9.1200)
1 x Mellanox 4036 switch with latest firmware that I can find (version: 3.9.1 / build Id:985), subnet manager running on the switch

Cards report up and running at 4xQDR:

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:0007:f2e7
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand

Problem is the IP performance over the Infiniband fabric is not that great, here are some IPerf test results. The results are running the cards in connected mode, with 65520 MTU. Datagram mode was worse. Kernel tweaking done via the Mellanox docs didn't make things any faster or slower basically performance was the same either way.

(Latest Centos 6.4 64 bit on each host node)

iperf -fg -c 10.10.10.11
------------------------------------------------------------
Client connecting to 10.10.10.11, TCP port 5001
TCP window size: 0.00 GByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.12 port 42552 connected with 10.10.10.11 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 12.7 GBytes 10.9 Gbits/sec
[root@vz3 ~]# iperf -fg -c 10.10.10.11
------------------------------------------------------------
Client connecting to 10.10.10.11, TCP port 5001
TCP window size: 0.00 GByte (default)
------------------------------------------------------------
[ 3] local 10.10.10.12 port 42553 connected with 10.10.10.11 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 12.7 GBytes 10.9 Gbits/sec

(Not very scientific but a test nonetheless)

SCP large file directory over IBoIP network:

file7 100% 4152MB 148.3MB/s 00:28
file10 100% 4152MB 148.3MB/s 00:28
file9 100% 4152MB 153.8MB/s 00:27

(Same test / same files over a 2xGE bonded interface, which is FASTER then over the IB network)

file7 100% 4152MB 138.4MB/s 00:30
file10 100% 4152MB 166.1MB/s 00:25
file9 100% 4152MB 159.7MB/s 00:26
file6 100% 4152MB 166.1MB/s 00:25

In summary: IPoIB on a QDR40Gb/sec fabric seems to be performing horrible. I know IPoIB isn't the most efficient way to do this (would be doing SRP if I could, but can't), but I would expect it to perform better then 30% of line rate.

Am I missing anything? I can run any test if someone has suggestions, or test another OS/method/etc.

Thanks in advance for any suggestions!

Tom

dba · Mar 14, 2013

I remember that the IB MTU is fixed at 4k or maybe 2k for older implementations. Have you tried setting your IP MTU to 2044 or 4092 to match (2048 or 4096 with four bytes reserved for the IB header).

Also, with respect to your file copy test, have you tested the read and write speed of the disks to verify that they aren't a bottleneck?

I'm getting multi-thread throughput of just under 2,000MB/S using the same ConnectX-2 generation cards and the same IB switch, but with Windows and slower CPUs - my results work out to 15.3 Gbits/second. Windows says "jumbo packet" size is 4092.

Toddh · Mar 14, 2013

From my experience tuning infiniband can be hit or miss. It depends on the OS, drivers, firmware, protocol and which way the wind is blowing. That said you are correct your numbers seem low.

I don't know about Centos, most of my testing is in Windows. I can tell you I have some similar hardware and I am currently running round 320mb reads and 550mb writes via iSCSI from Windows to a Linux based SAN. Those numbers can change with different Windows driver versions.

IPoIB has two modes, Datagram and Connected modes. Connected Mode is prefered and is the equivalent to jumbo frames in Ethernet. It increases the packet size from 2k for linux up to 64k.

I agree with dba, check your disk benchmarks to make sure the bottleneck is not there.

dba are you getting that 2000mb with SRP or IPoIB?
.

tjk · Mar 14, 2013

dba, thanks for your reply.

The mtu on ib0 is set to 2044 right now, and test results are the same for IP.

Testing using rdma_bw looks like I am getting awesome/impressive performance:

rdma_bw -b 172.16.1.10
214166: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | sl=0 | iters=1000 | duplex=1 | cma=0 |
214166: Local address: LID 0x04, QPN 0x340049, PSN 0x78ed95 RKey 0x38001800 VAddr 0x007f7b6be63000
214166: Remote address: LID 0x03, QPN 0x6c0049, PSN 0x6093c0, RKey 0x20001800 VAddr 0x007fcf0a91f000

214166: Bandwidth peak (#0 to #999): 6238.8 MB/sec
214166: Bandwidth average: 6238.73 MB/sec
214166: Service Demand peak (#0 to #999): 374 cycles/KB
214166: Service Demand Avg : 374 cycles/KB

As for the disks, I tested with an SSD in a couple of nodes, and the same results, SCP over a bonded GE link (2 links) is faster then the QDR IPoIB setup, so the disks are not the choke point that I can tell, since bonded GE was faster on the same nodes/tests.

Tom

tjk · Mar 14, 2013

Toddh,

Thanks for your reply.

I tested connected mode with 65520 MTU, and performance was slightly better, but never over 11Gbs.

I am using iperf and netperf which doesn't hit the hdd subsystem.

My SCP tests showed faster using bonded GE over the IPoIB setup, so I assume this isn't a disk issue or the #'s would be the same for both.

What sort of tweaking have you done on your linux setup, can you share?

Thanks,
Tom

dba · Mar 14, 2013

Toddh said:
...IPoIB has two modes, Datagram and Connected modes. Connected Mode is prefered and is the equivalent to jumbo frames in Ethernet. It increases the packet size from 2k for linux up to 64k...

Sort of. My understanding is larger packets in IPoIB are "virtual" in that they are assembled out of multiple, smaller, native 2k/4k IB packets when using connected mode.

Toddh said:
...dba are you getting that 2000mb with SRP or IPoIB?

IP definitely. The RDMA-based protocols are even faster. Unfortunatly, I now can't seem to find the detailed notes. I thought that I was testing Windows file sharing, but now I am thinking that it must have been StarWind iSCSI.
Update: I found the notes. I was testing the StarWind iSCSI target software on Windows to a Windows client using IOMeter to generate the load. 1,960MB/S with 1MB transfers and a queue depth of 32, maximum 4k IOPS was a fairly anemic 85K.

tjk, do you have a way to throw multiple threads worth of data at that connection? It is quite possible that the pipe is doing fine but that the protocol isn't able to utilize all of the bandwidth when performing a single copy.

tjk · Mar 14, 2013

dba said:
tjk, do you have a way to throw multiple threads worth of data at that connection? It is quite possible that the pipe is doing fine but that the protocol isn't able to utilize all of the bandwidth when performing a single copy.

I can test / reconfigure whatever I need to, do you know what tools I can use to test multiple streams? Be happy to run them for sure!

Appreciate the feedback!

Tom

dba · Mar 14, 2013

What upper level protocol are you using? iSCSI or NFS or ?

tjk said:
I can test / reconfigure whatever I need to, do you know what tools I can use to test multiple streams? Be happy to run them for sure!

Appreciate the feedback!

Tom

tjk · Mar 14, 2013

dba said:
What upper level protocol are you using? iSCSI or NFS or ?

None, right now I am just trying to get TCP/IP performance between nodes tuned and working as it should be, but the performance isn't what I expected with IPoIB.

When/if I figure this out, we'll be using a distributed storage system between nodes that runs over IP, which doesn't support SRP, hence me trying to get all the performance out of basic IPoIB that I can.

Tom

dba · Mar 14, 2013

I would use IOMeter. You'll need to run the Linux version of the IOMeter engine along with the control UI running on a Windows box. There is no unix-only version of IOMeter unfortunately. You can Google for how to use IOMeter, it's fairly easy. For throughput testing I recommend starting with something like the following: One worker, 8,000,000 sectors, queue depth = 32, 1MB transfers, 100% read, 100% sequential. Ramp the queue depth up and down to find your maximum throughput.

To run IOMeter you'll need some storage from one server mounted on another server over your IB network. a RAM disk on the storage server is best - avoid any disk bottlenecks. Testing won't require any disk on the test server. The storage can be shared via NFS or iSCSI, with iSCSI preferred.

tjk said:
None, right now I am just trying to get TCP/IP performance between nodes tuned and working as it should be, but the performance isn't what I expected with IPoIB.

When/if I figure this out, we'll be using a distributed storage system between nodes that runs over IP, which doesn't support SRP, hence me trying to get all the performance out of basic IPoIB that I can.

Tom

Now, having said all of this, I just noticed that you have run iperf and are getting only 10Gb/S. Ignore all of my advice until you get the raw TCP/IP throughput up to where it should be. I think that I'd try to rule out switch and fabric problems by testing a direct connection between two servers with a subnet manager running on one of them. I'd also test with netperf as well as iperf - I'm not sure how well either benchmarks QDR IB-level throughput.

Biren78 · Mar 15, 2013

dba said:
I would use IOMeter. You'll need to run the Linux version of the IOMeter engine along with the control UI running on a Windows box. There is no unix-only version of IOMeter unfortunately. You can Google for how to use IOMeter, it's fairly easy. For throughput testing I recommend starting with something like the following: One worker, 8,000,000 sectors, queue depth = 32, 1MB transfers, 100% read, 100% sequential. Ramp the queue depth up and down to find your maximum throughput.

Intellectually I have a block on IOMeter w/ workers. One server I can get working but tuning with multiple workers makes my head spin.

dba · Mar 15, 2013

Are you talking about "managers" or "workers"?

For those new to IOMeter: In IOMeter speak, the IOMeter app (IOMETER.exe on Windows) tells "workers" which run in "managers" what to do. A "manager" is an instance of dynamo (DYNAMO.exe on Windows) running on some machine that coordinates the actions of the "worker" processes that it spawns. The "worker" is the bit that actually runs the load. You run multiple workers within a manager in order to spread the load over multiple CPUs or to run different kinds of load at the same time.
When you do a simple install of IOMeter on Windows, the UI launches DYNAMO automatically and you don't need to know any of this. When you run it distributed, or on Linux, you launch one or more DYNAMO instances separately, after a bit of configuration to allow them to communicate with the Windows UI.

Biren78 said:
Intellectually I have a block on IOMeter w/ workers. One server I can get working but tuning with multiple workers makes my head spin.

jmarg · Mar 17, 2013

Couple of things (Mellanox guy here

1. Check out our performance tuning guide, might help: http://www.mellanox.com/related-doc..._Guide_for_Mellanox_Network_Adapters_v1.6.pdf
2. Check with iPerf running a few threads, 1 thread might be not-as-good as 4,5,6 threads.
3. Join community.mellanox.com and have some of our smarter guys answer some of your questions (not replacing servethehome.com - but in addition to)
4. Mellanox OFED 2.0 coming soon should add some nice performance improvements
5. Stay away from 4k MTU, stick with 2k MTU

jmarg · Mar 17, 2013

Oh, and:

6. CM is almost always better than UD for IPoIB, but it should be getting closer in Mellanox OFED 2.0.

tjk · Mar 17, 2013

jmarg,

Thanks for the suggestions. I applied all the kernel tweaks per your link.

I am running in connected mode, mtu 65520.

single stream iperf -fg -c 172.16.1.10
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 13.2 GBytes 11.3 Gbits/sec

2 streams iperf -fg -c 172.16.1.10 -P 2
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 5.11 GBytes 4.39 Gbits/sec
[ 3] 0.0-10.0 sec 13.6 GBytes 11.6 Gbits/sec
[SUM] 0.0-10.0 sec 18.7 GBytes 16.0 Gbits/sec

4 streams iperf -fg -c 172.16.1.10 -P 4
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 3.94 GBytes 3.38 Gbits/sec
[ 4] 0.0-10.0 sec 3.59 GBytes 3.08 Gbits/sec
[ 3] 0.0-10.0 sec 3.57 GBytes 3.07 Gbits/sec
[ 6] 0.0-10.0 sec 4.03 GBytes 3.45 Gbits/sec
[SUM] 0.0-10.0 sec 15.1 GBytes 13.0 Gbits/sec

8 streams iperf -fg -c 172.16.1.10 -P 8
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 2.21 GBytes 1.90 Gbits/sec
[ 3] 0.0-10.0 sec 1.99 GBytes 1.71 Gbits/sec
[ 8] 0.0-10.0 sec 2.68 GBytes 2.30 Gbits/sec
[ 9] 0.0-10.0 sec 1.64 GBytes 1.41 Gbits/sec
[ 7] 0.0-10.0 sec 2.45 GBytes 2.10 Gbits/sec
[ 4] 0.0-10.0 sec 1.89 GBytes 1.62 Gbits/sec
[ 5] 0.0-10.0 sec 1.76 GBytes 1.51 Gbits/sec
[ 10] 0.0-10.0 sec 2.37 GBytes 2.04 Gbits/sec
[SUM] 0.0-10.0 sec 17.0 GBytes 14.6 Gbits/sec

Not sure what else I can do. I was talking to some other folks and they basically just said IPoIB is horrible and try not to use it, however my app will only run on IP, I can't use rsockets or rdma/srp/etc.

Still cheaper then 10g ethernet I guess.

Thanks,
Tom

dba · Mar 17, 2013

You are up to 16 Gbits/second, which is 2,048MB/Second. That's near but even better than my 1,960MB/S iSCSI results using the same cards on Windows. It's still far short of the theoretical IB throughput, and short of the roughly 3,200MB/S results that I've seen from the RDMA protocols, but it's pretty good throughput none the less.

tjk said:
jmarg,

Thanks for the suggestions. I applied all the kernel tweaks per your link.

I am running in connected mode, mtu 65520.

single stream iperf -fg -c 172.16.1.10
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 13.2 GBytes 11.3 Gbits/sec

2 streams iperf -fg -c 172.16.1.10 -P 2
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 5.11 GBytes 4.39 Gbits/sec
[ 3] 0.0-10.0 sec 13.6 GBytes 11.6 Gbits/sec
[SUM] 0.0-10.0 sec 18.7 GBytes 16.0 Gbits/sec

...

Thanks,
Tom

tjk · Mar 17, 2013

Interesting update:

Loaded up Ubuntu 12.04.2 LTS and did some testing. The following #'s do not have OFED installed NOR do they have any kenel tweaks done.

Connected mode, 65K MTU

1 stream:
root@test-4:~# iperf -c 172.16.1.12 -fg -P1 -t 30
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 60.1 GBytes 17.2 Gbits/sec

2 streams:
root@test-4:~# iperf -c 172.16.1.12 -fg -P2 -t 30
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 38.3 GBytes 11.0 Gbits/sec
[ 4] 0.0-30.0 sec 38.3 GBytes 11.0 Gbits/sec
[SUM] 0.0-30.0 sec 76.6 GBytes 21.9 Gbits/sec

4 streams:
root@test-4:~# iperf -c 172.16.1.12 -fg -P4 -t 30
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-30.0 sec 19.0 GBytes 5.43 Gbits/sec
[ 3] 0.0-30.0 sec 19.0 GBytes 5.44 Gbits/sec
[ 5] 0.0-30.0 sec 19.0 GBytes 5.44 Gbits/sec
[ 4] 0.0-30.0 sec 19.0 GBytes 5.43 Gbits/sec
[SUM] 0.0-30.0 sec 75.9 GBytes 21.7 Gbits/sec

Anything more then 4 streams, and the SUM BW went down below 20/Gbits.

Ubuntu is using the latest 3.5 kernel, whereas Centos is using the latest 2.x kernel.

Wish I could use Ubuntu!

Tom

dba · Mar 17, 2013

Very interesting results! Nice find.
I wonder how Solaris would do? When I was testing SSD disk IO, I found Linux to be slower than Windows, but both looked pathetic compared to Solaris x86 11.1.

tjk said:
Interesting update:

Loaded up Ubuntu 12.04.2 LTS and did some testing. The following #'s do not have OFED installed NOR do they have any kenel tweaks done.

Connected mode, 65K MTU

1 stream:
root@test-4:~# iperf -c 172.16.1.12 -fg -P1 -t 30
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 60.1 GBytes 17.2 Gbits/sec

2 streams:
root@test-4:~# iperf -c 172.16.1.12 -fg -P2 -t 30
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 38.3 GBytes 11.0 Gbits/sec
[ 4] 0.0-30.0 sec 38.3 GBytes 11.0 Gbits/sec
[SUM] 0.0-30.0 sec 76.6 GBytes 21.9 Gbits/sec

4 streams:
root@test-4:~# iperf -c 172.16.1.12 -fg -P4 -t 30
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-30.0 sec 19.0 GBytes 5.43 Gbits/sec
[ 3] 0.0-30.0 sec 19.0 GBytes 5.44 Gbits/sec
[ 5] 0.0-30.0 sec 19.0 GBytes 5.44 Gbits/sec
[ 4] 0.0-30.0 sec 19.0 GBytes 5.43 Gbits/sec
[SUM] 0.0-30.0 sec 75.9 GBytes 21.7 Gbits/sec

Anything more then 4 streams, and the SUM BW went down below 20/Gbits.

Ubuntu is using the latest 3.5 kernel, whereas Centos is using the latest 2.x kernel.

Wish I could use Ubuntu!

Tom

gigatexal · Mar 17, 2013

Hmm. Best of luck man with this. Interesting to see that the kernel made a lot of a difference. For a 40gigabit switch getting 20gbits isn't that great then again I'm not sure what the overhead is with 10GBE but, as you probably know, 1gbit it seems often that I get close to 80% of the pipe.

tjk · Mar 18, 2013

Yea, the slowness is due to the TCP/IP stack when using IPoIB. I've tested with just RDMA transfers and the performance was line rate.

I also did some testing with NFS/RDMA and the performance wasn't that great, it was slower then NFS on 10Gbe.

At this point, if you require full IP communications (Non RDMA, SRP, SDP/rsockets) for your application, I would stick to regular 10Gb Ethernet.

Maybe the Mellanox guy can tell us why any sort of IP over Infiniband performs so badly? Is it lack of checksum/TCP offloads like 10Gbe cards have? Something else?

dba, Did you do any Windows SMB 3/RDMA performance testing with your equipment? If so, what were the results?

Tom

gigatexal said:
Hmm. Best of luck man with this. Interesting to see that the kernel made a lot of a difference. For a 40gigabit switch getting 20gbits isn't that great then again I'm not sure what the overhead is with 10GBE but, as you probably know, 1gbit it seems often that I get close to 80% of the pipe.

Infiniband IPoIB performance problems?

Active Member

Moderator

Member

Active Member

Active Member

Moderator

Active Member

Moderator

Active Member

Moderator

Active Member

Moderator

New Member

New Member

Active Member

Moderator

Active Member

Moderator

I'm here to learn

Active Member