Infiniband IPoIB performance problems?

dba · Mar 18, 2013

tjk said:
...

dba, Did you do any Windows SMB 3/RDMA performance testing with your equipment? If so, what were the results?

Tom

Not much, at least not yet. I ran a quick sqlio test against a SMB3 shared ramdisk and saw only around 600MB/S for sequential transfers. Disappointing, but I can't some to any conclusions since I really didn't do anything but run five minutes worth of testing with no tuning. I didn't even confirm that RMDA was being used.

Netwizard · May 8, 2013

Here's some hope

Hello from a first time poster. I stumbled across this thread while trying to tune IPoIB myself. I'm getting better numbers, so here's some hope for the original poster.

I bought a couple of FDR InfiniBand + 40GigE Mellanox cards, MCX354A-FCBT, on ebay and installed them in my home servers. We're talking consumer hardware, i3770 cpus and Gigabyte GA-Z77X-D3H motherboards. I got Mellanox QDR cables, but ibstat says "Rate: 40 (FDR10)", so I'm guessing I'm running at FDR. One of the boxes runs Ubuntu 12.10 and the other 13.04, as I'm in the process of upgrading. The following numbers are without any kernel tweaks and all numbers reproduce consistently. The drivers are whatever Ubuntu supplies with "apt-get install libmlx4-1 opensm". (how do you check which version?)

40G Ethernet mode:

root@linux2:~# iperf -c 10.1.5.11
[ ID] Interval * * * Transfer * * Bandwidth
[ *3] *0.0-10.0 sec *24.4 GBytes *21.0 Gbits/sec

root@linux2:~# iperf -c 10.1.5.11 -P2
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 21.6 GBytes 18.6 Gbits/sec
[ 4] 0.0-10.0 sec 21.0 GBytes 18.1 Gbits/sec
[SUM] 0.0-10.0 sec 42.6 GBytes 36.6 Gbits/sec

root@linux2:~# iperf -c 10.1.5.11 -P4
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-10.0 sec 8.10 GBytes 6.96 Gbits/sec
[ 4] 0.0-10.0 sec 16.4 GBytes 14.1 Gbits/sec
[ 3] 0.0-10.0 sec 9.37 GBytes 8.05 Gbits/sec
[ 6] 0.0-10.0 sec 8.89 GBytes 7.64 Gbits/sec
[SUM] 0.0-10.0 sec 42.8 GBytes 36.7 Gbits/sec
QDR IPoIB connected mode with mtu of 65520:

root@linux2:~# iperf -c 10.1.5.11
[ 3] 0.0-10.0 sec 28.8 GBytes 24.8 Gbits/sec

root@linux2:~# iperf -c 10.1.5.11 -P2
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 20.5 GBytes 17.6 Gbits/sec
[ 3] 0.0-10.0 sec 21.0 GBytes 18.1 Gbits/sec
[SUM] 0.0-10.0 sec 41.5 GBytes 35.7 Gbits/sec

root@linux2:~# iperf -c 10.1.5.11 -P4
[ ID] Interval Transfer Bandwidth
[ 6] 0.0-10.0 sec 10.5 GBytes 9.05 Gbits/sec
[ 4] 0.0-10.0 sec 10.9 GBytes 9.32 Gbits/sec
[ 5] 0.0-10.0 sec 11.2 GBytes 9.65 Gbits/sec
[ 3] 0.0-10.0 sec 11.8 GBytes 10.1 Gbits/sec
[SUM] 0.0-10.0 sec 44.4 GBytes 38.1 Gbits/sec

As you can see, multithreaded performance is close to line speed. I'm getting slightly better performance out of IPoIB than 40G Ethernet, so there's hope

Finally, thanks everyone for the help. Without this, I may have given up on IPoIB.

pyite · May 15, 2013

Nice thread you have going here, I am trying to get acceptable IPoIB results, but they aren't looking so good yet. I'll post some results and I have a couple of questions.

All machines are Dell R620's with E5-2660 cpu's and 20Gb cards. I tried connecting the test machines directly with a cable, but I get the same results in our 9024 switch. I went through the Mellanox optimizations in the above PDF.

Test #1: Ubuntu 12.04, Qlogic 7280 HCA's:

iperf results:
1 thread: [ 3] 0.0-10.0 sec 2.75 GBytes 2.37 Gbits/sec
2 threads: [SUM] 0.0-10.0 sec 2.57 GBytes 2.21 Gbits/sec
4 threads: [SUM] 0.0-10.0 sec 2.65 GBytes 2.28 Gbits/sec

Performance degrades rapidly past P4. This speed is around 20% what I would expect. Here is a quick RDMA test:

# ib_rdma_bw -n 20000 10.166.1.81
.....
18907: Bandwidth peak (#3751 to #13407): 1867.03 MB/sec
18907: Bandwidth average: 1773.97 MB/sec

Test #2: el6, Mellanox MT25208 HCA's:

iperf results:

1 thread: [ 3] 0.0-10.0 sec 2.95 GBytes 2.53 Gbits/sec
2 threads: [SUM] 0.0-10.2 sec 3.52 GBytes 2.97 Gbits/sec
4 threads: [SUM] 0.0-10.5 sec 3.53 GBytes 2.90 Gbits/sec
8 threads: [SUM] 0.0-10.2 sec 5.07 GBytes 4.28 Gbits/sec
12 threads: [SUM] 0.0-10.2 sec 4.81 GBytes 4.06 Gbits/sec
16 threads: [SUM] 0.0-10.2 sec 6.12 GBytes 5.17 Gbits/sec
24 threads: [SUM] 0.0-10.0 sec 5.98 GBytes 5.13 Gbits/sec

Looks like these are more scalable, but interestingly RDMA performance isn't as good:

14046: Bandwidth peak (#0 to #13202): 1492.46 MB/sec
14046: Bandwidth average: 1492.43 MB/sec

OK time for questions:

1) I will try Ubuntu on the Mellanox cards as the next test. Are there any other suggestions to try? I also just got a couple of 10 GbE cards too, that should be fun.

2) Is there anything like netcat or iperf for rdma? If not, in my spare time maybe I'll try to hack an RDMA option into netcat.

2) NFS-RDMA on el6 causes a weird kernel panic on the server if clients disconnect. Any idea how to deal with this one?

Thanks,
Mark

pyite · May 15, 2013

OK right after I posted this, I figured out the main problem with my setup. I fixed it with these:

# echo connected > /sys/class/net/ib0/mode
# ip link set ib0 mtu 65520

Now I am getting > 10 gigabit with 4 threads, though the qlogic still degrades pretty quickly, at 32 threads it is around 5 gigabit.

Awesome, thanks!!

mrkrad · May 16, 2013

what's the cpu load? BE3/Solarflare cards eat about 1 w3520 core to 10-15% at full blast send/receive (10gb) while my RK375 or XR997 seems to be sucking 50% out of 4 Q6600 cores to do the same work. Clearly some card designs are doing better offload than others.

It seems windows doesn't do LRO and TOE/VMQ/other settings are exclusive to each other and who knows what's best for efficiency?

gnarz · Jul 25, 2013

Hello,

nice to see that others share the same problems with IPoIB. Maybe of interest but a not so nice hack. I also had bad IPoIB performance. Luckily I found this post Re: IPoIB - GRO forces memcpy inside __pskb_pull_tail. Conclusion: If you go with 2K MTU you can avoid unneccessary memory allocation in the IPoIB kernel module.

in ipoib_main.c replace

...
if (!ib_query_port(hca, port, &attr))
priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
else {
...

with
...
if (!ib_query_port(hca, port, &attr))
priv->max_ib_mtu = 3072;
else {
...

At least NFS throughtput increased by round about 12% for large file transfers on our older Xeon 5520 machines.

Best regards.

MiniKnight · Jul 25, 2013

on our older Xeon 5520 machines.

Around here the Xeon L5520 is referred to as being in its prime rather than "older". Best $35 processor around

davidzhuang · Aug 11, 2013

I hit such issue with Mellanox CX2/CX3 card in RHEL6.4 too.
Checkout the boot args, edit "intel_iommu=on" to "iommu=off", if you don't need SRIOV support, just remove this parameter.
may be it can solve the low performance issue.

Krobar · Aug 11, 2013

pyite said:
Nice thread you have going here, I am trying to get acceptable IPoIB results, but they aren't looking so good yet. I'll post some results and I have a couple of questions.

All machines are Dell R620's with E5-2660 cpu's and 20Gb cards. I tried connecting the test machines directly with a cable, but I get the same results in our 9024 switch. I went through the Mellanox optimizations in the above PDF.

Test #1: Ubuntu 12.04, Qlogic 7280 HCA's:

iperf results:
1 thread: [ 3] 0.0-10.0 sec 2.75 GBytes 2.37 Gbits/sec
2 threads: [SUM] 0.0-10.0 sec 2.57 GBytes 2.21 Gbits/sec
4 threads: [SUM] 0.0-10.0 sec 2.65 GBytes 2.28 Gbits/sec

Performance degrades rapidly past P4. This speed is around 20% what I would expect. Here is a quick RDMA test:

# ib_rdma_bw -n 20000 10.166.1.81
.....
18907: Bandwidth peak (#3751 to #13407): 1867.03 MB/sec
18907: Bandwidth average: 1773.97 MB/sec

Test #2: el6, Mellanox MT25208 HCA's:

iperf results:

1 thread: [ 3] 0.0-10.0 sec 2.95 GBytes 2.53 Gbits/sec
2 threads: [SUM] 0.0-10.2 sec 3.52 GBytes 2.97 Gbits/sec
4 threads: [SUM] 0.0-10.5 sec 3.53 GBytes 2.90 Gbits/sec
8 threads: [SUM] 0.0-10.2 sec 5.07 GBytes 4.28 Gbits/sec
12 threads: [SUM] 0.0-10.2 sec 4.81 GBytes 4.06 Gbits/sec
16 threads: [SUM] 0.0-10.2 sec 6.12 GBytes 5.17 Gbits/sec
24 threads: [SUM] 0.0-10.0 sec 5.98 GBytes 5.13 Gbits/sec

Looks like these are more scalable, but interestingly RDMA performance isn't as good:

14046: Bandwidth peak (#0 to #13202): 1492.46 MB/sec
14046: Bandwidth average: 1492.43 MB/sec

OK time for questions:

1) I will try Ubuntu on the Mellanox cards as the next test. Are there any other suggestions to try? I also just got a couple of 10 GbE cards too, that should be fun.

2) Is there anything like netcat or iperf for rdma? If not, in my spare time maybe I'll try to hack an RDMA option into netcat.

2) NFS-RDMA on el6 causes a weird kernel panic on the server if clients disconnect. Any idea how to deal with this one?

Thanks,
Mark

1) What about the Mellanox distro?

2) Qperf does what you need and is part of OFED (Does Verbs and SDP)

3) Sorry have not used NFS-RDMA.

zane · Oct 12, 2013

This helped me out with OmniOS IPoIB tuning.
ndd -set /dev/tcp tcp_recv_hiwat 262144
ndd -set /dev/tcp tcp_xmit_hiwat 262144
ndd -set /dev/tcp tcp_max_buf 4194304

Any other tips of OmniOS and IPoIB?

archenroot · Dec 14, 2013

Hi @all,

thanks for this informative thread. Can somebody also post the CPU load when performing the tests and type of CPU for the result. As soon as protocol conversion is in place, the CPU performance plays probably also it important role.

I would like to use 2 of those cards(17721-B21 HP Dual Port PCI-e QDR 4X HCA Card ), so 4 QDR ports in conjunction with Quad-Core Xeon E3-1265LV2 2.5GHZ/8MB/LGA1155 CPU, so I am thinking if I shouldn't rather pickup some dual socket board and faster CPUs(3-4 GHz). This should be used to connect 24 disc array configured in 3x RAID6(8 discs per array) all wrapped into RAID 0 to processing nodes.

Thank you very much for any hints regarding the required CPU.

Ladislav

33_viper_33 · Dec 14, 2013

I'm doing some testing on a couple C6100 nodes with dual L5520's. I haven't had much time to work with this lately and probably will put this aside until mid JAN. See link for info.

http://forums.servethehome.com/networking/2781-infiniband-testing-high-speed-low-power-aio.html

jingjing · Jan 15, 2014

It seems windows doesn't do LRO and TOE/VMQ/other settings are exclusive to each other and who knows what's best for efficiency?

Search

Infiniband IPoIB performance problems?

dba

Moderator

Netwizard

New Member

pyite

New Member

pyite

New Member

mrkrad

Well-Known Member

gnarz

New Member

MiniKnight

Well-Known Member

davidzhuang

New Member

Krobar

Member

zane

Member

archenroot

New Member

33_viper_33

Member

jingjing

Banned