Infiniband IPoIB performance problems?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

dba

Moderator
Feb 20, 2012
1,477
184
63
San Francisco Bay Area, California, USA
...

dba, Did you do any Windows SMB 3/RDMA performance testing with your equipment? If so, what were the results?

Tom
Not much, at least not yet. I ran a quick sqlio test against a SMB3 shared ramdisk and saw only around 600MB/S for sequential transfers. Disappointing, but I can't some to any conclusions since I really didn't do anything but run five minutes worth of testing with no tuning. I didn't even confirm that RMDA was being used.
 

Netwizard

New Member
May 8, 2013
1
0
0
Here's some hope

Hello from a first time poster. I stumbled across this thread while trying to tune IPoIB myself. I'm getting better numbers, so here's some hope for the original poster.

I bought a couple of FDR InfiniBand + 40GigE Mellanox cards, MCX354A-FCBT, on ebay and installed them in my home servers. We're talking consumer hardware, i3770 cpus and Gigabyte GA-Z77X-D3H motherboards. I got Mellanox QDR cables, but ibstat says "Rate: 40 (FDR10)", so I'm guessing I'm running at FDR. One of the boxes runs Ubuntu 12.10 and the other 13.04, as I'm in the process of upgrading. The following numbers are without any kernel tweaks and all numbers reproduce consistently. The drivers are whatever Ubuntu supplies with "apt-get install libmlx4-1 opensm". (how do you check which version?)

  1. 40G Ethernet mode:

    root@linux2:~# iperf -c 10.1.5.11
    [ ID] Interval * * * Transfer * * Bandwidth
    [ *3] *0.0-10.0 sec *24.4 GBytes *21.0 Gbits/sec

    root@linux2:~# iperf -c 10.1.5.11 -P2
    [ ID] Interval Transfer Bandwidth
    [ 3] 0.0-10.0 sec 21.6 GBytes 18.6 Gbits/sec
    [ 4] 0.0-10.0 sec 21.0 GBytes 18.1 Gbits/sec
    [SUM] 0.0-10.0 sec 42.6 GBytes 36.6 Gbits/sec

    root@linux2:~# iperf -c 10.1.5.11 -P4
    [ ID] Interval Transfer Bandwidth
    [ 5] 0.0-10.0 sec 8.10 GBytes 6.96 Gbits/sec
    [ 4] 0.0-10.0 sec 16.4 GBytes 14.1 Gbits/sec
    [ 3] 0.0-10.0 sec 9.37 GBytes 8.05 Gbits/sec
    [ 6] 0.0-10.0 sec 8.89 GBytes 7.64 Gbits/sec
    [SUM] 0.0-10.0 sec 42.8 GBytes 36.7 Gbits/sec
  2. QDR IPoIB connected mode with mtu of 65520:

    root@linux2:~# iperf -c 10.1.5.11
    [ 3] 0.0-10.0 sec 28.8 GBytes 24.8 Gbits/sec

    root@linux2:~# iperf -c 10.1.5.11 -P2
    [ ID] Interval Transfer Bandwidth
    [ 4] 0.0-10.0 sec 20.5 GBytes 17.6 Gbits/sec
    [ 3] 0.0-10.0 sec 21.0 GBytes 18.1 Gbits/sec
    [SUM] 0.0-10.0 sec 41.5 GBytes 35.7 Gbits/sec

    root@linux2:~# iperf -c 10.1.5.11 -P4
    [ ID] Interval Transfer Bandwidth
    [ 6] 0.0-10.0 sec 10.5 GBytes 9.05 Gbits/sec
    [ 4] 0.0-10.0 sec 10.9 GBytes 9.32 Gbits/sec
    [ 5] 0.0-10.0 sec 11.2 GBytes 9.65 Gbits/sec
    [ 3] 0.0-10.0 sec 11.8 GBytes 10.1 Gbits/sec
    [SUM] 0.0-10.0 sec 44.4 GBytes 38.1 Gbits/sec

As you can see, multithreaded performance is close to line speed. I'm getting slightly better performance out of IPoIB than 40G Ethernet, so there's hope :D

Finally, thanks everyone for the help. Without this, I may have given up on IPoIB.
 

pyite

New Member
May 15, 2013
9
1
3
Nice thread you have going here, I am trying to get acceptable IPoIB results, but they aren't looking so good yet. I'll post some results and I have a couple of questions.

All machines are Dell R620's with E5-2660 cpu's and 20Gb cards. I tried connecting the test machines directly with a cable, but I get the same results in our 9024 switch. I went through the Mellanox optimizations in the above PDF.


Test #1: Ubuntu 12.04, Qlogic 7280 HCA's:

iperf results:
1 thread: [ 3] 0.0-10.0 sec 2.75 GBytes 2.37 Gbits/sec
2 threads: [SUM] 0.0-10.0 sec 2.57 GBytes 2.21 Gbits/sec
4 threads: [SUM] 0.0-10.0 sec 2.65 GBytes 2.28 Gbits/sec

Performance degrades rapidly past P4. This speed is around 20% what I would expect. Here is a quick RDMA test:

# ib_rdma_bw -n 20000 10.166.1.81
.....
18907: Bandwidth peak (#3751 to #13407): 1867.03 MB/sec
18907: Bandwidth average: 1773.97 MB/sec


Test #2: el6, Mellanox MT25208 HCA's:

iperf results:

1 thread: [ 3] 0.0-10.0 sec 2.95 GBytes 2.53 Gbits/sec
2 threads: [SUM] 0.0-10.2 sec 3.52 GBytes 2.97 Gbits/sec
4 threads: [SUM] 0.0-10.5 sec 3.53 GBytes 2.90 Gbits/sec
8 threads: [SUM] 0.0-10.2 sec 5.07 GBytes 4.28 Gbits/sec
12 threads: [SUM] 0.0-10.2 sec 4.81 GBytes 4.06 Gbits/sec
16 threads: [SUM] 0.0-10.2 sec 6.12 GBytes 5.17 Gbits/sec
24 threads: [SUM] 0.0-10.0 sec 5.98 GBytes 5.13 Gbits/sec

Looks like these are more scalable, but interestingly RDMA performance isn't as good:

14046: Bandwidth peak (#0 to #13202): 1492.46 MB/sec
14046: Bandwidth average: 1492.43 MB/sec


OK time for questions:

1) I will try Ubuntu on the Mellanox cards as the next test. Are there any other suggestions to try? I also just got a couple of 10 GbE cards too, that should be fun.

2) Is there anything like netcat or iperf for rdma? If not, in my spare time maybe I'll try to hack an RDMA option into netcat.

2) NFS-RDMA on el6 causes a weird kernel panic on the server if clients disconnect. Any idea how to deal with this one?

Thanks,
Mark
 

pyite

New Member
May 15, 2013
9
1
3
OK right after I posted this, I figured out the main problem with my setup. I fixed it with these:

# echo connected > /sys/class/net/ib0/mode
# ip link set ib0 mtu 65520

Now I am getting > 10 gigabit with 4 threads, though the qlogic still degrades pretty quickly, at 32 threads it is around 5 gigabit.

Awesome, thanks!!
 

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
what's the cpu load? BE3/Solarflare cards eat about 1 w3520 core to 10-15% at full blast send/receive (10gb) while my RK375 or XR997 seems to be sucking 50% out of 4 Q6600 cores to do the same work. Clearly some card designs are doing better offload than others.

It seems windows doesn't do LRO and TOE/VMQ/other settings are exclusive to each other and who knows what's best for efficiency?
 

gnarz

New Member
Jul 25, 2013
1
0
1
Hello,

nice to see that others share the same problems with IPoIB. Maybe of interest but a not so nice hack. I also had bad IPoIB performance. Luckily I found this post Re: IPoIB - GRO forces memcpy inside __pskb_pull_tail. Conclusion: If you go with 2K MTU you can avoid unneccessary memory allocation in the IPoIB kernel module.

in ipoib_main.c replace

...
if (!ib_query_port(hca, port, &attr))
priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
else {
...

with
...
if (!ib_query_port(hca, port, &attr))
priv->max_ib_mtu = 3072;
else {
...

At least NFS throughtput increased by round about 12% for large file transfers on our older Xeon 5520 machines.

Best regards.
 

davidzhuang

New Member
Aug 11, 2013
1
0
0
I hit such issue with Mellanox CX2/CX3 card in RHEL6.4 too.
Checkout the boot args, edit "intel_iommu=on" to "iommu=off", if you don't need SRIOV support, just remove this parameter.
may be it can solve the low performance issue.
 

Krobar

Member
Aug 25, 2012
54
10
8
Nice thread you have going here, I am trying to get acceptable IPoIB results, but they aren't looking so good yet. I'll post some results and I have a couple of questions.

All machines are Dell R620's with E5-2660 cpu's and 20Gb cards. I tried connecting the test machines directly with a cable, but I get the same results in our 9024 switch. I went through the Mellanox optimizations in the above PDF.


Test #1: Ubuntu 12.04, Qlogic 7280 HCA's:

iperf results:
1 thread: [ 3] 0.0-10.0 sec 2.75 GBytes 2.37 Gbits/sec
2 threads: [SUM] 0.0-10.0 sec 2.57 GBytes 2.21 Gbits/sec
4 threads: [SUM] 0.0-10.0 sec 2.65 GBytes 2.28 Gbits/sec

Performance degrades rapidly past P4. This speed is around 20% what I would expect. Here is a quick RDMA test:

# ib_rdma_bw -n 20000 10.166.1.81
.....
18907: Bandwidth peak (#3751 to #13407): 1867.03 MB/sec
18907: Bandwidth average: 1773.97 MB/sec


Test #2: el6, Mellanox MT25208 HCA's:

iperf results:

1 thread: [ 3] 0.0-10.0 sec 2.95 GBytes 2.53 Gbits/sec
2 threads: [SUM] 0.0-10.2 sec 3.52 GBytes 2.97 Gbits/sec
4 threads: [SUM] 0.0-10.5 sec 3.53 GBytes 2.90 Gbits/sec
8 threads: [SUM] 0.0-10.2 sec 5.07 GBytes 4.28 Gbits/sec
12 threads: [SUM] 0.0-10.2 sec 4.81 GBytes 4.06 Gbits/sec
16 threads: [SUM] 0.0-10.2 sec 6.12 GBytes 5.17 Gbits/sec
24 threads: [SUM] 0.0-10.0 sec 5.98 GBytes 5.13 Gbits/sec

Looks like these are more scalable, but interestingly RDMA performance isn't as good:

14046: Bandwidth peak (#0 to #13202): 1492.46 MB/sec
14046: Bandwidth average: 1492.43 MB/sec


OK time for questions:

1) I will try Ubuntu on the Mellanox cards as the next test. Are there any other suggestions to try? I also just got a couple of 10 GbE cards too, that should be fun.

2) Is there anything like netcat or iperf for rdma? If not, in my spare time maybe I'll try to hack an RDMA option into netcat.

2) NFS-RDMA on el6 causes a weird kernel panic on the server if clients disconnect. Any idea how to deal with this one?

Thanks,
Mark

1) What about the Mellanox distro?

2) Qperf does what you need and is part of OFED (Does Verbs and SDP)

3) Sorry have not used NFS-RDMA.
 

zane

Member
Aug 22, 2013
70
0
6
This helped me out with OmniOS IPoIB tuning.
ndd -set /dev/tcp tcp_recv_hiwat 262144
ndd -set /dev/tcp tcp_xmit_hiwat 262144
ndd -set /dev/tcp tcp_max_buf 4194304

Any other tips of OmniOS and IPoIB?
 

archenroot

New Member
Dec 14, 2013
1
0
1
Hi @all,

thanks for this informative thread. Can somebody also post the CPU load when performing the tests and type of CPU for the result. As soon as protocol conversion is in place, the CPU performance plays probably also it important role.

I would like to use 2 of those cards(17721-B21 HP Dual Port PCI-e QDR 4X HCA Card ), so 4 QDR ports in conjunction with Quad-Core Xeon E3-1265LV2 2.5GHZ/8MB/LGA1155 CPU, so I am thinking if I shouldn't rather pickup some dual socket board and faster CPUs(3-4 GHz). This should be used to connect 24 disc array configured in 3x RAID6(8 discs per array) all wrapped into RAID 0 to processing nodes.

Thank you very much for any hints regarding the required CPU.

Ladislav
 

jingjing

Banned
Nov 23, 2013
22
0
0
It seems windows doesn't do LRO and TOE/VMQ/other settings are exclusive to each other and who knows what's best for efficiency?