QDR Mellanox Performance Test Results with Varying BAR-space Size

alltheasimov · May 15, 2018

System:
Desktop (headnode): I7-5960X, GA-X99-SLI motherboard
Server (slave nodes): 6027TR-HTR (four X9DRT-HF). Only using one node for this testing. 2x E5-2667 V2
OS: CentOS 7.5 with Infiniband Support and various other packages. No custom OFED installs.
HCAs: One Sun/Oracle X4242A QDR Infiniband HCA (rebrand of Mellanox MHQH29B) in each.
Switch: Sun/Oracle QDR 36 port switch
Cables: Mellanox QDR cables connect Desktop to Switch and Server to Switch.

This is a follow up of a question that came up in my previous thread. Essentially, Sun has two "latest" firmware versions for these cards: One that can only be used in motherboards that have the ability to handle BAR-space sizes of 128MB or greater (2.11.2012), and one that can be used in most motherboards (2.11.2010). I want to know what the performance impact (if any) of the increased BAR-space is.

I started with 2.11.2012 firmware cards.

It seems there are three main tools for testing infiniband performance: 1. The scripts in the PerfTest Package, e.g. ib_send_bw, 2. qperf, 3. iperf. All of these require IPoIB. I did the following things first since most online guides recommend doing so:

Maximum performance and "above 4G decoding" turned on (see previous thread link) in BIOS settings
Turned off firewall
echo connected > /sys/class/net/ib0/mode
echo 65520 > /sys/class/net/ib0/mtu
ifconfig ib0 10.0.0.1 (on headnode. 10.0.0.2 for slave node).
Load various modules: ib_core, ib_mthca, ib_cm, ib_ipoib, ib_uverbs, ib_umad, rpcrdma, rdma_cm, rdma_ucm

I then did an ibping test.
On server: ibping –S –C mlx4_0 –P 1 –L 100 (where L XXX is the LID and P X is the port )
On client: ibping –c 10000 –f –C mlx4_0 –G 0x.................... (where the long entry after the G is the GUID of the server's port)

That worked. Now on to actual performance testing.

I first tried qperf:
On server: qperf
On client: qperf 10.0.0.2 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat
That resulted in

ud_lat: 5.2 us
ud_bw: 3.12 GB/s both send and receive
rc_rdma_read_bw: 3.42 GB/s
rc_rdma_write_bw: 3.42 GB/s
uc_rdma_write_bw: send: 3.44 , receive: 3.4 GB/s
tcp_bw: 3.28 GB/s
tcp_lat: 10.5 us

I tried ib_read_bw next:
On server: ib_read_bw
On client: ib_read_bw -F -a 10.0.0.2
That resulted in somewhere around 3.28 GB/s for bytes > 8192. It threw an error about being unable to connect on the client, though...didn't really understand that. Anyways, that agrees with the previous tests.

Then ib_send_bw:
On server: ib_send_bw -a
On client: ib_send_bw -a 10.0.0.2
Similar result: 3.27 GB/s for the larger bytes. It throws conflicting cpu frequency errors without the -F flag. I doubt that's important, though.

Then I tried iperf:
On server: iperf -s -i 1
On client: iperf -c 10.0.0.2 -P X (where X is the number of threads: 1,2,4)

Single thread performance was 24.9 Gbit/s, two threads 26.4 Gbit/s, four threads 26.5 Gbit/s. 26.5 Gbit/s is 3.31 GB/s, so that also agrees with the previous tests. The theoretical throughput of QDR is 32 Gbit/s, or 4GB/s, right? So that seems pretty good.

Next, I switched to 2.11.2010 firmware cards, went through the setup steps and ibping again, and repeated some of the tests:
qperf:

ud_lat: 5.4 us
ud_bw: 3.13 GB/s both send and receive
rc_rdma_read_bw: 3.44 GB/s
rc_rdma_write_bw: 3.43 GB/s
uc_rdma_write_bw: send: 3.44 , receive: 3.41 GB/s
tcp_bw: 3.24 GB/s
tcp_lat: 10.3 us

iperf: Single thread performance was 24.7 Gbit/s, two threads 26.4 Gbit/s, four threads 26.4 Gbit/s.

Conclusion: The larger BAR-space had no effect on performance, at least for my system.

Questions:

Is ~3.3 GB/s for a QDR system "good" performance? If not, what is a reasonable target and what can I do to improve it?
Anyone know anything about this BAR-space stuff with relation to Mellanox HCAs?
What is the difference between "datagram" and "connected" in the ib0 mode?
What is the MTU value and what does it do?

Thanks!

alltheasimov · May 17, 2018

Answer to question 1 here. So the answer is yes. In fact 25.6 Gbit/s is the fastest expected bandwidth for QDR. The key is this: PCI_LANES(8)*PCI_SPEED(5)*PCI_ENCODING(0.8) = 32 Gb/s (that's where that number comes from). 32*PCI_HEADERS(128/152)*PCI_FLOW_CONT(0.95) = 25.6 Gb/s. My system is right around that, so that's good. Interesting note: FDR bumps up the PCI_ENCODING ratio to about 0.97, and FDR10 uses the old QDR encoding with the faster FDR speed.

Question 2: I looked around a bit, but couldn't find much. In this link, I found the following quote:

If the “SriovSupport” field value shows “NoVfBarSpace”, SR-IOV cannot be used on this network adapter as there are not enough PCI Express BAR resources available. To use SR-IOV, you need to reduce the number of VFs to the number supported by the OS

SRIOV is "Single-Root I/O Virtualization".

SR-IOV enables network traffic to by-pass the software switch layer of the Hyper-V virtualization stack. As a result, the I/O overhead in the software emulation layer is diminished and can achieve network performance that is nearly the same performance as in non-virtualized environments

Maybe increasing the BAR-space size allows you to use more virtual functions (VFs), which are some sort of PCIe functions? That would explain why performance didn't change...doesn't affect bandwidth. I'm way out of my league here, though.

Answer to question 3 from here.

Datagram vs Connected modes

The IPoIB driver supports two modes of operation: datagram and
connected. The mode is set and read through an interface's
/sys/class/net/<intf name>/mode file.

In datagram mode, the IB UD (Unreliable Datagram) transport is used
and so the interface MTU has is equal to the IB L2 MTU minus the
IPoIB encapsulation header (4 bytes). For example, in a typical IB
fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.

In connected mode, the IB RC (Reliable Connected) transport is used.
Connected mode takes advantage of the connected nature of the IB
transport and allows an MTU up to the maximal IP packet size of 64K,
which reduces the number of IP packets needed for handling large UDP
datagrams, TCP segments, etc and increases the performance for large
messages.

In connected mode, the interface's UD QP is still used for multicast
and communication with peers that don't support connected mode. In
this case, RX emulation of ICMP PMTU packets is used to cause the
networking stack to use the smaller UD MTU for these neighbours.

So if you use IPoIB, it sounds like connected mode with a high MTU is the way to go.

However, in an attempt to answer my question 4, I found these links: 1, 2 (page 19), 3, which seem to suggest lower MTU values. I'm honestly way out of my league with this network engineering stuff.

If anyone is an expert, please chime in. Since I'm not going to be using IPoIB, I think I'll just leave it at that.

rysti32 · May 17, 2018

alltheasimov said:
Question 2: I looked around a bit, but couldn't find much. In this link, I found the following quote: SRIOV is "Single-Root I/O Virtualization". Maybe increasing the BAR-space size allows you to use more virtual functions (VFs), which are some sort of PCIe functions? That would explain why performance didn't change...doesn't affect bandwidth. I'm way out of my league here, though.

SR-IOV allows you to create multiple virtual PCI devices (called VFs) that shared the functionality of the physical device. It's typically used for server virtualization. In the case of a networking device, typically when a VM needs to communicate over the network, it has to use a virtual network implemented in the host software (e.g. VMWare). The host software is actually acting as a little switch for all of its VMs. For VMs that do a lot of network communication, there can be a significant performance cost to doing switching in software.

SR-IOV fixes this by allowing you to create multiple virtual NICs that have direct access to the physical link on the real NIC. The switching between virtual NICs is performed in the NIC hardware, so performance is much better.

The thing is that each virtual NIC needs BAR space on the host system. If the NIC's BAR space is limited, you'd be pretty limited in the number of virtual NICs that you could create. Beyond that, it's not going to matter.

So in summary, unless you're doing some really hardcore optimization of a virtualization set up, you won't care.

Search

QDR Mellanox Performance Test Results with Varying BAR-space Size

alltheasimov

Member

alltheasimov

Member

rysti32

New Member