QDR Mellanox Performance Test Results with Varying BAR-space Size

Discussion in 'Networking' started by alltheasimov, May 15, 2018.

  1. alltheasimov

    alltheasimov Member

    Joined:
    Feb 17, 2018
    Messages:
    48
    Likes Received:
    7
    System:
    Desktop (headnode): I7-5960X, GA-X99-SLI motherboard
    Server (slave nodes): 6027TR-HTR (four X9DRT-HF). Only using one node for this testing. 2x E5-2667 V2
    OS: CentOS 7.5 with Infiniband Support and various other packages. No custom OFED installs.
    HCAs: One Sun/Oracle X4242A QDR Infiniband HCA (rebrand of Mellanox MHQH29B) in each.
    Switch: Sun/Oracle QDR 36 port switch
    Cables: Mellanox QDR cables connect Desktop to Switch and Server to Switch.

    This is a follow up of a question that came up in my previous thread. Essentially, Sun has two "latest" firmware versions for these cards: One that can only be used in motherboards that have the ability to handle BAR-space sizes of 128MB or greater (2.11.2012), and one that can be used in most motherboards (2.11.2010). I want to know what the performance impact (if any) of the increased BAR-space is.

    I started with 2.11.2012 firmware cards.

    It seems there are three main tools for testing infiniband performance: 1. The scripts in the PerfTest Package, e.g. ib_send_bw, 2. qperf, 3. iperf. All of these require IPoIB. I did the following things first since most online guides recommend doing so:
    1. Maximum performance and "above 4G decoding" turned on (see previous thread link) in BIOS settings
    2. Turned off firewall
    3. echo connected > /sys/class/net/ib0/mode
    4. echo 65520 > /sys/class/net/ib0/mtu
    5. ifconfig ib0 10.0.0.1 (on headnode. 10.0.0.2 for slave node).
    6. Load various modules: ib_core, ib_mthca, ib_cm, ib_ipoib, ib_uverbs, ib_umad, rpcrdma, rdma_cm, rdma_ucm
    I then did an ibping test.
    On server: ibping –S –C mlx4_0 –P 1 –L 100 (where L XXX is the LID and P X is the port )
    On client: ibping –c 10000 –f –C mlx4_0 –G 0x.................... (where the long entry after the G is the GUID of the server's port)

    That worked. Now on to actual performance testing.

    I first tried qperf:
    On server: qperf
    On client: qperf 10.0.0.2 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat
    That resulted in
    • ud_lat: 5.2 us
    • ud_bw: 3.12 GB/s both send and receive
    • rc_rdma_read_bw: 3.42 GB/s
    • rc_rdma_write_bw: 3.42 GB/s
    • uc_rdma_write_bw: send: 3.44 , receive: 3.4 GB/s
    • tcp_bw: 3.28 GB/s
    • tcp_lat: 10.5 us
    I tried ib_read_bw next:
    On server: ib_read_bw
    On client: ib_read_bw -F -a 10.0.0.2
    That resulted in somewhere around 3.28 GB/s for bytes > 8192. It threw an error about being unable to connect on the client, though...didn't really understand that. Anyways, that agrees with the previous tests.

    Then ib_send_bw:
    On server: ib_send_bw -a
    On client: ib_send_bw -a 10.0.0.2
    Similar result: 3.27 GB/s for the larger bytes. It throws conflicting cpu frequency errors without the -F flag. I doubt that's important, though.

    Then I tried iperf:
    On server: iperf -s -i 1
    On client: iperf -c 10.0.0.2 -P X (where X is the number of threads: 1,2,4)

    Single thread performance was 24.9 Gbit/s, two threads 26.4 Gbit/s, four threads 26.5 Gbit/s. 26.5 Gbit/s is 3.31 GB/s, so that also agrees with the previous tests. The theoretical throughput of QDR is 32 Gbit/s, or 4GB/s, right? So that seems pretty good.


    Next, I switched to 2.11.2010 firmware cards, went through the setup steps and ibping again, and repeated some of the tests:
    qperf:
    • ud_lat: 5.4 us
    • ud_bw: 3.13 GB/s both send and receive
    • rc_rdma_read_bw: 3.44 GB/s
    • rc_rdma_write_bw: 3.43 GB/s
    • uc_rdma_write_bw: send: 3.44 , receive: 3.41 GB/s
    • tcp_bw: 3.24 GB/s
    • tcp_lat: 10.3 us
    iperf: Single thread performance was 24.7 Gbit/s, two threads 26.4 Gbit/s, four threads 26.4 Gbit/s.


    Conclusion: The larger BAR-space had no effect on performance, at least for my system.

    Questions:
    1. Is ~3.3 GB/s for a QDR system "good" performance? If not, what is a reasonable target and what can I do to improve it?
    2. Anyone know anything about this BAR-space stuff with relation to Mellanox HCAs?
    3. What is the difference between "datagram" and "connected" in the ib0 mode?
    4. What is the MTU value and what does it do?
    Thanks!
     
    #1
    JustinClift and Patrick like this.
  2. alltheasimov

    alltheasimov Member

    Joined:
    Feb 17, 2018
    Messages:
    48
    Likes Received:
    7
    Answer to question 1 here. So the answer is yes. In fact 25.6 Gbit/s is the fastest expected bandwidth for QDR. The key is this: PCI_LANES(8)*PCI_SPEED(5)*PCI_ENCODING(0.8) = 32 Gb/s (that's where that number comes from). 32*PCI_HEADERS(128/152)*PCI_FLOW_CONT(0.95) = 25.6 Gb/s. My system is right around that, so that's good. Interesting note: FDR bumps up the PCI_ENCODING ratio to about 0.97, and FDR10 uses the old QDR encoding with the faster FDR speed.

    Question 2: I looked around a bit, but couldn't find much. In this link, I found the following quote:
    SRIOV is "Single-Root I/O Virtualization".
    Maybe increasing the BAR-space size allows you to use more virtual functions (VFs), which are some sort of PCIe functions? That would explain why performance didn't change...doesn't affect bandwidth. I'm way out of my league here, though.

    Answer to question 3 from here.
    So if you use IPoIB, it sounds like connected mode with a high MTU is the way to go.

    However, in an attempt to answer my question 4, I found these links: 1, 2 (page 19), 3, which seem to suggest lower MTU values. I'm honestly way out of my league with this network engineering stuff.

    If anyone is an expert, please chime in. Since I'm not going to be using IPoIB, I think I'll just leave it at that.
     
    #2
    JustinClift and Patrick like this.
  3. rysti32

    rysti32 New Member

    Joined:
    Apr 14, 2018
    Messages:
    10
    Likes Received:
    3
    SR-IOV allows you to create multiple virtual PCI devices (called VFs) that shared the functionality of the physical device. It's typically used for server virtualization. In the case of a networking device, typically when a VM needs to communicate over the network, it has to use a virtual network implemented in the host software (e.g. VMWare). The host software is actually acting as a little switch for all of its VMs. For VMs that do a lot of network communication, there can be a significant performance cost to doing switching in software.

    SR-IOV fixes this by allowing you to create multiple virtual NICs that have direct access to the physical link on the real NIC. The switching between virtual NICs is performed in the NIC hardware, so performance is much better.

    The thing is that each virtual NIC needs BAR space on the host system. If the NIC's BAR space is limited, you'd be pretty limited in the number of virtual NICs that you could create. Beyond that, it's not going to matter.

    So in summary, unless you're doing some really hardcore optimization of a virtualization set up, you won't care.
     
    #3
    JustinClift and alltheasimov like this.

Share This Page