System:
Desktop (headnode): I7-5960X, GA-X99-SLI motherboard
Server (slave nodes): 6027TR-HTR (four X9DRT-HF). Only using one node for this testing. 2x E5-2667 V2
OS: CentOS 7.5 with Infiniband Support and various other packages. No custom OFED installs.
HCAs: One Sun/Oracle X4242A QDR Infiniband HCA (rebrand of Mellanox MHQH29B) in each.
Switch: Sun/Oracle QDR 36 port switch
Cables: Mellanox QDR cables connect Desktop to Switch and Server to Switch.
This is a follow up of a question that came up in my previous thread. Essentially, Sun has two "latest" firmware versions for these cards: One that can only be used in motherboards that have the ability to handle BAR-space sizes of 128MB or greater (2.11.2012), and one that can be used in most motherboards (2.11.2010). I want to know what the performance impact (if any) of the increased BAR-space is.
I started with 2.11.2012 firmware cards.
It seems there are three main tools for testing infiniband performance: 1. The scripts in the PerfTest Package, e.g. ib_send_bw, 2. qperf, 3. iperf. All of these require IPoIB. I did the following things first since most online guides recommend doing so:
On server: ibping –S –C mlx4_0 –P 1 –L 100 (where L XXX is the LID and P X is the port )
On client: ibping –c 10000 –f –C mlx4_0 –G 0x.................... (where the long entry after the G is the GUID of the server's port)
That worked. Now on to actual performance testing.
I first tried qperf:
On server: qperf
On client: qperf 10.0.0.2 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat
That resulted in
On server: ib_read_bw
On client: ib_read_bw -F -a 10.0.0.2
That resulted in somewhere around 3.28 GB/s for bytes > 8192. It threw an error about being unable to connect on the client, though...didn't really understand that. Anyways, that agrees with the previous tests.
Then ib_send_bw:
On server: ib_send_bw -a
On client: ib_send_bw -a 10.0.0.2
Similar result: 3.27 GB/s for the larger bytes. It throws conflicting cpu frequency errors without the -F flag. I doubt that's important, though.
Then I tried iperf:
On server: iperf -s -i 1
On client: iperf -c 10.0.0.2 -P X (where X is the number of threads: 1,2,4)
Single thread performance was 24.9 Gbit/s, two threads 26.4 Gbit/s, four threads 26.5 Gbit/s. 26.5 Gbit/s is 3.31 GB/s, so that also agrees with the previous tests. The theoretical throughput of QDR is 32 Gbit/s, or 4GB/s, right? So that seems pretty good.
Next, I switched to 2.11.2010 firmware cards, went through the setup steps and ibping again, and repeated some of the tests:
qperf:
Conclusion: The larger BAR-space had no effect on performance, at least for my system.
Questions:
Desktop (headnode): I7-5960X, GA-X99-SLI motherboard
Server (slave nodes): 6027TR-HTR (four X9DRT-HF). Only using one node for this testing. 2x E5-2667 V2
OS: CentOS 7.5 with Infiniband Support and various other packages. No custom OFED installs.
HCAs: One Sun/Oracle X4242A QDR Infiniband HCA (rebrand of Mellanox MHQH29B) in each.
Switch: Sun/Oracle QDR 36 port switch
Cables: Mellanox QDR cables connect Desktop to Switch and Server to Switch.
This is a follow up of a question that came up in my previous thread. Essentially, Sun has two "latest" firmware versions for these cards: One that can only be used in motherboards that have the ability to handle BAR-space sizes of 128MB or greater (2.11.2012), and one that can be used in most motherboards (2.11.2010). I want to know what the performance impact (if any) of the increased BAR-space is.
I started with 2.11.2012 firmware cards.
It seems there are three main tools for testing infiniband performance: 1. The scripts in the PerfTest Package, e.g. ib_send_bw, 2. qperf, 3. iperf. All of these require IPoIB. I did the following things first since most online guides recommend doing so:
- Maximum performance and "above 4G decoding" turned on (see previous thread link) in BIOS settings
- Turned off firewall
- echo connected > /sys/class/net/ib0/mode
- echo 65520 > /sys/class/net/ib0/mtu
- ifconfig ib0 10.0.0.1 (on headnode. 10.0.0.2 for slave node).
- Load various modules: ib_core, ib_mthca, ib_cm, ib_ipoib, ib_uverbs, ib_umad, rpcrdma, rdma_cm, rdma_ucm
On server: ibping –S –C mlx4_0 –P 1 –L 100 (where L XXX is the LID and P X is the port )
On client: ibping –c 10000 –f –C mlx4_0 –G 0x.................... (where the long entry after the G is the GUID of the server's port)
That worked. Now on to actual performance testing.
I first tried qperf:
On server: qperf
On client: qperf 10.0.0.2 ud_lat ud_bw rc_rdma_read_bw rc_rdma_write_bw uc_rdma_write_bw tcp_bw tcp_lat
That resulted in
- ud_lat: 5.2 us
- ud_bw: 3.12 GB/s both send and receive
- rc_rdma_read_bw: 3.42 GB/s
- rc_rdma_write_bw: 3.42 GB/s
- uc_rdma_write_bw: send: 3.44 , receive: 3.4 GB/s
- tcp_bw: 3.28 GB/s
- tcp_lat: 10.5 us
On server: ib_read_bw
On client: ib_read_bw -F -a 10.0.0.2
That resulted in somewhere around 3.28 GB/s for bytes > 8192. It threw an error about being unable to connect on the client, though...didn't really understand that. Anyways, that agrees with the previous tests.
Then ib_send_bw:
On server: ib_send_bw -a
On client: ib_send_bw -a 10.0.0.2
Similar result: 3.27 GB/s for the larger bytes. It throws conflicting cpu frequency errors without the -F flag. I doubt that's important, though.
Then I tried iperf:
On server: iperf -s -i 1
On client: iperf -c 10.0.0.2 -P X (where X is the number of threads: 1,2,4)
Single thread performance was 24.9 Gbit/s, two threads 26.4 Gbit/s, four threads 26.5 Gbit/s. 26.5 Gbit/s is 3.31 GB/s, so that also agrees with the previous tests. The theoretical throughput of QDR is 32 Gbit/s, or 4GB/s, right? So that seems pretty good.
Next, I switched to 2.11.2010 firmware cards, went through the setup steps and ibping again, and repeated some of the tests:
qperf:
- ud_lat: 5.4 us
- ud_bw: 3.13 GB/s both send and receive
- rc_rdma_read_bw: 3.44 GB/s
- rc_rdma_write_bw: 3.43 GB/s
- uc_rdma_write_bw: send: 3.44 , receive: 3.41 GB/s
- tcp_bw: 3.24 GB/s
- tcp_lat: 10.3 us
Conclusion: The larger BAR-space had no effect on performance, at least for my system.
Questions:
- Is ~3.3 GB/s for a QDR system "good" performance? If not, what is a reasonable target and what can I do to improve it?
- Anyone know anything about this BAR-space stuff with relation to Mellanox HCAs?
- What is the difference between "datagram" and "connected" in the ib0 mode?
- What is the MTU value and what does it do?