Need help with making reflashed HP / MCX354A work at 40Gb/s

Crond

Member
Mar 25, 2019
43
10
8
Hi All,

After reading this forum I decided to try to switch my home lab to 40gb/s, since adapters are cheaper than 10gb/s :) But ATM I can't get full speed from them.

Short story:
After update and connecting them directly I'm only getting ~20-23Gb/s on iperf/iperf3 running ubuntu 18.10 baremetal with single thread
t620:~$ iperf3 -s
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 2.63 GBytes 22.6 Gbits/sec
[ 5] 1.00-2.00 sec 2.69 GBytes 23.2 Gbits/sec

update: multi thread performance seems to be fine
t620:~$ iperf3 -P2 -c 10.10.10.2
Connecting to host 10.10.10.2, port 5201
[ 5] local 10.10.10.1 port 54416 connected to 10.10.10.2 port 5201
[ 7] local 10.10.10.1 port 54418 connected to 10.10.10.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.24 GBytes 19.2 Gbits/sec 0 907 KBytes
[ 7] 0.00-1.00 sec 2.24 GBytes 19.2 Gbits/sec 0 1005 KBytes
[SUM] 0.00-1.00 sec 4.48 GBytes 38.4 Gbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -



Long story:
I got a pair of what appears to be HP 649281-B21 Rev A5 (revision on sticker)
From lspci:
[PN] Part number: 649281-B21
[EC] Engineering changes: A5
[V0] Vendor specific: HP 2P 4X FDR VPI/2P 40GbE CX-3 HCA

Followed instruction https://forums.servethehome.com/ind...net-dual-port-qsfp-adapter.20525/#post-198015
And got both cards reflashed successfully with latest firmware
Device #1:
----------
Device Type: ConnectX3
Part Number: MCX354A-FCB_A2-A5
*****
Versions: Current Available
FW 2.42.5000 2.42.5000
PXE 3.4.0752 3.4.0752


Cards are plugged to
1) dell T620 / dual e5-2670v2 / 128g ram
2) dell p5820 / W-2155 / 64g ram
Systems connected directly with Mellanox active cable (mc2206310-015)
Latest Mellanox OFED MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu18.10-x86_64.tgz is installed in default configuration.

First port on both card is set to ethernet mode

$connectx_port_config -s
--------------------------------
Port configuration for PCI device: 0000:04:00.0 is:
eth
auto (ib)

Both cards negotiated correct speed / width on the host side
$lspci -vvv
b3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Ethernet / link seems to be fine as well
$ethtool enp67s0
Settings for enp67s0:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKX4/Full
10000baseKR/Full
40000baseCR4/Full
40000baseSR4/Full
56000baseCR4/Full
56000baseSR4/Full
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseKX/Full
10000baseKX4/Full
10000baseKR/Full
40000baseCR4/Full
40000baseSR4/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 40000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000014 (20)
link ifdown
Link detected: yes


I tried to do some twiks
1) change ipv4 and core configuration to allocate more memory
2) change cpu governor to performance
3) set IRQ affinity so that it would go to correct numa node on t620 (
set_irq_affinity_bynode.sh)
4) run iperf under numactl so it would go to the same numa node.

Altogether it pushed throughput from ~20-21Gb/s to 22-23GB/s but it's still nowhere close to ~40

Does anyone have any ideas what else can I try, or what can be wrong ?
Did I miss any magic switches for OFED installation script ?
Does anyone get close to 40Gb/s from a single thread/process ?

Thanks!
 
Last edited:

herby

Active Member
Aug 18, 2013
178
51
28
Jumbo frames? I've decided it isn't worth the hastle myself but you might eek out more with an MTU of 9000
 

zxv

The more I C, the less I see.
Sep 10, 2017
153
51
28
I use most of the same tuning methods and I suspect the 22-25Gb/sec limit may be the kernel's TCP/IP stack. RoCE will do 38-39 Gb/sec.

I've stopped using the "mlnx_tune" tool because it produces lower bandwidth on my hosts. It's a script so it should be possible to find out why, but I haven't done that.
 

Crond

Member
Mar 25, 2019
43
10
8
Have you tried multiple iperf3 processes on different ports at the same time?
I did multiple threads; even 2 threads on single port gives good results (see my post).
I'm starting to think that there are 2 problems
1) It could be that even performance governor is too slow to react for 40G, or doesn't collect/read performance counters correctly. I see that during test, per-core freq. sometimes goes down and core utilization (by perf) goes to 100% on t620. Interesting enough, pushing IRQs and application to opposite NUMA nodes from mlnx card gives better results in terms of single thread performance close to 30GB/s

2) Another problem could be overheating of the mlnx card in workstation (Precision 5820). Since it's an active cable transceiver will draw 20+ w per port.
Due to the way PCI lanes are allocated and other cards installed, I currently have only 1 option for PCIe3 x8 slot, which is directly above GPU card, with radiator of mlnx card facing down.
In t620 motherboard is aligned differently and radiator is facing up. Even though both hosts have only fans inside CPU/memory shroud. Card and tranciver in t620 stays nice and cold. And transceiver inside p5820 gets almost too-hot-to-touch.
And on long test I see that multi thread test peaks at 39.5GB/s for ~20-30 secs than goes down to 20-25 for couple of minutes and than settles around 31GB/s

I plan to re-arrange and move other cards around to see if it helps, and I guess I'll move to the actual goal which is to deploy SRP / IB based storage and convert other hosts from 1/10G ethernet to 40Gb/s infiniband, since I finally got my new sun IB switch out of the box.

P.S Even though it's now more academic interest, I would like to still get to the bottom of single thread bottle neck. If somebody has any other idea what else can be checked, or how it can be profiled let me know. I must admin that I have no idea how to profile execution flow with ~100 usec mean time for events without hardware debugger and full access to source code and tools.
 

zxv

The more I C, the less I see.
Sep 10, 2017
153
51
28
Interesting enough, pushing IRQs and application to opposite NUMA nodes from mlnx card gives better results in terms of single thread performance close to 30GB/s
I've noticed this too. Spreading IRQ affinity across all sockets can perform better than IRQs on only the socket attached to the NIC. Seems counter intuitive.
 

zxv

The more I C, the less I see.
Sep 10, 2017
153
51
28
The mellanox OFED bundle include a utility that reads the temperature:
mget_temp -d /dev/mst/mt4103_pciconf0
57
(this is for a FLR ConnectX-3 Pro in a DL380)
 

RageBone

Active Member
Jul 11, 2017
295
77
28
if you are on Intel and anything above skylake, you will have the DCI interface build into the Platform.
DCI For Direct Connect Interface, is Intels newest Debuginterface just needing some "bios Settings", a USB 3.0 cable without 5V pin, and the Eclipse based Debugging Suite, free and without an NDA directly from Intel.

So that might eliminate the Hardware debugger : D

I'd assume that one problem would lie with iperf itself, depending on the used Hardware and iperf configuration.

Here is the source for iperf if i'm not bamboozled this evening:
esnet/iperf
that could fix part of the source code poroblem.

The ReadMe mentiones a zero copy mode, maybe thats similar to RDMA where the main Advantages, as far as i remember are claimed as "1 instead of 3 copy operations through the Kernel".
And Offload of Protocol-things onto the card is another factor. If Iperf happens to unluckily do things its own way, it might not use offload features, therefore limiting performance.

Whats with the OFED and Mellanox supplied utils ?

On the other hand, Iperf is just a utility for mainly bandwidth measurement. I don't see a requirement for it to be single core optimized or so.
So you now at least know that the network hardware itself can do that throughput.
Is it really necessary to figure that thingle-thread performance problem out?
What will be your actual use-case? How is that performing ?
 

zxv

The more I C, the less I see.
Sep 10, 2017
153
51
28
I hadn't seen till now that you've got full bandwidth you using two threads. That's excellent for TCP IP.

Pause frame will help if you can enable it. Enable ecn using sysctl.

If the drivers is using poll mode, then look at interrupt moderation settings.

If / when you have mixed traffic on that link, queue priorities would make some difference.

I use NFS 4.1 and or iscssi over four connections over a single 40gb link.

Mlnx_tune produces bad tunings for me but it will show you the interface for lots of tunings.
 
  • Like
Reactions: arglebargle

Crond

Member
Mar 25, 2019
43
10
8
Thanks, I 'll give it a try.
I hadn't seen till now that you've got full bandwidth you using two threads. That's excellent for TCP IP.

Pause frame will help if you can enable it. Enable ecn using sysctl.

If the drivers is using poll mode, then look at interrupt moderation settings.

If / when you have mixed traffic on that link, queue priorities would make some difference.

I use NFS 4.1 and or iscssi over four connections over a single 40gb link.

Mlnx_tune produces bad tunings for me but it will show you the interface for lots of tunings.
 

Crond

Member
Mar 25, 2019
43
10
8
Thanks for reply.

I'll take a look at DCI, never had a chance to work with that before.

iperf zero copy has nothing to do with RDMA. It just instructs iperf to use sendfile() call instead of combination of read/write() to socket, thus eliminating a copy of. data between userspace buffer and kernel side. This reduced CPU load, but has no effect on throughput (unless your CPU is maxed out), but to avoid any doubts - yes I did tried it.

With respect to OFED / MLNX benchmarks I guess I overlooked it in the docs I read through, can you give me a link to document or community discussion or some sort of QSG for benchmarking using MLNX / OFED tools, or maybe some trivial example that you can post here ?

With respect to usecase/goal. You're correct my primary target of this testing is to verify that cards are working correctly after reflashing, which is done.
The goal is to migrate my "lab" build cluster from very old first gen 10G intel cards (pre-x520 / no HW offload/no SRIO, some were still CX4 :) ) to 40G SRP or iSER over IB. Which is essentially a 1+TB RAM based RAID-0. Each individual node is hitting 1M IOs on 60/40 random writes/reads with 4k blocks, so limiting factor is network. Reason for IB instead of RoCE is quite simple, since it's my pet project, I'm paying out of my pocket. 36port managed IB switch with 40G/ wire speed is 50$ shipped (I picked sun 36), ethernet alternative is probably 2 orders of magnitude more expensive

I planned to start testing with IB instead of ethernet, but turned out that ESXi no longer support IB (as of 5.5) and wrt to baremetal, at the time I was running tests, most of the links on Mellanox community and guides/articles related to iSER target and SRP setup were dead (refereed to old community website). So I submitted ticket to MLX support (and they fixed links this week) and went with testing based on ethernet, the rest I guess is history.

So the remaining question of single thread performance is more on "academic" side. I see no reason why single thread performance should be below ~36-38Gb/s, so most likely I'm still missing something that can be tunned, and in the end of the day will help to reduce overhead or latency and thus improve performance for final solution as well.


if you are on Intel and anything above skylake, you will have the DCI interface build into the Platform.
DCI For Direct Connect Interface, is Intels newest Debuginterface just needing some "bios Settings", a USB 3.0 cable without 5V pin, and the Eclipse based Debugging Suite, free and without an NDA directly from Intel.

So that might eliminate the Hardware debugger : D

I'd assume that one problem would lie with iperf itself, depending on the used Hardware and iperf configuration.

Here is the source for iperf if i'm not bamboozled this evening:
esnet/iperf
that could fix part of the source code poroblem.

The ReadMe mentiones a zero copy mode, maybe thats similar to RDMA where the main Advantages, as far as i remember are claimed as "1 instead of 3 copy operations through the Kernel".
And Offload of Protocol-things onto the card is another factor. If Iperf happens to unluckily do things its own way, it might not use offload features, therefore limiting performance.

Whats with the OFED and Mellanox supplied utils ?

On the other hand, Iperf is just a utility for mainly bandwidth measurement. I don't see a requirement for it to be single core optimized or so.
So you now at least know that the network hardware itself can do that throughput.
Is it really necessary to figure that thingle-thread performance problem out?
What will be your actual use-case? How is that performing ?
 

zxv

The more I C, the less I see.
Sep 10, 2017
153
51
28
With respect to OFED / MLNX benchmarks I guess I overlooked it in the docs I read through, can you give me a link to document or community discussion or some sort of QSG for benchmarking using MLNX / OFED tools, or maybe some trivial example that you can post here ?
This is benchmarking NVMe-oF but it will be a step toward the performance level of Ramdisk you mentioned.
https://community.mellanox.com/s/article/simple-nvme-of-target-offload-benchmark

This is benchmarking of iSER:
https://community.mellanox.com/s/article/iser-performance-tuning-and-benchmark

And for others who may be new to RDMA, links for setup are in "Recipes for RoCE fabrics":
https://community.mellanox.com/s/ar...rk-configuration-examples-for-roce-deployment