Need help with making reflashed HP / MCX354A work at 40Gb/s

Discussion in 'Networking' started by Crond, Apr 10, 2019.

  1. Crond

    Crond Member

    Joined:
    Mar 25, 2019
    Messages:
    34
    Likes Received:
    5
    Hi All,

    After reading this forum I decided to try to switch my home lab to 40gb/s, since adapters are cheaper than 10gb/s :) But ATM I can't get full speed from them.

    Short story:
    After update and connecting them directly I'm only getting ~20-23Gb/s on iperf/iperf3 running ubuntu 18.10 baremetal with single thread
    t620:~$ iperf3 -s
    [ ID] Interval Transfer Bitrate
    [ 5] 0.00-1.00 sec 2.63 GBytes 22.6 Gbits/sec
    [ 5] 1.00-2.00 sec 2.69 GBytes 23.2 Gbits/sec

    update: multi thread performance seems to be fine
    t620:~$ iperf3 -P2 -c 10.10.10.2
    Connecting to host 10.10.10.2, port 5201
    [ 5] local 10.10.10.1 port 54416 connected to 10.10.10.2 port 5201
    [ 7] local 10.10.10.1 port 54418 connected to 10.10.10.2 port 5201
    [ ID] Interval Transfer Bitrate Retr Cwnd
    [ 5] 0.00-1.00 sec 2.24 GBytes 19.2 Gbits/sec 0 907 KBytes
    [ 7] 0.00-1.00 sec 2.24 GBytes 19.2 Gbits/sec 0 1005 KBytes
    [SUM] 0.00-1.00 sec 4.48 GBytes 38.4 Gbits/sec 0
    - - - - - - - - - - - - - - - - - - - - - - - - -



    Long story:
    I got a pair of what appears to be HP 649281-B21 Rev A5 (revision on sticker)
    From lspci:
    [PN] Part number: 649281-B21
    [EC] Engineering changes: A5
    [V0] Vendor specific: HP 2P 4X FDR VPI/2P 40GbE CX-3 HCA

    Followed instruction https://forums.servethehome.com/ind...net-dual-port-qsfp-adapter.20525/#post-198015
    And got both cards reflashed successfully with latest firmware
    Device #1:
    ----------
    Device Type: ConnectX3
    Part Number: MCX354A-FCB_A2-A5
    *****
    Versions: Current Available
    FW 2.42.5000 2.42.5000
    PXE 3.4.0752 3.4.0752


    Cards are plugged to
    1) dell T620 / dual e5-2670v2 / 128g ram
    2) dell p5820 / W-2155 / 64g ram
    Systems connected directly with Mellanox active cable (mc2206310-015)
    Latest Mellanox OFED MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu18.10-x86_64.tgz is installed in default configuration.

    First port on both card is set to ethernet mode

    $connectx_port_config -s
    --------------------------------
    Port configuration for PCI device: 0000:04:00.0 is:
    eth
    auto (ib)

    Both cards negotiated correct speed / width on the host side
    $lspci -vvv
    b3:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
    LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

    Ethernet / link seems to be fine as well
    $ethtool enp67s0
    Settings for enp67s0:
    Supported ports: [ FIBRE ]
    Supported link modes: 1000baseKX/Full
    10000baseKX4/Full
    10000baseKR/Full
    40000baseCR4/Full
    40000baseSR4/Full
    56000baseCR4/Full
    56000baseSR4/Full
    Supported pause frame use: Symmetric Receive-only
    Supports auto-negotiation: Yes
    Supported FEC modes: Not reported
    Advertised link modes: 1000baseKX/Full
    10000baseKX4/Full
    10000baseKR/Full
    40000baseCR4/Full
    40000baseSR4/Full
    Advertised pause frame use: Symmetric
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Speed: 40000Mb/s
    Duplex: Full
    Port: FIBRE
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: off
    Supports Wake-on: d
    Wake-on: d
    Current message level: 0x00000014 (20)
    link ifdown
    Link detected: yes


    I tried to do some twiks
    1) change ipv4 and core configuration to allocate more memory
    2) change cpu governor to performance
    3) set IRQ affinity so that it would go to correct numa node on t620 (
    set_irq_affinity_bynode.sh)
    4) run iperf under numactl so it would go to the same numa node.

    Altogether it pushed throughput from ~20-21Gb/s to 22-23GB/s but it's still nowhere close to ~40

    Does anyone have any ideas what else can I try, or what can be wrong ?
    Did I miss any magic switches for OFED installation script ?
    Does anyone get close to 40Gb/s from a single thread/process ?

    Thanks!
     
    #1
    Last edited: Apr 10, 2019
  2. herby

    herby Active Member

    Joined:
    Aug 18, 2013
    Messages:
    161
    Likes Received:
    41
    Jumbo frames? I've decided it isn't worth the hastle myself but you might eek out more with an MTU of 9000
     
    #2
  3. Crond

    Crond Member

    Joined:
    Mar 25, 2019
    Messages:
    34
    Likes Received:
    5
    yes it's 9000
     
    #3
  4. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    152
    Likes Received:
    46
    I use most of the same tuning methods and I suspect the 22-25Gb/sec limit may be the kernel's TCP/IP stack. RoCE will do 38-39 Gb/sec.

    I've stopped using the "mlnx_tune" tool because it produces lower bandwidth on my hosts. It's a script so it should be possible to find out why, but I haven't done that.
     
    #4
  5. iceisfun

    iceisfun New Member

    Joined:
    Jul 19, 2014
    Messages:
    18
    Likes Received:
    1
    Have you tried multiple iperf3 processes on different ports at the same time?
     
    #5
  6. Crond

    Crond Member

    Joined:
    Mar 25, 2019
    Messages:
    34
    Likes Received:
    5
    I did multiple threads; even 2 threads on single port gives good results (see my post).
    I'm starting to think that there are 2 problems
    1) It could be that even performance governor is too slow to react for 40G, or doesn't collect/read performance counters correctly. I see that during test, per-core freq. sometimes goes down and core utilization (by perf) goes to 100% on t620. Interesting enough, pushing IRQs and application to opposite NUMA nodes from mlnx card gives better results in terms of single thread performance close to 30GB/s

    2) Another problem could be overheating of the mlnx card in workstation (Precision 5820). Since it's an active cable transceiver will draw 20+ w per port.
    Due to the way PCI lanes are allocated and other cards installed, I currently have only 1 option for PCIe3 x8 slot, which is directly above GPU card, with radiator of mlnx card facing down.
    In t620 motherboard is aligned differently and radiator is facing up. Even though both hosts have only fans inside CPU/memory shroud. Card and tranciver in t620 stays nice and cold. And transceiver inside p5820 gets almost too-hot-to-touch.
    And on long test I see that multi thread test peaks at 39.5GB/s for ~20-30 secs than goes down to 20-25 for couple of minutes and than settles around 31GB/s

    I plan to re-arrange and move other cards around to see if it helps, and I guess I'll move to the actual goal which is to deploy SRP / IB based storage and convert other hosts from 1/10G ethernet to 40Gb/s infiniband, since I finally got my new sun IB switch out of the box.

    P.S Even though it's now more academic interest, I would like to still get to the bottom of single thread bottle neck. If somebody has any other idea what else can be checked, or how it can be profiled let me know. I must admin that I have no idea how to profile execution flow with ~100 usec mean time for events without hardware debugger and full access to source code and tools.
     
    #6
  7. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    152
    Likes Received:
    46
    I've noticed this too. Spreading IRQ affinity across all sockets can perform better than IRQs on only the socket attached to the NIC. Seems counter intuitive.
     
    #7
  8. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    152
    Likes Received:
    46
    The mellanox OFED bundle include a utility that reads the temperature:
    mget_temp -d /dev/mst/mt4103_pciconf0
    57
    (this is for a FLR ConnectX-3 Pro in a DL380)
     
    #8
  9. RageBone

    RageBone Active Member

    Joined:
    Jul 11, 2017
    Messages:
    265
    Likes Received:
    69
    if you are on Intel and anything above skylake, you will have the DCI interface build into the Platform.
    DCI For Direct Connect Interface, is Intels newest Debuginterface just needing some "bios Settings", a USB 3.0 cable without 5V pin, and the Eclipse based Debugging Suite, free and without an NDA directly from Intel.

    So that might eliminate the Hardware debugger : D

    I'd assume that one problem would lie with iperf itself, depending on the used Hardware and iperf configuration.

    Here is the source for iperf if i'm not bamboozled this evening:
    esnet/iperf
    that could fix part of the source code poroblem.

    The ReadMe mentiones a zero copy mode, maybe thats similar to RDMA where the main Advantages, as far as i remember are claimed as "1 instead of 3 copy operations through the Kernel".
    And Offload of Protocol-things onto the card is another factor. If Iperf happens to unluckily do things its own way, it might not use offload features, therefore limiting performance.

    Whats with the OFED and Mellanox supplied utils ?

    On the other hand, Iperf is just a utility for mainly bandwidth measurement. I don't see a requirement for it to be single core optimized or so.
    So you now at least know that the network hardware itself can do that throughput.
    Is it really necessary to figure that thingle-thread performance problem out?
    What will be your actual use-case? How is that performing ?
     
    #9
  10. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    152
    Likes Received:
    46
    I hadn't seen till now that you've got full bandwidth you using two threads. That's excellent for TCP IP.

    Pause frame will help if you can enable it. Enable ecn using sysctl.

    If the drivers is using poll mode, then look at interrupt moderation settings.

    If / when you have mixed traffic on that link, queue priorities would make some difference.

    I use NFS 4.1 and or iscssi over four connections over a single 40gb link.

    Mlnx_tune produces bad tunings for me but it will show you the interface for lots of tunings.
     
    #10
    arglebargle likes this.
  11. Crond

    Crond Member

    Joined:
    Mar 25, 2019
    Messages:
    34
    Likes Received:
    5
    Thanks, I 'll give it a try.
     
    #11
  12. Crond

    Crond Member

    Joined:
    Mar 25, 2019
    Messages:
    34
    Likes Received:
    5
    Thanks for reply.

    I'll take a look at DCI, never had a chance to work with that before.

    iperf zero copy has nothing to do with RDMA. It just instructs iperf to use sendfile() call instead of combination of read/write() to socket, thus eliminating a copy of. data between userspace buffer and kernel side. This reduced CPU load, but has no effect on throughput (unless your CPU is maxed out), but to avoid any doubts - yes I did tried it.

    With respect to OFED / MLNX benchmarks I guess I overlooked it in the docs I read through, can you give me a link to document or community discussion or some sort of QSG for benchmarking using MLNX / OFED tools, or maybe some trivial example that you can post here ?

    With respect to usecase/goal. You're correct my primary target of this testing is to verify that cards are working correctly after reflashing, which is done.
    The goal is to migrate my "lab" build cluster from very old first gen 10G intel cards (pre-x520 / no HW offload/no SRIO, some were still CX4 :) ) to 40G SRP or iSER over IB. Which is essentially a 1+TB RAM based RAID-0. Each individual node is hitting 1M IOs on 60/40 random writes/reads with 4k blocks, so limiting factor is network. Reason for IB instead of RoCE is quite simple, since it's my pet project, I'm paying out of my pocket. 36port managed IB switch with 40G/ wire speed is 50$ shipped (I picked sun 36), ethernet alternative is probably 2 orders of magnitude more expensive

    I planned to start testing with IB instead of ethernet, but turned out that ESXi no longer support IB (as of 5.5) and wrt to baremetal, at the time I was running tests, most of the links on Mellanox community and guides/articles related to iSER target and SRP setup were dead (refereed to old community website). So I submitted ticket to MLX support (and they fixed links this week) and went with testing based on ethernet, the rest I guess is history.

    So the remaining question of single thread performance is more on "academic" side. I see no reason why single thread performance should be below ~36-38Gb/s, so most likely I'm still missing something that can be tunned, and in the end of the day will help to reduce overhead or latency and thus improve performance for final solution as well.


     
    #12
  13. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,680
    Likes Received:
    410
  14. zxv

    zxv The more I C, the less I see.

    Joined:
    Sep 10, 2017
    Messages:
    152
    Likes Received:
    46
    This is benchmarking NVMe-oF but it will be a step toward the performance level of Ramdisk you mentioned.
    https://community.mellanox.com/s/article/simple-nvme-of-target-offload-benchmark

    This is benchmarking of iSER:
    https://community.mellanox.com/s/article/iser-performance-tuning-and-benchmark

    And for others who may be new to RDMA, links for setup are in "Recipes for RoCE fabrics":
    https://community.mellanox.com/s/ar...rk-configuration-examples-for-roce-deployment
     
    #14
Similar Threads: Need help
Forum Title Date
Networking [SOLVED] Help Needed - Brocade ICX 6450 + Ruckus R720 Nov 25, 2019
Networking Need help reseting the password on a Cisco 3560G switch Nov 8, 2019
Networking LB6M Bricked Need Help Jun 23, 2019
Networking Need help and advice for a home router. Jun 16, 2019
Networking Need help on Voltaire VLT-4036E - 30034 May 10, 2019

Share This Page