[SOLVED]Mellanox ConnectX 3 can't get 40G only 10G

Discussion in 'Networking' started by BackupProphet, Nov 16, 2018.

  1. svtkobra7

    svtkobra7 Active Member

    Joined:
    Jan 2, 2017
    Messages:
    219
    Likes Received:
    37
    YEAHHH BUDDDYYY ... that's the money shot right there! ESXI <=> ESXi via direct 40G = iperf shows 33G :):):)

    Curiously though I can't get above 14G FreeNAS <=> FreeNAS on those same hosts.


    I picked up ~+1G on the 10G port (FreeNAS to FreeNAS), so I'm going to assume that was the result of properly enabling jumbo frames. mtu was set to 9000 in ESXi and FreeNAS, globally enabled on the ICX-6450 (i.e. "jumbo"), BUT NOT on the vlan interface. The following command did the trick and now a Plex jail works again (yay for not getting yelled at any longer).

    Code:
    interface ve200
    ip mtu 9000
    Prior testing didn't include host to host iperf, but since FreeNAS to FreeNAS nudged closer to 10G on the other port (which I would attribute to the config change), this much, much improved result doesn't touch the switch ... curious o_O

    Code:
    [root@ESXi-01:/opt/iperf/bin] ./iperf -c 10.2.0.42 -w 1M -P 8
    ------------------------------------------------------------
    Client connecting to 10.2.0.42, TCP port 5001
    TCP window size: 1.01 MByte (WARNING: requested 1.00 MByte)
    ------------------------------------------------------------
    [ 10] local 10.2.0.41 port 51426 connected with 10.2.0.42 port 5001
    [  8] local 10.2.0.41 port 21429 connected with 10.2.0.42 port 5001
    [  7] local 10.2.0.41 port 28275 connected with 10.2.0.42 port 5001
    [  6] local 10.2.0.41 port 10155 connected with 10.2.0.42 port 5001
    [ 11] local 10.2.0.41 port 48648 connected with 10.2.0.42 port 5001
    [  5] local 10.2.0.41 port 62808 connected with 10.2.0.42 port 5001
    [  4] local 10.2.0.41 port 19689 connected with 10.2.0.42 port 5001
    [  3] local 10.2.0.41 port 57074 connected with 10.2.0.42 port 5001
    [ ID] Interval       Transfer     Bandwidth
    [ 10]  0.0-10.0 sec  4.82 GBytes  4.14 Gbits/sec
    [  8]  0.0-10.0 sec  4.80 GBytes  4.13 Gbits/sec
    [  7]  0.0-10.0 sec  4.85 GBytes  4.17 Gbits/sec
    [  6]  0.0-10.0 sec  4.83 GBytes  4.15 Gbits/sec
    [ 11]  0.0-10.0 sec  4.81 GBytes  4.14 Gbits/sec
    [  5]  0.0-10.0 sec  4.81 GBytes  4.13 Gbits/sec
    [  4]  0.0-10.0 sec  4.87 GBytes  4.18 Gbits/sec
    [  3]  0.0-10.0 sec  4.85 GBytes  4.17 Gbits/sec
    [SUM]  0.0-10.0 sec  38.6 GBytes  33.2 Gbits/sec
     
    #21
    fohdeesha and RageBone like this.
  2. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Gents, after playing for a whole day with a Mellanox CX354A-FCBT I learned a ton but got stuck at having iperf perform at exactly 10-11 Gbit.
    No matter how many threads (-P <number>) it keeps doing that number. Between two dual Xeon E5s, not the fastest on the block, but not loaded at all. TCP/IP works, ping works, and iperf of course, filetransfers are stuck at EXACTLY 133 MB/sec even after many repeated SCP copies between the 2 machines.

    The card reports it has a 40 Gbit link. We tried to change to ETH, then changed back to VPI, obviously the cards are working, using a QSFP+ 3 meter DAC cable (sold as 40 Gbit QDR/FDR).

    So while the link is 40 Gbit, the speed is super stable (can run many minutes and always see 10-11Gbit/sec with iperf) we're stuck why it doesn't do more. We're just running plain Linux, no VMware or something, no hypervisors, nothing going on on the servers. Cards are in Gen.2 and Gen.3 x8 electrical slots (so nothing limited by sharing with a chipset or actually just a x4 slot) so that cannot explain why it can't go higher than 10-11 Gbit.

    We have several older Mellanox 10 Gbit SFP+ cards and those have been running fine for near 2 years now to a normal 1 Gbit + 10 Gbit SFP+ switch.

    Am I doing something wrong ? The cards have a FW 2.42.5000 out of the box, we didn't have to flash them or something. We turned servers off, connected the cable in various orders (first cable, then turned on servers, first turned on servers, then connected cable etc).

    One server is an Intel S2600CP with two E5-2650's (IIRC) and the other an Intel SC5520HC with two X5660's.
     
    #22
  3. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,491
    Likes Received:
    338
    What OS do you use? My guess is that you use iperf3 and windows. (Ipferf3 doesn't work that well on windows platforms)
    If you have windows only hosts you could try other benchmarks (I prefer ntttcp).

    Did you get 133mbyte/s with the sfp cards too?
    How did you measure the 133 mbyte/s?
    How does your storage look on the sending and receiving host?
     
    #23
  4. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    OS: Linux to Linux
    SFP+ cards ; those get 100-120 MB/sec but they're not connected in any way to the new 40G cards
    Measure ; SCP gave a nice report after sending a file (single) as big as a few GB (2-4)
    Storage ; we created 10 GB ramdisks (tempfs) on both servers to eliminate storage bottlenecks
    Creating a new file on the ramdisk takes literary a second for multiple GBs.

    I wonder if you can only get 40G with a minimum of 4 threads / parallel file tranfers or if it is possible to get close to the 40Gbit speed with a single filetransfer in a single thread ? Thanks for your time!
     
    #24
  5. svtkobra7

    svtkobra7 Active Member

    Joined:
    Jan 2, 2017
    Messages:
    219
    Likes Received:
    37
    For clarity (having just gone through something similar), "threads" and "parallel" are not interchangeable here.
    • iperf or iperf3? From my understanding using the -P switch does not multi-thread, at least with iperf3 and as confirmed here:
    • iperf3 at 40Gbps and above
    Code:
    iperf -h   -P, --parallel  #        number of parallel client threads to run
    iperf3 -h  -P, --parallel  #        number of parallel client streams to run
     
    #25
  6. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Ok thanks, will keep an eye on that. I think we used iperf (servers in the office, now it's weekend ; time to relax... ehh search forums for answers), not iperf3.

    Should a 40G/56G card be able to transfer near it's maximum of 40/56 Gbit between two servers over a DAC cable on one thread/connection? So nothing parallel?
     
    #26
  7. svtkobra7

    svtkobra7 Active Member

    Joined:
    Jan 2, 2017
    Messages:
    219
    Likes Received:
    37
    • A wise man once told me, "anything is possible" ;) but at the default window size and -P 1 (as is also the default), I'd be impressed if so.
      • You will likely want to increase that window size (-w 1M for example) and -P (-P 4 for example) to find the highest bitrate.
    • I can only speak from personal experience, and limited experience relative to others here, but:
      • PCIe v2 x 8 = 32 Gbps, so no you won't hit the theoretical max of that pipe using that.
      • Also the base clock on the E5-2650 v1 = 2.0 GHz, which may not be fast enough to max it out.
        • Nearly the best I was able to hit using an E5-2680 v2 & E5-2690 v2 was posted immediately before your reply (33 Gbps).
        • To be fair, my result may not be CPU bound as that test was before I did a fair bit of tuning / learning, but the fact that it was performed between two hosts running ESXi 6.7, where MLNX's OFED isn't supported, may have hurt the results. Its quite possible testing between two Linux clients (as you are) could result in 40 Gbps (I'd have to defer to other more knowledgeable members here). But similar to you that test was via DAC and not over a switch.
    • For me personally, 40 Gbps is well beyond my actual bottleneck so I was happy to easily achieve near line rate for port 1 @ 10 Gbps (switched via SFP+) and port 2 @ 20 - 30 Gbps (direct via QSFP DAC, reserved for FreeNAS <=> FreeNAS replication).
    Code:
      PCIe Per lane (each direction):
         v1.x:  250 MB/s ( 2.5 GT/s)
         v2.x:  500 MB/s ( 5 GT/s)
         v3.0:  985 MB/s ( 8 GT/s)
         v4.0: 1969 MB/s (16 GT/s)
    Code:
    PCIe v2 x8  =  4   GB/s ( 40 GT/s)
    PCIe v2 x16 =  8   GB/s ( 80 GT/s) 
     
    #27
  8. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Thanks, even x4 speed on gen.2 should be already 2GB/sec so enough to hit 20 GBit or more. I'm aware the CPUs aren't the fastest in the world and I'll play a bit with affinity to try to run the driver on the second CPU and now I think of it I am not sure if the card is in a slot tied to CPU 1 (out of 2) so maybe the driver is running on the other CPU and the card in a slot handled by the other CPU.

    There a lot of things we're still not sure about. Why VPI or ETH (with a DAC), does the speed of a DAC cable degrade over distance (because of errors/retransmissions or higher latency compared to fiber ; think xDSL lines that become less performant over distance).

    I bought a switch on Ebay (Mellanox MIS5024 , unmanaged 36 ports 40Gbit for $200) and that will be the final setup so even if a server to server connection can't be optimal, I hope a connection through that switch will be 30+ Gbit. I'll know in a week or 3 or so.

    I found out later, after all the test the 2 ports on the CX-354A actually are not 100% the same regarding VPI and ETH and auto-sensing. Will pay attention to that as well. Maybe it helps.

    So the brief question is ; what is the main difference between VPI and ETH mode if you have a DAC cable and later if you have an Infiniband (only) switch ; does that require the card to be in VPI (IB) mode because the switch doesn't support ETH mode ?

    Sorry for the noob questions. I've been dealing with networks for 20 years but Infiniband is new to me :)

    I bought the cards because they were cheap for me ( $135 per piece ) and the DAC cable was just $21 new and the switch $200 which sounded like a very good deal at $5.50 per 40 Gbit port :)
     
    #28
  9. svtkobra7

    svtkobra7 Active Member

    Joined:
    Jan 2, 2017
    Messages:
    219
    Likes Received:
    37
    • 2 GB/s (or Gigabytes / sec) ≠ 20 Gbps (or Gigabits per second), i.e 2 GB/s ÷ 0.125 = 16 Gbps < 20 Gbps.
    • (One of us needs to check our math, I hope its me).
    • Essentially you are asking why is fiber superior to copper over distance?
      • Frequency = (1) Light > copper generally, and (2) via copper, inductance and capacitance increase over distance, reducing frequency.
      • Fiber is not impacted by EMI.
      • But this question is out of scope (educational in nature), since you have a 3M DAC.
    As to the rest I'll leave it to someone better educated (I'm a noob as well, no IT background here). I'm not educated on IB whatsoever (barely more on 40G ETH), but I want to say it only has advantages at bitrates much higher than you are seeing at the moment, and further want to say the consensus is such that the additional benefit isn't worth the additional complexity (but that may be confirmation bias at play). After burning the firmware, I set the cards to ETH (2) and haven't looked back =>
    Code:
    ./mlxconfig -d [mt4099_pci_cr0] set LINK_TYPE_P1=2 LINK_TYPE_P2=2
    , where [mt4099_pci_cr0] = the pci device name returned by
    Code:
    ./mlxfwmanager --query
    but I'm "playing" with a limited tool kit and can't even use OFED, etc. (limitations you aren't subject to).
     
    #29
  10. i386

    i386 Well-Known Member

    Joined:
    Mar 18, 2016
    Messages:
    1,491
    Likes Received:
    338
    I'm pretty sure that the cards can do that and that the question is if the operating system and the applications can handle a 40gbit link with a single thread.

    2 cores @ 2,2 ghz should be able to saturate a 40gbit/s link (pentium d 1508):
    Unbenannt.JPG
     
    #30
    40gorbust likes this.
  11. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Ok you got me on the 2GB == 20 Gbit which is correct by the factor 1.25. I was just lazy doing 1:10 thinking of a bit of overhead and such.

    About VPI (IB?) vs ETH I'm still a bit unclear ; which one is better for the card, besides the fact it seems the card can do 56 Gbit in VPI(IB?) mode vs 40G max in ETH mode.

    Also someone wrote that "ipoib" emulation would yield lower results. Basically I just want to be able to transfer files quickly from one server to another ; one will be a workhorse with plenty of GPUs for Machine Learning and the other would be a fileserver with a lot of HBAs and probably PCIe cards with NVME SSDs, a tape-drive for backups (LTO5) and a bunch of normal HDDs for medium-term cheap storage.

    I'm just used to using TCP/IP to connect servers together (ignoring IPX from back in the days and such) so I don't "need" eth(ernet), I'm fine if I can ping over VPI / IB no matter what is needed for that, with low CPU usage if possible and high throughput.

    If this experiment with the 40G cards works I might try to get my hands on a 100G Mellanox card but that one was around $300 so a bit expensive with 2 for just an experiment without knowing if it would work. Forums are great. I loved reading this thread and learning so much in just an hour.

    Are there Eth(ernet) 40Gb switches with QSFP+ connectors requiring the alternative ETH mode on the Mellanox card ? What does everyone think of the VPI/IB mode vs ETH mode ; which one is 'better' or 'faster' or 'more optimized' ? Sorry for the noob questions but I learned that it's better to ask questions than to pretend you're smart but aren't :)
     
    #31
  12. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Impressive speeds for that CPU (usage). This also helps me to not give up if we cannot reach (near) 40 Gbit on these 2 servers. Danke!
     
    #32
  13. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    1. Ok so far we tried in 3 different mainboards ;

    1 x Intel S5500 (X58) PCIe Gen.2 X8 max with dual Intel Xeon E5645 cpus (2.4 Ghz)
    1 x Intel S5520HC PCIe Gen.2 X8 max with dual Intel Xeon E5645 cpus (2.4 Ghz)
    2 x Intel S2600CP PCIe Gen.3 X8 max with dual Intel Xeon E5-2470 cpus (2.3 Ghz)

    Linux, with kernels 3 and later during testing with 4.9.0.0.bpo.8 on Debian 9.6 (Stretch).

    First I had the two ConnectX 40Gb CX354A (rev A2) cards in the S5500 and S5520HC. The average speed was around 10 Gbit (fluctuating) with iperf -c 10.0.0.2 -w 416k -P 4 ; this gave the best result ; the best we saw was around 19 to 20 Gbit.

    In retrospect ; Gen.2 PCIe x8 can do max 4GByte/sec or 32 Gbit. So we'd never be able to see more than that.

    So we put the cards in the newer X79 mainboards ; two S2600CPs with each two E5-2470 v1 CPUs at 2.3 Ghz (turbo to 3.1 Ghz on 2 cores, up to 2.8 Ghz on all cores)

    Results ; even worse ; hardly reached more than 10 Gbit between the two cards.

    mlnx_tune said the cards are in x8 slots, at 8 GT/sec. The connection is IB, both cards are in VPI link mode, the speed is 40 Gbit (4 x QFP)

    Testing with a single connection (-P 1) gives between 1.5 to 5 Gbit maximum but adding more doesn't give more cumulative bandwidth. Tested with -P 4,8,10,32,64 ; the total just doesn't come even close to 40 Gbit.

    2. we set the bottom slot to LINK_TYPE 2 (ETH) and both cards reported to be in ETH mode but ... no connection between the two bottom slots.

    Are we unlucky with these 2 cards, can they be broken where just 1 port works and only at a lower speed ? Can the 3 meter DAC cable be faulty (it's brand new) in any way ?

    Any suggestions?
     
    #33
    Last edited: Jan 17, 2019 at 10:15 AM
  14. RageBone

    RageBone Member

    Joined:
    Jul 11, 2017
    Messages:
    92
    Likes Received:
    17
    I'm in no way a pro, or experienced enough that i can answer all of your questions.
    As Far as i understand the Mellanox "VPI" "Virtual Protocol Interconnect" Feature, it means that the Cards and switches who support VPI can translate and communicate between the two. It seems to mean, that a VPI switch with some Eth ports and some IB Ports can translate between the two, basically unifying the Network.
    Other then that, VPI isn't bound to IB or is in your case actually in use.

    On Infiniband, there is a software layer emulating IP over Infiniband, ipoib, which seems to be heavily single threaded and limited by the cpu.
    If your application does not fully support IB or RDMA over IB, it is very likely that it is using ipoib. You can tweak it to get ok-ish performance, i have heard of about 25Gbit/s on CX2 QDR cards.
    Most tools like Ping and iperf have their own ib equivalent like ibPing which leads me to believe that your tests were using ipoib, hence the speeds of around 10Gbit/s.
    Especially since 10Gbits on ipoib seems to be the common default, at least when i played around with it on my cx2 and some Qlogic cards, a while back.

    So why don't you at least try Eth mode on those cards?
    And please keep a close look at cpu usage.
    And by the way, X79 does not have dual socket support, so it has to be some other chipset / platform!
     
    #34
  15. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Thanks I read that ipoib is slower, but I have no idea IF that's what I'm using :)
    We tried to set the cards to ETH mode on the bottom slots, but , after switching the cable , there was no connection. I'll try again Friday, to see if I can set the top slots to ETH. I left them alone because they worked in IB mode and didn't wanted to break something.

    I remember using iperf with -P 4 and reading the 19/20 Gbit SUM of the 4 connections 4 cores were at about 50% load.

    ibPing is new to me, so I'll try that next. Yes we're only trying to make IP (TCP/IP) work over the cable, that's the goal of the 2 cards ; making a basic TCP/IP network.

    I am aware the single core performance of both the Xeon E-5645 and Xeon E5-2470 isn't great but I was hoping for more considering that at that time when the ConnectX-3 cards were released those CPUs were pretty common.

    The E5645 is on socket 1366, X58
    The E5-2470 is on socket 1356, C602

    I wrote by mistake X79 because that's in other boards we use (Supermicro X9DRi and X9DRX).

    [​IMG]

    [​IMG]

    The reason I wrote X79 because I've seen tons of these ; strange Chinese mainboards for really low prices (998 RMB is around $145, new) .

    But we're not using those and never bought those ; worried too much stuff won't work (after a while), fire or other interesting things.
     
    #35
  16. llowrey

    llowrey Member

    Joined:
    Feb 26, 2018
    Messages:
    32
    Likes Received:
    11
    Although an x8 PCIe2 slot is capable of 4GB/s, that's the raw rate.

    The packet overhead is between 24 and 28 bytes. The payload size is limited to the smallest supported packet size along the path to the device. In my case with AMD Opterons it's 128 bytes. So, worst case, I'm sending 156 bytes for every 128 bytes of payload. That means that user payload is only ~82% of the bandwidth. I use 80% as a good approximation and therefore rate an x8 slot as 3.2GB/s instead of 4.0GB/s.

    The overhead for PCIe 3 is a bit larger at 26-30 but implementations tend to use much larger packet sizes (eg 512) so efficiency ends up much higher.

    https://www.xilinx.com/support/documentation/white_papers/wp350.pdf

    You can see the PCIe packet size in linux by running 'lspci -vv' and looking for MaxPayload. The value in the DevCap section will tell you what the device is capable of and the DevCtl value is what was negotiated.

    PCIe2 ConnectX-4 (x16): Capable of 512, negotiated to only 128
    Code:
                    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 9.000W
                    DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                            RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                            MaxPayload 128 bytes, MaxReadReq 512 bytes
    
    PCIe3 ConnectX-3 (x8): Capable of 512 and negotiated to 512
    Code:
                    DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 116.000W
                    DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                            MaxPayload 512 bytes, MaxReadReq 512 bytes
    
    Given that, in my case with PCIe2 and only 128 byte packets the best I can do with an x8 device is 3.2GB/s or 25.6Gb/s.

    To get full 40GbE on my PCIe2 host I had to buy an x16 ConnectX-4 card. That allows for ~51.2Gb/s so more than enough for 40GbE. To get there, though, I had to tweak the linux network buffers by adding the following to /etc/sysctl.conf.

    Code:
    # increase TCP max buffer size setable using setsockopt()
    net.core.rmem_max = 536870912
    net.core.wmem_max = 536870912
    # increase Linux autotuning TCP buffer limit
    net.ipv4.tcp_rmem = 4096 87380 268435456
    net.ipv4.tcp_wmem = 4096 65536 268435456
    
    Before adding these values, the max TCP window size was too small to handle 40GbE. Mellanox has their own recommendations.

    https://community.mellanox.com/s/article/linux-sysctl-tuning

    With those settings in place I can hit 39Gb/s with iperf and 4 parallel streams.
     
    #36
  17. 40gorbust

    40gorbust New Member

    Joined:
    Saturday
    Messages:
    9
    Likes Received:
    0
    Yeah in the new mainboards we have Gen.3 x8 speed so not hitting any limits (7.9 GB/sec raw), in practice I see GPUs transfer around 6 GB/sec over PCIE 3 x8.

    I'll apply those OS tuning tips. Are your cards direct connected with a DAC to another card, or via a switch?
    Are your cards in IB or ETH mode ?

    For example using ~ /opt/mellanox/bin/mlxconfig -d /dev/mt4121_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2 ?

    (I copied this line, I used similar commands on the servers at work)
     
    #37
  18. llowrey

    llowrey Member

    Joined:
    Feb 26, 2018
    Messages:
    32
    Likes Received:
    11
    The cards are connected via a switch (Mellanox SX6108) using DACs and are in ethernet mode.
     
    #38
Similar Threads: [SOLVED]Mellanox ConnectX
Forum Title Date
Networking Help with old hp connectx cx4 style cards Jan 9, 2019
Networking Mellanox ConnectX-2 QSFP to SFP+ ? Jan 9, 2019
Networking [solved] Quanta LB6M connection to Mellanox ConnectX-4 /5 Nov 19, 2018
Networking Mellanox ConnectX-2, Wake On Lan for Windows Desktops? Nov 5, 2018
Networking Mellanox ConnectX-3 - HP 649281-B21 does not bring up link with QSFP+-Transceiver Nov 4, 2018

Share This Page