Stumped by 10Gb speed limit on 40/56gb IB cards...

Sniper_X

Member
Mar 11, 2021
89
15
8
I'm using a pair of dual port Connectx-3 cards (Part# MCX354A-FCBT on FW 2.42.5000 ) and I seem to be stuck at 10Gb.
My ultimate goal is to get to pure Infiniband connectivity between the servers and the storage.

I have Dell PowerConnect 8132(N4032) switches AND a pair of Mellanox IS5022 switches.
Cards are running on Dell PowerEdge 710 & 715 servers, each one connected through the switches.

I have tried both and I'm getting 10Gb on each, but nothing more. :(
(Even though the link speed says 40)

I know that there has been quite a bit of discussion here on these cards and regarding being stuck at 10Gbe.

I would love some assistance on this and hope that someone here can assist me.

IB_CARD.png
 
Last edited:

Sniper_X

Member
Mar 11, 2021
89
15
8
I have performed a few new tests using RAMdisks on each server and performing benchmarks.
I changed the maximum file size as well as changing the queue depth quite a bit.
I saw the most change by adjusting queue depth.
(Images attached.)

I have verified these cards are linked at 40Gbit and have tried many types of cables, including active optical links.

Is there anyone that can assist me in determining what the slowdown is?
 

Attachments

Sniper_X

Member
Mar 11, 2021
89
15
8
I'm still struggling with this issue.

Who here has faced this before?
I have read many similar posts and new FW resolved the problems, but I'm already running all the firmware mentioned.
 

i386

Well-Known Member
Mar 18, 2016
2,993
959
113
33
Germany
I looked at your screenshots again and I'm getting confused...
The screenshot in the op says ipoib adapter (infiniband) and shows a 40GBit/s link. How are the nics connected to the switches?
 

Sniper_X

Member
Mar 11, 2021
89
15
8
I looked at your screenshots again and I'm getting confused...
The screenshot in the op says ipoib adapter (infiniband) and shows a 40GBit/s link. How are the nics connected to the switches?

They are connect to a single unmanaged switch. (Mellanox IS5022) using the cables I mentioned above.
Yes, they are linked at 40Gb.
The screenshots are right.

The latest images from ATTO show that with various adjustments in Queue Depth, and payload side, I can get over 10Gb, but no where close to 40Gb. - very frustrating -

I'm going to try and directly connect the cards with a single cable and see what happens.

All tests are done to a Windows 2016 Server sharing a 20Gbyte RAM Disk.
 

MichalPL

Member
Feb 10, 2019
61
9
8
We have been testing things like this, and.. in out testing limit was mostly around CPU single core speed.

On very similar card Mellanox ConnectX-3 40GbE our Max was about 4.1Gbytes/s (so ~theoretical max) and both computers running Win10
it was a RAID0 made from first gen 3x Samsung 970 Evo, the speed tested locally was about 10-10.5GBytes/s
the CPU was Xeon 1650 v2 overclocked to ~4.2GHz all core, 64GB 1866MHz DDR3 Quad channel.

and 4.1Gbytes/s was not possible on a single NVMe (~3.6Max) and even double and RAM disk was also slower ;)

But real-life scenario Linux + ZFS is not that optimistic, instead of 10Gbytes/s max is around 4.4Gbyte/s on same CPU (single file, more files ~6.5Gbytes/s), back then "v2" was one of the fastest in single core performance CPU (faster than v4 after overclocking, no Ryzen 3xxx).

We also build one server based on a old and super cheap HP 580 Gen8 + 4x Xeon 8895 v2 and put 12 Samsung Evo 1TB 970 inside, local speeds are around 20Gbytes/s in software Raid0 but having problem to transfer it via 40GbE with max speed all the time (Numa, and QPI links)

Today a problem solver is superfast in single core Ryzen 5800X + 6x NVME PCIE 4.0 :) but unfortunately just 24PCIE lanes :/

1. In short please see if you have 100% on one of the CPU cores when testing (an if it's boosting to max speed)
2. check RDMA
3. see: https://community.mellanox.com/s/ar...-with-performance-tuning-of-mellanox-adapters
4. What is your CPU ?
 

Sniper_X

Member
Mar 11, 2021
89
15
8
4. What is your CPU ?"
Monitoring my CPU speeds, I have several cores open on the target server that is serving the RAMdisk.

And none of the cores go to 100%, most of the time if it gets close it only has about 95 or 96%.
 
Last edited:

MichalPL

Member
Feb 10, 2019
61
9
8
I just make local test on very similar computer (my old desktop ;) ) that we achieved 4.1GBytes/s, unfortunately here I have 10GbE Mellanox.
The HDD configuration is also very similar - very very old Samsung 981a (similar to 1st Evo) this is how it performed under windows10:

disk_samsung981a.png


and what I remember this is almost ok or ok (on the Edge of maxing Mellanox 40G) to achieve 3.8-4.1GB/s via Network (under Win10 not a server), slower SSD or slower CPU can't do this because than CPU goes to 100% or drives somehow was not able to fill buffers (but don't test it deeper to fully understood it). Can you check what is the speed of your RAMdisk locally using Crystal Disk Mark ? I am guessing RAM disk is too slow (software is too slow not RAM itself).
 

MichalPL

Member
Feb 10, 2019
61
9
8
and this is sample of the numa hell and not enough QPI links ;) 12 faster disc (Evo 1TB is faster than old 981a 256GB) result ~2x faster than 3x 981a. But here CPUs can boost to 3.6GHz and 3.2GHz all 60 cores (1650 v2 was set to 4.1GHz all core). WHat is funny sometimes it shows 20 or 24GBytes/s ;)
hdd_numa_hell.png
 
Last edited:

Sniper_X

Member
Mar 11, 2021
89
15
8
Just a note…

I have spent the last four hours performing tests to these RAMdisks under the following conditions:

  1. IPoIB - Direct Cable - ATTO - 4Gb File - Queue Depth 32
  2. IPoIB - Direct Cable - Crystal Disk Mark - 4Gb File - Queue Depth 32 - Threads 24
  3. IPoIB - Switch - MLNX-IS5022 - ATTO - 4Gb File - Queue Depth 32
  4. IPoIB - Switch - MLNX-IS5022 - Crystal Disk Mark - 4Gb File - Queue Depth 32 - Threads 24
  5. ETH - Direct Cable - ATTO - 4Gb File - Queue Depth 32
  6. ETH - Direct Cable - Crystal Disk Mark - 4Gb File - Queue Depth 32 - Threads 24
  7. ETH - Switch - DELL PowerConnect 8132 - ATTO - 4Gb File - Queue Depth 32
  8. ETH - Switch - DELL PowerConnect 8132 - Crystal Disk Mark - 4Gb File - Queue Depth 32 - Threads 24
So, I have benchmarks using both a an Ethernet switch, and an IB Switch, then using just a direct cable, via IPoIB and Ethernet.

Images attached.
 

Attachments

MichalPL

Member
Feb 10, 2019
61
9
8
I think what is limiting is a CPU performance when doing "ramdisc calculations", hmm..... best way is to use NVME disc, but I see there is lack of PCIE 3.0 support = many NVME's to be able to make fast enough RAID0 - like with Mac Pro based on Core 2 Quad 3.06GHz Xeons (PCIE 2.0 limit).

What is exactly the CPU model ?
 

Sniper_X

Member
Mar 11, 2021
89
15
8
Can you check what is the speed of your RAMdisk locally using Crystal Disk Mark ? I am guessing RAM disk is too slow (software is too slow not RAM itself).
You have a point here.

See the attached (exact same tests) to local RAMDISK.
 

Attachments

MichalPL

Member
Feb 10, 2019
61
9
8
hmmm, should be good or almost good :) when we were making tests 2x Samsung Evo on local machine gives ~6.5GB/s and it was too slow to reach 4.1GB/s via Network, Also I realized the next thing (this is why I am asking about CPU model) maybe you don't have support of PCIE 3.0 (this is what is inside spec) than even with x8 PCIE connector you are not able to max 40GbE
 

MichalPL

Member
Feb 10, 2019
61
9
8
you can check it here it should say: 2 or 3 for speed, and 4 or 8 for link width.

to achieve 40GbE you need PCIE 3.0 and x8 link, PCIE 3.0 and x4 is slightly slower, PCIE 2.0 and x8 is even more if i remember the coding correctly.

pcie 3.0 2.0 .png
 

MichalPL

Member
Feb 10, 2019
61
9
8
Just realized, you copy-paste it on the first picture , yes it's PCIE 2.0 (5GT/s) on x8 link - not enough to utilize full 40GbE bandwidth - so this is the issue, you need to connect it to the PCIE 3.0 computer (the oldest one are: i7 3770 or Xeon 1620).
 

MichalPL

Member
Feb 10, 2019
61
9
8
the second issue is probably the CPU on the limit with processing power here. (btw just checked PCIE 2.0 x8 is slightly faster than 3.0 x4 but still slower than 40GbE).
 

Sniper_X

Member
Mar 11, 2021
89
15
8
This is a Dell R715.
PCI 2.0 using a x8 slot.

The card is pci3

RAM disk was strike one.
PCI version is strike two.

Seems I need a different platform