Mellanox 40gbe tuning windows? I have terrible performance.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

aron

New Member
Jul 19, 2017
24
3
3
47
All,

I have connected two computers with Mellanox 40gbe.

Computer 1:
Threadripper 1950X with ConnectX-3 Pro EN MCX314A-BCCT
Running Windows 10

Computer 2:
I7-4790K on Asus Z97WS motherboard with ConnectX-3 Pro EN MCX314A-BCCT
Running Windows Server 2016.

Both having Samsung 970 SSDs. Transfer over network I get around 600 mb/sec. If I do 4 concurrent transfers i get around 400 mb/sec.

I really hope/want 40gbe to be much quicker. Particularly single stream transfer.

Do I have too high expectations? What speed do you get nvme to nvme transfer over 40gbe? Any suggestions on who to tune things?

Regards,
Aron




IPERF i get:
E:\tools>iperf3.exe -c 192.168.8.4
Connecting to host 192.168.8.4, port 5201
[ 4] local 192.168.8.5 port 50233 connected to 192.168.8.4 port 5201
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 691 MBytes 5.79 Gbits/sec
[ 4] 1.00-2.00 sec 920 MBytes 7.72 Gbits/sec
[ 4] 2.00-3.00 sec 1.16 GBytes 9.93 Gbits/sec
[ 4] 3.00-4.00 sec 1.38 GBytes 11.8 Gbits/sec
[ 4] 4.00-5.00 sec 1.17 GBytes 10.1 Gbits/sec
[ 4] 5.00-6.00 sec 1.05 GBytes 9.01 Gbits/sec
[ 4] 6.00-7.00 sec 1.33 GBytes 11.4 Gbits/sec
[ 4] 7.00-8.00 sec 1.59 GBytes 13.6 Gbits/sec
[ 4] 8.00-9.00 sec 1.61 GBytes 13.8 Gbits/sec
[ 4] 9.00-10.00 sec 1.64 GBytes 14.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 12.5 GBytes 10.7 Gbits/sec sender
[ 4] 0.00-10.00 sec 12.5 GBytes 10.7 Gbits/sec receiver
iperf Done.


Both cards run in PCIe 3.0 X 8 mode

E:\tools>vstat

hca_idx=1
uplink={BUS=PCI_E Gen3, SPEED=8.0 Gbps, WIDTH=x8, CAPS=8.0*x8}
MSI-X={ENABLED=1, SUPPORTED=128, GRANTED=18, ALL_MASKED=N}
vendor_id=0x02c9
vendor_part_id=4103
hw_ver=0x0
fw_ver=2.42.5000
PSID=MT_1090111023
node_guid=ec0d:9a03:00ea:4390
num_phys_ports=2
port=1
port_guid=ee0d:9aff:feea:4390
port_state=PORT_ACTIVE (4)
link_speed=NA
link_width=NA
rate=40.00 Gbps
port_phys_state=LINK_UP (5)
active_speed=40.00 Gbps
sm_lid=0x0000
port_lid=0x0000
port_lmc=0x0
transport=RoCE v2.0
rroce_udp_port=0x12b7
max_mtu=2048 (4)
active_mtu=2048 (4)
GID[0]=0000:0000:0000:0000:0000:ffff:c0a8:0b05
port=2
port_guid=ee0d:9aff:feea:4391
port_state=PORT_ACTIVE (4)
link_speed=NA
link_width=NA
rate=40.00 Gbps
port_phys_state=LINK_UP (5)
active_speed=40.00 Gbps
sm_lid=0x0000
port_lid=0x0000
port_lmc=0x0
transport=RoCE v2.0
rroce_udp_port=0x12b7
max_mtu=2048 (4)
active_mtu=2048 (4)
GID[0]=0000:0000:0000:0000:0000:ffff:c0a8:0805
GID[1]=fe80:0000:0000:0000:6c62:cf66:0b6c:5eda
 

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
Did try to tune according to: https://www.mellanox.com/related-do...ide_for_Mellanox_Network_Adapters_Archive.pdf
NUMA tuning isn't relevant on that CPU i assume. Also test with that suggested perf test utility there , ntttcp IIRC

EIDT1: That CPU you got have enormous single thread performance so it should be able to achieve significantly higher bandwidth with TCP.
Also check IRQ affinity, i'm not sure how that's set on Windows though, my experience with such card is from Linux (both RoCE v1 and TCP)
 

i386

Well-Known Member
Mar 18, 2016
4,221
1,540
113
34
Germany
I would say wrong expectations ._.

Instead of iperf use ntttcp on windows systems.

What benchmark tool do you use to measure storage performance?
Have you tried it locally?
 

iceisfun

Member
Jul 19, 2014
31
4
8
Yes, I think its quite important to decouple the storage performance and network performance.

Most of the testing I have done with these MNX CX3 cards gets about 22gig/sec from host to switch to host

For me, this is a very typical iperf3 output for a single thread from a xeon 2697v2 to another 2697v2 across a Nexus N3K both hosts having a LACP link

[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 1.68 GBytes 14.4 Gbits/sec 7 799 KBytes
[ 4] 1.00-2.00 sec 2.54 GBytes 21.9 Gbits/sec 10 870 KBytes
[ 4] 2.00-3.00 sec 2.53 GBytes 21.7 Gbits/sec 5 615 KBytes
[ 4] 3.00-4.00 sec 2.55 GBytes 21.9 Gbits/sec 3 679 KBytes
[ 4] 4.00-5.00 sec 2.55 GBytes 21.9 Gbits/sec 9 563 KBytes
[ 4] 5.00-6.00 sec 2.55 GBytes 21.9 Gbits/sec 14 1.23 MBytes
[ 4] 6.00-7.00 sec 2.54 GBytes 21.8 Gbits/sec 27 1.09 MBytes
[ 4] 7.00-8.00 sec 2.54 GBytes 21.8 Gbits/sec 9 601 KBytes
[ 4] 8.00-9.00 sec 2.54 GBytes 21.8 Gbits/sec 46 874 KBytes
[ 4] 9.00-10.00 sec 2.55 GBytes 21.9 Gbits/sec 6 619 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 24.6 GBytes 21.1 Gbits/sec 136 sender
[ 4] 0.00-10.00 sec 24.5 GBytes 21.1 Gbits/sec receiver


* Edit : The best performance I have ever gotten from this kind of setup with tuning is 29-31gbit/sec, also all of my Windows hosts with this setup are Windows Server 2016 using cx2 and LACP