Fastest File Transfer over 40gigE.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

aero

Active Member
Apr 27, 2016
346
86
28
54
If you're nowhere near maximum theoretical throughput of 40gbe, don't expect that 25gbe is going to magically result in better performance. I've been able to achieve 850MB/s on 10gbe, and 3.4GB/s on 40gbe (multi stream aggregate), all at default 1500mtu, with no tuning beyond optimizing block sizes. And that's real disk io, not raw network iperf, where I'm actually constrained by the disks not the network.

I don't think the problem is 40gbe technology itself.... You need to do some tuning.
 
  • Like
Reactions: arglebargle

fohdeesha

Kaini Industries
Nov 20, 2016
2,738
3,104
113
33
fohdeesha.com
Yeah, it's worth stating again - 40gbE is only 4x 10gbE streams at the hardware level, these are multiplexed at the hardware level by the nic/switches/etc ASIC into a single serial link - your applications, OS, etc, everything past the physical plug just sees a single 40gbE link. I have no issue getting 39gbps with a single stream over a 40gbE connection - 25gbE will obviously do far less than that. If 40gbE behaved like 4x 10gbE links in a LACP group like some people seem to think, it never would have been adopted as an ethernet standard
 
  • Like
Reactions: aero

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
Our internal storage speeds are about 4400MBps. So it's not the internal SSDs. Yeah with Disk I/O meter I can get better performance but with AJA it is only one thread. So, I'm curious to know how you go 39gbps is that gb or GB? What benchmark tool did you use @fohdeesha
 

fohdeesha

Kaini Industries
Nov 20, 2016
2,738
3,104
113
33
fohdeesha.com
the industry standard way to test network performance, iperf. Disk speed tests and file sharing protocols over ethernet are not going to tell you much about how the ethernet interconnect is performing. Your local disk speeds will always be faster, sending files over the network adds an entire other layer of complexity and CPU strain, what you're seeing is normal. you can get closer to 40gbps disk transfers over the network by moving to a sharing protocol made for it, vanilla SAMBA will never get close to this. You need something with RDMA, RoCE, etc
 
Last edited:
  • Like
Reactions: fossxplorer

fohdeesha

Kaini Industries
Nov 20, 2016
2,738
3,104
113
33
fohdeesha.com
What you need to do is run iperf like others have recommended, this will rule out your network layer. You should have no problem getting 38 or 39gbps per second with a default MTU of 1500 on modern systems. If you do, that rules out any issues with the ethernet interconnect, and means you have a bottleneck in either the file sharing protocol, it's overhead, or the CPU being used to do so
 

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
Yes I did run i perf and I was getting good results. I was just trying to up the AJA results as when people stream media they seem to do it across one thread. So the aggregate performance doesn't seem to help as much. At least that's my understanding.

We also use ATTO Fiber Channel and we can get 3200MBps on a 32GB Fiber line with AJA but the most I've gotten on 40GbE is 1750MBps.
 

fohdeesha

Kaini Industries
Nov 20, 2016
2,738
3,104
113
33
fohdeesha.com
iperf (with default settings) is one thread, and again it shouldn't have a problem maxing out 40gbE on a modern system. Fibrechannel was designed from the ground up for fast file transfers and minimal overhead by transferring raw block data, so it will be faster than plain file sharing protocols like SMB over ethernet without some tuning. Your bottleneck isn't the 40gbe ethernet, it's the file sharing protocol and CPU overhead that has to work to share files & block data over ethernet instead of fibrechannel. You could move the systems over to 100gbE cards, and I can promise your speeds will stay the exact same

You need to use a sharing protocol designed for fast transfers over ethernet, something with RDMA
 
  • Like
Reactions: maze

zkrr01

Member
Jun 28, 2018
106
6
18
@Myth Have you tried using robocopy with one thread to copy a very large file or directory on your fastest SSD from one system to another system with the same type of SSD? Use the logging feature of robocopy to record the elapsed time.

In my tests I was transferring a 435GB directory on a standard samsung 970 pro ssd's (3500MB/s read, 2700MB/s write ) and I was getting 2866MB/s. I was being limited by the read/write times of the SSD's, not by the 5000MB/s speed of the network. But it's still faster then the 1750MB/s you state you are getting. I do not understand that. It makes me think it may be related to the network NIC's you are using as several people referred to at the beginning of this posting. In any case something other than the 40GbE network is slowing your tests down.

Since you are using Windows server 2016, it would be interesting to know the speeds you get with RDMA (if your NIC's support RDMA).
 

fossxplorer

Active Member
Mar 17, 2016
556
98
28
Oslo, Norway
Interesting statement @fohdeesha and it contradicts all my research into achieving close to line speed with a 40GbE adapter (Mellanox ConnectX-3) by using TCP/IP. I wasn't able to get anything close to line speed with iperf using the defaults settings , i.e 1 thread as you point out. What i could get with iperf was around 25Gb/s IIRC, while with RoCE supported benchmark application i easily achieved around 39Gb/s.

Also, at least my findings indiciate that MTU size of 9000 bytes are the optimal one for TCP/IP transfers, while it's 4200 bytes for RoCE applications. See the attached graphs.

Note that my findings are on 2 Mellanox ConnectX-3 VPI adapters connected back to back without any switch in between.

EDIT1: there is a typo in the graph aesthetics, it should read MTU 1500 RoCE (and not MTU 1200 RoCE)

iperf (with default settings) is one thread, and again it shouldn't have a problem maxing out 40gbE on a modern system.
 

Attachments

fohdeesha

Kaini Industries
Nov 20, 2016
2,738
3,104
113
33
fohdeesha.com
completely depends on the processor, which is where it's mostly bound at these speeds due to the amount of interrupts needing to be generated. between two servers with dual e5-2690 v2's with connectx3's with an ICX6610 in the middle, sr-iov off (I see sr-iov in your graph title, from experience enabling sr-iov takes away some performance in baremetal systems), default 1500 mtu:


Code:
root@test4:~ # iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.169 port 5001 connected with 192.168.1.10 port 59967
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  43.9 GBytes  37.3 Gbits/sec
Note that this generated pretty much max CPU load, as 40gbps @ 1500 byte packets is a lot of packets to generate interrupts for. Increasing MTU to 9000 will reduce the amount of packets for a given flow, lowering CPU usage, but I usually don't bother due to a myriad of reasons
 

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
@Myth Have you tried using robocopy with one thread to copy a very large file or directory on your fastest SSD from one system to another system with the same type of SSD? Use the logging feature of robocopy to record the elapsed time.

In my tests I was transferring a 435GB directory on a standard samsung 970 pro ssd's (3500MB/s read, 2700MB/s write ) and I was getting 2866MB/s. I was being limited by the read/write times of the SSD's, not by the 5000MB/s speed of the network. But it's still faster then the 1750MB/s you state you are getting. I do not understand that. It makes me think it may be related to the network NIC's you are using as several people referred to at the beginning of this posting. In any case something other than the 40GbE network is slowing your tests down.

Since you are using Windows server 2016, it would be interesting to know the speeds you get with RDMA (if your NIC's support RDMA).
Yes I did use robocopy. I bought these 40GigE adapters off ebay used for $50 each so we thought it might be the mellanox 3 cards. but then we bought some new 40GigE ATTO cards because they also work in Mac systems, and we got similar performance, but yes ROBO copy does push the data much faster across the line and uses much more CPU depending on the number of threads set in the options menu.

What have you found the be the best number of file searches, directory searches, and copy thread settings to use? I know it will depend on each CPU but I'm curious, we usually use around 15,15,15 and get about 2.8GBps as measured in the performance monitor on server 2016 under the disk tab.


P.S. we have 8 NVMe 970s stripped into one RAID0.
 

zkrr01

Member
Jun 28, 2018
106
6
18
Once I get good results out of iperf and ntttcp I move on to looking at performance of production data. I use robocopy and copy data on my fastest ssd's from one workstation over the network to another workstation and capture the elapsed time with robocopy. I have Samsung 970 pro ssd's and I get 2866MB/s. I consider that good based on the average read/write speeds of the ssd's. The robocopy parms are:

ROBOCOPY /MT:1 /MIR /R:0 /W:0 /LOG:G:\files.log /NP /NDL \\Robin\g\HyperV-Exported-Systems \\Eagle\g\HyperV-Exported-Systems