Fastest File Transfer over 40gigE.

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
Hi Guys,

I just bought a few 40GigE Mellanox Connect x-3 cards and connected two storage servers over an active QOSFP+ optical cable. We use RAID arrays which are very fast. On I/O meter we get around 5100mbps across the servers over the 40gE connection.

As a side note we had to open "opensm" which starts a service to get the two servers to see each other, since we don't have a switch.

So the deal is, we want the fastest possible file transfer speeds. We work largely in the Media and Entertainment market, so people need to move data around fast!

We don't see much difference between 40GigE and 10GigE as far as file transfer is concerned. Like I need to move 100, 55gigabyte .mov files across the server in record time, but I max out at like 1.2GBps. That's using Teracopy with 9 threads. With windows explorer file transfer its around 800MBps.

These are the same speeds I get on my 10GigE connection. I believe it's because the 40GigE uses 4x10GigE and somehow combines those four threads. While a program like IO meter can access the threads, the file transfer process cannot.

Does anyone have any ideas on how to get these file transfer speeds up direct server to server? I've tried using Teracopy, Richcopy, and File Explorer simultaneously but they all seem to share a 1.2GBps pipe.

We use Windows Server 2016 direct connect by manually assigning IP addresses and then double clicking the "opensm" server which is inside the driver files. So that's how we connect them via 40GigE. And we can get like 48Gbps on speed tests, but no where near that while doing file transferring.

I'm already aware of the different type of files .dpx .mov .mxf and their transfer rate differences, so it's not the file type that is the problem, as I'm comparing the same types of files across a 10gE vs a 40gE and they are the same.

Please any advice would be appreciated.
-Myth
 
Last edited:
  • Like
Reactions: Freebsd1976

Aluminum

Active Member
Sep 7, 2012
431
46
28
The 4x10 is on the interface aka layer 1, the OS cannot "see" it. It can be provisioned/split with the right cable and switches etc but still layer 1.

File transfer protocols generally suck. Try SMB direct (RDMA) if you can.
 

SlickNetAaron

Member
Apr 30, 2016
50
13
8
43
I’m fairly certain that If you have to run opensm for the link to come up, it means you are running the cards on Infiniband mode, not Ethernet. That means not having IPs at all on the interfaces, unless running IPoIB which doesn’t have the same performance. Passed that observation, I’m not familiar with native infiniband transfers.
 

i386

Well-Known Member
Mar 18, 2016
4,220
1,540
113
34
Germany
I’m not familiar with native infiniband transfers
Fast if your underlying storage is fast too (ramdisk ftw :D)

On I/O meter we get around 5100mbps across the servers over the 40gE connection
Is that megebit per second?

We use Windows Server 2016 direct connect by manually assigning IP addresses and then double clicking the "opensm" server which is inside the driver files. So that's how we connect them via 40GigE. And we can get like 48Gbps on speed tests, but no where near that while doing file transferring.
I think you meant qsfp+ instead of 40gbe?.
48gbit/s is the speed of your network, not that of your storage. If you uses hdds in your storage you will neeed 28+ hdds in raid 0 writing at ~150 mbyte/s to saturate such a link.

Speaking of storage... How does it look like?
ZFS based? Hardware raid? :D
 
  • Like
Reactions: Aestr

rysti32

New Member
Apr 14, 2018
10
3
3
39
I’m fairly certain that If you have to run opensm for the link to come up, it means you are running the cards on Infiniband mode, not Ethernet. That means not having IPs at all on the interfaces, unless running IPoIB which doesn’t have the same performance. Passed that observation, I’m not familiar with native infiniband transfers.
You are correct. The card must be in Infiniband mode. I'd recommend contacting Mellanox support and asking how to force the cards into Ethernet mode, as that will be the first step in getting better performance out of the card.
 

Aestr

Well-Known Member
Oct 22, 2014
967
386
63
Seattle
I'm with @i386 on this. If you weren't maxing out 10gb and moving to a much faster link gets roughly the same speeds you need to start looking at the rest of the system, most likely the storage that's feeding the link. Unfortunately that usually ends up costing a lot more than just upgrading NICs.

As asked above give us as much info as you can about the storage systems on both sides. Also have you tried running iperf between the two systems to remove the storage from the equation and see what sort of network performance you're getting?
 
  • Like
Reactions: arglebargle

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
Hi guys,

thanks so much for all the wonderful replies. We use Adaptec RAID controller cards with 16 HDDs. Each array is striped as RAID 6 and we get around 3k read and write MBps. I've also attached 8x NVMe drives and striped them as RAID 0. I get around 5k read and write with those. However with IO meter I can get around 14,000MBps on the NVMe drives and around 6000MBps on the HDDs.

After extensive testing I've discovered that IO meter uses each "worker" as a thread on the CPU. So it's interesting how the speeds increase based on the number of threads. We also use AJA system test for benchmarking but it only uses one thread.

I've noticed on the NVMe drives if I use the samsung driver I get much better read/write performance, but that samsung NMVe driver only works on windows ten.

Anyways, So I did some internal testing yesterday after posting this and started moving some files around internally between the NVMe array and the HDDS. I also opened up system resource monitor. When I transfer files through file explorer (drag and drop) I show file transfer speeds around 500MBps. However the system resource monitor shows 600-1600MBps.

When I do an internal copy and paste on the NVMe drives I get similar results on the file explorer copy rate, but the system resource monitor shows 600-2100MBps. So it seems that the system resource monitor is better able to track the speed. I'm just a bit perplexed by how complex this all is.

And to answer your questions from before, my co-worker was able to change some settings around and get the mellanox to work on ethernet but he said the speeds where not as good as infiniband. I thought infiniband was faster than ethernet?

That's all I can think of for now. What I might do is break up the NVMe drives and do file transfers between the two servers each on a 4x NVMe RAID 0 array to see if that improves the speeds. My HDDs are not the best, we normally use 6TB seagates but the ones I'm testing with are 4TB tohsiba's and they are much slower.

Thanks for reading.
 

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
Fast if your underlying storage is fast too (ramdisk ftw :D)


Is that megebit per second?


I think you meant qsfp+ instead of 40gbe?.
48gbit/s is the speed of your network, not that of your storage. If you uses hdds in your storage you will neeed 28+ hdds in raid 0 writing at ~150 mbyte/s to saturate such a link.

Speaking of storage... How does it look like?
ZFS based? Hardware raid? :D
I'm sorry to say, but I'm not familiar with the response. qsfp+ is just the hardware, so 40gE or infiniband, uh! I get so confused.

Our storage is NTFS we are a SAN server manufacturer. We are using NVMe drives to saturate the link, still working out the kinks.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Ok, I'd go and create two Raid0 with 4 nvme drives each. Try to get local copying capability (teracopy/robocopy (manual concurrent copies or any multithreaded tool) to accetable levels (ie 3-4 GB/s). Watch CPU load, ensure max performance Bios settings etc.

If you get that done you can move to network copying - RDMA might be a good option for you in Windows; else check for infiniband optimization parameters. In Ethernet one would work with send and receive buffers/queue length, offload etc, not sure whether that is available in IB too.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I assume you have seen this basic video if you are into media:
?
Have you seen the follow up too ?
-> there he points out the need for RDMA to achieve higher speeds.

Unfortunately the third video never made it it seems
 
Last edited:
  • Like
Reactions: William and Myth

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
Make sure both the client and server are using smb multi-connect, and you have your cards in the proper sized pci-e 3.0 x4 or x8 or x16 slot depending on the performance desired.

I found my older 2013'ish cards limited to speeds likewise in older servers (esxi 5.5)- upgrading to the newest dl360 series more than quadrupled network performance at 10gbe! those bargain parts off eBay tend to not to be as performant as the latest and greatest!
 

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
I assume you have seen this basic video if you are into media:
?
Have you seen the follow up too ?
-> there he points out the need for RDMA to achieve higher speeds.

Unfortunately the third video never made it it seems
Whats the link to the third video? I just watched these, very helpful!!!!!!!
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
There doesnt seem to be one unfortunately :/ At least I have not found it in his channel
 

zkrr01

Member
Jun 28, 2018
106
6
18
Myth, did you ever figure out why the file transfers were so slow? You commented that "I believe it's because the 40GigE uses 4x10GigE".
If that is true, why are you using 4x10 links and not one 40Gb link?
 

maze

Active Member
Apr 27, 2013
576
100
43
Myth, did you ever figure out why the file transfers were so slow? You commented that "I believe it's because the 40GigE uses 4x10GigE".
If that is true, why are you using 4x10 links and not one 40Gb link?
40gig can be either 40g flow, or 4x10g flow.. depends on the hardware iirc.. which would be the reason why its not doing more than 10g per flow max.

Correct me if im wrong.
 

zkrr01

Member
Jun 28, 2018
106
6
18
The title of the post is "Fastest File Transfer over 40gigE. ". If that is true and there is no business requirements for InfiniBand or wanting a 4x10 setup then I would have selected a 40GbE NIC such as the MCX4131A-BCAT or similar NIC.

I transfer large Hyper-V based files at speeds limited by fast SSD's, not by the 5000MB/s speed of the 40GbE link. If your equipment can go faster then 5000MB/s, then select the 100GbE NIC card which can do 12,500MB/s. The cost of these NICS pales to the cost of the other equipment in use and make the best use of that equipment.
 

aero

Active Member
Apr 27, 2016
346
86
28
54
40gig can be either 40g flow, or 4x10g flow.. depends on the hardware iirc.. which would be the reason why its not doing more than 10g per flow max.

Correct me if im wrong.
All 40gbe is technically 4x 10g lanes. However, it is either split out to individual 10g links, or not.

If you're connecting 40gbe to 40gbe interfaces then all traffic is split equally over these lanes. NOT like a LAG.... It's not per flow, it's at the bit level. You should be able to achieve full throughput on a single traffic flow (at least the switch wouldn't be limiting you).

Edit: updated a couple words to be a bit more precise
 
Last edited:

fossxplorer

Active Member
Mar 17, 2016
554
97
28
Oslo, Norway
It's a relatively old thread, have you figured it out @Myth

Hi Guys,
So that's how we connect them via 40GigE. And we can get like 48Gbps on speed tests, but no where near that while doing file transferring.-Myth
The above might indicate you are using IB mode and not Ethernet mode. If you are using MLNX_OFED drivers, there are (at least in Linux, ntot sure about Windows though) tools to put the card into Ethernet mode. Then you could use iperf(3) to do a quick test between servers and also run some of the RDMA tools , e.g ib_connect_bw, to check the RoCE speeds :)

What i've experienced from a master thesis writing on these cards and RoCE, is that without applications supporting RDMA (i.e TCP tranfers) and using a single CPU thread, it's hard to achieve anything above ~22Gb/s. Adding more threads to iperf(3) doesn't linearly scale, and IIRC, i got around ~35Gb/s using 3 or 4 threads.
With RoCE using a single thread, i got ~38Gb/s.

But it was all on Linux!


EDIT: i was doing some research with different tuning options and you should definitely follow Mellanox tuning guide for these cards. MTU is worth playing with from default 1500 to 9000/9500, turn off CPU turbo/speedstep, hyperthreading, etc. NUMA nodes also plays in here.
 

Myth

Member
Feb 27, 2018
148
7
18
Los Angeles
It's wasn't my setup, it was just the 40GigE technology. I've heard 25Gb Ethernet is faster. I'll try it next.
But jumbo frames is a must on 40GigE networks with media rich files.

a standard 10GigE gives us about 650MBps and a 40GigE gives us about 800MBps with MTU 1500.
With MTU 9000, 10GbE gives 950MBps and 40GbE gives 1750MBps.

We do benchmark tests with AJA speed test 4k.