Windows Server, file transfers are severely throughput constrained, why?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Frank173

Member
Feb 14, 2018
75
9
8
I am running Windows Server 2016 (and also tried the below on Windows 10 for Workstations), set up 2 separate Ram Disks on the same machine, each getting over 30GB of space. I then copy a single file that is around 12GB in size onto one Ram Disk. Then I clear the Windows file cache. Subsequently I drag the file from the first ram disk onto the window of the second ram disk.

What I am seeing is an average 1.5-2GB/second transfer speed

Question: What is bottlenecking the transfer speed here? Same machine, one CPU that runs at around 4GHz with multiple cores, two identical memory disks with their own memory regions on the same machine accessing fast DDR4 2666 memory. I do not understand what constrains the transfer speed here, I understand that the standard Windows file transfer consists of a single threaded process. I need to speed up this particular process, I understand there are other file copy tools out there that utilize multiple streams.

Anyone who can shed light into this? This clearly seems to be a Windows software related issue.

Edit: I am experimenting with the above because I ultimately want to serve files on a file server to connected machines on the same network. I use 100 Gigabit Mellanox ConnectX-4 NICs and use a striped 4xNVME raid zero drive that can read and write at about 72 Gigabit/second. I tested network throughput with the Mellanox toolbox and reach 94 Gigabit/second throughput. Hence, hardware wise there should not be any constraints.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
If you're just doing a drag'n'drop in explorer, it'll often run in to single-threaded CPU limits (or at least it used to); you should be able to see this in task manager (and if not, what is using the most CPU?). Try using something like robocopy to more accurately test what the HW is capable of and see if it behaves any differently.
 

cesmith9999

Well-Known Member
Mar 26, 2013
1,421
470
83
and with robocopy please use the /MT[:number of threads 1-128, default :8] switch

Chris
 

Frank173

Member
Feb 14, 2018
75
9
8
I mentioned that yes, the file copy on Windows is single threaded. And I did check, not a single core is fully utilized during the file transfer. Tools like robocopy speed up the transfer speed a multi-fold and that is exactly the question I have. Why is a single threaded file transfer throughput in Windows so constrained? This is not just a question about file transfers but generally, any loads, access to anything will also be constrained for no apparent reason. Again, not a single core is fully utilized, DDR4 memory, even single channel, perform at more than 10 fold throughput, yes, any Windows API has overhead but not to this degree. When you look at it, I basically transfer binary data from memory to a single CPU and back onto memory. No PCIe lane limitations, no PCH or UPI involved whatsoever.

If you're just doing a drag'n'drop in explorer, it'll often run in to single-threaded CPU limits (or at least it used to); you should be able to see this in task manager (and if not, what is using the most CPU?). Try using something like robocopy to more accurately test what the HW is capable of and see if it behaves any differently.
 

Frank173

Member
Feb 14, 2018
75
9
8
Iperf3 shows a lot less throughput than when measuring with "nd_send_bw" bandwidth tool that comes with the Mellanox ConnectX-4 drivers and toolbox. I believe Mellanox's tool to be much more precise as it measures pure line throughput without any other overhead. I have not delved into the doc of Iperf3, so am not sure what it precisely does under the hood. So, I understand that there is overhead involved. But right now I like to focus on Windows file transfer mechanism because ultimately the file server will serve files over network and needs to load the data from disk.

No single CPU fully utilized during file transfer, memory to memory copy on DDR4 memory on the same machine. Nothing so far explains why such transfer is limited to around 2GB/sec.

What do you get with iperf3 between nodes?