Interested in seeing all of these results and tests though.
And I'm now curious to see how many TCP connections are opened by SMB MultiChannel when I don't use LACP. It must be at least two per server, which is more than I'm getting when LACP is on.
OK, to complete the story: I've tested again and Windows only ever opens one connection per remote IP.
The
Microsoft doc clearly states that it will open more connections if RSS is available (
"With SMB Multichannel, if the NIC is RSS-capable, SMB will create multiple TCP/IP connections for that single session, avoiding a potential bottleneck on a single CPU core when lots of small IOs are required"). But it doesn't do so for me, because the SMB connection thinks the NICs are not RSS capable.
I have 2 x 10G links on each of server and desktop. For the following test I am running a single instance of Samba's smbd on the server, but configured with two IPs - 192.168.100.54 (on 10G NIC #1) and 192.168.200.54 (NIC 2). The workstation NICS are configured across the same two subnets: 192.168.100.20 (NIC1) and 200.20 (NIC2).
I'm running benchmarks from the Windows client using the iozone disk benchmarker, with two to four concurrent instances. On the server I have enabled ZFS filesystem compression which almost completely removes the disk subsystem from the equation, as iozone data is 100% compressible. I just get activity on metadata writes; reads are all from cache. I have also tested with compression off, and the figures are basically the same as I'm well below the max throughput of the PCIe SSD array I'm testing with.
I first establish an SMB connection to 192.168.100.54:
net use z: \\192.168.100.54\tomj\copytest
Through the SMB protocol handshaking, Windows finds out about the other IP address(es):
Code:
C:\WINDOWS\system32> Get-SmbMultichannelConnection
Server Name Selected Client IP Server IP Client Interface Index Server Interface Index Client RSS Capable
----------- -------- --------- --------- ---------------------- ---------------------- ------------------
192.168.100.54 True 192.168.200.20 192.168.200.54 7 3 False
192.168.100.54 True 192.168.100.20 192.168.100.54 6 2 False
192.168.100.54 True 192.168.1.20 192.168.100.54 2 2 False
It shows one row per workstation NIC, and of these two are able to connect to the server. As these are on separate NICs, I achieve one TCP connection per NIC on both server and client.
Important to note is that it always shows "Client RSS Capable = False", even though I have RSS enabled on both Intel 10G NICs. This must be why I don't get more than one connection per IP - it relies on RSS being detected. RSS definitely
is on, and it's confirmed by both
Get-NetAdapterRSS and
Get-SmbClientNetworkInterface. The latter being interesting and odd, because it's part of the SMB software:
Code:
C:\WINDOWS\system32> Get-SmbClientNetworkInterface | findstr "Interface True"
Interface Index RSS Capable RDMA Capable Speed IpAddresses Friendly Name
6 True False 10 Gbps {fe80::65f0:5b9f:2376:8deb, 192.168.100.20} Ethernet 5
7 True False 10 Gbps {fe80::3907:8867:ca06:31c3, 192.168.200.20} Ethernet 6
At first I thought the lack of RSS detection might be a deficiency of the consumer code versus Server, but the Microsoft doc specifically mentions support in Windows 8. And I've since seen several discussions from people having the same issue in Server 2012. Here's an example, an
unsolved post on Microsoft TechNet, posted by a user on Windows Server 2012 and with other users reporting the same issue, the latest as recent as October 2017.
It's possible it's NIC or driver related. Maybe it works with some NICs and not others. But that's not be something I can test anytime soon.
However, the main stated benefit for having more connections is to avoid a bottleneck on a single workstation CPU/thread, and luckily I don't seem to have that constraint; it's the server/Samba end where I hit a CPU bottleneck. My desktop CPU is the same six-core/twelve-thread Westmere Xeon x5670 that my server has two of, but in the workstation it's overclocked to 4.4Ghz and has much faster RAM. I'm quite impressed with client load: on write tests I see overall utilisation of 20-25%, with no single core above 50%. On reads client utilisation totals 25%, with two cores at up to 60%. Total utilisation figures include all the usual background desktop stuff, including dozens of open browser tabs, and usually a YouTube or Netflix running at the same time as the tests
This test gives me a write speed averaging 12Gb/s with reads around 9.5Gb/s. The read test is where I most clearly hit the Samba CPU limit - I can see that the single smbd process is pegged at 100% of one core, as it's single threaded. The write test doesn't go as high, but must also be bottlenecked because if I perform the same test across two Samba instances (on the same physical server), writes go up to 15Gb/s. I don't know why reads need more CPU than writes.
Reads spread across two (or more) Samba instances average around 17Gb/s, with 17.8 being my fastest ever result. There's a lot more workstation CPU load in this case: about 45% total utilisation, and two cores up at 90-95%. All those figures are with iozone configured with a record size of 1MB. With 8k - probably more realistic of real filesystem usage - write performance drops by 5%, and reads by 20%, but that's not caused by increased CPU usage as that is also proportionally reduced. I'm not completely sure what bottleneck(s) keep me from getting closer to the max 20Gb/s. But again it seems like I'm not being disadvantaged by not getting the extra threads and connections promised by SMB MultiChan when detected RSS is detected.
On the server side, which is definitely a bottleneck when using a single instance (as I mostly will in real usage), Samba's smbd instance only uses a single process even though I'm connecting twice or more across multiple IPs. I suppose that's reasonable - it's considered a single connection and SMB MultiChannel has to ensure synchronisation in order to put the data into a single file(s) with no corruption. I guess that is hard to do across processes, or even across threads - though Windows can do it OK
But Samba still lists MultiChan as experimental, so it's probably not had much or any optimisation work. Maybe it will eventually become multi-threaded.
I do plan to ask the Samba guys about it, and about whether I can do any further tuning. I've already done the obvious things like enabling Jumbo frames and tuning TCP SO_SNDBUF/SO_RECVBUF.
Finally, I wondered what happened if I assigned multiple IPs per NIC on both ends. So on the server I added 192.168.110.54 on NIC1 and 210.54 on NIC2, and 110.20 and 210.20 for the workstation. Sure enough, the SMB connection now got four TCP connections to a single instance, all of which transferred data equally. Of course no speed improvements as I was already bottlenecked on the server side - in fact speed drops slightly, likely due to extra synchronisation work. But it is good to know it scales by IP and not by NIC, allowing one to work around the lack of RSS detection if it would be beneficial to do so.
All this made me realise that when I first tested LACP, I didn't do so properly. I assumed it would open an extra connection because of the team, which of course it does not without RSS detected. In my LACP test I configured a single IP per server instance. I did use two servers so that gave me two connections total, but if I'd configured two IPs per server as per my non-LACP tests, I would have got four total connections and LACP would have balanced a bit better than my initial LACP test.
But later I did anyway test LACP with four total connections (one each to four Samba instances) and it was still slower than non-LACP with two connections to one server, and much slower than non-LACP with four connections to two servers. So I think LACP is a dead end for my use case.