Poor SMB Direct performance?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Rakkzi

New Member
Sep 14, 2024
12
0
1
Hi there! I've been working on deploying 40G networking in our house for a dedicated storage network, and because all of our PCs are Windows based, I opted to go for a Windows Server VM that supplies shares over SMB Direct (RDMA), through Infiniband Connect X-3s. Here's the layout of how I have things setup:
Storage Topo.png
Unfortunately, my performance over the network has been quite poor. I ran the built in Infiniband performance testing tools like ib_write_bw and it was showing ~36g with 7us latency from my PC to the Windows VM, so my suspicion's turned to SMB possibly being the culprit. I'm seeing RDMA read/write in perfmon so I know direct is enabled/working... are these numbers normal? or does SMB Direct just suck over 10g? If so, any advice on what I should replace it with?
Crystaldisk scores for the storage pools below for example:

SSD Array:
MiniSSDsNonHost.pngMiniSSDHost.png

HDD Array:
MiniHDDsNonHost.pngMiniHDDsHost.png

Single Passthrough'd NVME Drive:
SingleNVMENonHost.pngSingleNVME.png

2x Passthrough'd NVME Drives in a Windows Software RAID 0:
NVMERAIDNonHost.pngNVMERAID.png

Any help is greatly appreciated!
 

Attachments

kapone

Well-Known Member
May 23, 2015
1,750
1,133
113
random question, have you enabled jumbo frames?
Shouldn't need to. While that can bump things up a bit, it's not gonna be earth shattering.

OP: Some more details about the hardware on the server side, and the client PCs. What NICs are you using (I'm assuming the switch is a Mellanox SX variant?). What OS versions?

Edit: Also, on the Windows Server VM - What's the CPU type for the VM?
 
  • Like
Reactions: fohdeesha

dsrhdev

New Member
May 28, 2024
24
6
3
random question, have you enabled jumbo frames?
its infiniband (and rdma), there is not need to

Any help is greatly appreciated!
1) you're comparing smb/direct (i.e. network fs) vs nvme (block) storage
2) i don't know is there any tunables for smb/direct
3) surprisingly i found the doc please check slide 10 - the native performance 1.5 times faster then qdr (i guess due to 8/10 encoding + something)
4) is your proxmox or client multi socket host?
 
Last edited:

Rakkzi

New Member
Sep 14, 2024
12
0
1
Some more details about the hardware on the server side, and the client PCs. What NICs are you using (I'm assuming the switch is a Mellanox SX variant?). What OS versions?

Edit: Also, on the Windows Server VM - What's the CPU type for the VM?
Server side hardware is an AMD EPYC 7282 with 125GB of RAM.
The client I tested this on is a Windows 11 Pro for Workstations (to enable RDMA since pro doesn't have it)

The switch is an SX6018, the NICs are Connect X-3s (MCX353A-FCBT). CPU type is x86-64-v2-AES. I previously had it as "host" but changing it about tripled my performance, the numbers above are from after I switched to x86-64-v2-AES.


1) you're comparing smb/direct (i.e. network fs) vs nvme (block) storage
3) surprisingly i found the doc please check slide 10 - the native performance 1.5 times faster then qdr (i guess due to 8/10 encoding + something)
4) is your proxmox or client multi socket host?
1) This confuses me because I don't think I was? I'm aware network file sharing isn't block storage, which would be something like iSER in this case, but I don't think (?) windows supports it.
3) I'm not entirely sure what I'm supposed to be looking at, is slide 10 showing that QDR only has ~1/2 the performance of FDR in terms of IOPS?
4) Both client + server are single sockets.
 

Sprint8

New Member
Oct 15, 2023
10
9
3
Have you turned on PFC in your switch? Assuming it supports it. It wont work without it .
 

gea

Well-Known Member
Dec 31, 2010
3,567
1,402
113
DE
To rule out rdma switch problems, you can connect nics directly
 

kapone

Well-Known Member
May 23, 2015
1,750
1,133
113
Have you turned on PFC in your switch? Assuming it supports it. It wont work without it .
RDMA will work with zero switch support. It’ll start breaking down once the switch is fully loaded and if it doesn’t have PFF/ECN/DCB etc.
 

kapone

Well-Known Member
May 23, 2015
1,750
1,133
113
Server side hardware is an AMD EPYC 7282 with 125GB of RAM.
The client I tested this on is a Windows 11 Pro for Workstations (to enable RDMA since pro doesn't have it)

The switch is an SX6018, the NICs are Connect X-3s (MCX353A-FCBT). CPU type is x86-64-v2-AES. I previously had it as "host" but changing it about tripled my performance, the numbers above are from after I switched to x86-64-v2-AES.




1) This confuses me because I don't think I was? I'm aware network file sharing isn't block storage, which would be something like iSER in this case, but I don't think (?) windows supports it.
3) I'm not entirely sure what I'm supposed to be looking at, is slide 10 showing that QDR only has ~1/2 the performance of FDR in terms of IOPS?
4) Both client + server are single sockets.
So, let’s take this one by one.

Can you describe what your HDD array is composed of, and how are they connected to the motherboard etc?

P.s. Connectx-3 only do ROCEv1, not ROCEv2. So, your client and server must be on the same L2 network (which I think yours are, otherwise it shouldn’t work).

The reason I brought up the ROCE versions is because v1 was always a bit flaky.
 

donedeal19

Member
Jul 10, 2013
51
18
8
In my lab I have better rdma connection when the client Os is windows server 2025. When you are looking at the task manager there will be no network activity confirming that rdma and smb direct active.
 

dsrhdev

New Member
May 28, 2024
24
6
3
1) This confuses me because I don't think I was? I'm aware network file sharing isn't block storage, which would be something like iSER in this case, but I don't think (?) windows supports it.
3) I'm not entirely sure what I'm supposed to be looking at, is slide 10 showing that QDR only has ~1/2 the performance of FDR in terms of IOPS?
4) Both client + server are single sockets.
hello,
1) you can use iser on windows, as i understand correctly you started to use the windows platform (on storage server) because all other pcs are on windows. you can use in this case some kind of linux distro at the server side, if it is sutable cause you might want shared sorage vs block (exclusive)
3) in terms of iops and throughput, as showed in picture you are using infiniband qdr (40g) switch. cx-3 is 8 lane pci-e, so at qrd you can achieve theoretical maximum 4GB/s on sequential read/write
4) ok, the numa is not in the case

upd:
sorry i was wrong, i read your answer inaccurate.

it might be numa-aware problem, since epyc 7xx2 is 4(?) numa node topology even at single socket
first of all (and simplest one) you can check what numa zone your cx-3 card bound to and bind according irqs to it
second just check the same for your nvmes
next, if you are lucky try to bind both ios path to the same numa node (try to put your cx-3 card to pic-e for the same numa node as yours nvmes bound to) and bind your windows vm to the cores belong of that numa node
more info about irq binding see at AMD Technical Information Portal + EnterpriseSupport (for numa and irq/cores affinity)
if you can complete the first or first+ second tips it can provide double performance boost, if all three it might be close to theoretical maximum

last but not least
what is goal are you achieving? if you need just fast remote storage and it is not require simultaneously access from different hosts try to use block storage via iser or nvme-of, opposite the only solution is smb-direct, which is not supported well on linux at the moment (ksmbd project)

if you have sx6018 you can achieve fdr of your cx-3 cards

upd2:
I ran the built in Infiniband performance testing tools like ib_write_bw and it was showing ~36g with 7us latency from my PC to the Windows VM
also can you please run ibstat on both windows (server/client) or iblinkifo on any node, it looks like you already have fdr, but 36Gbps little slower than can be (45+)
 
Last edited:

dsrhdev

New Member
May 28, 2024
24
6
3
P.s. Connectx-3 only do ROCEv1, not ROCEv2. So, your client and server must be on the same L2 network (which I think yours are, otherwise it shouldn’t work).
hello,
he uses infiniband switch, as showed on initial picture
 

Rakkzi

New Member
Sep 14, 2024
12
0
1
Have you turned on PFC in your switch? Assuming it supports it. It wont work without it .
So, let’s take this one by one.

Can you describe what your HDD array is composed of, and how are they connected to the motherboard etc?

P.s. Connectx-3 only do ROCEv1, not ROCEv2. So, your client and server must be on the same L2 network (which I think yours are, otherwise it shouldn’t work).

The reason I brought up the ROCE versions is because v1 was always a bit flaky.
Sorry, I'm not using ROCE, this network is Infiniband, and PFC is also an ethernet thing iirc.
My HDD array is 20x 2.5" 10K HGST Drives attached through a LSI SAS 9207-8e HBA, attached along with my SSD array. (I did notice that both my arrays have suspiciously close numbers on the benches, the HBA might (?) be a bottleneck there)

In my lab I have better rdma connection when the client Os is windows server 2025. When you are looking at the task manager there will be no network activity confirming that rdma and smb direct active.
Correct, I've noticed there's no activity in task manager, I was referring to perfmon's counters for SMB direct specifically, when I access files over the network I can see SMB Direct's RDMA activity there:
1761850287477.png
1761850302440.png

hello,
1) you can use iser on windows

2) in terms of iops and throughput, as showed in picture you are using infiniband qdr (40g) switch. cx-3 is 8 lane pci-e, so at qrd you can achieve theoretical maximum 4GB/s on sequential read/write

3)what is goal are you achieving? if you need just fast remote storage and it is not require simultaneously access from different hosts try to use block storage via iser or nvme-of

4) also can you please run ibstat on both windows (server/client) or iblinkifo on any node, it looks like you already have fdr, but 36Gbps little slower than can be (45+)
1) I wasn't able to find much of anything about iSER support on windows (or even an initiator for it) so if you have any more info on it that would be great! I know NVME-oF is pretty much obsoleting it but when I looked into setting it up I was also having trouble finding a suitable NVME-oF target / host, since the Starwind VSAN doesn't have support for it in the free version yet and windows server doesn't have the ability to do NVME-oF yet either afaik.

2) Currently yes the network is setup on QDR (40g) but the switch has the 56G firmware, I just need better cables to enable FDR. My drive benches over the network definitely aren't making it to anywhere near 4GB/s, but also I thought PCIe 3.0 x8 was 8GB/s?

3) I had three goals in mind:
  1. Simply having a NAS to host + share files between clients
  2. Enabling a sort of "roaming profile" setup where our clients download their files/profiles/homedir/etc from the server at login and sync changes when they logout
  3. Loading games over the network off the NAS

4) Here's the ibstats:

Client:
1761854283030.png

Server:
1761854400581.png
 

dsrhdev

New Member
May 28, 2024
24
6
3
3) I had three goals in mind:
  1. Simply having a NAS to host + share files between clients
  2. Enabling a sort of "roaming profile" setup where our clients download their files/profiles/homedir/etc from the server at login and sync changes when they logout
  3. Loading games over the network off the NAS
3.1, 3.2) ok, your better choice is smb-direct
3.3) yeah, i knew that! but not all games perform good over remote fs

4) ibstat looks good, try to tune numa related things, like irq binding to your cx-3
 

Exhaust8890

Member
Nov 29, 2023
33
18
8
Just chiming in with my experience yesterday, although I'm not using Infiniband or a switch. (ConnectX4, 25GB Peer-to-Peer, Windows Pro for Workstations)

Noticed my NAS over to my daily PC was only transferring in the hundreds. But the other way around, I was getting over 1GBps. Did some searching and found this website.


"On the SMB client, enable large MTU in SMB, and disable bandwidth throttling. To do this, run the following command (in Powershell):"

Set-SmbClientConfiguration -EnableBandwidthThrottling 0 -EnableLargeMtu 1
 

donedeal19

Member
Jul 10, 2013
51
18
8
You can try to search google for the powershell script that shows your pci devices current link speed.
Direct connection is better isolation when troubleshooting.
Host to host windows server no vm performance tests.
Max performance mode and c states disabled to rule out bottleneck. With the limited amount of information given suggestions will be limited.
Using file explorer transfer using a X58 (2008) I can get line rate at around 3200mb/s using cx-2 ipoib.
 

Attachments

Rakkzi

New Member
Sep 14, 2024
12
0
1
3.1, 3.2) ok, your better choice is smb-direct
3.3) yeah, i knew that! but not all games perform good over remote fs

4) ibstat looks good, try to tune numa related things, like irq binding to your cx-3
Double checked and IOMMU remapping is enabled for the passthrough:
1762122605035.png

Tried playing with the NUMA nodes per socket option in the BIOS but even when I set it to NPS4, Proxmox still only shows 1 node.
1762122642062.png1762122623754.png
So I'm guessing

"On the SMB client, enable large MTU in SMB, and disable bandwidth throttling. To do this, run the following command (in Powershell):"
Set-SmbClientConfiguration -EnableBandwidthThrottling 0 -EnableLargeMtu 1
Enabling those SMB configs on both the client and server, doesn't seem to have changed much:
1762115527997.png

1) You can try to search google for the powershell script that shows your pci devices current link speed.
2) Direct connection is better isolation when troubleshooting.
3) Host to host windows server no vm performance tests.
4) Max performance mode and c states disabled to rule out bottleneck. With the limited amount of information given suggestions will be limited.
5) Using file explorer transfer using a X58 (2008) I can get line rate at around 3200mb/s using cx-2 ipoib.
1) Adapter information for the Connect X-3s shows they're both at PCI-E 8Gbps x8:
1762115707918.png
1762115769969.png

2) Direct connection as in bare metal-> bare metal? Once I get the other clients set up I can try copying files between them to see how that affects things, should be able to test that later this week.

3) Running bare metal windows server on the current host + a client would be a big and disruptive undertaking that I would very much like to avoid if possible..

4) for "max performance mode", my core performance boost is set to auto, which is the only other option from disabled:
1762117379022.png

I'll try messing around in the bios a bit, I did find an option for preferred I/O and tried setting it to manual to give the connect X-3 priority but it hasn't shown me an input field to actually put the bus # in so I might need to reboot the host node to set it.1762122917243.png
 

kapone

Well-Known Member
May 23, 2015
1,750
1,133
113
@Rakkzi - I don't run IPoIB (for many different reasons), but I can tell you that SMB Direct performance over ROCE is top notch. This paper even discusses various aspects of both topologies. I can do ~4-5GB/s (i.e. wire speed on 56gb links) on SMB Direct easily over ROCEv2 (This is both with all flash and many many HDD arrays).


I can only hazard a guess that the IP encapsulation layer over IB is adding overhead, that may be minimized when using ROCE. However, even with that said, your local vs network disk performance is too far apart for these underlying things to make that much of a difference.

Shooting in the dark here...Tried different DAC cables? Different ports on the switch? Direct connect with no switch in the middle?

Edit: My recent RDMA tests with Debian, with OOTB settings, no tuning. I was seeing latencies of ~1usec

https://forums.servethehome.com/ind...nectx-3-pro-en-mcx314a-bcct.41399/post-484305
 
Last edited:

dsrhdev

New Member
May 28, 2024
24
6
3
@Rakkzi - I don't run IPoIB (for many different reasons), but I can tell you that SMB Direct performance over ROCE is top notch. This paper even discusses various aspects of both topologies. I can do ~4-5GB/s (i.e. wire speed on 56gb links) on SMB Direct easily over ROCEv2 (This is both with all flash and many many HDD arrays).
hello,
ip (iboib) is used during connection establishing (i.e. discovering of ib port guid/lid), but data transfer on smb/direct then proceeded via ib/rdma directly, iperf/iperf3 uses ipoib by default (but it is possible to use ib via LD_PRELOAD), qperf uses ib direcly as well

@Rakkzi
1) can you please check qperf between server and client for throughput and latency
2) you also can try to run ethernet/roce by configuring appropriate ports on switch and cards
 

Rakkzi

New Member
Sep 14, 2024
12
0
1
I Ran Mellanox's Fabric Performance Utilities since they're available in WinOF and qperf is linux only I believe. Did both ways with the storage host Windows Server VM acting as Server for the test, and then the Client PC acting as the server, vice versa.

Server Host results:

nd_write_lat:
nd_write_lat server.png
nd_read_lat:
nd_read_lat server.png
nd_read_bw:
nd_read_bw server.png
nd_write_bw:
nd_write_bw server.png

Client as host results:

nd_write_lat:
nd_write_lat client.png
nd_send_lat: (yeah I know I did send instead of read, I got them mixed up)
nd_send_lat client.png
nd_read_bw:
nd_read_bw client.png
nd_write_bw:
nd_write_bw client.png

I'm not sure where it's getting the number for the CPU Util for these tests, since task manager definitely doesn't show the CPUs being max'd out on either machine, it's only touching 4 cores and even then they're not pegged at all..
serverCPU.png

I'm not sure if these cards are VPI capable, I don't see a port protocol option in their properties under device manager:1762320919971.png

I did change the preferred I/O setting in the host PC's BIOS but it hasn't made a difference.