FreeNas/TrueNas RDMA Support (FR for voting)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Nice... now if Scale was only as performant as TNC... or has this been improved?

Else you dont happen to have any before / after values?
 

tsteine

Active Member
May 15, 2019
167
83
28
If you're using only 10g ports, have you set up DCB for lossless ethernet? Or ECN?
On 10gbe with ZFS backed storage, you will completely saturate the link, and rdma does not handle packet loss well. WIth CX4 or CX5 adapters, resilient ROCE is supported, but it requires ECN set up on the switch and traffic class as a minimum, the mellanox drivers should support this out of the box on the adapters, so only switch needs setup.
I would suspect that performance in such a scenario would tank pretty hard the second you start pushing a lot of throughput, though as long as you keep below 10gbe aggregate throughput, latency and performance should be excellent.
 
  • Like
Reactions: abq

tsteine

Active Member
May 15, 2019
167
83
28
On Layer2 GlobalPause is enough.

You need ECN only for Layer3
This only applies to lossless ethernet.

If you want to use Resilient RoCE with lossy fabric to avoid setting up lossless ethernet, ECN is necessary. (and mellanox/nvidia connectx 4/5/6/7 adapters)
 

efschu3

Active Member
Mar 11, 2019
160
61
28
GlobalPause on Layer2 for RocE for iSER is fine.

As said, you only need ECN on Layer3.

You can't run iSER on lossy connection.

Or well, you can, but you will have data corruption.
 

efschu3

Active Member
Mar 11, 2019
160
61
28
HOW IS PACKET LOSS HANDLED?
Upon packet drop, a NACK (not ACK) control packet with the specific PSN (packet sequence number) is sent to the sender in order for it to re-transmit the packet.

I don't trust this is working fine on iSER.

Or at least I would not benefit much over hardware accerlated TCP traffic.
 

tsteine

Active Member
May 15, 2019
167
83
28
HOW IS PACKET LOSS HANDLED?
Upon packet drop, a NACK (not ACK) control packet with the specific PSN (packet sequence number) is sent to the sender in order for it to re-transmit the packet.

I don't trust this is working fine on iSER.
While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.

I expect it's not a performance winner when it happens, but it does mitigate the need to set up lossless fabric.
 

efschu3

Active Member
Mar 11, 2019
160
61
28
While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.

I expect it's not a performance winner when it happens, but it does mitigate the need to set up lossless fabric.
Yes.

Until I dont see the specification of this. I wont trust this.
 

tsteine

Active Member
May 15, 2019
167
83
28
Yes.

Until I dont see the specification of this. I wont trust this.
I found a research paper from Mellanox on this topic, though it's somewhat old. (2017)

Handling packet loss relies on
the InfiniBand transport specification [5], as depicted in Figure 3.
Packets are stamped with a packet sequence number (PSN). The
responder in the operation accepts packets in order and sends
out-of-sequence (OOS) NACK upon receipt of the first packet in a
sequence that arrived out of order. OOS NACK includes the PSN
of the expected packet. The requestor handles the OOS NACK by
retransmitting all packets beginning from the expected PSN using
the go-back-N style scheme. The lost packet is fetched again from
the host memory. OOS NACK handling is a relatively complex flow
in the NIC combining hardware and firmware. In order to minimize
the impact of packet loss, retransmissions must be fast and effcient.
Seems to be handled transparently in firmware/hardware on the adapters by adapting the infiniband spec for handling congestion, since RoCE is, at the most basic level, infiniband traffic wrapped in ethernet transport. Normally, in an infiniband network, if the receiver is unable to process a packet, the sender would simply wait before sending it, in this case, it seems they just retroactively fall back on this mechanism whenever packet loss occurs with a packet buffer and ecn to replicate this on ethernet, so the iscsi software layer on either side would simply be waiting for packets, without being aware that packet loss ever occurred.

I don't see any reason not to trust it, the only drawback is latency once the buffer overflows and packets are dropped.
 
  • Like
Reactions: efschu3
Apr 21, 2016
56
25
18
43
Nice... now if Scale was only as performant as TNC... or has this been improved?

Else you dont happen to have any before / after values?
I've migrated 2 machines from core 13 to scale up to latest version. I haven't noticed any difference in performance. IMHO, zfs + zvol isn't where it should be performance wise. In fact I'm exploring other options for nvme drives other than zfs, but that is for another thread.

I've also got nvme target up & running but there seem to be a lapse for vmware to use the target - HPP refuses to claim the path because of lack of FUSE support - will have to use nvme target from spdk.
 

efschu3

Active Member
Mar 11, 2019
160
61
28
@tsteine
"The lost packet is fetched again from
the host memory."

Now the question is, "host memory region" of kernel module or application.

While it's DMA, it "should" be application.

If they realy talk about application memory, then I'm worried about this technique. So my application must be aware of this. If memory is freed, or overwritten meanwhile, this will corrupt your data.

Need some more info, how this is implemented.
 

tsteine

Active Member
May 15, 2019
167
83
28
@tsteine
"The lost packet is fetched again from
the host memory."

Now the question is, "host memory region" of kernel module or application.

While it's DMA, it "should" be application.

If they realy talk about application memory, then I'm worried about this technique. So my application must be aware of this. If memory is freed, or overwritten meanwhile, this will corrupt your data.

Need some more info, how this is implemented.
I checked around some more and found this: Revisiting Network Support for RDMA (berkeley.edu)

From what I understand, the packet loss mechanism here is actually handled by the infiniband traffic layer, not in the ethernet wrapping layer around infiniband by RoCE. While infiniband is lossless, packet loss can still occur, if a particular packet fails a CRC check on infiniband, it's discarded and retransmitted, so if this doesn't cause data corruption on infiniband networks, it's not going to cause it over RoCE either.
The point of using ECN and hardware support is to force the ethernet fabric to slow down and allow the infiniband layer to recover and retransmit the lost packet.
 
  • Like
Reactions: efschu3

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I just wanna say that as of 22.12 release of TrueNAS SCALE, RDMA is doable in a few steps.

1. Add RDMA support to the system - I've went with OFED package from Mellanox Nvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2
Some Q's ...
Is ESXi7 not able to utilize nsf over RDMA?
Can you share out any device via nvmf?

I've been running nfs for ages and not really looking to move to iSCSI unless it can't be helped...;)
 
  • Like
Reactions: itronin

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Alright, trying to follow the guideline here...

1. Add RDMA support to the system - I've went with OFED package from Mellanox Nvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2 which has al lthe bels and whistles to support rdma with iser and nvmeof.
results here should render the system rdma capable, have a nvme target rdma capable, iser target module etc.
I ran
Code:
chmod +x /bin/apt*
, then
Code:
./mlnxofedinstall --add-kernel-support  --enable-mlnx_tune --with-nvmf --with-nfsrdma --skip-distro-check

2. rebuild scst - I did this with the lastest git (3.7.0 - same as what's already in truenas) ; Check the install script to make sure you are starting the services with the new binaries (/etc/init.d/scst is the script used to start the service wrapped under the "goodness" of systemd /etc/systemd/system/scst.service.d)
3. reboot and if everything is doing what's supposed to, see your iscsi targets sessions with RDMAExtensions Yes in the logs.
I assume I only need this if I want iSCSI support for rdma?

Edit:

So I assume ofed installed ok, but I am not sure yet on the next steps?
Is the built in nfs daemon now automagically rdma enabled? assume not
How can i use nvmeof to mount a share on ESXi via nvmeof initiator?
 
Last edited:

tsteine

Active Member
May 15, 2019
167
83
28
Is the built in nfs daemon now automagically rdma enabled? assume not
ESXi doesn't support NFSoRDMA.
It does seem that truenas scale has changed from nfs-ganesha to nfs-kernel-server, which does support RDMA while ganesha does not.

That being said, nfs over rdma runs on a different port, the "default" rdma port is tcp/20049, vs the normal tcp/2049 port so it's pretty far from being "plug and play"
 

tsteine

Active Member
May 15, 2019
167
83
28
ok.
Any idea re exposing TNS Scales via nvmeoF?
At the moment, you would have to install SPDK on truenas.

With SPDK, you would need to configure an nvmeof target, exposing an AIO bdev pointing to a zvol (or possibly a file on a zfs file system)

I have done this with Ubuntu, SPDK and ESXi with zvol backing, it's not an easy "snap your fingers" setup, and the gains vs iSCSI iSER are basically nonexistent, since the bottleneck seems to be the zfs file system/zvol, rather than the network protocol running over RDMA.

I only noticed significant uplifts with nvmeof and SPDK when directly exposing an NVME device, and at that point, running truenas scale with zfs is mostly pointless, just set up a linux distro of your choice with spdk.

The only real benefit from zfs with rdma iser/nvmeof is that it allows for higher throughput and less CPU consumption from IO traffic.