FreeNas/TrueNas RDMA Support (FR for voting)

Rand__ · Jan 4, 2023

Nice... now if Scale was only as performant as TNC... or has this been improved?

Else you dont happen to have any before / after values?

tsteine · Jan 4, 2023

If you're using only 10g ports, have you set up DCB for lossless ethernet? Or ECN?
On 10gbe with ZFS backed storage, you will completely saturate the link, and rdma does not handle packet loss well. WIth CX4 or CX5 adapters, resilient ROCE is supported, but it requires ECN set up on the switch and traffic class as a minimum, the mellanox drivers should support this out of the box on the adapters, so only switch needs setup.
I would suspect that performance in such a scenario would tank pretty hard the second you start pushing a lot of throughput, though as long as you keep below 10gbe aggregate throughput, latency and performance should be excellent.

efschu3 · Jan 4, 2023

On Layer2 GlobalPause is enough.

You need ECN only for Layer3

tsteine · Jan 4, 2023

efschu3 said:
On Layer2 GlobalPause is enough.

You need ECN only for Layer3

This only applies to lossless ethernet.

If you want to use Resilient RoCE with lossy fabric to avoid setting up lossless ethernet, ECN is necessary. (and mellanox/nvidia connectx 4/5/6/7 adapters)

efschu3 · Jan 4, 2023

GlobalPause on Layer2 for RocE for iSER is fine.

As said, you only need ECN on Layer3.

You can't run iSER on lossy connection.

Or well, you can, but you will have data corruption.

tsteine · Jan 4, 2023

efschu3 said:
GlobalPause on Layer2 for RocE for iSER is fine.

As said, you only need ECN on Layer3.

You can't run iSER on lossy connection.

Or well, you can, but you will have data corruption.

This is actually possible without lossless fabric if you have mellanox adapters which support resilient roce, that is, Connectx4 or newer.

ref:
Introduction to Resilient RoCE - FAQ (nvidia.com)

efschu3 · Jan 4, 2023

HOW IS PACKET LOSS HANDLED?
Upon packet drop, a NACK (not ACK) control packet with the specific PSN (packet sequence number) is sent to the sender in order for it to re-transmit the packet.

I don't trust this is working fine on iSER.

Or at least I would not benefit much over hardware accerlated TCP traffic.

tsteine · Jan 4, 2023

efschu3 said:
HOW IS PACKET LOSS HANDLED?
Upon packet drop, a NACK (not ACK) control packet with the specific PSN (packet sequence number) is sent to the sender in order for it to re-transmit the packet.

I don't trust this is working fine on iSER.

While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.

I expect it's not a performance winner when it happens, but it does mitigate the need to set up lossless fabric.

efschu3 · Jan 4, 2023

tsteine said:
While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.

I expect it's not a performance winner when it happens, but it does mitigate the need to set up lossless fabric.

Yes.

Until I dont see the specification of this. I wont trust this.

tsteine · Jan 4, 2023

efschu3 said:
Yes.

Until I dont see the specification of this. I wont trust this.

I found a research paper from Mellanox on this topic, though it's somewhat old. (2017)

https://www.researchgate.net/publication/319050299_RoCE_Rocks_without_PFC_Detailed_Evaluation

Handling packet loss relies on
the InfiniBand transport specification [5], as depicted in Figure 3.
Packets are stamped with a packet sequence number (PSN). The
responder in the operation accepts packets in order and sends
out-of-sequence (OOS) NACK upon receipt of the first packet in a
sequence that arrived out of order. OOS NACK includes the PSN
of the expected packet. The requestor handles the OOS NACK by
retransmitting all packets beginning from the expected PSN using
the go-back-N style scheme. The lost packet is fetched again from
the host memory. OOS NACK handling is a relatively complex flow
in the NIC combining hardware and firmware. In order to minimize
the impact of packet loss, retransmissions must be fast and effcient.

Seems to be handled transparently in firmware/hardware on the adapters by adapting the infiniband spec for handling congestion, since RoCE is, at the most basic level, infiniband traffic wrapped in ethernet transport. Normally, in an infiniband network, if the receiver is unable to process a packet, the sender would simply wait before sending it, in this case, it seems they just retroactively fall back on this mechanism whenever packet loss occurs with a packet buffer and ecn to replicate this on ethernet, so the iscsi software layer on either side would simply be waiting for packets, without being aware that packet loss ever occurred.

I don't see any reason not to trust it, the only drawback is latency once the buffer overflows and packets are dropped.

Gabriel Mateiciuc · Jan 5, 2023

Rand__ said:
Nice... now if Scale was only as performant as TNC... or has this been improved?

Else you dont happen to have any before / after values?

I've migrated 2 machines from core 13 to scale up to latest version. I haven't noticed any difference in performance. IMHO, zfs + zvol isn't where it should be performance wise. In fact I'm exploring other options for nvme drives other than zfs, but that is for another thread.

I've also got nvme target up & running but there seem to be a lapse for vmware to use the target - HPP refuses to claim the path because of lack of FUSE support - will have to use nvme target from spdk.

Gabriel Mateiciuc · Jan 5, 2023

Another piece of info regarding the switch - I haven't set up anything for QOS. It's a dumb switch ...

efschu3 · Jan 5, 2023

@tsteine
"The lost packet is fetched again from
the host memory."

Now the question is, "host memory region" of kernel module or application.

While it's DMA, it "should" be application.

If they realy talk about application memory, then I'm worried about this technique. So my application must be aware of this. If memory is freed, or overwritten meanwhile, this will corrupt your data.

Need some more info, how this is implemented.

tsteine · Jan 5, 2023

efschu3 said:
@tsteine
"The lost packet is fetched again from
the host memory."

Now the question is, "host memory region" of kernel module or application.

While it's DMA, it "should" be application.

If they realy talk about application memory, then I'm worried about this technique. So my application must be aware of this. If memory is freed, or overwritten meanwhile, this will corrupt your data.

Need some more info, how this is implemented.

I checked around some more and found this: Revisiting Network Support for RDMA (berkeley.edu)

From what I understand, the packet loss mechanism here is actually handled by the infiniband traffic layer, not in the ethernet wrapping layer around infiniband by RoCE. While infiniband is lossless, packet loss can still occur, if a particular packet fails a CRC check on infiniband, it's discarded and retransmitted, so if this doesn't cause data corruption on infiniband networks, it's not going to cause it over RoCE either.
The point of using ECN and hardware support is to force the ethernet fabric to slow down and allow the infiniband layer to recover and retransmit the lost packet.

efschu3 · Jan 5, 2023

Nice find. Tnx.

Rand__ · Jan 14, 2023

Gabriel Mateiciuc said:
I just wanna say that as of 22.12 release of TrueNAS SCALE, RDMA is doable in a few steps.

1. Add RDMA support to the system - I've went with OFED package from ~~Mellanox~~ Nvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2

Some Q's ...
Is ESXi7 not able to utilize nsf over RDMA?
Can you share out any device via nvmf?

I've been running nfs for ages and not really looking to move to iSCSI unless it can't be helped...

Rand__ · Jan 15, 2023

Alright, trying to follow the guideline here...

Gabriel Mateiciuc said:
1. Add RDMA support to the system - I've went with OFED package from ~~Mellanox~~ Nvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2 which has al lthe bels and whistles to support rdma with iser and nvmeof.
results here should render the system rdma capable, have a nvme target rdma capable, iser target module etc.

I ran

Code:

chmod +x /bin/apt*

, then

Code:

./mlnxofedinstall --add-kernel-support  --enable-mlnx_tune --with-nvmf --with-nfsrdma --skip-distro-check

Gabriel Mateiciuc said:
2. rebuild scst - I did this with the lastest git (3.7.0 - same as what's already in truenas) ; Check the install script to make sure you are starting the services with the new binaries (/etc/init.d/scst is the script used to start the service wrapped under the "goodness" of systemd /etc/systemd/system/scst.service.d)
3. reboot and if everything is doing what's supposed to, see your iscsi targets sessions with RDMAExtensions Yes in the logs.

I assume I only need this if I want iSCSI support for rdma?

Edit:

So I assume ofed installed ok, but I am not sure yet on the next steps?
Is the built in nfs daemon now automagically rdma enabled? assume not
How can i use nvmeof to mount a share on ESXi via nvmeof initiator?

tsteine · Jan 16, 2023

Rand__ said:
Is the built in nfs daemon now automagically rdma enabled? assume not

ESXi doesn't support NFSoRDMA.
It does seem that truenas scale has changed from nfs-ganesha to nfs-kernel-server, which does support RDMA while ganesha does not.

That being said, nfs over rdma runs on a different port, the "default" rdma port is tcp/20049, vs the normal tcp/2049 port so it's pretty far from being "plug and play"

Rand__ · Jan 16, 2023

ok.
Any idea re exposing TNS Scales via nvmeoF?

tsteine · Jan 16, 2023

Rand__ said:
ok.
Any idea re exposing TNS Scales via nvmeoF?

At the moment, you would have to install SPDK on truenas.

With SPDK, you would need to configure an nvmeof target, exposing an AIO bdev pointing to a zvol (or possibly a file on a zfs file system)

I have done this with Ubuntu, SPDK and ESXi with zvol backing, it's not an easy "snap your fingers" setup, and the gains vs iSCSI iSER are basically nonexistent, since the bottleneck seems to be the zfs file system/zvol, rather than the network protocol running over RDMA.

I only noticed significant uplifts with nvmeof and SPDK when directly exposing an NVME device, and at that point, running truenas scale with zfs is mostly pointless, just set up a linux distro of your choice with spdk.

The only real benefit from zfs with rdma iser/nvmeof is that it allows for higher throughput and less CPU consumption from IO traffic.

FreeNas/TrueNas RDMA Support (FR for voting)

Well-Known Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Member

Member

Active Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Active Member