This only applies to lossless ethernet.On Layer2 GlobalPause is enough.
You need ECN only for Layer3
This is actually possible without lossless fabric if you have mellanox adapters which support resilient roce, that is, Connectx4 or newer.GlobalPause on Layer2 for RocE for iSER is fine.
As said, you only need ECN on Layer3.
You can't run iSER on lossy connection.
Or well, you can, but you will have data corruption.
While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.HOW IS PACKET LOSS HANDLED?
Upon packet drop, a NACK (not ACK) control packet with the specific PSN (packet sequence number) is sent to the sender in order for it to re-transmit the packet.
I don't trust this is working fine on iSER.
Yes.While Nvidia/Mellanox have not been specific on how this is handled, I can only assume this implemented through using physical packet buffers on the adapters, (which is why hardware support is necessary for connectx4 and newer), so in the case of a NACK, the adapters themselves would retransmit a copy of the packet from a buffer on the adapter, while ECN is used to slow down/stop transmission until the packet is properly delivered in the correct sequence.
I expect it's not a performance winner when it happens, but it does mitigate the need to set up lossless fabric.
I found a research paper from Mellanox on this topic, though it's somewhat old. (2017)Yes.
Until I dont see the specification of this. I wont trust this.
Seems to be handled transparently in firmware/hardware on the adapters by adapting the infiniband spec for handling congestion, since RoCE is, at the most basic level, infiniband traffic wrapped in ethernet transport. Normally, in an infiniband network, if the receiver is unable to process a packet, the sender would simply wait before sending it, in this case, it seems they just retroactively fall back on this mechanism whenever packet loss occurs with a packet buffer and ecn to replicate this on ethernet, so the iscsi software layer on either side would simply be waiting for packets, without being aware that packet loss ever occurred.Handling packet loss relies on
the InfiniBand transport specification [5], as depicted in Figure 3.
Packets are stamped with a packet sequence number (PSN). The
responder in the operation accepts packets in order and sends
out-of-sequence (OOS) NACK upon receipt of the first packet in a
sequence that arrived out of order. OOS NACK includes the PSN
of the expected packet. The requestor handles the OOS NACK by
retransmitting all packets beginning from the expected PSN using
the go-back-N style scheme. The lost packet is fetched again from
the host memory. OOS NACK handling is a relatively complex flow
in the NIC combining hardware and firmware. In order to minimize
the impact of packet loss, retransmissions must be fast and effcient.
I've migrated 2 machines from core 13 to scale up to latest version. I haven't noticed any difference in performance. IMHO, zfs + zvol isn't where it should be performance wise. In fact I'm exploring other options for nvme drives other than zfs, but that is for another thread.Nice... now if Scale was only as performant as TNC... or has this been improved?
Else you dont happen to have any before / after values?
I checked around some more and found this: Revisiting Network Support for RDMA (berkeley.edu)@tsteine
"The lost packet is fetched again from
the host memory."
Now the question is, "host memory region" of kernel module or application.
While it's DMA, it "should" be application.
If they realy talk about application memory, then I'm worried about this technique. So my application must be aware of this. If memory is freed, or overwritten meanwhile, this will corrupt your data.
Need some more info, how this is implemented.
Some Q's ...I just wanna say that as of 22.12 release of TrueNAS SCALE, RDMA is doable in a few steps.
1. Add RDMA support to the system - I've went with OFED package fromMellanoxNvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2
I ran1. Add RDMA support to the system - I've went with OFED package fromMellanoxNvidia : Mellanox OFED (MLNX_OFED) Software: End-User Agreement - add kernel support, enable nvmf and (optional) nfsover rdma (I'd like to further test proxmox as an initiator to see if there are any benefits). Main initiator "target" for this is vmware - post 7.0.2 which has al lthe bels and whistles to support rdma with iser and nvmeof.
results here should render the system rdma capable, have a nvme target rdma capable, iser target module etc.
chmod +x /bin/apt*
./mlnxofedinstall --add-kernel-support --enable-mlnx_tune --with-nvmf --with-nfsrdma --skip-distro-check
I assume I only need this if I want iSCSI support for rdma?2. rebuild scst - I did this with the lastest git (3.7.0 - same as what's already in truenas) ; Check the install script to make sure you are starting the services with the new binaries (/etc/init.d/scst is the script used to start the service wrapped under the "goodness" of systemd /etc/systemd/system/scst.service.d)
3. reboot and if everything is doing what's supposed to, see your iscsi targets sessions with RDMAExtensions Yes in the logs.
ESXi doesn't support NFSoRDMA.Is the built in nfs daemon now automagically rdma enabled? assume not
At the moment, you would have to install SPDK on truenas.ok.
Any idea re exposing TNS Scales via nvmeoF?