PVRDMA, RDMA, iSER, iSNS, RoCE v1/v2 - Has anyone else made sense of this?

AveryFreeman · Dec 25, 2021

Hey,

So I've finally taken the plunge into trying to configure some of my cards to run iSCSI on ESXi for some datastores and direct-shares of zvols. Eventually would like to do RDMA (I think?), but even ISCSI has been a total PITA compared to stuff like NFS and SMB.

I've had this stuff for at least a year now, but I find acronyms super annoying, so I think a certain avoidance has become second nature. I thought I'd be able to do 56Gb IB when I bought the cards (I have a handful of ConnectX-3 MCX354As flashed to FCBT ) but then I found out my upgrade to ESXi 7.0 made that out of the question. So now I'm trying to figure out the 40Gb ethernet + RDMA links to storage providers like LIO/targetcli, and it's been quite the wellspring of obnoxious minutiae.

For example, like, I was reading that to connect a VM to an RDMA endpoint in vSphere, the HCA (host controller adapter - I had to look that up, I thought CX3s were considered NICs, but that would waste a perfect opportunity to introduce another useless acronym) has to be connected to vSphere through a vDS (virtual distributed switch), but the vDS can only have ONE uplink connected (phys HCA port), even though my cards have two? e.g.: https://docs.vmware.com/en/VMware-v...tml#GUID-4A5EBD44-FB1E-4A83-BB47-BBC65181E1C2

Virtual machines that reside on different ESXi hosts require HCA to use RDMA . You must assign the HCA as an uplink for the vSphere Distributed Switch. PVRDMA does not support NIC teaming. The HCA must be the only uplink on the vSphere Distributed Switch.

Does that mean I can only have one PER HOST, or only one uplink connected to THE ENTIRE vDS?!

Is that just during setup, or like, all the time?! It's totally not clear. Here's another gem:

From The Basics of Remote Direct Memory Access (RDMA) in vSphere | VMware

RDMA over Converged Ethernet (RoCE) is supported with PVRDMA . . .fabric needs to support Priority Flow Control (PFC). PVRDMA supports both RoCE v1 and v2. The difference here is that RoCE v1 supports switched networks only, where RoCE v2 supports routed networks.

If my CX3 HCAs are only RoCE v1 compliant, does that mean that I have to make sure my little storage-only network VyOS VM I made basically just for running dnsmasq needs to be OFF unless the HCAs are using RoCE v2? (v1 = switched network only vs v2 = routing compliant) Is that what that means?

Or can I maybe change my CX3 HCAs to RoCE v2 with this setting: https://kb.vmware.com/s/article/79148

Code:
Changing RDMA NIC's RoCE Version in ISER Environments:

Code:

esxcli rdma iser params set -a vmhba69 -r 2

And PVRDMA supports RoCE v2, but IF my cards only support v1, does that mean I'm limited to the capabilities of my card (v1), or could I conceivably do some host-only (internal) networking with PVRDMA unattached to the HCA? Would that even make any difference if it were host-only, anyway...? And they all have to be set to the same thing? It's all just so unclear.

Oh, looks like I found an answer to that question here: https://blogs.vmware.com/vsphere/2020/10/para-virtual-rdma-support-for-native-endpoints.html

VMs running on the same ESXi host use memory copy for PVRDMA. This mode does not require ESXi hosts to have an HCA card connected

So that's cool, but still ... do I connect them to the host through the same networking endpoint? So, even the host-only PVRDMA needs to be connected to a vDS?

If I have to set them all ahead of time to be the same version throughout my stack, will I still be able to use them with iSER?

And seriously, though ... WTF is iSER? That's gotta be the lamest acronym ever (saying a lot). Is that still considered "networking", like SCSI over IP? Or does it render the HCA useless for other more networky stuff while it's running?

...I don't even know where to begin with this pile of flaming garbage:

Pre-requisites to enable PVRDMA support for native endpoints

ESXi host must have PVRDMA namespace support.

ESXi namespaces should not be confused with the vSphere with Tanzu/Tanzu Kubernetes Grid namespaces. In releases previous to vSphere 7.0, PVRDMA virtualized public resource identifiers in the underlying hardware to guarantee that a physical resource can be allocated with the same public identifier when a virtual machine resumed operation following the use of vMotion to move it from one physical host server to another. To do this, PVRDMA distributed virtual to physical resource identifier translations to peers when creating a resource. This resulted in additional overhead that can be significant when creating large numbers of resources. PVRDMA namespaces prevents these additional overheads by letting multiple VMs coexist without coordinating the assignment of identifiers. Each VM is assigned an isolated identifier namespace on the RDMA hardware, such that any VM can select its identifiers within the same range without conflicting with other virtual machines. The physical resource identifier no longer changes even after vMotion, so virtual to physical resource identifier translations are no longer necessary

There's just so much going on here, it's a lot to work with. If anyone has suggestions for someone trying to make this at least SORT OF simple, lemme know!

RageBone · Dec 25, 2021

iSer is plain iSCSi with RMDA.
that's it.
No mumbo jumbo, no RoceV1 or 2.
You turn it On on the server.
Your client supports it and you connect them to the server and it's done.

I think it works over Eth and IB. I'm not sure with which i used it, likely IB.
My main issue with it is that only few clients support it.
For instance Windows does not!
I'm also in doubt about freebsd at this point but i'm not sure.
I don't think iSER hogs your nic and prevents you from using it for other stuff, that'd be actually very unreasonable.
Where did you get that impression?

In regards to your vDS question about only one uplink.
I have 0 experience with esxi but i wager that this "limitation" is founded on the need to route between uplinks if there is more then one.
Be aware that link-aggregation and bridging makes multiple physical interfaces into one software interface.

Is there a limit on the amount of vDS?
if not, why not make one vDS per interface and VM?

iSCSI vs iSER vs SRP on Ethernet & InfiniBand

Benchmark comparison of iSCSI, iSER and SRP protocols.

www.zeta.systems

iSER
Stands for "iSCSI Extensions for RDMA" [link]. It basically extends the iSCSI protocol to include RDMA support.

With iSER based setups, you use the same software as IP based iSCSI. That is, you use the same initiator and target. You just configure the initiator side to use an iSER interface when performing iSCSI discovery. You don't need to learn anything new. As long as your hardware, initiator and target support iSER, it will work.

Unlike SRP, you can run it on Ethernet.

Target support is provided by SCST and LIO on Linux, COMSTAR on Solaris/illumos and StarWind on Windows.

The protocol specifications were first published in 2004. Like SRP, it is very mature and completely stable.

huh, who would have thunk

Rand__ · Dec 26, 2021

I think you are going about this with a slightly wrong approach.

In the end everything you have (hw and software, i.e switch, nic/hca, operating systems, storage provider [nfs, iscsi, smb]) need to share the same capabilities.

I had a basic post explaining all these at some point (inquiring really for details), but i can't find it and maybe it got lost.

But, you need to start from both ends - i.e. in your case - what capabilities has vmWare (lots added to 7), and what capabilities has your target operating system (which is)?

This is not only basic OS support (ie sure, linux can do roce), but whatever storage provider you want to use needs to support that protocol too - and not only as initiator (client), but also as a target (server) since you want ESXi to be the initiator that consumes storage from your storage box.

The actual amount of capable software targets is quite limited, i.e. the last time I checked there was no NFSoRDMA nor a NVMEoF target software for Linux [that was simple enough to use/working with ESXi] but that was a year+ ago.

There was/is Starwind o/c which has really good protocol support, but I never got warm with it (for various reasons), but ymmv.

So if you have found a workable combination then you can play around with the acronyms and try to get it working step by step (and then a lot of cross stack error searching most likely

).

AveryFreeman said:
Does that mean I can only have one PER HOST, or only one uplink connected to THE ENTIRE vDS?!

No it means that your HCA is the *only* uplink on that particular dVS, 1 or 4 interfaces on uplink ports does not matter, and o/c one per host

AveryFreeman said:
Is that just during setup, or like, all the time?! It's totally not clear. Here's another gem:

From The Basics of Remote Direct Memory Access (RDMA) in vSphere | VMware

If my CX3 HCAs are only RoCE v1 compliant, does that mean that I have to make sure my little storage-only network VyOS VM I made basically just for running dnsmasq needs to be OFF unless the HCAs are using RoCE v2? (v1 = switched network only vs v2 = routing compliant) Is that what that means?

Or can I maybe change my CX3 HCAs to RoCE v2 with this setting: https://kb.vmware.com/s/article/79148

No, the setting is esxi only, if the hw [firmware] does not support it it will not work

And if your VyOS VM is routing instead of switching then it will not workwith RoCE v1. And if your VyOS VM does not speak ECN and PFC it will not work either...

AveryFreeman said:
There's just so much going on here, it's a lot to work with. If anyone has suggestions for someone trying to make this at least SORT OF simple, lemme know!

It is not simple since it involves a lot on all ends of the chain. You can go with iWarp, that cuts out the switch config and is now officially in ESXi too

AveryFreeman · Dec 26, 2021

RageBone said:
iSer is plain iSCSi with RMDA.
that's it.
No mumbo jumbo, no RoceV1 or 2.
You turn it On on the server.
Your client supports it and you connect them to the server and it's done.

I think it works over Eth and IB. I'm not sure with which i used it, likely IB.
My main issue with it is that only few clients support it.
For instance Windows does not!
I'm also in doubt about freebsd at this point but i'm not sure.
I don't think iSER hogs your nic and prevents you from using it for other stuff, that'd be actually very unreasonable.
Where did you get that impression?

In regards to your vDS question about only one uplink.
I have 0 experience with esxi but i wager that this "limitation" is founded on the need to route between uplinks if there is more then one.
Be aware that link-aggregation and bridging makes multiple physical interfaces into one software interface.

Is there a limit on the amount of vDS?
if not, why not make one vDS per interface and VM?

iSCSI vs iSER vs SRP on Ethernet & InfiniBand

Benchmark comparison of iSCSI, iSER and SRP protocols.

www.zeta.systems

iSER
Stands for "iSCSI Extensions for RDMA" [link]. It basically extends the iSCSI protocol to include RDMA support.

With iSER based setups, you use the same software as IP based iSCSI. That is, you use the same initiator and target. You just configure the initiator side to use an iSER interface when performing iSCSI discovery. You don't need to learn anything new. As long as your hardware, initiator and target support iSER, it will work.

Unlike SRP, you can run it on Ethernet.

Target support is provided by SCST and LIO on Linux, COMSTAR on Solaris/illumos and StarWind on Windows.

The protocol specifications were first published in 2004. Like SRP, it is very mature and completely stable.

huh, who would have thunk

Thank you, yeah this is super helpful!

Maybe the issue I'm running into is I'm trying to do too much at once. Thankfully, LIO only allows me to do either iSCSI or iSER one at a time.

I am noticing that if I try to do iSER on the same vDS switch I'm using for iSCSI, they both take a colossal dump. I think I am going to need to make a separate one just for testing the iSER if I want to keep iSCSI functionality. But unfortunately, I don't know if the limitation is in LIO or vSphere, so using a separate vDS will basically just help me narrow down which one has the limitation.

vSphere's pretty picky about stuff. Either that, or it just doesn't like LIO with zvols.

I tried to present a 1M block size zvol to vSphere thinking it'd be a good match for a VMFS6 datastore, since VMFS6's smallest (only?) block size is 1MB.

It wouldn't mount it or format the partition because it was too weirded out by the underlying datastore not being 512n, 512e or 4kn. Jeez, never had that problem in COMSTAR, could make volblocksize whatever I wanted ...

If you know LIO, I also can't set my /backstore/block attribute block_size=512 to anything else, either, so I made my volblocksize=4096 hoping it'd think it was a 4kn drive. Touchy feely.

But wait, there's more. My linux distro won't load vmw_pvrdma drivers correctly, it keeps saying my pvrdma device is a vmxnet3 (regular 10GbE vNIC). This guy who works for VMware doing dev / testing for AI / ML workloads w/ GPUs over 100GbE said even HE has trouble getting the pvrdma drivers to load properly and includes a super rudimentary looking rc.d script to mitigate: RDMA | Earl C. Ruby III

So now that I write this, I'm guessing it's an RDMA driver in linux issue rather than the vDS, will try and get that working first. jtfc at least I guess this is progress.

i386 · Dec 26, 2021

RDMA = Remote Direct Memory Access
Not a protocol, but a concept for reading and writing directly to memory of a remote system bypassing the cpu (and the software stack)

RoCE = RDMA over Converged Ethernet
Concrete implementation of RDMA over ethernet. It was a proprietary/nonstandard technology by the Infiniband Trade Association (IBTA) and there are two version RoCE v1 which works on layer 2 and RoCE v2 which works on layer 3 which makes it routable. Mellanox used to be the only company implementing this protocol, but nowadays there are other companies implementing RoCE v2 (the biggest ones are intel and broadcom)

iWARP = Internet Wide-Area RDMA Protocol
This is another concrete implementation of RDMA over ethernet, but unlike RoCE this is an official standard. Compared to RoCE it was designed to be routable from the beginning and doesn't require any special configuration on a switch. For a long time Chelsio was the only manufacturer of network cards that supported iWARP, Intel which helped to create that standard started using it with their 100GBE nics (almost 15 years after creating that standard). Another Vendor who started to implement iWARP on their 100GBE+ nics is broadcom.

PVRDMA = Paravirtual RDMA
This is a virtual network card by vmware that supports RoCE v1 & v2. It can be used to give the advantages of RDMA to virtual machines.

iSER = iSCSI Extensions for RDMA
These extensions of the iSCSI protocol allow it to use RDMA over iWARP, RoCE or Infiniband

iSNS = Internet Storage Name Service
I have no idea about this protocol

AveryFreeman · Dec 26, 2021

i386 said:
RDMA = Remote Direct Memory Access
Not a protocol, but a concept for reading and writing directly to memory of a remote system bypassing the cpu (and the software stack)

RoCE = RDMA over Converged Ethernet
Concrete implementation of RDMA over ethernet. It was a proprietary/nonstandard technology by the Infiniband Trade Association (IBTA) and there are two version RoCE v1 which works on layer 2 and RoCE v2 which works on layer 3 which makes it routable. Mellanox used to be the only company implementing this protocol, but nowadays there are other companies implementing RoCE v2 (the biggest ones are intel and broadcom)

iWARP = Internet Wide-Area RDMA Protocol
This is another concrete implementation of RDMA over ethernet, but unlike RoCE this is an official standard. Compared to RoCE it was designed to be routable from the beginning and doesn't require any special configuration on a switch. For a long time Chelsio was the only manufacturer of network cards that supported iWARP, Intel which helped to create that standard started using it with their 100GBE nics (almost 15 years after creating that standard). Another Vendor who started to implement iWARP on their 100GBE+ nics is broadcom.

PVRDMA = Paravirtual RDMA
This is a virtual network card by vmware that supports RoCE v1 & v2. It can be used to give the advantages of RDMA to virtual machines.

iSER = iSCSI Extensions for RDMA
These extensions of the iSCSI protocol allow it to use RDMA over iWARP, RoCE or Infiniband

iSNS = Internet Storage Name Service
I have no idea about this protocol

Nice synopsis! Definitely keeps it all in one place. Now just gotta figure the nuts and bolts out.

I did figure out my pvrdma HCA / pretendaNIC doesn't work well if my storage-provider VM won't load its kernel extensions at boot - instead of loading up a PVRDMA vHCA (as specified in VM's settings), Linux thinks it is a VMXNET3 (10GbE vNIC) and uses those drivers for it.

Do you know anything about this? (maybe need some stubs in /etc/modprobe.d?). This guy is apparently engineer at VMW says he has the same issue and suggests a (shitty looking) workaround: RDMA | Earl C. Ruby III

One problem with solution he proposes is my distro OpenSUSE has no /etc/rc.d (or /etc/init.d or other sysvinit-style legacy leftovers) unfortunately.

Thank you ...

Search

PVRDMA, RDMA, iSER, iSNS, RoCE v1/v2 - Has anyone else made sense of this?

AveryFreeman

consummate homelabber

RageBone

Active Member

iSCSI vs iSER vs SRP on Ethernet & InfiniBand

Rand__

Well-Known Member

AveryFreeman

consummate homelabber

iSCSI vs iSER vs SRP on Ethernet & InfiniBand

i386

Well-Known Member

AveryFreeman

consummate homelabber