VM-to-VM network performance with SR-IOV and eswitch vs. vswitch?

joek

New Member
Mar 20, 2016
27
12
3
103
Hello, long time lurker, first time posting. A long time user and fan of virtualization and VMware (since circa 1999) and have been consolidating my home infrastructure since (in addition to every company I have worked with).

I am interested in any information with respect to VM-to-VM network performance on the same physical ESXi box; in particular:
a) VM-to-VM using ESXi vswitch.
b) VM-to-VM using an SR-IOV-capable NIC with an eswtich.

Specifically, does an SR-IOV-capable NIC with an eswitch (e.g., Intel, QLogic or Mellanox):
a) Significantly reduce CPU overhead for VM-to-VM network traffic on the same box?
b) Is the VM-to-VM data rate limited by the NIC PHY?

I'm asking because I'm working on a new home consolidation build, but don't want to spend $500-700 if it doesn't buy more than simply buying more cores.

Thanks in advance.
 

pricklypunter

Well-Known Member
Nov 10, 2015
1,675
498
83
Canada
I have very limited experience with SR-IOV, so I could be way off the mark here, however my understanding is that you can have multiple VF instances, each of 10Gbps (PHY), and that data passing between them will not exceed that rate. It will however be processed much faster as it is directly accessed across the PCIe bus hardware as opposed to being handled by the Hypervisor OS (Hypervisor bypass) as is the case with VM-VM using the vSwitch. Before spending money though, I would wait till someone here with much more experience of SR-IOV explains things better than I am able to :)
 
  • Like
Reactions: joek

joek

New Member
Mar 20, 2016
27
12
3
103
@pricklypunter -- Tx, sorry if I wasn't clear... I have a pretty good understanding of SR-IOV. What I'm primarily interested in is vm-to-vm (intra-node/box) performance using current 10+ Gbe NICs with an embedded switch (eswitch, VEB or VEPA). The eswich on these NICs are (purportedly) capable of routing traffic directly between eswitch's, between VF's, or between VF's and the PHY/XAUI.

Conceivably such a NIC could accelerate VM-to-VM intra-node traffic. That is, it could be beneficial on a single node even if not connected to an external switch: connect vm's through such a NIC (bypassing ESXi and ESXi vswitch) and avoid the ESXi overhead.

Consider three topologies for intra-node traffic.
1. VM-to-VM ESXi vswitch -- Incurs ESXi overhead for all traffic.
2. VM-to-VM SR-IOV NIC with external switch -- No ESXi vswitch overhead; limited by XAUI/PHY.
3. VM-to-VM SR-IOV NIC with embedded switch -- No ESXI vswitch overhead; not limited by XAUI/PHY (?).

Given the cost of additional cores for (1) to drive high network rates, and the cost of an external switch and network-rate PHY limits for (2), then (3) might be a worthwhile investment, if all (or most) of the traffic is intra-node--and assuming that the NIC eswitch is not PHY line-rate limited.

In short, there is no reason why the NIC eswitch would be limited to anything other than the PCIe bus rate (give or take), as it would essentially be VM-DMA-NIC-eswich-NIC-DMA-VM without ever touching the XAUI/PHY. That could significantly improve VM-VM intra-node performance at much lower power consumption (and possibly much reduced latency).

The question is then: What are the NIC's eswicth rate limits? Are they tied to the XAUI/PHY rate? The PCIe bus rate? Internal NIC eswitch rate (no one appears to publish numbers)? Or something else?

Thanks again. Any and all information appreciated.
 

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
Basically the vswitch is only limited by CPU power and will transfer at a minimum of 10gb/s.

The only reason to use sr-iov or similar is so you decrease overhead, but realistically unless you have enormous bandwidth or low CPU cycles this is not an issue.

Tldr; use virtual nics wherever possible and only use Sr-iov on latency sensitive situations or where paravirtual drivers don't exist.
 

joek

New Member
Mar 20, 2016
27
12
3
103
@Keljian -- Got it. Assume high bandwidth and low CPU cycles. Sorry, SR-IOV was probably a distraction, but the only NICs with embedded switches have SR-IOV, and require SR-IOV to use the embedded switch.

I'm looking to quantify the tradeoffs; specifically: does an embedded switch buy me anything in terms of CPU cycles, bandwidth latency, and power? That's the $500 question I'm looking to answer.

For example, the Intel XL710 looks interesting if it can pump ~40Gbs intra-node vm-to-vm at < 4 watts. With SR-IOV and its embedded switch it could potentially do ~40Gbs, with lower latency and less power that pushing ~40Gbs through a software/CPU vswitch.

Thanks again.
 

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
Is power that much of a concern? What CPU are you running? Is upgrading it an option? What is the use case? (Need more info)

Typically speaking hardware switches are for chain topologies. I suppose this use could work.

Even the lowly Chelsio t420-so-cr has a hardware switch in it, has plenty of virtual functions and can do line speeds (2x10gbps)- it is far cheaper than $500
 
Last edited:
  • Like
Reactions: joek

joek

New Member
Mar 20, 2016
27
12
3
103
I have an old e5-2450 plus several other critters. They are power hogs (total system) and having them on 24x7 pains me. It takes a significant chunk of available CPU to drive fast (10Gbe-class) intra-node vm-to-vm network traffic.

Upgrade is definitely in the cards (thanks for asking!). Objective is to reduce the current sprawl down to two systems. I am looking at a xeon d-1541 or maybe broadwell-ep (depending on what Intel show us next week). Those two systems would be: one desktop (~8x5); one everything else (24x7, SAN and other infrastructure). Each will act as backup for the other, although not hot backup.

No interest or need for 10Gbe switch/infrastructure as nothing I have now or in the foreseeable future requires it. In any case, I'd rather consolidate to fewer-larger virtualized nodes and put systems that require 10Gbe-class connects on the same node, or maybe direct attach between a couple nodes. If external/physical 10Gbe-anything, it will be SFP+ direct attach between a very small number of systems with no external switch. 10Gbase-T is out; still way too much of a power hog and expensive (especially with an external 10Gbase-T switch).

I could probably stuff everything in one system with nominally or less equivalent power and maybe similar price, but having everything dependent on a single CPU-mobo-HBA-etc, which may take a few days to replace scares me. Cold backup/sparing with only one hot system, even with advanced replacement contract, is not reasonable IMHO (another discussion).

Given the current steep premium for >8 cores, I'd rather reserve cores for work other than driving network traffic. Getting 10Gbe+ intra-node throughput at lower power than driving the CPU's hard--and using the available CPU or TDP/turbo-boost for work other than networking would be great. Thus the question.

And back to the original question.... There is no reason why the NIC eswitch should be limited to the XAUI/PHY speed as intra-node vm-to-vm traffic never touches (or should never touch) the XAUI/PHY. The only fundamental speed limitation is thus (or should be) PCIe bus speed--and of course the eswitch speed. However, none of the vendors appear to publish information on the speed of their eswitch's (at least that I can find). Any information or pointers would be greatly appreciated.

Hope that helps clarify. Thanks again.
 
Last edited:

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
I really think you are overstating how much inter-VM traffic taxes the CPU. I would measure this before attempting an eswitch solution. (And you have nothing to lose in terms of setting it up/testing, then getting a nic with an eswitch if you need lower consumption). That said, there is nothing stopping you getting said inexpensive Chelsio and seeing what results you get as a "proof of concept". Please share them if you do.

Also, Xeon-d is not necessarily the best way to go. Especially for a desktop.

Modern processors (post sandy bridge - particularly haswell and broadwell) use very little power at idle, and unless you are driving it hard, then there is no point going a low power processor. (Even if you are, having the power on tap can be useful)

I can attest to broadwell using very very little power at idle.

You would likely be better off investing in a power supply which has high efficiency at low watts (sub 100w efficiency) than a Xeon-d, and will likely save a similar amount of energy.
 
Last edited:

joek

New Member
Mar 20, 2016
27
12
3
103
Have measured. In previous tests with my old system, it takes a significant chunk of one ~2MHz core (basically 100% of it) to drive a single connect through a vswitch to 4-6Gbs. Which AFAICT is reasonably consistent with other more recent benchmarks--which seem to converge on a general rule that a modern ~2Ghz core is needed to drive a 10Gbe NIC. Would love see more recent/accurate benchmarks if you have them.

Yeah, xeon-d may not be the best way to go for a desktop. Which is why I'm looking at splitting between a xeon-d for 24x7 infrastructure and something else for desktop (cue broadwell-ep, maybe).
 

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
Well, seeing as I actually have a t420 in my home server, I will look at what it would take to use the eswitch, then I can potentially give you first hand experience- any assistance would be appreciated, I am running esxi
 
  • Like
Reactions: joek

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
A very quick google shows that if you bridge the ports using iptables this should activate the eswitch in Linux - this being the case, if kvm/qemu is an option and may work if you assign the bridge to multiple vms.

On that note, if you are using Linux for virtualisation, DPDK may drop your consumption significantly.

Using Open vSwitch* with DPDK for Inter-VM NFV Applications | Intel® Developer Zone

Unfortunately it is not an option for me (as I need aes-ni, and avx2.0 - which I can't confirm work with kvm yet)
 
Last edited:

joek

New Member
Mar 20, 2016
27
12
3
103
Wonderful! I have a couple topologies/diagrams I will post that might help. (But need to get off mobile.) I think you may be the first person to actually quantify eswitch performance (at least for the t420) . Thanks again.
 

joek

New Member
Mar 20, 2016
27
12
3
103
+1 for DPDK. Inter-VM network performance is a primary design focus of DPDK. Here are some supported NICs you can test with:
DPDK doc
Unfortunately as far as I can tell, DPDK doesn't help much to reduce ESXi overhead; it reduces kernel overhead. Intel and VMware's position appears to be: if you want to reduce ESXi overhead, use SR-IOV or DirectPath; what you do within the VM to reduce kernel overhead is up to you (DPDK).

I have attached a diagram which shows three configurations. I'm particularly interested in configuration B, and specifically the part highlighted in red. I included DPDK only to illustrate where it fits. Although I'm interested in DPDK, it is orthogonal to this question and you can ignore it in the diagram (just consider all VM's to contain vanilla kernel-application stack).

Apologies in advance for my drawing skills. Thanks again everyone.
 

Attachments

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
I understand what you are suggesting, I just don't know how to enable the eswitch in esxi, with or without sr-iov

One possible configuration would be to pass through two virtual functions to a Linux VM, virtualise another two Linux vms within the VM using KVM, then create a bridge using ip chains.

Doing passthrough again on the second VM situation.

The thing is, this wouldn't give you much in the way of benefits for esxi.
 

joek

New Member
Mar 20, 2016
27
12
3
103
Thanks! That's interesting. Pushing iSCSI read >300K IO/s and >2000MB/s vm-to-vm on the same node with an E5-1660 v2 is impressive. (Wish they showed the same with no offload and an ESXi vswitch on the target VM for comparison.)

This makes me think 8 cores and a decent NIC/eswitch may be a better than buying >8 cores given the steep premium for >8 cores.
 

joek

New Member
Mar 20, 2016
27
12
3
103
I understand what you are suggesting, I just don't know how to enable the eswitch in esxi, with or without sr-iov ...
Reading various docs suggest you should not have to do anything special to enable the eswitch in an SR-IOV-capable NIC.[1] The eswitch, and the number of ports on the eswitch, is essentially a function of SR-IOV and the number of VF's defined.

For example, take a physical NIC and divide it into multiple virtual NICS/VFs using SR-IOV. Each of those VFs implicitly defines a port on the NIC's eswitch (otherwise no way for them to communicate). That is, any NIC which supports SR-IOV must have some sort of eswitch.

A simple test: define one VF for each VM; unplug the NIC from external switch. Can the VM's communicate over those VF's? I would hope and expect they can? If yes, then you have a working eswitch in the NIC.

Assuming the above test is successful, the next question is: What is the performance of the eswitch? Some may be limited by the PHY (although no reason they should be, but who knows), while others may be limited only by PCIe or eswitch logic.


[1] Although, e.g., the Intel XL710 docs indicate there are many VF- and eswtich-specific attributes, those attributes should be of marginal relevance.
 
Last edited:

Keljian

Active Member
Sep 9, 2015
429
71
28
Melbourne Australia
The simple test seems obvious, but I access the server over the nic, so I can't simply disconnect the nic from my switch.

What I can do is assign virtual functions to two fresh VMs, and then see if the throughput is >10gbps? Would this help?
 
  • Like
Reactions: joek

joek

New Member
Mar 20, 2016
27
12
3
103
The simple test seems obvious, but I access the server over the nic, so I can't simply disconnect the nic from my switch.

What I can do is assign virtual functions to two fresh VMs, and then see if the throughput is >10gbps? Would this help?
Yes, that would be great! If throughput is > 10Gbs it would tell us the eswitch is not PHY limited and would help shed light on where it is potentially limited. If throughput is < 10Gbs, it tells us the eswitch may be limited elsewhere (PHY, eswitch logic, something else)?

In any case, your tests would inform and help. I really think this is an unexplored area that deserves some attention. Thanks again.