Mellanox ConnectX-2 and ESXi 6.0 - Barely Working - Terrible Performance

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

dswartz

Active Member
Jul 14, 2011
610
79
28
esxi forces sync write on for all nfs writes. you need to get a decent slog SSD to fix this.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
Found this link which is interesting, perhaps it's something NFS security related.
malayter.com: Fixing slow NFS performance between VMware and Windows 2008 R2
- I know its old and related to Windows 2008 R2, but it's oddly similar.

Also here's my current NFS config.

PS C:\Users\Administrator> Get-NfsServerConfiguration


State : Running
LogActivity :
CharacterTranslationFile : Not Configured
DirectoryCacheSize (KB) : 128
HideFilesBeginningInDot : Disabled
EnableNFSV2 : True
EnableNFSV3 : True
EnableNFSV4 : True
EnableAuthenticationRenewal : False
AuthenticationRenewalIntervalSec :
NlmGracePeriodSec : 45
MountProtocol : {TCP}
NfsProtocol : {TCP}
NisProtocol : {TCP}
NlmProtocol : {TCP}
NsmProtocol : {TCP}
PortmapProtocol : {TCP}
MapServerProtocol : {TCP}
PreserveInheritance : False
NetgroupCacheTimeoutSec : 30
UnmappedUserAccount :
WorldAccount : Everyone
AlwaysOpenByName : False
GracePeriodSec : 240
LeasePeriodSec : 120
OnlineTimeoutSec : 180
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
esxi forces sync write on for all nfs writes. you need to get a decent slog SSD to fix this.
The NAS doesnt have ZFS, but it does have the equivalent of the SLOG, in that I have a dedicated pair of Samsung 850 EVO SSDs for write cache (on separate LSI 9211 IT mode adapters). Reads/Writes that are pure cache generally perform around 1.0 - 1.2GBs. [and since i have 1TB of cache, pretty much everything i do is cached all the time]. Behind that I have 10x Hitachi 4TB NAS drives 64MB 7.2K RPM, which easily sustain about 800MBs bandwidth when destaging the SSD cache.

But I have read about the ESXi forcing sync on all NFS writes, and issues with Thin Provisioning over NFS.

Is there any way to use Async on ESX?
 

dswartz

Active Member
Jul 14, 2011
610
79
28
Not that I know of. If running ZFS, you can say 'zfs set sync=disabled xxx', where 'xxx' is the dataset name.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
I finally just finished upgrading the NAS box.
- Went from Windows 2012 R2 to Windows 2016
- Went from 2x Samsung 850 Evo's to 4x
- Went from 5x Hitachi 4TB NAS to 10x
- Installed a 3rd and 4th LSI 9211-8i, and split the SSDs up 2 per LSI channel
- Rest of the HDDs are split up on the remaining 2 adapters. (PCIe 2.0 x8 slots)
(Partly because my original 14.5TB was nearly full, partly because I wanted more disk IOPs & Bandwidth)

I've replicated the exact same NFS 'meh' performance, and iSCSI 'it works, but not @ Disk or Interface speeds'
I am fan of the 1.8.2.5 OFED drivers over the 1.8.2.4 though, for iSCSI performance particularlly. (noticed a minor improvement)

I really didn't see any improvement in RAW disk performance by going from 2x SSDs as journal to 4x.
(Using a single vDisk across the entire pool, with Write Back Cache set to 466GB (i.e. all 4x SSDs in RAID1)

My current research on alternative options, is looking like potentially Ubuntu 16.04 - 16.10
Supports ConnectX-2 in IPoIB mode
Supports ZFS
And that's about all I really care about :)

Really would be nice for SMB 3.1.1 support with RDMA, which I have now.
Really would be nice for NFS 4.1 support, which I dont have now with Windows 2016.

(Oh also added another ESX server HP DL380 G7, 2x X5660 Hexacores @ 2.8GHz with 144GB RAM, ConnectX-2 Dual Port QDR)
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
If you really want things to fly with your setup, you pretty much have to switch to SRP. 1.8.2.5 is the latest (and final) VMware driver from Mellanox supporting SRP initiator. It probably won't work with ESXi 6.5 (if you decided to upgrade in the future) , but it works very well with ESXi 6.0.

Unfortunately, this severely limits your choice of targets. The only SRP target that is up to date with the latest Mellanox OFED packages is SCST on Linux. You could also try Solaris, but I've had issues with it in the past. So, for a decent NAS, that would mean ZFS on Linux, which I've been successfully using for the last 2 years.

Going forward, Mellanox declared they would no longer develop SRP, so, for decent RDMA storage, you're left with iSER. Unfortunately, Mellanox have persistently neglected releasing iSER initiators for ESXi. At the moment, the situation is very messy and unclear.

For starters, there is no official ESXi driver supporting both iSER and Connect-X2. You can find an unofficial distro 1.8.3 that includes iSER support over IPoIB working with Connect-X2, but it's challenging to keep up to date with the newest ESXi releases.
The only official driver from Mellanox supporting iSER on ESXi 5.5 and 6.0 (not 6.5!) is 1.9.10.5. However, it supports only Connect-X3 (and Pro) and ONLY over Ethernet, not over IB. This means you don't only have to replace you NICs, but also most likely the switch. The 2.x driver family does support IB but not iSER. The new adaptors (Connect-IB and Connect-X4/5) don't even have an ESXi driver supporting IB in any shape or form (only ETH!), let alone iSER. Which kind of shows Mellanox's commitment (or lack of thereof) to their ESXi customers.
 
  • Like
Reactions: poto

humbleThC

Member
Nov 7, 2016
99
9
8
48
It's about that time, to start digging in to this again, I very much appreciate all the comments/feedback.
Minor upgrades to my setup, include RAM upgrades all across. 80GB on the NAS, 120GB/144GB on ESX01/02.

NET
Mellanox 4036-E with Subnet mgr configured
4x ConnectX-2 (mt26428) running FW 2.10.0720 across the board

NAS
SuperMicro 36x Bay Chassis w/ Dual Xeon QC X5560 @ 2.8GHz & 80GB ECC RAM
Windows Server 2016 - Using Storage Spaces for a 10x Pool with 4x SSD as dedicated Journals. (I'm totally not getting 4x SSDs worth of bandwidth however on writes).
- Driver 5.1.11548.0 (Mellanox Provided)​

ESX
A pair of ESX 6.0 U2 hosts (HP DL380s, one G6, one G7)
- Driver 1.8.2.5
Although it's "working" over iSCSI, I can't help but feel i'm leaving some performance on the table.

I'm doing some more R&D to validate if Ubuntu 16.10 is my best choice for a Linux ZFS server for CIFS/NFS/iSCSI supporting ConnectX2s in IPoIB mode.
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Mate, you definitely shouldn't expect any sort of decent performance from something like Storage Spaces as a foundation of an ESXi datastore.

The best alternative to proper enterprise storage solutions like EMC or HP would be iSCSI or NFS over RDMA. Unfortunately, Mellanox have really bad support for ESXi in general, and, as of late, they seem to have abandoned releasing new drivers that include any sort of RDMA-accelerated transport. Maybe there is something both Mellanox and VMware are not telling us and stock VMware Software iSCSI initiator and/or NFS combined with the out-of-the-box Mellanox drivers (IB?ETH?) do utilise RDMA, but I highly doubt it. I'm pretty sure they would have been broadcasting it from every possible IT news outlet if that had been the case.

Therefore, you're left with the following choices:
  • SRP - use 1.8.2.4 (for ESXi 5.x) or 1.8.2.5 (for ESXi 6.0) drivers with one of the following targets: Solaris 11.x (any variant) or Linux with either inbox/LIO or Mellanox OFED/SCST (you can't use Mellanox OFED with LIO as they stripped SRP support out of it). In any case, you can use it only over Infiniband (not Ethernet), which means you need either a managed switch or OpenSM running on one of the computers connected to your fabric.
  • iSER - use either 1.8.3 (for ESXi 5.x or 6.0 with forced installation) over Infiniband or 1.9.x.x (for 5.x or 6.0) over Ethernet. You will have to use a Linux target, both LIO and SCST should work with both inbox and Mellanox OFED drivers. The second option requires an Ethernet switch though.
Both solutions assume an iSCSI datastore. I'm not aware of any way to utilise an NFS-based datastore with RDMA.
Also, as you might have noticed, none of the above support ESXi 6.5, and I can confirm forced installation doesn't help, as all Mellanox older ESXi drivers collide with 6.5 components one way or another.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
Mate, you definitely shouldn't expect any sort of decent performance from something like Storage Spaces as a foundation of an ESXi datastore.

The best alternative to proper enterprise storage solutions like EMC or HP would be iSCSI or NFS over RDMA. Unfortunately, Mellanox have really bad support for ESXi in general, and, as of late, they seem to have abandoned releasing new drivers that include any sort of RDMA-accelerated transport. Maybe there is something both Mellanox and VMware are not telling us and stock VMware Software iSCSI initiator and/or NFS combined with the out-of-the-box Mellanox drivers (IB?ETH?) do utilise RDMA, but I highly doubt it. I'm pretty sure they would have been broadcasting it from every possible IT news outlet if that had been the case.

Therefore, you're left with the following choices:
  • SRP - use 1.8.2.4 (for ESXi 5.x) or 1.8.2.5 (for ESXi 6.0) drivers with one of the following targets: Solaris 11.x (any variant) or Linux with either inbox/LIO or Mellanox OFED/SCST (you can't use Mellanox OFED with LIO as they stripped SRP support out of it). In any case, you can use it only over Infiniband (not Ethernet), which means you need either a managed switch or OpenSM running on one of the computers connected to your fabric.
  • iSER - use either 1.8.3 (for ESXi 5.x or 6.0 with forced installation) over Infiniband or 1.9.x.x (for 5.x or 6.0) over Ethernet. You will have to use a Linux target, both LIO and SCST should work with both inbox and Mellanox OFED drivers. The second option requires an Ethernet switch though.
Both solutions assume an iSCSI datastore. I'm not aware of any way to utilise an NFS-based datastore with RDMA.
Also, as you might have noticed, none of the above support ESXi 6.5, and I can confirm forced installation doesn't help, as all Mellanox older ESXi drivers collide with 6.5 components one way or another.
Excellent feedback, much appreciated!

My next-step is currently testing OmniOS (branch off Illumos, which is a branch off Solaris).
I got Napp-It installed on top to manage ZFS, and i'm benchmarking various RAIDZ configuration with and without SSDs configured in various ways for Log & Cache.

I'm trying to see what the theoretical max disk performance is, prior to testing across the network.

However, I did get link light and full driver support out of the box with OmniOS for ConnectX2, the very 1st iperf came back @ 15.8Gbs with 4 threads. (which was already faster than i was able to achieve between windows>windows or windows>esx in the past), so I'm getting optimisitic.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
So I'm pretty sure i missed something critical in testing NFS performance between Windows & ESX. (Sync Writes = On).
My ZFS reading/testing basically spelled it out. For the same reasons ZFS has a problem with small random writes with sync enabled, the ESX host NFS mounting a Windows share has a problem in that specific workload.

i.e. In some cases powering on a VM from a NFS datastore was fast, overall feel of the VM was swift, but any time i tried to vMotion it would go down to 3-30MBs max, and often time-out and fail. Same for deploying an .OVF to NFS datastore.

I think it may be as simple as disabling sync writes, and run in non-POSIX compliance mode. Of course there are risks associated with this, and its not recommended for critical production workloads. But if you're stuck on Windows for your NFS export, i'd very much look in to disabling sync.

It's effectively the same in ZFS, it's designed to protect you, but if you want you can disable sync, and reap the benefits of performance in any IO workload.
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
So I'm pretty sure i missed something critical in testing NFS performance between Windows & ESX. (Sync Writes = On).
My ZFS reading/testing basically spelled it out. For the same reasons ZFS has a problem with small random writes with sync enabled, the ESX host NFS mounting a Windows share has a problem in that specific workload.

i.e. In some cases powering on a VM from a NFS datastore was fast, overall feel of the VM was swift, but any time i tried to vMotion it would go down to 3-30MBs max, and often time-out and fail. Same for deploying an .OVF to NFS datastore.

I think it may be as simple as disabling sync writes, and run in non-POSIX compliance mode. Of course there are risks associated with this, and its not recommended for critical production workloads. But if you're stuck on Windows for your NFS export, i'd very much look in to disabling sync.

It's effectively the same in ZFS, it's designed to protect you, but if you want you can disable sync, and reap the benefits of performance in any IO workload.
I don't quite understand why you would even consider running an NFS server on Windows. It's basically forcing a technology onto a platform it hasn't been designed for. You'd be so much better off with any variant of Solaris, FreeNAS or Linux running NFS.

Now, while there is a long standing dispute between iSCSI and NFS (on top of one of the well-established platforms above) as a foundation of an ESXi datastore, there is one crucial thing that makes a huge difference when your transport is Infiniband: RDMA support. And, in this department, you're left just with iSCSI (either in the form of SRP or iSER). Yes, you can enable NFS support for RDMA when both the client and the server are Linux, but, remember, you client is ESXi, and this limits your choices quite severely. I'd suggest you to read my previous post again to understand what options are available at the moment.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
I don't quite understand why you would even consider running an NFS server on Windows. It's basically forcing a technology onto a platform it hasn't been designed for. You'd be so much better off with any variant of Solaris, FreeNAS or Linux running NFS.

Now, while there is a long standing dispute between iSCSI and NFS (on top of one of the well-established platforms above) as a foundation of an ESXi datastore, there is one crucial thing that makes a huge difference when your transport is Infiniband: RDMA support. And, in this department, you're left just with iSCSI (either in the form of SRP or iSER). Yes, you can enable NFS support for RDMA when both the client and the server are Linux, but, remember, you client is ESXi, and this limits your choices quite severely. I'd suggest you to read my previous post again to understand what options are available at the moment.
Well a good majority of the time i'm getting the NAS from my primary Windows 10 workstation. Just so happens it too has a 40Gb QDR IB HCA. Nice part of Windows 2012/2016 is that SMB3 support with RDMA, and driver support is a win. Its just ESX doesnt use SMB3 of course, so I want to share the 'large pool of disks with SSD cache' also to my ESX Lab. NFS or iSCSI, but iSCSI requires I carve out space and dedicated it. NFS share would allow me to share the same volume/file system between CIFS and NFS. In a perfect world, it would work.
 

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Of course it will work, just don't expect it to perform anywhere near its potential. Disk performance is a crucial factor of VM performance, and, in most cases, it's handicapped by small size read/write latency. NFS on a Windows Server (especially run in parallel with high-volume NAS-style transfers over SMB) is hardly going to provide you with this. Kind of kills the whole purpose of having Infiniband cards in your ESXi hosts. I doubt you'll see any performance improvement on top of 1 Gbps Ethernet this way...
 
  • Like
Reactions: humbleThC

mpogr

Active Member
Jul 14, 2016
115
95
28
53
I re-read the entire thread from the beginning and here are some thoughts: you never mentioned what kind of underlying storage you have. I started playing with my storage solutions for ESXi a while ago. Almost the only thing that I kept constant was using ZFS, but everything else has been changed/played with: OS (Nexenta/OmniOS/Solaris 11.3/CentOS), all-in-one vs bare metal, HDD RAID configuration (RAIDZ2, RAID10), SLOG (HDD, hardware controller, consumer SDD, enterprise SSD), datastore protocol (NFS vs. iSCSI), network infrastructure (teamed Ethernet, Infiniband: host-to-host/via a switch) and so on. My key takeaways:
1. Don't use RAIDZ, it's too slow. For redundancy, use RAID10 with a hot spare.
2. Always have an SLOG, even if it's a slow one. One thing ZFS absolutely hates and punishes you badly for is sync writes without an SLOG.
3. There is a HUGE difference between enterprise and consumer grade SSDs used for SLOG. Enterprise SSDs have capacitors assuring write completion even in the event of power loss. As a result of this, they effectively operate as async devices even when asked to perform sync writes.
4. If you care for performance, don't enable dedupe. One way or another, it will punish you when it needs to be broken. Sometimes very badly.
5. Beyond the point of RAID10 + fast SLOG, you can expect only minor improvements from the following: moving to faster network (IB instead of 1 Gbps Ethernet), utilising RDMA (iSCSI + SRP/iSER) etc. These are nice, but not too noticeable until your workload increases significantly.

It is important to verify that your issues with VMotion etc. are actually related to network and not something else. If everything is working fine, I would expect VMotion to work reasonably well even over 1 Gbps Ethernet. Give it a try. If it still times out, time to look at your underlying storage solution and think of alternatives. NFS over Windows Server would be my primary suspect...
 
Last edited:

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Just adding on top of the above, SMB usage patterns tend to be low concurrency/short duration/high throughput demanding ones, which is the complete opposite to the ESXi->Datastore patterns. So combining those on the same array is not ideal anyway.

That said, I do have a couple of additional virtualised storage appliances in my landscape (based on Solaris 11.3 via iSCSI over IPoIB/FDR) that I use to ensure non-interrupted operation of my most critical VMs (such as Domain Controllers and pfSense Internet Gateway) via in-OS drive mirroring (mirrored drives come from different Datastores). However, I predominantly use these appliances as SMB share providers. I can tell you I'm getting really good throughput (200-300 MB/s) when I copy large (10s of GB) files to and from them, which is likely limited by the HDDs themselves. And how often do you need that throughput anyway? On the other hand, your VMs are accessing their virtual HDDs all the time, and that access latency is critical for the VM responsiveness.

So, I'd rather suggest you trying Solaris/FreeNAS/Linux with ZFS and add CIFS export to it. I'm pretty sure you'll get much better overall experience than with your current server, and you're hardly going to notice the loss of RDMA for SMB.
 

humbleThC

Member
Nov 7, 2016
99
9
8
48
Thanks much for your comments/notes!

I've spent most of my time so far, playing with the hardware and architecture, trying to understand what my baseline expectations are for the hardware performance/capabilities. Which means testing the individual HDDs and SSDs apart, and together in various RAID configuration, via various OS. Comparing to known vendor and reliable benchmark review sites. (I've only really tested Windows 2012 R2, Windows 2016, and OmniOS w/ Napp-It so far personally.)

I started off with 5x HDDs (Hitachi 4TB NAS 7.2k) and 2x SSD (Samsung 250GB Evo 850) on (3) LSI 9211-8i's in IT mode.
In Windows I was able to get about 1GBs / on bursty large block sequential transfers across the pipe.

I then added 5x more matching HDDs, and 2x more matching SSDs.
The way I was using the SSDs in windows as dedicated write cache worked 'well enough' to handle the burst of IOs coming in, and destage them to disk post client-acks. The only problem is it didnt scale. I added 100% more disks, and got 100% more capacity but only 0-1% more performance.

I then, via advice of the ZFS guru's in my other thread, decided to purchase:
(4) Intel S3710 DC 400GB SSDs, and (2) More LSI 9211-8i's.

Although this SuperMicro server is 2x Generations old it does have (2x Intel X5660 QC @ 2.8Ghz, and 80GB DDR3 ECC)
It's also limited to PCI v2.0 x8 PCIe bandwidth (7x total slots, physically x16 slots, where x8 bandwidth is available)
My thought here is, i'm never going to max a proper/new PCIe NVMe card @ 2-4GBs, due to PCI bandwidth.
But having (5) separate LSI adapters on (5) separate PCI busses, should give me a solid 1 - 1.5Gbs per controller. (assuming there's enough disk behind each, and there should be).

upload_2017-1-12_8-38-13.png

Design might be confusing, but basically i'm splitting every SSD of every time on to it's own controller.
I'm splitting the 10x HDDs evenly across 5x Controller, and using RAIDz (4+1) where no two components of any disk group ever touch the same controller, for redundancy and performance hopefully.

For clarity, i'm not using any Hardware RAID, all my LSI's are flashed P20-IT Mode. The RAID listed, is the software RAID that ZFS will be handling.
 
Last edited:

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Wow, that's heck of a design!

I'm just trying to understand what's your end goal here. How much storage exactly do you need for VMware Datastores and how much for CIFS? Can you provide some input?

Intuitively, first and foremost I'd recommend splitting the duties between these two use cases. Because the usage patterns are so different, mixing workloads related to these on top of a single array is probably not a good idea.

For CIFS storage, RAIDZ1 is fine speed-wise and is probably optimal capacity-wise, and SLOG is not really required. When you create a ZFS file system optimised for throughput (which should be your goal for this scenario), the SLOG is bypassed anyway. L2ARC would help, but it doesn't have to be huge. And I would definitely not go out of my way to make it RAID-0 or something like that.

For VMware Datastores, I'd rather consider using RAID-10 + a hot spare, because write latency is heavily penalised by RAIDZ. Also, do invest in a VM backup system (RAID is not a substitute for backup, remember that!). A fast SLOG is necessary and it doesn't have to be huge. Your DC3710s would have been an excellent choice for an SLOG if they hadn't been so big. It's just pity you've got 400Gb of enterprise storage that would be "wasted" for such an "unholy" purpose as SLOG. What I did myself was getting 120Gb DC3610 for cheap and then overprovisioning them (using hdparm) to make them look like 16Gb drives. This is more than enough for an SLOG.

You'd probably be better off building a separate SSD-based volume off those 3710s and putting your most latency-demanding VMs there. You have only 4 drives, so you'd probably want to configure them as RAID-10 without a spare. Just make sure you back them up frequently. I'm still not 100% sure if it's worthwhile to have an SLOG on top of an SSD-only array, you'd might want to experiment with that.

Last, having so much hardware in a single old system is risky. The board might have not been tested for the amount of current your controllers would draw for extended periods of time. If there is a system-wide failure (even a minor one, like a failed DIMM module), the entire system with all the storage attached to it will go down. For a VM datastore, this means all the VMs lose their underlying storage, possibly corrupting their in-OS filesystems in the process.

I would strongly recommend therefore to have at least two completely independent (running on two separate physical boxes) VMware Datastores and to mirror hard drives of the most mission-critical VMs across those stores, so they can survive a single Datastore failure.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
I'm still not 100% sure if it's worthwhile to have an SLOG on top of an SSD-only array, you'd might want to experiment with that.
Looks like it depends on the speed of the ssd pool vs the slog.
My (slow) 2 drive ssd pool profited from the 750 (nvme) slog for all tests but sequential writes.

Will try to test S3700's tomorrow.
 
  • Like
Reactions: mpogr

mpogr

Active Member
Jul 14, 2016
115
95
28
53
Looks like it depends on the speed of the ssd pool vs the slog.
My (slow) 2 drive ssd pool profited from the 750 (nvme) slog for all tests but sequential writes.

Will try to test S3700's tomorrow.
I think we won't know until actual measurements are performed. The idea of SLOG is to make sure access to the main drives in the pool is not broken into small transactions when sync writes are requested by the client. So, potentially, an SLOG can have positive effect even when it's not faster than the main pool drives. But this is just theory, and we don't really know how it going to look like in the real world...