NFSv3 vs NFSv4 vs iSCSI for ESXI datastores

zxv · Jan 14, 2019

I thought that switching from NFS to iSCSI would provide increase performance for datastores on ESXI.

I just started looking at migrating from NFS to iSCSI on a 40gbe network with jumbo frames.
I ran a very simple benchmark, and I didn't expect it, but NFSv4 was faster than NFSv3 which was in turn faster than iSCSI (see below).

So I'm reevaluating whether to pursue iSCSI.

The file server is Ubuntu 18.04 DL380 Gen8 providing storage to ESXI 6.7 DL 360 Gen9 hosts.
The network is IPv4 and jumbo frames using 40Gbe ConnectX-3 Pro cards connected to an Arista DCS-7050QX.
The datastore is a zfs pool of 24 vdevs consisting of mirrored pairs (48 10KRPM 6Gb/s SAS drives) in two D2700 enclosures, each with a single connection to an HP H241 controller.
Send and receive buffers are tuned for 40gbe, but other than that, the tuning is pretty much default.

To measure the performance, I ran an ATTO disk benchmark in a Windows 10 VM. The same VM was mounted via NFSv3, NFSv4 and iSCSI, to obtain the results below. The same datastore was used for all three. For the iSCSI the VM's disk was migrated to a zvol in the same zfs pool, mounted via iSCSI.

What do you all think?

iSCSI:

NFSv3:

NFSv4.1:

Rand__ · Jan 14, 2019

It sounds unlikely

what was the sync setting?
ISCSI on ESXi is usually faster since it uses async while nfs uses sync writing

zxv · Jan 14, 2019

The zfs dataset (for NFS) and zvol (for iSCSI) both had zfs sync=disabled.

The zfs dataset had compression=lz4 while the zvol had compression=off (per recommendations).

Rand__ · Jan 14, 2019

ok, then its quite possible as nfs4.1 is the newest of the three

dswartz · Jan 14, 2019

zvols suck, performance wise. i'm betting that was your issue. can you try a file-based LUN and retest?

zxv · Jan 14, 2019

Below is iSCSI backed by a file on the same zfs datastore as above.

The relative change in performance from zvol to file backed was:
Read rate increased 10%
Write rate decreased 30%

dswartz · Jan 14, 2019

The write decrease is hard to believe. What record size did you specify for the dataset? And how did you create the file? Also, when you used a zvol, did you use the default 8KB volblocksize? Maybe try with 32KB? I assume sync=disabled in all cases?

zxv · Jan 14, 2019

I've tried zvols with 32K recordsize as well as the default 128K.
The recordsize of dataset that holds the backing files is 128K.

Creating a thick backing file on the dataset, which is using lz4 compression, results in an initial file size of 512 bytes. That's not ideal, but I ran the benchmark anyway:
Thick file backed iSCSI:

For for the file backed iSCSI test above, I created a sparse file, and ran the benchmark three times.
The numbers were consistent between runs.
sync=disabled was used in all cases.

The zvol was sparse as well, and the benchmark was run three time, with all three being consistent.

I've also tried increasing buffer sizes for TCP, IP, as well as, TX/RX size on the interfaces and Recv/XmitSegementLength on iSCSI target group.

So I tested ramdisk as a sanity check, and ramdisk backed iSCSI has bad performance as well.
Ramdisk backed iSCSI:

This suggest that the LIO (linux IO) iSCSI target could be the cause of the performance issues.

LIO is in the Ubuntu repo, whereas SCST is not, so I'm not sure how stable SCST would be on Ubuntu 18.04.

dswartz · Jan 15, 2019

Ah, ok. I have never used lio (I have used scst, but that's a PITA to build for zfs at least it's annoying...)

zxv · Jan 15, 2019

After further tuning, the results for the LIO iSCSI target were pretty much unchanged.

Switching to the STGT target (Linux SCSI target framework (tgt) project) improved both read and write performance slightly, but was still significantly less than NFSv3 and NFSv4.

Vsphere best practices for iSCSI recommend that one ensure that the esxi host and the iSCSI target have exactly the same maximum command queue depth (128) and maximum Outstanding IO Requests limit (16).

LIO target backed by a ramdisk, with CMD and DSNRO tuned:

STGT target backed by a ramdisk, with CMD and DSNRO tuned:

Joqur · Feb 25, 2019

zxv said:
After further tuning, the results for the LIO iSCSI target were pretty much unchanged.

Switching to the STGT target (Linux SCSI target framework (tgt) project) improved both read and write performance slightly, but was still significantly less than NFSv3 and NFSv4.

Vsphere best practices for iSCSI recommend that one ensure that the esxi host and the iSCSI target have exactly the same maximum command queue depth (128) and maximum Outstanding IO Requests limit (16).

LIO target backed by a ramdisk, with CMD and DSNRO tuned:

STGT target backed by a ramdisk, with CMD and DSNRO tuned:

Cheers for the effort of testing and reporting!

I'm struggling with similar issues with LIO.
I've got a physical server acting as target. Ubuntu 18.04, kernel 4.15.0-45-generic, 2 x Xeon E5-2640 v2, 256GB DDR3, X540-DA2 (one port dedicated to iSCSI) connected to a switch, 10 x 4TB HDD in RAIDZ2, Optane 900p 280GB as SLOG

The initiator is ESXi 6.7, which in turn runs on a host with 2 x 2695 v2, 128GB DDR3, 82599EN network controller (1 x 10G via DAC to a switch).

ESXi is configured with 4 VMK and port binding for multipathing.

I'm testing from a Linux VM and getting at most 800 MB/s read and write, simplex. I seem to be getting around 180-200 MB/s per path.

zxv · Feb 25, 2019

I found it helpful to benchmark the iscsi and nfs using a ramdisk. That helped in terms of being able to see the effects of network tuning.

With 40gb links, I often see not much over 25gbit/sec via tcp, so benchmarking with two 10gbit links serving from a ramdisk may be fairly close to pushing enough data to expose other bottlenecks in the stack.

I'm going to try two 40gbit/sec links soon, and I don't know whether it'll be any faster. I'll find out.

Joqur · Feb 25, 2019

zxv said:
I found it helpful to benchmark the iscsi and nfs using a ramdisk. That helped in terms of being able to see the effects of network tuning.

With 40gb links, I often see not much over 25gbit/sec via tcp, so benchmarking with two 10gbit links serving from a ramdisk may be fairly close to pushing enough data to expose other bottlenecks in the stack.

I'm going to try two 40gbit/sec links soon, and I don't know whether it'll be any faster. I'll find out.

That's exactly what I've done, created a ramdisk via targetcli, and I'm seeing virtually the same numbers as before. I'm unsure if the bottleneck is with LIO in that case, but I'm unsure what it would be as I am not familiar with tracing issues with LIO yet.

When testing with Windows as the initiator, failing to use MPIO as it seems to not be working in Win 10 Pro or Enterprise, I'm seeing around 1.2 GB/s for sequential reads and 1.1 GB/s for sequential writes. Around 70-75k IOPS at 4K, QD32

zxv · Feb 25, 2019

So windows can get close to the full bandwidth, 1.2 GB/s, with a single connection on a 10g link?

I also see better results between a linux server and linux client with iperf3 than between linux server and ESXI client. It looks like the tcp stack in ESXI is less performant above 20gbit/sec. In that regard, the multipathing may mitigate some of the limitations by increasing the network buffers available.

There are some suggested tunings here:
How to configure VMware vSphere 6.x on Data ONTAP 8.x
.. such as
esxcfg-advcfg -s 32 /Net/TcpipHeapSize
esxcfg-advcfg -s 512 /Net/TcpipHeapMax
plus others for NFS.

Joqur · Feb 25, 2019

zxv said:
So windows can get close to the full bandwidth, 1.2 GB/s, with a single connection on a 10g link?

I also see better results between a linux server and linux client with iperf3 than between linux server and ESXI client. It looks like the tcp stack in ESXI is less performant above 20gbit/sec. In that regard, the multipathing may mitigate some of the limitations by increasing the network buffers available.

There are some suggested tunings here:
How to configure VMware vSphere 6.x on Data ONTAP 8.x
.. such as
esxcfg-advcfg -s 32 /Net/TcpipHeapSize
esxcfg-advcfg -s 512 /Net/TcpipHeapMax
plus others for NFS.

Yes.
I've tried using dd on linux, and reading or writing from/to the SAN inside a VM, via ESXi, gives me the following results:
RAMDisk:
(write)
dd if=/dev/zero of=/home/test/test1 bs=512k count=20000
800-900 MB/s

(read)
dd if=test1 of=/dev/null bs=512k
1.1 - 1.2 GB/s

Disk array:
Same commands as above.
(write)
520 - 530 MB/s

(read)
1.1 - 1.2 GB/s

iperf3:
SAN Server, VM client
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 10.6 GBytes 9.08 Gbits/sec 9 sender
[ 4] 0.00-10.00 sec 10.6 GBytes 9.07 Gbits/sec receiver

VM Server, SAN client
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 10.9 GBytes 9.33 Gbits/sec 81 sender
[ 4] 0.00-10.00 sec 10.9 GBytes 9.32 Gbits/sec receiver

acquacow · Feb 25, 2019

dd is horrible as a disk benchmark because it's single threaded. It'll tell you the speed a single thread can read/write to your backing store, but not much more.

I'd recommend using fio for low-level disk benchmarking.

Joqur · Feb 25, 2019

acquacow said:
dd is horrible as a disk benchmark because it's single threaded. It'll tell you the speed a single thread can read/write to your backing store, but not much more.

I'd recommend using fio for low-level disk benchmarking.

Not sure why, but I cannot seem to run fio that particular VM. It loads up, loads in the options, then sits there at "fio-3.1", maxing out the CPU.

I ran it on another VM:

Code:

Jobs: 1 (f=1): [R(1)][100.0%][r=1003MiB/s,w=0KiB/s][r=4012,w=0 IOPS][eta 00m:00s]

file1: (groupid=0, jobs=1): err= 0: pid=5029: Mon Feb 25 21:02:15 2019

   read: IOPS=3987, BW=997MiB/s (1045MB/s)(9974MiB/10005msec)

    slat (usec): min=10, max=685, avg=31.83, stdev=11.67

    clat (usec): min=673, max=12884, avg=3973.50, stdev=1571.52

    lat (usec): min=719, max=12915, avg=4006.50, stdev=1571.21

    clat percentiles (usec):

    |  1.00th=[  963],  5.00th=[ 1270], 10.00th=[ 1598], 20.00th=[ 2442],

    | 30.00th=[ 3195], 40.00th=[ 3621], 50.00th=[ 4047], 60.00th=[ 4490],

    | 70.00th=[ 5014], 80.00th=[ 5538], 90.00th=[ 5997], 95.00th=[ 6259],

    | 99.00th=[ 6587], 99.50th=[ 6718], 99.90th=[ 8225], 99.95th=[10945],

    | 99.99th=[12256]

   bw (  KiB/s): min=963072, max=1036288, per=100.00%, avg=1021056.55, stdev=17189.65, samples=20

   iops        : min= 3762, max= 4048, avg=3988.50, stdev=67.15, samples=20

  lat (usec)   : 750=0.05%, 1000=1.10%

  lat (msec)   : 2=13.97%, 4=33.67%, 10=51.12%, 20=0.08%

  cpu          : usr=4.53%, sys=17.64%, ctx=24536, majf=0, minf=1034

  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%

    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%

    issued rwt: total=39896,0,0, short=0,0,0, dropped=0,0,0

    latency   : target=0, window=0, percentile=100.00%, depth=16


Run status group 0 (all jobs):

   READ: bw=997MiB/s (1045MB/s), 997MiB/s-997MiB/s (1045MB/s-1045MB/s), io=9974MiB (10.5GB), run=10005-10005msec


Disk stats (read/write):

  sda: ios=38773/16, merge=185/9, ticks=152924/216, in_queue=153132, util=99.10%

Large reads.

Small reads end up at around 50 MB/s, while local access is around 550-600 MB/s for 4k reads. Over iSCSI, the four processes (one for each path) are at around 25-30% utilisation, which leads me to suspect ESXi's iSCSI settings

efschu2 · Mar 15, 2019

Instead of iSCSI you may try iSER? Btw anyone know when ESXi will add NVMeOF initiator?

Joqur · Mar 15, 2019

efschu2 said:
Instead of iSCSI you may try iSER? Btw anyone know when ESXi will add NVMeOF initiator?

As far as I know the X540 and X520 don't support RDMA/RCoE, so something like iSER is out the window. I am looking into upgrading to ConnectX-4 or ConnectX-5 instead which will allow me to do so.

dswartz · Mar 15, 2019

I tried iSER a couple of weeks ago on the latest 6.7. Using latest mellanox drivers for 50gb card. Guests stopped responding. Looked at esxi console. Purple screen of death

I haven't had the time or motivation to take another look at it...

NFSv3 vs NFSv4 vs iSCSI for ESXI datastores

The more I C, the less I see.

Attachments

Well-Known Member

The more I C, the less I see.

Well-Known Member

Active Member

The more I C, the less I see.

Active Member

The more I C, the less I see.

Active Member

The more I C, the less I see.

Member

The more I C, the less I see.

Member

The more I C, the less I see.

Member

Well-Known Member

Member

Member

Member

Active Member