vmdk on lsi scsi controller limiting SLOG for sync write performance?

jp83 · Jan 8, 2018

Quick summary of my setup, I have an ESXi (currently 5.5) all-in-one on a Dell R510 w/ H200 and 8x WD REDs using pcie passthrough and 32GB RAM to FreeNAS. I understand vmware causes sync writes over NFS and it's really slow.

I have a Fusion IO Duo (which is really 2 x 320GB on a single PCIe card) that I'm using as a local datastore for my "critical" VMs to bootstrap the looped back NFS mount where initially I intended most everything else to be. I had experimented with a 4GB vmdk as a SLOG to improve writes and I go from ~9MB/s to 80MB/s. I've been on a quest to figure out why it's still so slow when I can get something like 300MB/s disabling sync. (I'm not at home where I took notes on the numbers, so may have to update but it's in the ballpark).

I was really hesitant to even try to do this at first after reading and understanding about direct disk access, however, I'm seeing more or this used for testing and thinking it's actually sort-of 'ok' for this purpose, if I can get some benefit. I enabled the vm setting so that it passes through the unique id/serial for tracking. I had somewhat given up on my quest until I got back into doing better analytics and more testing. In particular I was also messing with a purely virtualized omv and trying out the I/O Analyzer vm to benchmark my Fusion IO directly compared to a new Samsung PM953 NVMe. I got a second PCIe based storage so that I could configure another backup R510 similarly. I don't really have room for 2 cards in one machine, nor would I be able to RAID them, and I know the Fusion IO doesn't have drivers in FreeNAS if I passed it, or one side through.

Anyways, let me skip to my suspicions. I had noticed that the vmdk shows up in the boot messages as a 150MB/s interface, and ironically I seem to be limited to about what I'd get with a single HDD. Then I realized I'm using the LSI (Parallel) SCSI controller. The paravirtual interface (which I noticed when comparing to the omv and I/O analyzer vms) doesn't seem to be supported for FreeBSD, but I tried switching to the LSI SAS controller, and it shows up as 300MB/s interface during boot, but still not really any faster. I've also tried adding an additional 4GB vmdk as slog to slice between the 2, and I can get a whopping 90MB/s (10MB/s more).

I thought if I switched to omv I could use the paravirtual controller and get more performance, but I see other tests (like gea's napp-it optane performance tests) showing this works. Is there something else here that I'm missing?

I'm also considering something like an HGST SAS SSD in one of the internal bays directly on the passed through backplane. I'm just struggling with the need because I already have fast storage and/or could probably just live with sync disabled for my purposes.

Based on my latest I/O analyzer I know I can get great performance out of either PCIe storage directly. I even tried doubling up to stress it further, i.e. see how the two sides of the Fusion IO Duo would work if I striped across them and in general how much contention there would be if other VMs run on it directly while it's also being used for all writes as a SLOG.

I'm really trying to get this resolved so I can move on, but it's really fundamental to (re)architecting/optimizing my setup. Now I'm thinking about just using the PCIe based storage directly (I think I'd switch my primary to the NVMe because it's faster, bigger continuous space, and lower power) for VMs and the HD array for large bulk data. Beyond that I'm struggling with how/where to store the "persistant data" in my VMs and containers, like Plex metadata, influxdb for metrics/home automation, centralized logging, etc. because backing up the VMs (ghettovcb) are now full copies not just incremental so as it gets bigger will become more painful. I was kind of hoping to use ZFS snapshots and send/receive to optimally backup on and off site. Tell me for home use, this is fine and the single (high quality) SSDs are generally reliable enough themselves not to over complicate because I'm worried about redundancy.

nk215 · Jan 8, 2018

When you quote your speed, is it sequential or random 4K?

I have been doing some tests on the side with my 4x S3500 600GB mirror and various SLOG device. The one that's closest to your Fusion-io is my Virident and here' my speed

FreeNAS NFS served back to ESXi for datastore. 32G memory FreeNAS
With 15gb SLOG from Virident virtual disks

-----------------------------------------------------------------------
CrystalDiskMark 5.2.1 x64 (C) 2007-2017 hiyohiyo
Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 1156.661 MB/s
Sequential Write (Q= 32,T= 1) : 352.210 MB/s
Random Read 4KiB (Q= 32,T= 1) : 263.343 MB/s [ 64292.7 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 33.587 MB/s [ 8200.0 IOPS]
Sequential Read (T= 1) : 801.807 MB/s
Sequential Write (T= 1) : 256.919 MB/s
Random Read 4KiB (Q= 1,T= 1) : 19.909 MB/s [ 4860.6 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 7.983 MB/s [ 1949.0 IOPS]

Test : 1024 MiB [F: 0.3% (0.1/32.0 GiB)] (x5) [Interval=5 sec]
Date : 2018/01/02 14:54:36
OS : Windows 7 Ultimate SP1 [6.1 Build 7601] (x64)

With sync=disable and no SLOG (which is not needed for sync=disable)
CrystalDiskMark 5.2.1 x64 (C) 2007-2017 hiyohiyo
Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 1156.186 MB/s
Sequential Write (Q= 32,T= 1) : 552.242 MB/s
Random Read 4KiB (Q= 32,T= 1) : 258.470 MB/s [ 63103.0 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 174.175 MB/s [ 42523.2 IOPS]
Sequential Read (T= 1) : 792.674 MB/s
Sequential Write (T= 1) : 435.592 MB/s
Random Read 4KiB (Q= 1,T= 1) : 20.034 MB/s [ 4891.1 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 14.493 MB/s [ 3538.3 IOPS]

Test : 1024 MiB [F: 0.3% (0.1/32.0 GiB)] (x5) [Interval=5 sec]
Date : 2017/12/23 19:36:20
OS : Windows 7 Ultimate SP1 [6.1 Build 7601] (x64)

This is still very slow compared to a VM stored on local SSD datastore

jp83 · Jan 8, 2018

nk215 said:
When you quote your speed, is it sequential or random 4K?

What I'm roughly quoting is supposed to be sequential write. Originally when I was first setting up a lot time ago I was just doing a dd in a linux vm, but now I've moved to the I/O analyzer vm, and I run the max write throughput test, so I think this should be sequential to maximize the benchmark. I'll update when I get back to my notes at home.

jp83 · Jan 8, 2018

Ok, so I was pretty close on the numbers I mentioned above. I don't have a Win7 license for a VM to try to get apples to apples comparison.

With sync=disabled the 4 x 2 RED drives (4 mirrored pairs) can do 357 MB/s at 715 IOPs (this is what it reports while doing the max throughput test, not trying to maximize for IOPs).

With sync=standard I only get 8 MB/s at 84 IOPs.

Adding a 4GB vmdk for SLOG yields 80 MB/s at 160 IOPs.

The vmware I/O analysis on the Fusion IO drive by itself (not through NFS mount) reports 520 MB/s at 1040 IOPs. Running the max write IOPs reports 65,481. For the record, the PM953 shows a max of 680 MB/s write and 67527 max IOPs (2 diff tests).

Given that I can write to Fusion IO at 520 MB/s and directly to disks (w/o sync) at 357 MB/s, isn't 80 MB/s with them working together pretty poor? I understand why with just disks, if it's writing it's ZIL on disk first then writing the real data again there's a lot of overhead and it drops down to the 8 MB/s. I thought freeing it up with the SLOG (i.e. separate log) would relieve some of that constraint. In other words sequential throughput could basically stream to disk closer to the 357 MB/s sync disabled speed, while the synced writes are offloaded and done in parallel to the faster Fusion IO. Would there be some congestion on the PCIe bus? Is the vmdk (thick provisioned, lazily zeroed I think) somehow not sequential causing more random writes?

This is how the FreeNAS VM sees one of the vmdk disks on the Fusion IO drive with the LSI SAS controller.

da1 at mpt0 bus 0 scbus2 target 1 lun 0
da1: <VMware Virtual disk 1.0> Fixed Direct Access SCSI-2 device
da1: Serial Number 6000c292fb7a965535119762eb0005e1
da1: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da1: Command Queueing enabled
da1: 4096MB (8388608 512 byte sectors)
da1: quirks=0x140<RETRY_BUSY,STRICT_UNMAP>

Before with the LSI (Parallel) Controller is was saying only 150MB/s and I thought that was a limiting factor.

Is this performing as expected? Do I need to reset my expectations? I realize I've got the slower NAS drives backing the storage, but my hope was to offset that with a hybrid approach. Even recently reading in @gea's performance reports noting that for single user operation you can achieve near SSDs speeds with bulk storage, revived my hope.

whitey · Jan 8, 2018

For reference my main VM pool that runs 40-50 VM's is a 4 disk sas3 ssd setup of husmm1680's w/ an additional husmm1620 ssd as SLOG and I push roughly 1GB read/10K iops and 200MB write/1K iops w/ 4k random tests using fio if that helps at all. sVMotions push steady 500-600MB read, 225-275MB write in/out of that pool of NFS stg.

Rand__ · Jan 9, 2018

Can you pass through the Fusion IO to get native slog speed?
That would tell you if it is suitable to be used as slog in the first place.

SLOG performance (and as such nfs write speeds) is more about device latency than raw iops or sequential speed.

jp83 · Jan 9, 2018

I can't pass through the Fusion IO directly to FreeNAS because the drivers aren't compiled in (although I read the paid version from iXsystems includes them). Maybe I can try with my new NMVe, but that wouldn't really work because I still need something as a local datastore.

So, I'm suspicious of some other virtualized limitation I thought I'd try a simple nfs export with just a vmdk on the Fusion IO. I have another test vm of FreeNAS and just created a simple single disk stripe, no compression, no SLOG. I mounted this back into esxi as I did with the primary and used this as the datastore for vm i/o analyzer.

Initial results look familiarly slow. 97 MB/s at 195 IOPs. Hm, interestingly near the 1GbE boundary, and then I realized I don't have this vm on my separate storage network, it was going out through a physical 1GbE NIC. Surprised that even worked, but it must have used the VMkernel management port, which has the 1 GbE NICs rather than storage VMkernel port. Otherwise, I have CX4 10GbE for my LAN network, and the storage network doesn't have any NICs assigned to the separate vSwitch. The storage network itself, vSwitch and interfaces, are also configured for mtu 9000.

Anyways, I fixed that problem and put it on the internal storage network. This time I see 192 MB/s writes at 385 IOPs. Looking at the reports in FreeNAS I see 224 MB/s max write to the virtual disk and my network interface rx topped out at 1.7G. Hm, this reminds me I've actually been fighting another long term issue with maxing out the network speeds as well. See this post for reference: SOLVED - 10GbE FreeNAS servers are constrained when receiving network data

So that's better, but still not what I think the Fusion IO should be able to support and not even what the HDD backed array does with sync disabled, so I tried disabling sync here as well. Now I see 383 MB/s at 766 IOPs. This and the above actually start to make sense, as the sync writes incur a double write penalty, so disabling sync I get 2x performance. Interestingly FreeNAS reports max disk write of 457 MB/s and network rx at 3.6G. At first I was thinking I was still being limited by the network (but iperf between these interfaces is showing 7-9 Gbits/sec, unfortunately still < 2G on the LAN interface with CX4 10GbE and mtu 1500). Maybe I am approaching what the Fusion IO can handle, I saw 520 MB/s directly , but I'm ok with these sorts of numbers if only they'll work out together.

This brings me back to my main question, if I can do async writes to my HDD backed array at 357 MB/s, and write independently (also async) to my Fusion IO at 383 MB/s, why when I combined them with the Fusion IO as a SLOG do I only see 80 MB/s? Am I still misunderstanding something fundamental about zfs performance?

Trying to answer my own question, it looks a lot like the 1GbE speed limitation, but I confirmed my main freenas is on the storage network, and it checks out because the hdd with sync disabled is able to get the 357 MB/s. What am I missing? @whitey's setup above shows there shouldn't be that big of a virtualization penalty.

jp83 · Jan 10, 2018

EDITED.

One more test, I temporarily put in an Intel 320 SSD on the H200 HBA and used it as a SLOG. I thought I got somewhere, but forgot I still had sync disabled. With sync=standard I still only get 6 MB/s write at 13 IOPs. Not sure what to try next.

azev · Jan 10, 2018

I think the next test would have to use some kind of enterprise nvme drives.

Rand__ · Jan 10, 2018

Is that 320 direct attached? 6MB/s is pretty low, wouldnt think that thats accurate. O/c its quite old, and I never tried as slog but I'd expect more.

Have you tried spinning up another FN instance to seee whether its your config?

nk215 · Jan 12, 2018

I tried all kind of setup and the best I got with SLOG was about half of sync=disable number in write especially random 4K write.

I'll come back to FreeNAS as NFS mounted datastore when I have my hand on an optane drive.

Right now, I put my VM on local datastore and keep my data on a NAS. My fast data is on a rockstore all SSD array (mostly because I can take out any SSD when I want to use that SSD somewhere else or expand, change raid types whever I want, offline dedupe etc).

dswartz · Jan 15, 2018

I'm convinced there is some kind of throttling or whatever going on with the ZIL. I have even set up an 8GB ramdisk to use as an SLOG, and still see abysmally low sync writes compared to async.

Search

vmdk on lsi scsi controller limiting SLOG for sync write performance?

jp83

New Member

nk215

Active Member

jp83

New Member

jp83

New Member

whitey

Moderator

Rand__

Well-Known Member

jp83

New Member

jp83

New Member

azev

Well-Known Member

Rand__

Well-Known Member

nk215

Active Member

dswartz

Active Member