[SOLVED] QUEST: zfs, zvol, kvm settings/tuning (to fix SLOW VM sync-access)

fermulator · Dec 9, 2019

Hi all experts! I would like to share my setup, and ask for guidance/advise based on what I have. I've been running this configuration for many years now, but finally looking to address the slow I/O performance in my VMs. Have spent 10-15 hours running various augmentations & tests in a spreadsheet trying to make heads or tails over the settings, but am at a bit of "end of the rope" on my patience, hoping someone in the community has experience here and might eyeball the situation and prevail me a "duh!, you should do ...".

High Level

metal: 2x SAS platter HDDs, SAS controller, IBM x3550 M3
filesystem: zfs-on-linux + zvols
zfs pool type: mirror
host: Fedora (up until recently), now Ubuntu 18.04 server
virtualization: kvm/qemu VMs (w/ disk on virtio & raw storage into zvol)

The Problem Statement

Disk I/O access from VMs has always been slow (~10MB/s or less)
The VMs function well enough (none of them have super critical I/O requirements)
However in _SOME_ scenarios I found it annoying (i.e. nextcloud, scp/rsync transfers, etc)
Recently I have replaced my 2x SAS HDDs (aging) with new SAS HDDs -- while I'm at this, creating fresh pool, etc -- I decided I wanted to solve this slow access problem
In the past I chalked up the slowness to old/aging drives ... but analysis now disproves this.

Configuration

We'll work bottom up

Bare Metal

Relevant Server Specs: IBM x3550 M3, Intel(R) Xeon(R) CPU (E5620 @ 2.40GHz), 76GB RAM
Storage Controller: LSI SAS1068E,
HDDs: SAS 2x HGST HUC109090CSS60 900GB, 512-byte sector size

Host

Linux Ubuntu 18.04
Kernel: 4.15.0-70-generic
ZFS: 0.7.5

ZFS Pool
1. "zstorage" is its name, mirror'd
2. ashift=9 (aligned with the 512-byte physical sector size of these disks)
3. NOTE: my older pool I did ashift=12, even though the older drives were also 512-byte sector, but for this testing when I created the NEW pool I went with ashift=9 in an attempt to address the slow I/O (1:1 alignment per reading gives the best performance, but having 4K on top of 512-b is OK, future proof...)
ZFS base dataset zstorage/vms
1. Relevant Config (the things per reading that matter):
  1. recordsize=128K
  2. compression=off (normally would set to lz4)
  3. atime=off
  4. xattr=sa
  5. logbias=latency
  6. dedup=on (whoops, need to fix this; though shouldn't affect things as we get into ZVOL stuff)
  7. sync=standard
  8. acltype=posixacl
ZVOL block storage for VMs (not very relevant AFAIK as we're using ZVOLs)
- zstorage/vms/<VM_NAME>/disk0 (XXG varies)
- Relevant Config (the things per reading that matter):
  - volblocksize=8K
  - checksum=on
  - compression=off
  - logbias=latency
  - dedup=on (double whoops)
  - sync=standard
  - copies=1
ZVOL block storage for TESTING
- zstorage/test/disk0 (10G)
- Relevant Config (the things per reading that matter):
  - volblocksize=8K (varied this for testing)
  - checksum=on
  - compression=off (also tried lz4)
  - copies=1
  - logbias=latency|throughput (tried both latency overall better)
  - dedup=off
Partition Table on RAW block device (on top of ZVOL)
- VMs: (from the original provisioning...)
  - Units: sectors of 1 * 512 = 512 bytes
  - Sector size (logical/physical): 512 bytes / 8192 bytes
  - I/O size (minimum/optimal): 8192 bytes / 8192 bytes
  - Start @ 2048
- TEST: (for these tests to try to find the BEST settings)
  - Units: sectors of 1 * 512 = 512 bytes
  - Sector size (logical/physical): 512 bytes / 8192 bytes
  - I/O size (minimum/optimal): 8192 bytes / 8192 bytes
  - Start @ 2048 (not aligned, but I also tried starting at 8192 made no difference)
Filesystem:
- All VMs and testing are done with basic ext4 for compatibility with guest OSes.
VMs:
- Type: KVM, qemu, Linux guests with varying CPU/RAM/storage assignments
- Note: NONE are "heavy DB" (so we can for now ignore any special volume tuning for database pages)
- Storage:
  - type=block
  - driver=qemu, type=raw, cache=none|directsync|?, io=native
  - bus=virtio

Finally done that summary. Now to some of the meat of it.

Performance Methodology

I am not as scientific as I would like to be, but doing my best. The VMs are unfortunately still active, so they MAY impact results here and there yielding slight inconsistencies ... but I've turned off as much as possible to reduce I/O load outside the tests. (obviously in a perfect world I would shutdown all VMs for this...)

I've been using "fio" primarily for my testing, like several others have shown in forum posts here and elsewhere, this is the TYPICAL invocation:

Code:

# RANDOM-WRITE
fio --name=random-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=512b|4K|8K|128K --direct=0 --size=128m --numjobs=16 --end_fsync=1

# SEQ WRITE
fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=512b|4K|8K|128K --direct=0 --size=128m --numjobs=16 --end_fsync=1

, where of course we vary "block size" in the testing. (choose one)

My understanding/comments of the settings:

ioengine=sync|libaio (AFAIK, "sync" best represents what VMs will do)
rw=randwrite|write (random writes the WORST case, or seq writes the BEST case; real world is somewhere in between for basic VMs)
bs=... (vary the write block size to test theories about perf & underlying sizes)
size=128M (for testing/time reasons, seemed reasonable, others used 256MB)
numjobs=16 (parallelize I/O write tasks, this is QUITE heavy, maybe VMs only have 1x IO thread? not sure)
end_fsync=1 (WAIT for storage to confirm write before saying "done" -- this is the crux of our testing IMO)

I don't fully understand these ones, various other users did this

iodepth=4
direct=0 (I tried one or two cases to set =1, but it didn't work)

Performance Measurements

Throughout you will see slightly incomplete results as I did not run the FULL suite of tests across the full matrix ...

Baseline: started with the base zfs dataset

As one can expect, the 128K results _usually_ won, because our zfs dataset has a 128K record size. There we have some baseline numbers, but it is worth noting that the zfs dataset is _NOT_ optimized for I/O performance (per configuration above), as I do not have any important/heavy load here -- only on the ZVOLs, moving on.

Testing: test ZVOL "test/disk0" w/ 10G volume, single partition, ext4

Moving on to some TEST results, I conducted MANY experiments to try to experiment for the best settings. Much of this was based on reading + experimentation.

Given the TEST zvol, I was able to destroy and re-create at will, so playing with volblocksize was possible.

My takeaways here:

We can see the first tests I accidentally left dedup=on, but later turned that off. (heck compare A to E and it got worse)
libaio I concluded was generally slower than sync (not sure why)
I _THOUGHT_ creating a zvol with volblocksize=512-b would be the BEST results due to ashift=9 and HDD sector size = 512-byte ... but this was not the result.
I conducted a few experiments in changing logbias to "throughput" (to get higher throughput) but this led to LOWER results (presumably due to sync=true on FIO)
One theory I had was since the partition didn't start on the ZVOL blocksize necessarily, thought maybe it would make a difference, but it did not. (compare E to A)

ANYWAY, all this is _NEARLY_ moot. All of these results so far (while not necessarily AS HIGH as I would expect out of my drives) are significantly faster than the VMs.

Here's the output of TWO cases for scenario A

RAND-WRITE:

Code:

/mnt/temp_disk0/fio$ fio --name=random-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=8k --direct=0 --size=128m --numjobs=16 --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
Jobs: 16 (f=16): [F(16)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]

random-write: (groupid=0, jobs=1): err= 0: pid=21225: Mon Dec  9 20:06:15 2019
  write: IOPS=582, BW=4659KiB/s (4770kB/s)(128MiB/28135msec)
    clat (usec): min=7, max=478, avg=13.21, stdev= 7.44
     lat (usec): min=8, max=479, avg=13.49, stdev= 7.45
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[   11], 10.00th=[   11], 20.00th=[   12],
     | 30.00th=[   12], 40.00th=[   13], 50.00th=[   13], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   15], 90.00th=[   16], 95.00th=[   17],
     | 99.00th=[   21], 99.50th=[   24], 99.90th=[   59], 99.95th=[  115],
     | 99.99th=[  363]
  lat (usec)   : 10=1.35%, 20=97.50%, 50=1.01%, 100=0.07%, 250=0.01%
  lat (usec)   : 500=0.04%
  cpu          : usr=0.14%, sys=1.03%, ctx=333, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,16384,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

(SNIP the other 15x jobs)

Run status group 0 (all jobs):
  WRITE: bw=72.8MiB/s (76.3MB/s), 4658KiB/s-4659KiB/s (4770kB/s-4770kB/s), io=2048MiB (2147MB), run=28135-28137msec

Disk stats (read/write):
  zd256: ios=0/1028, merge=0/0, ticks=0/20968, in_queue=20968, util=2.59%

SEQ-WRITE:

Code:

$ fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=8k --direct=0 --size=128m --numjobs=16 --end_fsync=1
seq-write: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
Jobs: 16 (f=16): [F(16)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
seq-write: (groupid=0, jobs=1): err= 0: pid=22319: Mon Dec  9 20:08:12 2019
  write: IOPS=619, BW=4952KiB/s (5071kB/s)(128MiB/26466msec)
    clat (usec): min=7, max=2988, avg=14.18, stdev=28.99
     lat (usec): min=7, max=2989, avg=14.48, stdev=28.99
    clat percentiles (usec):
     |  1.00th=[    9],  5.00th=[   11], 10.00th=[   11], 20.00th=[   12],
     | 30.00th=[   13], 40.00th=[   13], 50.00th=[   13], 60.00th=[   14],
     | 70.00th=[   15], 80.00th=[   15], 90.00th=[   17], 95.00th=[   18],
     | 99.00th=[   23], 99.50th=[   36], 99.90th=[  210], 99.95th=[  318],
     | 99.99th=[ 1647]
  lat (usec)   : 10=3.41%, 20=94.61%, 50=1.67%, 100=0.09%, 250=0.15%
  lat (usec)   : 500=0.06%, 750=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%
  cpu          : usr=0.13%, sys=1.04%, ctx=565, majf=0, minf=14
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,16384,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

(SNIP the other 15x jobs)

Run status group 0 (all jobs):
  WRITE: bw=77.4MiB/s (81.1MB/s), 4952KiB/s-4953KiB/s (5071kB/s-5072kB/s), io=2048MiB (2147MB), run=26465-26466msec

Disk stats (read/write):
  zd256: ios=0/1028, merge=0/0, ticks=0/19124, in_queue=19124, util=2.63%

In both cases, we see the drive I/O on "zpool iostat" hit decent speeds, ~150-250MB/s (not sure why fio reports it so much lower).

So there, some baselines.

The cruddy VM Vesults

Finally we're here to see the problem:

Within the VM, we get _similar_ results on seq-write (not quite as fast, but perhaps given the virtualization overheads I could accept it):

Code:

/tmp$ fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=4k --direct=0 --size=128m --numjobs=16 --end_fsync=1
seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
Jobs: 12 (f=12): [_(1),F(7),_(1),F(2),_(1),F(2),_(1),F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]                       
seq-write: (groupid=0, jobs=1): err= 0: pid=4668: Mon Dec  9 20:11:22 2019
  write: IOPS=1043, BW=4173KiB/s (4273kB/s)(128MiB/31410msec)
    clat (usec): min=6, max=2025.8k, avg=782.68, stdev=26653.78
     lat (usec): min=7, max=2025.8k, avg=784.51, stdev=26653.78
    clat percentiles (usec):
     |  1.00th=[      7],  5.00th=[      8], 10.00th=[      9],
     | 20.00th=[      9], 30.00th=[      9], 40.00th=[     12],
     | 50.00th=[     14], 60.00th=[     15], 70.00th=[     16],
     | 80.00th=[     17], 90.00th=[     24], 95.00th=[     31],
     | 99.00th=[     78], 99.50th=[  10421], 99.90th=[  36439],
     | 99.95th=[ 708838], 99.99th=[1317012]
   bw (  KiB/s): min=    2, max=29466, per=5.98%, avg=3071.62, stdev=5269.70, samples=32
   iops        : min=    0, max= 7366, avg=767.56, stdev=1317.45, samples=32
  lat (usec)   : 10=35.12%, 20=50.44%, 50=12.78%, 100=0.74%, 250=0.10%
  lat (usec)   : 500=0.05%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.03%, 10=0.11%, 20=0.42%, 50=0.06%
  lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.02%
  lat (msec)   : 2000=0.02%, >=2000=0.01%
  cpu          : usr=0.76%, sys=1.99%, ctx=540, majf=1, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,32768,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

(SNIP)

Run status group 0 (all jobs):
  WRITE: bw=50.2MiB/s (52.6MB/s), 3210KiB/s-5407KiB/s (3287kB/s-5537kB/s), io=2048MiB (2147MB), run=24241-40831msec

Disk stats (read/write):
  vda: ios=1663/5244, merge=187/70829, ticks=1312/3893732, in_queue=2317472, util=63.42%

But now see random writes, WOW so bad.

At first it goes OK, about 20-40MB/s, but within a few seconds, everything slows to a crawl here. and we end up with nearly 0KB/s

Code:

$ fio --name=rand-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=4k --direct=0 --size=128m --numjobs=16 --end_fsync=1
rand-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
Jobs: 16 (f=16): [w(16)][14.9%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 04m:05s]

I tried changing the logical/physical sector size of the disk in virsh XML, but this renders the VM not bootable (I thought maybe I would try getting this to align with the ZVOL block size ...).

All the VMs by default have cache=none, but I tried also cache=directsync ... (results still cruddy, did not record).

Looked into "type=volume", but dont' think this is supported on zfs-on-linux (maybe it is, but either way in theory raw should be fine ...)

io=native, not sure what else to try, the other options don't feel relevant.

Assuming driver=qemu is fine as well, might not be?

Final Questions & Notes

So that is a lot of data, and I am left with many questions.

Is my new pool using ashift=9 the better performer? (I did not retest yet with ashift=12 as I'll have to redo all the setup, a bit lazy, but in theory, I think I am correct here; so long as I accept, if I ever get newer drives with 4K sector sizes, I'll have to rebuild the pool, and I'm OK with that.)
What might a theory/explanation be for why random writes within the VM are SO BAD?
What are the best ZVOL settings for VM usage?
What are the best VM disk settings for underlying ZVOL block devices?
I am aware that many recommend using `qcow2` on ZFS datasets as it is easier to maintain, and _maybe_ is only slightly less performant; For me I prefer to stick with the zvol solution for now (although admittedly if I switched it might resolve all the woes ... but I have ~10x VMs and flipping them all to the qcow2 format is a chunk of work in itself...) - so please no "just use qcow" replies

The Ultimate Question

Can anyone link/reference to a guide for "tuning ZVOL & VMs for real-world general I/O"? - I know this various per I/O workload, but I just want "best of averages", exclude large file writes, exclude database page sizes, just normal Linux OS guest usage.

Appendix

For posterity I've included various txt files with outputs of full config/etc (as I've only summarized them above).

EffrafaxOfWug · Dec 10, 2019

Unless I'm misunderstanding something, it looks like you're chucking a heavy random workload against an array of only two platters...? Even your seq write is using a QD of 4 and 16 jobs so it's scarcely any different from a random write as far as a small array like this goes.

I'm not a ZFS expert but you don't have a lot of IOPS to play with here and I suspect ARC is hiding a multitude of sins for reads, for sync writes in a VM I don't think it'll be able to do this as writes would need to go to the platters directly.

You're also using a relatively old CPU that's likely badly affected by KPTI; do your figures improve any if you temporarily disable the spectre/meltdown mitigations?

fermulator · Dec 10, 2019

RE: > seq write is using a QD of 4 and 16 jobs so it's scarcely any different from a random write
, do you mean it is _very different_? (TBH I'm not sure about the QD and parallel job count and alignment with "typical" workloads or not)

RE: > don't have a lot of IOPS to play with here and I suspect ARC is hiding a multitude of sins for reads
, true not a lot of IOPS probably, not doing any read tests fwiw; but I figured my test methodology comparing zfs dataset, zfs zvol+ext4, and zfs zvol+ext4 + vm; should directly isolate the problem to being related to VM disk config (that is, the VMs are super slow on random write, but not the raw zvol+ext4 ... the notes about IOPS and ARC should be moot)

RE: > for sync writes in a VM I don't think it'll be able to do this as writes would need to go to the platters directly
, it can do sync, why not? it is just a double layer of sync (the zvol has to ack to the VM, and the platter acks to zfs) - of course it is a "double sync" tho

RE: > old CPU affected by KPTI;
, yes certainly true, I see the kernel messages about it (and am aware of it in general) - I figured comparatively it wouldn't matter, but point taken, some of the kernel fixes MAY actually impact VM performance more

Rand__ · Dec 10, 2019

So I have no clue whatsoever of KVM, so please excuse me if my suggestion is totally useless, but if there is the slightest chance that something in your stack turns your VM's access to 'sync' writes then that will be your culprit (easily checked by setting sync=always or disabled on the ZVOL datasets and see if either has any impcat)-
see
Sync writes, or: Why is my ESXi NFS so slow, and why is iSCSI faster?
as an (ESXi & nfs/iSCSi [sync vs async] based) discussion to understand the background if you don't know of the problem.

o/c if you are sure your stack is async all the way through then please disregard.

fermulator · Dec 10, 2019

That is a _great_ and informative post thank you for sharing! It most certainly helps to understand a bit better about the sync-related considerations throughout the stack.

Can confirm that all my VMs are not employing the blind "100% always sync" strategy, (only one of them has it enabled for testing purposes)

Code:

$ for VM in $(sudo virsh list --all | awk '{print $2}' | grep -v Name); do sudo virsh dumpxml $VM | grep cache; done
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <driver name='qemu' type='raw' cache='directsync' io='native'/>
      <driver name='qemu' type='raw' cache='none' io='native'/>

So my TLDR out of all of this is to try:

sync=standard on the ZVOLs (write synchronously if requested, else default to async, does not break ZIL sync)
cache=none (async) on the VM disks (they rely on the integrity of the underlying block storage)
add an SSD SLOG (ideally with power loss protection OR at least systems on UPS which mine are)

This will be my next sequence of testing and validation. In theory this is the fastest perf yield. _HOWEVER_, it may not be the right setup. As soon as one adds the SLOG, in theory, I should consider doing:

sync=standard on the ZVOLs (write synchronously if requested, else default to async, does not break ZIL sync)
cache=directsync on the VM disks (prefer data integrity, much slower performance)
add an SSD SLOG for higher sync performance (ideally with power loss protection OR at least systems on UPS which mine are)

PS: it is worth noting that ALL of my default configurations for ZVOL and VMs already adhere to this ... so the verdict is that _IF_ the VMs are sluggish due to application or OS synchronous write requests, then a fast SSD SLOG is recommended.

fermulator · Dec 11, 2019

Of note, I still don't know what is the RIGHT configuration/setting on the ZVOL side (per all the notes above) in terms of volblocksize, and what the best physical/logical sizes are for the VM partitioning.

fermulator · Feb 4, 2020

TLDR of verdict:
1. shoved in an older Kingston SSD as SLOG
2. enabled sync=always on all ZVOLs for VMs
3. accept risk of power loss / crash at same time as SSD failure as system is on UPS battery backup

Code:

$ sudo zpool status zstorage
  pool: zstorage
 state: ONLINE
  scan: scrub repaired 0B in 6h33m with 0 errors on Sun Feb  2 08:33:56 2020
config:

   NAME                                      STATE     READ WRITE CKSUM
   zstorage                                  ONLINE       0     0     0
     mirror-0                                ONLINE       0     0     0
       scsi-35000cca0169c8328                ONLINE       0     0     0
       scsi-35000cca0169c819c                ONLINE       0     0     0
   logs
     ata-KINGSTON_SNV425S264GB_07MA20035331  ONLINE       0     0     0

errors: No known data errors



$ sudo zfs get all zstorage/temp/disk0
NAME                 PROPERTY              VALUE                  SOURCE
zstorage/temp/disk0  type                  volume                 -
zstorage/temp/disk0  creation              Tue Feb  4 18:45 2020  -
zstorage/temp/disk0  used                  20.6G                  -
zstorage/temp/disk0  available             529G                   -
zstorage/temp/disk0  referenced            12K                    -
zstorage/temp/disk0  compressratio         1.00x                  -
zstorage/temp/disk0  reservation           none                   default
zstorage/temp/disk0  volsize               20G                    local
zstorage/temp/disk0  volblocksize          8K                     default
zstorage/temp/disk0  checksum              on                     default
zstorage/temp/disk0  compression           lz4                    inherited from zstorage
zstorage/temp/disk0  readonly              off                    default
zstorage/temp/disk0  createtxg             978751                 -
zstorage/temp/disk0  copies                1                      default
zstorage/temp/disk0  refreservation        20.6G                  local
zstorage/temp/disk0  guid                  3575123244057819171    -
zstorage/temp/disk0  primarycache          all                    default
zstorage/temp/disk0  secondarycache        all                    default
zstorage/temp/disk0  usedbysnapshots       0B                     -
zstorage/temp/disk0  usedbydataset         12K                    -
zstorage/temp/disk0  usedbychildren        0B                     -
zstorage/temp/disk0  usedbyrefreservation  20.6G                  -
zstorage/temp/disk0  logbias               latency                default
zstorage/temp/disk0  dedup                 off                    inherited from zstorage
zstorage/temp/disk0  mlslabel              none                   default
zstorage/temp/disk0  sync                  standard               inherited from zstorage
zstorage/temp/disk0  refcompressratio      1.00x                  -
zstorage/temp/disk0  written               12K                    -
zstorage/temp/disk0  logicalused           6K                     -
zstorage/temp/disk0  logicalreferenced     6K                     -
zstorage/temp/disk0  volmode               default                default
zstorage/temp/disk0  snapshot_limit        none                   default
zstorage/temp/disk0  snapshot_count        none                   default
zstorage/temp/disk0  snapdev               hidden                 default
zstorage/temp/disk0  context               none                   default
zstorage/temp/disk0  fscontext             none                   default
zstorage/temp/disk0  defcontext            none                   default
zstorage/temp/disk0  rootcontext           none                   default
zstorage/temp/disk0  redundant_metadata    all                    default

Fire in ext4

Code:

$ sudo mkfs.ext4 /dev/zvol/zstorage/temp/disk0
mke2fs 1.44.1 (24-Mar-2018)
Discarding device blocks: done                          
Creating filesystem with 5242880 4k blocks and 1310720 inodes
Filesystem UUID: 14bf787b-a09a-4966-bb23-67f4a4c3bfe8
Superblock backups stored on blocks:
   32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
   4096000

Allocating group tables: done                          
Writing inode tables: done                          
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done  

sudo mount /dev/zvol/zstorage/temp/disk0 /mnt/temp_disk0

Perf results are just worse ... the SSD _is_ quite old.

Code:

 /mnt/temp_disk0/TEST$ fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=8K --direct=0 --size=128m --numjobs=16 --end_fsync=1
seq-write: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes

(SNIP)

Run status group 0 (all jobs):
  WRITE: bw=33.0MiB/s (35.6MB/s), 2174KiB/s-2175KiB/s (2227kB/s-2227kB/s), io=2048MiB (2147MB), run=60255-60278msec

Disk stats (read/write):
  zd160: ios=0/1045, merge=0/0, ticks=0/22316, in_queue=22316, util=1.58%

Code:

 /mnt/temp_disk0/TEST$ fio --name=random-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=8K --direct=0 --size=128m --numjobs=16 --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes

(SNIP)

Run status group 0 (all jobs):
  WRITE: bw=33.2MiB/s (34.8MB/s), 2126KiB/s-2126KiB/s (2177kB/s-2177kB/s), io=2048MiB (2147MB), run=61653-61657msec

Disk stats (read/write):
  zd160: ios=0/1500, merge=0/0, ticks=0/20760, in_queue=20764, util=1.22%

I would say the _difference_ though, is that the volume w/ SSD SLOG performance results are CONSISTENT regardless of seq-write vs. rand-write (which is what we'd expect w/ an SSD in front of it).

The IO% utilization is drastically low ... some sort of bottleneck I'm not sure about atm limiting the max throughput/bandwidth. It _IS_ possible that I'm plugged into a slower SATA bus w/ the SSD ... i saw some dmesg logs that showed this device downgraded to SATA 1.5Gbps.

Code:

 [1202280.947390] ata1.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
SATA Version is:  SATA 2.6, 3.0 Gb/s

Rand__ · Feb 4, 2020

Whatever made you think that Kingston would make a good slog?
Look at this for various drives as slog results SLOG benchmarking and finding the best SLOG (freenas but should be transferable).

Get a s3700 if you meed a cheap one or optane if you need faster

gea · Feb 5, 2020

Even a slow SSD may help a little regarding performance as disks are really worse regarding sync write, but cheap desktop SSDs are a bad choice and the Kingston is a very bad choice.

An Slog must offer
- powerloss protection (you use sync for security)
- low latency
- high steady write iops with qd1

The best of all like an Intel Optane offers 10us latency and 500k write iops (4k). Desktop SSDs may go down to a few thousand iops or less on steady write

The Intel 3700 was the best affordable Slog in last years, you can get them for cheap (used) The current best Slog is either an Intel Optane NVMe (ex datacenter 4801x or P9x00) or an WD SS530 when using SAS,

I have made some performance tests regarding Slog, see https://www.napp-it.org/doc/downloads/optane_slog_pool_performane.pdf

fermulator · Feb 9, 2020

@Rand__ & @gea; Thanks for the input on the hardware.

I had the old desktop Kingston SSD just lying around, I did not (and still don't) see any problem with using it until it dies. (otherwise I had no other purpose for it). It's just for a home lab, personal VMs, "relatively" low continuous IOPS (`zpool iostat` shows ~10-50 write operations per second on average), and I have power UPS protection on the system(s).

As you can see from the results the random-write-sync results are improved and consistent, that is all I wanted. The hit to sequential read/write is acceptable to me as the VMs do not require high BW IO (~35MB/s is fine). So good enough right? For my purposes, in a personal home lab environment, I don't agree with the above-mentioned requirements for SLOG. Certainly however for something serious and especially PROD/enterprise, those requirements listed out become important.

Whenever it dies (which I know it will from this type of workload), `zfs` will handle the fault gracefully and kick it out without any data loss/corruption (assuming the system doesn't crash or loose power at the same time which is highly unlikely). After which point I'll consider $ investment for an inexpensive NVMe/PCI-e SLOG replacement or maybe as suggested a better suited SSD. The Intel S3700 100G seems like a decent "next in line" choice @ $200 CAD new or maybe ~$50 USD used. Intel DC S3700 Series 2.5" 100GB SATA III MLC Internal Solid State Drive (SSD) SSDSC2BA100G301 - Newegg.ca

But for now ... I'm not aware of any real concerns with using an old desktop SSD as an SLOG in a home lab environment. If anyone has evidence or experience to the contrary, definitely open to hearing more beyond "bad" and "really bad" choice.

Rand__ · Feb 9, 2020

If its working for you (performance) and you're not concern about potential dataloss then o/c its ok to use it

fermulator · Feb 9, 2020

Rand__ said:
If its working for you (performance) and you're not concern about potential dataloss then o/c its ok to use it

hm! so what possible data loss are we talking? My understanding is that even in a fault scenario, non-mirror'd SLOG, zfs will detect it and kick it out without any data loss. So even using a "cheap/bad/cruddy" desktop SSD is fine for low workloads if it meets your performance needs. But I MAY anticipate a short lifespan of the device.

Rand__ · Feb 9, 2020

I agree the probability is quite low, but stranger things have happened

Just make sure you have some kind of backup (which is o/c always a good idea)

fermulator · Feb 19, 2021

1 year later update on the Kingston SSD SoH:

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Unknown_JMF_Attribute   0x0007   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0013   100   100   050    Pre-fail  Always       -       0
  7 Unknown_JMF_Attribute   0x000b   100   100   050    Pre-fail  Always       -       0
  8 Unknown_JMF_Attribute   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       32865
 10 Unknown_JMF_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       3461
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       5
175 Bad_Cluster_Table_Count 0x0003   100   100   010    Pre-fail  Always       -       0
192 Unexpect_Power_Loss_Ct  0x0012   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   025   100   020    Old_age   Always       -       25 (Min/Max 23/40)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
240 Unknown_JMF_Attribute   0x0013   100   100   050    Pre-fail  Always       -       0
170 Bad_Block_Count         0x0003   100   100   010    Pre-fail  Always       -       1 212 25
173 Erase_Count             0x0012   100   100   000    Old_age   Always       -       5179 9409 6004

Rand__ · Feb 20, 2021

Looking good

fermulator · Feb 12, 2022

Last week the SSD died fwiw

-- at first it couldn't be automatically imported on start, had to force the import since the SLOG had failed; however there was no data loss because the previous shutdown/restart was clean (and the system is running on a UPS);

Back to the pool without the SLOG, and continuing to suffer slow performance on ZVOLs ....

fermulator · Feb 13, 2022

More experiments (wasn't as scientific and diligent with note/result taking)

TLDR: zvol configuration/tuning was not the main culprit. (Though I'm sure I could refine and tune it according to the workload of a given VM); Rather, it is the cache & IO-mode of KVM/VirtIO.

`~<5-20MB/s, --> ~>+50-150MB/s` (order of magnitude improvement)

ANSWER:
* `DiskBus = VirtIO` (already was true)
* `CacheMode = none` (already was true) <-- I experimented with others, WARNING: avoid host page caching with ZVOLs (it really has the worst performance and can hang the VM preventing it from resetting even by force on the host)
* `IO-mode = threads` (was on native) <-- my host has oodles of CPU capacity and cores and this was the major winner

Performance Tweaks - Proxmox VE was very helpful.

Search

[SOLVED] QUEST: zfs, zvol, kvm settings/tuning (to fix SLOW VM sync-access)

fermulator

New Member

Attachments

EffrafaxOfWug

Radioactive Member

fermulator

New Member

Rand__

Well-Known Member

fermulator

New Member

fermulator

New Member

fermulator

New Member

Rand__

Well-Known Member

gea

Well-Known Member

fermulator

New Member

Rand__

Well-Known Member

fermulator

New Member

Rand__

Well-Known Member

fermulator

New Member

Rand__

Well-Known Member

fermulator

New Member

fermulator

New Member