Hi all experts! I would like to share my setup, and ask for guidance/advise based on what I have. I've been running this configuration for many years now, but finally looking to address the slow I/O performance in my VMs. Have spent 10-15 hours running various augmentations & tests in a spreadsheet trying to make heads or tails over the settings, but am at a bit of "end of the rope" on my patience, hoping someone in the community has experience here and might eyeball the situation and prevail me a "duh!, you should do ...".
High Level
We'll work bottom up
Performance Methodology
I am not as scientific as I would like to be, but doing my best. The VMs are unfortunately still active, so they MAY impact results here and there yielding slight inconsistencies ... but I've turned off as much as possible to reduce I/O load outside the tests. (obviously in a perfect world I would shutdown all VMs for this...)
I've been using "fio" primarily for my testing, like several others have shown in forum posts here and elsewhere, this is the TYPICAL invocation:
, where of course we vary "block size" in the testing. (choose one)
Performance Measurements
Throughout you will see slightly incomplete results as I did not run the FULL suite of tests across the full matrix ...
Baseline: started with the base zfs dataset
As one can expect, the 128K results _usually_ won, because our zfs dataset has a 128K record size. There we have some baseline numbers, but it is worth noting that the zfs dataset is _NOT_ optimized for I/O performance (per configuration above), as I do not have any important/heavy load here -- only on the ZVOLs, moving on.
Testing: test ZVOL "test/disk0" w/ 10G volume, single partition, ext4
Moving on to some TEST results, I conducted MANY experiments to try to experiment for the best settings. Much of this was based on reading + experimentation.
Given the TEST zvol, I was able to destroy and re-create at will, so playing with volblocksize was possible.
My takeaways here:
Here's the output of TWO cases for scenario A
RAND-WRITE:
SEQ-WRITE:
In both cases, we see the drive I/O on "zpool iostat" hit decent speeds, ~150-250MB/s (not sure why fio reports it so much lower).
So there, some baselines.
The cruddy VM Vesults
Finally we're here to see the problem:
Within the VM, we get _similar_ results on seq-write (not quite as fast, but perhaps given the virtualization overheads I could accept it):
But now see random writes, WOW so bad.
At first it goes OK, about 20-40MB/s, but within a few seconds, everything slows to a crawl here. and we end up with nearly 0KB/s
I tried changing the logical/physical sector size of the disk in virsh XML, but this renders the VM not bootable (I thought maybe I would try getting this to align with the ZVOL block size ...).
All the VMs by default have cache=none, but I tried also cache=directsync ... (results still cruddy, did not record).
Looked into "type=volume", but dont' think this is supported on zfs-on-linux (maybe it is, but either way in theory raw should be fine ...)
io=native, not sure what else to try, the other options don't feel relevant.
Assuming driver=qemu is fine as well, might not be?
Final Questions & Notes
So that is a lot of data, and I am left with many questions.
The Ultimate Question
Appendix
For posterity I've included various txt files with outputs of full config/etc (as I've only summarized them above).
High Level
- metal: 2x SAS platter HDDs, SAS controller, IBM x3550 M3
- filesystem: zfs-on-linux + zvols
- zfs pool type: mirror
- host: Fedora (up until recently), now Ubuntu 18.04 server
- virtualization: kvm/qemu VMs (w/ disk on virtio & raw storage into zvol)
- Disk I/O access from VMs has always been slow (~10MB/s or less)
- The VMs function well enough (none of them have super critical I/O requirements)
- However in _SOME_ scenarios I found it annoying (i.e. nextcloud, scp/rsync transfers, etc)
- Recently I have replaced my 2x SAS HDDs (aging) with new SAS HDDs -- while I'm at this, creating fresh pool, etc -- I decided I wanted to solve this slow access problem
- In the past I chalked up the slowness to old/aging drives ... but analysis now disproves this.
We'll work bottom up
- Bare Metal
- Relevant Server Specs: IBM x3550 M3, Intel(R) Xeon(R) CPU (E5620 @ 2.40GHz), 76GB RAM
- Storage Controller: LSI SAS1068E,
- HDDs: SAS 2x HGST HUC109090CSS60 900GB, 512-byte sector size
- Host
- Linux Ubuntu 18.04
- Kernel: 4.15.0-70-generic
- ZFS: 0.7.5
- ZFS Pool
- "zstorage" is its name, mirror'd
- ashift=9 (aligned with the 512-byte physical sector size of these disks)
- NOTE: my older pool I did ashift=12, even though the older drives were also 512-byte sector, but for this testing when I created the NEW pool I went with ashift=9 in an attempt to address the slow I/O (1:1 alignment per reading gives the best performance, but having 4K on top of 512-b is OK, future proof...)
- ZFS base dataset zstorage/vms
- Relevant Config (the things per reading that matter):
- recordsize=128K
- compression=off (normally would set to lz4)
- atime=off
- xattr=sa
- logbias=latency
- dedup=on (whoops, need to fix this; though shouldn't affect things as we get into ZVOL stuff)
- sync=standard
- acltype=posixacl
- Relevant Config (the things per reading that matter):
- ZVOL block storage for VMs (not very relevant AFAIK as we're using ZVOLs)
- zstorage/vms/<VM_NAME>/disk0 (XXG varies)
- Relevant Config (the things per reading that matter):
- volblocksize=8K
- checksum=on
- compression=off
- logbias=latency
- dedup=on (double whoops)
- sync=standard
- copies=1
- zstorage/vms/<VM_NAME>/disk0 (XXG varies)
- ZVOL block storage for TESTING
- zstorage/test/disk0 (10G)
- Relevant Config (the things per reading that matter):
- volblocksize=8K (varied this for testing)
- checksum=on
- compression=off (also tried lz4)
- copies=1
- logbias=latency|throughput (tried both latency overall better)
- dedup=off
- zstorage/test/disk0 (10G)
- Partition Table on RAW block device (on top of ZVOL)
- VMs: (from the original provisioning...)
- Units: sectors of 1 * 512 = 512 bytes
- Sector size (logical/physical): 512 bytes / 8192 bytes
- I/O size (minimum/optimal): 8192 bytes / 8192 bytes
- Start @ 2048
- TEST: (for these tests to try to find the BEST settings)
- Units: sectors of 1 * 512 = 512 bytes
- Sector size (logical/physical): 512 bytes / 8192 bytes
- I/O size (minimum/optimal): 8192 bytes / 8192 bytes
- Start @ 2048 (not aligned, but I also tried starting at 8192 made no difference)
- Units: sectors of 1 * 512 = 512 bytes
- VMs: (from the original provisioning...)
- Filesystem:
- All VMs and testing are done with basic ext4 for compatibility with guest OSes.
- VMs:
- Type: KVM, qemu, Linux guests with varying CPU/RAM/storage assignments
- Note: NONE are "heavy DB" (so we can for now ignore any special volume tuning for database pages)
- Storage:
- type=block
- driver=qemu, type=raw, cache=none|directsync|?, io=native
- bus=virtio
Performance Methodology
I am not as scientific as I would like to be, but doing my best. The VMs are unfortunately still active, so they MAY impact results here and there yielding slight inconsistencies ... but I've turned off as much as possible to reduce I/O load outside the tests. (obviously in a perfect world I would shutdown all VMs for this...)
I've been using "fio" primarily for my testing, like several others have shown in forum posts here and elsewhere, this is the TYPICAL invocation:
Code:
# RANDOM-WRITE
fio --name=random-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=512b|4K|8K|128K --direct=0 --size=128m --numjobs=16 --end_fsync=1
# SEQ WRITE
fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=512b|4K|8K|128K --direct=0 --size=128m --numjobs=16 --end_fsync=1
- My understanding/comments of the settings:
- ioengine=sync|libaio (AFAIK, "sync" best represents what VMs will do)
- rw=randwrite|write (random writes the WORST case, or seq writes the BEST case; real world is somewhere in between for basic VMs)
- bs=... (vary the write block size to test theories about perf & underlying sizes)
- size=128M (for testing/time reasons, seemed reasonable, others used 256MB)
- numjobs=16 (parallelize I/O write tasks, this is QUITE heavy, maybe VMs only have 1x IO thread? not sure)
- end_fsync=1 (WAIT for storage to confirm write before saying "done" -- this is the crux of our testing IMO)
- I don't fully understand these ones, various other users did this
- iodepth=4
- direct=0 (I tried one or two cases to set =1, but it didn't work)
Performance Measurements
Throughout you will see slightly incomplete results as I did not run the FULL suite of tests across the full matrix ...
Baseline: started with the base zfs dataset
As one can expect, the 128K results _usually_ won, because our zfs dataset has a 128K record size. There we have some baseline numbers, but it is worth noting that the zfs dataset is _NOT_ optimized for I/O performance (per configuration above), as I do not have any important/heavy load here -- only on the ZVOLs, moving on.
Testing: test ZVOL "test/disk0" w/ 10G volume, single partition, ext4
Moving on to some TEST results, I conducted MANY experiments to try to experiment for the best settings. Much of this was based on reading + experimentation.
Given the TEST zvol, I was able to destroy and re-create at will, so playing with volblocksize was possible.
My takeaways here:
- We can see the first tests I accidentally left dedup=on, but later turned that off. (heck compare A to E and it got worse)
- libaio I concluded was generally slower than sync (not sure why)
- I _THOUGHT_ creating a zvol with volblocksize=512-b would be the BEST results due to ashift=9 and HDD sector size = 512-byte ... but this was not the result.
- I conducted a few experiments in changing logbias to "throughput" (to get higher throughput) but this led to LOWER results (presumably due to sync=true on FIO)
- One theory I had was since the partition didn't start on the ZVOL blocksize necessarily, thought maybe it would make a difference, but it did not. (compare E to A)
Here's the output of TWO cases for scenario A
RAND-WRITE:
Code:
/mnt/temp_disk0/fio$ fio --name=random-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=8k --direct=0 --size=128m --numjobs=16 --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
Jobs: 16 (f=16): [F(16)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=21225: Mon Dec 9 20:06:15 2019
write: IOPS=582, BW=4659KiB/s (4770kB/s)(128MiB/28135msec)
clat (usec): min=7, max=478, avg=13.21, stdev= 7.44
lat (usec): min=8, max=479, avg=13.49, stdev= 7.45
clat percentiles (usec):
| 1.00th=[ 10], 5.00th=[ 11], 10.00th=[ 11], 20.00th=[ 12],
| 30.00th=[ 12], 40.00th=[ 13], 50.00th=[ 13], 60.00th=[ 13],
| 70.00th=[ 14], 80.00th=[ 15], 90.00th=[ 16], 95.00th=[ 17],
| 99.00th=[ 21], 99.50th=[ 24], 99.90th=[ 59], 99.95th=[ 115],
| 99.99th=[ 363]
lat (usec) : 10=1.35%, 20=97.50%, 50=1.01%, 100=0.07%, 250=0.01%
lat (usec) : 500=0.04%
cpu : usr=0.14%, sys=1.03%, ctx=333, majf=0, minf=8
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,16384,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
(SNIP the other 15x jobs)
Run status group 0 (all jobs):
WRITE: bw=72.8MiB/s (76.3MB/s), 4658KiB/s-4659KiB/s (4770kB/s-4770kB/s), io=2048MiB (2147MB), run=28135-28137msec
Disk stats (read/write):
zd256: ios=0/1028, merge=0/0, ticks=0/20968, in_queue=20968, util=2.59%
Code:
$ fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=8k --direct=0 --size=128m --numjobs=16 --end_fsync=1
seq-write: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
Jobs: 16 (f=16): [F(16)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
seq-write: (groupid=0, jobs=1): err= 0: pid=22319: Mon Dec 9 20:08:12 2019
write: IOPS=619, BW=4952KiB/s (5071kB/s)(128MiB/26466msec)
clat (usec): min=7, max=2988, avg=14.18, stdev=28.99
lat (usec): min=7, max=2989, avg=14.48, stdev=28.99
clat percentiles (usec):
| 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 11], 20.00th=[ 12],
| 30.00th=[ 13], 40.00th=[ 13], 50.00th=[ 13], 60.00th=[ 14],
| 70.00th=[ 15], 80.00th=[ 15], 90.00th=[ 17], 95.00th=[ 18],
| 99.00th=[ 23], 99.50th=[ 36], 99.90th=[ 210], 99.95th=[ 318],
| 99.99th=[ 1647]
lat (usec) : 10=3.41%, 20=94.61%, 50=1.67%, 100=0.09%, 250=0.15%
lat (usec) : 500=0.06%, 750=0.01%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=0.13%, sys=1.04%, ctx=565, majf=0, minf=14
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,16384,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
(SNIP the other 15x jobs)
Run status group 0 (all jobs):
WRITE: bw=77.4MiB/s (81.1MB/s), 4952KiB/s-4953KiB/s (5071kB/s-5072kB/s), io=2048MiB (2147MB), run=26465-26466msec
Disk stats (read/write):
zd256: ios=0/1028, merge=0/0, ticks=0/19124, in_queue=19124, util=2.63%
So there, some baselines.
The cruddy VM Vesults
Finally we're here to see the problem:
Within the VM, we get _similar_ results on seq-write (not quite as fast, but perhaps given the virtualization overheads I could accept it):
Code:
/tmp$ fio --name=seq-write --ioengine=sync --iodepth=4 --rw=write --bs=4k --direct=0 --size=128m --numjobs=16 --end_fsync=1
seq-write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
seq-write: Laying out IO file (1 file / 128MiB)
Jobs: 12 (f=12): [_(1),F(7),_(1),F(2),_(1),F(2),_(1),F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
seq-write: (groupid=0, jobs=1): err= 0: pid=4668: Mon Dec 9 20:11:22 2019
write: IOPS=1043, BW=4173KiB/s (4273kB/s)(128MiB/31410msec)
clat (usec): min=6, max=2025.8k, avg=782.68, stdev=26653.78
lat (usec): min=7, max=2025.8k, avg=784.51, stdev=26653.78
clat percentiles (usec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 9],
| 20.00th=[ 9], 30.00th=[ 9], 40.00th=[ 12],
| 50.00th=[ 14], 60.00th=[ 15], 70.00th=[ 16],
| 80.00th=[ 17], 90.00th=[ 24], 95.00th=[ 31],
| 99.00th=[ 78], 99.50th=[ 10421], 99.90th=[ 36439],
| 99.95th=[ 708838], 99.99th=[1317012]
bw ( KiB/s): min= 2, max=29466, per=5.98%, avg=3071.62, stdev=5269.70, samples=32
iops : min= 0, max= 7366, avg=767.56, stdev=1317.45, samples=32
lat (usec) : 10=35.12%, 20=50.44%, 50=12.78%, 100=0.74%, 250=0.10%
lat (usec) : 500=0.05%, 750=0.03%, 1000=0.01%
lat (msec) : 2=0.02%, 4=0.03%, 10=0.11%, 20=0.42%, 50=0.06%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.02%
lat (msec) : 2000=0.02%, >=2000=0.01%
cpu : usr=0.76%, sys=1.99%, ctx=540, majf=1, minf=23
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=0,32768,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
(SNIP)
Run status group 0 (all jobs):
WRITE: bw=50.2MiB/s (52.6MB/s), 3210KiB/s-5407KiB/s (3287kB/s-5537kB/s), io=2048MiB (2147MB), run=24241-40831msec
Disk stats (read/write):
vda: ios=1663/5244, merge=187/70829, ticks=1312/3893732, in_queue=2317472, util=63.42%
At first it goes OK, about 20-40MB/s, but within a few seconds, everything slows to a crawl here. and we end up with nearly 0KB/s
Code:
$ fio --name=rand-write --ioengine=sync --iodepth=4 --rw=randwrite --bs=4k --direct=0 --size=128m --numjobs=16 --end_fsync=1
rand-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=4
...
fio-3.1
Starting 16 processes
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
rand-write: Laying out IO file (1 file / 128MiB)
Jobs: 16 (f=16): [w(16)][14.9%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 04m:05s]
All the VMs by default have cache=none, but I tried also cache=directsync ... (results still cruddy, did not record).
Looked into "type=volume", but dont' think this is supported on zfs-on-linux (maybe it is, but either way in theory raw should be fine ...)
io=native, not sure what else to try, the other options don't feel relevant.
Assuming driver=qemu is fine as well, might not be?
Final Questions & Notes
So that is a lot of data, and I am left with many questions.
- Is my new pool using ashift=9 the better performer? (I did not retest yet with ashift=12 as I'll have to redo all the setup, a bit lazy, but in theory, I think I am correct here; so long as I accept, if I ever get newer drives with 4K sector sizes, I'll have to rebuild the pool, and I'm OK with that.)
- What might a theory/explanation be for why random writes within the VM are SO BAD?
- What are the best ZVOL settings for VM usage?
- What are the best VM disk settings for underlying ZVOL block devices?
- I am aware that many recommend using `qcow2` on ZFS datasets as it is easier to maintain, and _maybe_ is only slightly less performant; For me I prefer to stick with the zvol solution for now (although admittedly if I switched it might resolve all the woes ... but I have ~10x VMs and flipping them all to the qcow2 format is a chunk of work in itself...) - so please no "just use qcow" replies
The Ultimate Question
- Can anyone link/reference to a guide for "tuning ZVOL & VMs for real-world general I/O"? - I know this various per I/O workload, but I just want "best of averages", exclude large file writes, exclude database page sizes, just normal Linux OS guest usage.
Appendix
For posterity I've included various txt files with outputs of full config/etc (as I've only summarized them above).
Attachments
-
958 bytes Views: 1
-
1.9 KB Views: 2
-
9.7 KB Views: 4
-
5.4 KB Views: 3