Optimize Write Amplification (ZoL, Proxmox)

Dunuin

New Member
Sep 11, 2020
3
2
3
Hi,

I've setup a Proxmox-Hypervisor and got a total write amplification of around 18 from VM to NAND flash. Can someone hint me how to improve this?

Server setup:
Supermicro X10SRM-F, Xeon E5-2620v4, 64GB DDR4 2133 ECC

root drives:
2x Intel S3700 100GB (sda/sdb, LBA=4k) mirrored with mdraid, encrypted with LUKS, LVM and ext4 for root.

VM drives:
4x Intel S3710 200GB (sdc/sdd/sde/sdg, LBA=4k) + 1x Intel S3700 200GB (sdf, LBA=4k) as raidz1 pool (ashift=12, no LOG or Cache device, atime=off, compression=LZ4, no deduplication, sync is standard). All VMs (vdevs with blocksize=8k) inside a encrypted dataset.

Backup drive:
1x Seagate ST4000DM004 4TB SRM HDD (sdh, LBA=4k) as zfs pool for VM backups and as storage for RAM dumps (used by snapshots).

Unused drives:
2x Samsung 970 Evo 500GB (nvme0n1/nvme1n1) on a M.2 addon card mirrored with zfs but not in use at the moment because of too high write amplification.

All my VMs are using ext4 and are using the vdevs as "raw" format. In the kvm settings I use "Virtio SCSI single" as controller, "none" as cache mode and ssd emulation, io thread, discard are enabled. Inside the VMs I mounted the virtual drives with "noatime", "nodiratime" and fstrim is run once a week. "/tmp" is mounted via ramfs.

All partitions inside the VMs and on the host are aligned to 1MB.

If I run iostat on the host and sum up the writes of all vdevs, all the VMs combined are writing around 1 MB/s of data.
The vdevs are on the raidz1 pool which is writing 5x 1,9 MB/s so from guest filesystem to host filesystem I get a write amplification of around 10.
I use smartctl to monitor the host writes and nand writes of each drive and for every 1 GB of data written to the SSD the SSD is writing around 1.8GB of data to the NAND.
So in the end the 1MB/s of real data from the guests are multipling up to 18 MB/s written to the NAND flash.

Code:
root@Hypervisor:/var/log# iostat 600 2
Linux 5.4.60-1-pve (Hypervisor)         09/06/2020      _x86_64_        (16 CPU)

...

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.66    0.00    5.90    0.02    0.00   89.42

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1           0.00         0.00         0.00          0          0
nvme0n1           0.00         0.00         0.00          0          0
sdg             129.56         0.81      1918.89        484    1151332
sdb               4.94         0.00        25.70          0      15417
sdh               0.00         0.00         0.00          0          0
sdf             128.64         0.81      1918.76        488    1151256
sda               4.95         0.05        25.70         32      15417
sdd             129.78         0.83      1917.61        500    1150564
sde             129.89         0.81      1917.23        488    1150340
sdc             130.13         0.87      1916.58        520    1149948
md0               0.00         0.00         0.00          0          0
md1               4.06         0.05        25.13         32      15080
dm-0              4.06         0.05        25.13         32      15080
dm-1              4.06         0.05        29.87         32      17920
dm-2              0.00         0.00         0.00          0          0
zd0               0.69         0.00         8.03          0       4820
zd16              0.58         0.00         6.45          0       3868
zd32             13.13         0.89       278.59        536     167156
zd48              0.62         0.00         6.90          0       4140
zd64              0.58         0.00         6.53          0       3920
zd80              0.00         0.00         0.00          0          0
zd96              0.00         0.00         0.00          0          0
zd112             0.10         0.01         0.53          8        320
zd128             0.00         0.00         0.00          0          0
zd144             0.00         0.00         0.00          0          0
zd160             0.00         0.00         0.00          0          0
zd176             0.00         0.00         0.00          0          0
zd192             0.00         0.00         0.00          0          0
zd208             0.00         0.00         0.00          0          0
zd224             0.00         0.00         0.00          0          0
zd240             0.00         0.00         0.00          0          0
zd256             0.00         0.00         0.00          0          0
zd272             0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd320             0.00         0.00         0.00          0          0
zd336             0.00         0.00         0.00          0          0
zd352             0.00         0.00         0.00          0          0
zd368             0.00         0.09         0.00         56          0
zd384             0.00         0.00         0.00          0          0
zd400            51.87         0.16       717.30         96     430380
zd416             0.58         0.00         6.32          0       3792
zd432             0.58         0.00         6.39          0       3832
zd448             0.67         0.00         8.11          0       4868
zd464             0.60         0.00         6.36          0       3816
Is there anything I made wrong?

18MB/s is 568 TB per year which is really a lot of data because the VMs are just idleing plain debians without heavy use. Only 3 VMs got real applications running (Zabbix, Graylog, Emby). I chose 5x Intel S37X0 SSDs because they got a combined TBW of 18.000TB and should last some time but saving some writes would be nice non the less.

I only found 1 optimization doing a big difference and that was changing the cache mode from "no cache" to "unsafe" so the VMs can't do sync writes which will cut the write amplification in half. But that isn't really a good thing if something crashes and some DBs gets corrupt or something like that.
I think it will make such a big impact, because I didn't have a dedicated SLOG drive so the ZIL will be by default on the same drives and this should double the writes in case of sync writes, because every write needs to be stored on the ZIL section of the drive first and later again on the same drive for permanent storage.

And if I run "fdisk -l" on the guests all the QEMU harddisks are showing a LBA of 512B but vdevs are 8K and real disks of the pool are 4k (ashift 12). May that cause the write amplification? I wasn't able to find a way to change how KVM handles the LBA of the virtual drives. Or is it normal that 512B LBAs are shown but the 8K are really used internally?

I've read somewhere that raidz1 isn't good for write amplification and a zfs raid10 might be better but there where no numbers. Does someone tested both and can compare the them? I only find zfs performance benchmarks not mentioning the write amplification. I initially thought that raidz1 with 5 drives should result in a much better write amplification because only 25% more data is stored on the drives and not 100% more.

Right now I'm using 5x 200GB SSD as raidz1 but I'm only using 100GB space. If raid10 is more optimized for write amplification I could use raid10 with 1 spare until I need more then 400GB.
I don't know what the internal SSD write amplification was with just two mirrored SDDs but the translation from guest filesystem to host file system got a write amplification of about 7. With raidz1 the same amount of data from the guests causes a write amplification from guest to host of 10.

Any idea what would be the best disc setup to optimize write amplification?

And I created all my vdevs with 8k blocksize. Could that cause more write amplification? Is it better to use a higher number like 16k or 32k if the SSDs are using 4k LBA?

Would be great if someone could give me some hints on how to improve the write amplification without needing to sacrifice the snapshot ability, failure safty and error correction of the zfs filesystem.
 
Last edited:
  • Like
Reactions: Brian Puccio

Dunuin

New Member
Sep 11, 2020
3
2
3
I tested 4 SSDs as mirror striped zfs pool (like raid10) and the write amplification was even worse...

raidz1 of 5 SSDs: 808 kb/s writes from VMs -> 6913 kb/s writes to pool = 8.56x write amplification (without additional write amplification from SSD)

zfs mirror stripe of 4 SSDs(raid10) with standard sync writes: 883 kb/s writes from VMs -> 10956 kb/s writes to pool = 12.40x write amplification (without additional write amplification from SSD)

zfs mirror stripe of 4 SSDs(raid10) with most syncwrites deactivated: 707 kb/s writes from VMs -> 4990 kb/s writes to pool = 7.06x write amplification (without additional write amplification from SSD)


raidz1 pool (sdc/sdd/sde/sdf/sdg):
Code:
root@Hypervisor:~# iostat 3600 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           0.00         0.28         0.00       1024          0
nvme1n1           0.00         0.29         0.00       1032          0
sdf             119.93         2.72      1382.60       9780    4977356
sdb               4.35         0.57        24.69       2048      88898
sdc             121.43         2.70      1383.36       9716    4980080
sde             121.05         2.85      1381.88      10276    4974760
sdg             120.96         2.80      1382.17      10088    4975820
sdd             121.26         2.77      1383.42       9976    4980308
md0               0.00         0.14         0.00        512          0
md1               3.57         0.57        24.19       2052      87084
sda               4.36         0.71        24.69       2564      88898
sdh               0.00         0.43         0.00       1544          0
dm-0              3.57         0.43        24.19       1540      87084
dm-1              3.57         0.14        30.94        516     111372
dm-2              0.00         0.14         0.00        512          0
zd0               0.81         1.28         8.18       4624      29440
zd16              0.51         0.00         5.78          0      20812
zd32              0.00         0.00         0.00          0          0
zd48              0.00         0.00         0.00          0          0
zd64              0.00         0.00         0.00          0          0
zd80              0.56         0.00         6.42          0      23108
zd96              0.54         0.00         6.22          0      22408
zd112             0.56         0.00         6.45          0      23224
zd128             0.04         0.27         0.00        976          0
zd144             2.41         0.00        23.25          0      83708
zd160             0.00         0.00         0.00          0          0
zd176            26.24         0.00       326.52          0    1175480
zd192             0.00         0.00         0.00          0          0
zd208             0.00         0.00         0.00          0          0
zd224             0.00         0.00         0.00          0          0
zd240            13.21         2.57       328.63       9236    1183052
zd256             0.00         0.00         0.00         12          0
zd272             0.00         0.00         0.00          0          0
zd288             0.00         0.00         0.00          0          0
zd304             0.00         0.00         0.00          0          0
zd320             2.64         0.00        53.82          8     193752
zd336             0.62         0.00         6.88          4      24784
zd352             0.00         0.00         0.00          0          0
zd368             0.00         0.00         0.00          0          0
zd384             0.00         0.00         0.00          0          0
zd400             0.59         0.00         6.38          0      22952
zd416             1.04         0.00        12.84          0      46224
zd432             1.59         0.01        16.37         20      58948
zd448             0.00         0.00         0.00          0          0
zfs "Raid10" (sdc/sdd/sde/sdg):
Code:
root@Hypervisor:~# iostat 3600 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           0.00         0.00         0.00          0          0
nvme1n1           0.00         0.00         0.00          4          0
sdf               0.00         0.00         0.00          0          0
sdb               5.63         0.17        28.77        596     103573
sdc              77.40        13.24      2786.42      47680   10031128
sde              71.34        13.17      2691.64      47400    9689908
sdg              71.49        13.01      2691.64      46836    9689908
sdd              77.68        13.66      2786.42      49160   10031128
md0               0.00         0.00         0.00          0          0
md1               4.61         1.48        28.12       5336     101232
sda               5.69         1.32        28.77       4740     103573
sdh               0.00         0.00         0.00          4          0
dm-0              4.61         1.48        28.12       5336     101232
dm-1              4.60         1.46        37.60       5240     135344
dm-2              0.01         0.03         0.01         96         44
zd160             0.50         0.19         5.33        668      19180
zd192             0.00         0.00         0.00          0          0
zd16              0.59         0.00         6.31          0      22720
zd48              0.00         0.00         0.00          0          0
zd96              1.68         0.01        17.12         28      61640
zd208             0.00         0.00         0.00          0          0
zd224             0.57         0.00         6.02         12      21668
zd288             0.00         0.00         0.00          0          0
zd80              0.57         0.00         6.13          0      22056
zd304             0.00         0.00         0.00          0          0
zd352             2.39         0.02        23.32         72      83936
zd384             0.00         0.00         0.00          0          0
zd144             3.10         0.88        63.93       3176     230160
zd272             0.00         0.00         0.00          0          0
zd320             1.00         0.11        12.02        396      43264
zd32              0.00         0.00         0.00          0          0
zd400             0.58         0.04         6.29        128      22628
zd128             0.00         0.00         0.00          0          0
zd112             0.76         0.00         8.32          8      29956
zd368             0.00         0.00         0.00          0          0
zd0               0.61         0.28         6.44       1012      23180
zd64              0.00         0.00         0.00          0          0
zd336            26.63        10.51       329.33      37848    1185604
zd256             0.00         0.00         0.00          0          0
zd176            13.70         0.77       392.81       2756    1414112
zd416             0.00         0.00         0.00          0          0

zfs "Raid10" with "cache mode = unsafe" (ignoring syncwrites) for the 2 VMs doing most writes:
Code:
root@Hypervisor:~# iostat 1200 2
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           0.00         0.00         0.00          0          0
nvme1n1           0.00         0.00         0.00          0          0
sdf               0.00         0.00         0.00          0          0
sdb               7.01         0.17        32.46        208      38957
sdc              60.49        66.98      1271.80      80380    1526156
sde              55.57        68.84      1223.11      82612    1467732
sdg              55.53        68.84      1223.11      82608    1467732
sdd              60.40        66.17      1271.80      79404    1526156
md0               0.00         0.00         0.00          0          0
md1               5.66         0.66        31.66        792      37988
sda               7.04         0.49        32.46        584      38957
sdh               0.00         0.00         0.00          0          0
dm-0              5.66         0.66        31.66        792      37988
dm-1              5.65         0.63        42.51        760      51012
dm-2              0.01         0.03         0.01         32         12
zd160             0.51         0.00         5.38          0       6460
zd192             0.00         0.00         0.00          0          0
zd16              0.98        31.05         6.85      37256       8224
zd48              0.00         0.00         0.00          0          0
zd96              1.72         0.02        18.17         20      21804
zd208             0.00         0.00         0.00          0          0
zd224             0.56         0.00         5.93          0       7120
zd288             0.00         0.00         0.00          0          0
zd80              0.57         0.00         6.00          0       7204
zd304             0.00         0.00         0.00          0          0
zd352             2.28         0.00        22.20          0      26636
zd384             0.00         0.00         0.00          0          0
zd144             3.00         0.00        53.93          0      64716
zd272             0.00         0.00         0.00          0          0
zd320             1.01         0.00        11.91          0      14292
zd32              0.00         0.00         0.00          0          0
zd400             0.57         0.00         6.26          0       7512
zd128             0.00         0.00         0.00          0          0
zd112             0.74         0.00         8.14          0       9764
zd368             0.00         0.00         0.00          0          0
zd0               0.62         0.00         6.83          0       8192
zd64              0.00         0.00         0.00          0          0
zd336            60.93         5.44       243.25       6524     291900
zd256             0.00         0.00         0.00          0          0
zd176            79.80       124.27       312.94     149124     375524
zd416             0.00         0.00         0.00          0          0

Any ideas?
 
Last edited:
  • Like
Reactions: Vesalius

netswitch

New Member
Sep 24, 2018
22
9
3
Hi,

I have no solution to provide but I have the same problem with write amplification with ZFS and Proxmox.
I have been running a 12 drives RAIDZ6 using crucial mx500 1Tb drives.

I have computed a *28 write amplification and the "estimated lifetime remaining" of the drive was loosing one percent per week.
I endend up replacing the drive with intel S3500 and now it is loosing less lifetime. (but I havent computed write amplification yet)

I have read that the issue is related to the amounts of sync mades to the drives and as the MX500 do not have capacitors, sync are more frequent and nand is worn out faster.
But with S3710 you should'nt have this issue.
 

Vesalius

Member
Nov 25, 2019
95
63
18
Just today started using the suggestions here for the same issue.
  • set a small recordsize
  • set logbias=throughput
  • set compression=lz4
Code:
zfs set recordsize=16k rpool
zfs set logbias=throughput rpool
zfs set compression=lz4 rpool
To see what you have now and later confirm any changes you can use:
Code:
zfs get recordsize,logbias,compression rpool
I already had lz4 compression on, but recordsize was the default 128k and logbias was latency.

 
Last edited:

dswartz

Active Member
Jul 14, 2011
491
49
28
Maybe I misunderstand this, but I thought write amplification referred to a per-drive concern. Reading the numbers above, I'm seeing numbers where the listed write amplification is N times bigger than per-drive (for N drive pool, roughly). This seems misleading, at best, no? Also, IIRC, unsafe (is this what proxmox refers to as 'sync=disabled'?) is no more likely to corrupt anything, than if you pull the power plug. Most modern filesystems (and properly configured DBs) are crash-tolerant.
 

Vesalius

Member
Nov 25, 2019
95
63
18
Also may try another recommendation from that same linked thread to increase the zfs_txg_timeout option to greater than the default 5 secs. Will likely try 10 secs. server has a UPS so not so worried about a power outage. Have to create a /etc/modprobe.d/zfs.conf file as proxmox doesn’t have one by default.

 

Dunuin

New Member
Sep 11, 2020
3
2
3
Meanwhile I recreated every zvol because I changed the volblocksize of the raidz1 pool to 32K so I don't loose capacity because of bad padding.

The write amplification of the SSDs themselve isn'T that bad. Its around factor 2.4x. A bigger problem is the write amplification between Guest-OS and Host what is about factor 7x.
My VMs try to store around 60GB/day to the virtual harddisks. This results in ZFS writing 420GB/day to the pool and around 1TB/day is written to the NAND. So finally thats a total write amplification of around factor 17x.

I think all the different block sized from the guest file systems to the physical drives is the biggest problem and causes some padding overhead. It should look like this:
SSDs (logical/physical sector size: 512B/4K) <-- ZFS pool (ashift: 12 so 4K) <-- zvol (volblocksize: 32K) <-- virtio SCSI virtual drive (LBA: 512B) <-- ext4 partitions (block size 4K)


Just today started using the suggestions here for the same issue.
  • set a small recordsize
  • set logbias=throughput
  • set compression=lz4
LZ4 is anabled, logbias is latency and recordsize is 128K. But as far as I know the recordsize isn't used for zvols so it shoudn't affect the virtual harddiscs. And I've read that "logbias=throughput" will cause horrible fragmentation and slowdowns if you got alot of small writes without SLOG.

Also may try another recommendation from that same linked thread to increase the zfs_txg_timeout option to greater than the default 5 secs.
I returned to the default "zfs_txg_timeout" value because my Intel SSDs should be able to cache even sync writes so that wont greatly lower the SSDs write amplification and only add more risk to loose data and slow down the pool.

Also, IIRC, unsafe (is this what proxmox refers to as 'sync=disabled'?) is no more likely to corrupt anything, than if you pull the power plug. Most modern filesystems (and properly configured DBs) are crash-tolerant.
I meant the ZFS option "sync=disabled" which will ignore and flushes and force every sync write to be handled as async write. But "cachemode = unsafe" in the virtio SCSI setting will do the samething but on virtualization level.