Hi,
I've setup a Proxmox-Hypervisor and got a total write amplification of around 18 from VM to NAND flash. Can someone hint me how to improve this?
Server setup:
Supermicro X10SRM-F, Xeon E5-2620v4, 64GB DDR4 2133 ECC
root drives:
2x Intel S3700 100GB (sda/sdb, LBA=4k) mirrored with mdraid, encrypted with LUKS, LVM and ext4 for root.
VM drives:
4x Intel S3710 200GB (sdc/sdd/sde/sdg, LBA=4k) + 1x Intel S3700 200GB (sdf, LBA=4k) as raidz1 pool (ashift=12, no LOG or Cache device, atime=off, compression=LZ4, no deduplication, sync is standard). All VMs (vdevs with blocksize=8k) inside a encrypted dataset.
Backup drive:
1x Seagate ST4000DM004 4TB SRM HDD (sdh, LBA=4k) as zfs pool for VM backups and as storage for RAM dumps (used by snapshots).
Unused drives:
2x Samsung 970 Evo 500GB (nvme0n1/nvme1n1) on a M.2 addon card mirrored with zfs but not in use at the moment because of too high write amplification.
All my VMs are using ext4 and are using the vdevs as "raw" format. In the kvm settings I use "Virtio SCSI single" as controller, "none" as cache mode and ssd emulation, io thread, discard are enabled. Inside the VMs I mounted the virtual drives with "noatime", "nodiratime" and fstrim is run once a week. "/tmp" is mounted via ramfs.
All partitions inside the VMs and on the host are aligned to 1MB.
If I run iostat on the host and sum up the writes of all vdevs, all the VMs combined are writing around 1 MB/s of data.
The vdevs are on the raidz1 pool which is writing 5x 1,9 MB/s so from guest filesystem to host filesystem I get a write amplification of around 10.
I use smartctl to monitor the host writes and nand writes of each drive and for every 1 GB of data written to the SSD the SSD is writing around 1.8GB of data to the NAND.
So in the end the 1MB/s of real data from the guests are multipling up to 18 MB/s written to the NAND flash.
Is there anything I made wrong?
18MB/s is 568 TB per year which is really a lot of data because the VMs are just idleing plain debians without heavy use. Only 3 VMs got real applications running (Zabbix, Graylog, Emby). I chose 5x Intel S37X0 SSDs because they got a combined TBW of 18.000TB and should last some time but saving some writes would be nice non the less.
I only found 1 optimization doing a big difference and that was changing the cache mode from "no cache" to "unsafe" so the VMs can't do sync writes which will cut the write amplification in half. But that isn't really a good thing if something crashes and some DBs gets corrupt or something like that.
I think it will make such a big impact, because I didn't have a dedicated SLOG drive so the ZIL will be by default on the same drives and this should double the writes in case of sync writes, because every write needs to be stored on the ZIL section of the drive first and later again on the same drive for permanent storage.
And if I run "fdisk -l" on the guests all the QEMU harddisks are showing a LBA of 512B but vdevs are 8K and real disks of the pool are 4k (ashift 12). May that cause the write amplification? I wasn't able to find a way to change how KVM handles the LBA of the virtual drives. Or is it normal that 512B LBAs are shown but the 8K are really used internally?
I've read somewhere that raidz1 isn't good for write amplification and a zfs raid10 might be better but there where no numbers. Does someone tested both and can compare the them? I only find zfs performance benchmarks not mentioning the write amplification. I initially thought that raidz1 with 5 drives should result in a much better write amplification because only 25% more data is stored on the drives and not 100% more.
Right now I'm using 5x 200GB SSD as raidz1 but I'm only using 100GB space. If raid10 is more optimized for write amplification I could use raid10 with 1 spare until I need more then 400GB.
I don't know what the internal SSD write amplification was with just two mirrored SDDs but the translation from guest filesystem to host file system got a write amplification of about 7. With raidz1 the same amount of data from the guests causes a write amplification from guest to host of 10.
Any idea what would be the best disc setup to optimize write amplification?
And I created all my vdevs with 8k blocksize. Could that cause more write amplification? Is it better to use a higher number like 16k or 32k if the SSDs are using 4k LBA?
Would be great if someone could give me some hints on how to improve the write amplification without needing to sacrifice the snapshot ability, failure safty and error correction of the zfs filesystem.
I've setup a Proxmox-Hypervisor and got a total write amplification of around 18 from VM to NAND flash. Can someone hint me how to improve this?
Server setup:
Supermicro X10SRM-F, Xeon E5-2620v4, 64GB DDR4 2133 ECC
root drives:
2x Intel S3700 100GB (sda/sdb, LBA=4k) mirrored with mdraid, encrypted with LUKS, LVM and ext4 for root.
VM drives:
4x Intel S3710 200GB (sdc/sdd/sde/sdg, LBA=4k) + 1x Intel S3700 200GB (sdf, LBA=4k) as raidz1 pool (ashift=12, no LOG or Cache device, atime=off, compression=LZ4, no deduplication, sync is standard). All VMs (vdevs with blocksize=8k) inside a encrypted dataset.
Backup drive:
1x Seagate ST4000DM004 4TB SRM HDD (sdh, LBA=4k) as zfs pool for VM backups and as storage for RAM dumps (used by snapshots).
Unused drives:
2x Samsung 970 Evo 500GB (nvme0n1/nvme1n1) on a M.2 addon card mirrored with zfs but not in use at the moment because of too high write amplification.
All my VMs are using ext4 and are using the vdevs as "raw" format. In the kvm settings I use "Virtio SCSI single" as controller, "none" as cache mode and ssd emulation, io thread, discard are enabled. Inside the VMs I mounted the virtual drives with "noatime", "nodiratime" and fstrim is run once a week. "/tmp" is mounted via ramfs.
All partitions inside the VMs and on the host are aligned to 1MB.
If I run iostat on the host and sum up the writes of all vdevs, all the VMs combined are writing around 1 MB/s of data.
The vdevs are on the raidz1 pool which is writing 5x 1,9 MB/s so from guest filesystem to host filesystem I get a write amplification of around 10.
I use smartctl to monitor the host writes and nand writes of each drive and for every 1 GB of data written to the SSD the SSD is writing around 1.8GB of data to the NAND.
So in the end the 1MB/s of real data from the guests are multipling up to 18 MB/s written to the NAND flash.
Code:
root@Hypervisor:/var/log# iostat 600 2
Linux 5.4.60-1-pve (Hypervisor) 09/06/2020 _x86_64_ (16 CPU)
...
avg-cpu: %user %nice %system %iowait %steal %idle
4.66 0.00 5.90 0.02 0.00 89.42
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
nvme1n1 0.00 0.00 0.00 0 0
nvme0n1 0.00 0.00 0.00 0 0
sdg 129.56 0.81 1918.89 484 1151332
sdb 4.94 0.00 25.70 0 15417
sdh 0.00 0.00 0.00 0 0
sdf 128.64 0.81 1918.76 488 1151256
sda 4.95 0.05 25.70 32 15417
sdd 129.78 0.83 1917.61 500 1150564
sde 129.89 0.81 1917.23 488 1150340
sdc 130.13 0.87 1916.58 520 1149948
md0 0.00 0.00 0.00 0 0
md1 4.06 0.05 25.13 32 15080
dm-0 4.06 0.05 25.13 32 15080
dm-1 4.06 0.05 29.87 32 17920
dm-2 0.00 0.00 0.00 0 0
zd0 0.69 0.00 8.03 0 4820
zd16 0.58 0.00 6.45 0 3868
zd32 13.13 0.89 278.59 536 167156
zd48 0.62 0.00 6.90 0 4140
zd64 0.58 0.00 6.53 0 3920
zd80 0.00 0.00 0.00 0 0
zd96 0.00 0.00 0.00 0 0
zd112 0.10 0.01 0.53 8 320
zd128 0.00 0.00 0.00 0 0
zd144 0.00 0.00 0.00 0 0
zd160 0.00 0.00 0.00 0 0
zd176 0.00 0.00 0.00 0 0
zd192 0.00 0.00 0.00 0 0
zd208 0.00 0.00 0.00 0 0
zd224 0.00 0.00 0.00 0 0
zd240 0.00 0.00 0.00 0 0
zd256 0.00 0.00 0.00 0 0
zd272 0.00 0.00 0.00 0 0
zd288 0.00 0.00 0.00 0 0
zd304 0.00 0.00 0.00 0 0
zd320 0.00 0.00 0.00 0 0
zd336 0.00 0.00 0.00 0 0
zd352 0.00 0.00 0.00 0 0
zd368 0.00 0.09 0.00 56 0
zd384 0.00 0.00 0.00 0 0
zd400 51.87 0.16 717.30 96 430380
zd416 0.58 0.00 6.32 0 3792
zd432 0.58 0.00 6.39 0 3832
zd448 0.67 0.00 8.11 0 4868
zd464 0.60 0.00 6.36 0 3816
18MB/s is 568 TB per year which is really a lot of data because the VMs are just idleing plain debians without heavy use. Only 3 VMs got real applications running (Zabbix, Graylog, Emby). I chose 5x Intel S37X0 SSDs because they got a combined TBW of 18.000TB and should last some time but saving some writes would be nice non the less.
I only found 1 optimization doing a big difference and that was changing the cache mode from "no cache" to "unsafe" so the VMs can't do sync writes which will cut the write amplification in half. But that isn't really a good thing if something crashes and some DBs gets corrupt or something like that.
I think it will make such a big impact, because I didn't have a dedicated SLOG drive so the ZIL will be by default on the same drives and this should double the writes in case of sync writes, because every write needs to be stored on the ZIL section of the drive first and later again on the same drive for permanent storage.
And if I run "fdisk -l" on the guests all the QEMU harddisks are showing a LBA of 512B but vdevs are 8K and real disks of the pool are 4k (ashift 12). May that cause the write amplification? I wasn't able to find a way to change how KVM handles the LBA of the virtual drives. Or is it normal that 512B LBAs are shown but the 8K are really used internally?
I've read somewhere that raidz1 isn't good for write amplification and a zfs raid10 might be better but there where no numbers. Does someone tested both and can compare the them? I only find zfs performance benchmarks not mentioning the write amplification. I initially thought that raidz1 with 5 drives should result in a much better write amplification because only 25% more data is stored on the drives and not 100% more.
Right now I'm using 5x 200GB SSD as raidz1 but I'm only using 100GB space. If raid10 is more optimized for write amplification I could use raid10 with 1 spare until I need more then 400GB.
I don't know what the internal SSD write amplification was with just two mirrored SDDs but the translation from guest filesystem to host file system got a write amplification of about 7. With raidz1 the same amount of data from the guests causes a write amplification from guest to host of 10.
Any idea what would be the best disc setup to optimize write amplification?
And I created all my vdevs with 8k blocksize. Could that cause more write amplification? Is it better to use a higher number like 16k or 32k if the SSDs are using 4k LBA?
Would be great if someone could give me some hints on how to improve the write amplification without needing to sacrifice the snapshot ability, failure safty and error correction of the zfs filesystem.
Last edited: