can anyone explain 'write amplification'?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

jcizzo

Member
Jan 31, 2023
88
10
8
And how to prevent, or at least minimize it?

I'm setting up a new truenas nas based upon scale/fangtooth. it's going to have 6 evo 870 SSD's. they're brand new and i read somewhere on one of the forums that due to write amplification, the users (also using TN Scale) drives were wearing out quickly. these were expensive and i can't have them getting trashed (obviously).

The way the user described it, it seems that it's more of an issue in tn scale than core. Can anyone speak to that?

In one response, it was suggested that when the pool was setup, the sectors of tn scale were not aligned with that of the SSDs. In other words (and I don't recall precisely, but I'll say whatever for the example..) the drive has 4k sectors but the user had the pool setup with 32k sectors.. again, that isn't accurate, just trying to relay an example..

it was also mentioned that the drives should be over provisioned..

it's meant to be a simple nas but the media pool will hold a ton of movies, shows and music.. and these are 4TB evo 870's.

I'm not running databases or anything, just want something that uses low power, runs quietly with drives that'll last at least several years before needing replacement, which should absolutely be more than possible considering 99% of it's use will be read operations.

Thanks for your input!
 

louie1961

Active Member
May 15, 2023
439
197
43
It has to do with ZFS being a copy on write (COW) file system. All the data is written at least twice. Check out this video

 

jcizzo

Member
Jan 31, 2023
88
10
8
I was wondering if i could prevent the double-writing by using an nvme for the zil.. i figure if the zil was on an nvme then the sata's would only be written to once, but i'm not sure if truenas scale (or core) would operate that way..

As i understand it, if the zil is on a separate device and the device croaks, you only lose the zil, which doesn't matter if all the writes to the main pool are complete. do you know if that's possible?
 

jcizzo

Member
Jan 31, 2023
88
10
8
also, i planned on using async writing.. no need for synchronous writing in my usecase..
 

i386

Well-Known Member
Mar 18, 2016
4,801
1,863
113
36
Germany
Write applification is the additiona data that has to be written/transfered becuase every layer in the stack has a specific size it can handle/use.
I'll take an example from windows +refs (64KByte chunks) + hardware raid (3x devices, raid 5, 32Byte strips):

Application: writes in n bytes
Filesystem: converts the bytes into multiples of the filesystem chunk size (in this example 64KByte chunks)
The raid writes the filesystem data in multiples of the strip sizes (in this example 32KByte) across the data devices, every strip size is a mutiple of 512 or 4096 Bytes
Internally the SSD writes in pages to the media, current ssds use pages with sizes ranging from 8 to 64 KByte
 

nexox

Well-Known Member
May 3, 2023
1,823
881
113
There are several other things that cause write amplification, including several processes within the SSD, mostly to due with the way NAND flash requires an erase before a re-write.

Erase blocks are much larger than the pages used for writes (2+MB vs 4-64kB) so if you do a lot of small random writes you'll eventually have all your erase blocks partly filled with pages you need to keep and partially filled with pages that have been over-written, but the drive still needs to erase something in order to accept new writes, so it copies some active pages to a new block in order to free up a block to erase. Those copies are NAND writes and count against the lifetime, and you can see how larger erase blocks (QLC is usually larger than TLC, which is larger than MLC) with certain workloads could lead a drive to write the same data several times, that's amplification.

The SSD also needs to store the metadata that points a host-side block to a particular page of NAND, since those move around, and that metadata itself is stored on the NAND, which can only be efficiently written in 4kB (or larger) pages, but the individual bits of metadata are smaller than that, so for performance the SSD controller tends to keep that data in volatile memory for a while, but that means on power loss your data is gone, so when software sends one of the commands that instructs a drive to durably store data, it has to commit that metadata to non-volatile storage, which may end up writing partial pages. If you write 2kB to a 4kB page, that's a 2x write amplification. SSDs with power loss protection don't suffer from this particular issue, that's one reason why Enterprise SSDs can run high throughput databases and similar applications which require durable storage of random writes without taking a huge hit on write endurance.
 

jcizzo

Member
Jan 31, 2023
88
10
8
ok, so if i were to install a separate nvme drive and direct truenas to use it as the zil, would that stop the double-writes?

Can i adjust the page size when the pool is created to further reduce the amplification?

again, the drives will be storing movies, tv shows, and music.. i'll be adding and deleting from time to time, but for 99% of the time, the dataset will be read from.
 

jcizzo

Member
Jan 31, 2023
88
10
8
also (again), i don't plan on using synchronous writes.. really no need for it in my use case. I've read however, that in TN Scale even with sync writes disabled, you can still enable the zil on all writes.. from what i gather, the zil exists regardless, but one can move it to another drive to alleviate zil-related write amplification from the main pool in question..

Correct me if i'm wrong..

and yes, i know the zil is NOT a write-through cache as many have come to assume.
 

unwind-protect

Active Member
Mar 7, 2016
605
246
43
Boston
I tested write amplification in FreeBSD. UFS, ZFS, Geli and ZFS native encryption. I measured how the counters in SMART would go up compared to the amount I was writing. The numbers showed no amplification, which surprised me.
 

nexox

Well-Known Member
May 3, 2023
1,823
881
113
I don't know the details of ZFS or TN, but a consumer-grade NVMe drive is probably going to have similar write amplification issues, possibly even worse than SATA drives because the target benchmark numbers are higher so NVMe drives tend to rely a lot more on keeping data in volatile RAM, potentially increasing the impact of small durable writes.

Given your intended use, you would probably be alright getting a pair (I believe the ZIL is essential so should be mirrored) of smaller enterprise-grade MLC SATA SSDs with power loss protection, enterprise NVMe is great but M.2 drives with PLP are sort of rare and U.2 or AIC drives often use a bunch of power and require extra cooling, plus the improved latency doesn't really matter for bulk media storage.
 

jcizzo

Member
Jan 31, 2023
88
10
8
I don't know the details of ZFS or TN, but a consumer-grade NVMe drive is probably going to have similar write amplification issues, possibly even worse than SATA drives because the target benchmark numbers are higher so NVMe drives tend to rely a lot more on keeping data in volatile RAM, potentially increasing the impact of small durable writes.

Given your intended use, you would probably be alright getting a pair (I believe the ZIL is essential so should be mirrored) of smaller enterprise-grade MLC SATA SSDs with power loss protection, enterprise NVMe is great but M.2 drives with PLP are sort of rare and U.2 or AIC drives often use a bunch of power and require extra cooling, plus the improved latency doesn't really matter for bulk media storage.
from what i've read, zil is only important for recovery if the server were to lose power during a data transfer and sync writes.. so, if this were backing up vms, databases and whatnot, then yeah, mirroring the zil would be a good move, and that's if you're writing to mechanical drives.. if i'm writing to several ssds in a raidz2 config, and the write completes, all is fine. if the write completes and then i get a message that the zil drive (an nvme) is about to die, i can replace it and direct truenas to use the new nvme for zil.

the point of the exercise is to stop the ssds from being written to twice. If i can move the zil to an nvme, they'll only be written to once which brings it all in-line with any other drive.
 

joerambo

New Member
Aug 30, 2023
28
9
3
it's meant to be a simple nas but the media pool will hold a ton of movies, shows and music.. and these are 4TB evo 870's.

I'm not running databases or anything, just want something that uses low power, runs quietly with drives that'll last at least several years before needing replacement, which should absolutely be more than possible considering 99% of it's use will be read operations.
I would not care about write amplification at all. Setup classical 6 SSD drive RaidZ2 and thats it. For bulk storage, the fact that ZFS writes data twice (in fact due to bulk writes, much less so) does not matter at all. The users that killed SSDs were abusing them with DBMS or some crazy VM storage, where I/Os get amplified over multiple layers.

Main worry is choosing reliable drives, with updated FWs and well thought out backup strategy to workaround things like "oops all my drives reached 32k hours at same time".
 
  • Like
Reactions: DarkServant

DarkServant

Active Member
Apr 5, 2022
101
91
28
@jcizzo

I think that you have read the prerequisites for a reliable ZFS storage system?
You should not go with a base-system without a good amount of ECC-DRAM. Storage-wise, you have heard about enterprise SSD's which do have some extras like powerloss-protection (data-in-flight! not only at rest), there is an "870evo" equivalent called pm893 where sub-models can go up to 7,68TB. They are physically different and are designed for such workloads with a different firmware too. No consumer SSD's in a ZFS NAS.
Whatever, what you can do, give em the latest firmware, make a clean erase -> all cells are set to 1, and then set the overprovisioning via Host-Protected-Area to about 3TB ( hdparm -Np6442450944 --yes-i-know-what-i-am-doing /dev/sd* ), 24Tbits from 32TBits physical. So your SSD will at least have enough space, that the write-amplification don't skyrocket at some point even without TRIM and filled to >=90%.
The 870 Evo has a page-size of 16KiB but it is divided into 2x 8KiB, if you come across an ashift the value should be 13 in this case i believe.

The prices have not dropped on enterprise/datacenter SSD's like on consumer SSD's, the difference is quite high.

There are m.2 SSD's which have PLP, but mostly at a length of 110mm instead of 80mm (Micron 7450 Max 400/800GB are 80mm). Cooling them is an issue for itself. Two Optane P1600X would be a nice choice if you want this SLOG config.

But in the case of the SSD's you have, you should probably consider another filesystem and or NAS solution.
 
  • Like
Reactions: nexox

TonyArrr

Active Member
Sep 22, 2021
176
88
28
Straylia
I was doing something similar OP, about 4 years ago I started my server off with 12x 2TB WD Blue SATA drives in a pair of 6 drive RAIDZ2 vdevs for storage of larger media.

I did keep track via a influxdb exporter how the write amplification went for the first 6 months (so moving the initial 6TB of media in, and adding a couple seasons of tv shows later) and my write amplification worked out at 1.4x
Stopped detailed logging at that point, just recording basic SMART stats now to alert me if anything goes fishy

Never had a ZIL drive, but did have a separate pool of NVMEs for my VMs, databases and logs.

None of the drives have failed yet, and I’ve done a bit over 8 full writes on average.

I am currently replacing them bit by bit, but just so I can expand the pool size. PLP and better performance are also a consideration, but mostly just playing the $/GB game (though not considering consumer drives anymore)

I’m picking drives like the earlier mentioned PM893s when I see a good price and just exchanging them into the pool a couple at a time. Means no immediate storage gains on purchase, but since I have multiple vdevs it helps keep free space vs used space balanced (as available space in a vdev does get considered what zfs decides what to write where)

Long story short, you’ll probably be fine. Keep track of your drives data somewhere so you can track its real world usage and plan for failures/wearout before they happen, and don’t put anything on there that isn’t backed up somewhere else (unless you don’t care about losing it)
 

DarkServant

Active Member
Apr 5, 2022
101
91
28
I have to admit that the OP'ed sm883 SSD's saw no real heavy writes/load at all. TrueNAS Core ZFS striped-mirror and a about a one terabyte of data on it, just for SMB/CIFS-shares.
They probably never wear out, and only the power-on-hours are climbing...

@TonyArrr: Just keep the firmwares up to date (JXTC604Q). And yeah, the price difference is quite high now.