ZFS Elephant In The Room (all NVMe array)

CyklonDX · Dec 16, 2023

Will take about a week to chew threw another set of 6's ~ i used 6 with raidz2 (still have 6)
They aren't doing anything, and its winter. - temps are fine.

// atm i don't have any other than hynix gold's ssd's -> for test i think we should have another set of different brand too, not sure when i get another chunk lightly used ssd. (likely crucial mx500's would be next)

CyklonDX · Dec 25, 2023

finished 2nd run, got exact same results.
i'll try it on different brand next time i have some.

pimposh · Dec 25, 2023

You mean you just killed another batch of drives ?

oneplane · Dec 25, 2023

My two cents (well, this wall of text is probably more like 2 euros):

I think a lot of the configuration really depends on your use case in terms of complexity or how advanced it needs to be. The more advanced/complex your needs, the more involved the configuration ends up being.

That said, if your usage isn't extreme, then any 'bad' configuration will still work fine. I've had redundant storage on Consumer SSDs for years and out of ~40 4TB Samsung and WD SSDs none have failed in ~6 years of use (I keep having to re-calculate the number as time seems to be moving faster each year...

), which is with small office locations that mostly just don't want to wait for file listings and search has to be fast. They might write a few TB each day in each location, and file-level sync causes additional writes all around. This is both on MDADM+LVM and ZFS.

This doesn't mean it works for everyone (we're talking about SMB setups for SMBs with maybe 4 VMs per storage pools on top of the NAS usage, used by maybe 25 users concurrently), but modern hardware usually outlasts the average user pretty decently.

If you're doing things like hosting many database servers, then you'd probably get into some trouble, but since a ton of data is just read over and over (especially operating system and application binaries) it's not as heavy as you'd think it is. For NAS use we generally see about 17TB written vs. 40TB read per drive per year on that SMB setup. Since writes are spread in almost all scenarios it tends to be a little less weary.

Back to the topic at hand: for VM storage, ZFS zvols are pretty neat, but since you'd be having big chunks of data (from ZFS's perspective) with filesystems on it (from the VM perspective), having to do with file-backed disks and just plain MD RAID10 and LVM isn't all that crazy. You'd lose integrity features, but the in-VM filesystem really should already be taking care of that. If you think about it, anything important should probably be on ZFS, but not on ZFS-in-ZFS (i.e. a host doing ZFS and a guest doing ZFS as well). Anything else shouldn't be on ZFS since it's a waste of space and compute. I do run ext4 on zvols, compression on but dedup off. It doesn't kill the disks as fast as you'd think.

I have a template for single-node appliance compute (using Proxmox and ZFS), there is almost no local data persistence (all done on a NAS elsewhere), only OS, application and hot/cached data. It's a 1TB pool (just a simple mirror) with 2 devices (Samsung SM863a basic enterprise SATA SSD):

power on hours: 47217
wear-out: 3%
LBA written: 282097715172
LBA read: 25842077196
NAND writes: 857246527488

Those nodes host 5 VMs, all Linux, all ext4, 3 have a 4GB Swap disk, there is some constant-activity stuff like a K3S admin node & orchestrator, a network VM, a hardware management thing (mostly RS485 management and data transceiver), a local data pre-processing spark VM, and a general K3S worker node.

If I calculated it correctly, that means (282097715172*512)/(1024^4)=131 TB written, which isn't a lot. But it's only 12TB read! So not-optimal configurations like those are really making a tenfold difference in 'eating disks'. On the other hand, it would need to be an order of magnitude higher before I'd spend time and money on doing anything about this. The hardware will age out before it's worn out... as long as you don't host a NAS-on-ext4-on-zvol.

Personally, on single-node setups I still use ZFS, and just accept that you're not going to get all the endurance and performance an NVMe drive has to offer. I'd easily throw 50% of the capacity, performance and endurance in the trash if it means that I can detect and fix data integrity with ease. Especially when you consider the extreme performance and capacity we get with current-gen hardware. Not too long ago, DDR2 FB-DIMMS and RAID10 HDDs were considered 'good enough', even for 2 or 3 VMs on the same host.

The only difference I make is disk-per-vdev count, for NVMe I tend to make larger stacks (up to 12) than HDDs (up to 8) because the increased bandwidth makes resilvers not as scary, and for HDDs (usually 6 disk raidz2 per vdev) having more vdevs give you more performance, which for non-SSD storage is still important.

For dual-nodes I used to use a cluster FS on top of zvols, and for anything bigger, Ceph is pretty much the next step. (but I don't do dual-nodes anymore, either the work is important enough for real redundancy, or it's not and you just get the one active node at a time with complete disk VM migration if required instead of shared storage)

Do suboptimal configurations sometimes make me sad because we leave performance on the table (and life/usage)? Yes. But when it's about making money it matters a whole lot less than I'd like.

CyklonDX · Dec 26, 2023

pimposh said:
You mean you just killed another batch of drives ?

yes.

janek202 · Dec 26, 2023

CyklonDX said:
finished 2nd run, got exact same results.
i'll try it on different brand next time i have some.

Could you also try with a different ashift value, like 13?

CyklonDX · Dec 26, 2023

yes i can, but i cannot give anyone a day/date when i'm able to get new lot of ssd's for free.
*Unless someone wills to send 4 exact same units ~ it may take few months, a year, idk until i get some.

gea · Dec 27, 2023

There are some things to consider with ZFS vs older filesystems

Checksums on data and metadata
This increases the amount of io. You can disable for less io=better performance but checksums is propably the main reason to prefer ZFS

Copy on Write
This means that no data is overwritten but always written newly in ZFS blocks. This gives you snaps and an always valid filesystem even on a crash during write so you propably won't miss. Main problem of a Copy on Write filesystem: When you want to change a "house" to "mouse" in an older filesystem it was possible to change a single byte for this. With ZFS you must at least read/write a whole ZFS datablock. With a larger file the blocksize is recsize. In the extreme this can mean that you must read/write 1M to modify one byte. For smaller files blocksize is reduced dynamically ex in a 4k file you must process 4k to modify 1 byte. As the minimal size is physical blocksize of a disk (4k) this is the same amount on older filesystems. Copy on Write also increases fragmentation especially when pool becomes full.

ZFS recsize
When you write data, it is splittet in ZFS blocks in recsize, ex a 2M file is splittet in 4 blocks with recsize=512M. Larger recsizes like 512k or 1M are good for performance but increases amount of io when you edit files. Lower values like 16k-64k are better with small files or use cases like databases or VM storage. Default 128k is a good compromise. If use case is clear you can optimize this value. With SSD and volatile data I would avoid high values due the negative effect on write amplification.

ZFS sync write
This protects the ZFS rambased writecache in case of a crash. Use only when needed ex VM storage as this is bad for write performance. ZFS encryption with sync enable is quite slow as small datablocks are bad for encryption

Other performance sensitive settings
Amount io RAM. ZFS use RAM as read/writecache to limit negative performance aspects of checksums and Copy on Write. It does not cache files but last/most read ZFS blocks and metadata. If RAM is lower than say 32GB, you may check if performance is better with more RAM

ZFS raid
In a raid Z, iops capability is equal to a single disk. With mirrors write iops scale with number of mirrors, read iops is twice that number. If iops is a limiting factor, prefer mirrors. Also overall performance does not scale linear with number of disks. If 2 disks in a raid-0 gives 1,5x the value of a single disk, this is ok.

Physical disk blocksize.
Mechanical disks are now always 4k (ashift 12). With flash it may be (or not) that larger values like 8k (ashift 13) may be better. If you don't define ashift during vdev creation, ZFS reads the physical blocksize from disks. Normally this is the best as the disk manufacturer should know what value is best suited for the SSD.

CPU
ZFS is software Raid. With a smaller CPU max achievable perfrormance is limited. If you want to go beyond say 2 GB/s throughput you need a fast or very fast CPU. Mostly clock is more important than number of cores.

Flash in general (beside Intel Optane)
With concurrent read/write or steady write after some time of writing SSD performance may become really bad especially with consumer models or those without ram cache not so with enterprise SSDs. With desktop models you can increase overprovisioning (new or secure erased SSD) to limit this effect. Also avoid near full pools (up from 50%fillrate, performance goes down).

If you want to know whats possible with ZFS on a certain hardware, do tests with Intel Optane (I know expensive and hard to find now)

homeserver78 · Dec 27, 2023

gea said:
When you want to change a "house" to "mouse" in an older filesystem it was possible to change a single byte for this.

Rather a single sector?

gea said:
ZFS recsize
When you write data, it is splittet in ZFS blocks in recsize, ex a 2M file is splittet in 4 blocks with recsize=512M. Larger recsizes like 512k or 1M are good for performance but increases amount of io when you edit files. Lower values like 16k-64k are better with small files or use cases like databases or VM storage.

As I understand it, since the recsize is a max value and smaller ZFS blocks are used for storing smaller files regardless of the recsize setting, a lower recsize setting shouldn't make a difference for small files. The use case for smaller recsize is when you have large files that you need to make smaller modifications to - like databases and VM storage, as you point out.

Also, larger recsize is good for thoughput with large, unchanging files, since it reduces the required IO (more data transferred per IO) and the on-disk fragmentation. It will waste bandwidth (not IO) when you edit files.

gea said:
ZFS sync write
This protects the ZFS rambased writecache in case of a crash. Use only when needed ex VM storage as this is bad for write performance. ZFS encryption with sync enable is quite slow as small datablocks are bad for encryption

Did you mean compression (rather than encryption)? Compression works better with larger blocks, but I'm not sure I understand how small blocks would slow down encryption. Am I missing something?

gea · Dec 27, 2023

Ok, smallest possible io is not one byte but a 512B or 4k physical disk blocksize.

Correct, recsize is the maximal allowed size.
For a smaller file than recsize, smaller ZFS datablocks are used.
If you edit a larger file with large block/recsizes you must always read/write large datablocks
what means with 1M block/recsize a 1M io for a "house" to "mouse" instead a 512B or 4K.

With Flash this can reduce lifespan due increased writes. With Flash on nonempty blocks any write
may additionally need to do a SSD block read SSD block erase, page update and SSD block write.
This affects not only bandwith but also iops. Enterprise SSDs are much better, desktop SSD can be quite bad.

No I mean encryption with sync enabled. In this case very small write commits must be encrypted.
This is not efficient. Even with the fastest Optane and a fast system, performance is only a fraction of nonsync writes.
See my tests, https://www.napp-it.de/doc/downloads/epyc_performance.pdf

homeserver78 · Dec 29, 2023

gea said:
No I mean encryption with sync enabled. In this case very small write commits must be encrypted.
This is not efficient. Even with the fastest Optane and a fast system, performance is only a fraction of nonsync writes.
See my tests, https://www.napp-it.de/doc/downloads/epyc_performance.pdf

I didn't realise encryption would slow down IO operations like that; I thought it would mostly affect throughput. Always nice to learn something new. Thanks!

SnJ9MX · Dec 29, 2023

CyklonDX said:
So... I ran some tests on older Hynix Gold 1TB SSD's with zfs ashift 12
(had 12 of those, with around 15-20TB written on each)

The result the disks want to die. While in reality i only wrote around 300-400TB on each - they reported 600+. While at NAND writes exploded.
View attachment 33255

// So in my opinion zfs at very least isn't treating ssd's with as if they were with blocks.

You are using consumer NVMe drives rated for 0.4 DWPD (750 TB). And you (whether intentionally or not) wrote 3.8 PB to them. What did you expect? Get some old enterprise drives off eBay with minimum of 3 DWPD and you won't need to worry about this.

i386 · Dec 29, 2023

SnJ9MX said:
DWPD

dwpd it's a "bad" metric without context: 3dwpd with a 100GByte ssd is 300GByte, but 0.3 dwpd with a 30TByte ssd is 10TByte

SnJ9MX · Dec 29, 2023

i386 said:
dwpd it's a "bad" metric without context: 3dwpd with a 100GByte ssd is 300GByte, but 0.3 dwpd with a 30TByte ssd is 10TByte

Indeed it is. I prefer the full endurance number, which is why I included it as well. That said, easy to tell what kind of endurance to expect based on DWPD:

1 or less = read intensive
3 = mixed use
>5 = write intensive

I have quite the collection of 10 DWPD drives in my homelab with no chance of ever using the entire rated endurance. Even the "mixed use" D3-S4610 960GBs I have are 6PB endurance...

CyklonDX · Dec 29, 2023

SnJ9MX said:
You are using consumer NVMe drives rated for 0.4 DWPD (750 TB). And you (whether intentionally or not) wrote 3.8 PB to them. What did you expect? Get some old enterprise drives off eBay with minimum of 3 DWPD and you won't need to worry about this.

No the actual amount of written data was 500TB. I expected them to die. With aim to verify if zfs treats them as blocks, and not as 4kn disks. (ashift) i have another set of ssd's running with standard 512 ashift, and they have less writes to nand.

Same brand/model i'm using in some servers hynix golds in zfs 512 it has 50T of writes, but only 94TB nand writes (around 2x extra writes);
vs with killing 4kn on avg 500-600TB writes resulted in 3.4PB nand writes (around 5-6x extra writes).

Different brands will behave differently - thats given, thats how hynix'es behaved.
In terms of PBW top of the line would be hgst 800G models having around ~24PBW, while not costing all that much.

Search

ZFS Elephant In The Room (all NVMe array)

CyklonDX

Well-Known Member

CyklonDX

Well-Known Member

pimposh

hardware pimp

oneplane

Well-Known Member

CyklonDX

Well-Known Member

janek202

New Member

CyklonDX

Well-Known Member

gea

Well-Known Member

homeserver78

New Member

gea

Well-Known Member

homeserver78

New Member

SnJ9MX

Active Member

i386

Well-Known Member

SnJ9MX

Active Member

CyklonDX

Well-Known Member