Is ZFS really needed on SSD / NVMe devices?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

BackupProphet

Well-Known Member
Jul 2, 2014
1,256
832
113
Stavanger, Norway
intellistream.ai
I am a big fan of ZFS and have been using it for over 15 years now. My favorite features are checksums with self-healing, snapshots and transparent compression.

However, over this time, I have hundreds of TB stored on SSD. At not even once, I've seen a checksum error. It happens occasionally on hard drives. The numbers are so good that I have yet to see a dead SSD too. The only dead SSD's I've seen has been DOA. This is from multiple vendors, HGST, Intel, Sandisk, Samsung, Toshiba.

I do know that the SSD controllers are quite complex and has their own checksum/ECC implementation.

I am wondering, have any of you seen checksum errors? The reason I ask, is that I am considering just going for LVM+XFS for my storage setup from now on. A lot of software like Clickhouse comes with transparent compression and XFS is way faster than ZFS.
 

gea

Well-Known Member
Dec 31, 2010
3,476
1,363
113
DE
The main reasons ZFS cannot be as fast are

-Copy on Write
This means that ZFS can guarantee atomic writes on a crash during write. Without CoW you are in danger of a corrupted filesystem or raid. I would not want to miss as t offers transparent snaps too. Main disadvantage is write amplification.

-Checksums (more data to process)
I would not want to miss this realtime end to end data datablock verification feature in recsize that covers even bad cables/trays not only small single sectors/cells like disc ecc.

- RAM caching
Every write or read is additionally copied to RAM Arc. This massively improves performance on hd pools but limits with fast SSD.
You can disable (direct_io feature) in current ZFS
 

MountainBofh

Beating my users into submission
Mar 9, 2024
390
283
63
I've seen SSD's fail and lose data. Not very often, but enough that for anything super critical I still use ZFS. My 2 VM farms at work keep all the VM's stored on a server with 2 Micron 7450's setup in ZFS mirror mode.
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,567
530
113
There are few tricks to make zfs work harder in terms of io's - but since its not documented really well its more changing, testing, and feeling rather than guided optimization of what you should do as hardware can be very different and effects of settings can be also.
/sys/module/zfs/parameters

*by default its optimized for spinning rust, but if you adjust some settings - adding more threads, bigger grabs, more cpu - it will eat more cpu too - you will get better results. (but again depends on whole configuration of your system - from hardware to software.)
 
  • Like
Reactions: T_Minus and pimposh

ano

Well-Known Member
Nov 7, 2022
717
316
63
zfs is quite slow, but ohhh so nice as well is my 5 cent.

ixsystems devs has helped a lot though, now you can push past 20GBs

we are running multi petabyte zfs stuff. zfs has its places, ceph for the rest, and some xfs
 

mattventura

Well-Known Member
Nov 9, 2022
602
317
63
Consider: what level of performance do you actually need for your use case? If your hardware is theoretically capable of 15GB/s, but ZFS kneecaps that to 5GB/s, and you need that whole 15, then that's a problem. But if you only needed 3GB/s in the first place, then it's not really an issue.
 

ca3y6

Active Member
Apr 3, 2021
312
212
43
I am salivating all over my keyboard reading about these numbers. It seems that no matter what medium, setting in storage space of raw disk I use, I never go past 1.5-2GB/s on windows in normal usage (like windows copy, not talking about CrystalDiskMark). Even with standalone drives benchmarked at 4-5GB/s.
 
  • Like
Reactions: zunder1990

i386

Well-Known Member
Mar 18, 2016
4,597
1,744
113
35
Germany
I am a big fan of ZFS and have been using it for over 15 years now. My favorite features are checksums with self-healing, snapshots and transparent compression.

However, over this time, I have hundreds of TB stored on SSD. At not even once, I've seen a checksum error. It happens occasionally on hard drives. The numbers are so good that I have yet to see a dead SSD too. The only dead SSD's I've seen has been DOA. This is from multiple vendors, HGST, Intel, Sandisk, Samsung, Toshiba.

I do know that the SSD controllers are quite complex and has their own checksum/ECC implementation.

I am wondering, have any of you seen checksum errors? The reason I ask, is that I am considering just going for LVM+XFS for my storage setup from now on. A lot of software like Clickhouse comes with transparent compression and XFS is way faster than ZFS.
First some complaining:
According to the zfs gurus all my data & storage is dead (I'm using windows server, ntfs & refs on hardware raid, 200+ TByte data). And yet everything works and the applications do not detect any (checksum) errors... Same for the customers of the company I'm working for. So far data was only lost in "catastrophic" events (power surge, ransomware)

ZFS (or any other solution based on filesystems or raid) won't protect you against catastrophic events, only verified backups (your name ;D) stored off site

Evaluate lvm+xfs or any other solutions (mdadm, ceph, windows storage spaces?) and see how it works for you (performance, maintencance, does it let you sleep at night) and how it compares to what you already know (ZFS). A lot of stuff might sound great if you read it but if you have to run it it it might not be as straight forward or as great as you thought.
 

TRACKER

Active Member
Jan 14, 2019
291
125
43
Opposite of the previous opinion, i had data corruption issue like 15 years ago, when buggy ICH driver caused silent data corruption on files, larger than ~1GB in size. I found it by accident when i begin to get unzip/unrar errors on some large archive files. After many many hours spent of testing, the way how i figure it out was i created multiple VMs (vmware workstation) with windows, linux and (back then new kid on the block) opensolaris with ZFS. Well, guess what :) Only opensolaris/zfs was able to detect data corruption on the vm disk, other OSes were running happily, no crashes, nothing...so yeah..Don't underestimate possibilities for silent data corruption.
 
  • Like
Reactions: Stephan and pimposh

gea

Well-Known Member
Dec 31, 2010
3,476
1,363
113
DE
In the end it`s all about propability of problems and how to handle them. If you crash or pull the power plug for example during write, you have a high chance of a corrupt filesystem or raid due incomplete atomic writes on first or after a few occurances when using hardwareraid, ntfs or ext4. With a Copy on Write filesytems (btrfs, ReFS, ZFS) and software raid you may need many thousands of same crashes to see a problem. Same with bitrot or ramflip. It happens at a low statistical rate. If you wait long enough you have a 100% chance of a problem (You only know with checksum verification or ECC). In the short term or with lower amount of RAM or storage you have a good chance not to see a problem. You can ignore this or use ECC and a modern filesystem with checksums on metadata and data (btrfs,ZFS, on ReFS an option) to detect and fix such problems.

You can ignore such or use a state of the art filesystem that is developped to adress. Years ago, my mailserver was a Windows Server with ntfs and one of the best hardwareraid with BBU at that time. One day, I suddenly got readerrors and was asked to run a chkdsk. It runs days (offline) with the result of "scrumbled egg". Another few days later service was online again after a restore from backup.

This was the moment I switched to ZFS (OpenSolaris, Nexenta) and have never had a dataloss since.
 
Last edited:
  • Like
Reactions: TRACKER

BackupProphet

Well-Known Member
Jul 2, 2014
1,256
832
113
Stavanger, Norway
intellistream.ai
Consider: what level of performance do you actually need for your use case? If your hardware is theoretically capable of 15GB/s, but ZFS kneecaps that to 5GB/s, and you need that whole 15, then that's a problem. But if you only needed 3GB/s in the first place, then it's not really an issue.
Its not bandwidth that is the issue, but latency. I get 2-10x more iops with XFS. That is significant. For sequential scan, ZFS can do 25-30GB/s. Which is good enough for Clickhouse.
 

OP_Reinfold

Member
Sep 8, 2023
99
44
18
I am wondering, have any of you seen checksum errors? The reason I ask, is that I am considering just going for LVM+XFS for my storage setup from now on. A lot of software like Clickhouse comes with transparent compression and XFS is way faster than ZFS.
some background...

I use a huge hardware-raid backup pool for backups (all 30tb nvme drives), roughly around 7GBytes per sec capability at peak (a limit defined by the network pipe, I just haven't got around to investigating the wonderful 400gbits+ switches and cards, frankly not needed for my setup), which runs automatically only a nightly incremental in the background and so beautifully efficient and quick when restoring data to shares and even spinning up new machines based off of backed up physical/virtual machines.

I no longer use any form of raid or software equivalents be it zfs/btrfs etc etc in ANY of my servers for 'redundency', be it production or not. The time it takes to recover from any kind of failure is just too quick these days to even consider adding anymore complications/overheads.

Get the backup pool perfect, all headaches go away... well, the ones of the past do anyway.

BUT... if you are maintaining petabytes of 'always-online' data, that is when you begin to start looking at the big boys methods of redundency/de-dups etc... anything less than say 60 odd TB, forget it, backup pool all way... but no one way is the only way, all depends on budgets and time and the most important of all 'critical nature of the data' - wouldn't do my setup for a settlements system for example lol...

for me, time is precious, so I tend to invest where I can reduce my effort in ongoing support/maintenance - but it always comes down to the nature of the data and how critical it is, if you want something to protect recently modified data, then an over-nightly isn't going to cut it. And in that case then of course any form of redundancy is better than not, backups won't cut it.

Going back on point...

I personally haven't had any NVME data corruption yet, the backup system does patrols and never found any whatsoever, but like they say, YMMV.

Absolutely nothing wrong with your suggestion going LVM+XFS (ps I prefer XFS as linux fs myself). As long as you got backups and you're happy with the age of the backup for restore - but if you're in the boat of ahhh if it goes I'll just rebuild it (ie. a homelab staging arena), then yeah sure, wave bye bye to backups too ;) - ansible scripts come in very handy for spinning up new servers with pre-defined configs, a worthy investment in time to learn if not already touched.
 
Last edited:
  • Like
Reactions: onose

gea

Well-Known Member
Dec 31, 2010
3,476
1,363
113
DE
Going back on point...
I personally haven't had any NVME data corruption yet, the backup system does patrols and never found any whatsoever, but like they say, YMMV.
How do you know?
Only realtime end to end data checksums can give a guarantee that data on storage or backup is valid.
Without this is no more than a hope that whenever something happens (and it does with a certain propability) below a failed NVMe,
it only affects unimportant bytes.
 

pimposh

hardware pimp
Nov 19, 2022
390
225
43
Some alternative. Some interesting links in here.
ZFS is nice. ZFS is good. Like F350 or Hilux. But Hilux comes with own tradeoffs and don't fit well most housewives.
 
Last edited:
  • Haha
Reactions: TRACKER