Bit rot

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

twin_savage

Member
Jan 26, 2018
55
30
18
33
???
Any serious reference for that opinion?
Read the summary and para above in this link:

Copy on Write filesystems suffer more fragmentation than older filesystems but the advanced ZFS rambased read/write cache reduces small read i/o to a fraction. It is not unusual to see 80% cache hit rate with enough RAM. The variable blocksize is a nice feature when you write datablocks < recsize. If the data to be written >=recsize, all pieces are written in recsize (up to 1M)
My gripe with ZFS in this instance isn't with the rate at which the file system tends to fragment files, it is that there is no in-place recourse to defragment. I'd prefer to get stressful defragmentation done before a disc failure occures rather than effectively dealing with disk fragmenation after when data is most at risk.
 
  • Like
Reactions: jdnz

Whaaat

Active Member
Jan 31, 2020
305
159
43
Even spinning rust for example already uses ECC. But it simply isn't enough, both because it often still results in data loss (what do you do with double bit flips? you lose your block, that's what).
Hold your horses, double bit flip is a mere trifle for hdd's ECC
The size of the ECC data is based upon two factors -- the size of the data to be protected and the basic ECC algorithm being used. Over the years, ECC algorithms have improved. In today’s ECC technology, an HDD is typically able to correct errors of up to 50 bits out of 4,096 bits per sector (512 bytes times 8 bits-per-byte). As the ECC algorithms have improved, their size has also increased.
 

twin_savage

Member
Jan 26, 2018
55
30
18
33
It shouldn't be done at one level either way, and luckily it isn't. Even spinning rust for example already uses ECC. But it simply isn't enough, both because it often still results in data loss (what do you do with double bit flips? you lose your block, that's what).
It is true that the hdd itself will assign ecc per 4k block, which is all the more great. When I mentioned at the block level earlier I was referring to parity between entire blocks within a RAID set as opposed to ECC within a block that is deposited by the hdd's controller.

My earlier comments about block vs file system level were for a mutually exclusive situation with having parity being done at the block level or at the file system level; obviously the more places ECC can be applied the better (within reason).



This is where ZFS shines; it doesn't care how fancy the drive pretends to be, it assumes it is crap, and assumes that the more the interconnect controller tries to do, the less sure it can be about the data.
In making that assumption, ZFS is assuming it's data path (which is very often times long and complex) is superior to a more tightly knit, closed loop control of data integrity such as a hardware raid controller's. This may be true sometimes, but I've had computers run next to rf magnetons and basically we found out that the more conductor length that was exposed to radiation the more errors we had, before and after chassis grounding.



To be clear here I'm not saying don't use ZFS, I'm just trying to say it's pros and cons (yes despite what popular consensus is on the internet, there are cons beyond being a resource hog) should be considered.
 
  • Like
Reactions: ecosse

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
Read the summary and para above in this link:


My gripe with ZFS in this instance isn't with the rate at which the file system tends to fragment files, it is that there is no in-place recourse to defragment. I'd prefer to get stressful defragmentation done before a disc failure occures rather than effectively dealing with disk fragmenation after when data is most at risk.
This is the old story that ZFS is bad as there are no recovery tools like chkdsk or fschk.
This ignores that on a raid 5 data, metadata and raid stripes are written sequencially disk by disk A crash during a write has a serious risk of a corrupt filesystem or raid. Without checksums on every datablock there is no way to prove consistency or repair data errors. A readerror on a degraded raid-5 has a high risk of a lost array.

Sun designed ZFS to remain stable and data valid on nearly all thinkable problems beside bad hardware or bugs. A crash during write is uncritical due Copy on Write and checksum protected data on every disk. A bad pool is beyond easy repair option. A Raid-5 with less damage would be lost.

The last time (15 years ago) when my mailserver with one of the best raid-5 at that time crashed, i tried a chkdsk. It run 3 days with a result like scrambled egg. I then switched to Solaris ZFS with no dataloss since.
 
  • Like
Reactions: TRACKER

twin_savage

Member
Jan 26, 2018
55
30
18
33
This is the old story that ZFS is bad as there are no recovery tools like chkdsk or fschk.
This ignores that on a raid 5 data, metadata and raid stripes are written sequencially disk by disk A crash during a write has a serious risk of a corrupt filesystem or raid. Without checksums on every datablock there is no way to prove consistency or repair data errors. A readerror on a degraded raid-5 has a high risk of a lost array.

Sun designed ZFS to remain stable and data valid on nearly all thinkable problems beside bad hardware or bugs. A crash during write is uncritical due Copy on Write and checksum protected data on every disk. A bad pool is beyond easy repair option. A Raid-5 with less damage would be lost.

The last time (15 years ago) when my mailserver with one of the best raid-5 at that time crashed, i tried a chkdsk. It run 3 days with a result like scrambled egg. I then switched to Solaris ZFS with no dataloss since.
The blog thoroughly goes through the write hole phenomenon on hardware/software raid, how BBUs were adopted for the former and how they only protect against power loss but not OS or firmware crash; and furthermore what can be expected after such an event.

The guy seems very knowledgeable as I'd imagine he'd have to be since his job is ZFS&RAID data recovery.

He goes on to explain that for ZFS "once you add transactions, copy-on-write, and checksums on top of RAIDZ, the write hole goes away."
but elaborating that ZFS isn't immune from power loss problems:
" Of course, ZFS fans will say that you never lose a ZFS pool to a simple power failure, but empirical evidence to the contrary is abundant. "


The blog is a wealth of knowledge and is one of the few agnostic places to get information about storage and file systems that isn't colored by the fanboism that has permeated seemingly all the other forums and opinion pieces.
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
In the end every IT process is sequential, step by step. Even in the Copy on Write process that ensures that a write is completely done or discarded there is a point in time when the last bit is written successfully and all the pointers must switch to make the new block valid. A crash at a certain time can even corrupt a Copy on Write filesystem.

This leads to statistical propability of a failure. While with raid-5 a crash at any point of a write operation has a high chance of datacorruption that can be reduced a little with a BBU, the propability of a corruption is much much lower with ZFS. There is no 100% technology only 95% or 99,9999% survival rate of a crash at best. What ZFS does is to limit the risk to a state of the art level.
 
  • Like
Reactions: Fritz

tinfoil3d

QSFP28
May 11, 2020
876
403
63
Japan
Anyone can confirm that "enterprise ssds" (of the past) would start to drop bits if unpowered for a long time?
I'm specifically referring to s3710, i have one in a laptop that i rarely power up(like sometimes once in a half a year?) and i have a hashsums of most of files out there and just for fun i re-run these to compare and so far nothing has been incorrect. files are huge blobs of data, hundreds of gigabytes. i'd expect it to become incorrect at one point but so far no issues. it's been used this way for over 3 years now. it's de-energized most of the time.
Share your experiences?
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
Enterprise SSD like the 3710 are very reliable and durable but do not expect a bitrot or failure rate over time of 0.0000%. If data is important you need a verify method (manually created or filesystem based checksums like with btrfs, ReFS or ZFS) and redundany/backup to repair when needed.

The question is not if, but when and how often and how you are prepared to handle it.
 

tinfoil3d

QSFP28
May 11, 2020
876
403
63
Japan
@gea Yep, i just shared my experience because i read it somewhere many years ago that MLC and the "older" enterprise SSDs are starting to rot MUCH faster than consumer SSDs. And I've been trying to reproduce/confirm that expectation but wasn't so far able.
I've had a pleasure of opening up the S3710 and consumer MX500, and the difference in construction really blew my mind, SMD components and chips on intel are huge. PCB takes the entire 2.5in space, mx500 PCB is more like the size of PCB within USB drive. Because the components are so huge those intel ssds are actually heavy if you lift just the PCB out of the enclosure. Legendary stuff, in some sense.
Otherwise I care about backups for important stuff, sure. Multiple copies, mostly full, offline and off-site as well.
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
Keep in mind that hardware RAID doesn't really exist. The ASIC just runs code, which is software. It's specialised and accelerated, but in the end it's just software running on a chip, handling your data.

Why does that technicality matter? Because if we were to make a real comparison, we'd have to end up comparing algorithms implemented in software, sometimes accelerated with hardware assistance, and while we can do that for ZFS and UFS and EXT4 and LVM and BTRFS, we can't do it for ReFS or any of the other ones. That said, not many people will be able to read and mentally parse the ZFS codebase in its entirety, so it's all just educated guesses... :p (but guesses about open and inspectable systems tend to be easier to check; OTOH: from the recovery and repair person perspective, they are looking from the outside in, which doesn't tell you anything about the design process so you're still just guessing)
 
Last edited:

Whaaat

Active Member
Jan 31, 2020
305
159
43
Keep in mind that hardware RAID doesn't really exist. The ASIC just runs code, which is software.
The only hardware solution - is true analog device. Any digital device will use software layers on top of physical. Hardware RAID - is the solution independent from OS. If OS hangs, hardware realization will continue its routine operation.
You can try to avoid hardware RAID at all costs, but using even a single SSD you become a user of more advanced hardware raid solution inside. It will not ask you to replace faulty drives chips but will recover and transfer data to the healthy blocks in the background.
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
The only hardware solution - is true analog device. Any digital device will use software layers on top of physical. Hardware RAID - is the solution independent from OS. If OS hangs, hardware realization will continue its routine operation.
You can try to avoid hardware RAID at all costs, but using even a single SSD you become a user of more advanced hardware raid solution inside. It will not ask you to replace faulty drives chips but will recover and transfer data to the healthy blocks in the background.
Yep, and it gets even more interesting when you have multi-controller SSDs, especially old large capacity models that were essentially RAID00 with EEC and a bunch of hope and crossed fingers.

I think the true thing to avoid is a device that is a black box but pretends to be something else, or better yet, a device that effectively MITMs your storage. That's a benefit of SSDs, generally the OS doesn't care about the NAND layer, it stops at the block layer. But as soon as something else tries to be your block layer (while it really isn't), that's where the problems start to multiply.
 

twin_savage

Member
Jan 26, 2018
55
30
18
33
Anyone can confirm that "enterprise ssds" (of the past) would start to drop bits if unpowered for a long time?
I'm specifically referring to s3710, i have one in a laptop that i rarely power up(like sometimes once in a half a year?) and i have a hashsums of most of files out there and just for fun i re-run these to compare and so far nothing has been incorrect. files are huge blobs of data, hundreds of gigabytes. i'd expect it to become incorrect at one point but so far no issues. it's been used this way for over 3 years now. it's de-energized most of the time.
Share your experiences?
The symptoms of SSD NAND cell charge decay are typically severely reduced read speed for the affected cells long before any data loss occurs, assuming we're talking about MLC/TLC/QLC/PLC (basically any SSD with a controller with sample NAND voltage to discriminate charge, some very early SLC controllers didn't have this).
Contrary to common knowledge, most SSDs don't refresh their weak NAND cells simply by being powered (the exception to this being some enterprise SSDs), once a NAND cell is written, it stays untouched unless GC or trim writes/consolidates it, there's a big long thread on it over on the L1 forums going through the problems with a Seagate Firecuda 520 and stale data.



I think the true thing to avoid is a device that is a black box but pretends to be something else, or better yet, a device that effectively MITMs your storage.
But doesn't that describe every SSD on the market? because of the FTL.
 
Last edited:
  • Like
Reactions: tinfoil3d

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
But doesn't that describe every SSD on the market? because of the FTL.
Well, not exactly. The FTL is essentially the block mapping on an HDD or the transport mechanism on a tape drive (to a degree). The devices themselves are just persistence devices which contains the required parts to translate block interaction to whatever the media needs to fulfil that, and the edge of the responsibilities lie at the interface (IDE, SATA, SAS, NVMe etc.).

What RAID controllers try to do is become the 'end' of the chain, but in reality they are not the end of the chain. They also aren't aware of the media, and need special knowledge (a.k.a. 'certified' drives or 'compatibility matrix') to patch their logical emulation of what multiple drives are doing. Part of this is because legacy topologies and software didn't allow for multi-device endpoints (there was a master and a slave and that's about it for your port), and the earlier implementations either hooked into the BIOS (So you had some sort of slow but universal interface) or had a custom driver (F6 Floppy anyone?).
 

twin_savage

Member
Jan 26, 2018
55
30
18
33
Well, not exactly. The FTL is essentially the block mapping on an HDD or the transport mechanism on a tape drive (to a degree). The devices themselves are just persistence devices which contains the required parts to translate block interaction to whatever the media needs to fulfil that, and the edge of the responsibilities lie at the interface (IDE, SATA, SAS, NVMe etc.).
FTL as in the flash translation layer in SSDs; it is an ever changing lookup table that maps logical blocks to NAND cells. It was the cause of complete SSD failure during power loss in many SSDs awhile back because it would get into some kind of undetermined state after a power loss and all your data would be gone without it since it mixes and matches which NAND cells belong with which blocks, and its constantly mixing and mactching them due to cell fill consolidation (so that write amplification doesn't happen as much) and wear leveling algorithms. In this respect the FTL is a black box, no company will share the code that makes up the FTL because they think it gives them a competitive edge.

I suppose a weak analog to this in HDDs would be the sector reallocation mechanism, but this is not nearly as complex as the FTL and is mostly transparent in how it works.


What RAID controllers try to do is become the 'end' of the chain, but in reality they are not the end of the chain. They also aren't aware of the media, and need special knowledge (a.k.a. 'certified' drives or 'compatibility matrix') to patch their logical emulation of what multiple drives are doing. Part of this is because legacy topologies and software didn't allow for multi-device endpoints (there was a master and a slave and that's about it for your port), and the earlier implementations either hooked into the BIOS (So you had some sort of slow but universal interface) or had a custom driver (F6 Floppy anyone?).
True the RAID controllers become the end of the chain as far as the OS is concerned, but they still have bidirectional communication with the disks beyond data transfers, the SCSI command set that SAS drives use will communicate verbosely what the drive is doing back up to the RAID controller, it doesn't, or rather shouldn't take a special firmware on HDDs to "place nice" with raid controllers. If vendors are doing such things that is an attempt at milking their customers **cough** synology **cough**

RAID controllers even can tell if a HDD is dual actuator and will construct volumes based off of this so that bad topologies can't be constructed.
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
FTL as in the flash translation layer in SSDs; it is an ever changing lookup table that maps logical blocks to NAND cells. It was the cause of complete SSD failure during power loss in many SSDs awhile back because it would get into some kind of undetermined state after a power loss and all your data would be gone without it since it mixes and matches which NAND cells belong with which blocks, and its constantly mixing and mactching them due to cell fill consolidation (so that write amplification doesn't happen as much) and wear leveling algorithms. In this respect the FTL is a black box, no company will share the code that makes up the FTL because they think it gives them a competitive edge.

I suppose a weak analog to this in HDDs would be the sector reallocation mechanism, but this is not nearly as complex as the FTL and is mostly transparent in how it works.
The FTL is more than just a LUT or a map, but as I wrote before, the job of the controller on the SSD (including the FTL) is the same as on other storage media controllers: turn media-specific things into blocks that are used for block storage. The OS uses a block-centric protocol to read and write, and the media controller translates blocks into whatever is actually happening on-disk (be it spinning rust, NAND chips with one or more levels, or something else like tape or optical media). On-disk, there are no blocks, that's just a logical concept. For NAND flash cells, an FTL would be used whole on NOR flash that same FTL wouldn't work due to the very different nature of NOR flash. The same would apply to a CHS map or SA+Block Map on a magnetic disk. But none of this matters to the OS, it just wants blocks, and all the parts of the system (IDE, SATA, SAS, NVMe on the interface, the drivers, the filesystems) assume that the drives and the OS can be in agreement on the state of the blocks. As soon as a controller starts MITM stuff in the middle, it needs to emulate those characteristics perfectly (which would only work on a 1-to-1 mapping), or you're essentially not creating reliable path between the blocks as the OS sees them and the blocks as the disk sees them.

In the context of Bit Rot: if you have to cross your fingers and hope your raid controller will deal with it, you're already in trouble. And that's why it is useful for the OS and future software improvements to be able to interact with and interrogate the drives directly. This also applies to security (looking at you, OPAL) and performance (fake sector sizes anyone?).
 
Last edited:

twin_savage

Member
Jan 26, 2018
55
30
18
33
The FTL is more than just a LUT or a map, but as I wrote before, the job of the controller on the SSD (including the FTL) is the same as on other storage media controllers: turn media-specific things into blocks that are used for block storage. The OS uses a block-centric protocol to read and write, and the media controller translates blocks into whatever is actually happening on-disk (be it spinning rust, NAND chips with one or more levels, or something else like tape or optical media). On-disk, there are no blocks, that's just a logical concept. For NAND flash cells, an FTL would be used whole on NOR flash that same FTL wouldn't work due to the very different nature of NOR flash.
The FTL can be thought of as a dynamic LUT that changes based on inputs (many of which are unknown to consumers), this is a useful way to understand what the FTL is doing. This is how the FTL is thought of during raw NAND data recovery after a controller failure, the FTL is "frozen" into a static LUT and the raw NAND outputs are reconstructed based off of the FTL's LUT state.

The distinction I'm making between FTL and other storage medium's controllers is that the other storage medium controllers are more transparent and less "active", the exception to this being SED. FTL is truly a black box while other storage medium's controllers are mostly or more transparent.


The same would apply to a CHS map or SA+Block Map on a magnetic disk. But none of this matters to the OS, it just wants blocks, and all the parts of the system (IDE, SATA, SAS, NVMe on the interface, the drivers, the filesystems) assume that the drives and the OS can be in agreement on the state of the blocks. As soon as a controller starts MITM stuff in the middle, it needs to emulate those characteristics perfectly (which would only work on a 1-to-1 mapping), or you're essentially not creating reliable path between the blocks as the OS sees them and the blocks as the disk sees them.
Just as CHS maps don't belong in kernel space, neither should the integrity of the contents of the blocks.
We should have the hardware in place and organized in a way so that it tells truthful data to the OS. If we can't trust block data coming in, then why are we trusting the data in main memory? why trust CPU cache?

-this is my opinion obviously, some people will choose to use ZFS but it goes against system design principles that I favor for most storage needs I encounter. I subscribe to keeping systems and code as modular as possible and in small auditable chunks, I don't like systems that span many domains and have a large surface for bugs, similar to how we reduce our security attack surface, we should strive for system architectures that reduce bug surface.


Regarding controllers MITM'ing data the OS is trying to access. Both ZFS and hardware raid controllers are technically MITM'ing the data presented to applications. But the number of lines of code running to translate data to the application for hardware raid controller is significantly less than the number of lines of code used to translate data with ZFS.



In the context of Bit Rot: if you have to cross your fingers and hope your raid controller will deal with it, you're already in trouble. And that's why it is useful for the OS and future software improvements to be able to interact with and interrogate the drives directly. This also applies to security (looking at you, OPAL) and performance (fake sector sizes anyone?).
Hardware raid controllers perform scrubs to detect and fix bit rot, just as ZFS would have to for a given volume; if you have to cross your fingers for the hardware raid to scrub successfully you're going to need to sacrifice several goats to ensure ZFS scrubs successfully because ZFS's data is fragmented across the surface of the hdd while hardware raid's isn't.
 

oneplane

Well-Known Member
Jul 23, 2021
845
484
63
The FTL can be thought of as a dynamic LUT that changes based on inputs (many of which are unknown to consumers), this is a useful way to understand what the FTL is doing. This is how the FTL is thought of during raw NAND data recovery after a controller failure, the FTL is "frozen" into a static LUT and the raw NAND outputs are reconstructed based off of the FTL's LUT state.

The distinction I'm making between FTL and other storage medium's controllers is that the other storage medium controllers are more transparent and less "active", the exception to this being SED. FTL is truly a black box while other storage medium's controllers are mostly or more transparent.
An FTL isn't a dynamic LUT, or a LBA-to-NAND map (well, it might be if we're talking very old implementations but even the most recent FTL firmware I've worked on wouldn't run in such a mode at all), it's much more. But none of that really matters, what matters is that you ask the device for blocks, and it has to sort out how to make it happen. So the media (NAND chips) is "not-blocks" and what we want is "blocks". This applies to other media as well. We don't care about the analog signals on the drive platters in a HDD, we care about blocks. We also don't care about how it is doing signal-to-bit translation, just the resulting blocks.

Just as CHS maps don't belong in kernel space, neither should the integrity of the contents of the blocks.
We should have the hardware in place and organized in a way so that it tells truthful data to the OS. If we can't trust block data coming in, then why are we trusting the data in main memory? why trust CPU cache?
I don't think anyone here has suggested CHS maps in a kernel or something like that.

-this is my opinion obviously, some people will choose to use ZFS but it goes against system design principles that I favor for most storage needs I encounter. I subscribe to keeping systems and code as modular as possible and in small auditable chunks, I don't like systems that span many domains and have a large surface for bugs, similar to how we reduce our security attack surface, we should strive for system architectures that reduce bug surface.
Well, everyone can have plenty of opinions. We tend to use case-by-case requirements and historical data, and RAID controllers pretty much always lose. As for the track record of ZFS specifically: so far we had one case of a bad pool where we restored from snapshot. That's once every 10 years, which is about 10 times less than RAID controllers in the same pod of racks.

Regarding controllers MITM'ing data the OS is trying to access. Both ZFS and hardware raid controllers are technically MITM'ing the data presented to applications. But the number of lines of code running to translate data to the application for hardware raid controller is significantly less than the number of lines of code used to translate data with ZFS.
You're missing the point, the entire host side of things assumes it's talking to a disk, but it's not, and every element of the RAID controller stack is doing its best to keep pretending it's a disk regardless. This is also why practically all of them have to resort to out of band and non-standard utilities to circumvent normal operating system operation. And that is also why RAID5 and the likes are so dangerous, because the OS will never know it failed (just like the controller doesn't know for things like write holes - but at least the OS could have asked the disks, the controller can't do that because it doesn't have access to the host). This whole thing is a nasty stack of hacks, and there is no incentive to rewrite the firmware because it's making plenty of money as-is. End result: more hacks added on a case-by-case basis, codebase gets ported to whatever new controller comes out and the cycle repeats.


Hardware raid controllers perform scrubs to detect and fix bit rot, just as ZFS would have to for a given volume; if you have to cross your fingers for the hardware raid to scrub successfully you're going to need to sacrifice several goats to ensure ZFS scrubs successfully because ZFS's data is fragmented across the surface of the hdd while hardware raid's isn't.
This makes no sense, try again.
 

gea

Well-Known Member
Dec 31, 2010
3,157
1,195
113
DE
In the end its all about statistics and quality of error correction. Data density on modern disks is so high that on every read you must guess if a read delivers a 0 or 1 or a 4k block is valid or not. Without diskbased ECC correction high capacity disks would not be doable.

The question is now how likely is a remaing read error that is not detected or wrongly repaired by disk ECC if the number of ECC repairs is relative high, one for exampe per 10^n reads in the end? On a 20TB drive even a very low remaining error rate becomes a real incident over time. Related to bitrot this does not even include problems outside a disk like cabling or connectors that cannot be fixed with disk ECC.

This is where modern filesystems add a real improvement. They use real data/metadata end to end checksums (OS-disk-OS) in datablocks up to 1M to detect and repair any remaing error. This would not be a needed add-on to disk ECC when number of ZFS checksum errors would be zero what is definitely not the case. From time to time you see ZFS checksum errors and you are happy that ZFS has detected and repaired them during any read or a pool scrub of all data.

Without you have undetected data corruptions.
 

Whaaat

Active Member
Jan 31, 2020
305
159
43
The question is now how likely is a remaing read error that is not detected or wrongly repaired by disk ECC if the number of ECC repairs is relative high, one for exampe per 10^n reads in the end?
The answer is 10^-15 BER typical for HDD and 10^-18 for SAS SSD
(if you ask a drive: are my blocks flushed to disk? it lies 9 out of 10 times AFAICT)
In today's HDD you can't even disable the write cache. It will continue to operate regardless of your choice))

WDC.PNG
 
  • Like
Reactions: oneplane and Fritz