I have fallen in love with MooseFS

tjk

Active Member
Mar 3, 2013
406
140
43
www.servercentral.com
I've tested the Quobyte distributed FS too, which is based on XtreemFS, distributed metadata servers, tiering, etc.

A bit buggy on the setup, but awesome SDS for sure, but not cheap.
 

i386

Well-Known Member
Mar 18, 2016
3,379
1,127
113
33
Germany
Anything that does the traditional RAID 0, 1, 5, 6 (or any of the less common) ones, in a way that is more or less the same.
This would mean that zfs is also "traditional raid" since the math behind raid 5/raidz1 (https://blogs.oracle.com/solaris/post/understanding-raid-5-recovery-with-elementary-school-math) and raid 6/raidz2 (https://blogs.oracle.com/solaris/post/understanding-raid-6-with-junior-high-math) is the same. (Both blog posts are from a zfs developer from oracle)
 

Mithril

Active Member
Sep 13, 2019
317
96
28
This would mean that zfs is also "traditional raid" since the math behind raid 5/raidz1 (https://blogs.oracle.com/solaris/post/understanding-raid-5-recovery-with-elementary-school-math) and raid 6/raidz2 (https://blogs.oracle.com/solaris/post/understanding-raid-6-with-junior-high-math) is the same. (Both blog posts are from a zfs developer from oracle)
"In a way that's more or less the same", using the same parity math isn't where an issue would be (unless that parity math is flawed). And I will note that blog doesn't go over the *actual* math, but I think that's sort of a tangent to this anyways.

RaidZ doesn't have the same write-hole that most hardware/software raids do.

RaidZ isn't logically separated from the file system, while you can create a virtual block device touse any file system you want that's an abstraction and ZFS is still "under" that.

RaidZ rebuilds (resilver) only care about data (since it's linked to the file system, that awareness exists)

RaidZ doesn't lay out blocks on the disk in the same way (the exact difference is something I'd need to refresh my memory)

But, IMHO, the biggest difference is RaidZ/ZFS has a lower level of "trust" of the hardware, this is at the ZFS layer so even single disk setups have some protection. RAID with parity is *supposed* to have that, but many solutions silently *don't*.

Granted, these differences come with large tradeoffs in performance, hardware considerations, limitation to a single "native" file system, etc.
There *are* raid solutions out there that still work the way people believe they do, but you need the right disks and the right hardware/software solution or you could very easily have little or no additional protection VS a single disk. And in all cases, RAID is not a backup (nor RaidZ). It's at least a *little* easier to test Raid than it is to test ECC memory (write directly to a member disk, from another computer if you need to, test what happens when you read)

Relating to this thread, MooseFS is actually acting a bit like ZFS here (broadly speaking), maintaining checksums on files and claims to do re-reads from other copies if needed. Assuming MooseFS does what it says well, it would (IMHO) make raid5/6 more useful since it would solve at least some data integrity issues. What I don't know is if MooseFS can "rebuild" a file (say from two partially correct sources) or if it's just a checksum/hash/
 

UhClem

Active Member
Jun 26, 2012
297
152
43
NH, USA
... Since the error correction needs to be fast, it can't be too sophisticated so it's entirely possible to reconstruct an incorrect value on a read retry and most drives will pass along the first "correct" value for a sector. ...
Do you really believe this?
[You read it on the Web ? ...(so it must be true) :)]
 

Mithril

Active Member
Sep 13, 2019
317
96
28
Do you really believe this?
[You read it on the Web ? ...(so it must be true) :)]
It's possible spinning rust no longer uses Hamming code, sure. But it's likely they do as it's proven, fast and reliable within its limits. Which is 1 bit correction, 2 bit detection, and *can* fail (pass bad data) at3 or more bitflips.

Seems safe, sure. Except in modern HDD corrected ECC errors are just a way of life, a necessity of the density and cost demands. I've got a HDD under test right now passing (so far) with flying colors, no reallocated sectors but a handful of delayed ECC and *millions* of silent ECC corrections, fairly typical for a drive with ~30k hours. Most modern SMART utils don't even show corrected/silent ECC only failures and retries. AFAIK many SSDs are doing internal ECC and sometimes parity that is never even reported to the host. So it's rather obvious to me single bit errors are a constant (so to speak), double bit errors happen sometimes. I'm not willing to be 3+ bit errors are impossible, especially when it would be the biggest problem potentially such as when another drive is failing in an array.

Is this a "sky is falling" type thing? Obviously not. Can it happen? Absolutely.

I very, very, very much doubt any consumer (and likely most enterprise) drives do anything more than "check ECC, attempt re-read, report error on uncorrected double failure". The gap there is a 3+ failure that results in a passing ECC, which can happen.

Of course, it all comes down to how important any given bit of data is, I've run software raid-0 before because it wasn't the primary storage and affordable SSDs were not even something people imagined at the time so the performance was a win.

Food for thought, is the DRAM cache on your HDD or SSD ECC? I doubt it...
 

nabsltd

Active Member
Jan 26, 2022
192
110
43
Long before I'd ever heard of ZFS, I started keeping checksums of files. Whenever I copy that data to another disk, I verify the checksum matches. If it does not, I test the checksum against the original file.

In 15 years, I have never had a file where the "source" failed the checksum and the disk did not report an uncorrectable error. I've caught silent copy errors (where there were no problems with either disk, but somewhere in between some bits flipped), and had disks die on me, but not once has a disk that has said "yep, everything's OK" lied to me.

Note, too, that unless you use ZFS for everything, and only use ZFS replication for copying, ZFS does not protect from the sort of errors that having an external checksum does. Sure, odds are that no bits get flipped in RAM, or on the network, or over a bus, but you can never be 100% sure without testing.
 

UhClem

Active Member
Jun 26, 2012
297
152
43
NH, USA
It's possible spinning rust no longer uses Hamming code, sure. But it's likely they do as it's proven, fast and reliable within its limits. Which is 1 bit correction, 2 bit detection, and *can* fail (pass bad data) at3 or more bitflips. ...
"possible" ??? ... "no longer" ??? (Did it ever?)

Why would you make these assumptions?

Have you, maybe, conjoined the ECC implementation used for (current/modern) RAM [an 8-bit datum] with (any of) the ECC implementations used for (current/modern) HDDs [a 512/4096-byte datum]?
 
Last edited:
  • Like
Reactions: Mithril

Mithril

Active Member
Sep 13, 2019
317
96
28
"possible" ??? ... "no longer" ??? (Did it ever?)

Why would you make these assumptions?

Have you, maybe, conjoined the ECC implementation used for (current/modern) RAM [an 8-bit datum] with (any of) the ECC implementations used for (current/modern) HDDs [a 512/4096-byte datum]?
When you don't remember the name, google it and fail to notice they are actually talking about SSDs, lol. Yeah, totally the wrong name/method. Looks like it's a mix of Reed-Solomon which is much better (in general), and Low-density parity-check. Both are single pass, so unless the drive is comparing multiple (raw) reads it's possible to have an uncorrected error. How often depends on drive parameters. I'm actually less worried about that now with the refresher on the ECC used, so thanks!

I tried to refresh my memory and actually ended up the worse for it, doh!

I will say tho, that it's still good to have *some* extra checksum, either in filesystem such as ZFS or external as nabsltd mentioned as that protects reasonably both from silent bad reads, and silent failed/corrupted writes (such as non-sync writes, or consumer drives saying "yes I wrote that" before it actually finishes).

Honestly when you look at the raw ECC rates of modern spinning rust, it's quite impressive. The whole "we're going to constantly be getting errors so we just have to live with it and correct them" philosophy. Sort of like how with NAND storage, quantum tunneling is just.... a consideration that has to be made as a fact of life.

Edit: No sarcasm here, genuinely confused myself trying to refresh my memory so I wasn't confused. I still think hash/checksome besides what the drive does is a good idea.
 

zunder1990

Active Member
Nov 15, 2012
131
30
28
Relating to this thread, MooseFS is actually acting a bit like ZFS here (broadly speaking), maintaining checksums on files and claims to do re-reads from other copies if needed. Assuming MooseFS does what it says well, it would (IMHO) make raid5/6 more useful since it would solve at least some data integrity issues. What I don't know is if MooseFS can "rebuild" a file (say from two partially correct sources) or if it's just a checksum/hash/
Here is how it would work in my setup using the open source version. For this example the file is 60mb so it will fix in a single chuck(64mb) and I have min goal set to 2. There are two full copies of the chuck seating on different hard drives in different servers. During read of the file or normal checksum there is an error, moosefs will flag the error and read the file from the good chuck left, and serve that info to the client. It will also work and coping that good chuck to another server to bring the system back up to two good working copies of the chuck. For my really important folder (less than 1tb) of stuff like desktop backups and photos I ahve the min goal set to 3 so there are 3 good copies of that file in the moosefs system.