I have fallen in love with MooseFS

tjk · Aug 1, 2022

I've tested the Quobyte distributed FS too, which is based on XtreemFS, distributed metadata servers, tiering, etc.

A bit buggy on the setup, but awesome SDS for sure, but not cheap.

i386 · Aug 2, 2022

Mithril said:
Anything that does the traditional RAID 0, 1, 5, 6 (or any of the less common) ones, in a way that is more or less the same.

This would mean that zfs is also "traditional raid" since the math behind raid 5/raidz1 (https://blogs.oracle.com/solaris/post/understanding-raid-5-recovery-with-elementary-school-math) and raid 6/raidz2 (https://blogs.oracle.com/solaris/post/understanding-raid-6-with-junior-high-math) is the same. (Both blog posts are from a zfs developer from oracle)

Mithril · Aug 3, 2022

i386 said:
This would mean that zfs is also "traditional raid" since the math behind raid 5/raidz1 (https://blogs.oracle.com/solaris/post/understanding-raid-5-recovery-with-elementary-school-math) and raid 6/raidz2 (https://blogs.oracle.com/solaris/post/understanding-raid-6-with-junior-high-math) is the same. (Both blog posts are from a zfs developer from oracle)

"In a way that's more or less the same", using the same parity math isn't where an issue would be (unless that parity math is flawed). And I will note that blog doesn't go over the *actual* math, but I think that's sort of a tangent to this anyways.

RaidZ doesn't have the same write-hole that most hardware/software raids do.

RaidZ isn't logically separated from the file system, while you can create a virtual block device touse any file system you want that's an abstraction and ZFS is still "under" that.

RaidZ rebuilds (resilver) only care about data (since it's linked to the file system, that awareness exists)

RaidZ doesn't lay out blocks on the disk in the same way (the exact difference is something I'd need to refresh my memory)

But, IMHO, the biggest difference is RaidZ/ZFS has a lower level of "trust" of the hardware, this is at the ZFS layer so even single disk setups have some protection. RAID with parity is *supposed* to have that, but many solutions silently *don't*.

Granted, these differences come with large tradeoffs in performance, hardware considerations, limitation to a single "native" file system, etc.
There *are* raid solutions out there that still work the way people believe they do, but you need the right disks and the right hardware/software solution or you could very easily have little or no additional protection VS a single disk. And in all cases, RAID is not a backup (nor RaidZ). It's at least a *little* easier to test Raid than it is to test ECC memory (write directly to a member disk, from another computer if you need to, test what happens when you read)

Relating to this thread, MooseFS is actually acting a bit like ZFS here (broadly speaking), maintaining checksums on files and claims to do re-reads from other copies if needed. Assuming MooseFS does what it says well, it would (IMHO) make raid5/6 more useful since it would solve at least some data integrity issues. What I don't know is if MooseFS can "rebuild" a file (say from two partially correct sources) or if it's just a checksum/hash/

UhClem · Aug 3, 2022

Mithril said:
... Since the error correction needs to be fast, it can't be too sophisticated so it's entirely possible to reconstruct an incorrect value on a read retry and most drives will pass along the first "correct" value for a sector. ...

Do you really believe this?
[You read it on the Web ? ...(so it must be true)

]

Mithril · Aug 3, 2022

UhClem said:
Do you really believe this?
[You read it on the Web ? ...(so it must be true) ]

It's possible spinning rust no longer uses Hamming code, sure. But it's likely they do as it's proven, fast and reliable within its limits. Which is 1 bit correction, 2 bit detection, and *can* fail (pass bad data) at3 or more bitflips.

Seems safe, sure. Except in modern HDD corrected ECC errors are just a way of life, a necessity of the density and cost demands. I've got a HDD under test right now passing (so far) with flying colors, no reallocated sectors but a handful of delayed ECC and *millions* of silent ECC corrections, fairly typical for a drive with ~30k hours. Most modern SMART utils don't even show corrected/silent ECC only failures and retries. AFAIK many SSDs are doing internal ECC and sometimes parity that is never even reported to the host. So it's rather obvious to me single bit errors are a constant (so to speak), double bit errors happen sometimes. I'm not willing to be 3+ bit errors are impossible, especially when it would be the biggest problem potentially such as when another drive is failing in an array.

Is this a "sky is falling" type thing? Obviously not. Can it happen? Absolutely.

I very, very, very much doubt any consumer (and likely most enterprise) drives do anything more than "check ECC, attempt re-read, report error on uncorrected double failure". The gap there is a 3+ failure that results in a passing ECC, which can happen.

Of course, it all comes down to how important any given bit of data is, I've run software raid-0 before because it wasn't the primary storage and affordable SSDs were not even something people imagined at the time so the performance was a win.

Food for thought, is the DRAM cache on your HDD or SSD ECC? I doubt it...

nabsltd · Aug 4, 2022

Long before I'd ever heard of ZFS, I started keeping checksums of files. Whenever I copy that data to another disk, I verify the checksum matches. If it does not, I test the checksum against the original file.

In 15 years, I have never had a file where the "source" failed the checksum and the disk did not report an uncorrectable error. I've caught silent copy errors (where there were no problems with either disk, but somewhere in between some bits flipped), and had disks die on me, but not once has a disk that has said "yep, everything's OK" lied to me.

Note, too, that unless you use ZFS for everything, and only use ZFS replication for copying, ZFS does not protect from the sort of errors that having an external checksum does. Sure, odds are that no bits get flipped in RAM, or on the network, or over a bus, but you can never be 100% sure without testing.

UhClem · Aug 4, 2022

Mithril said:
It's possible spinning rust no longer uses Hamming code, sure. But it's likely they do as it's proven, fast and reliable within its limits. Which is 1 bit correction, 2 bit detection, and *can* fail (pass bad data) at3 or more bitflips. ...

"possible" ??? ... "no longer" ??? (Did it ever?)

Why would you make these assumptions?

Have you, maybe, conjoined the ECC implementation used for (current/modern) RAM [an 8-bit datum] with (any of) the ECC implementations used for (current/modern) HDDs [a 512/4096-byte datum]?

Mithril · Aug 5, 2022

UhClem said:
"possible" ??? ... "no longer" ??? (Did it ever?)

Why would you make these assumptions?

Have you, maybe, conjoined the ECC implementation used for (current/modern) RAM [an 8-bit datum] with (any of) the ECC implementations used for (current/modern) HDDs [a 512/4096-byte datum]?

When you don't remember the name, google it and fail to notice they are actually talking about SSDs, lol. Yeah, totally the wrong name/method. Looks like it's a mix of Reed-Solomon which is much better (in general), and Low-density parity-check. Both are single pass, so unless the drive is comparing multiple (raw) reads it's possible to have an uncorrected error. How often depends on drive parameters. I'm actually less worried about that now with the refresher on the ECC used, so thanks!

I tried to refresh my memory and actually ended up the worse for it, doh!

I will say tho, that it's still good to have *some* extra checksum, either in filesystem such as ZFS or external as nabsltd mentioned as that protects reasonably both from silent bad reads, and silent failed/corrupted writes (such as non-sync writes, or consumer drives saying "yes I wrote that" before it actually finishes).

Honestly when you look at the raw ECC rates of modern spinning rust, it's quite impressive. The whole "we're going to constantly be getting errors so we just have to live with it and correct them" philosophy. Sort of like how with NAND storage, quantum tunneling is just.... a consideration that has to be made as a fact of life.

Edit: No sarcasm here, genuinely confused myself trying to refresh my memory so I wasn't confused. I still think hash/checksome besides what the drive does is a good idea.

zunder1990 · Aug 6, 2022

Mithril said:
Relating to this thread, MooseFS is actually acting a bit like ZFS here (broadly speaking), maintaining checksums on files and claims to do re-reads from other copies if needed. Assuming MooseFS does what it says well, it would (IMHO) make raid5/6 more useful since it would solve at least some data integrity issues. What I don't know is if MooseFS can "rebuild" a file (say from two partially correct sources) or if it's just a checksum/hash/

Here is how it would work in my setup using the open source version. For this example the file is 60mb so it will fix in a single chuck(64mb) and I have min goal set to 2. There are two full copies of the chuck seating on different hard drives in different servers. During read of the file or normal checksum there is an error, moosefs will flag the error and read the file from the good chuck left, and serve that info to the client. It will also work and coping that good chuck to another server to bring the system back up to two good working copies of the chuck. For my really important folder (less than 1tb) of stuff like desktop backups and photos I ahve the min goal set to 3 so there are 3 good copies of that file in the moosefs system.

amalurk · Apr 20, 2023

I ran MooseFS before for a file store it worked well enough. It is basically a CephFS equivalent that pre-dated CephFS and that hasn't really had any architecture improvements in a long long time just slow maintenance for many years. They promised multiple master or at least automatic master failover in the open source but have never delivered it or erasure encoding which are maybe available in the paid version but, they don't seem to really try to sell that either and the project hasn't seem much change in many years. So, I have my doubts about the paid version really being that much of an improvement of the open source. I have since moved to Ceph because I can do CephFS for file store and then RBD too for VMs. Just makes more sense to me than two software systems.

i386 · Apr 20, 2023

Are you sure?
I get a "new" tag in the top right for new/unread posts and then usually skim over them:

gb00s · Apr 21, 2023

amalurk said:
.... It is basically a CephFS equivalent that pre-dated CephFS and that hasn't really had any architecture improvements in a long long time just slow maintenance for many years. They promised multiple master or at least automatic master failover in the open source but have never delivered it or erasure encoding which are maybe available in the paid version but, they don't seem to really try to sell that either and the project hasn't seem much change in many years. So, I have my doubts about the paid version really being that much of an improvement of the open source ....

Link: MooseFS

zunder1990 · Apr 21, 2023

amalurk said:
I ran MooseFS before for a file store it worked well enough. It is basically a CephFS equivalent that pre-dated CephFS and that hasn't really had any architecture improvements in a long long time just slow maintenance for many years. They promised multiple master or at least automatic master failover in the open source but have never delivered it or erasure encoding which are maybe available in the paid version but, they don't seem to really try to sell that either and the project hasn't seem much change in many years. So, I have my doubts about the paid version really being that much of an improvement of the open source. I have since moved to Ceph because I can do CephFS for file store and then RBD too for VMs. Just makes more sense to me than two software systems.

I am still running moosefs for my file system at home and still enjoy using but I will also agree with you on most of the above. it does seem that new development was on features is very slow or not happening but the code for my use case has been stable. In the past few weeks I have moved my proxmox based VMs from local storage to ceph backed storage and overall it has been a nice experience. One thing I have noticed is ceph does not do well or cant run on SBC with low ram. I have a bunch of odroid hc2 and I could not make ceph stable on those as it kept running out of ram. Moosefs has no problem on the very same hardware. with moosefs running those boards are seating at 220md used out of 2gb ram.

zunder1990 · Apr 21, 2023

i386 said:
Are you sure?
I get a "new" tag in the top right for new/unread posts and then usually skim over them:View attachment 28517

Looks like he deleted his comment in this thread.

wings · Sep 15, 2023

For what it's worth -

I'm running MooseFS Pro at home (150TiB licence, 1.3PB cluster) and at work (1.5PB licence, 1.8PB cluster) and the erasure coding and high availability is solidly in "just ****ing works" territory. Failover is reliable. I sleep better with it than I do with my Ceph cluster.

Not only that, it actually *works* (read: provides reasonable, usable performance) on hard drive bricks/OSDs/disks, unlike Ceph... My 2c.

There's tons of development on Moose, it just isn't trickling down yet as MooseFS CE 4.x isn't a thing and won't be until... something? happens. I'm unclear on when they'll actually open source MFS4, but in the meantime it's by far the best SDS I've used. Happy customer.

zunder1990 · Sep 15, 2023

@wings What does the pro pricing look like if you dont mind sharing?

wings · Sep 15, 2023

zunder1990 said:
@wings What does the pro pricing look like if you dont mind sharing?

I wouldn't mind sharing but they don't publicly disclose it and I'd like to be respectful of that, sorry. I'll comment that it's very reasonable pricing to me for the feature set and reliability, even when comparing to other similar SDS like Ceph and SeaweedFS. I try to be fairly agnostic and MooseFS just often ends up being the best fit for my problem spaces.

To give you *some* idea, the savings from the hardware you don't need to buy when you run Moose vs something else (less disks because of EC, less hardware because it's lighter) more than offset the licence cost. Sorry that's vague.

gb00s · Sep 15, 2023

wings said:
... and the erasure coding and high availability is solidly in "just ****ing works" territory. Failover is reliable. I sleep better with it than I do with my Ceph cluster.

Not only that, it actually *works* (read: provides reasonable, usable performance) on hard drive bricks/OSDs/disks, unlike Ceph... My 2c ...

How does Erasure Coding impact your performance? I always thought EC may complicate things a little bit while having an incident. Thats what I took from a discussion with a dev. Are you using any underlying like ZFS or just plain XFS? Any caching involved (Intel CAS with some sort of write protection)? My experience with the CE version is that Intel CAS set up with some Optane PEMMs speeds things up significantly. I have no experience with the native WIndows client in the Pro version. Tested some time with Master duplication via VM in a cluster, but defenitely could see performance impact. So master in a VM is not ideal.

wings · Sep 15, 2023

gb00s said:
How does Erasure Coding impact your performance? I always thought EC may complicate things a little bit while having an incident. Thats what I took from a discussion with a dev. Are you using any underlying like ZFS or just plain XFS? Any caching involved (Intel CAS with some sort of write protection)? My experience with the CE version is that Intel CAS set up with some Optane PEMMs speeds things up significantly. I have no experience with the native WIndows client in the Pro version. Tested some time with Master duplication via VM in a cluster, but defenitely could see performance impact. So master in a VM is not ideal.

There is *an* impact but honestly I don't notice the difference. I will happily serve datasets out of EC goals in production with nobody caring

Master cannot run as a VM. I know that's a funny statement, but... try it, it'll work but you'll have absolutely terrible performance. I didn't believe them and did it anyways and ended up calling them weeks later panicking about how my production service was taking 1-2 seconds to return any file. Moved the master back to bare metal on their advice and wham, fast. We suspect it might be a memory thing - memory management under layers of virt is probably not as good/fast as plain Linux.

zunder1990 · Sep 15, 2023

wings said:
There is *an* impact but honestly I don't notice the difference. I will happily serve datasets out of EC goals in production with nobody caring

Master cannot run as a VM. I know that's a funny statement, but... try it, it'll work but you'll have absolutely terrible performance. I didn't believe them and did it anyways and ended up calling them weeks later panicking about how my production service was taking 1-2 seconds to return any file. Moved the master back to bare metal on their advice and wham, fast. We suspect it might be a memory thing - memory management under layers of virt is probably not as good/fast as plain Linux.

my moosefs-leader is a VM and it does not seem to causing me any problems. VM host is promox on a intel E5v4 based cpu system with ddr4. My usage of moosefs is mostly vm backups from promox or media files for plex. Current moosefs cluster is 135tb total, 80tb free. 150k fs objects 563k chucks. The chuckservers are all proxmox hosts again intel E5v4 based cpu system with ddr4, 19 disk spread over 3 nodes.

I have fallen in love with MooseFS

Active Member

Well-Known Member

Active Member

just another Bozo on the bus

Active Member

Well-Known Member

just another Bozo on the bus

Active Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Member

Active Member

Member

Well-Known Member

Member

Active Member