Has anyone tried bcachefs yet?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

voxadam

Member
Apr 21, 2016
107
14
18
Portland, Oregon
Kent Overstreet, the original developer of the bcache block caching system for Linux, has been developing bcachefs which he refers to as "The COW filesystem for Linux that won't eat your data" for awhile now and I was curious if anyone has had a chance to give it a spin.

I've been running Btrfs on my primary workstation for a number of years without incident, though, the fact that I've avoided RAID5/6 like the plague probably has a lot to do with my success. I've been meaning to upgrade my primary SSD and do a clean install for awhile now and it's tempting to give bcachefs a shot. As bcachefs is based on the widely deployed and tested bcache code I'm reasonably confident in its stability but in the event it does eat my data I have backups. The primary thing that's holding me back is the lack of support snapshots which Kent freely admits is "by far the most complex of the remaining features to implement" and there's nary a mention of it on the TODO which worries me a bit.

Anyway, I was just wondering if anyone had experimented with bcahcefs yet.

Feature status:

  • Full data checksumming

    Fully supported and enabled by default. We do need to implement scrubbing, once we've got replication and can take advantage of it.

  • Compression

    Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed

  • Tiering/writeback caching:

    Bcachefs allows you to specify disks (or groups thereof) to be used for three categories of I/O: foreground, background, and promote. Foreground devices accept writes, whose data is copied to background devices asynchronously, and the hot subset of which is copied to the promote devices for performance.

    Basic caching functionality works, but it's not (yet) as configurable as bcache's caching (e.g. you can't specify writethrough caching).

  • Replication

    All the core functionality is complete, and it's getting close to usable: you can create a multi device filesystem with replication, and then while the filesystem is in use take one device offline without any loss of availability.

  • Encryption

    Whole filesystem AEAD style encryption (with ChaCha20 and Poly1305) is done and merged. I would suggest not relying on it for anything critical until the code has seen more outside review, though.

  • Snapshots

    Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement - it's going to be quite awhile before I can dedicate enough time to finishing them, but I'm very much looking forward to showing off what it'll be able to do.
Known issues/caveats
  • Mount time

    We currently walk all metadata at mount time (multiple times, in fact) - on flash this shouldn't even be noticeable unless your filesystem is very large, but on rotating disk expect mount times to be slow. This will be addressed in the future - mount times will likely be the next big push after the next big batch of on disk format changes.

homepage: bcachefs
git: evilpiepirate.org/git/bcachefs.git
Patreon: Kent Overstreet is creating bcachefs - a next generation Linux filesystem | Patreon
mailing list: Majordomo Lists at VGER.KERNEL.ORG
irc: irc://irc.oftc.net/#bcache
The bcachefs filesystem [LWN.net] - 25 August 2015
Bcachefs - encryption, fsck, and more - 15 March 2017
A new bcachefs release [LWN.net] - 16 March 2017
 
  • Like
Reactions: MiniKnight

MiniKnight

Well-Known Member
Mar 30, 2012
3,072
973
113
NYC
I'm not ready to use it yet. I strongly prefer upstream kernel support. This also gives me pause
Scalable - has been tested to 50+ TB, will eventually scale far higher
That isn't giving me confidence.

As a project, the concept has promise.
 
  • Like
Reactions: voxadam

dandanio

Active Member
Oct 10, 2017
182
70
28
I have dabbled when I was looking for alternatives for ZFS. But since Red Hat abandoned it in their distro, I scrapped all thoughts of playing with it again anytime in the future. ZFS is it and Btrfs has nothing to offer to make it superior to it.
 
  • Like
Reactions: JustinClift

Joel

Active Member
Jan 30, 2015
855
194
43
42
From your description it sounds like it's basically trying to copy ZFS, so why not just use the original which is at least 15 years old and mostly mature?

Of course for ultimate stability you'd want to stick with BSD OSes.
 

dswartz

Active Member
Jul 14, 2011
610
79
28
It's not really like ZFS at all. That said, it seems to share the weakness of a number of one-man projects I've seen in the past. The developer disappears for weeks at a time, incommunicado...
 

SlickNetAaron

Member
Apr 30, 2016
50
13
8
43
I’m hoping this takes off! The design, feature roadmap and performance look impressive on a cursory review
 

dswartz

Active Member
Jul 14, 2011
610
79
28
Yeah, that would be nice. I'm not holding my breath though. I've seen this WAY too many times before. One-man project. Too much stuff to do. Doesn't want to take others on board to help. Gets burned out and either disappears for weeks at a time, or just abandons the project (not saying he's done all of this, but enough warning signs to make me leery...)
 

Joel

Active Member
Jan 30, 2015
855
194
43
42
Definitely not something I’d want in a file system, especially compared to a mature product like ZFS...
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
According to bits I've read, compression is based on zstd/zstandard written by facebook (also used in btrfs amongst other utils), but I don't think it's actually functional yet.
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
as i got it, compression works but is currently useless because saved space isn't reported back as usable. guess you can choose the algo.

anyway, the key is that some linux kernel devs involved in i/o, filesystems and such obviously consider bcachefs for upstream. guess there is still a huge work to be done until this happens, but looks promising for the future.
guess in terms of stability and architectural flaws putting efforts towards it is maybe more efficient than with btrfs.

Btw, also dm-cache seems to be on it's way in 4.18 :)
 

BackupProphet

Well-Known Member
Jul 2, 2014
1,092
650
113
Stavanger, Norway
olavgg.com
I just tested it now, there is a nice ppa for ubuntu that makes installation a breeze bcachefs testing archive : Chris Halse Rogers
First impression, sync write performance is only 30-50% of ext4 or xfs. About the same as ZFS on Linux. Still over twice as fast as BTRFS which has REALLY slow sync write performance.

Transparent compression with zstd is awesome, if it worked I could use this at once. I looks like impressive work so far, if I read the manual correctly you can assemble pools like you can with ZFS. That's really cool!

bcachefs:
Code:
olav@sola:~$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /mnt/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2571,695 ops/sec     389 usecs/op
        fdatasync                          2589,633 ops/sec     386 usecs/op
        fsync                              2666,934 ops/sec     375 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2568,984 ops/sec     389 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1270,695 ops/sec     787 usecs/op
        fdatasync                          2064,706 ops/sec     484 usecs/op
        fsync                              2055,642 ops/sec     486 usecs/op
        fsync_writethrough                              n/a
        open_sync                          1214,583 ops/sec     823 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          2343,705 ops/sec     427 usecs/op
         2 *  8kB open_sync writes         1360,672 ops/sec     735 usecs/op
         4 *  4kB open_sync writes          799,095 ops/sec    1251 usecs/op
         8 *  2kB open_sync writes          541,918 ops/sec    1845 usecs/op
        16 *  1kB open_sync writes          272,126 ops/sec    3675 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                2823,150 ops/sec     354 usecs/op
        write, close, fsync                2958,514 ops/sec     338 usecs/op

Non-sync'ed 8kB writes:
        write                            394881,852 ops/sec       3 usecs/op
ext4
Code:
olav@sola:~$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /mnt/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      9489,370 ops/sec     105 usecs/op
        fdatasync                          9619,069 ops/sec     104 usecs/op
        fsync                              9581,304 ops/sec     104 usecs/op
        fsync_writethrough                              n/a
        open_sync                         10228,169 ops/sec      98 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      5601,623 ops/sec     179 usecs/op
        fdatasync                          6110,553 ops/sec     164 usecs/op
        fsync                              5459,554 ops/sec     183 usecs/op
        fsync_writethrough                              n/a
        open_sync                          5023,706 ops/sec     199 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          5402,382 ops/sec     185 usecs/op
         2 *  8kB open_sync writes         4896,125 ops/sec     204 usecs/op
         4 *  4kB open_sync writes         2988,940 ops/sec     335 usecs/op
         8 *  2kB open_sync writes         1755,473 ops/sec     570 usecs/op
        16 *  1kB open_sync writes          970,653 ops/sec    1030 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                8060,057 ops/sec     124 usecs/op
        write, close, fsync                8937,642 ops/sec     112 usecs/op

Non-sync'ed 8kB writes:
        write                            315500,895 ops/sec       3 usecs/op
Done on a 80GB Intel 320
 

_alex

Active Member
Jan 28, 2016
866
97
28
Bavaria / Germany
so, just single ssd with no spinner behind ?
Here are the 'tunables', you should be able to choose between lz4, gzip and zstd per diskgroup

IoTunables

Until it works or as an alternative you could layer it with dm-vdo, and also enable dedup ...
I had only minor problems building vdo on debian, mainly fixing some paths for python.
 

homeserver78

New Member
Nov 7, 2023
26
13
3
Sweden
Perhaps it's time to wake this thread from the dead now that bcachefs has been accepted in the mainline kernel (since linux-6.7 it seems). Now including working snapshots, compression (gzip/lz4/zstd), writethrough caching, and more!

Anyone tried it? It looks really nice on paper, with its ability to combine slow + fast storage, configurable redundancy, checksumming, encryption and so on.

From the manual:
Code:
bcachefs format --compression=lz4
    --replicas=2
    --label=ssd.ssd1 /dev/sda
    --label=ssd.ssd2 /dev/sdb
    --label=hdd.hdd1 /dev/sdc
    --label=hdd.hdd2 /dev/sdd
    --label=hdd.hdd3 /dev/sde
    --label=hdd.hdd4 /dev/sdf
    --foreground_target=ssd
    --promote_target=ssd
    --background_target=hdd
... would create a file system where all data is redundant (2 copies), with writes going first to the "foreground" devices (as long as they have space) and then replicated in the background to the "background" devices. bcachefs keeps track of device IO latency and directs reads to the fastest devices. If data is read that aren't already on "promote" devices the data is promoted to those. Devices can be added and removed after the fact and their "target" changed.

Devices also have a "durability" property, so one can set e.g. durability=0 to not count data stored on that device to the number of replicas (e.g. for pure caching), or durability=2 if the underlying device is a mirrored HW RAID. So it all seems very flexible.

But I haven't seen much about how it actually handles missing or broken devices... And it's unclear to me how removal and off-lining of devices work in relation to redundancy and "degration". Any experiences to share?