Has anyone tried bcachefs yet?

voxadam · Apr 22, 2018

Kent Overstreet, the original developer of the bcache block caching system for Linux, has been developing bcachefs which he refers to as "The COW filesystem for Linux that won't eat your data" for awhile now and I was curious if anyone has had a chance to give it a spin.

I've been running Btrfs on my primary workstation for a number of years without incident, though, the fact that I've avoided RAID5/6 like the plague probably has a lot to do with my success. I've been meaning to upgrade my primary SSD and do a clean install for awhile now and it's tempting to give bcachefs a shot. As bcachefs is based on the widely deployed and tested bcache code I'm reasonably confident in its stability but in the event it does eat my data I have backups. The primary thing that's holding me back is the lack of support snapshots which Kent freely admits is "by far the most complex of the remaining features to implement" and there's nary a mention of it on the TODO which worries me a bit.

Anyway, I was just wondering if anyone had experimented with bcahcefs yet.

Feature status:

Full data checksumming

Fully supported and enabled by default. We do need to implement scrubbing, once we've got replication and can take advantage of it.
Compression

Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed
Tiering/writeback caching:

Bcachefs allows you to specify disks (or groups thereof) to be used for three categories of I/O: foreground, background, and promote. Foreground devices accept writes, whose data is copied to background devices asynchronously, and the hot subset of which is copied to the promote devices for performance.

Basic caching functionality works, but it's not (yet) as configurable as bcache's caching (e.g. you can't specify writethrough caching).
Replication

All the core functionality is complete, and it's getting close to usable: you can create a multi device filesystem with replication, and then while the filesystem is in use take one device offline without any loss of availability.
Encryption

Whole filesystem AEAD style encryption (with ChaCha20 and Poly1305) is done and merged. I would suggest not relying on it for anything critical until the code has seen more outside review, though.
Snapshots

Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement - it's going to be quite awhile before I can dedicate enough time to finishing them, but I'm very much looking forward to showing off what it'll be able to do.

Known issues/caveats

Mount time

We currently walk all metadata at mount time (multiple times, in fact) - on flash this shouldn't even be noticeable unless your filesystem is very large, but on rotating disk expect mount times to be slow. This will be addressed in the future - mount times will likely be the next big push after the next big batch of on disk format changes.

homepage: bcachefs
git: evilpiepirate.org/git/bcachefs.git
Patreon: Kent Overstreet is creating bcachefs - a next generation Linux filesystem | Patreon
mailing list: Majordomo Lists at VGER.KERNEL.ORG
irc: irc://irc.oftc.net/#bcache
The bcachefs filesystem [LWN.net] - 25 August 2015
Bcachefs - encryption, fsck, and more - 15 March 2017
A new bcachefs release [LWN.net] - 16 March 2017

MiniKnight · Apr 22, 2018

I'm not ready to use it yet. I strongly prefer upstream kernel support. This also gives me pause

Scalable - has been tested to 50+ TB, will eventually scale far higher

That isn't giving me confidence.

As a project, the concept has promise.

dandanio · Apr 22, 2018

I have dabbled when I was looking for alternatives for ZFS. But since Red Hat abandoned it in their distro, I scrapped all thoughts of playing with it again anytime in the future. ZFS is it and Btrfs has nothing to offer to make it superior to it.

Joel · Apr 23, 2018

From your description it sounds like it's basically trying to copy ZFS, so why not just use the original which is at least 15 years old and mostly mature?

Of course for ultimate stability you'd want to stick with BSD OSes.

dswartz · Apr 23, 2018

It's not really like ZFS at all. That said, it seems to share the weakness of a number of one-man projects I've seen in the past. The developer disappears for weeks at a time, incommunicado...

SlickNetAaron · Apr 23, 2018

I’m hoping this takes off! The design, feature roadmap and performance look impressive on a cursory review

dswartz · Apr 23, 2018

Yeah, that would be nice. I'm not holding my breath though. I've seen this WAY too many times before. One-man project. Too much stuff to do. Doesn't want to take others on board to help. Gets burned out and either disappears for weeks at a time, or just abandons the project (not saying he's done all of this, but enough warning signs to make me leery...)

Joel · Apr 23, 2018

Definitely not something I’d want in a file system, especially compared to a mature product like ZFS...

_alex · Jun 18, 2018

Here are some news on bcachfs, seems like it will be accepted in upstream somewhen:

An update on bcachefs [LWN.net]

BackupProphet · Jun 19, 2018

Interesting, its not clear, but do compression work? And what compression about are there? LZ4 or ZSTD?

EffrafaxOfWug · Jun 19, 2018

According to bits I've read, compression is based on zstd/zstandard written by facebook (also used in btrfs amongst other utils), but I don't think it's actually functional yet.

_alex · Jun 19, 2018

as i got it, compression works but is currently useless because saved space isn't reported back as usable. guess you can choose the algo.

anyway, the key is that some linux kernel devs involved in i/o, filesystems and such obviously consider bcachefs for upstream. guess there is still a huge work to be done until this happens, but looks promising for the future.
guess in terms of stability and architectural flaws putting efforts towards it is maybe more efficient than with btrfs.

Btw, also dm-cache seems to be on it's way in 4.18

BackupProphet · Jun 19, 2018

I just tested it now, there is a nice ppa for ubuntu that makes installation a breeze bcachefs testing archive : Chris Halse Rogers
First impression, sync write performance is only 30-50% of ext4 or xfs. About the same as ZFS on Linux. Still over twice as fast as BTRFS which has REALLY slow sync write performance.

Transparent compression with zstd is awesome, if it worked I could use this at once. I looks like impressive work so far, if I read the manual correctly you can assemble pools like you can with ZFS. That's really cool!

bcachefs:

Code:

olav@sola:~$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /mnt/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2571,695 ops/sec     389 usecs/op
        fdatasync                          2589,633 ops/sec     386 usecs/op
        fsync                              2666,934 ops/sec     375 usecs/op
        fsync_writethrough                              n/a
        open_sync                          2568,984 ops/sec     389 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      1270,695 ops/sec     787 usecs/op
        fdatasync                          2064,706 ops/sec     484 usecs/op
        fsync                              2055,642 ops/sec     486 usecs/op
        fsync_writethrough                              n/a
        open_sync                          1214,583 ops/sec     823 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          2343,705 ops/sec     427 usecs/op
         2 *  8kB open_sync writes         1360,672 ops/sec     735 usecs/op
         4 *  4kB open_sync writes          799,095 ops/sec    1251 usecs/op
         8 *  2kB open_sync writes          541,918 ops/sec    1845 usecs/op
        16 *  1kB open_sync writes          272,126 ops/sec    3675 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                2823,150 ops/sec     354 usecs/op
        write, close, fsync                2958,514 ops/sec     338 usecs/op

Non-sync'ed 8kB writes:
        write                            394881,852 ops/sec       3 usecs/op

ext4

Code:

olav@sola:~$ sudo /usr/lib/postgresql/10/bin/pg_test_fsync -f /mnt/testfile
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      9489,370 ops/sec     105 usecs/op
        fdatasync                          9619,069 ops/sec     104 usecs/op
        fsync                              9581,304 ops/sec     104 usecs/op
        fsync_writethrough                              n/a
        open_sync                         10228,169 ops/sec      98 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      5601,623 ops/sec     179 usecs/op
        fdatasync                          6110,553 ops/sec     164 usecs/op
        fsync                              5459,554 ops/sec     183 usecs/op
        fsync_writethrough                              n/a
        open_sync                          5023,706 ops/sec     199 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write          5402,382 ops/sec     185 usecs/op
         2 *  8kB open_sync writes         4896,125 ops/sec     204 usecs/op
         4 *  4kB open_sync writes         2988,940 ops/sec     335 usecs/op
         8 *  2kB open_sync writes         1755,473 ops/sec     570 usecs/op
        16 *  1kB open_sync writes          970,653 ops/sec    1030 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                8060,057 ops/sec     124 usecs/op
        write, close, fsync                8937,642 ops/sec     112 usecs/op

Non-sync'ed 8kB writes:
        write                            315500,895 ops/sec       3 usecs/op

Done on a 80GB Intel 320

_alex · Jun 19, 2018

so, just single ssd with no spinner behind ?
Here are the 'tunables', you should be able to choose between lz4, gzip and zstd per diskgroup

IoTunables

Until it works or as an alternative you could layer it with dm-vdo, and also enable dedup ...
I had only minor problems building vdo on debian, mainly fixing some paths for python.

homeserver78 · Mar 21, 2024

Perhaps it's time to wake this thread from the dead now that bcachefs has been accepted in the mainline kernel (since linux-6.7 it seems). Now including working snapshots, compression (gzip/lz4/zstd), writethrough caching, and more!

Anyone tried it? It looks really nice on paper, with its ability to combine slow + fast storage, configurable redundancy, checksumming, encryption and so on.

From the manual:

Code:

bcachefs format --compression=lz4
    --replicas=2
    --label=ssd.ssd1 /dev/sda
    --label=ssd.ssd2 /dev/sdb
    --label=hdd.hdd1 /dev/sdc
    --label=hdd.hdd2 /dev/sdd
    --label=hdd.hdd3 /dev/sde
    --label=hdd.hdd4 /dev/sdf
    --foreground_target=ssd
    --promote_target=ssd
    --background_target=hdd

... would create a file system where all data is redundant (2 copies), with writes going first to the "foreground" devices (as long as they have space) and then replicated in the background to the "background" devices. bcachefs keeps track of device IO latency and directs reads to the fastest devices. If data is read that aren't already on "promote" devices the data is promoted to those. Devices can be added and removed after the fact and their "target" changed.

Devices also have a "durability" property, so one can set e.g. durability=0 to not count data stored on that device to the number of replicas (e.g. for pure caching), or durability=2 if the underlying device is a mirrored HW RAID. So it all seems very flexible.

But I haven't seen much about how it actually handles missing or broken devices... And it's unclear to me how removal and off-lining of devices work in relation to redundancy and "degration". Any experiences to share?

homeserver78 · Mar 21, 2024

One major feature that's missing is support for send/receive. It's on the roadmap but obviously that doesn't help today.

Search

Has anyone tried bcachefs yet?

voxadam

Member

MiniKnight

Well-Known Member

dandanio

Active Member

Joel

Active Member

dswartz

Active Member

SlickNetAaron

Member

dswartz

Active Member

Joel

Active Member

_alex

Active Member

BackupProphet

Well-Known Member

EffrafaxOfWug

Radioactive Member

_alex

Active Member

BackupProphet

Well-Known Member

_alex

Active Member

homeserver78

Member

homeserver78

Member