Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Muxfs – a mirroring, checksumming, and self-healing filesystem layer for OpenBSD (sdadams.org)
171 points by ciprian_craciun on Aug 14, 2022 | hide | past | favorite | 59 comments


I have only two issues with the `muxfs` implementation as it stands:

(1) (And the largest problem) is that it requires stable inodes in order to tie the checksums with the actual files. This means (and it's already stated in the article) you can't copy / move / overwrite any of the underlying files without losing the checksums. (Basically it also removes the possibility of accessing one of the mirrors via NFS, FUSE, or anything that doesn't have stable inodes.)

(2) (Based on my reading of the article) it doesn't seem to hold a "log" or "sequence" to identify which of the two mirrors are ahead or if they are in sync. In case of a disconnect / reconnect you need to manually tell `muxfs` which is the "newer" one (by using a `sync` before being able to mount it).

(I haven't tested it though, I'm running Linux, but I'm quite interested because just last week I thought "why doesn't one implement a FUSE file-system to add checksums and thus prevent bitrot". `muxfs` also adds mirroring.)


I am also using checksums to detect bit-rot, but in order to tie them to the files they are stored in extended attributes of the files. Thus they do not depend on the inode numbers.

OpenBSD also supports extended file attributes, so using them should be possible.

Using extended attributes on Linux or FreeBSD requires a few precautions, because there are still various copying/movement/archiving CLI commands or GUI applications that ignore the extended attributes and also some file systems that do not implement extended attributes, e.g. tmpfs on Linux (which supports only certain kinds of system extended attributes, not those defined by the users) or all not extremely new versions of NFS (only NFSv4 in Linux 5.9 or newer supports xattr, unlike samba, which has supported them for decades, mapping them correctly between different file systems, e.g. XFS on Linux to UFS on FreeBSD), so copying a file via those file systems would lose silently the extended attributes of the files.

The extended file attributes have been introduced in 1989, in HPFS for OS/2 version 1.2, and they have been brought to UNIX in XFS, in 1993.

30 years later, it is annoying to see that there are still some programs which pretend to make file copies or file archives, but which can lose the extended file attributes, without any warnings or errors.


OpenBSD removed support for extended attributes.


I suppose that this has been done to ensure less work for the OpenBSD maintainers, but I do not consider it as a wise decision.

Extended file attributes can be used to implement a large number of useful things, many of them enhancing security, e.g. access-control lists. However, IIRC OpenBSD has chosen to also not implement ACLs.

In any case regardless how useful or not useful extended file attributes are considered to be, deciding to not implement them in the main file system used by an operating system has the immediate consequence of disqualifying this operating system for the use on a file server (a.k.a. NAS), which is an application domain where the *BSD operating systems have been traditionally very good.

The reason is that whenever such a NAS would have Windows, Linux or FreeBSD clients, transiting any file through that NAS would potentially lose data.

In general, in the documentation of any file system, the lack of support for features like extended attributes or access-control lists must be displayed very prominently, to warn any potential users about the risk of data loss during copy operations (because the file copy commands are usually stupid enough to not inform the users whenever they are stripping file metadata, so such a loss may be discovered only when it is too late).


In another project of mine I considered using extended attributes to tag files into categories. With extended attributes not being universally supported and easily overlooked the conclusion I came to was that I should store that information in a database file. I didn't want to one day lose my (manually assigned) tags to an erroneous move command.

Anything non-native to OpenBSD can still be stored in muxfs as an archive file. You would need to ensure to pass the right arguments to the archiver to preserve the attributes in this case.


Among the programs with good support for extended file attributes are any archivers based on the FreeBSD libarchive (e.g. bsdtar), rsync, samba and the Linux coreutils.

However, when instead of being compiled from sources, precompiled binaries are used, one must verify if the support for extended attributes has not been disabled, as it may happen in some misguided Linux distributions.

Besides such precompiled binaries where the xattr support had been disabled, I have also seen various GUI-based file managers that lacked support for xattr and which could strip them during copy or move operations.

The older tar and cpio archive formats do not support extended attributes, and many older tar programs support xattr, but by using tar or pax format extensions that may be incompatible with other tar implementations.

While it is important to be aware of these caveats, otherwise you may have unpleasant surprises, like I had many years ago, when I was copying files between different users via /tmp, and I could not understand where the files were losing metadata (and also their timestamps were truncated), until realizing that /tmp was on tmpfs, and copying to /tmp was silently stripping the extended attributes and truncating the timestamps (the latter might no longer be true today), with a few precautions it is possible to use extended attributes without problems on Linux and FreeBSD.

Before using extended attributes, I had also used a database file, but that had the disadvantage of being updated continuously all the time, even for file operations that did not change the file content, e.g. when renaming or moving files.


not a 1:1 equivalent of the "classic" extended attr but there are special flags on files.

https://man.openbsd.org/chflags


(1) I believe that stable inode numbers are possible with FUSE but they must be implemented by the FUSE driver. Mirroring over a network is not the intended use case. The mirrors exist to provide data redundancy for use by the muxfs driver. If you need to copy the data to a new location there is the muxfs sync command.

(2) muxfs uses sequence numbers to count the write operations performed on each mirror. Upon failure to mount due to the mirrors being out of sync a report is printed comparing the first mirror with the first non-matching mirror, and this includes their sequence numbers.


Nice concept, and having skimmed it's worth noting:

> muxfs needs you!

> No filesystem can be considered stable without thorough testing and muxfs is no exception.

> Even if I had tested muxfs enough to call it stable it still would not be responsible to expect you to simply take my word for it. It is for this reason that I do not intend to release a version 1.0 until there are sufficient citations that I can make to positive, third-party evaluations of muxfs.

> This is where you can help.

> I need volunteers to test muxfs, provide feedback, and periodically publish test results.


These requirements of much testing of inherently multithreaded code with a lot riding on not destroying user data suggests a model checker would be an essential piece, perhaps someone more knowledgeable can opine.


OpenBSD's FUSE does not implement multithreading.


Author here. A big thank you to you all for your interest in muxfs! I will try to answer all of your questions as best I can.


The concept sounds good, but you also need people to review the FUSE implementation as well as muxfs. And test, and so on.

One question: do you plan to implement "concatenation" of filesystems, so you can build one very large muxfs system (10s of terabytes) ?


The OpenBSD FUSE implementation is in base so it should already be well audited.

I don't plan to add "concatenation" as you have described, however this can, in theory, be approximated by layering muxfs on top of multiple RAID0s.


> a filesystem should automatically check and repair data as it is accessed rather than processing the entire filesystem tree upon every check or repair job.

Except this is not sufficient. Flash storage for example is especially susceptible to random bitrot of data over time regardless of whether or not it is ever accessed or even powered on. Ever tried to plug in an old USB stick or SD card only to find out it was totally busted or unreadable? Scanning the entire filesystem and re-checksumming everything is therefore completely necessary.


There is the muxfs audit subcommand though that does that. I guess the author was trying to say that it shouldn't be the only way to do it, as that opens the door for silently returning corrupted data in-between runs, so you start pondering how low you should set the interval between audit runs etc. I guess with automatic checks on every access you can feel safe running the audit every other month or so.


i can't find "zfs" mentioned once in this guy's doc so my first question is... why not?


Same reason he doesn’t mention BtrFS and a bunch of other file systems: Because he’s running OpenBSD which doesn’t support ZFS.


From the page:

> I decided it was finally time to build a file server to centralize my files and guard them against bit-rot. Although I would have preferred to use OpenBSD due to its straightforward configuration and sane defaults, I was surprised to find that none of the typical NAS filesystems were supported.

OpenBSD does not support ZFS.


Theo is, perhaps rightfully so, against importing what is effectively a paravirtualized Solaris kernel into the OpenBSD source code in order to run a file system.


Too sad, because the partitioning of OpenBSD is why i don't use it, with ZFS you could just do a dataset throw x^w,nosuid etc on them and give them a quota, with ffs one can bet that you run out of space (earlier or later), in one of the partitions (Workstation NOT Server).


You can use your own partitioning though? One for /, one for swap. Done.


Yes you can but you cant set stuff lime nosuid etc on /


I doubt it. Even for ports you can still symlink /usr/ports to $HOME/ports, for Scummvm with --enable-all-engines or Eduke32 (Build/GPLv2 license clash, can't be shared as a binary).

/usr/local is not small at all by default.


[flagged]


I think they're trying to say is you can just link stuff to $HOME if some filesystem runs out of space (not an endorsement of that view, just an explanation).


Because ZFS is not supported on OpenBSD. In fact he does mention in the beginning of the article that he was surprised that none of the NAS related file-systems are not supported by OpenBSD.

On the other side, ZFS is an overly complicated behemoth, that wants direct access to the block device. Meanwhile `muxfs` works with any already existing file-system (local or remote) and just provides the checksums. So both serve different use-cases.


Non ZFS filesystems are overly simplified, ignoring the problems they ought to be solving.


ZFS is not the simplest solution to the problems that "ought to be solved", and the implementation can be rather annoying - lacking support for hardware configuration changes, using its own cache system sidestepping the one in the kernel, and generally not fitting in with normal filesystem paradigms. And that's not even addressing that incremental sends - a huge feature - was (is?) broken due to holebirth, making it unreliable.

btrfs kinda blew up, but it would be nice to have a good and simple reliable filesystem that actually fits in with the others. ZFS is what we're stuck with till then.


Some of them are grounded in Linux politics. I have hope in bcachefs but best not rush, we've all seen btrfs.


ZFS was never any prettier on BSD either, so wouldn't blame Linux in that.

But yes, bcachefs is somewhat interesting. Or maybe btrfs manages to clean up their act one day.


ZFS on FreeBSD was very pretty, till Linux guys added adb and other Linux-specific features.


In what way? Back when I used it it seemed just as misplaced. Its own mount semantics, its own caches, all that.


Blazingly fast, robust and rock-solid.

Miles better than all that geom_xxx RAIDs (I've been maintainer of geom_raid5, mind you), chipset RAIDs and FFS2 SU+J stuff, which is not completely stable even right now — I've got a ton of erros from forced foreground fsck after "normal" background fsck completion as soon as 2019 (I don't have single R/W FFS2 after that, and I'm happy!)

With all my due respect to McKusick, all this modernization of FFS2 (SU, snapshots, SU+J) always was very fragile, and software RAIDs implemented as GEOMs is much worse than ZFS VDEV layer.

Linux-induced changes downgrade ZFS performance a lot, though :-( Another level of indirection in ARC is really big deal.


What's wrong with btrfs?


The parity-based RAID levels still are officially not safe for production, and overall many people don't quite trust it in more complex setups due to past bugs.


It depends on your use case. BTRFS still has deficiencies in how it does quotas (significantly slows down fs if you enable quotas). BTRFS raid 5 has a write hole like traditional raid 5. And, there is a problem that sometimes occurs with individual extents when you use dedup. One of the dedup tools predicts that the issue will occur and will skip dedupe on such extents.

There were RFCs on proposals to address both the write hole and quota issues on LWN this year, with the write hole fix already having draft patches see "raid tree".

BTRFS is fine if you are not doing things that can hit those edges.


Here’s an excellent run down that’s somewhat recent: https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...


as with all things GPL, victim of FUD campaigns by corporate america.


excuse me what


> ZFS is not the simplest solution to the problems that "ought to be solved"

What are simpler solutions to the problems that ought to be solved?


A CoW filesystem itself is not much more complicated than a plain filesystem. Or maybe a CoW softraid, with the filesystem existing at a different layer.

ZFS has countless bonus tunables, several types of caching distinct from the kernel VFS cache, its own write logs and special devices, multiple levels of topology (datasets in a pool consisting of vdevs consisting of drives), deduplication, compression, etc.

It is also not at all user friendly. When set up right (and no changing your mind on setup), and when fed enough resources, it does it's job well, but simple or elegant cannot describe it.


You didn't answer the question. You can't claim that there are simpler solutions then not actually provide a simpler solution. Which handwaving is not.

> A CoW filesystem itself is not much more complicated than a plain filesystem.

And yet there isn't one out there, there's pretty much only ZFS and BTRFS, the latter having been in a state of almost-but-not-actually-working for over a decade now.

bcachefs is the only contender and it remains a single-developer effort with little mainlining progress in the last few years.


> You didn't answer the question. You can't claim that there are simpler solutions then not actually provide a simpler solution.

And what was would be a reasonable response to you? Pasting a novel filesystem implementation as proof that the existing ones are overcomplicated? Opinions do not have a burden of evidence.

There is a handful of CoW filesystems out there, showing that it is certainly not an insurmountable issue to write one. Rather, the problem is stopping people from doing more at this point, keeping the design simple instead.

That we don't have something better yet is likely a result of writing filesystems in general being rather laborous to do right regardless of CoW, and being incredibly unrewarding - few care about filesystems unless it's broken.


There’s also hammerfs. I’ve not used that though.

I think the complaints levelled against ZFS is a little unfair though. I agree that there are more elegant ways to implement ZFS but actually what we have already works really damn well. And the comments about the CLI being hard to use is weird because having used a hell of a lot of different file systems over the years (including BtrFS), I’ve found ZFS to be remarkably easy.

ZFS has saved me from a number of hardware failures. If it really were as bad as the comments on here have made out, it’s have lost data several times over.


I completely agree with that assessment. The core ideas behind how the CoW work are elegant, but the implementation is anything but elegant.


Depends what problems you're looking to solve; you don't need a CoW filesystem if you just want data integrity features for example, and you don't need data integrity features if you're just looking for something with quick and efficient snapshots.

ZFS tries to solve every filesystem problem and actually doesn't even do a terrible job at it, but it can be a bit of a beast due to its high complexity and that it doesn't integrate well with the rest of the system.


>On the other side, ZFS is an overly complicated behemoth, that wants direct access to the block device. Meanwhile `muxfs` works with any already existing file-system (local or remote) and just provides the checksums. So both serve different use-cases.

Na...it's not overly complicated for what it is, but yes it is a behemoth.

>that wants direct access to the block device.

Yes for high-performance "enterprise"-setup's it is preferable, but absolutely not needed.

> Meanwhile `muxfs` works with any already existing file-system (local or remote) and just provides the checksums.

That i think is the winning point here, just add bit-rot protection to ffs.


> that wants direct access to the block device

You can set up a ZFS pool backed by files[1]. Probably not something you should do with data you really care about, but it's possible.

[1]: https://linux.die.net/man/8/zpool (Virtual Devices)


Other than lack of support for NAS file systems in OpenBSD, is there a reason not to use ZFS (in favor of another file system providing similar features)?


The muxfs source code is a lot smaller than that of ZFS so if security is a concern to you then you might find muxfs easier to audit than ZFS. It also compiles quickly so could be a good match for a source-based system. This said I never aimed to "beat" ZFS.


I appreciate how clear you are about the chosen trade-offs and I am also quite impressed by the extent to which the install instructions include "and now here's how you check that worked" commands - I'm sure somebody reading this will think "well, yes, obviously you should include those" but it's not as common as I might like and your version thereof is notably thorough.


Thanks! I'll try to remember this for any future blog entries.


Would muxfs be easily portable to other operating systems, or does it rely on non-standard APIs?


Do you use forward error correction? If so, which algorithm?


No error correcting codes currently. Just checksums and redundant copies. Currently the supported checksum algorithms are crc32, md5, and sha1.


would porting ZFS have been worthwile? I know the CDDL is likely considered toxic to have in the kernel but even in a FUSE way might be worthwhile here, no?


Porting ZFS to OpenBSD has been attempted before. See this blog post from 2013: https://flak.tedunangst.com/post/ZFS-on-OpenBSD


Ahh ok. Makes sense. Seems this was to include it in the kernel though. I wonder if I fully user space solution would have been doable? (Unlikely I guess)


There is a FUSE implementation of ZFS for Linux but it is still in development. Porting it at this stage could be tricky.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: