A straight talk about btrfs

jacknews · on Aug 24, 2017

A data-point of one, and I only have a few TB spread over a handful of disks, but I lost an entire 1TB btrfs volume recently because of a single disk sector gone bad (of course, I have backups, but it's a pain to restore 1TB, not to mention racking of the nerves (will the backup drive fail with sudden heavy use?)

It's the only FS I've ever been unable to recover data from, at all, (I've had quite a few disks with errors before going back 25 years), and so it's no longer on my list of filesystems-I-trust.

andrewstuart2 · on Aug 24, 2017

I've experienced data loss twice in 15 years of using Linux and trying different file systems. Both were on btrfs. I didn't use any advanced features, though I had plans to before this happened. But at that point, I cut my losses and converted all my disks back to ext4.

The features are great, but data availability/integrity is the fundamental thing I want from a filesystem. If I write some bytes, they should be there tomorrow. Everything else is secondary.

cyphar · on Aug 24, 2017

If you're interested in a more featureful filesystem that isn't btrfs and has a very solid track record, I would recommend looking at XFS. It's a very old filesystem but has a lot of quite modern features (and it performs better than ext4 for several workloads).

arjie · on Aug 24, 2017

Funnily enough, XFS is the only file system on Linux where I've lost data. Back in the day, the wisdom was to stay away from it if you were in an iffy power situation because it would serve zeroes if there was a write to a file near a power loss (I.e. you wouldn't get old file or new file but something else)

Having had that happen to me I always used some extN and didn't lose any existing files.

Of course that's a decade or more ago and I may be misremembering but a cursory Google looks like other people encountered something like it too.

crest · on Aug 24, 2017

Afaik the problem is that XFS trusts the hardware to do the right thing during a power loss aka stop inflight requests before the brownout turns you disks and their controllers into heavily biased random number generators. Lots of x86(-64) hardware lacks a proper power loss interrupt triggered early enough to stop the all I/O in time. The ext3 journaling hides that problem to some extend by journaling more than just the minimal metadata required.

msla · on Aug 24, 2017

Then XFS isn't suitable for consumer-grade hardware.

linsomniac · on Aug 24, 2017

For maybe 3-6 months a decade ago I was running XFS on my laptop. My laptop had some sort of flakiness, ISTR it was graphics related, and it would require frequent reboots for a while. I remember one particular instance where after a power cycle a file I was working on an hour prior to it was trashed. That was a real "Oh FFS!" moment and I stopped using XFS.

But, I will say I've had a lot of success using XFS to serve images, particularly tiles. Particularly map tiles that are generated once and then basically static. There it has a lot less overhead when formatted with size=2048, so lots of small files are handled better. Of course, reiserfs was even better, but part of that was with a job I worked on where they just blindly rendered map tiles for the whole earth, and so there were a lot of 50 byte solid blue tiles. Reiser did really well with that, though some deduplication or linking would have been a huge benefit.

kogepathic · on Aug 24, 2017

> I would recommend looking at XFS. It's a very old filesystem but has a lot of quite modern features (and it performs better than ext4 for several workloads).

I would second this. XFS is great. A few years ago I moved from JFS to XFS for bulk data storage because JFS is pretty abandoned these days. No issues at all with XFS, and never lost any data.

Just make sure it's right for you before you use it (e.g. you cannot shrink XFS volumes as you can ext4)

ArneBab · on Aug 24, 2017

XFS is the system where I lost all data when we had a powerloss.

btmaybe · on Aug 24, 2017

XFS has the same problem as btrfs: extreme sensitivity to metadata corruption. I'm not sure XFS has the RAID1-for-metadata (DUP) feature like btrfs has recently added.

pnutjam · on Aug 24, 2017

I believe Suse uses XFS for larger data volumes. Their emphasis os on btrfs for the OS and system drives where it can help with administration tasks.

yellowapple · on Aug 24, 2017

XFS has been my FS of choice for a few years now. If I remember right, openSUSE even defaults to it for /home.

RX14 · on Aug 24, 2017

Which raid level were you using over those multiple disks? I've been using btrfs and never even experienced a hiccup despite continuous sector failures, continuous drive failures and even disk controllers disconnecting on the fly. And I've been running the whole thing on a single, fairly weak, server with second-hand disks attached over the cheapest USB3 sata adapters I've found. By all accounts I've been putting btrfs in a situation that is well outside the norm and I have 0 complaints. I also use it on my desktop (no raid) with similar experience (minus the shitty hardware).

I am continuously amazed at the amount of people who have issues with btrfs. It's been absolutely rock solid over the time I've used it and I have 0 complaints (apart from the ext4 -> btrfs converter producing a corrupted btrfs filesystem, but the actual kernel btrfs code itself has been flawless).

btmaybe · on Aug 24, 2017

I too have lost an entire btrfs volume (single disk) because of a bad sector. It seems btrfs is very sensitive to bad sectors in metadata, compared to ext4.

Since then, I've formatted my btrfs partition to do RAID1 on metadata and haven't had problems since.

  $ sudo btrfs fi df /
  Data, single
  System, DUP
  Metadata, DUP

jacknews · on Aug 24, 2017

Perhaps this is the secret - and should be made the default.

I don't use raid, precisely because I'm worried that failures will take down the entire set, and all of it will be unrecoverable.

I'm not at all suggesting that BTRFS is "buggy" unreliable, just, as you say, with certain kinds of disk errors (possibly rare, but it got me), the entire volume becomes unreadable, whereas I've always been able to recover files from extX or xfs volumes.

Perhaps it's just my bad luck though.

RX14 · on Aug 24, 2017

The default for single devices is to have DUP metadata, and for raid1 devices the default is raid1 metadata. These are the defaults, somehow they got bypassed. I think you're right that you need to think a little more and know a little more about btrfs to use it properly, but I hope that will change in the future.

pnutjam · on Aug 24, 2017

From the article: 'And against the demand of some partners, we are still refusing to support “Automatic Defragmentation”, “In-band Deduplication” and higher RAID levels, because the quality of these options is not where it ought to be.'

I've only used btrfs on single disk systems.

Dylan16807 · on Aug 24, 2017

Isn't that the default?

sp332 · on Aug 24, 2017

"A single device filesystem will default [metadata] to DUP, unless a SSD is detected. Then it will default to single." https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#O...

btmaybe · on Aug 24, 2017

It's a relatively new feature (Jan-2016) and it's disabled by default for SSD devices (increased wear). Personally I prefer increased wear over unmountable volumes.

_opc6 · on Aug 24, 2017

I have the same experience. btrfs has been great to me. I've disconnected disks on the fly as well on raid 1 and the mount is happily serving files off the remaining disks.

vog · on Aug 24, 2017

Serious question: Why would anyone today use BTRFS, given that the battle-proof ZFS has finnally been ported to Linux for quite some time?

intsunny · on Aug 24, 2017

Some of us are very concerned about licensing at all times. I'd wager such steadfast concern is ultimately what helped Linux in matters of SCO v. IBM over the matters of IBM's JFS file system.

ZFS is very much in a grey area, and in turn, a huge turn off for many.

josteink · on Aug 24, 2017

> ZFS is very much in a grey area

Really? How so?

And if so, what about all those proprietary GPU-driver kernel-modules from AMD and Nvidia which definitely are not compatible with the GPL? Are they a gray area too?

And if they're not a gray area... Why is it suddenly a problem that the ZFS kernel-module is not licensed in a GPL-compatible manner?

I'm not trying to sound facetious, but I honestly can't see the distinction here. And Linux distros left and right have been distributing closed-source kernel-modules for GPUs for a long time now. So what's the problem? What am I missing?

cyphar · on Aug 24, 2017

Yes, the proprietary drivers have always been an ethical and legal grey area. However, the way a proprietary driver is distributed is different to how ZFS is distributed. Proprietary drivers are distributed as object code that is then built to be a kernel module on the user's machine. This means that at no point does Nvidia or AMD distribute a Linux kernel module with proprietary code. There are arguments however that distributions which distribute this auto-build scheme by default may also be in violation of the GPL. ZFS is distributed by Canonical as a fully functional kernel module.

There's a lot of gray areas because there have not been many legal cases on derived works and the GPL. The GPL itself has held up in court on copyleft grounds, of course.

Also, you've got the fact that in the case of proprietary graphics drivers, the threat is that the Linux kernel community would sue Nvidia or AMD. The threat with ZFS is that Oracle (who is a member of the Linux kernel community) would sue people using ZFS (they could also then sue for patent infringement).

wizeman · on Aug 24, 2017

I know in the past NixOS did the same for the ZFS kernel module: prior to NixOS installation, it was able to download the ZFS source code, build it and then load it from inside the installer Live CD/USB while it was running, and this would require just a couple of commands. I don't know if this is still true nowadays or if they just distribute the ZFS kernel module directly in the Live installer.

jdboyd · on Aug 24, 2017

If combining gpl and cddl in a redistributed file is a violation of gpl but not cddl, then what would the holder of the copyright of the cddl code be able to sue about? I think the concern is that a contributor to Linux might sue, just like in the Nvidia/and case.

cyphar · on Aug 24, 2017

Oracle is a contributor to Linux, so they could sue from the GPL side. While Nvidia and AMD also are Linux contributors, the fact they ship proprietary modules would make it hard to argue that they aren't implicitly permitting users to redistribute it (and thus they would be forced to license their drivers under GPLv2).

Not to mention that ZFS is covered by Oracle patents. CDDL provides a patent license, but it might be possible for Oracle to sue you for patent infringement if you're distributing code in a way that complicates the licensing. Not that I'm saying that's likely, but Oracle has enough money to ruin you if they want to.

vetinari · on Aug 24, 2017

The crucial difference is in distribution, not in usage.

AMD and Nvidia distribute a binary module with a thin shim. It is the user who builds this shim and inserts into his kernel. Neither AMD, Nvidia nor any Linux vendor[1] distribute any binary, that links into GPL-ed kernel. The combination is done by the user.

And so does the ZoL project. They have the same trade-offs as AMD and Nvidia. It is just much more difficult to install Linux on a filesystem not supported by the installer, than not have working 3D acceleration at setup time though.

[1] the distributions take care to have them in separate, third party repos and if you look more closely, you will find they are being build on the user machine using dkms, akmod or similar mechanism. That also trows a wrench into things if you want to use Secure Boot too.

riffraff · on Aug 24, 2017

Closed GPU drivers are made by the companies, which have interest in graphic cards being sold, and not sueing linux users.

ZFS on the other hand is owned by Oracle which has no interest in people using linux, nor ZFS if it's not on their proprietary stuff.

wizeman · on Aug 24, 2017

The license incompatibility is on the GPL side, not on the ZFS / closed GPU side. The only ones which could sue are the Linux kernel developers.

belorn · on Aug 24, 2017

Publish the source code files only under GPL and Linux kernel developers will be happy.

You might get intro trouble with the ZFS side however, since the CDDL license says: "must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License".

If the Source Code form is only under the terms of the GPL, then the above condition is not met. The law suit would thus come from the ZFS side.

Manozco · on Aug 24, 2017

Which includes Oracle, so they could sue Canonical and/or zfs users if I understand everything correctly

kayfox · on Aug 24, 2017

Doesn't Oracle provide a version of ZFS on its Linux?

riffraff · on Aug 25, 2017

they don't

yellowapple · on Aug 24, 2017

This is resolvable if you think of a driver "shim" that's a derivative work of both a GPL'd codebase (e.g. Linux) and a non-GPL'd codebase (GPU drivers). It's possible for the GPU driver's proprietary license to permit linking without license restrictions, meaning that a "shim" kernel module which loads the binary blob can be under the kernel's copyleft license without problem (since while it's a derivative work of both the module and the blob, only one license is copyleft).

(I don't know if this is what AMD/Nvidia actually do ToS-wise, but it's at least feasible).

Doing this for ZFS is harder, since the shim would have to be a derivative work of both a GPL'd codebase and a CDDL'd codebase, neither licenses of which are compatible with one another. Dual-licensing is probably illegal, as is just picking one over the other.

IANAL, though, and I might be grossly misunderstanding the problem.

belorn · on Aug 24, 2017

Using a shim to do a legal workaround seems to me as a trick that is unlikely to ever get past a judge. In all other form which people try to do legal workaround in order to turn illegal to legal, it seems that the law don't generally take kindly to it and either explicitly forbids it or just treat the trick as inconsequential to the end result.

For example, we have shims in form of dummy organizations, straw owners, money launderer and so on. Neither of those kind of shims work to turn something illegal to legal. Why would some dummy code that sits between two incompatible software work any better in the legal system?

yellowapple · on Aug 25, 2017

"Neither of those kind of shims work to turn something illegal to legal."

There's nothing illegal here to try to turn legal, though. If the blob's license permits other software to interface with it without restriction, then the shim is free to be under the GPL per the terms of the other work from which it's derived (Linux).

The only way there'd be anything illegal here would be if the blob itself is a derivative work of Linux, and if that's the case, then the blob itself is illegal (since it would be subject to the GPL), regardless of the shim and regardless of whether or not it's distributed as part of a Linux distribution.

belorn · on Aug 25, 2017

Take a dummy organization that is used to avoid taxes. There is nothing illegal to pay someone to form a company located in a different country. A company can also sell assets to an other for what ever price they want, and there is nothing illegal to declare that as a result one of the companies now have zero profits and zero assets.

But if something is legal or illegal depend more than just on each set piece. If combining a blob with the kernel creates a derivative, then just adding a shim to the mix won't turn it legal. Judges are trained to look at the big picture rather than the individual parts, and this is especially true in civil-law. Court systems are generally interested in:

What is the kernel authors intention by their copyright license.

What is the blob authors intention by their copyright license.

What is the distributor intention when combining the two works and what understanding did the distributor have about the other copyright authors and their wishes in regard to derivatives.

The existence of a shim that has no other purpose is a sign that something shady is going on, similar to dummy organizations. The question a judge is likely to ask is why such obfuscation was used since that can help establishing intention. If the end result is a de-facto derivative, and the intention was to create a single work out of two seperate works, and the creator knows that they are not allowed to create a derivative work, then a shim is not going to save the day.

yellowapple · on Aug 26, 2017

"Take a dummy organization that is used to avoid taxes."

See, that right there is where the analogy falls apart. Tax evasion is (usually) illegal. Writing software to translate between two independently-developed programs is not, last I checked.

"If the end result is a de-facto derivative"

You'd need to prove the blob is derived from Linux. If it's not, then there's literally nothing illegal happening here. If it is, then - again - the shim is not the illegal thing; the blob is, and the shim is entirely irrelevant in that illegality.

The usual reason for the shims, by the way, has very little to do with licensing terms (at least not directly) and very much to do with the fact that Linux does not provide a stable API (let alone ABI) for kernel modules. The intent of the shim is therefore almost always technical rather than legal in nature.

cyphar · on Aug 27, 2017

> See, that right there is where the analogy falls apart.

You're missing the point. That's the _goal_. If they hadn't mentioned that was the goal, would it still be illegal?

> Writing software to translate between two independently-developed programs is not, last I checked.

It is illegal if the license doesn't permit you to do that. Copyright owners can place restrictions on usage of their works.

boomboomsubban · on Aug 24, 2017

>ZFS is very much in a grey area, and in turn, a huge turn off for many.

There is a grey area in distribution, there isn't one about using ZFS. Even the FSF, the group who believe the CDDL and GPL can't be distributed together, say "Privately, You Can Do As You Like."

https://www.fsf.org/licensing/zfs-and-linux

SwellJoe · on Aug 24, 2017

Who deploys production systems on a large scale with a "private" build of the filesystem code? I want my production systems to run on code that is being used by as many people as possible; I don't want patched kernels, I don't want privately built kernel packages, I don't want a unique system that only I've ever seen. I want a system that is as boring as possible (while still providing the functionality I need to effectively do my job). I want a system with a bunch of people complaining about it and asking questions about it online, so that when problems arise, I can find answers.

Now that ZFS is available on Ubuntu and seems to have some adoption, I guess it's a reasonable choice for some. I'm still a bit iffy on it. I don't really want to add license concerns to my list of worries.

cyphar · on Aug 24, 2017

The CDDL also has patent clauses and so it's conceivable that a user of OpenZFS which received it in a way that violates the OpenZFS license could be liable for patent infringement of an Oracle patent. And there have been many cases of companies suing users of software over patents.

Another issue is that you should always get software like your filesystem from your distribution. We do a lot of work making sure that your systems can be safely updated, and making sure that upstream bugs are fixed for our distribution. Even community distributions put a lot of effort into that work. As someone who works on maintaining a distribution (I work for SUSE and contribute to openSUSE), I would guess that most people underestimate how much work you need to devote to maintaining the software in a distribution.

boomboomsubban · on Aug 24, 2017

>The CDDL also has patent clauses and so it's conceivable that a user of OpenZFS which received it in a way that violates the OpenZFS license could be liable for patent infringement of an Oracle patent. And there have been many cases of companies suing users of software over patents.

Again, with usage this isn't a problem, the license could only possibly be broken by distribution with the GPL. Even then, I believe it is the GPL that is broken, so the patent clause would remain.

As for the rest of your argument, the OpenZFS team does a lot of work maintaining the filesystem. Why does that work need to come from you?

cyphar · on Aug 24, 2017

> As for the rest of your argument, the OpenZFS team does a lot of work maintaining the filesystem. Why does that work need to come from you?

Integration into our tools, backporting fixes, doing release engineering, tracking upstream changes, triaging and resolving distribution bug reports, documenting usage and troubleshooting, configuring defaults and best practices, a whole lot of testing, etc.

As I said, there's a lot of work that goes into a distribution (I probably haven't covered most of it) that most people don't think about. And that's assuming that a distribution is going to be passive about something as core as a filesystem -- which we wouldn't be. So we'd be working with upstream on development as well, which is more work. So saying something like "it's supported on distribution X" when that distribution doesn't even provide official packages for it is a massive stretch. It might work on distribution X, and you might provide independent ISV-style support for it, but it's not supported by us.

I appreciate that the sort of work distributions do isn't well-publicised (mostly because stability is hardly a sexy thing to blog about, and we don't rewrite things in JavaScript every weekend). But there is an incredible amount of work that goes into making distributions work well for users, and there's a reason that many distributions have lasted for so many years (there's a need for someone to do the boring work of packaging for you).

pnutjam · on Aug 24, 2017

Yeah btrfs works best on the OS partitions. I would hate to roll in an unsupported filesystem for my OS.

vertex-four · on Aug 24, 2017

If I know that Canonical can't legally distribute ZFS in whatever format to me, and yet I use Canonical's distribution of ZFS, isn't there a legal risk there? After all, it would turn out that I have no license to use said distribution of ZFS as such a license was never conferred to me by someone with the legal right to do so.

Generally speaking, courts would probably give me the benefit of the doubt if I had no reason to believe that they couldn't distribute it to me - but as I knew they couldn't (the issues with ZFS and the Linux kernel are well-documented), and I knew I'm using it, they'd probably hold me in violation of copyright.

boomboomsubban · on Aug 24, 2017

>If I know that Canonical can't legally distribute ZFS in whatever format to me, and yet I use Canonical's distribution of ZFS, isn't there a legal risk there?

If you somehow knew that any form of distribution was illegal that would be the case. I haven't heard anyone saying that's the case, distributing it bundled with GPL software is potentially breaks the GPL.

josteink · on Aug 24, 2017

> Why would anyone today use BTRFS, given that the battle-proof ZFS has finnally been ported to Linux for quite some time?

It's only been available out of the box in a simple-to-use and supported fashion in 1 Distro for barely 1 year.

You can hardly say it's been widely available, nor for a "quite some time".

tw04 · on Aug 24, 2017

Well, two things. He didn't say that, he said it's been PORTED to Linux for quite some time. Which is accurate, the first "stable" release was in 2013 - 4 years qualifies as "quite some time" in pretty much any tech circle that isn't VMS.

As for it not being widely available... you're saying the only way for something to be considered widely available is to be included in the distro directly? I'd argue an easy installation/setup and solid documentation are FAR more important than being included in a distro. If the setup is arcane or the documentation is horrible, it doesn't matter if a tool is in every distro on the planet, nobody is going to use it.

linsomniac · on Aug 24, 2017

I've been using ZFS on Linux for 5+ years. I know a lot of people wrote off zfs+fuse, but I used it very successfully to store 10s of TB of backup data for years with no data loss events and performance I can't tell was worse than ZFSonLinux. And I've been using ZoL for years at another job.

I think your "barely a year" metric is far off.

zacmps · on Aug 24, 2017

ZFS on Linux has official support for Arch, Debian, Fedora, Gentu, OpenSUSE, RHEL &CentOS and Ubuntu.

cyphar · on Aug 24, 2017

It has "official support" from the ZoL folks. And yes, openSUSE has ZFS packages in OBS. But we sure as hell don't ship them by default, or in our official repos. The same applies for Arch, Debian, Fedora, Gentoo, RHEL and CentOS.

Ubuntu is the only distribution that official supports ZoL, and actually ships it in it's official repositories (and by default). What that means is that Canonical is effectively saying "we trust there's no legal reason why we cannot do this." No other distribution has made that claim.

EDIT: Actually NixOS also supports it, but the point stands.

Filligree · on Aug 24, 2017

Well, also NixOS.

jononor · on Aug 24, 2017

Supported by whom? Is it at the level one can expect from paid support from say Red Hat, SUSE or Canonical?

cyphar · on Aug 24, 2017

They're referring to ZoL providing support for distributions (which actually just means "it works, and if you send a bug we'll work on it"). Only Canonical provides support from the distribution side. See https://news.ycombinator.com/item?id=15088761.

xioxox · on Aug 24, 2017

BTRFS has much more flexible snapshots and clones than ZFS. You can create rewritable snapshots and create new snapshots based on those. In addition you can create COW copies of files with "cp --reflink" which ZFS doesn't support.

BTRFS feels much more like a native filesystem on Linux too. It doesn't have that big ARC cache.

We've been using BTRFS for 5+ yrs on a Linux MD RAID setup, with no problems at all.

throw0824a · on Aug 24, 2017

> You can create rewritable snapshots and create new snapshots based on those.

You can do this with ZFS as well.

First, there is no such thing as a "rewritable snapshot"; the term is an oxymoron. Please do not use it and please do not promulgate it. The term "snapshot" should be reserved for read-only states of a (file) system at a particular point of time:

* https://en.wikipedia.org/wiki/Snapshot_(computer_storage)

Second, ZFS does have this feature: its called cloning. You create a (read-only) snapshot, and then run the "clone" command to make it read-write:

* https://www.freebsd.org/doc/handbook/zfs-zfs.html#zfs-zfs-cl... * http://docs.oracle.com/cd/E19253-01/819-5461/gbcxz/index.htm...

As long as the clone exists though, the snapshot cannot be deleted. However you can "promote" the clone, after which the snapshot can be removed. This feature has been around since at least 2010:

* https://www.freebsddiary.org/zfs-promote.php

Please use the word "clone" for a read-write copy of an original data set, as it is already fairly accepted nomenclature:

* https://en.wikipedia.org/wiki/Disk_cloning * https://en.wikipedia.org/wiki/Cloning_(programming)

xioxox · on Aug 24, 2017

I was aware of the promotion feature. I was creating new clones/snapshots in a chain hierarchy in zfs, copying old backup sets progressively onto the back of the chain, but keeping the head as the current copy. This was a breeze in btrfs, but basically impossible in zfs as it refused to promote the old clones/snapshots.

As to the nomenclature, it doesn't seem to make sense to differentiate snapshots and clones. With the flexibility of btrfs they're just the same thing with a R/O flag.

throw0824a · on Aug 24, 2017

> As to the nomenclature, it doesn't seem to make sense to differentiate snapshots and clones. With the flexibility of btrfs they're just the same thing with a R/O flag.

It does make sense: one is writable and the other is not. When someone is talking about (e.g.) mitigation mechanisms against ransomware, saying you have "snapshots" is meaningless if they're R/W as the ransowmare can go in an overwrite files. But if you use the term "snapshot" correctly--meaning R/O--everyone involved knows you have mitigated the risk since the data is safe from being altered and reverting is possible.

It's not the "same thing" if there is a difference between the two--which there is, the R/O flag setting. If two thing are different then they are not the "same": this may seem tautological but it's not. Call different things differently.

The btrfs CLI is really retarded in this regard where using "snapshot" is not R/O by default, as it violates decades of expectations and POLA:

* https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...

tw04 · on Aug 24, 2017

The term "clone" has been used in the storage industry for several decades now. It has a very specific meaning, as does the term "snapshot". I understand it might not make sense to you personally, but you're just going to confuse anyone you're talking to referring to it by another name. You might as well start calling containers "light virtual machines". You'll get an equal number of blank stares and confusion from whomever you're talking to.

throw0824a · on Aug 24, 2017

> The term "clone" has been used in the storage industry for several decades now. It has a very specific meaning, as does the term "snapshot".

Yup, this is my point: words have meaning.

Semi-recent XKCD on "Communicating":

* https://xkcd.com/1860/

unlmtd1 · on Aug 24, 2017

On the flip side, ZFS on linux now has native encryption.

pantalaimon · on Aug 24, 2017

The possibility to dynamically add and remove discs from the array is a pretty neat feature.

rocqua · on Aug 24, 2017

For my personal NAS use, this is the lacking feature that makes me stick with ext4 over ZFS. I'd go to btrfs if not for the reputation of unstability.

pgeorgi · on Aug 24, 2017

At that point, why rely on Linux?

vog · on Aug 24, 2017

Not sure why others downvoted you. (Maybe because you didn't name the alternative you obviously had in mind.)

But I think you have a fair point: If it is about quality and stability, why not going from Linux+ZFS straight to FreeBSD+ZFS?

vog · on Aug 24, 2017

Or OpenSolaris+ZFS, if that project hadn't died.

cyphar · on Aug 24, 2017

OpenSolaris is effectively dead, but its successor is the illumos project -- which has multiple distributions such as SmartOS. So you could use SmartOS+ZFS.

JdeBP · on Aug 24, 2017

Hacker News recently had a discussion of this at https://news.ycombinator.com/item?id=15011181 .

linsomniac · on Aug 24, 2017

I love ZFS, but I'll also say that I also feel like ZFS is fairly slow. Of course, it probably doesn't help that most of my ZFS machines hold 10+TB of data, with lots of snapshots.

I don't feel like I'm going to be using btrfs anytime soon, I've given up on it. But there are days I wish I had an alternative for high reliability, snapshotting, and ideally deduplication (which usually tends to make my ZFS machines fall over).

I'd love to see HAMMER on Linux.

mulle_nat · on Aug 24, 2017

Because I want my machine to do something else besides running a filesystem. :P

moogly · on Aug 24, 2017

It's the default fs on Synology NAS:es.

X86BSD · on Aug 24, 2017

Didn't they move to FreeBSD?

moogly · on Aug 24, 2017

Synology DSM is based on their own custom Linux. Nothing's changed there recently, except that with 6.x released last year, btrfs became the default file system (used to be ext4).

X86BSD · on Aug 24, 2017

My mistake then. One of the small NAS vendors moved to FreeBSD. I for the life of me can't think of the name of the vendor.

mdnormy · on Aug 25, 2017

Don't think so. I maintained multi-vendor NAS'es and always try to be up-to-date.

Do you mean QNAP? Last year they started offering new enterprise NAS with custom OS built on top of BSD kernel

> QNAP Enterprise Storage(QES) operating system, which is based on the simple and efficient FreeBSD kernel and the ZFS file system

https://www.qnap.com/en/product/model.php?II=246

stargrazer · on Aug 26, 2017

https://www.ixsystems.com/freenas-mini/ ? but I think they've been using it for quite some time.

user5994461 · on Aug 24, 2017

>>> given that the battle-proof ZFS has finnally been ported to Linux for quite some time?

ZFS has not been ported to Linux yet.

At best, one could say that there are SOME pieces of it being ported and tested on -exclusively- the latest Ubuntu.

tqh · on Aug 24, 2017

Same here. I lost a disk to btrfs and did not manage to recover any data. No longer on my list of filesystems-I-trust. I guess Suse is the biggest contributor since the others stopped? It doesn't give me any confidence that it can be trusted.

aljarry · on Aug 24, 2017

I've had similar experience around 5-6 years ago. One of the notebooks with ubuntu was force rebooted and btrfs got corrupted, no way to boot the system again.

danieldk · on Aug 24, 2017

If Brazil, one of the world biggest producers of beef, would announce to stop producing fish: would you wonder, whether Peru, one of the world biggest producers of fish, would stop producing fish?

I find this quite weak argumentation coming from SUSE. Novell/SUSE was also (by far) the largest contributor to AppArmor at some point. And SUSE used it as the MAC framework in their consumer and enterprise distributions. Then seemingly out of nothing they fired the AppArmor team.

_bz2r · on Aug 24, 2017

that's an important fact about the sustainability of any corporate contributor, but it's not relevant to the thesis of the article (which is that any corporate contributor makes decisions independently than others).

jabl · on Aug 24, 2017

Oh? What are they using now then? SELinux?

sp332 · on Aug 24, 2017

AppArmor is configured to run by default on any fresh installation of SUSE Linux Enterprise Server. https://www.suse.com/documentation/sles11/book_security/data...

All the drama was back in 2007. https://web.archive.org/web/20160410165126/http://www.cnet.c... Looks like it blew over by 2008 https://news.opensuse.org/2008/08/20/opensuse-to-add-selinux...

cyphar · on Aug 24, 2017

Both openSUSE and SLE use AppArmor by default. I believe we provide some support for SELinux if you want to use it, but AppArmor is the default LSM.

btilly · on Aug 24, 2017

For those who wonder who deprecated it, Red Hat did in the release notes for 7.4.

Search for brtfs in https://access.redhat.com/documentation/en-US/Red_Hat_Enterp... for details.

lemoncucumber · on Aug 24, 2017

As has been discussed in other threads, RHEL's kernels are ancient compared to mainline and so they probably decided it's just too much work to continually backport btrfs changes onto increasingly dated kernels. The sky is not (necessarily) falling.

jameskegel · on Aug 24, 2017

What put t in perspective for me was that here's a company who was a product and they make their money giving those products the utmost attention to things like compatibility and stability, and that means helping when these things break. This is not them making a political move or a line in the sand, this is them saying they can not afford to put the same 110% into this project so they're not going to carry it forward. It's the circle of life and I appreciate them for the rest of their works even if it can be mildly inconvenient in some cases.

danieldk · on Aug 24, 2017

But if they thought that btrfs was the future, they would definitely have put in the manpower to maintain and backport btrfs. Red Hat probably believes that XFS + LVM et al. can provide the necessary functionality for enterprise setups on a more reliable basis.

catdog · on Aug 24, 2017

They were never heavily invested in btrfs and they simply already employ a lot of XFS developers. I also don't think it's that easy to put more manpower into it. People with expertise in such such a rather complex niche field don't grow on trees.

72deluxe · on Aug 24, 2017

But do they grow on btrees?

Filesystem joke, hilarious I know

wumpus · on Aug 24, 2017

Uh, the entire value proposition of RHEL is that Red Hat will backport such patches, supporting customers who don't want to change anything until they move up to the next major release.

djsumdog · on Aug 24, 2017

That just feels tragic and silly. Probably many of the same types of shops that won't build the automated dev/ops infrastructures and end-to-end integration tests that let people move up stacks and updates rapidly.

I haven't worked at a place that used RHEL in a long time. Unless you use one of their enterprise products like their directory or identity management system, there's no real reason to not use CentOS .. or really anything else for that matter.

kijin · on Aug 24, 2017

Well... a very large portion of the value proposition of CentOS is that Red Hat will backport patches, and that CentOS users will get those patches for free, in case they don't want to change anything until they move up to the next major release.

jlgaddis · on Aug 24, 2017

Not every company is a development shop.

Some of us don't need -- or want -- the latest and greatest. I want servers that (other than installing updates) just work and don't have to be constantly maintained.

djsumdog · on Aug 24, 2017

Looks like they're running 3.10 on RHEL7.4, which is a pretty old long term release. I mean there are long term 4.x branches now.

vbezhenar · on Aug 24, 2017

It's not 3.10 from kernel.org. There are many backported patches adding support for selected new features, new hardware, etc. It doesn't make much sense to just compare numbers.

wtallis · on Aug 24, 2017

Comparing the version numbers gives a good impression of how much of a maintenance burden RedHat is taking upon themselves. You can't simply presume that RedHat's 3.10 fork is missing any particular feature or fix from future upstream versions, but it's quite obvious that RedHat's fork has to be missing something by now, and probably a lot of somethings.

xfs · on Aug 24, 2017

Enterprise customers care much more about stability than features. Backporting features is always easy, if you consider the amount of work needed to have enterprise level guarantee on their quality. Weeding out bugs in features is much, much harder.

jlgaddis · on Aug 24, 2017

I've got machines that have been running 7 since very shortly after GA and I can't say that I'm missing out on anything that I need.

grumpyprole · on Aug 24, 2017

I'm really confused by the choice of development process for BTRFS. First they write huge amounts of experimental/buggy code and then spend years trying to fix it. Reliability was obviously not the primary goal and I doubt they will be able to retrofit it.

memracom · on Aug 24, 2017

This is a bad move. More than ever we need Linux standards and this goes in the opposite direction. It would be OK for SUSE to support btrfs as a first class choice of filesystem but the extfs family should be the standard one.

And ZFS works just fine on Linux. If people want to use it, then the distros should not put roadblocks in their way and that means, ZFS should also be supported as a first class choice for servers.

Note that "support" is different from licensing and from "included in our install repo". It is OK to have different licensing for things like this and to install it from a non-distro repo. Look at PostgreSQL for an example of how you can install a mission critical tool from a distro-compatible repo that is run by the upstream project, in this case, PostgreSQL.

But even though it comes from a different repo, it should be "supported" by the distro to the extent that they make a best effort to help people resolve problems. It doesn't mean that you need to be experts in every nuance of the tool and the best way to do that is to maintain a good working relationship with these upstream projects.

Something like ZFS or PostgreSQL are mission critical tools that use Linux as the interface to the hardware, My comments do not apply to any random app or utility that someone wrote for Linux. Perhaps btrfs belongs in this class but I personally don't know since I have not used it.

dom0 · on Aug 24, 2017

> but the extfs family should be the standard one.

*XFS

gens · on Aug 24, 2017

Linux has lately gotten a lot of bad standards.

iso-8859-1 · on Aug 24, 2017

Such as?

gens · on Aug 24, 2017

Most of the things by freedesktop.org in the last.. idk 5-10 years. Console-kit/logind, AppData, dbus being used for the lower parts of user-space, and probably many more that i don't recall now. Oh, and pulseaudio.

EDIT: How could i forget polkit.

sliken · on Aug 24, 2017

Systemd

wongarsu · on Aug 24, 2017

>More than ever we need Linux standards

Why is that? Given sufficient resources, isn't competition and innovation better?

giis · on Aug 24, 2017

Recent Redhat decision may also be related to their investment in Storage products Ceph (~175 million in 2014) and GlusterFS (~125million in 2011) not just about stability of btrfs.

As someone who previously experienced with distributed storage like emc-Isilon and also fuse-based glusterfs, btrfs has huge potential for enterprise storage. It does also need more effort on testing front at this moment. hoping that soon btrfs will become default fs for most Linux distros.

jle17 · on Aug 24, 2017

I read somewhere (probably on HN) that it could be related to their acquisition of Permabit, a company which seems to be producing Linux software for deduplication, compression and thin-provisionning. This seems more in-line with what btrfs has to offer.

ghaff · on Aug 24, 2017

What do distributed software-defined storage/distributed filesystem projects have to do with a CoW filesystem? Alternatives to btrfs are things like XFS and ext4.

Elrac · on Aug 24, 2017

When a CEO/CTO uses the words "straight talk" I prepare for a copious load of meaningless waffling. Was not disappointed!

_bz2r · on Aug 24, 2017

huh. i found it to be coherent and not-particularly-heavy on techno-doublespeak.

while it wasn't particularly edgy/straight, it certainly wasn't "meaningless waffling".

i got the impression the title came from wanting to use the german idiomatic expression for "straight talk" in order to work in the butter-cow pun.

hyperfekt · on Aug 24, 2017

I've taken a long look at COW filesystems in these past days as part of me wanting to set up a new workstation, and it appears that while btrfs is not a viable long-term solution for its quality issues and lack of development, ZFS does not have the pretense of being a filesystem for the general usecase, as exemplified by the lack of offline deduplication, defragmentation, the possibility to easily change disk configurations, and much more. Maybe we should consider accelerating the development of bcachefs as the future of reliable and feature-rich filesystems on Linux, which appears both more modern and holistic, but still has quite some ways to go. A new filesystem is necessary because many things we should demand from them are not modularly composable without massive disadvantages. Implementing compression as a layer, for example, demands basically creating another filesystem to manage the space, with great overhead in all dimensions. Similar things go for the consistency guarantees provided by COW or the deduplication and snapshotting that depends on it.

kingwill101 · on Aug 24, 2017

2 days ago I shutdown my laptop and all was working out great at that point. Now yesterday I turned back on to get some work done and couldn't pass maintenance mode because my home partition got corrupted somehow. Took me all day literally to get it working. I have since changed that partition back to ext4 leaving only / as btrfs

unixhero · on Aug 24, 2017

I use BTRFS for non critical data, like scratch data and second backup archives. This is because I can stretch it out over many old harddrives I have laying around. The memory footprint is also very low. And it comes with compression!

But for everything else, I have rammed up, manned up, and use ZFS.

j_koreth · on Aug 24, 2017

Is there any reason why Oracle contributes so much to Btrfs while maintaining ZFS? Or am I mistaken in Oracle's relationships with ZFS

Elv13 · on Aug 24, 2017

Oracle sells Oracle Linux, a RedHat clone. ZFS on Linux (ZoL) isn't controlled by Oracle, it is a fork of OpenSolaris version of ZFS. The ZFS sold by Oracle on Solaris isn't the same product and doesn't support Linux. Oracle doesn't want to support it on Linux because it would fragment one of the remaining cash cow they got from Sun. And anyway, they probably can't support ZoL even if they wanted to without putting themselves into self induced legal gray water. Apparently, they intend to move Solaris into a rolling model like Windows 10, but focused on legacy. They can sell ZFS there without affecting their Linux operations. There is also a rumor of a deal with NetApp not to support ZFS on Linux. All combined, BTRFS is a better suited tech for them to support.

rehemiau · on Aug 24, 2017

I was wondering, how does Illumos fit into all this? Is their ZFS the Oracle-controlled one?

cyphar · on Aug 24, 2017

illumos is the repo of record for OpenZFS[1], which is the community fork of OpenSolaris' ZFS (which is now proprietary). Most of the really cool new features are in OpenZFS because it has far more developer involvement (from FreeBSD, illumos, etc).

[1]: http://open-zfs.org/wiki/Main_Page

laumars · on Aug 24, 2017

No: http://open-zfs.org/wiki/Main_Page

Illumos is a fork of OpenSolaris: https://wiki.illumos.org/display/illumos/illumos+Home

Basically ZFS is Solaris (Oracle), everything else is OpenZFS. And every Solaris-like OS is at best just going to be an OpenSolaris (or derivative) fork.

boomboomsubban · on Aug 24, 2017

No, they split when Oracle bought Sun. The only part of OpenZFS, which everything but Solaris uses, that involves Oracle is a number of patents that the CDDL grants permission to use.

perlgeek · on Aug 24, 2017

This is a bit of a tangent, but I think it's important to remember that a huge corporation like Oracle has quite a different decision making process than individuals.

Yes, one devision of Oracle owns, maintains and develops ZFS. But one of the many other divisions (maybe for historical reasons, maybe because it was acquired and never migrated, ...) might use btrfs, and it makes sense for them to pour manpower into it, even if that could be perceived as somewhat of a competition to ZFS.

My main point is that big organizations naturally tend to do things that look conflicting from the outside, just because they are too large to be efficiently standardized.

_pmf_ · on Aug 24, 2017

> This is a bit of a tangent, but I think it's important to remember that a huge corporation like Oracle has quite a different decision making process than individuals.

I'll just leave this here for people to watch: https://www.youtube.com/watch?v=-zRN7XLCRhc

bitL · on Aug 24, 2017

Oracle started BTRFS as "me too" project to show SUN they can do something like ZFS (well, they couldn't). So after acquisition of SUN they ended up with both. SUN's strategy was to use ZFS as one of main selling points of Solaris and refused to port it to Linux; somebody someday decided to port it as well and ended up with ZFS with weird license and buggy BTRFS they slowly phased out to Red Hat and SuSE.

boomboomsubban · on Aug 24, 2017

They started btrfs several years before they acquired Sun and ZFS with it. Why they continued with it is unclear, presumably a combination of inertia and not wanting to relicense it for other Linux's to be able to ship with.

riffraff · on Aug 24, 2017

oracle has its own linux which it is interested in moving forward, but its version of ZFS is restricted to Solaris.

rodgerd · on Aug 24, 2017

That's gotta be the weasel-wordiest straight talk I've seen in a wee while.

Here's a fact: the upstream's official statement is that RAID 1 isn't production-ready.

cyphar · on Aug 24, 2017

That's not accurate. Upstream btrfs doesn't support RAID56[1]. However, neither their multiple devices page[2] nor their gotchas page[3] mention that raid0, raid1 or raid10 are not recommended for production. Do you have a citation for your claim?

[1]: https://btrfs.wiki.kernel.org/index.php/RAID56 [2]: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Mul... [3]: https://btrfs.wiki.kernel.org/index.php/Gotchas

rodgerd · on Aug 24, 2017

https://btrfs.wiki.kernel.org/index.php/Status

"RAID1 mostly OK Can get stuck in irreversible read-only mode if only one device is present."

Which is an especially toxic combination with things like:

"Device replace mostly OK"

It's 30 seconds to find the status page if you're actually trying.

colde · on Aug 24, 2017

In that same page however, they also define "mostly OK" as: > mostly OK: safe for general use, there are some known problems

This would seem to suggest they are saying it is okay for production use. Otherwise, what do they mean by "safe for general use"

danparsonson · on Aug 24, 2017

Call me conservative but "Can get stuck in irreversible read-only mode if only one device is present" doesn't equate to "okay for production use" when I'm considering deploying a filesystem; likewise "device replace" being "mostly OK".

wtallis · on Aug 24, 2017

As I've explained elsewhere, that "irreversible read-only mode" is well documented and completely avoidable, and even if you do trigger it a one-line kernel patch will bypass the overzealous safety check and allow you to complete the recovery process. If you're actually using something in production, you should probably RTFM so you can avoid shooting yourself in the foot.

lightedman · on Aug 24, 2017

"As I've explained elsewhere, that "irreversible read-only mode" is well documented and completely avoidable"

And yet it hasn't been fixed despite being well-known? That tells me to stay away from it more than anything else.

"even if you do trigger it a one-line kernel patch will bypass the overzealous safety check"

I shouldn't need to hackjob my kernel to make a single individual lone drive work. If you can't work with a single hard drive, you have no business trying to work with multiple hard drives.

wtallis · on Aug 24, 2017

> I shouldn't need to hackjob my kernel to make a single individual lone drive work.

You don't. You only need the hack if you do a bad job of cleaning up after the loss of the other half of your two-drive mirror. ZFS won't let you transition in-place from RAID to non-RAID at all. Btrfs just requires that you not reboot in the middle of that migration.

sliken · on Aug 24, 2017

Seems insane, after all not all reboots are voluntary.

wtallis · on Aug 24, 2017

"Insane" is pretty strong for a temporary limitation that is just as severe with traditional RAID arrays. A sudden power failure or any other hardware problem cropping up during an array rebuild is a nightmare scenario.

cyphar · on Aug 25, 2017

It should be noted that md-raid does handle that scenario. I agree with you that the characterisation of this present limitation in btrfs is quite unfair, but not all RAID systems are susceptible to that problem.

wtallis · on Aug 24, 2017

> "Here's a fact: the upstream's official statement is that RAID 1 isn't production-ready."

That doesn't seem to match anything I've seen. The btrfs status page classifies RAID1 and RAID10 features as "mostly OK". The one and only documented caveat is that when a disk in a RAID1 fails, you should mount the filesystem as read-only until you are ready to fix the problem by either replacing the failed disk or converting to a non-RAID profile. There's a real bug underlying this limitation, but the only way to encounter this bug is to be quite cavalier about how you handle a degraded array, by doing something that you shouldn't expect to be safe.

michaelmrose · on Aug 24, 2017

Mostly OK != production ready.

A degraded array should have no substantially greater risk of failure than the single disks the vast majority of people rely on daily. I can think of one obvious reason one might mount the degraded array simply to copy data from it to a different arrangement. Another reason is in a non enterprise use case a desktop users might not instantly have a disk available and might mistakenly believe that this is no more risky than having only one disk in the first place.

This is ridiculously flaky for something that is supposed to be years in development with backing from major companies.

This confirms my feeling that btrfs belongs in the waste basket.

wtallis · on Aug 24, 2017

> Mostly OK != production ready.

Don't nitpick about what a two-word summary implies when there's only one underlying bug at issue. Just address the bug itself. "Production ready" never means 100% bug-free.

> A degraded array should have no substantially greater risk of failure than the single disks the vast majority of people rely on daily.

If you are comfortable dropping from the RAID1 profile to a non-RAID profile, btrfs requires you to explicitly make this conversion; there's no safe way to make it automatic. Forcing the filesystem to accept writes while it's still in RAID1 mode but is incapable of providing RAID1 data integrity is something that you should expect to cause problems.

> I can think of one obvious reason one might mount the degraded array simply to copy data from it to a different arrangement.

You can mount read-only as many times as you want if you're going to copy the remaining data to a different filesystem. If you want to do an in-place recovery, the current limitation is only that you shouldn't mount the degraded array as writable until you're ready to make it not degraded anymore.

mulle_nat · on Aug 24, 2017

Ran into this with my RAID1: https://www.spinics.net/lists/linux-btrfs/msg62346.html. I then ditched btrfs, and I'll never try it out again.

wtallis · on Aug 24, 2017

> I wish that if a drive fail, the btrfs filesystem still mounts rw and leave the OS running, but warns the user of a failing disk and easily allow the addition of a new drive to reintroduce redundancy.

This might make sense for you, but is insane as a default policy. Manual intervention should be required before the FS will accept writes with less than the configured degree of redundancy. Silently mounting and hoping the user notices something in their logs is too dangerous.

> I created a raid1 btrfs filesystem by converting an existing single btrfs instance into a degraded raid1, then added the other drive

This seems backwards. Why not add the second drive, then convert to RAID1?

-

Note that the patch to work around the refusal to mount is extremely simple, and that the patch is quite safe if used properly. But it's not really an acceptable solution for upstreaming, because it will lead to bigger problems if used in more complicated situations. There are several potential solutions that would be safe and widely deployable, but all involve changing far more than two lines of code.

pas · on Aug 24, 2017

Agreed, that default policy should be manual intervention, but currently I don't see how I can set a different policy.

For example in a large cluster where that degradation is expected and monitored for.

How do you add a second drive? Is there a brtfs raid faq/howto?

wtallis · on Aug 24, 2017

In a large cluster, you'd probably plan on replacing every drive that failed rather than reconfigure to use less redundancy. So in that case, you'd probably want the hot spare feature to be stabilized and upstreamed. Then the FS could automatically copy over (or reconstruct from parity) data for a missing disk, without modifying data on the surviving disks.

In environments with a smaller budget that want to get the system back up and running before a replacement drive is available, it could be valuable to be able to pre-specify that the system should rebalance with less redundancy when a drive goes missing. I'm not aware of any work to implement this kind of feature. No enterprise customer would want or use this feature, and even a home user on a shoestring budget wouldn't necessarily want this rebalancing to happen automatically. (What if the drive was only temporarily missing, such as from a failed or loose SATA cable? You wouldn't want to do a ton of re-writing of data only to have to reverse it on the next boot.)

pas · on Aug 24, 2017

A hot spare is expensive. I want pure redundancy. That is making a one drive failure into a still perfectly operational server/node/box.

That's what RAID1 usually means.

And sure, you can't survive the next without replacing the failed drive and resyncing.

But the remount read only thing is something different. It's a useful failure mode, but doesn't help with operational simplicity.

(If the SATA cable is loose, the it'll cause intermittent failures, you'll see it in the log, and there will be a lot of resync events. And probably degraded performance, a lot of ATA (or SCSI errors in case of SAS), and other bus/command errors that go away on retry. And with SMART it's possible to at least guess that it's not the drive. It'd be great to have an error notification interface from the kernel and a tool could try to dig into the relevant subsystem's perf and health data to try to guess what's the faulty component exactly.)

rodgerd · on Aug 24, 2017

> but the only way to encounter this bug is to be quite cavalier about how you handle a degraded array, by doing something that you shouldn't expect to be safe.

The way to encounter it is to mount the array a second time. That's hardly "cavalier handling".

A decade in, and RAID 1 doesn't work right. Saying this clearly offends the feels here at HN, but it's a fact.

wtallis · on Aug 24, 2017

> The way to encounter it is to mount the array a second time. That's hardly "cavalier handling".

`mount` is not the same as `mount -o degraded,rw`. The latter should raise the eyebrows of anyone paying attention to what they're typing. Odds are that you'll have to consult the docs to even find these mount options, because it doesn't happen automatically. This is where any careful, sane user who's concerned about their data would spend a few more minutes thinking through their entire recovery procedure.

There's one corner case to RAID1 recovery where the current tooling does not fully prevent a careless user from putting the FS in a bad state. This is not the same as "RAID1 doesn't work right".

mulle_nat · on Aug 24, 2017

Consider that most users will be careless, because running a filesystem to them is as exiting as any of the other dozen OS services. Having a corner case like this in RAID1 to me is like having a corner case in encryption... That I immediately run into this case on my first btrfs RAID1 problem, makes me think that there is a steep slope downwards to this corner.

wtallis · on Aug 24, 2017

If when a drive in your RAID fails, your first response is anything but checking the docs for the recovery procedure, you're going to end up disappointed sooner or later. When your fault-tolerant system tells you something broke leaving it in a fragile state and your response is to tell the system to ignore that and pretend everything is normal, expect trouble. You're working in a domain where there is no one right answer, so the system cannot magically anticipate how to handle the exceptional situation. None of the above is the fault of btrfs. None of this is avoidable. This problem has to be faced by ZFS, too.

What is avoidable is that btrfs could make it harder to do the equivalent of `mount -o degraded,rw`. But ultimately, there will be some mechanism for modifying a degraded array, and it'll get documented and then excerpted in blog posts and StackOverflow answers without all the context, and users will find a way to work themselves into a corner. There are all kinds of ways to do this with ZFS, too. ZFS tends to default to the approach of requiring you to copy all your data elsewhere and rebuild your array from scratch. What btrfs is doing here is no worse, except that it's a bit less up-front about the limitation because it's actually completely avoidable and this is a fixable UI bug, not a deep-seated architectural limitation.

mioelnir · on Aug 24, 2017

I don't think a careful, sane user who's concerned about their data would be on btrfs in the first place.

wtallis · on Aug 24, 2017

No, probably not. But this particular bug isn't the reason why users who need very high level assurances should avoid btrfs for now. The nature of this bug does not does lend itself to using it as an argument that btrfs is unsafe in general.

linuxready · on Aug 24, 2017

Do you have a link to this real bug ? It would be interesting to follow it until it is resolved.

wtallis · on Aug 24, 2017

There are plenty of instances of this bug being brought up on the mailing list. One of them is already linked elsewhere in this discussion, and the btrfs status page (also linked from this discussion) has further mailing list links.

Basically, btrfs doesn't want to allow a writeable mount when it might be missing some data. If there's some data on the FS that isn't stored with the RAID1 profile, then the kernel can't safely assume that the missing drive didn't have more chunks like that, holding data that wasn't mirrored on one of the surviving drives. But it's currently not possible to convert from RAID1 to non-RAID or to rebuild the array with a replacement without mounting the degraded array as writeable, which leads to non-RAID data being written. That puts the FS in a state that cannot be automatically judged safe at mount time, and the FS remains in that state until the recovery is complete (either converting from RAID1 to non-RAID, or replacing the failed drive).

There's no easy way to require the user to specify at the time of the `mount -o degraded,rw` whether they intend to resolve the situation by ceasing to use RAID1 or by replacing the failed drive. That leaves users with the opportunity to do neither and instead make the situation worse.

linuxready · on Aug 24, 2017

Thanks for the explanation. I was hoping for a Github issue number (or Bugzilla or whatever) to easily track this bug, but perhaps the Btrfs dev team doesn't work with issue number ?

At least for RAID1, it seems that implementing RAID1 N-way mirroring would ease the process to recover from a failed drive. In case of drive failure, we could use the remaining drive in read-only mode to copy the data to a new drive, hence creating a RAID1 array with two working drives and one failed drive. The OS should then allow to boot in rw mode, and from there it is easy to remove the failed drive from the RAID1 array.

However it seems that RAID1 N-way mirroring (with N > 2) is not even on the roadmap at this moment.

Have I misunderstood something or does this approach make sense ?

wtallis · on Aug 24, 2017

You can do RAID1 with more than two drives, but you'll only get two copies of each chunk of data. In this scenario, when one drive dies you can still write new data in RAID1 to the remaining space on the surviving drives, so mounting the FS writeable in degraded mode doesn't risk leaving the FS in a state where the safety is hard to determine on the next mount. If space permits, you can also rebalance before even shutting down to remove the failed drive, also avoiding the corner case.

Being able to do N-way mirroring with three or more copies of the data would be nice, but it's not necessary; 2-way mirroring across 3 or more drives is sufficient, and the hot spare feature will be more widely useful.

linuxready · on Aug 24, 2017

I'm not sure that I have understood.

I was referring to this sequence of events: 1) 2-way mirroring across 2 drives 2) one drive fails 3) buy and plug a new drive 4) rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible 5) remove the failed drive, ending with 2-way mirroring across 2 drives

But it seems that you are referring to: 1) 2-way mirroring across 3 drives 2) one drive fails 3) rebalance to have 2-way mirroring across the 2 working drives 4) remove the failed drive, ending with 2-way mirroring across 2 drives

I assume that people don't/won't start the initial RAID1 with 3 drives.

Anyway, I would find 3-way mirroring across 3 drives very useful as it gives a simple identical foolproof process to replace a faulty hard drive, whether it has just a few corrupted data (but still readable) or have completely failed : just plug a new drive, rebalance, reboot and remove the defective drive.

wtallis · on Aug 24, 2017

> rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible

I'm not sure this even has meaning. But anyways, it's probably pointless to try to kick off a rebalance when the FS is still trying to use a dead drive. Either use the device replace command (which isn't stable yet), or tell btrfs to delete the dead drive then add the replacement drive. If the problem drive is failing but not completely dead yet, then the device replace command is supposed to move data over with a minimum of excess changes to drives other than the ones being removed and added. But the device replace command doesn't properly handle drives with bad sectors yet, so the separate remove and add actions are more reliable albeit slower and put more work on the other drives in the array.

linuxready · on Aug 24, 2017

You are probably right, bad idea.

Where can I find the proper official procedure to replace a failed drive (i.e. which cannot be mounted anymore) in a RAID1 array (with 2 drives) ?

I found these 2 links: https://unix.stackexchange.com/questions/334228/btrfs-raid1-... https://unix.stackexchange.com/questions/227560/how-to-repla...

But it is written there that if 'replace' doesn't work, it can take up to 5 days using add/remove for only 100GB !

I haven't found any official procedure on the Btrfs wiki.

g0xA52A2A · on Aug 24, 2017

TL;DR - Don't be dissuaded by marketing, here's some more marketing.