A data-point of one, and I only have a few TB spread over a handful of disks, but I lost an entire 1TB btrfs volume recently because of a single disk sector gone bad (of course, I have backups, but it's a pain to restore 1TB, not to mention racking of the nerves (will the backup drive fail with sudden heavy use?)
It's the only FS I've ever been unable to recover data from, at all, (I've had quite a few disks with errors before going back 25 years), and so it's no longer on my list of filesystems-I-trust.
I've experienced data loss twice in 15 years of using Linux and trying different file systems. Both were on btrfs. I didn't use any advanced features, though I had plans to before this happened. But at that point, I cut my losses and converted all my disks back to ext4.
The features are great, but data availability/integrity is the fundamental thing I want from a filesystem. If I write some bytes, they should be there tomorrow. Everything else is secondary.
If you're interested in a more featureful filesystem that isn't btrfs and has a very solid track record, I would recommend looking at XFS. It's a very old filesystem but has a lot of quite modern features (and it performs better than ext4 for several workloads).
Funnily enough, XFS is the only file system on Linux where I've lost data. Back in the day, the wisdom was to stay away from it if you were in an iffy power situation because it would serve zeroes if there was a write to a file near a power loss (I.e. you wouldn't get old file or new file but something else)
Having had that happen to me I always used some extN and didn't lose any existing files.
Of course that's a decade or more ago and I may be misremembering but a cursory Google looks like other people encountered something like it too.
Afaik the problem is that XFS trusts the hardware to do the right thing during a power loss aka stop inflight requests before the brownout turns you disks and their controllers into heavily biased random number generators. Lots of x86(-64) hardware lacks a proper power loss interrupt triggered early enough to stop the all I/O in time. The ext3 journaling hides that problem to some extend by journaling more than just the minimal metadata required.
For maybe 3-6 months a decade ago I was running XFS on my laptop. My laptop had some sort of flakiness, ISTR it was graphics related, and it would require frequent reboots for a while. I remember one particular instance where after a power cycle a file I was working on an hour prior to it was trashed. That was a real "Oh FFS!" moment and I stopped using XFS.
But, I will say I've had a lot of success using XFS to serve images, particularly tiles. Particularly map tiles that are generated once and then basically static. There it has a lot less overhead when formatted with size=2048, so lots of small files are handled better. Of course, reiserfs was even better, but part of that was with a job I worked on where they just blindly rendered map tiles for the whole earth, and so there were a lot of 50 byte solid blue tiles. Reiser did really well with that, though some deduplication or linking would have been a huge benefit.
> I would recommend looking at XFS. It's a very old filesystem but has a lot of quite modern features (and it performs better than ext4 for several workloads).
I would second this. XFS is great. A few years ago I moved from JFS to XFS for bulk data storage because JFS is pretty abandoned these days. No issues at all with XFS, and never lost any data.
Just make sure it's right for you before you use it (e.g. you cannot shrink XFS volumes as you can ext4)
XFS has the same problem as btrfs: extreme sensitivity to metadata corruption. I'm not sure XFS has the RAID1-for-metadata (DUP) feature like btrfs has recently added.
Which raid level were you using over those multiple disks? I've been using btrfs and never even experienced a hiccup despite continuous sector failures, continuous drive failures and even disk controllers disconnecting on the fly. And I've been running the whole thing on a single, fairly weak, server with second-hand disks attached over the cheapest USB3 sata adapters I've found. By all accounts I've been putting btrfs in a situation that is well outside the norm and I have 0 complaints. I also use it on my desktop (no raid) with similar experience (minus the shitty hardware).
I am continuously amazed at the amount of people who have issues with btrfs. It's been absolutely rock solid over the time I've used it and I have 0 complaints (apart from the ext4 -> btrfs converter producing a corrupted btrfs filesystem, but the actual kernel btrfs code itself has been flawless).
I too have lost an entire btrfs volume (single disk) because of a bad sector. It seems btrfs is very sensitive to bad sectors in metadata, compared to ext4.
Since then, I've formatted my btrfs partition to do RAID1 on metadata and haven't had problems since.
$ sudo btrfs fi df /
Data, single
System, DUP
Metadata, DUP
Perhaps this is the secret - and should be made the default.
I don't use raid, precisely because I'm worried that failures will take down the entire set, and all of it will be unrecoverable.
I'm not at all suggesting that BTRFS is "buggy" unreliable, just, as you say, with certain kinds of disk errors (possibly rare, but it got me), the entire volume becomes unreadable, whereas I've always been able to recover files from extX or xfs volumes.
The default for single devices is to have DUP metadata, and for raid1 devices the default is raid1 metadata. These are the defaults, somehow they got bypassed. I think you're right that you need to think a little more and know a little more about btrfs to use it properly, but I hope that will change in the future.
From the article:
'And against the demand of some partners, we are still refusing to support “Automatic Defragmentation”, “In-band Deduplication” and higher RAID levels, because the quality of these options is not where it ought to be.'
It's a relatively new feature (Jan-2016) and it's disabled by default for SSD devices (increased wear). Personally I prefer increased wear over unmountable volumes.
I have the same experience. btrfs has been great to me. I've disconnected disks on the fly as well on raid 1 and the mount is happily serving files off the remaining disks.
Some of us are very concerned about licensing at all times. I'd wager such steadfast concern is ultimately what helped Linux in matters of SCO v. IBM over the matters of IBM's JFS file system.
ZFS is very much in a grey area, and in turn, a huge turn off for many.
And if so, what about all those proprietary GPU-driver kernel-modules from AMD and Nvidia which definitely are not compatible with the GPL? Are they a gray area too?
And if they're not a gray area... Why is it suddenly a problem that the ZFS kernel-module is not licensed in a GPL-compatible manner?
I'm not trying to sound facetious, but I honestly can't see the distinction here. And Linux distros left and right have been distributing closed-source kernel-modules for GPUs for a long time now. So what's the problem? What am I missing?
Yes, the proprietary drivers have always been an ethical and legal grey area. However, the way a proprietary driver is distributed is different to how ZFS is distributed. Proprietary drivers are distributed as object code that is then built to be a kernel module on the user's machine. This means that at no point does Nvidia or AMD distribute a Linux kernel module with proprietary code. There are arguments however that distributions which distribute this auto-build scheme by default may also be in violation of the GPL. ZFS is distributed by Canonical as a fully functional kernel module.
There's a lot of gray areas because there have not been many legal cases on derived works and the GPL. The GPL itself has held up in court on copyleft grounds, of course.
Also, you've got the fact that in the case of proprietary graphics drivers, the threat is that the Linux kernel community would sue Nvidia or AMD. The threat with ZFS is that Oracle (who is a member of the Linux kernel community) would sue people using ZFS (they could also then sue for patent infringement).
I know in the past NixOS did the same for the ZFS kernel module: prior to NixOS installation, it was able to download the ZFS source code, build it and then load it from inside the installer Live CD/USB while it was running, and this would require just a couple of commands. I don't know if this is still true nowadays or if they just distribute the ZFS kernel module directly in the Live installer.
If combining gpl and cddl in a redistributed file is a violation of gpl but not cddl, then what would the holder of the copyright of the cddl code be able to sue about? I think the concern is that a contributor to Linux might sue, just like in the Nvidia/and case.
Oracle is a contributor to Linux, so they could sue from the GPL side. While Nvidia and AMD also are Linux contributors, the fact they ship proprietary modules would make it hard to argue that they aren't implicitly permitting users to redistribute it (and thus they would be forced to license their drivers under GPLv2).
Not to mention that ZFS is covered by Oracle patents. CDDL provides a patent license, but it might be possible for Oracle to sue you for patent infringement if you're distributing code in a way that complicates the licensing. Not that I'm saying that's likely, but Oracle has enough money to ruin you if they want to.
The crucial difference is in distribution, not in usage.
AMD and Nvidia distribute a binary module with a thin shim. It is the user who builds this shim and inserts into his kernel. Neither AMD, Nvidia nor any Linux vendor[1] distribute any binary, that links into GPL-ed kernel. The combination is done by the user.
And so does the ZoL project. They have the same trade-offs as AMD and Nvidia. It is just much more difficult to install Linux on a filesystem not supported by the installer, than not have working 3D acceleration at setup time though.
[1] the distributions take care to have them in separate, third party repos and if you look more closely, you will find they are being build on the user machine using dkms, akmod or similar mechanism. That also trows a wrench into things if you want to use Secure Boot too.
Publish the source code files only under GPL and Linux kernel developers will be happy.
You might get intro trouble with the ZFS side however, since the CDDL license says: "must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License".
If the Source Code form is only under the terms of the GPL, then the above condition is not met. The law suit would thus come from the ZFS side.
This is resolvable if you think of a driver "shim" that's a derivative work of both a GPL'd codebase (e.g. Linux) and a non-GPL'd codebase (GPU drivers). It's possible for the GPU driver's proprietary license to permit linking without license restrictions, meaning that a "shim" kernel module which loads the binary blob can be under the kernel's copyleft license without problem (since while it's a derivative work of both the module and the blob, only one license is copyleft).
(I don't know if this is what AMD/Nvidia actually do ToS-wise, but it's at least feasible).
Doing this for ZFS is harder, since the shim would have to be a derivative work of both a GPL'd codebase and a CDDL'd codebase, neither licenses of which are compatible with one another. Dual-licensing is probably illegal, as is just picking one over the other.
IANAL, though, and I might be grossly misunderstanding the problem.
Using a shim to do a legal workaround seems to me as a trick that is unlikely to ever get past a judge. In all other form which people try to do legal workaround in order to turn illegal to legal, it seems that the law don't generally take kindly to it and either explicitly forbids it or just treat the trick as inconsequential to the end result.
For example, we have shims in form of dummy organizations, straw owners, money launderer and so on. Neither of those kind of shims work to turn something illegal to legal. Why would some dummy code that sits between two incompatible software work any better in the legal system?
"Neither of those kind of shims work to turn something illegal to legal."
There's nothing illegal here to try to turn legal, though. If the blob's license permits other software to interface with it without restriction, then the shim is free to be under the GPL per the terms of the other work from which it's derived (Linux).
The only way there'd be anything illegal here would be if the blob itself is a derivative work of Linux, and if that's the case, then the blob itself is illegal (since it would be subject to the GPL), regardless of the shim and regardless of whether or not it's distributed as part of a Linux distribution.
Take a dummy organization that is used to avoid taxes. There is nothing illegal to pay someone to form a company located in a different country. A company can also sell assets to an other for what ever price they want, and there is nothing illegal to declare that as a result one of the companies now have zero profits and zero assets.
But if something is legal or illegal depend more than just on each set piece. If combining a blob with the kernel creates a derivative, then just adding a shim to the mix won't turn it legal. Judges are trained to look at the big picture rather than the individual parts, and this is especially true in civil-law. Court systems are generally interested in:
What is the kernel authors intention by their copyright license.
What is the blob authors intention by their copyright license.
What is the distributor intention when combining the two works and what understanding did the distributor have about the other copyright authors and their wishes in regard to derivatives.
The existence of a shim that has no other purpose is a sign that something shady is going on, similar to dummy organizations. The question a judge is likely to ask is why such obfuscation was used since that can help establishing intention. If the end result is a de-facto derivative, and the intention was to create a single work out of two seperate works, and the creator knows that they are not allowed to create a derivative work, then a shim is not going to save the day.
"Take a dummy organization that is used to avoid taxes."
See, that right there is where the analogy falls apart. Tax evasion is (usually) illegal. Writing software to translate between two independently-developed programs is not, last I checked.
"If the end result is a de-facto derivative"
You'd need to prove the blob is derived from Linux. If it's not, then there's literally nothing illegal happening here. If it is, then - again - the shim is not the illegal thing; the blob is, and the shim is entirely irrelevant in that illegality.
The usual reason for the shims, by the way, has very little to do with licensing terms (at least not directly) and very much to do with the fact that Linux does not provide a stable API (let alone ABI) for kernel modules. The intent of the shim is therefore almost always technical rather than legal in nature.
>ZFS is very much in a grey area, and in turn, a huge turn off for many.
There is a grey area in distribution, there isn't one about using ZFS. Even the FSF, the group who believe the CDDL and GPL can't be distributed together, say "Privately, You Can Do As You Like."
Who deploys production systems on a large scale with a "private" build of the filesystem code? I want my production systems to run on code that is being used by as many people as possible; I don't want patched kernels, I don't want privately built kernel packages, I don't want a unique system that only I've ever seen. I want a system that is as boring as possible (while still providing the functionality I need to effectively do my job). I want a system with a bunch of people complaining about it and asking questions about it online, so that when problems arise, I can find answers.
Now that ZFS is available on Ubuntu and seems to have some adoption, I guess it's a reasonable choice for some. I'm still a bit iffy on it. I don't really want to add license concerns to my list of worries.
The CDDL also has patent clauses and so it's conceivable that a user of OpenZFS which received it in a way that violates the OpenZFS license could be liable for patent infringement of an Oracle patent. And there have been many cases of companies suing users of software over patents.
Another issue is that you should always get software like your filesystem from your distribution. We do a lot of work making sure that your systems can be safely updated, and making sure that upstream bugs are fixed for our distribution. Even community distributions put a lot of effort into that work. As someone who works on maintaining a distribution (I work for SUSE and contribute to openSUSE), I would guess that most people underestimate how much work you need to devote to maintaining the software in a distribution.
>The CDDL also has patent clauses and so it's conceivable that a user of OpenZFS which received it in a way that violates the OpenZFS license could be liable for patent infringement of an Oracle patent. And there have been many cases of companies suing users of software over patents.
Again, with usage this isn't a problem, the license could only possibly be broken by distribution with the GPL. Even then, I believe it is the GPL that is broken, so the patent clause would remain.
As for the rest of your argument, the OpenZFS team does a lot of work maintaining the filesystem. Why does that work need to come from you?
> As for the rest of your argument, the OpenZFS team does a lot of work maintaining the filesystem. Why does that work need to come from you?
Integration into our tools, backporting fixes, doing release engineering, tracking upstream changes, triaging and resolving distribution bug reports, documenting usage and troubleshooting, configuring defaults and best practices, a whole lot of testing, etc.
As I said, there's a lot of work that goes into a distribution (I probably haven't covered most of it) that most people don't think about. And that's assuming that a distribution is going to be passive about something as core as a filesystem -- which we wouldn't be. So we'd be working with upstream on development as well, which is more work. So saying something like "it's supported on distribution X" when that distribution doesn't even provide official packages for it is a massive stretch. It might work on distribution X, and you might provide independent ISV-style support for it, but it's not supported by us.
I appreciate that the sort of work distributions do isn't well-publicised (mostly because stability is hardly a sexy thing to blog about, and we don't rewrite things in JavaScript every weekend). But there is an incredible amount of work that goes into making distributions work well for users, and there's a reason that many distributions have lasted for so many years (there's a need for someone to do the boring work of packaging for you).
If I know that Canonical can't legally distribute ZFS in whatever format to me, and yet I use Canonical's distribution of ZFS, isn't there a legal risk there? After all, it would turn out that I have no license to use said distribution of ZFS as such a license was never conferred to me by someone with the legal right to do so.
Generally speaking, courts would probably give me the benefit of the doubt if I had no reason to believe that they couldn't distribute it to me - but as I knew they couldn't (the issues with ZFS and the Linux kernel are well-documented), and I knew I'm using it, they'd probably hold me in violation of copyright.
>If I know that Canonical can't legally distribute ZFS in whatever format to me, and yet I use Canonical's distribution of ZFS, isn't there a legal risk there?
If you somehow knew that any form of distribution was illegal that would be the case. I haven't heard anyone saying that's the case, distributing it bundled with GPL software is potentially breaks the GPL.
Well, two things. He didn't say that, he said it's been PORTED to Linux for quite some time. Which is accurate, the first "stable" release was in 2013 - 4 years qualifies as "quite some time" in pretty much any tech circle that isn't VMS.
As for it not being widely available... you're saying the only way for something to be considered widely available is to be included in the distro directly? I'd argue an easy installation/setup and solid documentation are FAR more important than being included in a distro. If the setup is arcane or the documentation is horrible, it doesn't matter if a tool is in every distro on the planet, nobody is going to use it.
I've been using ZFS on Linux for 5+ years. I know a lot of people wrote off zfs+fuse, but I used it very successfully to store 10s of TB of backup data for years with no data loss events and performance I can't tell was worse than ZFSonLinux. And I've been using ZoL for years at another job.
It has "official support" from the ZoL folks. And yes, openSUSE has ZFS packages in OBS. But we sure as hell don't ship them by default, or in our official repos. The same applies for Arch, Debian, Fedora, Gentoo, RHEL and CentOS.
Ubuntu is the only distribution that official supports ZoL, and actually ships it in it's official repositories (and by default). What that means is that Canonical is effectively saying "we trust there's no legal reason why we cannot do this." No other distribution has made that claim.
EDIT: Actually NixOS also supports it, but the point stands.
They're referring to ZoL providing support for distributions (which actually just means "it works, and if you send a bug we'll work on it"). Only Canonical provides support from the distribution side. See https://news.ycombinator.com/item?id=15088761.
BTRFS has much more flexible snapshots and clones than ZFS. You can create rewritable snapshots and create new snapshots based on those. In addition you can create COW copies of files with "cp --reflink" which ZFS doesn't support.
BTRFS feels much more like a native filesystem on Linux too. It doesn't have that big ARC cache.
We've been using BTRFS for 5+ yrs on a Linux MD RAID setup, with no problems at all.
> You can create rewritable snapshots and create new snapshots based on those.
You can do this with ZFS as well.
First, there is no such thing as a "rewritable snapshot"; the term is an oxymoron. Please do not use it and please do not promulgate it. The term "snapshot" should be reserved for read-only states of a (file) system at a particular point of time:
As long as the clone exists though, the snapshot cannot be deleted. However you can "promote" the clone, after which the snapshot can be removed. This feature has been around since at least 2010:
I was aware of the promotion feature. I was creating new clones/snapshots in a chain hierarchy in zfs, copying old backup sets progressively onto the back of the chain, but keeping the head as the current copy. This was a breeze in btrfs, but basically impossible in zfs as it refused to promote the old clones/snapshots.
As to the nomenclature, it doesn't seem to make sense to differentiate snapshots and clones. With the flexibility of btrfs they're just the same thing with a R/O flag.
> As to the nomenclature, it doesn't seem to make sense to differentiate snapshots and clones. With the flexibility of btrfs they're just the same thing with a R/O flag.
It does make sense: one is writable and the other is not. When someone is talking about (e.g.) mitigation mechanisms against ransomware, saying you have "snapshots" is meaningless if they're R/W as the ransowmare can go in an overwrite files. But if you use the term "snapshot" correctly--meaning R/O--everyone involved knows you have mitigated the risk since the data is safe from being altered and reverting is possible.
It's not the "same thing" if there is a difference between the two--which there is, the R/O flag setting. If two thing are different then they are not the "same": this may seem tautological but it's not. Call different things differently.
The btrfs CLI is really retarded in this regard where using "snapshot" is not R/O by default, as it violates decades of expectations and POLA:
The term "clone" has been used in the storage industry for several decades now. It has a very specific meaning, as does the term "snapshot". I understand it might not make sense to you personally, but you're just going to confuse anyone you're talking to referring to it by another name. You might as well start calling containers "light virtual machines". You'll get an equal number of blank stares and confusion from whomever you're talking to.
OpenSolaris is effectively dead, but its successor is the illumos project -- which has multiple distributions such as SmartOS. So you could use SmartOS+ZFS.
I love ZFS, but I'll also say that I also feel like ZFS is fairly slow. Of course, it probably doesn't help that most of my ZFS machines hold 10+TB of data, with lots of snapshots.
I don't feel like I'm going to be using btrfs anytime soon, I've given up on it. But there are days I wish I had an alternative for high reliability, snapshotting, and ideally deduplication (which usually tends to make my ZFS machines fall over).
Synology DSM is based on their own custom Linux. Nothing's changed there recently, except that with 6.x released last year, btrfs became the default file system (used to be ext4).
Same here. I lost a disk to btrfs and did not manage to recover any data. No longer on my list of filesystems-I-trust. I guess Suse is the biggest contributor since the others stopped? It doesn't give me any confidence that it can be trusted.
I've had similar experience around 5-6 years ago. One of the notebooks with ubuntu was force rebooted and btrfs got corrupted, no way to boot the system again.
If Brazil, one of the world biggest producers of beef, would announce to stop producing fish: would you wonder, whether Peru, one of the world biggest producers of fish, would stop producing fish?
I find this quite weak argumentation coming from SUSE. Novell/SUSE was also (by far) the largest contributor to AppArmor at some point. And SUSE used it as the MAC framework in their consumer and enterprise distributions. Then seemingly out of nothing they fired the AppArmor team.
that's an important fact about the sustainability of any corporate contributor, but it's not relevant to the thesis of the article (which is that any corporate contributor makes decisions independently than others).
As has been discussed in other threads, RHEL's kernels are ancient compared to mainline and so they probably decided it's just too much work to continually backport btrfs changes onto increasingly dated kernels. The sky is not (necessarily) falling.
What put t in perspective for me was that here's a company who was a product and they make their money giving those products the utmost attention to things like compatibility and stability, and that means helping when these things break. This is not them making a political move or a line in the sand, this is them saying they can not afford to put the same 110% into this project so they're not going to carry it forward. It's the circle of life and I appreciate them for the rest of their works even if it can be mildly inconvenient in some cases.
But if they thought that btrfs was the future, they would definitely have put in the manpower to maintain and backport btrfs. Red Hat probably believes that XFS + LVM et al. can provide the necessary functionality for enterprise setups on a more reliable basis.
They were never heavily invested in btrfs and they simply already employ a lot of XFS developers. I also don't think it's that easy to put more manpower into it. People with expertise in such such a rather complex niche field don't grow on trees.
Uh, the entire value proposition of RHEL is that Red Hat will backport such patches, supporting customers who don't want to change anything until they move up to the next major release.
That just feels tragic and silly. Probably many of the same types of shops that won't build the automated dev/ops infrastructures and end-to-end integration tests that let people move up stacks and updates rapidly.
I haven't worked at a place that used RHEL in a long time. Unless you use one of their enterprise products like their directory or identity management system, there's no real reason to not use CentOS .. or really anything else for that matter.
Well... a very large portion of the value proposition of CentOS is that Red Hat will backport patches, and that CentOS users will get those patches for free, in case they don't want to change anything until they move up to the next major release.
Some of us don't need -- or want -- the latest and greatest. I want servers that (other than installing updates) just work and don't have to be constantly maintained.
It's not 3.10 from kernel.org. There are many backported patches adding support for selected new features, new hardware, etc. It doesn't make much sense to just compare numbers.
Comparing the version numbers gives a good impression of how much of a maintenance burden RedHat is taking upon themselves. You can't simply presume that RedHat's 3.10 fork is missing any particular feature or fix from future upstream versions, but it's quite obvious that RedHat's fork has to be missing something by now, and probably a lot of somethings.
Enterprise customers care much more about stability than features. Backporting features is always easy, if you consider the amount of work needed to have enterprise level guarantee on their quality. Weeding out bugs in features is much, much harder.
I'm really confused by the choice of development process for BTRFS. First they write huge amounts of experimental/buggy code and then spend years trying to fix it. Reliability was obviously not the primary goal and I doubt they will be able to retrofit it.
This is a bad move. More than ever we need Linux standards and this goes in the opposite direction. It would be OK for SUSE to support btrfs as a first class choice of filesystem but the extfs family should be the standard one.
And ZFS works just fine on Linux. If people want to use it, then the distros should not put roadblocks in their way and that means, ZFS should also be supported as a first class choice for servers.
Note that "support" is different from licensing and from "included in our install repo". It is OK to have different licensing for things like this and to install it from a non-distro repo. Look at PostgreSQL for an example of how you can install a mission critical tool from a distro-compatible repo that is run by the upstream project, in this case, PostgreSQL.
But even though it comes from a different repo, it should be "supported" by the distro to the extent that they make a best effort to help people resolve problems. It doesn't mean that you need to be experts in every nuance of the tool and the best way to do that is to maintain a good working relationship with these upstream projects.
Something like ZFS or PostgreSQL are mission critical tools that use Linux as the interface to the hardware, My comments do not apply to any random app or utility that someone wrote for Linux. Perhaps btrfs belongs in this class but I personally don't know since I have not used it.
Most of the things by freedesktop.org in the last.. idk 5-10 years. Console-kit/logind, AppData, dbus being used for the lower parts of user-space, and probably many more that i don't recall now. Oh, and pulseaudio.
Recent Redhat decision may also be related to their investment in Storage products Ceph (~175 million in 2014) and GlusterFS (~125million in 2011) not just about stability of btrfs.
As someone who previously experienced with distributed storage like emc-Isilon and also fuse-based glusterfs, btrfs has huge potential for enterprise storage. It does also need more effort on testing front at this moment. hoping that soon btrfs will become default fs for most Linux distros.
I read somewhere (probably on HN) that it could be related to their acquisition of Permabit, a company which seems to be producing Linux software for deduplication, compression and thin-provisionning. This seems more in-line with what btrfs has to offer.
What do distributed software-defined storage/distributed filesystem projects have to do with a CoW filesystem? Alternatives to btrfs are things like XFS and ext4.
I've taken a long look at COW filesystems in these past days as part of me wanting to set up a new workstation, and it appears that while btrfs is not a viable long-term solution for its quality issues and lack of development, ZFS does not have the pretense of being a filesystem for the general usecase, as exemplified by the lack of offline deduplication, defragmentation, the possibility to easily change disk configurations, and much more.
Maybe we should consider accelerating the development of bcachefs as the future of reliable and feature-rich filesystems on Linux, which appears both more modern and holistic, but still has quite some ways to go.
A new filesystem is necessary because many things we should demand from them are not modularly composable without massive disadvantages. Implementing compression as a layer, for example, demands basically creating another filesystem to manage the space, with great overhead in all dimensions. Similar things go for the consistency guarantees provided by COW or the deduplication and snapshotting that depends on it.
2 days ago I shutdown my laptop and all was working out great at that point. Now yesterday I turned back on to get some work done and couldn't pass maintenance mode because my home partition got corrupted somehow. Took me all day literally to get it working. I have since changed that partition back to ext4 leaving only / as btrfs
I use BTRFS for non critical data, like scratch data and second backup archives. This is because I can stretch it out over many old harddrives I have laying around. The memory footprint is also very low. And it comes with compression!
But for everything else, I have rammed up, manned up, and use ZFS.
Oracle sells Oracle Linux, a RedHat clone. ZFS on Linux (ZoL) isn't controlled by Oracle, it is a fork of OpenSolaris version of ZFS. The ZFS sold by Oracle on Solaris isn't the same product and doesn't support Linux. Oracle doesn't want to support it on Linux because it would fragment one of the remaining cash cow they got from Sun. And anyway, they probably can't support ZoL even if they wanted to without putting themselves into self induced legal gray water. Apparently, they intend to move Solaris into a rolling model like Windows 10, but focused on legacy. They can sell ZFS there without affecting their Linux operations. There is also a rumor of a deal with NetApp not to support ZFS on Linux. All combined, BTRFS is a better suited tech for them to support.
illumos is the repo of record for OpenZFS[1], which is the community fork of OpenSolaris' ZFS (which is now proprietary). Most of the really cool new features are in OpenZFS because it has far more developer involvement (from FreeBSD, illumos, etc).
Basically ZFS is Solaris (Oracle), everything else is OpenZFS. And every Solaris-like OS is at best just going to be an OpenSolaris (or derivative) fork.
No, they split when Oracle bought Sun. The only part of OpenZFS, which everything but Solaris uses, that involves Oracle is a number of patents that the CDDL grants permission to use.
This is a bit of a tangent, but I think it's important to remember that a huge corporation like Oracle has quite a different decision making process than individuals.
Yes, one devision of Oracle owns, maintains and develops ZFS. But one of the many other divisions (maybe for historical reasons, maybe because it was acquired and never migrated, ...) might use btrfs, and it makes sense for them to pour manpower into it, even if that could be perceived as somewhat of a competition to ZFS.
My main point is that big organizations naturally tend to do things that look conflicting from the outside, just because they are too large to be efficiently standardized.
> This is a bit of a tangent, but I think it's important to remember that a huge corporation like Oracle has quite a different decision making process than individuals.
Oracle started BTRFS as "me too" project to show SUN they can do something like ZFS (well, they couldn't). So after acquisition of SUN they ended up with both. SUN's strategy was to use ZFS as one of main selling points of Solaris and refused to port it to Linux; somebody someday decided to port it as well and ended up with ZFS with weird license and buggy BTRFS they slowly phased out to Red Hat and SuSE.
They started btrfs several years before they acquired Sun and ZFS with it. Why they continued with it is unclear, presumably a combination of inertia and not wanting to relicense it for other Linux's to be able to ship with.
That's not accurate. Upstream btrfs doesn't support RAID56[1]. However, neither their multiple devices page[2] nor their gotchas page[3] mention that raid0, raid1 or raid10 are not recommended for production. Do you have a citation for your claim?
Call me conservative but "Can get stuck in irreversible read-only mode if only one device is present" doesn't equate to "okay for production use" when I'm considering deploying a filesystem; likewise "device replace" being "mostly OK".
As I've explained elsewhere, that "irreversible read-only mode" is well documented and completely avoidable, and even if you do trigger it a one-line kernel patch will bypass the overzealous safety check and allow you to complete the recovery process. If you're actually using something in production, you should probably RTFM so you can avoid shooting yourself in the foot.
"As I've explained elsewhere, that "irreversible read-only mode" is well documented and completely avoidable"
And yet it hasn't been fixed despite being well-known? That tells me to stay away from it more than anything else.
"even if you do trigger it a one-line kernel patch will bypass the overzealous safety check"
I shouldn't need to hackjob my kernel to make a single individual lone drive work. If you can't work with a single hard drive, you have no business trying to work with multiple hard drives.
> I shouldn't need to hackjob my kernel to make a single individual lone drive work.
You don't. You only need the hack if you do a bad job of cleaning up after the loss of the other half of your two-drive mirror. ZFS won't let you transition in-place from RAID to non-RAID at all. Btrfs just requires that you not reboot in the middle of that migration.
"Insane" is pretty strong for a temporary limitation that is just as severe with traditional RAID arrays. A sudden power failure or any other hardware problem cropping up during an array rebuild is a nightmare scenario.
It should be noted that md-raid does handle that scenario. I agree with you that the characterisation of this present limitation in btrfs is quite unfair, but not all RAID systems are susceptible to that problem.
> "Here's a fact: the upstream's official statement is that RAID 1 isn't production-ready."
That doesn't seem to match anything I've seen. The btrfs status page classifies RAID1 and RAID10 features as "mostly OK". The one and only documented caveat is that when a disk in a RAID1 fails, you should mount the filesystem as read-only until you are ready to fix the problem by either replacing the failed disk or converting to a non-RAID profile. There's a real bug underlying this limitation, but the only way to encounter this bug is to be quite cavalier about how you handle a degraded array, by doing something that you shouldn't expect to be safe.
A degraded array should have no substantially greater risk of failure than the single disks the vast majority of people rely on daily. I can think of one obvious reason one might mount the degraded array simply to copy data from it to a different arrangement. Another reason is in a non enterprise use case a desktop users might not instantly have a disk available and might mistakenly believe that this is no more risky than having only one disk in the first place.
This is ridiculously flaky for something that is supposed to be years in development with backing from major companies.
This confirms my feeling that btrfs belongs in the waste basket.
Don't nitpick about what a two-word summary implies when there's only one underlying bug at issue. Just address the bug itself. "Production ready" never means 100% bug-free.
> A degraded array should have no substantially greater risk of failure than the single disks the vast majority of people rely on daily.
If you are comfortable dropping from the RAID1 profile to a non-RAID profile, btrfs requires you to explicitly make this conversion; there's no safe way to make it automatic. Forcing the filesystem to accept writes while it's still in RAID1 mode but is incapable of providing RAID1 data integrity is something that you should expect to cause problems.
> I can think of one obvious reason one might mount the degraded array simply to copy data from it to a different arrangement.
You can mount read-only as many times as you want if you're going to copy the remaining data to a different filesystem. If you want to do an in-place recovery, the current limitation is only that you shouldn't mount the degraded array as writable until you're ready to make it not degraded anymore.
> I wish that if a drive fail, the btrfs filesystem still mounts rw and leave the OS running, but warns the user of a failing disk and easily allow the addition of a new drive to reintroduce redundancy.
This might make sense for you, but is insane as a default policy. Manual intervention should be required before the FS will accept writes with less than the configured degree of redundancy. Silently mounting and hoping the user notices something in their logs is too dangerous.
> I created a raid1 btrfs filesystem by converting an existing single btrfs instance into a degraded raid1, then added the other drive
This seems backwards. Why not add the second drive, then convert to RAID1?
-
Note that the patch to work around the refusal to mount is extremely simple, and that the patch is quite safe if used properly. But it's not really an acceptable solution for upstreaming, because it will lead to bigger problems if used in more complicated situations. There are several potential solutions that would be safe and widely deployable, but all involve changing far more than two lines of code.
In a large cluster, you'd probably plan on replacing every drive that failed rather than reconfigure to use less redundancy. So in that case, you'd probably want the hot spare feature to be stabilized and upstreamed. Then the FS could automatically copy over (or reconstruct from parity) data for a missing disk, without modifying data on the surviving disks.
In environments with a smaller budget that want to get the system back up and running before a replacement drive is available, it could be valuable to be able to pre-specify that the system should rebalance with less redundancy when a drive goes missing. I'm not aware of any work to implement this kind of feature. No enterprise customer would want or use this feature, and even a home user on a shoestring budget wouldn't necessarily want this rebalancing to happen automatically. (What if the drive was only temporarily missing, such as from a failed or loose SATA cable? You wouldn't want to do a ton of re-writing of data only to have to reverse it on the next boot.)
A hot spare is expensive. I want pure redundancy. That is making a one drive failure into a still perfectly operational server/node/box.
That's what RAID1 usually means.
And sure, you can't survive the next without replacing the failed drive and resyncing.
But the remount read only thing is something different. It's a useful failure mode, but doesn't help with operational simplicity.
(If the SATA cable is loose, the it'll cause intermittent failures, you'll see it in the log, and there will be a lot of resync events. And probably degraded performance, a lot of ATA (or SCSI errors in case of SAS), and other bus/command errors that go away on retry. And with SMART it's possible to at least guess that it's not the drive. It'd be great to have an error notification interface from the kernel and a tool could try to dig into the relevant subsystem's perf and health data to try to guess what's the faulty component exactly.)
> but the only way to encounter this bug is to be quite cavalier about how you handle a degraded array, by doing something that you shouldn't expect to be safe.
The way to encounter it is to mount the array a second time. That's hardly "cavalier handling".
A decade in, and RAID 1 doesn't work right. Saying this clearly offends the feels here at HN, but it's a fact.
> The way to encounter it is to mount the array a second time. That's hardly "cavalier handling".
`mount` is not the same as `mount -o degraded,rw`. The latter should raise the eyebrows of anyone paying attention to what they're typing. Odds are that you'll have to consult the docs to even find these mount options, because it doesn't happen automatically. This is where any careful, sane user who's concerned about their data would spend a few more minutes thinking through their entire recovery procedure.
There's one corner case to RAID1 recovery where the current tooling does not fully prevent a careless user from putting the FS in a bad state. This is not the same as "RAID1 doesn't work right".
Consider that most users will be careless, because running a filesystem to them is as exiting as any of the other dozen OS services. Having a corner case like this in RAID1 to me is like having a corner case in encryption...
That I immediately run into this case on my first btrfs RAID1 problem, makes me think that there is a steep slope downwards to this corner.
If when a drive in your RAID fails, your first response is anything but checking the docs for the recovery procedure, you're going to end up disappointed sooner or later. When your fault-tolerant system tells you something broke leaving it in a fragile state and your response is to tell the system to ignore that and pretend everything is normal, expect trouble. You're working in a domain where there is no one right answer, so the system cannot magically anticipate how to handle the exceptional situation. None of the above is the fault of btrfs. None of this is avoidable. This problem has to be faced by ZFS, too.
What is avoidable is that btrfs could make it harder to do the equivalent of `mount -o degraded,rw`. But ultimately, there will be some mechanism for modifying a degraded array, and it'll get documented and then excerpted in blog posts and StackOverflow answers without all the context, and users will find a way to work themselves into a corner. There are all kinds of ways to do this with ZFS, too. ZFS tends to default to the approach of requiring you to copy all your data elsewhere and rebuild your array from scratch. What btrfs is doing here is no worse, except that it's a bit less up-front about the limitation because it's actually completely avoidable and this is a fixable UI bug, not a deep-seated architectural limitation.
No, probably not. But this particular bug isn't the reason why users who need very high level assurances should avoid btrfs for now. The nature of this bug does not does lend itself to using it as an argument that btrfs is unsafe in general.
There are plenty of instances of this bug being brought up on the mailing list. One of them is already linked elsewhere in this discussion, and the btrfs status page (also linked from this discussion) has further mailing list links.
Basically, btrfs doesn't want to allow a writeable mount when it might be missing some data. If there's some data on the FS that isn't stored with the RAID1 profile, then the kernel can't safely assume that the missing drive didn't have more chunks like that, holding data that wasn't mirrored on one of the surviving drives. But it's currently not possible to convert from RAID1 to non-RAID or to rebuild the array with a replacement without mounting the degraded array as writeable, which leads to non-RAID data being written. That puts the FS in a state that cannot be automatically judged safe at mount time, and the FS remains in that state until the recovery is complete (either converting from RAID1 to non-RAID, or replacing the failed drive).
There's no easy way to require the user to specify at the time of the `mount -o degraded,rw` whether they intend to resolve the situation by ceasing to use RAID1 or by replacing the failed drive. That leaves users with the opportunity to do neither and instead make the situation worse.
Thanks for the explanation. I was hoping for a Github issue number (or Bugzilla or whatever) to easily track this bug, but perhaps the Btrfs dev team doesn't work with issue number ?
At least for RAID1, it seems that implementing RAID1 N-way mirroring would ease the process to recover from a failed drive.
In case of drive failure, we could use the remaining drive in read-only mode to copy the data to a new drive, hence creating a RAID1 array with two working drives and one failed drive.
The OS should then allow to boot in rw mode, and from there it is easy to remove the failed drive from the RAID1 array.
However it seems that RAID1 N-way mirroring (with N > 2) is not even on the roadmap at this moment.
Have I misunderstood something or does this approach make sense ?
You can do RAID1 with more than two drives, but you'll only get two copies of each chunk of data. In this scenario, when one drive dies you can still write new data in RAID1 to the remaining space on the surviving drives, so mounting the FS writeable in degraded mode doesn't risk leaving the FS in a state where the safety is hard to determine on the next mount. If space permits, you can also rebalance before even shutting down to remove the failed drive, also avoiding the corner case.
Being able to do N-way mirroring with three or more copies of the data would be nice, but it's not necessary; 2-way mirroring across 3 or more drives is sufficient, and the hot spare feature will be more widely useful.
I was referring to this sequence of events:
1) 2-way mirroring across 2 drives
2) one drive fails
3) buy and plug a new drive
4) rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible
5) remove the failed drive, ending with 2-way mirroring across 2 drives
But it seems that you are referring to:
1) 2-way mirroring across 3 drives
2) one drive fails
3) rebalance to have 2-way mirroring across the 2 working drives
4) remove the failed drive, ending with 2-way mirroring across 2 drives
I assume that people don't/won't start the initial RAID1 with 3 drives.
Anyway, I would find 3-way mirroring across 3 drives very useful as it gives a simple identical foolproof process to replace a faulty hard drive, whether it has just a few corrupted data (but still readable) or have completely failed : just plug a new drive, rebalance, reboot and remove the defective drive.
> rebalance to have 3-way mirroring across 3 drives (with one being out): this is currently not possible
I'm not sure this even has meaning. But anyways, it's probably pointless to try to kick off a rebalance when the FS is still trying to use a dead drive. Either use the device replace command (which isn't stable yet), or tell btrfs to delete the dead drive then add the replacement drive. If the problem drive is failing but not completely dead yet, then the device replace command is supposed to move data over with a minimum of excess changes to drives other than the ones being removed and added. But the device replace command doesn't properly handle drives with bad sectors yet, so the separate remove and add actions are more reliable albeit slower and put more work on the other drives in the array.
It's the only FS I've ever been unable to recover data from, at all, (I've had quite a few disks with errors before going back 25 years), and so it's no longer on my list of filesystems-I-trust.