Five Years of Btrfs

InTheArena · on Jan 27, 2020

I went on a quest a few years ago, thinking it would be good for the industry to standardize on a single next generation filesystem for UNIX. I started with ZFS on linux since that seemed to have the most vocal advocates. That lasted about a half year, until a bug in the code resulted in a completely corrupt disk, and I had to restore 4TB of data over a month from offside backups. That plus the licensing confusion around ZFS has made it impossible for ZFS to be the defacto choice.

I went down the BTRFS path, despite it's dodgy reputation when netgear announced their little embedded NASes, and switched my server over to it. The experience was solid enough that I bought high-end synology and have had zero problems with it.

starfallg · on Jan 27, 2020

Btrfs is the only FS I used that resulted in complete FS corruption losing nearly all data on disk, not once, but 3 times.

After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.

RX14 · on Jan 27, 2020

It seems a lot of people have these stories, and then people like me and OP who have had btrfs survive the most fucked up situations (I've had a btrfs nas built on "random drives I've had lying around" and abused it for 5 years and had 0 bugs at all).

I'm not sure what causes it, but there seems to be an effect where btrfs loves you or hates you and few people with mixed experiences regarding data loss. One possible cause is distro choice tends to be per person and how up to date said distro keeps it's kernel. But, I'm not sure.

scottlamb · on Jan 28, 2020

> It seems a lot of people have these stories, and then people like me and OP who have had btrfs survive the most fucked up situations (I've had a btrfs nas built on "random drives I've had lying around" and abused it for 5 years and had 0 bugs at all).

Why wouldn't you expect it to survive that? Is there a particular reason to believe those drives are broken? I.e., are they older consumer drives known to lie about cache flushes? do they have bad sectors? How have you abused it? What kind of load? Did you fill the filesystem (which another commenter mentioned seems to be a common element of most sad btrfs stories)? did your system frequently lose power while under write load?

Lacking more details, I'd just say one user experiencing 0 bugs in 5 years should be completely unremarkable. I expect filesystems to be very reliable, so a lot of people having stories of corruption means stay away from btrfs. Having some people with stories of no corruption doesn't really move the needle. Together, these stories still mean stay away from btrfs!

cmurf · on Jan 28, 2020

That's hyperbole, it can't be taken seriously. OpenSUSE uses Btrfs by default, if there were more problems outside what's expected by md+LVM+ext4 (or XFS), which is the feature comprised by Btrfs and then some, they wouldn't have made the on-going investments they have. Facebook has been using it in production with thousands of installations for years.

You want details from people experiencing zero problems, but you don't ask for details from people who are? That's a weird way to go about conducting the necessary autopsies, to discover and fix bugs.

Anyway, I monitor the upstream filesystems lists, and they all have bugs. They're all fixing bugs. They're all adding new features. And that introduces bugs that need fixing. It's not that remarkable, until of course someone suggests only one file system is to be avoided, while also providing no details, but depends on conjecture.

scottlamb · on Jan 28, 2020

I asked RX14 why they called out their lack of problems as remarkable ("survive the most fucked up situations"). It sounds strange, as I mentioned.

I don't need to ask people who've had problems because I've had them myself, in unremarkable circumstances, a while back. I'm sure I could find reports on the mailing list as well, in which others have already asked for details.

packetlost · on Jan 27, 2020

In my experience, btrfs is very fragile in power loss or kernel crash/panic scenarios. It very consistently causes soft lockups on file read/writes after power loss until you run a `brtfs check --repair` on it. My experience is mostly on Arch, so it's not a case where it's out of date and missing patches.

cmurf · on Jan 28, 2020

Sounds like hardware problems in the storage stack. Btrfs developers contributed the dm-log-writes target to the kernel, expressly for conducting power loss tests on file systems. All the file systems benefit from this work. https://www.kernel.org/doc/Documentation/device-mapper/log-w... And Btrfs is doing the right thing these days.

I recently conducted system resource starvation tests where a compile process spun off enough threads to soak the system to the point it becomes unresponsive. I did over 100 forced power off tests while the Btrfs file system was being written to as part of that compile. Zero complaints: not on mount, not on scrubs, not with btrfs check, and not any while in normal operation following those power offs.

If you want to complain about Btrfs, complain about the man page warning to not use --repair without advice from a developer. You did know about that warning, right?

packetlost · on Jan 28, 2020

100% was not a hardware problem. Works fine on other filesystems ️

cmurf · on Jan 28, 2020

That's an inadequate answer because it rests on other file systems assuming the hardware is working reliably. Btrfs and ZFS don't make such assumptions, that's why everything is checksummed. They are canaries for hardware, firmware, and software problems in the storage stack that other filesystems ignore.

danudey · on Jan 27, 2020

This was my experience. We had a brief power outage at work and my btrfs (root) partition was toast. Spent a whole day rebuilding my system afterwards. Will definitely not go that route again.

The only difference is that none of the repair tools were able to recover the filesystem, but I was able to dump the files themselves to a new disk to recover them. Really not sure why, it was very strange.

shanemhansen · on Jan 28, 2020

I ran btrfs on a laptop. 2 things.

Once I ended up with a bunch of zero length files (presumably metadata was written before content?).

I also, multiple times ended up with errors related to full drives despite by drive not being full. Deleting snapshots seemed to help.

Then I went to a zfs fs on root and never had another problem.

rdslw · on Jan 28, 2020

Since a year I literally daily turn of my machine by pulling a plug (home automation turns off all plugs at midnight to make me go bed ;).

My quite large 1tb multivolume, multisnapshot BTRFS fs never had any problems.

And it's quite aggressive cfg (big fs commit).

P.S. I do have backups though.

jraph · on Jan 28, 2020

Ugh. You are testing your home the Netflix way [1] :-)

Why not putting poweroff in a cron task a bit before midnight so you don't uselessly risk hosing your file system? You can always restore your backup but it takes time!

[1] https://arstechnica.com/information-technology/2012/07/netfl...

thfuran · on Jan 27, 2020

I think the probable cause is that it's not common bugs that cause the corruption but uncommon ones. Most of the time, they work fine. But you really want a stronger guarantee than that out of your filesystem.

SEJeff · on Jan 27, 2020

Historically, the biggest bugs in btrfs were when you came close to filling up the filesystem. For the longest time, you'd get -ENOSPC (no space left) even when you had many Gb of space left due to really bad metadata and block level space usage.

danudey · on Jan 27, 2020

I'm a huge Mac fanboy, but APFS really kicks me in the teeth sometimes. Aside from things like snapshots, clones, etc. not being accessible to users (well, not really), or being able to create subvolumes at specific mount points which forget those mount points next reboot, it had an extremely strange behavior (possibly relating to snapshots/CoW?) where once it was full, it stayed full forever until you rebooted.

Basically, any time a runaway process filled my disk, I just had to hard-reboot and hope I didn't have any unsaved work or state that I needed to preserve.

Really makes me hope that Apple is going to further extend APFS to not just be baby's first CoW volume-management filesystem.

scottlamb · on Jan 28, 2020

> it had an extremely strange behavior (possibly relating to snapshots/CoW?) where once it was full, it stayed full forever until you rebooted.

Do you have Time Machine enabled? I think it uses snapshots, which explains why the filesystem stays full. I've hit this myself and was initially surprised to see rm not improving matters (possibly even making it worse) but it makes sense with snapshots. The working on reboot was a surprise. I'd put off fixing the machine for at least a week, and when I went to actually fix it, it was quite anticlimatic to just reboot and have it work. Maybe it checks for this condition on reboot and dumps Time Machine snapshots if so.

That was the less scary part of my macOS filesystem integrity worries. My full disk started when it was staging a full Time Machine backup after I got a dialog saying:

> Time Machine completed a verification of your backups on "my.nas.address". To improve reliability, Time Machine must create a new backup for you.

...for the Nth time. I don't know for certain if the problem is with Apple's software or with my NAS's (Synology) but these backups are clearly not as reliable as one would hope...

blattimwind · on Jan 27, 2020

Let's not forget about various performance issues which were exacerbated by "low free space" conditions (i.e. after you filled the volume beyond 80 % these started to pop up). A file system that will sometimes go down well into the fractional IOPS range is not very useful.

Some of these are fixed by now, though.

SEJeff · on Jan 27, 2020

"the biggest bugs in btrfs were when you came close to filling up the filesystem" :)

I used to read every email on btrfs-devel for a year or so.

machawinka · on Jan 27, 2020

This is my experinence too. Works great with lots of free space, as soon as space gets tight, performance deteriorates really fast. Nevertheless, for me it has been worthy.

kbenson · on Jan 28, 2020

There's a good Bryan Cantrill talk about that.[1] The gist is that eventually, when you throw enough resources at a problem, all that's left are the really uncommon problem and bugs, and this is specifically what you get in the data path (including drive firmware) where things get harder and harder to figure out as the code gets more hidden and obscure.

As with all his talks, you can expect it to be quite entertaining as well as informative and historical (if from his POV).

1: https://www.youtube.com/watch?v=fE2KDzZaxvE

zepearl · on Jan 27, 2020

Personally I think that in the case of a CoW filesystem, bugs which cause corruption should be very uncommon because of the very nature of the CoW mechanism, especially if coupled together with data checksums as publicized in the case of BTRFS.

If things still get trashed then I tend to think that the very foundation of the FS is bad.

But maybe I'm just naive :)

josteink · on Jan 27, 2020

> I'm not sure what causes it, but there seems to be an effect where btrfs loves you or hates you and few people with mixed experiences regarding data loss.

I tried, I really tried to like btrfs.

On the servers/workstations I’ve had few serious issues, but a few “gotchas” you need to know to keep things running smoothly.

On every laptop I’ve had, I’ve had btrfs fail on me. Repeatedly.

So I gave up on it. ZFS for me these days.

dmos62 · on Jan 27, 2020

> btrfs loves you or hates you

This is how superstitious traditions start, and ritualistic sacrifice in particular, I'd think.

rnd0 · on Jan 28, 2020

>and ritualistic sacrifice in particular, I'd think.

Does data loss count as a sacrifice in this instance?

wtallis · on Jan 28, 2020

If it does, I think the ZFS "rebuild the pool from scratch" should as well, since that seems far more ritualistic.

awill · on Jan 28, 2020

>but there seems to be an effect where btrfs loves you or hates you

Surely it depends on the btrfs implementation. e.g. Arch Linux getting daily kernel updates vs an enterprise distro

packetlost · on Jan 28, 2020

Just as unstable on Arch as of a month or two ago.

unixhero · on Jan 27, 2020

Same and same. Never saw any problems with btrfs. Really like the memory consumption of btrfs!

paulddraper · on Jan 27, 2020

> an effect where btrfs loves you or hates you

Same thing happens with operating systems.

BubRoss · on Jan 27, 2020

One anecdote of a filesystem working fine and one anecdote of it becoming a disaster don't cancel each other out.

I wouldn't buy a $5 USB thumb drive if half the people said it lost their data and half said it worked fine.

rnd0 · on Jan 28, 2020

I'd buy it -but only for short-term use to sneakernet shit I already had backed up reliably somewhere else.

of course, where we run into problems is that btrfs is meant to be the reliable backup. Oops.

BubRoss · on Jan 28, 2020

You realize that there are $5 thumb drives that work, just like there are filesystems that actually work right? There isn't any benefit to using something broken, these problems have been solved.

rdslw · on Jan 28, 2020

Sorry, but this is an anecdata.

Down there, 2/3 of this hackersnews discussion (if you are patient to get there) you can see questions about production deployment of btrfs, with some VERY interesting answers of BIG deployments of btrfs. Read success confirmed with data. My takeaway from reading whole discussion:

* lot of people (individuals) praise of btrfs

* lot of people (ind.) tell about problems

* quite nice features/btrfs usage patterns, not matched even by zfs mentioned

* still for VM/DB you shall consider different approach (thin LVM + xfs or ext4) and slave machine WITH btrfs and snapshots on it

* quite many problems/deficienses of ZFS mentioned (apart fomr typical license/kernel inclusion)

* lot of new features on the way in recent kernels for btrfs

* btrfs is not dead

p.s. worth to comment that kernel 5.6 just received another huge new features batch for btrfs (async discard!)

aidenn0 · on Jan 28, 2020

ZFS is the only FS I used that resulted in complete FS corruption, losing nearly all data on disk (only once though).

tw04 · on Jan 27, 2020

Legitimately curious what the ZFS bug was. I’ve not heard of a TFDL bug in zfs for a Loooong time.

The reason Synology btrfs is mostly solid is because they refused to ever use the btrfs raid layer. But the second you move to btrfs on LVM you lose a large portion of the supposed benefits.

Having used both, never lost data on zfs and I’ve been using it since it was released and have had it save me from silent data corruption. BTRFS hasn’t ever lost me an entire file system, but I’ve definitely lost files.

clSTophEjUdRanu · on Jan 27, 2020

I really don't understand the insane hype around ZFS. You can't read any thread that touches on filesystems without the ZFS zealots coming out.

yjftsjthsd-h · on Jan 27, 2020

ZFS is mature/stable, its feature set is basically unmatched (data checksums, compression, atomic snapshots, RAID(0,1,10,5,6), send/receive) by any other option on Linux, and what competition it does have is unstable in some configurations (BTRFS), essentially dead in the water (reiserfs), in early development (bcachefs), or far more complex to manage (gluster, ceph, LVM+XFS). Other than the licensing issue, ZFS is basically a silver bullet.

zepearl · on Jan 27, 2020

I agree and I'd like to add to the list of feature-set the adaptive cache (which does not only take into account the last time a block was used but as well how frequently it was used) and the SSD-cache ("ARC" respectively "L2ARC" in ZFS jargon).

bbatha · on Jan 27, 2020

Also don't underrate good documentation and easy to use tooling.

somehnguy · on Jan 27, 2020

This was the killer feature for me.

I had been wanting to try ZFS on my home NAS for a while (for snapshotting/redundancy/data integrity) and finally got enough disks that it made sense. I wasn't looking forward to learning what I presumed to be a very complicated system though. About 15 minutes into my research for setting up and maintaining a ZFS filesystem and I just went - wait thats it? So incredibly simple and well documented, it has been a joy to use. It is very rarely that complicated operations on complicated systems use such simple and easy to understand commands. It just does what I expect!

danudey · on Jan 27, 2020

ZFS is incredibly easy to learn to use, whereas btrfs is quite complicated to learn/use, and even more so if you've used ZFS since a lot of things are either just different enough to be weird, or so different that it makes no sense.

Examples: ZFS snapshots can be recursive (-r) or not, whereas on btrfs they cannot be recursive; in discussions I've seen, this is mentioned as "a feature", since you can create a subvolume for data that you don't want to be part of the snapshot, but it also prevents you from dividing up a logical heirarchy into multiple behaviours (compression vs. not, block size, etc.).

wtallis · on Jan 28, 2020

> but it also prevents you from dividing up a logical heirarchy into multiple behaviours (compression vs. not, block size, etc.).

Bind mounts can get around most of the limitations here, at the cost of polluting one directory with the canonical locations of all your special-purpose subvolumes. I think it's still awkward to simultaneously snapshot every subvolume that is mounted under a particular tree for incremental backup purposes.

tjoff · on Jan 27, 2020

The hype is quite easy to understand. Snapshots and checksums are two complete game-changers. ZFS has them both. And there are no real alternatives in many cases.

I've personally waited for BTRFS longer than a decade but my use-cases are yet to be considered stable (not something you really mess with in regard to filesystems).

Honestly, as sure as I have been on the success of BTRFS I now consider BTRFS dead on arrival - if it will ever even arrive. The pace of development is slower than the universe around it, that might be too harsh but really - no RAID6 yet? A decade ago the impression I got was "soon". And now 2-drive parity is becoming obsolete.

ZFS has tons of warts for home-use, I agree. So, for a home-user with high demands I don't see anything exciting in the future.

TurningCanadian · on Jan 28, 2020

There were a bunch of btrfs raid56 patches last year. I think the known bugs have been addressed and is just that the wiki page hasn't been updated.

Re obsolete, are you referring to RAID1C3?

tjoff · on Jan 28, 2020

I'm thinking of this:

https://www.zdnet.com/article/why-raid-5-stops-working-in-20...

I'd much prefer something like raidz3 compared to the authors setup.

RAID1C3 is nice but very expensive for use in bulk storage at home.

mycall · on Jan 28, 2020

What warts do you speak of?

tjoff · on Jan 28, 2020

No defragmentation, and as far as I'm aware all copy-on-write filesystems suffer greatly from fragmentation once utilization goes too high. ZFS will never recover unless you restart from scratch.

No way to rebalance a pool. Also increasing a pool always results in less reliability (in terms of drive losses that results in the whole pool going down).

No proper recovery tools if something goes wrong.

Then the lack of flexibility talked about in the article. This means the up-front cost and total cost is vastly more than a more typical setup where you can buy drives spread out over many years and take advantage of falling prices, less power consumption and noise (in part because you typically start such an array with higher density drives, since the low cost and longevity allows you to).

Probably forgot some other reasons.

That said I still use zfs (freenas) at home. But because of the above it is quite hard to blindly recommend it.

fulafel · on Jan 28, 2020

lvm and hence ext4 etc have had snapshots for ages.

tjoff · on Jan 28, 2020

As do NTFS. But they are not really comparable to "real" filesystem snapshotting, at least not in my opinion.

asveikau · on Jan 27, 2020

I don't think I am a zealot, nor a heavy user, but I use it on 1 machine at home (an NFS server running FreeBSD, which I have clients for elsewhere in my house). I came to this idea when I saw some data loss on some magnetic disks in my house, and repairing or even assessing the level of damage was difficult.

My experience is that it's pretty good. The tooling does what it says without a lot of drama. I can scrub while the system is in use and don't notice it mostly. I have seen some small corruptions that it was able to flag for me with specific filenames and fix. Snapshotting and send/receive is also very handy.

I heard some people say they don't like to use it under heavy load. That seems reasonable to me. You're paying costs to get the integrity piece. So it's not for every use or every user. It is very good at what it does, however.

stiray · on Jan 27, 2020

Same with me. I just figured out at some point, 10 years ago, that it is nice to have snapshots on root disk. And figured out FreeBSD is supporting ZFS. Tryed it, loved it, used it. The ZFS on linux was destabilized in latest versions (`ls /.zfs/snapshots`) and they blew it considerably by adding it to systemd (I need to reboot fedora multiple times before it boots ever since), but at least I know that my data are not lost (unlike btrfs, had two major crashes in two years). Quite frankly I'll rather wait for Raisser to get out of jail than use btrfs again. Anyway, I bet on Hammer2.

seized · on Jan 28, 2020

ZFS is like really good snow tires in the winter. You can tell people with other tires how great it is to have really good tires, but they dont believe you until they experience the benefits for themselves. Or put another way, no one "needs" ZFS until they really need it, then they wont live without it ever again.

I switched to it after the 7200.11 firmware mess, where the drives reported successful writes but didnt write anything. ZFS would have caught that, my Adaptec card certainly couldnt have and didnt.

ZFS to the rescue again a while later when those (now firmware updated) 7200.11 drives started dropping after 15k hours of service. ZFS saved my data when two drives started failing in my RAID5 set at the same time.

All the weird minor problems that would cause random issues or performance issues for other file systems like flaky SATA cables, intermittent HBA/backplane ports, etc. ZFS catches them all and informs you.

Having been hit by bit rot, corrupted files, corrupted file systems, etc etc before switching, ZFS is fantastic. And there is something great about watching it scrub at >1GB/sec, verifying every single bit of your data.

twic · on Jan 27, 2020

Eh, i'm waiting for them to rewrite it in Rust.

paulddraper · on Jan 27, 2020

Poe's law

doctor_eval · on Jan 28, 2020

This is such a meta-comment that I actually LOLd!

ksec · on Jan 28, 2020

I had to check the dictionary for the meaning of "hype"

a situation in which something is advertised and discussed in newspapers, on television, etc. a lot in order to attract everyone's interest:

May be its just me because Morden day usage of "hype" seems to involve and implies a negative meaning, especially in tech. Similar to false advertising. And no one was actively promoting ZFS, they were only very "responsive".

And then zealots, I had to reread 226 comments, ran to Cambridge dictionary

a person who has very strong opinions about something, and tries to make other people have them too

I dont see anyone having strong opinions and force others to have the same. If anything a lot of people are showing not because the love ZFS, but they have been burnt by btrfs.

kqr · on Jan 27, 2020

ZFS is the worst filesystem/volume manager, except all others.

somehnguy · on Jan 27, 2020

Have you tried it? I went from having never used ZFS to loving it (and I guess being one of those zealots) very quickly after setting it up. So simple yet so powerful!

rnd0 · on Jan 28, 2020

>You can't read any thread that touches on filesystems without the ZFS zealots coming out.

Agreed 100%. That's particularly annoying to us desktop users. It took me years to figure out that no, FreeBSD aside, it doesn't bring anything to the table outside of enterprise storage use cases. At least it doesn't bring anything that's worth the hassles (I don't have to export ntfs filesystems before using them on another computer; same for ext4 -and then there's performance).

zionic · on Jan 27, 2020

I have a synology NAS on btrfs. One of the best computer purchases I've ever made.

lostlogin · on Jan 27, 2020

I’ll second this, it’s fantastic. The time it takes to expand when adding a second 16TB is deeply average (8 days) but that’s about it for downsides. It’s the best computer I’ve owned.

m4rtink · on Jan 27, 2020

Hard to standardize on something that can't be maintained in the same place all your other filesystems are in (in the Kernel) for licensing reasons.

rbanffy · on Jan 27, 2020

Only the boot file system drivers need to be in the kernel. As long as there is a stable ABI, it's fine for everything else to be someplace else.

skissane · on Jan 27, 2020

> As long as there is a stable ABI, it's fine for everything else to be someplace else.

Mainline Linux has a policy against in-kernel ABI stability guarantees. User-space is given ABI stability guarantees, in-kernel code by intention is not. That includes filesystems.

pizza234 · on Jan 28, 2020

Do you have a link to the bug issue? ZFS purpotedly never had any corruption issues on release versions, so that makes it a really interesting case.

derefr · on Jan 27, 2020

A question for HN: what filesystem and/or block-device abstraction layer would you use on a database server, if you wanted to perform scheduled incremental backups using filesystem-level consistent snapshotting and differential snapshot shipping to object storage, instead of using the DBMS’s own replication layer to achieve this effect? (I.e. you want disaster recovery, not high availability.)

Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?

My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)

If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?

chousuke · on Jan 27, 2020

AWS and GCP most likely use their own proprietary stuff. Various storage systems (such as Netapp) are able to provide snapshots at the storage system level, and if you're interested in something open source, a Ceph cluster can also provide you snapshottable block devices; whether it's a good idea for a database is another question.

Filesystem snapshots are a legitimate way of backing up databases, but it's not quite as simple as just taking a snapshot. For PostgreSQL for example you will still need to call pg_start_backup() and ensure your WAL archives are properly stored in your object storage system for point-in-time recovery. Without the database-specific precautions, your snapshots will still be crash-consistent and most likely usable in some manner, but not quite proper backups.

Using BTRFS or ZFS as the database filesystem has its own footguns. For example, the default record size of ZFS datasets doesn't match the block size of most databases, so if you forget to take that into account, you'll very likely see rather terrible performance.

takeda · on Jan 27, 2020

If the database is PostgreSQL, I would strongly advise about forgetting about filesystem snapshots and instead using streamed backups (if on premises use barman, if in cloud WAL-E or WAL-G (never used it but looks like improvement over WAL-E).

This gives you backup with a replay value, so you can restore at any point in time. You can also use such backup for setting up replication. There's still a daily backup which is there to speed up recovery and increase resiliency. Those backups don't really put much load on the database, but if that's a concern you can back up the replica (which is what cloud providers or at least AWS is doing).

As for ZFS, out of the box ZFS is not a good file system for databases, although you can get a good performance after tuning. You for example want to configure it to have block sizes alizened with database blocks, configure ZIL, perhaps changing block hashing algorithms (although I think current default should be fast).

As for your question how are cloud providers are doing it, most of us can speculate. To me it looks like standard RDS instances are simply on EBS (which are utilizing S3). In Aurora they skipped EBS and implemented another database storage directly.

It seems like the backups are performed in traditional way though.

muxator · on Jan 27, 2020

I do not think it would be a good idea to use file system level snapshotting for backing up a database. The database "knows better" about its internals, and can give more guarantees about the consistency of its data. I would trust a filesystem-levdl backup only as a last resort.

iracic · on Jan 27, 2020

It is possible to put database in state that is "ready" for snapshot, pushing changes to disk and sort of freezing I/O during snapshot.

jstrong · on Jan 27, 2020

this is generally not a matter of concern for a copy on write filesystem like zfs, since it's not possible for the file to be in an "in between" state. If a write were in progress, the filesystem would still be pointing to the previous state. Only when the data is written to disk is the pointer moved to the new location.

tjoff · on Jan 27, 2020

It very much is a concern. ZFS has no knowledge about the internals of a database, which parts of a file are related to each other etc.

RX14 · on Jan 27, 2020

DBMSes always keep their database in the file system in a consistent state to be able to recover from system crashes. Taking a file system snapshot is equivalent to pulling the power on the database server in terms of data recovery, but databases are designed to support this.

tjoff · on Jan 27, 2020

As do filesystems. Yet I've seen anyone argue that cutting the power is the recommended way of doing backups.

In fact the opposite, make sure to use an UPS just so that you can shutdown cleanly in the unfortunate event.

For example: https://blogs.oracle.com/paulie/backing-up-mysql-using-zfs-s...

wmf · on Jan 27, 2020

Some people came up with the idea of "crash-only software", arguing that it's better to maintain one code path (recovering from a crash) than two (clean start and recovery), but it hasn't caught on that much. https://www.usenix.org/legacy/events/hotos03/tech/full_paper...

paulddraper · on Jan 27, 2020

Google follows this, I believe

rocqua · on Jan 27, 2020

It matters when various writes to files need to be ordered, and the DB processes is keeping some of the consistency part in memory.

Teknoman117 · on Jan 27, 2020

It is a matter of concern if said database systems leaves its filesystem contents in an inconsistent state at any point. ZFS, BTRFS, and others can only keep consistent what they have control over.

mekster · on Jan 29, 2020

I also think database specific backup makes more sense.

Some people recommend filesystem snapshotting but wouldn't that make recovery a slow process because you have to load up the entire database even if you just wanted to look up on data of a small table?

Maybe backing up only small tables as SQL dumps while keeping a file system snapshot would be a good compromise.

m4rtink · on Jan 27, 2020

Create a LVM thinpool, create a thin LV in that, format with XFS, put the database on top.

Each time you want to a backup, create a CoW snapshot of the thin LV, then mount it somewhere and run the backup.

The "main" thin LV should be happily chugging along independently when you are doing that.

And all this is stable proven technology available about anywhere (eq. RHEL 7+).

danudey · on Jan 28, 2020

My understanding is that performance on LVM drops dramatically after the first snapshot due to the way it handles CoW (synchronous writes on top of your async write). Is that no longer true, or only in certain circumstances?

It seems as though the way to go would be to take a 'snapshot', back it up, and then delete it immediately; is that right?

https://www.nikhef.nl/~dennisvd/lvmcrap.html

mappu · on Jan 28, 2020

I think this complaint applies to the original LVM2 snapshots, not the new thin ones.

StreamBright · on Jan 27, 2020

>> Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)?

As far as I know AWS does not use SANs because they consider it as anti-pattern. Most backups land on S3 because of reliability and price.

KaiserPro · on Jan 27, 2020

SAN and s3 are different beasts.

EBS is very much a SAN, if you read the docs, the Nitro HBA Controllers have dedicated bandwidth allocation for doing just EBS.

As there is a dedicated network for just servicing block storage, that sounds suspiciously like a Storage Area Network to me.

S3 for backup makes lots of sense, its ubiqutous, reliable and smeared over lots of regions. It also works well with large files. Its also orders of magnitude cheaper than EBS to run.

StreamBright · on Jan 27, 2020

Sure thing. I was referring to the lack of SAN in the context of backups. Yes, EBS is a SAN in that sense.

polskibus · on Jan 27, 2020

So how is S3 implemented? Does it reuse any publicly available open source component?

derefr · on Jan 27, 2020

I don’t think they’ve published anything specifically on S3’s architecture (someone please correct me if I’m wrong, I last looked into this a long time ago), but

1. they came out with S3 soon after coming out with their Dynamo paper (before releasing DynamoDB, even); and

2. there’s a good constructive proof, as a studyable FOSS system, for how to build object storage on top of a Dynamo architecture, in the form of Riak CS (object storage) which is built atop Riak KV (a Dynamo impl.) Riak CS seems to make pretty much the same set of guarantees (in terms of time/space complexity of operations, possible durability numbers per scaled number of copies, etc.) that S3 does, so it’s a fair guess that they’re similarly-architected systems.

StreamBright · on Jan 27, 2020

It is a closed source project that has many components. I am not aware if any of those are opensource.

Aldipower · on Jan 27, 2020

Assuming you can afford 2 machines this setup works pretty well for me.

Primary DB-Server -> XFS on LVM with a LVM caching SSD

Secondary (write-only) mirror DB-Server -> ZFS

The DB is replicating automatically to the secondary server by the database internal replication features. At the secondary I am then able to lock the DB temporarily, doing a ZFS snapshot and maybe could do a ZFS send/receive afterwards without affecting the primary server.

ZFS is great for doing snapshots and archiving of huge amount of data, but it's very very bad for production databases in terms of performance. Most database aren't designed to deal with the CoW feature of ZFS, which leads to a very bad write performance and database fragmentation in the end.

unixhero · on Jan 27, 2020

Zfs

Btrfs

See various "Private Cloud" Linux distributions for implementation examples of this. Such as Proxmox which does it out of the box on ZFS and soon on btrfs too.

SEJeff · on Jan 27, 2020

xfs + dm-snapshot / lvm snapshots. Very fast and very reliable.

KaiserPro · on Jan 27, 2020

AWS, and possibly GCP only allow 1:1 mapping of volumes (publicly, I know AWS allow it under the hood. )

Which makes synchronising snaphots a lot easier (and caching too, but thats another thing entirely.)

They are treated as block storage, so on the outside don't have to worry about what filesystem is running on it. (in practice they have to be a bit aware, so that they don't snapshot unbootable or dirty images, but I assume thats mostly handled by an OS plugin)

TL;DR:

AWS et al snapshots are at the block level. Linux has poorly documented primitives for this.

If you put your VM images on a Filesystem provided by ZFS or BTRFS then you can snapshot your images, without having to buy a SAN, or expensive controller.

ZFS has by far the best documentation. BTRFS's documentation has improved, but the tools are still difficult to use.

gravypod · on Jan 27, 2020

I've seen a lot of the hacker community focusing on btrfs and zfs but very little focusing on ceph. I think ceph has a lot of the features that we want in a file system and some things that aren't even possible on traditional file systems (per-file redundancy settings) with very little downsides. The setup is a little more complex involving a few daemons to manage disks, balance, monitor, etc. I wish there was something similar to FreeNAS for ceph that only focused on making the experience seemless because I think if it became more popular in the home lab space we'd see lots of cool tools pop up for it.

louwrentius · on Jan 27, 2020

I love Ceph, I even wrote an intro about it for those who are not familiar with it.

https://louwrentius.com/understanding-ceph-open-source-scala...

But Ceph is not designed to be a competitor to BTRFS or ZFS. The core vision of Ceph is scalability. If you need petabytes of storage and the performance to scale with it, take a look at Ceph.

I may be totally wrong here, but from what I understand about Ceph, it's not meant as a file system for a single computer. I don't understand the idea of running Ceph on your laptop/desktop. It's possible to run it that way but it defeats it's purpose.

I've build a small lab setup with Ceph:

https://louwrentius.com/my-ceph-test-cluster-based-on-raspbe...

Also, there's the issue of performance, in particular latency. That's a bit of a weak spot of Ceph, from what I can tell. Again, may be wrong. But I found these notes interesting.

https://yourcmc.ru/wiki/Ceph_performance

seabrookmx · on Jan 27, 2020

This.

In fact, it's really common to use a ZFS array on single nodes, and then create a SAN using multiple such machines by layering Ceph on top.

louwrentius · on Jan 27, 2020

That's interesting, but it's layers upon layers... (RIP latency), I think. Unless it's about just bandwidth and volume, then latency is not that big of a deal.

seabrookmx · on Jan 28, 2020

You don't have to use ZFS snapshots. I haven't run a system like this in production but presumably you choose ZFS because it's flexible in how you configure the arrays (as is say, LVM) and because it supports checksumming.

gravypod · on Jan 27, 2020

I never really store data on my local machines anymore. All of my data is either hosted only on, or backed up to, my storage server. I think the selling point of ceph is that every server in my apartment can be part of my storage cluster and data I really want to avoid loosing can be persisted across all of them.

For me latency isn't really a large issue. I read and write everything locally on my SSD-backed desktop/laptop and then sync my files to my storage node via git or rsync or something. For me data integrity and availability are important.

tezzer · on Jan 27, 2020

I've had one issue with btrfs that took it off my radar completely. A customer had a runaway issue that filled a btrfs device with unimportant things. We found the errant process and killed it, but apparently if a btrfs device is completely full, you can't delete anything to free up space. File removal requires some amount of free space. Bricked the device, annoyed a customer, back to ext4.

takeda · on Jan 27, 2020

ZFS had this issue (I believe fixed) workaround was to pick up one large file that you wanted to delete and do `echo -n > /the/unimportant/file` once the file was reduced in size to 0, rm started to work again.

Not sure if that workaround would work in btrfs, but it worked on ZFS.

chungy · on Jan 27, 2020

ZFS reserves 1/64 of every disk precisely so it can't be truly fully allocated. It leaves enough room to delete snapshots, truncate files, and so forth.

Mind that everything is copy-on-write, you can't do anything, even metadata changes, without allocating new blocks. It needs the reserve space.

tobias3 · on Jan 28, 2020

I had a ZFS bug once where they increased the amount reserved in a new release which caused my file system to be 100% and me unable to delete anything until I went back to the previous release.

Btrfs uses the the disk completely. This is harder to do (also compared to e.g. ext4 reserving a fixed amount of inode space which may be unused when the disk is full). At some point they added an in-memory "global reserve" metadata space which allows you to delete stuff even if the file system is full.

rcthompson · on Jan 27, 2020

What happens if the file has already found its way into a snapshot? Then presumably that command will not free any space.

takeda · on Jan 27, 2020

Well, rm wouldn't free the space either so you either would remove the snapshot or chose a different file.

pletnes · on Jan 27, 2020

True. For me, I freed up space by nuking old snapshots, when I ran into this on btrfs.

loeg · on Jan 27, 2020

See also: 'truncate -s 0 /the/file'

pfranz · on Jan 27, 2020

Yep. I had this happen a few weeks ago (I'm not sure how much maintenance the server has had since it was set up 2-3 years previous). Thankfully, after seeing whatever the error was ("No space left on device" or something) and furrowing my brow it seemed obvious enough to try without having to search for a solution. It seemed just dumb enough to work.

proaralyst · on Jan 27, 2020

The trick was to insert a USB drive, tell BTRFS it's block storage and delete away. Once you're done you tell it not to use the disk and you're good.

This is why I'm back on ext4 now, too.

zlynx · on Jan 27, 2020

I do a similar thing with my laptop's swap partition. swapoff, add it to btrfs, then remove it and mkswap again. Always seemed safer than a potentially dodgy USB drive.

adamzegelin · on Jan 27, 2020

Or, if you don't care about your data, add a RAM disk :P

tezzer · on Jan 27, 2020

Good trick, I'll save that one. Thanks!

paulddraper · on Jan 27, 2020

I'm surprised they wouldn't reserved space.

IIRC ext4 reserves a (configurable) portion of the disk for system management; it seems like btrfs could easily do the same.

takeda · on Jan 27, 2020

Ext4 reserves space that can be used only by root, it is so system services can continue to work when users take all the space. It doesn't have issues like this if you exhaust all of that space.

In ZFS and I'm sure in btrfs you can set up quotas and reserved space, globally and or user, but by default it is set to 0. I actually set my quota to 80% because apparently if you fill ZFS more it causes heavy fragmentation.

paulddraper · on Jan 27, 2020

You're 100% correct; ext4 reserves this for a different purpose. I'm just saying it's not a entirely novel idea (albeit for a different reason).

kasabali · on Jan 28, 2020

Ext4 reserved space also helps with fragmentation.

To be more specific, reserved space on ext3 helps the fs so that it can be more flexible during allocation and avoid fragmentation.

Ext4 has delayed allocation mount option for that purpose so reserved space is not as much important for that but it'd still help if you turn off delayed allocations.

rbanffy · on Jan 27, 2020

Won't that only work for non-root processes?

cmurf · on Jan 28, 2020

A copy on write file system has this potential problem because nothing is overwritten. To delete anything requires space to write the metadata change reflecting the deletion, and before the data extends can be freed the change must be committed to stable media.

It's been years since Btrfs introduced "global reserve" which reserves enough metadata space to ensure it's possible to delete files on full file systems. But an old work around for this is to add a small device to the Btrfs volume, making is a 2 device volume. It could be a USB stick, a zram device (ramdisk), partition, or even a loop mounted file on some other file system. Delete the files, and now you can remove the temporary 2nd device.

Zardoz84 · on Jan 27, 2020

this wasn't fixed years ago ? I had a BTRFs partition filled by a rogue process and it not bricked. it allow me to remove the junk files without any issue. And I'm talking of a Ubuntu 14.04 Ltd server

tezzer · on Jan 27, 2020

This was 14.04 as well, on a Tegra (armhf) system.

eta: looked up the ticket, customer reported "when trying to delete any files, even as root, btrfs says "cannot remove"", field engineering observed the same.

pojntfx · on Jan 27, 2020

Love using Btrfs; the is no better filesystem than it nowadays that it's reliability issues have been fixed.

pmarreck · on Jan 27, 2020

I have to beg to differ here as I had a different experience that I literally just posted about to Reddit yesterday

https://www.reddit.com/r/zfs/comments/eu1qsj/a_tale_of_two_f...

tl;dr Unbeknownst to me I had a bad drive cable for an external NVMe enclosure that was causing intermittent I/O errors (only during high drive utilization) that went undetected by BTRFS and slowly corrupted my drive, eventually leading to an unbootable and unrepairable system (and to be fair, I should have scrubbed instead of attempting btrfsck --repair from another booted drive, but I don't care what you say, a --repair function should NOT potentially cause FURTHER corruption if it is at all available in the tooling! Like, just fucking rip it out if it can potentially make things worse, or recode the damn thing to just act defensively... jeez)

Wiped the drive and started over with Ubuntu 19.10 and its new integrated ZFS on Root support... ZFS detected the IO issue pretty much instantly and prevented further errors by freezing I/O. Swapped the cable out during my troubleshooting and the issue went away. Also, drive is plenty fast, read test at 800MB/s

xzcat · on Jan 27, 2020

I'll throw in my own anecdote. ZFS on root caused me a significant amount of headache when the proxmox node I was using it on just randomly decided it wasn't going to boot anymore. The ZFS pools were fine, no data was lost, but no amount of messing with it fixed the zfsonroot and it was quite difficult to find quality search results for.

And of course it was a weekend where my parents and siblings and in-laws were visiting, so I had the joy of going around messing with DNS settings wherever someone had a device that only paid attention to the first two DNS servers in the DHCP settings.

(I've since changed my DNS setup- now I only have a primary self-hosted one that's on an RPi in my networking cabinet, and the second entry is Google. I figure if I only get two servers that are respected for real, I'm making sure one of them is google.)

lostlogin · on Jan 27, 2020

> I only have a primary self-hosted one that's on an RPi in my networking cabinet, and the second entry is Google.

I was under the impression that there was no such thing as primary and secondary for DNS, just ‘here is one’ and ‘here is another’, with someone going for a terrible naming system of ‘primary’ and secondary’. I’m no expert and my knowledge come from messing about with Pihole and reading their documentation.

jmcnulty · on Jan 28, 2020

The first nameserver listed in resolv.conf is kind of a primary as it will always be consulted first, unless you add "options rotate". The next nameserver only come into play if the first doesn't respond (default 5 seconds, also tunable with options). They're not named primary/secondary in the file but could be considered that way.

nwmcsween · on Jan 28, 2020

Don't rely on this behaviour, many DNS libraries, will query all or n to save on latency.

saalweachter · on Jan 27, 2020

I suspect that both BTFS and ZFS are currently good enough under most configurations that most users don't have a problem with whichever they choose, and it's only a tiny fraction that has a really good or bad experience and becomes a rabid advocate based on their anecdotes.

rleigh · on Jan 27, 2020

This is an obvious truism. Of course they appear to work correctly under ideal conditions.

The real question is how they behave under less than ideal conditions. It is these conditions where Btrfs has performed poorly, and where ZFS has performed very well. I lost several Btrfs filesystems due to its poorly-tested and broken error handling trashing the filesystem beyond recovery.

The selling point of both of these filesystems is their robustness, fault-tolerance and ability to self-heal. Only one of them actually delivers.

Filligree · on Jan 27, 2020

That's really a packaging issue, not a ZFS issue, but I feel your pain.

The best suggestion I can offer is to use a distribution that treats it like a first-class citizen, such as... well, the Ubuntu support is still beta level, so only NixOS for now.

pmarreck · on Jan 28, 2020

> when the proxmox node I was using it on just randomly decided it wasn't going to boot anymore

could this possibly be proxmox's fault more than ZFS's fault? You even said the pools were fine

nickik · on Jan 27, 2020

That's why FS integration into the kernel would have been so important for the whole software ecosystem.

mdip · on Jan 27, 2020

I tend to agree with you here -- reliability has been a non-issue for me, though I've never configured `btrfs` in its RAID configuration.

Performance becomes an issue in certain cases, but in every one that I've encountered, adjusting configuration has resolved the problems to my satisfaction.

Would my Windows 10 VM run better under a different filesystem, rather than `btrfs` with various tweaks applied? Reading relatively recent articles on the subject would suggest that it would, however, I'd rather work with a single filesystem type and understand its strengths/weaknesses than manage two different filesystems as long as I can get performance to a usable state.

ansible · on Jan 27, 2020

We have run btrfs in RAID configuration, but that has usability issues, even just doing RAID-1.

We've switched back to using MD (mdadm) for RAID-1 setup, and then using btrfs on top of that for the snapshots, send / receive, block-level CRC and such.

Dealing with failed drives isn't as easy with btrfs as it is with Linux MD.

Dylan16807 · on Jan 27, 2020

Does that include performance reliability?

It wasn't very long ago that I had BTRFS drives on two separate systems develop crippling performance issues, with random delays increasing up to seconds, and the filesystem going unresponsive for even longer when I deleted snapshots. I think something about the performance was degrading every time an hourly snapshot was made, even though the system only kept a couple dozen of them at a time.

pantalaimon · on Jan 27, 2020

> nowadays that it's reliability issues have been fixed

Is this also true for RAID5/6?

jxcl · on Jan 27, 2020

This issue has its own wiki page on the BTRFS wiki:

https://btrfs.wiki.kernel.org/index.php/RAID56

So, no, that particular issue hasn't been fixed.

3fe9a03ccd14ca5 · on Jan 27, 2020

> * For data, it should be safe as long as a scrub is run immediately after any unclean shutdown.*

That’s unfortunate. Does the scrub run automatically in those situations? Consumer hardware will be the most prone to intermittent power failure.

sp332 · on Jan 27, 2020

There are some big caveats there. https://btrfs.wiki.kernel.org/index.php/RAID56

buttersbrian · on Jan 27, 2020

is it considered better than ZFS?

kiney · on Jan 27, 2020

I use BTRFS on several devices for years. The tooling is a bit rough, but no major problems. Just recently data checksumming saved me: In December I replace an old 2TB drive in my RAID1 (2+4+4+4) with an 8TB drive. The new drive had checksum errors after a few weeks which BTRFS handled gracefully. With "classical" RAID i might only have noticed when it's to late. (I RMAed the bad drive)

  [/dev/mapper/h4_crypt].write_io_errs    0
  [/dev/mapper/h4_crypt].read_io_errs     0
  [/dev/mapper/h4_crypt].flush_io_errs    0
  [/dev/mapper/h4_crypt].corruption_errs  0
  [/dev/mapper/h4_crypt].generation_errs  0
  [/dev/mapper/h2_crypt].write_io_errs    0
  [/dev/mapper/h2_crypt].read_io_errs     30
  [/dev/mapper/h2_crypt].flush_io_errs    0
  [/dev/mapper/h2_crypt].corruption_errs  0
  [/dev/mapper/h2_crypt].generation_errs  0
  [/dev/mapper/h1_crypt].write_io_errs    0
  [/dev/mapper/h1_crypt].read_io_errs     0
  [/dev/mapper/h1_crypt].flush_io_errs    0
  [/dev/mapper/h1_crypt].corruption_errs  0
  [/dev/mapper/h1_crypt].generation_errs  0
  [/dev/mapper/h3_crypt].write_io_errs    0
  [/dev/mapper/h3_crypt].read_io_errs     0
  [/dev/mapper/h3_crypt].flush_io_errs    0
  [/dev/mapper/h3_crypt].corruption_errs  0
  [/dev/mapper/h3_crypt].generation_errs  0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].write_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].read_io_errs     16
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].flush_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].corruption_errs  20619
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].generation_errs  0

edit: formatting

epx · on Jan 27, 2020

I have been using btrfs in my "NAS"/personal server for 3 years, changed disk configuration a couple times, I do snapshots every hour and prune them using a Fibonacci-like timeline, no problems yet.

Teknoman117 · on Jan 27, 2020

My experience has been the same. Admittedly, I've not tried native BTRFS parity raid (I'm sitting the volume on top of mdraid). But, I ran the "mkfs.btrfs" 5 years ago at this point for my desktop and no data loss yet. I back things up religiously, so I'm not too worried about the volume failing, but it'll be nice if btrfs parity raid gets stabilized, because I could replace my current NAS storage config.

I used to use ZFS on my NAS, but after running it for a year and fiddling with it, I wasn't able to tune it in a way I liked. I always had random performance problems and zvols were super slow. It's now dm-integrity on all disks, an mdraid raid6 volume over those, with LVM2 on top of that and mirrored NVMe disks as a read and write cache.

I also wish BTRFS would add extents at some point so you could run virtual machine images from it without weird performance issues from time to time (although I imagine this is less of an issue on SSDs because they're "fragmented" inside anyways).

alyandon · on Jan 27, 2020

I use btrfs in raid1 mode and the ability to shrink/grow/add/remove devices at will without data loss or extended downtime led me to choose btrfs over zfs on my home servers.

cyphar · on Jan 27, 2020

You can grow and add/remove raid1 devices (mirror vdevs) in ZFS without any significant work or downtime. Shrinking does require a bit more work, but depending on your setup it can be done fairly painlessly with send/recv (and shrinking is usually not something which is a very common administrative operation).

xzcat · on Jan 27, 2020

"fairly painlessly" and "without significant work or downtime" doesn't sound like it lines up with btrfs's, which I would describe as "one command and zero downtime (just some io load if you rebalance immediately)" for both operations. btrfs is also mainline, which increases how painless it is to use.

BTRFS does have some scary stories from earlier in its development, and true raid5 seems like it's unlikely to be safe for quite a while, but raid1 and "normal" fs usage has been rock solid in my experience. The only time I've ever had an issue was probably 4 years ago at this point, and it was solved by just booting an Arch live iso and running a btrfs command that was basically "fix exactly the bug that your error message indicates". I don't remember exactly what it is, something about two sizes not matching, but googling the text it showed at boot led me directly to the command to fix it. Certainly dramatically less trouble than I've ever had when hardware RAID goes south.

I do agree that modern lvm does probably compete with btrfs, but again you're trading how dang simple btrfs raid1 is to manage for monkeying with partitions in lvm in exchange for ~some? performance.

IMO ZFS is in a weird spot where I don't know where I'd use it. It's too complicated/annoying to admin for me to want to run it in my basement for myself/my family, and for anything bigger or more professional I'd use ceph or a problem-domain-specific storage system (HDFS, clickhouse, aws, etc).

cyphar · on Jan 28, 2020

The first operations I mentioned (adding or removing a device from a vdev, or adding a new vdev) are one command with no downtime:

  % zpool attach <pool> <vdev> <device> // add to a vdev
  % zpool detach <pool> <vdev> <device> // remove from a vdev
  % zpool add <pool> <vdev> <devices...> // add a new vdev

In the newest ZFS versions, you can also remove mirror and singleton vdevs (this does require some time -- because the data needs to be copied from the drives) but it's all done in the background:

  % zpool remove <pool> <vdev>

Shrinking a pool "the old way" (which is still sometimes necessary depending on what you're doing) is definitely more involved -- you have to create a new pool with the layout you need and then do a zfs send/recv from your old pool to the new one. This does only take a handful of commands but I would definitely consider it to be a much more complicated affair than the operations I mentioned above.

I would not (nor did I) compare LVM (or md-raid) to btrfs or ZFS -- those technologies have fundamental limitations regarding the integrity of your data that ZFS (and btrfs) don't have. And don't get me wrong -- I don't have a problem with btrfs (I run btrfs on all of my machines except my home server -- which runs ZFS), I just disagree with GP's point that ease of use is an argument for btrfs over ZFS. There are many arguments for either technology.

> btrfs is also mainline, which increases how painless it is to use.

I agree that this is one argument to pick btrfs over ZFS (though on most distributions it isn't really that hard to install ZFS, the fact that btrfs requires zero extra work to use on Linux is a benefit).

3fe9a03ccd14ca5 · on Jan 27, 2020

How? My understanding is that you create a new vdev and add the old vdev as a device, basically recursively creating volumes with each new device you add.

cyphar · on Jan 28, 2020

Which operation are you asking about? [1] is a sister comment which I posted that outlines how to do most of the operations I mentioned.

[1]: https://news.ycombinator.com/item?id=22168494

Shalle135 · on Jan 27, 2020

Is there any specific reasons to run btrfs over for example ext4? You can create/shrink/grow pools, create encrypted volumes etc by using LVM.

It all depends on the application but in the majority of cases the io performance of btrfs is worse than the alternatives.

Redhat for example choose to deprecate btrfs for unknown reasons while SUSE made it it’s default. The future of it seems uncertain which may cause a lot of headache’s in major environments if implemented there.

derefr · on Jan 27, 2020

Redhat and SUSE (SLES) are both enterprise environments, so at every level, they have to choose one tech stack to go all-in on (i.e. to train their support staffs on), and then discourage their customers from using the others. (“Deprecating” a component, for such orgs, means that some of their customers are now stuck with it, and they’ll continue to support those customers in their use of it, but they certainly won’t support new customers using it.)

The fact that one enterprise-support provider went all-in on Btrfs, while another didn’t, basically tells you that the choice is pretty arbitrary. If no enterprise-support provider used Btrfs, then I’d be concerned.

Arnavion · on Jan 27, 2020

The enterprise provider that actually develops btrfs continues to support btrfs, and one enterprise provider that doesn't stopped supporting it.

People treat RH stopping support of btrfs as some sort of death knell for it. Meanwhile all the btrfs users are confused why RH's opinion should matter at all when they weren't that involved with developing it in the first place.

As an opensuse user, btrfs has saved multiple machines from botched updates by letting me revert to the snapshot from right before the update was applied (opensuse's update tool automatically takes snapshots before and after updates).

Conan_Kudo · on Jan 27, 2020

Red Hat used to be heavily involved in Btrfs development. In fact, they are present in a huge chunk of its development in the first few years. But their developers were hired away by Facebook, leaving Red Hat with nobody who work on Btrfs regularly. That's the underlying cause for why they stopped supporting it. Hiring someone to work on Btrfs takes time and effort that they don't have a reason to spend right now.

protanopia · on Jan 27, 2020

> Redhat for example choose to deprecate btrfs for unknown reason

According to Josef Bacik, RH deprecated btrfs because he was the engineer in charge for Btrfs and had left the company.

https://news.ycombinator.com/item?id=14907771

rrauenza · on Jan 27, 2020

I've also been running btrfs (on CentOS7) for about five years on my home NAS.

One advantage is it detects bit rot -- and you can scrub the disks once a week looking for the bad blocks.

I also like the inline compression.

I run at RAID1 and the only issue I had was several years ago there was a bug about freeing allocations so occasionally the filesystem would be full but not full.

jhoechtl · on Jan 27, 2020

Docker on btrfs can benefit from fs supported layers. It's fantastic!

baroomba · on Jan 27, 2020

I avoid LVM because trying to install with it always seems to break things, either making the install fail or breaking later during an upgrade. And I mean on normal-ass distros like Debian and Ubuntu, not anything odd, and not even with "fancy" features like disk encryption involved (I can only imagine the mess that'd introduce).

Then again I avoided Grub for years because I found it fiddlier and more breakage-prone than LILO, so possibly I'm just an idiot and/or jinxed when it comes to new things in Linux.

riku_iki · on Jan 27, 2020

For me, killer feature is transparent compression, I work with a lot of numerical data in Postgres, and running it over btrfs is the only viable way to compress it.

mekster · on Jan 29, 2020

Also, who uses btrfs in production? I only heard Facebook is using it somewhere but never read about others. Why is Facebook using btrfs yet there seems to be no publicity to make it more popular for external contributions.

zielmicha · on Jan 27, 2020

fsync is still a bit slow on BTRFS (on ZFS too, but to a smaller degree). For example, I just did a quick benchmark on Linux 5.3.0 - installing Emacs on fresh Ubuntu 18.04 chroot (dpkg calls fsync after every installed package).

ext4 - 33s, ZFS - 50s, btfrs - 74s

(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated using "fallocate --length 10G" on ext4 filesystem, the results are very consistent)

nopurpose · on Jan 27, 2020

is ext4 also from fallocated file on underneath ext4?

zielmicha · on Jan 27, 2020

lousken · on Jan 27, 2020

Did anyone had the courage to use btrfs in production? Any stories to share?

jhalstead · on Jan 27, 2020

Seems like Facebook uses it:

"Btrfs has played a role in increasing efficiency and resource utilization in Facebook’s data centers in a number of different applications. Recently, Btrfs helped eliminate priority inversions caused by the journaling behavior of the previous filesystem, when used for I/O control with cgroup2 (described below). Btrfs is the only filesystem implementation that currently works with resource isolation, and it’s now deployed on millions of servers, driving significant efficiency gains."

https://engineering.fb.com/open-source/linux/

alexgartrell · on Jan 27, 2020

Yeah there are a remarkable set of container runtime tasks (package downloads, rootfs creation and management, etc) that are way easier with btrfs. It wasn’t always smooth sailing but luckily Chris, Josef, Omar and others are awesome and now (and for the last while) we are asking for features rather than fixes.

ak217 · on Jan 27, 2020

At a previous job, I deployed btrfs to production in a system that continuously spins up and shuts down thousands of VMs. A key feature that I was able to leverage to make this easy is seed devices. This btrfs feature works similarly to overlay filesystems.

If I were doing that today, I would do a bake-off of OverlayFS vs. btrfs for this feature. Btrfs has many other compelling features that may make it worth using, although it's always been slower than ext4/xfs so I'd also need to check how it does with modern ultra high performance NVMe drives.

Btrfs never lost our data, although there was a kernel panic in the journal writing code in the Linux 3.2/Ubuntu 12.04 timeframe. The panic would not cause data loss but it did wedge VMs. Since that was fixed, it's had a 100% reliable run in that system, to my knowledge.

mekster · on Jan 29, 2020

I heard people get stable btrfs when certain features are turned off, so it may be helpful to say what you have turned on or off with its features when saying how it has been stable or not.

agravier · on Jan 27, 2020

It's the default on recent Synology NAS, in my experience. No particular issue in my limited experience. Mostly transparent for the user.

barclay · on Jan 27, 2020

(also a happy syno user here, been using it on several NAS's quite happily).

My rough understanding is synology did some pretty heavy modifications to btrfs in their implementation though... (a quick google finds me nothing to back this up, but i remember reading about it somewhere...)

InTheArena · on Jan 27, 2020

Not modifications per se, but it doesn't quite do the "normal" setup. Encryption is a mess (you can't export encrypted volumes via NFS), and the caching layer on top of it seems prone to corruption on the SSD (I've had my NVMe mirror cache drop twice over the last year and a half).

I'd like to see them move to full disk encryption rather then their current approach.

vetinari · on Jan 27, 2020

They do encryption/compression on subvolume level; each share you create is a separate subvolume.

For RAID5, they are using it on top of LVM, but with some modification - the synology implementation hooks LVM and btrfs together, so it gets ZFS-like properties.

There's a guy on the internet, who was playing around with it: https://daltondur.st/syno_btrfs_1/

tobias3 · on Jan 28, 2020

So they have fixed the last big hurdle to btrfs adoption in the small (single node) NAS space and are just sitting on it (violating the GPL). I urge any Synology user to write them to send you the Linux kernel source then upload it somewhere... though, their last Linux kernel drop seems to have been in 2017, so not much hope there...

vetinari · on Jan 27, 2020

CZ.NIC (the .cz tld administrator) is shipping OpenWRT-based wifi routers Turris Omnia and Turris Mox. Both of them use btrfs as their filesystem.

boris · on Jan 27, 2020

We (the build2 project) use it in our CI infrastructure for VM storage. For every build we make a snapshot of a VM, boot it, build, drop the snapshot, repeat. So we are talking about making/dropping snapshots every couple of minutes 24x7 for months without a reboot. We haven't had a single issue.

rleigh · on Jan 27, 2020

No problems with unbalancing?

When I was doing whole rebuilds of Debian, using e.g. 8 parallel builds of >18000 packages, it was creating and destroying a snapshot once every few seconds to minutes, at most 8 snapshots in existence at once. It got unbalanced and went write only every 36 hours. A clean brand new filesystem which never had more than 10% space utilisation and was typically around 1%.

archi42 · on Jan 27, 2020

At home: I'm running a RAID1 btrfs on my 12 disk cold storage (rackmount, SAS backplane, JBOD SAS controller). It has two new 4TB 24/7-rated SATA disks I got for that NAS, the rest is mostly salvaged from work (old drives, 500GB to 1TB). I had exactly the same selling point on btrfs as the author - I see a "huge" 7.8TB RAID1, and once it fills up I just swap an old disk (or two) for another 24/7 disk with decent TB/$.

At work: I was told our OpenSUSEs had some failures/data-loss, so we're not using the default btrfs on these. Though I don't know with what version that was (we migrated to OpenSUSE about 3 years ago).

mongol · on Jan 27, 2020

I think some of SUSE's customers certainly must? Otherwise it would not make any sense for them to support it and stand behind it by now...?

pQd · on Jan 28, 2020

i've been using BTRFS since 2014 to store backups. there is a noticeable performance penalty when rsync'ing hundreds of thousands of files to a spinning-rust disk connected to USB-SATA dock when BTRFS is used instead of EXT4. i'm accepting it in exchange for ability to run scheduled scrub of the data to detect potential bitrot.

since 2017 i'm also using BTRFS to host mysql replication slaves. every 15 min, 1h, 12h crash-consistent snapshots of the running database files are taken and kept for couple of days. there's consensus that - due to its COW nature - BTRFS is not well suited for hosting vms, databases or any other type of files that change frequently. performance is significantly worse compared to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using NVMe drives and relaxing durability of MySQL innodb engine. i've used those snapshots few times each year - it worked fine so far. snapshots should never be the main backup strategy, independently of them there's a full database backup done daily from masters using mysqldump. snapshots are useful whenever you need to very quickly access state of the production data from few minutes or hours ago - for instance after fat fingering some live data.

during those years i've seen kernel crashes most likely due to BTRFS but i did not lose data as long as the underlying drives were healthy.

izacus · on Jan 27, 2020

It's also worth noting that Synology uses btrfs as an option to do checksumming and snapshots on their NAS devices.

They're still using their own RAID layer though.

ValentineC · on Jan 27, 2020

> They're still using their own RAID layer though.

Synology's RAID implementation is largely mdadm + LVM.

cmurf · on Jan 28, 2020

kernel 5.5 released Sunday. Btrfs now has raid1c3, raid1c4 profiles for 3 and 4 copy raid1. Adds new checksum algorithms: xxhash, blake2b, sha256.

Async discards coming in 5.6. https://lore.kernel.org/linux-btrfs/cover.1580142284.git.dst...

abotsis · on Jan 27, 2020

It’s worth noting that much of the premise of the article (wanting flexibility) is outdated. Zfs has support for removing top-level raid 0/1 vdevs now. So you can take a raid10 pool, and remove a top level mirror vdev completely. Note that this doesn’t work for raid5/6 vdevs, but as the author points out, those are becoming less and less used because of rebuild time and performance.

In addition to the slew of other features Btrfs is missing (send/recv, dedup, etc) zfs allows you to dedicate something like an Intel optane (or other similar high write endurance, low latency ssd) to act as stable storage for sync writes, and a different device (typically mlc or tlc flash) to extend the read cache.

kstrauser · on Jan 27, 2020

I think there's a selection bias here: people using RAID 5/6 may not be using ZFS as much because it's not well supported. I'd bet money that those levels are much more common in SOHO settings than RAID 10 is, because it's still the sweet spot for "I need lots of storage" vs "...and am willing to spend drive's worth of storage on availability". For instance, anyone using a NAS primarily as a backup target for desktops and small servers may love RAID 5, but be unwilling to throw money at a "better" RAID 10 setup.

the8472 · on Jan 27, 2020

btrfs has send/recv. And dedup, which is more efficient that ZFS' since it can be performed offline, on select parts of the filesystem and doesn't have to keep gigabytes of dedup tables in memory.

herf · on Jan 27, 2020

zfs remove is not a very good implementation - it keeps the old blocks around (as a virtual device) and redirects them to new locations. This is fine for "oops I accidentally added a device" but not great otherwise.

abotsis · on Jan 27, 2020

*Just kidding on send/recv, looks like it’s there now. Substitute with encryption if you need another example.

geophertz · on Jan 27, 2020

Is using btrfs on a personal machine something to do? It seems that all the comments as well as articles about it, just assume you're running it on a server.

The ability to add and remove disks on a desktop machine is very tempting.

wtfrmyinitials · on Jan 27, 2020

I've been running it on my desktop for a while and it's been wonderful. I have a cron job set to take a snapshot of the filesystem hourly so if I ever blow a file away or a package upgrade goes wonky I'm back up and running in minutes.

mdip · on Jan 27, 2020

I've been a `btrfs` user for the better part of 4 years despite, at the time, a very vocal group providing advice against it[0].

I'll be the first to say that it isn't a silver bullet for everything. But then, what filesystem really is? Filesystems are such a critical part of a running OS that we expect perfection for every use case; filesystem bugs or quirks[1] result in data loss which is usually Really Bad(tm).

That said, for the last two years, I've been running Linux on a Thinkpad with a Windows 10 VM in KVM/qemu -- both are running all the time. When I first configured my Windows 10 VM, performance was brutal; there were times when writes would stall the mouse cursor and the issue was directly related to `btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM and adjusted some settings that affected how `btrfs` interacted with it. I discovered similar things happened when running a `balance` on the filesystem and after a bit of research, found that changing the IO scheduler to one more commonly used on spindle HDDs made everything more stable.

So why use something that requires so much grief to get it working? Because those settings changes are a minor inconvenience compared against the things "I don't have to mess with" to cover a bigger problem that I frequently encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation uses `btrfs` on root. Every time software is added/modified, or `yast` (the user-friendly administrative tool) is run, a snapshot is taken automatically. When I or my OS screws something up, I have a boot menu that lets me "go back" to prior to the modification. It Just Works(tm). In the last two years, I've had around 4-5 cases where my OS was wrecked by keeping things up to date, or tweaking configuration. In the past, I'd be re-installing. Now, I reboot after applying updates and if things are messed up, I reboot again, restore from a read-only snapshot and I'm back. I have no use for RAID or much else[2] which is one of the oft-repeated "issues" people identify with `btrfs`.

It fits for my use-case, along with many of the other use-cases I encounter frequently. It's not perfect, but neither is any filesystem. I won't even argue that other people with the same use case will come to the same conclusion. But as far as I'm concerned, damn it works well.

[0] I want to say that an installation of openSUSE ended up causing me to switch to `btrfs`, but I can't remember for sure -- that's all I run, personally, and it is a default for a new installation's root drive.

[1] Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the filesystem has multiple concepts of "free space" that don't necessarily line up with what running applications understand.

[2] My servers all have LSI or other hardware RAID controllers and present the array as a single disk to the OS; I'm not relying on my filesystem to manage that. My laptop has a single SSD.

nickik · on Jan 27, 2020

Being 'The Dude' of file system is literally the opposite of what I want. When looking at ZFS talks and the incredible complexity of some of those operations that Btrfs seems to think are 'no big deal', I will simply not trust that. Specially because it has been proven over and over again that Btrfs claims its 'stable' and then a new series of issues show up. Or its 'stable' but not if you use 'XY feature', or if the disk is 'to full' or whatever.

I remember using it after I had heard it was 'stable' and it eat my data not long after (not using crazy features or anything). I certainty will not use it again. A FS should be stable from the beginning, as stable core that you can then build features around, rather then a system with lots of feature that promises to be stable in a couple years (and then wasn't years after being in the kernel already).

Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool for me has been no issue at all, I never saw a reason why I would want to reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.

Overall not having ZFS in Linux is a huge failure of the Linux world. I think its much more NIMBY then a license issue.

loudmax · on Jan 27, 2020

> I think its much more NIMBY then a license issue

How do you propose that ZFS be brought into Linux? When Sun released ZFS as open source, they made a deliberate decision to use a license that prevented it from being integrated into the Linux kernel. This was no accident. At the time, Sun was still pushing OpenSolaris which was losing ground to Linux. The ZFS on Linux project gets around this restriction by running ZFS in user space, but this is not optimal.

You can make a legitimate argument that Linux should have been released under a BSD style license (I think that would be wrong, but it's plausible). I don't see how you can argue that ZFS's license is somehow the fault of the Linux world.

SignalsFromBob · on Jan 29, 2020

> The ZFS on Linux project gets around this restriction by running ZFS in user space

ZFS on Linux is a kernel module. You may be thinking of ZFS-FUSE which runs in user space using FUSE, but I'm not sure if it's being maintained any more.

trasz · on Jan 27, 2020

On the other hand, choosing to use GPL would prevent it from being integrated anywhere else. You'd also lose the patent protection granted by CDDL.

vetinari · on Jan 27, 2020

It's not like they couldn't use dual licenses, like Mozilla did at the time, for example.

aidenn0 · on Jan 28, 2020

Any license more liberal than the GPL would also be fine. For example: MIT/X11, 2-clause BSD, or Ruby's license.

nickik · on Jan 27, 2020

> When Sun released ZFS as open source, they made a deliberate decision to use a license that prevented it from being integrated into the Linux kernel

This is simply totally false no matter how many times people repeat it. It pure FUD.

Sun picked the licence because they had to allow linking with closed code for their products, going with the GPL was simply not viable given the situation with drivers on their platforms. Their licence is actually build on the Mozilla licence without forcing resolution in California. Sun actually spend quite a bit of time and resources to develop a really good licence and made it as open they could given their constraints.

Also, Sun very agressivily pushed their technologies to other systems and Linux would have been no exeption. Sun helped Apple integrate D-Trace, and at the same time the hedge an evil plan to not give it to Linux? They helped upstream things to the BSDs as well.

That simply conspiricy nonsense that was typical with the 'its actually GNU/Linux' crowd that was pushed in the 2000s. Sun was seen as evil coorprate trying to stamp on the 'real open source' community, looking back on this now, the absurdity of that sentement should be clear. Sun made mistakes, but their overall track record was staller.

The idea that the function of the GPL is to block other Open Source code from integrating into an Open Source project is an abolute insane concept and a total perversion of the Idea of Open Source. Literally using the supposed 'most free' GPL to activly block and exclude other Open Source code from people.

loudmax · on Jan 27, 2020

Sun's motivation for choosing the CDDL is beside the point. Unless ZFS is released under a license that allows it to be redistributed under the GPL, ZFS cannot be legally built into Linux as a filesystem.

If you have reason to believe that Linux developers can go ahead and simply integrate ZFS into Linux without worrying about the license, I'm sure lawyers from the FSF, IBM, Cannonical, etc would love to hear your explanation.

floatboth · on Jan 27, 2020

Uh, isn't Canonical already shipping ZFS in the latest Ubuntu? https://www.techrepublic.com/article/something-exciting-is-c...

Their argument (as I understand it) is that loadable kernel modules are separate discrete pieces of software that do not become "part of" the kernel and do not have to care about kernel licensing. They can be any license, including proprietary, like nvidia drivers.

There is about zero need for "integration" as in static linking / inclusion in the linux repo btw. Nothing wrong with dkms.

nickik · on Jan 28, 2020

This is not a new thought and the lawyers actually understand this. GPL was not designed to protect user from open source. And its an idiotic missapplication. Oracle themself diliver ZFS with Linux, and so do many, many others.

The only 'argument' is that we can't do it because 'big bad oracle' will sue you but that really doesn't hold up.

aidenn0 · on Jan 28, 2020

> Sun's motivation for choosing the CDDL is beside the point.

GP made a claim to Sun's motivation, rebutting that claim seams reasonable.