I don't see the technical reasoning there, but I agree that there's a thematic p...

cokernel_hacker · on June 1, 2012

I am afraid that is not even close to accurate. If you define "95% of problems" to be "reading and writing data such that data is read and written" sure, great.

However, there are two little, minor things that file systems care about: performance and safety. FFS/ext2 have neither of things.

Neither ext2 nor FFS contain a journal nor do the copy-on-write metadata. Heck, if you "upgrade" to ext3, you get the journal but nothing that protects you from bit-rot.

If you look at most drives on the market, you will see devices capable of corrupting data for certain after three years. Your journal'd filesystem does jack in this case, all it can do is ensure proper ordering of writes to metadata, no guarantees that the data will be any good once written.

How about performance? Well, if you look at FFS/ext2 they are essentially terrible. Block-at-a-time allocators with no extent configuration. Good luck getting the most out of your storage media when you have your block tree data-structure. Granted, ZFS suffers from the same issue but btrfs's extent tree configuration certainly does not. IIRC, the state of the art ext[23] implementations use read ahead to ameliorate the problem but does not fundamentally cure it. If you look at ext4, they have adopted extent trees via their Htree structure.

A filesystem like zfs/btrfs is pretty imune to bitrot, they can easily mirror their metadata and avoid overwriting their metadata like FFS/ext[234] making torn writes non-issues. They avoid the many pathologies that your "simpler" filesystem and trades the complexity for not needing a fsck mechanism in the face of data-corruption, one should only be needed in the face of implementation bugs (you should note that ZFS has no fsck, btrfs just recently obtained one).

Oh, and if any of you think soft updates work, they don't. While it would be great if they really did work but in a world where drives actively reorder writes and do not respect SCSI/SATA/misc transport commands to flush their internal caches, then you do not get safety. This set of drives is considerably huge.

tl;dr You are oversimplifying the complex.

ajross · on June 1, 2012

Clearly I hit a nerve.

>If you define "95% of problems" to be "reading and writing data such that data is read and written"

Pretty much. You have a competing definition? With the added point that "95% of problems" are mostly I/O bound reads of unfragmented write-once files and won't see benefit from extent allocation. And of the remaining 5% most of them are database systems which are doing their own journaling and block allocation.

Does btrfs have nice features (I like snapshots myself)? Sure. Do I use it in preference to ext4? Yes. But be honest with yourself: it's only incrementally better than ext2 for pretty much everything you use a computer for. And yet it sits at the pinnacle of 40 years of concerted effort.

And garbage collection is much the same: a staggeringly huge amount of effort for comparatively little (but not zero, thus "worth it" by some metric) payout.

Edit: just to throw some gas on the^H^H^H^H point out the narrowness of vision that I think is endemic in this kind of thought:

> If you look at most drives on the market, you will see devices capable of corrupting data for certain after three years.

If you look at most actual filesystems on the market, you'll find they're embedded in devices which will be thrown out in two years when their contract expires. They'll also be lost or stolen with vastly higher frequency than that at which they will experience NAND failure. If you look at most high-value storage deployments in the real world, they have redundancy and backup regimes in place which make that filesystem feature merely a downtime/latency improvement.

Basically, if someone waved a magic wand and erased all fancy filesystems and GC implementations from the world... how much would really change? Apple has deployed a pretty good mobile OS without GC, after all. Oracle made a business out of shipping reliable databases over raw block devices 20 years ago. Try that same trick with other "difficult" software technologies (video compression, say) and things look much more grim.

cokernel_hacker · on June 1, 2012

My competing definition includes safety and performance. Not about how the system works a single epsilon after an operation but perhaps reading data that we wrote last month, or last year and doing it using the hardware's resources efficiently.

Clearly either your or the ext[234] developers are mistaken as they have gone ahead and implemented extent based allocation and file management.

btrfs's design is inherently safe while ext2's design is inherently unsafe. To make the analogy work, I would say that ext2 does not fit the design spec for a file system; it would be like a garbage collector that does not actually collect garbage.

The payout for data integrity, which no filesystems really handled until the WAFL era, is huge. If you cannot see, this discussion has no point.

The only thing that I must be honest about is that I would not trust ext2 to store my data any more than I would trust you to implement a scheme to store my data.

ajross · on June 1, 2012

It's really not worth continuing the argument. But I'll just point out that again you're arguing about features in isolation (e.g. "safety" instead of the value of what you want to keep safe or dangers from things other than filesystem failure). It's easy to justify one filesytem as better than another. Our disconnect is that I'm talking about whether or not that filesystem is so much better as to justify the effort spent to develop it.

To make the point again: without btrfs (or really "the last 30 years of filesystem research") we'd all be OK. Likewise if V8 or LuaJIT had to do mark & sweep, or if everyone had to write in a language with manual storage allocation, things would look mostly the same.

Without video compression probably two thirds of the market for semiconductor devices and half the internet would simply evaporate (I pulled those numbers out of you-know-where, but I frankly doubt they're all that wrong). That tells me something about the relative value of these three applications of software engineering talent.

nightski · on June 1, 2012

The question is, would we have known any of this if that research had not taken place? It is easy to have 20/20 hindsight. Even with it though I am pretty sure we still would of expended the effort to get where we are today.

DrJosiah · on June 4, 2012

I disagree with you on both points.

You can't solve all (FS) problems for all people simultaneously. Custom FS solutions by Apple (recently), Oracle (20 years ago), and countless other vendors in the last 3 decades are about choosing different trade-offs, learning from mistakes, and growing the technology with the hardware (fat-12 -> fat-16 -> fat-32 was all about growing with increasing drive size). But if you truly believe that nothing substantial has been gained, why did you bother upgrading to btrfs?

Older filesystems suffered from a variety of problems that could and did ruin data for a lot of people. Modern filesystems like NTFS5, ZFS, btrfs, etc., make their best effort to ensure data can actually be consistently read and written, which while being outside of the most common path of reading non-fragmented files sequentially (I'd even say 99.99% of actual IO in non-server configurations), it is still a use-case that is important for the vast majority of computer users. You know, people who do business, homework, take pictures, ...

On the GC side of things, the Azul VM has shown to be quite effective at addressing many of the performance issues with garbage collection in Java (which gets even better when you stop doing stupid crap in Java). And there are garbage collected languages that are not Java that don't suffer the same stuttering problem of the Java GC (that have also been around for 20+ years, and have steadily gotten better over time).

fanf2 · on June 2, 2012

FreeBSD FFS with softupdates has been doing metadata copy on write since about 1998. Recent versions have a journal too, to deal with the cases not covered by softupdates.

ZFS also relies on drive write barriers for safety. There is no hope if your disk lies.

stiff · on June 1, 2012

Modern Java and .NET VMs still have latency issues under GC pressure that were equally visible in Emacs Lisp a quarter century ago.

Aren't you jumping to conclusions a little bit too fast? That there are still issues in some applications means all the work on Garbage Collection was wasted? I would like to remind you that modern GC-ed Java programs are often close in performance to C++ code with hand-written memory management, this clearly was not "visible in Emacs Lisp a quarter century ago".

finnh · on June 1, 2012

I believe that ZFS clocks in at fewer lines of code than competing filesystem + volume manager combos. That certainly was touted as a benefit of ZFS's "rampant layering violation" when first introduced.

end point: ZFS packs a huge quantity of functionality into 80K lines of code (probably closer to 90 or 100 now), which is quite a bit smaller than the combined size of (less featureful) file system + volume manager implemenations that it replaces.

See [1] for a comparison of ZFS vs. UFS+SVM. See [2] for Jeff Bonwick's discussion of the complexity win in ZFS.

[1] http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg0...

[2] https://blogs.oracle.com/bonwick/entry/rampant_layering_viol...