The more and more I read about modern garbage collection design the more and mor...

ajross · on June 1, 2012

I don't see the technical reasoning there, but I agree that there's a thematic parallel: they're both reasonably straightforward ideas with very clean user-facing stories. They can both be implemented very simply (literally as part of an undergraduate course -- I remember writing both).

And they both have undesirable performance problems in certain regimes, the solution to which has been the unending quest of generations of incredibly smart programmers and academics. They have led to staggering complexity, weird bugs, huge near-unmaintainable codebases, and uncounted PhD theses. Whole careers have been aimed at this stuff.

And frankly it's been, from many perspectives, mostly a waste. ZFS/btrfs are, for 95% of problems, indistinguishable from FFS/ext2 implementations that are a tiny fraction of their size. Modern Java and .NET VMs still have latency issues under GC pressure that were equally visible in Emacs Lisp a quarter century ago.

Applications which have hard requirements that fly in the face of these systems don't use them. Serious storage servers don't use the filesystem except vestigially (i.e. they do manual sync on very large files, or they use a block device directly). Serious realtime apps don't do GC and manage memory themselves. And that's not going to change no matter how many PhD's we throw at the problem.

cokernel_hacker · on June 1, 2012

I am afraid that is not even close to accurate. If you define "95% of problems" to be "reading and writing data such that data is read and written" sure, great.

However, there are two little, minor things that file systems care about: performance and safety. FFS/ext2 have neither of things.

Neither ext2 nor FFS contain a journal nor do the copy-on-write metadata. Heck, if you "upgrade" to ext3, you get the journal but nothing that protects you from bit-rot.

If you look at most drives on the market, you will see devices capable of corrupting data for certain after three years. Your journal'd filesystem does jack in this case, all it can do is ensure proper ordering of writes to metadata, no guarantees that the data will be any good once written.

How about performance? Well, if you look at FFS/ext2 they are essentially terrible. Block-at-a-time allocators with no extent configuration. Good luck getting the most out of your storage media when you have your block tree data-structure. Granted, ZFS suffers from the same issue but btrfs's extent tree configuration certainly does not. IIRC, the state of the art ext[23] implementations use read ahead to ameliorate the problem but does not fundamentally cure it. If you look at ext4, they have adopted extent trees via their Htree structure.

A filesystem like zfs/btrfs is pretty imune to bitrot, they can easily mirror their metadata and avoid overwriting their metadata like FFS/ext[234] making torn writes non-issues. They avoid the many pathologies that your "simpler" filesystem and trades the complexity for not needing a fsck mechanism in the face of data-corruption, one should only be needed in the face of implementation bugs (you should note that ZFS has no fsck, btrfs just recently obtained one).

Oh, and if any of you think soft updates work, they don't. While it would be great if they really did work but in a world where drives actively reorder writes and do not respect SCSI/SATA/misc transport commands to flush their internal caches, then you do not get safety. This set of drives is considerably huge.

tl;dr You are oversimplifying the complex.

ajross · on June 1, 2012

Clearly I hit a nerve.

>If you define "95% of problems" to be "reading and writing data such that data is read and written"

Pretty much. You have a competing definition? With the added point that "95% of problems" are mostly I/O bound reads of unfragmented write-once files and won't see benefit from extent allocation. And of the remaining 5% most of them are database systems which are doing their own journaling and block allocation.

Does btrfs have nice features (I like snapshots myself)? Sure. Do I use it in preference to ext4? Yes. But be honest with yourself: it's only incrementally better than ext2 for pretty much everything you use a computer for. And yet it sits at the pinnacle of 40 years of concerted effort.

And garbage collection is much the same: a staggeringly huge amount of effort for comparatively little (but not zero, thus "worth it" by some metric) payout.

Edit: just to throw some gas on the^H^H^H^H point out the narrowness of vision that I think is endemic in this kind of thought:

> If you look at most drives on the market, you will see devices capable of corrupting data for certain after three years.

If you look at most actual filesystems on the market, you'll find they're embedded in devices which will be thrown out in two years when their contract expires. They'll also be lost or stolen with vastly higher frequency than that at which they will experience NAND failure. If you look at most high-value storage deployments in the real world, they have redundancy and backup regimes in place which make that filesystem feature merely a downtime/latency improvement.

Basically, if someone waved a magic wand and erased all fancy filesystems and GC implementations from the world... how much would really change? Apple has deployed a pretty good mobile OS without GC, after all. Oracle made a business out of shipping reliable databases over raw block devices 20 years ago. Try that same trick with other "difficult" software technologies (video compression, say) and things look much more grim.

cokernel_hacker · on June 1, 2012

My competing definition includes safety and performance. Not about how the system works a single epsilon after an operation but perhaps reading data that we wrote last month, or last year and doing it using the hardware's resources efficiently.

Clearly either your or the ext[234] developers are mistaken as they have gone ahead and implemented extent based allocation and file management.

btrfs's design is inherently safe while ext2's design is inherently unsafe. To make the analogy work, I would say that ext2 does not fit the design spec for a file system; it would be like a garbage collector that does not actually collect garbage.

The payout for data integrity, which no filesystems really handled until the WAFL era, is huge. If you cannot see, this discussion has no point.

The only thing that I must be honest about is that I would not trust ext2 to store my data any more than I would trust you to implement a scheme to store my data.

ajross · on June 1, 2012

It's really not worth continuing the argument. But I'll just point out that again you're arguing about features in isolation (e.g. "safety" instead of the value of what you want to keep safe or dangers from things other than filesystem failure). It's easy to justify one filesytem as better than another. Our disconnect is that I'm talking about whether or not that filesystem is so much better as to justify the effort spent to develop it.

To make the point again: without btrfs (or really "the last 30 years of filesystem research") we'd all be OK. Likewise if V8 or LuaJIT had to do mark & sweep, or if everyone had to write in a language with manual storage allocation, things would look mostly the same.

Without video compression probably two thirds of the market for semiconductor devices and half the internet would simply evaporate (I pulled those numbers out of you-know-where, but I frankly doubt they're all that wrong). That tells me something about the relative value of these three applications of software engineering talent.

nightski · on June 1, 2012

The question is, would we have known any of this if that research had not taken place? It is easy to have 20/20 hindsight. Even with it though I am pretty sure we still would of expended the effort to get where we are today.

DrJosiah · on June 4, 2012

I disagree with you on both points.

You can't solve all (FS) problems for all people simultaneously. Custom FS solutions by Apple (recently), Oracle (20 years ago), and countless other vendors in the last 3 decades are about choosing different trade-offs, learning from mistakes, and growing the technology with the hardware (fat-12 -> fat-16 -> fat-32 was all about growing with increasing drive size). But if you truly believe that nothing substantial has been gained, why did you bother upgrading to btrfs?

Older filesystems suffered from a variety of problems that could and did ruin data for a lot of people. Modern filesystems like NTFS5, ZFS, btrfs, etc., make their best effort to ensure data can actually be consistently read and written, which while being outside of the most common path of reading non-fragmented files sequentially (I'd even say 99.99% of actual IO in non-server configurations), it is still a use-case that is important for the vast majority of computer users. You know, people who do business, homework, take pictures, ...

On the GC side of things, the Azul VM has shown to be quite effective at addressing many of the performance issues with garbage collection in Java (which gets even better when you stop doing stupid crap in Java). And there are garbage collected languages that are not Java that don't suffer the same stuttering problem of the Java GC (that have also been around for 20+ years, and have steadily gotten better over time).

fanf2 · on June 2, 2012

FreeBSD FFS with softupdates has been doing metadata copy on write since about 1998. Recent versions have a journal too, to deal with the cases not covered by softupdates.

ZFS also relies on drive write barriers for safety. There is no hope if your disk lies.

stiff · on June 1, 2012

Modern Java and .NET VMs still have latency issues under GC pressure that were equally visible in Emacs Lisp a quarter century ago.

Aren't you jumping to conclusions a little bit too fast? That there are still issues in some applications means all the work on Garbage Collection was wasted? I would like to remind you that modern GC-ed Java programs are often close in performance to C++ code with hand-written memory management, this clearly was not "visible in Emacs Lisp a quarter century ago".

finnh · on June 1, 2012

I believe that ZFS clocks in at fewer lines of code than competing filesystem + volume manager combos. That certainly was touted as a benefit of ZFS's "rampant layering violation" when first introduced.

end point: ZFS packs a huge quantity of functionality into 80K lines of code (probably closer to 90 or 100 now), which is quite a bit smaller than the combined size of (less featureful) file system + volume manager implemenations that it replaces.

See [1] for a comparison of ZFS vs. UFS+SVM. See [2] for Jeff Bonwick's discussion of the complexity win in ZFS.

[1] http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg0...

[2] https://blogs.oracle.com/bonwick/entry/rampant_layering_viol...

dons · on June 1, 2012

Convergence.

When we all have persistent, solid state drives, your filesystem will be your heap and GC...

jmaygarden · on June 1, 2012

To what solid state drive technology are you referring that would make good use as a heap? I wouldn't think that NAND flash would be suitable due to the write/erase constraints.

Tuna-Fish · on June 1, 2012

NAND flash is going away. Right now it loses half it's write endurance, and it's eraseblocks double in size, every shrink. Eraseblocks are already at 4MB and write endurance is below a 1000 writes, pretty soon additional shrinks will bring no value. This happens because of fundamental physical limits, and no-one believes they can be overcome -- even the biggest proponents are talking about delaying the inevitable for a few years instead of avoiding it.

Right now, there is a hundred-billion-dollar stampede going on for finding out it's successor. Players include:

MRAM (Toshiba, Hitachi, Hynix, IBM, Everspin(Freescale spinoff), Samsung, NEC)

FeRAM (Ramtron, IBM, TI, Fujitsu, Samsung, Matsushita, Oki, Toshiba, Infineon, Hynix, Symetrix)

ReRAM (HP, ITRI, IMEC, Panasonic, Rambus)

CBRAM (NEC, Sony, Axon, Micron)

PRAM (Intel, IBM, ST Micro, Samsung, Numoxys)

I might have missed a few backers. Also, all names after a technology are not working together, especially in FeRAM there are multiple competing approaches.

An interesting commonality about these technologies is that they all aim to be universal memories. That is, they intend to eventually replace both DRAM and Flash, and some are even aiming for the last cache levels in cpus. This will probably lead to some changes in system design. Although I think the people who are calling for the death of filesystems are going about it all wrong -- I expect the role of filesystems to expand, not shrink. All they need is new block layers.

Note that while everyone says memristors (ReRAM), they aren't even the most likely candidate (if I'd have to pick, I'd say PRAM), HP just has the best marketing.

jmaygarden · on June 2, 2012

Leon Chua also considers PRAM to be a memristor technology [1].

  All 2-terminal non-volatile memory devices based on
  resistance switching are memristors, regardless of the
  device material and physical operating mechanisms.

[1] http://www.springerlink.com/content/f41r8m054x550430/

Klinky · on June 2, 2012

IMFT got NAND down to 20nm with 3K - 5K write endurance. Hynix has plans to produce 15nm soon(unclear on endurance). Many thought getting NAND to it's current level would be impossible, the same could be said about hard drives.

Write amplification isn't much of an issue with smart firmware.

The new memory techs aren't market viable as we speak and it's questionable if first generation commercial devices featuring them will even be available within a few years and at what cost? Many of them also have drawbacks which hamper their viability or parallel the drawbacks of NAND.

I still think a real replacement for NAND is probably a decade out at least.

VMG · on June 1, 2012

MRAM maybe? (http://en.wikipedia.org/wiki/Magnetoresistive_random-access_...)

dons · on June 1, 2012

Was thinking of this article earlier this week: http://lwn.net/Articles/498289/ (which I'm presuming is some MRAM thing)

jasonwatkinspdx · on June 1, 2012

1.) If capacity increases, for a given workload, overwriting updates save less capacity. So even without new technology, there's a trend towards less overwrites of a given block for a fixed workload.

2.) Memristor Memory

mburns · on June 1, 2012

Memristors?

cpeterso · on June 1, 2012

Multics already implemented "single-level storage" in the 1960s. Effectively, the disk was all swap space for a huge RAM disk.

https://en.wikipedia.org/wiki/Multics#Novel_ideas

dons · on June 1, 2012

The research group I was part of worked on single address space OS's from the early 90s to the late 00's, http://www.ertos.nicta.com.au/research/old/mungi//manifesto.... , and yep, these ideas have been around for a long time.

whatusername · on June 2, 2012

Looks like IBM i is still/currently shipping with a single address space.

Wouldn't surprise me if Oracle and HP also have something similar.

nwmcsween · on June 1, 2012

'Filesystems' as in arbitrary data serialized to an arbitrary backing is absurdly stupid when said data has a structure as with all data. A semantic system that understands the data needs to be implemented at some point - why have n parsing implementations for a single data format? With a semantic 'object system' data becomes apart of programming.

oconnor0 · on June 1, 2012

How so? Could you explain in more detail?

simcop2387 · on June 1, 2012

A lot of what I've seen in the structures of how they keep track of things efficiently seem to be very similar. Though filesystems are adding a bit of extra data to it compared to the GC, (filenames, locations, etc) a number of the design constraints and things seem to be very very similar. Like the idea to keep track of what's free and doing the allocation seem to have very similar solutions even given the different constraints. I think part of this might also be because some of the more modern filesystems are being designed to also consider SSD which end up even closer to normal memory. The way that (at least at a layman's understanding) btrfs seems to handle COW looks very similar to how this new garbage collection is being designed for lua. And the arenas look very much like the extants in ext4 (not sure about that in btrfs).

This also makes me wonder if a design that takes caching of the drive into account could make some small improvements on things.

EDIT: added small snippet about the arenas.

ithkuil · on June 2, 2012

Filesystems and memory allocation algorithms do have something in common. The problem of allocating memory and managing free space etc exists also without GC.

When hierarchical filesystems behave as pure trees, it's always possible to simply deallocate the storage of a deleted file, because the file cannot be pointed by more than one entry in the tree.

Unix filesystems, however, do have a way of creating several references to the same file, namely hard links. In order to handle these multiple references (like having a program with multiple variables pointing to the same object) a kind of GC algorithm is used, which is simply a "reference counting GC".

Reference counting is a simple technique which is appropriate in several cases but has it's drawbacks and it's One of these drawbacks can be illustrated by this quote:

"One day a student came to Moon and said: “I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons.”

Moon patiently told the student the following story:

“One day a student came to Moon and said: ‘I understand how to make a better garbage collector... ”"

caf · on June 2, 2012

This drawback is why sensible UNIX filesystems don't allow you to create hardlinks to directories.