Hacker News new | past | comments | ask | show | jobs | submit login
The Garbage Collection Handbook, 2nd Edition (routledge.com)
387 points by zzxcvb on April 8, 2023 | hide | past | favorite | 166 comments



On a related note, I found ART's implementation of Concurrent copying and compaction GC to be pretty novel. Here's a nice write up detailing how handling page faults in userspace come in handy to accomplish that: https://www.tdcommons.org/cgi/viewcontent.cgi?article=4751&c... (pdf) / https://web.archive.org/web/20230216233459/https://www.tdcom...

For context, here's a brief overview of the evolution of the Android Runtime Garbage Collector: https://archive.is/Ue6Pj


Certainly interesting, but there are no performance numbers mentioned in the white paper comparing userfaultfd to read barriers. So the actual benefit to switching to userfaultfd is unknown (at least in publically accessible documents -- I'm sure Google has done internal performance evaluations).

Using page faults (and/or page protection) to perform compaction instead of barriers is a pretty old technique (see the 1988 paper by Appel, Ellis, and Li [1]; see also the Compressor by Kermany and Petrank [2]). But handling page faults were very expensive (at least on Linux) until the addition of userfaultfd recently.

[1]: https://dl.acm.org/doi/10.1145/960116.53992 [2]: https://dl.acm.org/doi/10.1145/1133255.1134023


What is the actual perf benefit of userfaultd (e.g. for write protection)? It sounds interesting but its unclear to me why it would be any faster than a signal handler. Is it just a simpler code path within the kernel? Or is it that the hardware is configured to directly jump to a user space handler without any kernel intervention?


Well page protection is expensive which is why the predominant way to implement concurrent copying/compaction until recently was using read barriers (and still is -- ART seems to be an exception). I will mention that concurrent compacting GCs wouldn't use userfaultfd for write protection, but for read protection (it effectively takes over the job of a read barrier).

I personally don't know too much about userfaultfd and how it works internally, but my best guess is that it bypasses heavyweight kernel data-structures and lets the user application handle the page fault. It is obviously better than simply using mprotect, but it is not immediately clear why it would be better than a read barrier (other than code size considerations, which honestly doesn't sound like much of a big deal as the code handling userfaultfd also needs to be brought into the instruction cache).

I did find this kernel doc about userfaultfd [1] which might be interesting to read if you're interested (it also does mention that userfaultfd doesn't lock some internal kernel data-structures which gives it a better performance than simply using mprotect, but the implementation details are a bit sparse).

[1]: https://docs.kernel.org/admin-guide/mm/userfaultfd.html


20 years ago in the programming languages lesson at the university we started learning about OOP using Java and Ada as examples. When the professor started describing Java, a fellow student interrupted him to inform the class that "Java has garbage collection" (and boast of his special knowledge).

After that incident his nickname was "the garbage collector"!


Coincidentally, Ada optionally supports garbage collection in its specifications but it's up to the runtime to implement it.


And since no implementation has ever supported it, it has been deprecated in ISO Ada 95 and further removed from it on ISO Ada 2012.

https://en.wikibooks.org/wiki/Ada_Programming/Pragmas/Contro...


There was this project though.

https://github.com/Roldak/AGC

The best part is that it's faster than manual management. People will tell you they need do to malloc and free manually for performance, but when you actually run the numbers GC wins for a majority of use cases.


Tracing garbage collectors don’t generally win against reference counting connectors, especially when those reference counts are automatically elided via ARC (eg swift and objective C) or because they’re rarely used by means of composition (c++ and rust). Additionally, different kinds of application strategies are better depending on the use case (eg a pool allocator that you bulk drop at the end of some computation).

What papers are you referencing showing tracing GCs outperforming things? If it’s just the website, I think it’s an artifact of a micro benchmark rather than something that holds true for non trivial programs.


I've always heard of the "Swift elides reference counts" statements but I've never seen it substantiated. I don't claim to be a Swift GC expert by any means, but the impression I get from the two Swift GC papers I've read [1, 2] is that Swift has a very simple implementation of RC. The RC optimization document (albeit the document is incomplete) [3] also doesn't give me the impression that Swift is doing much eliding of reference counts (I'm sure it is doing it for simple cases).

Do you have any links which might explain what kind of eliding Swift is doing?

EDIT: The major RC optimizations I have seen which elide references are deferral and coalescing and I'm fairly certain that Swift is doing neither.

[1]: https://dl.acm.org/doi/abs/10.1145/3170472.3133843

[2]: https://doi.org/10.1145/3243176.3243195

[3]: https://github.com/apple/swift/blob/main/docs/ARCOptimizatio...


Swift's compiler elides obviously useless RC operations at compile time. It doesn't do anything at runtime though.


That is correct. Like Objective-C the compiler statically removes ARC when it can prove it doesn’t need it (or when you annotate it because where you’re getting it from is manually managing and giving you ownership of the reference). So within a function or cross function with LTO (although it looks like swift doesn’t yet have LTO [1] so I’m not sure about cross-module optimizations).

> When receiving a return result from such a function or method, ARC releases the value at the end of the full-expression it is contained within, subject to the usual optimizations for local values.

Quote from [2] (section 6 has the full details about optimizations). I believe [3] might be the compiler pass.

There actually is some elision that happens at runtime if you install an autoreleasepool if I recall correctly.

I did actually work at Apple so that’s where my recollection comes from although it’s been 8 years since then and I didn’t work on the compiler side of things so my memory could be faulty.

[1] https://github.com/apple/swift/pull/32233

[2] https://opensource.apple.com/source/lldb/lldb-112/llvm/tools...

[3] https://llvm.org/doxygen/group__ARCOpt.html


> So within a function or cross function with LTO (although it looks like swift doesn’t yet have LTO [1] so I’m not sure about cross-module optimizations).

I'm not sure Swift supports "static library modules" so in practice any linked object consists of a single module.

> There actually is some elision that happens at runtime if you install an autoreleasepool if I recall correctly.

There is a trick in the ObjC ABI that elides autorelease-returns, but it's deterministic after compilation time so I wouldn't call it a runtime optimization.


Correct. But it’ll do it within functions in the same module.

> but it's deterministic after compilation time so I wouldn't call it a runtime optimization.

What do you mean? My understanding is that autoreleasepool is 100% at runtime. The compiler is not involved afaik except to know to register autorelease with the currently installed pool.


> What do you mean? My understanding is that autoreleasepool is 100% at runtime. The compiler is not involved afaik except to know to register autorelease with the currently installed pool.

The compiler emits calls that always put something in the autorelease pool or always don't; there's no smart decisions at runtime that skips it or make the ordering of releases nondeterministic. A garbage collector runs whenever it feels like and so the ordering of finalizations changes.


Oh sure. You have to explicitly indicate which resources should go to the autoreleasepool. It’s not as magic as ARC. It’s also generally fallen out of favor at Apple as far as I could tell.


Thanks for the links!


The conventional wisdom is that evacuating (copying) GC's win over malloc/free since 1) the GC touches only the live data and not the garbage, and 2) it compacts the active memory periodically, which improves its cache and (when relevant) paging hit rates.

Obviously though, this will be situation dependent.


Then why do every performant managed language opts for tracing GCs when they can?

RC is used in lower level languages because it doesn’t require runtime support, and can be implemented as a library.

As I wrote in another comment, even with elisions, you are still trading off constant writes on the working thread for parallel work, and you even have to pay for synchronization in parallel contexts.


Because tracing GCs can solve referential loops which RC can’t. So at the language level where you have to handle all sorts of programs written by programmers of varying quality (+ mistakes) a tracing GC gives better predictable memory usage performance across a broader range of programs.

Seriously. A single threaded reference counter is super cheap. Cross thread reference counts shouldn’t be used and I think are an anti pattern - it’s better to have the owning thread be responsible for maintaining the reference count and passing a borrow via IPC that the borrower has to hand back. There is also hybrid RC where you Arc across threads but use RC within the thread. This gives you the best of both worlds with minimal cost. Which model you prefer is probably a matter of taste.

CPUs are stupid fast at incrementing and decrementing a counter. Additionally most allocations should be done on the stack with a small amount done on the heap that is larger / needs to outlive the current scope. I’ve written all sorts of performance-critical programs (including games) and never once has shared_ptr in C++ (which is atomic) popped up in the profiler because the vast majority of allocations are on stack, value composition, or unique_ptr (ie no GC of any kind needed).

The fastest kind of GC is one where you don’t even need any (ie Box / unique_ptr). The second fastest is an integrated increment that’s likely in your CPU cache. I don’t think anyone can claim that pointer chasing is “fast” and certainly not faster than ARC. Again assuming you’re not being uncareful and throwing ARC around everywhere when it’s not needed in the first place. Value composition is much more powerful and leave RC / Arc when you have a more complicated object graph with shared ownership (and even then try to give ownership to the root uniquely or through RC and hand out references only to children and RC to peers).


Obviously the 3 objects you allocate with shared_ptr in C++ won’t be a performance bottleneck, but then we are not comparing apples to oranges.

Your single threaded RC will still have to write back to memory, no one thinks that incrementing an integer is the slow part — destroying cache is.


Even when I had this in the hot path 10 years ago and was owning hundreds of objects in a particle filter, handing out ownership copies and creating new ones ended up taking ~5% (ie making it a contiguous vector without any shared_ptr). It can be expensive but in those case you probably shouldn’t be using shared_ptr.

Oh, and the cost of incrementing an integer by itself (non atomically) is stupid fast. Like you can do a billion of them per second. The CPU doesn’t actually write that immediately to RAM and you’re not putting a huge amount of extra cache pressure vs all the other things your program is doing normally.


> Your single threaded RC will still have to write back to memory

I think you mean mem or cache, and there's a good chance it will remain in cache and not be flushed to ram for short lived objects.

> no one thinks that incrementing an integer is the slow part — destroying cache is.

agreed


If you write to cache, then depending on architecture that change has to be made visible to every other thread as well. Reading is not subject to such a constraint.


MESI cache coherency (and its derivatives) [1] means that you can have exclusive writes to cache if and only if no other core tries to access that data. I would think most if not all microarchitectures have moved to MESI (or equivalent) cache coherency protocols as they avoid unnecessary writes to memory.

[1]: https://en.wikipedia.org/wiki/MESI


Sure, but for shared objects you can’t allow data races.


Objects shared across threads. Most don’t need to be.


I don't understand the relevance of this.


Only for atomics. Single-threaded RC counts are fine and WebKit uses hybrid RC where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).

Atomics are rarely needed as you should really try to avoid sharing ownership across threads and instead change your design to avoid that if you can.


>> If you write to cache, then depending on architecture that change has to be made visible to every other thread as well

> Only for atomics

I don't think cache coherency is aware of threads; they live at a level above it (IANAExpert though)

> where you use an atomic shared ptr to share across threads and then you downgrade it to a non atomic version in-thread (and can hand out another atomic copy at any time).

your idea of GC is very different from others', you are happy to do a ton of manual stuff. GC is generally about not doing a ton of manual stuff.


So fast that Apple Silicon introduced specific memory instructions to handle ARC counters.


It did not do this, standard atomics are relatively fast because it's a unified memory system.


ARM also has an option for weaker consistency atomics which help with speed (and ARC does take advantage of this but not something Apple sped up specifically extra afaik in silicon)


Because they don’t care as much about working set size as Apple does.


Sure, it is a tradeoff as basically every other technical choice.

But we were talking about performance here, and especially in throughput, tracing GCs are much better.


Proof about throughput? Stating something as fact isn’t quite as strong as showing axes. Also performance has multiple axis for measuring: throughput, latency, memory usage, etc etc. Java eating all your memory and spending a non-trivial amount of overall CPU on a tracing collector which can negatively impact latency isn’t necessarily what people want efficiency wise where smearing that over total execution time tends to give better results.

The main advantage from a tracing GC is that you have an easier programming model (which isn’t a trivial trade off) but the downside is that you don’t have deterministic destruction which can make certain programming paradigms difficult (yes Java introduced alternate mechanisms to get you there if you care but it feels like a 100% bolt on rather than something integrated more neatly and elegantly and most developers don’t actually care).


I will say that you seem to be stating "smearing that over total execution time tends to give better results" without proof as well. I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient. Yes RC has smaller minheap sizes than tracing GC, but garbage collection is fundamentally a time-space tradeoff and RC is highly optimized for memory efficiency. Also most naive RC implementations don't copy objects so you get heap fragmentation.

Comparing naive RC to tracing GC is non-trivial. In order to have a fair comparison, you'd have to implement naive RC and tracing GC in the same system and then compare them across a set of benchmarks. I personally have never come across a great performance study which compared naive RC to tracing GC.


As I’ve stated repeatedly, RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.

Sure. Atomic counters are relatively expensive. But I’m good designs there’s very few of them. And they easily show up in hotspots if they’re a problem and you fix your object model. The problem with tracing GC is that you have no way to fix it. Most languages that use RC actually avoid most memory allocations/frees by leveraging value composition instead of referential ownership.

I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW. Similarly, any big hyperscaler is unlikely to be using Java for their core performance-critical infrastructure. To me those are pretty clear performance advantages.

> I certainly don't think using atomic operations everywhere and converting every pointer read operation into a write operation is efficient.

I’m unaware of anyone using RC properly is doing this. You only do this when you need to share ownership but you should be doing this exceedingly sparingly. Unique ownership and referential sharing is by far the most common. If you’re passing RC into a function that doesn’t retain ownership beyond its call scope you’re not using RC properly.

> Also most naive RC implementations don't copy objects so you get heap fragmentation.

That’s only kind of true. Good allocators seem to mitigate this problem quite effectively (glibc’s is notably not good at this as compared with the mimalloc and new tcmalloc). But sure, that is kind of a problem. It can be mitigated though by optimizing your allocation patterns once you know that’s the problem (noticing it is by far the hardest bit). And because it’s more manual you can customize your allocator.

Look. I’m not disagreeing about the developer benefits being significant. I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler. It’s possible that a tracing GC can provide better performance out the gate with minimal optimization vs needing to spend more time optimizing your allocation strategies if you make design mistakes. But remember. A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs. And you still generally need to stop the world (I think that may not be true in some advanced Java Gc algorithms but that’s the exception rather than the rule and you trade off even lower throughput). And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.

Let me give you a real world example. When I was working on the indoor positioning in iOS, I ported our original Java codebase verbatim where I used shared_ptr for almost every Java allocation across the board where I might even potentially be sharing ownership as I wanted to start with safety and optimize later. Not only was the initial version faster than the equivalent Java code (not surprising since c++ will outperform due to no auto boxing, at least at the time), when I got rid of shared_ptr in the particle filter which is a core hot path, it only showed a 5-10% improvement in perf (using Accelerate for the core linear algebra code was way more impactful). The vast majority of it actually came from the fact that all the particles were now living continuously within the vector rather than the overhead of the atomic counting. Just saying. People really overestimate the cost of RC because GC is rarely needed in the first place / when it is ownership shouldn’t be being modified in your hot path. When I worked on Oculus on Link, we used shared_ptr liberally in places because, again, ownership is actually rarely shared - most allocations are unique_ptr.

Edit: note that I’m explicitly distinguish RC (single threaded) from ARC (atomic multi thread RC). Confusingly ARC stands for automatic RC in Apple land although it’s atomic there too. Automatic RC is trickier but again, as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).


> As I’ve stated repeatedly, RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.

But that's the whole point. Tracing GC doesn't do malloc/free, and that's where the performance advantages come from. Instead of a complex allocator that has to do lots of work on every free() as well, you get a bump-pointer allocator and move some complexity over to the tracing thread.

And this is especially true if you tend to have large linked data structures.


With all due respect, you are talking about everything, but never had any objective comparison where the RC-tracing GC were the culprit.

> RC is super cheap. Seriously. You can do about several billion of them per second. Your malloc/free call is going to be more expensive.

Yes, and it is completely irrelevant. Good tracing GCs can use a thread local bump allocator, and it can even defragment it automatically later.

> Sure. Atomic counters are relatively expensive. But I’m good designs there’s very few of them. And they easily show up in hotspot

That’s just false, unless you are using something like cachgrind or so — any sane compiler will inline the 2-3 instructions of counter increment/decrements. It is the stereotypical “death by thousands cuts”, never showing up in profiling.

> Most languages that use RC actually avoid most memory allocations/frees by leveraging value composition instead of referential ownership.

I’m sorry but this has absolutely nothing to do with the topic here, there are plenty of tracing GCd languages that can do that as well, like D, Go, C#, Nim just from the top of my head. That’s just a completely different axis.

> I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW

That only proves that RCs use less memory, which as has been pointed out in the thread many times is one of its few positives — it does make sense to use in a mobile phone, but then you finish off with “ big hyperscaler is unlikely to be using Java for their core performance-critical infrastructure” which is just bad logic, and straight up false. Half of the internet literally runs on Java, with heap sizes going up to the terabytes range. Alibaba, many part of Google, Apple’s literally every backend system, Twitter, whole cloud providers(!) all run on the JVM.

> You only do this when you need to share ownership

That’s like.. the point of garbage collection algorithms? Otherwise you why don’t you just randomly increment a counter here and there?!

> Not only was the initial version faster than the equivalent Java code (not surprising since c++ will outperform due to no auto boxing, at least at the time

You literally give the reason why it was faster — you are comparing a sequentially laid out data structure to one of pointers. The effect size of that will trump any concern about which GC algorithm is used, so your point doesn’t apply here at all.


> RC is super cheap. Seriously. You can do about several billion of them per second.

Right. You bump allocate faster, however. RC is an additional operation to the allocation request. Given naive RC can't move objects, it necessarily needs a free-list allocator. Free-list allocators can allocate pretty fast (for the most common object sizes), but can't reach the speeds of bump allocators. Furthermore, bump allocators have better locality of reference than free-list allocators.

Also I've never questioned you can't do non-atomic increments/decrements efficiently. They are exceptionally fast. However you are still converting every pointer read operation into a pointer write operation. The rule of thumb is that there are generally 10x more pointer read operations in a program than pointer write operations. This is why read barriers are also considered more costly than write barriers.

> I did actually provide proof by the way. Apple’s phones use half the RAM as Android and are at least as equally fast even if you discount better HW.

I don't really agree with this statement. There are way too many unknown/unaccounted variables. It could be better hardware, maybe Android has a terrible architecture, and could just be as you said that Swift RC is genuinely better than ART GC or it can be whatever. Point is that it's not a scientific comparison. We don't know if the benefits in iOS come from RC. And we wont be able to know unless we have two systems where all parameters are the exact same _except_ one is using RC and another is using tracing GC, with both systems ran on the same hardware and on the same set of benchmarks.

> That’s only kind of true. Good allocators seem to mitigate this problem quite effectively

Yes. Note that mimalloc literally has a concept of "deferred frees" effectively emulating GC as otherwise freeing an object can result in an unbounded recursive free call (for example, dropping a large linked list).

> I’m just saying that good memory management (made fairly easy in Rust) is always going to outperform tracing GC the same way optimized assembly will outperform the compiler.

Sure. The perfect memory manager is omniscient and knows exactly when an object is not required and will free it. But unfortunately we don't have perfect memory managers yet. So yes I agree in principle, but we have to be pragmatic.

> And I can’t belabor this point enough - languages with RC use it rarely as shared ownership is rarely needed and you can typically minimize where you use it.

I feel like that entirely depends on the problem domain? I don't know where you're getting the "shared ownership is rarely needed" from? Maybe this is true for your problem domain but may not be for others. And if it's true for yours, then great! You can use RC and optimize your programs that way.

> A tracing garbage collector still needs to do atomic reads of data structures which potentially requires cross cpu shoot downs.

Sure. GC metadata needs to be accessed and updated atomically. 100% agree with you. The order of magnitude of those operations is likely much less than what you would get with naive RC though.

> as Swift and ObjC demonstrate, it’s generally good enough without any serious performance implications (throughput or otherwise).

In one of my comments about Swift in this thread, the two papers I linked see up to 80% of the benchmark execution time being dominated by ARC operations! I will note that the papers are around 6-7 years old so the situation might have drastically changed since then, but I haven't personally found many new/contemporary evaluations of Swift RC.


this isn't fully true. Java is in the process of getting lxr which is uses both tracing and reference counting.


Yes and no. LXR is a highly optimized deferred and coalesced reference counting (amongst many other optimizations) GC and it looks nothing like the RC implementations you see in other languages like Python, Nim, Swift, etc. So yes, reference counting can be performant, _but_ only if you are using a deferred and coalesced implementation as otherwise the simple implementation requires atomic operations for every pointer read/write operation (though compilers could optimize it to use non-atomic operations when it's sure semantics are preserved).

EDIT: The post you're responding to is referring to the simple standard implementation of RC.


Rust doesn’t use multi threaded Rc unless you actually need to send it across threads.


I'm aware. Rust isn't a conventional "managed language" which is why I consciously omitted it from the above list. In an increasingly multi-threaded world, you just have to use the synchronized version of RC (aka Arc<T> in Rust).


Swift and Objective-C ARC performance is quite poor.


Compared to what, though? And is that still the case if all OS components use whatever it is, as opposed to a few applications? Memory efficiency is crucial for overall system performance, and ARC is highly memory-efficient compared to every production GC I’m aware of.


Peak memory is not the only reason ARC is memory-efficient though, the main reason is that it has better memory reading behavior because it doesn't have to scan for pointers. Server GCs assume all memory containing pointers is cheap to access, which is not true if some of it is swapped out.



iOS ships the same speed phones as Android with half the RAM. So I’d say here’s a real comparison of ARC vs tracing GC.


And the Android rules the mobile world with 80% market share.

Also iOS applications tend to crash due to memory leaks or not enough memory being available.

So yeah a real comparison.


Iphones are 3 generations ahead in terms of single core CPU performance, so that’s just a biased take.


How is that possible? Sounds counter-intuitive.

Do you have a recommendation for reading?


> and further removed from it on ISO Ada 2012.

and in that precise moment Ada proved it had garbage collection all along


By that measure same applies to C++, as C++20 removed the GC API introduced in C++11.


There was an Ada implementation on the Lisp Machine, that might have been using GC.


As for .NET and JVM implementations, but that is a consequence of underlying platform, just like C++/CLI and C++/CX, and none of them have been that big into the Ada compiler market.


Which means no library can depend on it, pushing memory management responsibilities into calling code.


Ah, same as C++ it sounds like.


Plot Twist:

20 years later, he wrote "The Garbage Collection Handbook" and is the leading authority on garbage collection.


Stop-the-world indeed


Met Richard Jones, one of the authors of the book, at its original launch. Very nice guy. Bought the original book and did some heavy reading on it – if you get this book, be prepared to put time into it, it's readable but it's not one you can do a quick skim of – and I really profited from it. It should be required reading for all programmers.

(To be clear, I'm referring to the very first version, I haven't read the subsequent versions but given the quality of the first I'd be very surprised if they were any less good).

Edit: https://www.cs.kent.ac.uk/people/staff/rej/ - a jumpoff page for his stuff.

Edit2: https://www.cs.kent.ac.uk/people/staff/rej/gcbib/ - "[This] online bibliographic database includes nearly 3,000 garbage collection-related publications. It contains abstracts for some entries and URLs or DOIs for most of the electronically available ones, and is continually being updated. The database can be searched online or downloaded as BibTeX, PostScript, or PDF." Welcome to the ultimate rabbit hole I guess.


I don't know if it is included in the new edition of the book but in case anyone is interested in a modern, highly efficient RC implementation that does not rely on deferring the reference count updates (which kills one of the advantages of RC in the first place), check the Perseus paper. Koka (which uses perseus) is quite competitive with OCaml and Haskell

Just search for "perseus reference counting", you'll find it. It uses linear logic to insert explicit "dup/drop" operations and then merges and coalesces them.


Is there any particular reason you didn't link to the paper itself (https://www.microsoft.com/en-us/research/uploads/prod/2020/1...) or use Microsoft's scuffed version of DOI (MSR-TR-2020-42)? The paper appears to be freely available, so there shouldn't be copyright issues with linking to it...


Just that I'm on mobile and am lazy. Thanks for posting it!


Apart from the paper are there independent verifications of the technique's performance?


The Roc language uses it too and shows similar performance. It's not magic though, it just takes advantage of RC for various optimizations (like "Functional But In Place") which add up and compensate for the added burden of ref count manipulation.


I found "A Unified Theory of Garbage Collection" helpful: https://dl.acm.org/doi/10.1145/1035292.1028982


if you need code to understand garbage collection, there is walkthrough of garbage collector and C code at http://maplant.com/gc.html is really helpful.

I tweaked it to work on amd64 and started adding register scanning based on what eatonphil's discord people told me to do.

https://github.com/samsquire/garbage-collector

It's not fit for any purpose but more of a learning exercise.


Bob Nystrom (of Game Programming Patterns, Crafting Interpreters, and dartfmt fame) also wrote a tutorial implementation[1], of a precise tracing GC as opposed to a conservative one.

Regarding register scanning in a conservative GC, Andreas Kling has made (or at least quoted) the amusing observation[2] that your C runtime already has a primitive to dump all callee-save registers to memory: setjmp(). So all you have to do to scan both registers and stack is to put a jmp_buf onto the stack, setjmp() to it, then scan the stack normally starting from its address.

[1] https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...

[2] https://youtu.be/IzB6iTeo8kk


Glibc's mangles some pointer registers in setjmp(). It XORs a per-process value with the stack pointer, frame pointer and instruction pointer stored in the jmp_buf, on all (or nearly all) architectures.

Although losing the stack and instruction pointers is unlikely to be a problem for the GC context, the frame pointer register need not contain a frame pointer value. It can be an arbitrary program value depending on compile options. That's something to watch out for with this GC technique.


You’re right, and I shouldn’t have dismissed the PTR_MANGLE business so easily when I looked at the source[1]. In hindsight, the __ILP32__ (i.e. x32) special case for the high part of %rbp on x86-64 looks awfully suspicious even if you don’t know the details.

Given that __attribute__((optimize("no-omit-frame-pointer"))) doesn’t seem to get GCC to save the parent frame pointer on the stack reliably, while Clang doesn’t understand that atribute (or #pragma GCC optimize(...)) at all, this now looks less slick than it initially seemed.

... Have I mentioned that I dislike hardening techniques?

[1] https://elixir.bootlin.com/glibc/glibc-2.37/source/sysdeps/x...


Implementations are unfortunately allowed to do whatever they want to that jmp_buf, they could xor the contents for all you know. Hopefully no implementation does something silly like that.


This seems like a reasonable environmental assumption if you’re already scanning the stack conservatively. I’d be more worried about pointer authentication (AArch64), pointer encryption (Glibc) or perhaps register windows (SPARC, Itanium). Still, as a cheap trick for avoiding assembly it seems to work well enough in non-exotic situations.


Seems like a reasonable thing to do to avoid forging jmp_bufs.


Side topic, but the maplant website design is great.


Besides the code blocks, it has 0 CSS. So what you're actually complementing is your browser's default styles for HTML


What I really want out of a garbage collector is a "Collect" function with a deadline. Pick a max time it's allowed to run before stopping and returning to the program.


Real time GCs exist such as the IBM Metronome GC. Though I'll be honest and say I haven't heard of many real-time GCs other than the Metronome one. Certainly many new GCs have reduced pause times dramatically but that's orthogonal to real-time GC (as you can make pause times infinitesimally small but not let the mutator actually progress).


Here’s a fairly extensive review that mentions some recent research on “real-time” GCs, including one with a lower bound for mutator utilization: https://eschew.wordpress.com/2016/09/02/summarizing-gc/.


See PTC and Aicas.


I've achieved GC timing that is good enough for real-time competitive game hosting using .NET6+. Worst case over 4 hours of load testing was an 8ms spike. Average 1-2ms.

The magic trick is to intentionally collect as often as reasonably possible (i.e. at batch/frame/tick processing boundaries) and avoid using sophisticated GC schemes that involve multiple threads or asynchrony.

Oh, and obviously you need to minimize allocations throughout or it won't matter.


Nim has exactly that with the GC_step proc.

https://nim-lang.org/1.4.0/gc.html

However recent and future versions (2.0) are moving towards a different approach that is also applicable for deterministic real time systems: ARC, which is basically alloc and free calls inserted automatically by the compiler using static analysis (no "runtime").


That is necessary but often not sufficient. It still leaves the problem of moving time from where garbage is created to the bulk 'collect' step. I've run collect every frame to keep stalls down, but the time always creeps up and up and you end up doing forensic analysis of which code is making the garbage in the first place.


Not really possible unless you have a real-time OS. Virtual memory and concurrency means there is really no way to bound how long anything will take.


You may have absolutely zero pauses if you don't have heap compaction.


Isn’t that just incremental garbage collection?


Great that there's a new edition of this coming. This is the definitive reference if you work with garbage collectors, it's also very well written.


Note that (the first edition of) this book considers reference counting, including variations like deferred reference counting, to be garbage collection (though not tracing garbage collection), and reviews it as well.


Just like most relevant computer science sources do, only joe and jane on the street without CS background think otherwise.


Anyone know what's new in this edition? Both the publisher's site and the author's site are light on details.


Great that is having an update, it is one of the most relevant source of informations for GC algorithms.


Wow, this really is a new edition that will be available in 2023. The old (2012) edition is excellent. I don't have the impression that a whole lot has changed since then, but an update with all the latest refinements has to be worthwhile to implementers.


Who would usually need this book?


The first edition is a definitive reference for anyone writing a garbage collector (and by extension, exploring writing a programming language). Also people who are interested in how GCs work.

Probably less so for people trying to optimize a program to work with an existing GC (eg. tweaking the JVM), but I suppose knowing the basic principles can help.


I’ve written two garbage collectors in my time, neither of which were in a programming language.

One was for an in memory cache of data relationships. Another was to clean up soft references with an RDF graph. Neither were, nor needed to be, particularly sophisticated.

The cache was a compacting collector, the RDF one was mostly a “connectedness” test, pruning those nodes fallen from the graph.

Recall that malloc has a simple garbage collector for its free space, and arguably the level of sophistication that ranks a modern malloc implementation is how it manages its free space.

In the end detritus must be identified and resources reclaimed. So you see how GC like systems can occur in divergent areas of work.


This book (the 1st edition) gave me exactly what I needed when writing the garbage collector for my programming language.

Besides the great technical content, I found it to be a very enjoyable, readable book.


I read the mark-and-sweep GC parts to gain some deeper understanding about Go's garbage collector. Probably the best resource for that kind of stuff other than reading actual code.


I feel the need for garbage collection is a language design mis-feature. That is to say, producing garbage is a language design-mis-feature. To quote Bjarne Stroustrup:

> I don't like garbage. I don't like littering. My ideal is to eliminate the > need for a garbage collector by not producing any garbage. That is now > possible.

and it's indeed possible. For example It's become pretty much a non-issue in modern C++: https://stackoverflow.com/a/48046118/1593077 (and C++ is not the only example, it's just a prominent example of a language which almost standardized garbage collection, but eventually did not go that way.)


C++ can actually produce quite a bit of garbage unintentionally, it's why linters will remind you often to call std::move.

That said I much prefer deterministic resource cleanup even in a janky language like C++ over a tracing GC.


It's true that C++ can trigger a lot of unnecessary copying if one writes carelessly; but that's not the same as garbage, in that both copies are used, and none of them continue living indefinitely without special intervention. But point taken about pass-by-value deficiencies.


Modern C++ has reference counting (“smart pointers”) and memory leaks instead, so YMMV.


“And memory leaks instead” I choked on my morning coffee hahahahah


I wouldn't say it's a non issue. I frequently have to tune allocators to fix heap fragmentation in databases...


Recognized on the last line of the post I linked to actually.


Oh sorry:)


You can’t avoid garbage if you deal with generic programs — Rust and C++ can only implement a subset of all programs without RC (which is the most basic GC algorithm).

This is the same way to many other interesting CS properties — most of them are undecidable at compile time, so you have to do it at runtime.


More and more people nowadays are programming at high levels of abstraction. If you're designing the frontend of a website, or making a mobile game, or developing a stock trading algorithm, or whatever else, then you probably don't want to constantly worry about details of memory management...

On top of this, GC is necessary for some algorithms. Any data structure with partial sharing (e.g. binary search tree with versioning via path-copying) needs GC to be space-efficient. You could either rely on a built-in GC, or write your own. If you write your own, I think you'll find that it is tedious and error-prone due to memory safety issues.


We could already be doing that during the 1980's had it not been for UNIX getting out of Bell Labs with source tapes, and bad management decisions.

https://interlisp.org/

http://toastytech.com/guis/cedar.html

"Eric Bier Demonstrates Cedar"

https://www.youtube.com/watch?v=z_dt7NG38V4&t=2s

"Making Smalltalk"

https://www.youtube.com/watch?v=PaOMiNku1_M

http://www.edm2.com/index.php/VisualAge_Smalltalk

Also there is to note that most BASIC implementations had support for automatic memory management, at least for strings and arrays, the structured compiled dialects even better.

Also database programming with languages like Clipper and FoxPro.

And even if we stay in the UNIX world, that is exactly using stuff like Perl also allowed for, C like programming without the headaches of manual memory management.

Or the brief fad of 4GL languages.


> then you probably don't want to constantly worry about details of memory management...

Oh, you actually don't have to, that's the whole point... in the past, you (effectively) had a choice between careful manual management of memory and garbage collection with its overheads. These days, you can use constructs which take care of that management for you, with very little or no overhead, which don't involve garbage.

It's true that sometimes GC is algorithmically necessary; but then the high-level-of-abstraction argument is irrelevant. And in those cases, a GC library does indeed come in useful. You don't need to write your own, others have likely done it already.


Cool stuff! The last 10 years have seen lots of great languages pushing the field so I'm excited to see what's in the book =)


Bizarre that you can't preorder it yet, you have to wait until June, just to preorder.


As a guess, they might want the accounting for preorders to fall into Q3? Is that a thing people do?


Maybe there is gamesmanship around best seller lists?


I pre-ordered it from the linked website. Maybe it's only available for certain regions?


its available on amazon to pre order - https://www.amazon.com/dp/1032218037


I managed to pre-order it on Amazon UK.


Will there be a PDF / epub option?


If I'm not mistaken, Routledge uses that VitalSource dreck, so you can only read their ebooks via a dedicated app. I don't buy dead-tree books any more, but I'll make an exception for this one.


The previous edition of this book is available on e.g. Apple's bookstore and Kindle. Unfortunately, the new one does not show an eBook option yet. I very much hope that one will be available soon, instead of them holding back the eBook to goose hardcover sales or some other reason.


Looks like their ebooks are available through a number of sources:

https://www.routledge.com/our-products/ebooks


[flagged]


Who is Robert Sewell and why are his preferences interesting?


1. It’s a joke

2. Many great programmers hold the opinion that Java is horrible. Linus for example. So if Sewell doesn’t seem credible, try the creator of Linux and git.


Linus's thoughts on programming languages aren't particularly enlightening either ;)


I bet Linus does not get much outside C.


> Many great programmers hold the opinion that Java is horrible.

Could you name a few?


Linus, Dijkstra, Stroustrup, to name a few.

Time has shown OOP is not good.


> Linus, Dijkstra

I wouldn't take Linus' hyperbole at face value, given that he is a guy who has consistently refused to use anything but C or assembly.

Dijkstra, on the other hand, argued that Haskell should be taught in college instead of Java. Yeah, let's go with that.

> Stroustrup

> Time has shown OOP is not good.

So, the creator of C++ argues that OOP is "not good"?

OOP is ubiquitous, and a fine tool. Plainly saying that it is "not good" is like saying nothing.


Then it’s great that Java is a general purpose programming language with plenty of ways to do FP as well.

Also, being good at some subset of CS doesn’t mean they are good at language design.


I think it is a joke and at least to me it is funny, without knowing who that guy is or was.


Someone who is right regarding this.


darn i thought this was gonna be a book about waste management


Rust has left the building


The book also has a chapter on reference counting ;-)


Rust also uses reference counting, probably the worst sort of garbage collection.


Only when used in a naïve way, which Rust does not. For example, the increments/decrements are done only when "clone" is called and scope exit respectively, and based on Rust ownership/borrow checking, is rarely done combining the best of both worlds (but yes, implementations with aggressive increment/decrements in loops and on every function call can be very slow). Rust also separates Arc (atomic refs) and Rc (non-atomic refs) and enforces usage scenarios in the type checker giving you cheap Rc in single threaded scenarios. Reference counting when done in a smart way works pretty well, but you obviously have to be a little careful of cycles (which in my experience are pretty rare and fairly obvious when you have such a data type).


The increment/decrement calls only occur on an explicit call to .clone(). No .clone(), no increment/decrement.

You won't see many clones in rust code.


It's how often reference counts are adjusted on hot paths that matters (including in libraries), and back to the original point, reference counting doesn't let you free groups of objects in one go (unlike a tracing GC).

Also it'd be nice if the reference counts were stored separately from the objects. Storing them alongside the object being tracked is a classic mistake made by reference count implementations (it spreads the writes over a large number of cache lines). I was actually surprised that Rust doesn't get this right.

Another issue with manual memory management is that you can't compact the heap.


The amount of reference-counted pointers in most Rust code is a tiny fraction compared to boxes or compiler-tracked-lifetime references.

Yes in theory it would be more efficient to store all the reference counts together, but that's in theory. In practice most Rust apps will not call clone on a shared pointer on a hot path and if they do it's usually 1 such pointer and they do something with the data as well (so it's all 1 cache line anyway)

You can't compare Rust/C++ with Swift/Nim when it comes to RC, there just aren't enough reference count operations for it to matter much (unless you're in a shitty OO C++ codebase like me that pretends it is java with std::shared_ptr everywhere)

Apps where heap compaction would be relevant in a low-level language like Rust or C++ will typically use a bump allocator which will trounce any kind of GC.


Tracing is the worst in terms of performance


Anyone claiming something like this obviously hasn’t dig into GCs. You honestly think that writing into memory at each access, especially atomically is anywhere near the performance of a GC that can do most of its work in parallel and just flip a bit to basically “having deleted” everything no longer accessible?


a bit flip is not writing?

Also do traces not have to work atomically? The program needs to stop, you can’t have it check roots as it runs.

I’ll admit I am no GC researcher with ph.D experience, but your comment makes it seem you aren’t either.


Tracing is batched up in GC pauses, rather than on every access as with naive RC. It is necessary to stop the world, but the work done in the pause does not need to use atomic operations.

Atomics are handy in a parallel/multi-core tracing collector, but IME pointer chasing in tracing somehow manages to cover the time it takes to do atomic operations.


That depends. Deallocating a zillion little objects one a a time can be slower than doing them all in a batch.


Not really, here it is winning hands down over Swift's ARC implementation.

https://github.com/ixy-languages/ixy-languages


Wasn't this the comparison that decided to use reference types for everything for no real reason?


The reason being comparing how various schemes of automatic reference memory management perform.

Naturally if the purpose was to compare stack allocation performance other approach would have been taken.


The goal was to write a network driver in several languages. Nobody said anything about comparing memory management techniques, nor would the Swift implementation use a stack allocator anyways.


They would appreciate your contribution to fix the benchmark.

Apparently the ARC performance improvements announced at WWDC 2022 weren't needed.


I don’t think they would, considering they’ve already finished their study.


You could have your own repo to counterpost every time someone like me refers to that old study with tainted results.


I don't really have the skills to do an accurate comparison across many languages, and if I pick just three or four it's going to get nitpicked to death for being cherry-picked. Honestly, I generally think studies of this kind are mostly doomed to failure, or invariably converge to someone trying to encode x86 intrinsics in Rust.


I don’t know anything about the benchmark, but how would you test GC implementations without reference types?


Swift’s value types have reference counts because they may have members that need their lifetimes to be managed appropriately. (For example, if they’re reference types.)


This is incorrect, value types are not referenced counted in Swift. If a value type contains a reference type member (usually an anti pattern!), then that member’s reference count is indeed incremented when the value type is copied. But it is not accurate to claim that value types themselves have reference counts.


Yes, fair enough. I was thinking of this more from the perspective of types that have internal reference types that are opaque so it’s effectively like the value type itself having a reference count on it, but yes, really it’s the internal storage that is getting the count.


This isn't the kind of program Swift was designed to perform well for.

Nor is wallclock speed even what the system should be optimizing for, since you buy phones to run apps not to run the system. You should be measuring how well it gets out of the way of the important work.


Ah the excuses when facts are put on the table.

"Fast. Swift is intended as a replacement for C-based languages (C, C++, and Objective-C). "

-- https://www.swift.org/about/

"From its earliest conception, Swift was built to be fast. Using the incredibly high-performance LLVM compiler technology, Swift code is transformed into optimized machine code that gets the most out of modern hardware. The syntax and standard library have also been tuned to make the most obvious way to write your code also perform the best whether it runs in the watch on your wrist or across a cluster of servers.

Swift is a successor to both the C and Objective-C"

-- https://developer.apple.com/swift/#fast


[flagged]


[flagged]


It usually isn't. Reference counting is good when you need to rather tightly limit memory use, are wedded to RAII, or need something simple enough that a programmer can knock it out in an afternoon.


Do we need this book now that we have Rust ??


No. Rust has completely consigned all of computer science to the dustbin of irrelevance.


Perhaps someone will rewrite this book (the whole book, not just the code) in Rust.


Rust removed its garbage collector and memory safety 9 years ago. Stack overflows and unsafeties are plaguing it since. Reading such a book would be a necessity for rustaceans esp.


Well, we may use it to re-implement a GC in Rust!


That's like asking "Do we need Go now that we have Rust?"


Exactly that.

BTW My comment was ironic.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: