Biased Reference Counting: Minimizing atomic operations in GC (2018) [pdf]

NightMKoder · on Nov 28, 2019

This is pretty cool! As they mention in the paper this is similar to biased locking in java where “synchronized” blocks only ever get promoted to a full mutex when a second thread tries to lock it. It’s the same idea here except with a counter.

This made me think of two other approaches I wonder if anyone tried. The first is using something like per-cpu counters. The number of threads is usually large, but maybe the number of cpus is bounded enough to contain the memory pressure? Linux implements restartable sequences [https://blog.linuxplumbersconf.org/2016/ocw/system/presentat...] for this but I don’t know if that’s any faster than BRC described here. The other thought was - in a similar light to the static analysis the paper mentions - is there any benefit to somehow marking the objects that are “about to escape” to switch the owner thread id? One of the common patterns is swift is bouncing a sequence of tasks off different dispatch queues to get on/off the UI thread, but its really a sequence of operations on the same object.

Great stuff in either case - I’m a fan of no-effort memory management with minimal performance implications and this is one step closer.

vnorilo · on Nov 28, 2019

I played with per-thread counters. The problem is you don't want them sharing a cache line, so having them inside the object is problematic. Having external per-thread counter arrays involves bookkeeping and indirection I could not make fast enough. YMMV, and maybe someone has done a better job.

A known variant is deferred counting, where a single thread mutates refcounts and you have per-thread lightweight queues from other threads to the mutator.

gottesmm · on Nov 28, 2019

One problem with the type of approach here is that many modern UI programs allocate objects on side threads and then migrate those objects to the main thread (which is the only thread allowed to touch UI state). So truly, one needs to have an approach that allows for that steady state to be recognized (the steady state being when the object has landed in its final destination thread).

Once swift has some sort of concurrency model, it would be interesting to see if we could use per thread local ref counts that manage the lifetime of a single atomic ref count bump. One would be able to convert in between the two types of ref counts upon escape if needed.

This type of approach would eliminate most of the ref counts and allow for all threads to use in hot code these non-atomic thread local ref counts.

blt · on Nov 28, 2019

I only skimmed the paper, but it seems like the proposed method could support handing over ownership from the worker thread to the GUI thread via an explicit statement in the code.

liuliu · on Nov 28, 2019

Yeah. Event at runtime, it seems possible when the owner thread gives up the ownership (by setting owner_tid = 0) can use an extra flag to let other threads to compete for the ownership. As long as the flag is on the 2nd half-word, (it seems) there is no atomic requirement for the 1st half-word.

blt · on Nov 28, 2019

This is a nice result. A similar pattern to generational garbage collection: divide everything into a fast, common path, and slow, fully general path.