This is pretty cool! As they mention in the paper this is similar to biased lock...

vnorilo · on Nov 28, 2019

I played with per-thread counters. The problem is you don't want them sharing a cache line, so having them inside the object is problematic. Having external per-thread counter arrays involves bookkeeping and indirection I could not make fast enough. YMMV, and maybe someone has done a better job.

A known variant is deferred counting, where a single thread mutates refcounts and you have per-thread lightweight queues from other threads to the mutator.