My understanding was that allocators nowadays have cache pools per cpu or thread and only rarely need to lock on global state (when imbalances between pool sizes
necessitate shifting resources around).
They do and so does jemalloc in Firefox, however that lowers contention, it doesn't let you do away with synchronization. Consider this simple scenario: a thread allocates a chunk of memory from its per-thread pool, but the object is later released by another thread. You can't do that w/o synchronization around the pool. Per-CPU pools are even trickier because a thread might be moved around between different CPUs by the scheduler while it's allocating memory. So you need special facilities to implement those in user-space like restartable sequences: https://lwn.net/Articles/650333/