It's great to be able to make memory allocators faster, but it's also wise to consider the possibility that it's the applications that use them which are where the inefficiency is - and ironically it's the applications that do the most unnecessary allocation/deallocation that would benefit the most from a faster memory allocator.
To paraphrase a common adage, the most efficient way to allocate memory is to not do it at all. Probably helps with reducing the chance of use/free bugs too.
There's a moment when you realise you should avoid allocations because you discover that allocations are expensive. Then you also realise at some point that you cannot avoid some allocations (e.g. you need to perform some non-blocking networking) and because of the first point, you get a bit frustrated by this: you're forced to do something that is expensive.
That's exactly when you realise that you can get rid of that frustration if you introduce some complex caching, object pools or more complex structures. But the more complex it gets, the more frustrated you become (again). Finally, you end up writing something simple and efficient: a memory allocator that matches the exact allocation pattern of your use case... and you want to allocator to be as efficient as possible.
What would be an unnecessary allocation? A program generally does something with the allocated memory and at that point it becomes a necessary allocation.
Specifically, it makes a lot of sense to optimize system allocators.
If you can't rely on fast system allocators you are inclined to recreate your own pools and arenas that allocate from malloc() and write all the management boilerplate yourself, with your own bugs on top. This also eats up your time that you would otherwise spend on writing your application. Oh yeah, you're also likely to end up slower than any of the recent allocators published since 2000's or so.
On the other hand, if the system has some modern slab-style allocator that is cache-aware and does automatic pooling of similarly sized objects, you get all that basically for free by calling malloc() and free() in a dumbfounded and "unnecessary" way, very much repeatedly. Well, the applications need to manage their memory somehow, hence the allocations and deallocations.
Optimizing the system memory allocator pays off as long as you never see the allocator hogging too many cycles in your profiler. If you can get away with lots of malloc(), free(), and whatnot because of a smart allocator that ideally turns those into bumping pointers merely then, you win.
Custom memory management is generally useful in some highly optimized loops where you just can't pay the cost of a random book-keeping round when calling malloc() or free(), or in cases where you can spend some to save some. Then you might want to manage your own pool so that you can guarantee there won't be operations other than O(1). Alternatively your program might benefit from a pattern where all the memory is allocated sequentially and never freed until at the very end of the operation. Processing one request or running one cycle of operation might examples.
A C++ program that copies arounds lots of temporary std::strings is an example of unnecessary allocations.
The thing about optimizing allocators is that you very quickly run out of ways to make them faster without a space tradeoff. Notice the first optimizations detailed in the article was to retain a 'free page' instead of returning it to the kernel: this speeds up allocation and deallocation, but increases memory usage. And if one makes it faster, why not two, or three, or fifty?
The same is true for that slab-style allocator. The basic idea is to separate memory into separate pools of different sizes, and only allocate from the pool for the given size. If that pool is full, enlarge it, even if other pools have enough space to satisfy the allocation. That's wasted memory!
So we have to balance speed against memory usage. The right balance is domain-specific: that Java app running on a monster server can afford to waste lots of memory to speed up allocations, while the allocator on my iPhone needs to be much more mindful of its space overhead.
It's all good, but if there's a sizable std::map with std::strings somewhere in the code, it only makes sense to use a faster allocator when one is available. It's basically an optimization freebie.
To paraphrase a common adage, the most efficient way to allocate memory is to not do it at all. Probably helps with reducing the chance of use/free bugs too.