The objective of the article wasn't to stress the specific case of the 8-bytes a...

The objective of the article wasn't to stress the specific case of the 8-bytes allocation pattern, it was about showing that malloc behavior depends on the context. The size of 8 bytes had been chosen because it was small enough to allow a large number of allocation to be performed so that timing were quite accurate in the results, however the main goal was to show the difference between was are called the "contended" and the "uncontended" case: your program may perform properly with single-threaded workload, but poorly in a multi-threaded environment, not because of explicit locking in your code but because some resource sharing is hidden behind the allocator.

Also, the choice of lot of allocations + lot of deallocations pattern was chosen because this is an issue we ran into quite recently: we allocated a huge tree structure progressively and sometimes we flushed it to disk. The flush was quite efficient, but the deallocation blocked the program during approximatively 30s. As a quick fix, we put the deallocation in background, but this slowed the tree construction down by approximatively 50% because of the contention of the allocator. Even if the size of the chunks in the article were not realistic, the results were near-realistic enough to be considered publishable.

I provided (in a separate comment) the result of the benchmarks with a 32-bytes payload (in that benchmark, all the allocators had the same 12% overhead in term of memory, but we clearly see the same performance pattern as with the 8-bytes payload).