Right ok.... What I don't get is that it mentions profiling the advantage of TCMalloc, which is thread-cached malloc. Is this going to be realistic when the replayer is a single thread? (I could be missing something)
Long answer: if the allocator is poorly designed, a lot of time will be spent traversing its free list/tree/whatever looking for a block to fit your size requirements. this lookup time can be exacerbated if your heap is badly fragmented, or if the allocator does a poor job coalescing freed blocks. you could end up spending lots of time in malloc looking for a nicely sized block.
also, with regard to heap fragmentation - long running processes which do lots of allocations/frees can cause fragmentation, again depending on the design of the allocator. if there is a lot of heap frag, you could see some substantial bloating.
so profiling your process for those two items can be valuable.
what you say is true; the major gain for TCMalloc is in multi-(native)-threaded apps.
perhaps the next version of malloc_wrap will support multiple threads.
in either case, we have not yet finished collecting data about the different allocators, so I am not currently in a position to say which is better for our use case.
i just wanted a tool to let me replay a constant set of allocation patterns against different allocators to find out if swapping out libc's malloc made a difference for us and that is precisely what malloc_wrap is.
what you say is true; the major gain for TCMalloc is in multi-(native)-threaded apps.
I think this is pretty key, because otherwise TCMalloc is somewhat of an overhead. Depending on your platform, a standard malloc with will pull ahead (depends on how favourable locking is, but it is the case for OS X anyway).
A multi-threaded instance sounds interesting, but - I'm guessing it would be a challenge to get a representative sample.
You might be reading the article too literally -- you can test more than just tcmalloc, of course (ned, ptmalloc*, libumem, etc). It is -very- possible that one of these allocators will handle our memory footprint more gracefully than say, libc. There is only one way to find out: via A/B testing.
I think the important thing to keep in mind is that assertions like:
"I think this is pretty key, because otherwise TCMalloc is somewhat of an overhead."
are a bit subjective, IMHO. Allocators are different from one another, and of course they react to a series of allocations/deallocations differently. We're trying to find out if the way we use our heap is better suited to another allocator like tcmalloc, or nedmalloc, or whatever.
And RE: multi-threaded - I don't believe it will be particularly difficult to get a representative sample, but working on that isn't very high on my list right now.