The system-allocator needs to be thread-safe in a large variety of situations.
Your custom allocator doesn't have to be: maybe you know only 1 thread is running. Heck, even knowing that only 32-threads or 64-threads will ever run can make gross simplifications to the synchronization code.
System-allocator needs to be thread-safe with an unbounded number of threads (because you may call pthread_create at any time).
Your custom allocator may know more specifics about when, and where, pthreads are created, as well as whether or not they interact with each other.
---------
Case in point: lets say you malloc a 256MB region per thread. You know that only one thread would ever access this 256MB region, so you write it without any synchronization primitives what so ever.
But you still want a general purpose, multi-size supported allocator to split up the data inside the 256MB region.
You call void* fooptr = malloc(256MB), because grabbing from the system allocator should remain thread-safe (who knows how many other threads are calling malloc?). But afterwards, you can make very specific assumptions about the access patterns of that fooptr.
Given that the system allocator likely has a thread-local or CPU-local cache, does that constraint really make things noticeably more efficient? It simplifies your implementation, sure, but the system allocator already exists, and one implementation is simpler than two...
Hmm... my multithreaded strategy typically uses task-based parallelism with a thread-pool.
In particular: #pragma omp task
Those omp-tasks can "float" between threads, depending on your implementation. If some task enters a blocking situation (usually a task-barrier), it could switch to another pthread during the implementation.
Ex: Task A is running on Hardware-Thread#10. Task A mallocs something from Thread#10's local malloc. Task A calls barrier, which means Thread#10 "gives up Task A" back to the work queue.
Later, all tasks hit the barrier, and Task A can run again. But which threads run TaskA is left up to the runtime. Thread#25 might be running the task now. At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.
----------
Its probably not safe to assume thread-local storage to be sufficient for task-based parallelism.
> At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.
I don't think this is true. As I understand it, it just gets put into Thread#25's pool rather than returned to Thread#10s. If there's a long-running imbalance—like a producer-consumer pattern in which all the mallocs are from Thread#10 and all the frees are from Thread#25—that will lead to more global synchronization because the allocator will have to repeatedly refill Thread#10's pool and empty Thread#25's pool. But if it's just that there's some shuffling of threads between the allocations and frees but stuff is generally balanced, there's little if any additional cost.
I think you're suggesting using a set of "task-local" custom allocators, and each of those can ignore threads? I suppose that would work, as each task is only on one thread at once, and there has to be a barrier anyway when it hops from thread to thread. But I'm skeptical it's faster than the system allocator. I'd love
The system-allocator needs to be thread-safe in a large variety of situations.
Your custom allocator doesn't have to be: maybe you know only 1 thread is running. Heck, even knowing that only 32-threads or 64-threads will ever run can make gross simplifications to the synchronization code.
System-allocator needs to be thread-safe with an unbounded number of threads (because you may call pthread_create at any time).
Your custom allocator may know more specifics about when, and where, pthreads are created, as well as whether or not they interact with each other.
---------
Case in point: lets say you malloc a 256MB region per thread. You know that only one thread would ever access this 256MB region, so you write it without any synchronization primitives what so ever.
But you still want a general purpose, multi-size supported allocator to split up the data inside the 256MB region.
You call void* fooptr = malloc(256MB), because grabbing from the system allocator should remain thread-safe (who knows how many other threads are calling malloc?). But afterwards, you can make very specific assumptions about the access patterns of that fooptr.