Thread safety, or the lack of it. The system-allocator needs to be thread-safe i...

scottlamb · on Oct 14, 2020

Given that the system allocator likely has a thread-local or CPU-local cache, does that constraint really make things noticeably more efficient? It simplifies your implementation, sure, but the system allocator already exists, and one implementation is simpler than two...

dragontamer · on Oct 14, 2020

Hmm... my multithreaded strategy typically uses task-based parallelism with a thread-pool.

In particular: #pragma omp task

Those omp-tasks can "float" between threads, depending on your implementation. If some task enters a blocking situation (usually a task-barrier), it could switch to another pthread during the implementation.

Ex: Task A is running on Hardware-Thread#10. Task A mallocs something from Thread#10's local malloc. Task A calls barrier, which means Thread#10 "gives up Task A" back to the work queue.

Later, all tasks hit the barrier, and Task A can run again. But which threads run TaskA is left up to the runtime. Thread#25 might be running the task now. At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.

----------

Its probably not safe to assume thread-local storage to be sufficient for task-based parallelism.

scottlamb · on Oct 14, 2020

> At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.

I don't think this is true. As I understand it, it just gets put into Thread#25's pool rather than returned to Thread#10s. If there's a long-running imbalance—like a producer-consumer pattern in which all the mallocs are from Thread#10 and all the frees are from Thread#25—that will lead to more global synchronization because the allocator will have to repeatedly refill Thread#10's pool and empty Thread#25's pool. But if it's just that there's some shuffling of threads between the allocations and frees but stuff is generally balanced, there's little if any additional cost.

I think you're suggesting using a set of "task-local" custom allocators, and each of those can ignore threads? I suppose that would work, as each task is only on one thread at once, and there has to be a barrier anyway when it hops from thread to thread. But I'm skeptical it's faster than the system allocator. I'd love