I'm messing around with implementing a parallel runtime system right now^, and I also had to implement a shared-nothing allocator. I ended up making each thread have a Chase-Lev Deque of 2MB pages. If you want to allocate a page, you do the usual pop-or-try-to-steal thing, and if that doesn't work, THEN you do the mmap to get another 2MB page. I was going to benchmark it and write a post about it, but I never got around to it. Glad shared-nothing allocation seems to be a problem worth solving.
^ actually it's basically implemented. I model-checked it with TLA+ and found a bug that I haven't fixed yet. Next up I'm writing a query engine on it.
If you're backing up your memory through pages stored in N work-stealing queues (chase-lev deque in this case), doesn't that contradict the shared-nothing architecture? When queue is empty, I understand your allocator will try to steal a page from another queue before it asks an OS for a new 2MB page - and that essentially means synchronization and thus sharing. Also, I think this approach is going to result with bad memory locality since your data will now be spread all over the place, e.g. different queues.
Sort of? The stealing happens very rarely, only when the thread doesn’t have a page. It’s done using relaxed atomics, so synchronization is pretty mild in this case.
> bad data locality
This will be relevant when I make it NUMA-aware, but generally speaking this shouldn’t matter as each page is much bigger than a core’s L2 cache anyway. And also memory living in the deque is freed, I don’t think there’s generally expectations on locality of memory returned to you from the allocator.
I am especially interested in how you share memory between threads in your parallel runtime - what thread safety approach do you use?
I've been thinking about static memory as permanent regions that are permanent fixtures and the program flows through them and there must be backpressure when buffers get full.
I just wrote a nonblocking multithreaded barrier with a lock free algorithm.
My threads generally don’t share memory inside the runtime. The only time they touch each others’ stuff is when they perform work stealing or memory stealing, and that’s all provided by the CL-Deque
Unfortunately, I cannot find any significant code samples for Verona on the website or in the GitHub repo. There are a few types defined in a pretty low-level way:
Hi, I am one of the authors of snmalloc. There is a pool of allocators. When a thread exits it returns it to the pool. When a thread is created it first checks the pool for an allocator and uses that in preference to creating a new one. If your application has generally a uniform number of threads over time, then this works well. If you have a spike of threads at start up, then are purely single threaded after that, then this heuristic does not work well.
We have not had complaints about this heuristic, yet. We have a solution that involves marking the remote free queue as sleeping, and a thread that sends to it and observes "sleeping" would then have to do some work for that "sleeping" allocator. This is fairly easy, but I haven't had time to implement it.
> Allocations on one thread are freed by a different thread
I can imagine one use-case for this: a task that is scheduled from and executed by a work-stealing thread-pool can allocate memory in one thread but by design there's no guarantee that the memory will be necessarily freed from that exact thread. Would that be a good use-case for snmalloc?
> Deallocations occur in large batches
This sounds much like a bump allocator use-case but which can do this exact thing by calling a single munmap(addr, len) and unmap multiple allocations all at once.
> I can imagine one use-case for this: a task that is scheduled from and executed by a work-stealing thread-pool can allocate memory in one thread but by design there's no guarantee that the memory will be necessarily freed from that exact thread. Would that be a good use-case for snmalloc?
Yes. A work stealing runtime is a perfect use case for this. Also, cases where you might have a dedicated IO ingress thread that allocates work items, these get picked up by worker threads, and then sent to a dedicated IO egress thread. The ingress thread does a lot of allocation, the egress thread does a lot of deallocation, and the workers do a bit of both. The deallocations typically flow the opposite direction to the work flow, and this works extremely well with snmalloc.
> This sounds much like a bump allocator use-case but which can do this exact thing by calling a single munmap(addr, len) and unmap multiple allocations all at once.
Where suitable bump allocated arena management will be fast, but some times the lifetime are more complex. Certain algorithms for memory management can often release a lot of memory in a batch that is not necessarily allocated in batch, i.e. Lock-free epoch based memory reclamation, or Linux RCU, both batch deciding memory is no longer required. There are also application level things like periodic flushes of a cache/log that can cause batching of deallocations, but are not well suited to bump allocators.
I'm not a contributor to snmalloc or affiliated in any way.
I'm going to reimplement snmalloc in C for my own code.
I'm going to avoid that problem by using structured concurrency. Under my system, threads only get data from threads that are ancestors. (There will be a way of getting data from a non-ancestor, but it will effectively keep the allocating thread alive until the using thread is done.)
The reason I'm going to use snmalloc is because of structured concurrency; it's like it was designed for structured concurrency.
I did, but I don't want to have a dependency on Microsoft code.
Plus, there are some changes I will make to make my snmalloc/mimalloc clone more friendly to structured concurrency. Nothing major, stuff to mostly deal with realloc().
> The real question is: how does this deal with freeing allocations from a thread that exited?
I assume that they don't release the allocator assigned to a thread when a thread exits, unless that allocator is holding a list of zero allocations.
That sounds like the easiest part of all of this, TBH: A single extra line of code that, after performing an actual deallocation (as opposed to sending a message to perform the deallocation), the allocator checks that it has non-zero allocations, and that the thread still exists, otherwise it deletes itself.
Don't all thread implementations leak unfreed memory when a thread exits?
I mean, how would the thread know that the memory needs to be freed? It cannot know that the application hasn't ROT13 encoded the pointer, passed it to another thread, and that thread plans to use the memory.
> Don't all thread implementations leak unfreed memory when a thread exits?
I'm a bit slow this morning - I don't understand what this is supposed to mean.
To me, it's like saying "don't all functions leak unfreed memory when a function returns" - it's true, but so what? It's always been true.
> I mean, how would the thread know that the memory needs to be freed? It cannot know that the application hasn't ROT13 encoded the pointer, passed it to another thread, and that thread plans to use the memory.
As I understand it, they are using multiple allocators (maybe 1 allocator per thread?), and they store metadata (such as ownership information of the allocated block) during the allocation.
With 1 allocator per thread, each allocation by an allocator maps to a single thread. When a block is to be freed, the allocator for the thread checks the ownership information, and if it is not the owner of that block, sends a request to the thread that is the owner of that block.
This means that even if the thread allocated a block, and let it get used by multiple other threads, they can all call `free` on that block, and the original thread will simply get all the free requests (I assume it will ignore requests for freeing blocks that are already freed).
It all depends on extra metadata stored during allocation, that describes the ownership information of that block.
(Happy to be corrected about any/all of the above)
The memory could be freed from another thread with a classical allocator—but this one passes it back to the allocating thread, leaving the question what happens to it?
Allocators may send messages to any other allocator. In a naïve implementation each allocator would keep a queue of batched messages for each other allocator. The number of queues would then either be the dynamically known number of existing threads, or the statically known maximal number of possible threads. The former would require allocation of a dynamically sized structure when handling remote deallocation and slow enqueueing, while the latter would lead to significant wasted space and hard-coded limits. Instead, we adapted ideas from radix trees. Allocators keep a fixed 2k size array of buckets of pending messages, where k in our implementation is 6. Batched messages are inserted into the bucket which corresponds to their destination’s address, modulo the number of buckets. Dispatch of messages takes place by sending all the messages from the same bucket to one allocator, which then responds to those messages for which it is the final destination, and forwards the rest, again according to their destination allocator address, but now shifted right by k bits.
I would think that the overhead of message passing would mean that minimization of the number of messages sent. I wonder how well this technique scales.
While reviewing that doc, I also came across this. Seems very interesting -- I did not know this was possible:
> Some architectures, such as CHERI (including Arm's Morello), explicitly consider pointer provenance and bounds in addition to their target addresses. Adding these considerations to the architecture enables software to constrain uses of particular pointers in ways that are not available with traditional protection mechanisms. For example, while code may have a pointer that spans its entire C stack, it may construct a pointer that authorizes access only to a particular stack allocation (e.g., a buffer) and use this latter pointer while copying data. Even if an attacker is able to control the length of the copy, the bounds imposed upon pointers involved can ensure that an overflow is impossible. (On the other hand, if the attacker can influence both the bounds and the copy length, an overflow may still be possible; in practice, however, the two concerns are often sufficiently separated.) For malloc() in particular, it is enormously beneficial to be able to impose bounds on returned pointers: it becomes impossible for allocator clients to use a pointer from malloc() to access adjacent allocations!
> I wonder to what extent moving bounds checks into hardware provides the potential for efficient memory safety.
It's great! The CHERI team at U. Cambridge has recently released their initial performance characterization of Morello, Arm's experimental ARMv8 w/ CHERI: https://ctsrd-cheri.github.io/morello-early-performance-resu... . The major take-away there is a little buried, but is:
> The above 1.8% to 3.0% is our current best estimate of the geometric mean overhead that would be incurred for a future optimized design
That seems to be well within people's tolerance for security features, especially as we think having CHERI would also allow us to turn off, and so stop paying for, some existing mitigations.
^ actually it's basically implemented. I model-checked it with TLA+ and found a bug that I haven't fixed yet. Next up I'm writing a query engine on it.