I've written some memory allocators that use lock-free allocation maps. In this ...

jamesmunns · on Dec 1, 2020

Interesting! Is this work public somewhere perhaps?

I'm currently working on a similar technique (though with smaller blocks, currently ranging 64-1024 bytes - basically any 5 power-of-twos) for a lock-free allocator intended for real time embedded systems. I use a 32-bit word for each 1024 byte "page", and have a tree structure inside of the word (it takes 31 bits to cover 5 power-of-two ranges).

My code (which only allocs and never frees at the moment) is here:

https://github.com/jamesmunns/bitpool/

zaarn · on Dec 1, 2020

Only a much older version which sadly does not feature compressed bitmaps or reordering of the bitmaps, plus having one or two bugs rather critical bugs. It's part of my toy project kernel. Largely written in Rust with a few unsafe behaviours since it's intended to be able to bootstrap given only a pointer and size of a heap memory block using only 8 kilobytes of stack, as well as being able to add or remove memory to the entire list.

There is only one section that behaves like a lock, which handles the free range decompression. It has to be somewhat exclusive in behaviour since it must be able to decompress the block without allocating memory, instead it simply uses the decompressed lock to allocate the memory necessary to accomodate for overhang memory in the range (ie, if a range is larger than a block). It simply means that during this time, the available memory shrinks by the amount represented by the range.

IIRC it's about 128 bytes of metadata, the rest is bits in the bitmaps to handle page allocation. Each block can in theory handle about 124 MiB, though I do have some logic so that sections can be split into multiple blocks to reduce contention as well as handling large pages (which largely reside in a single block as even 16MB pages enable managing Terabytes of memory).

Allocation is simply a CAS on the byte containing the free bit for the specific page (Base Address + 4KiB * Bit Index) then doing an atomic increment on the free page counter. Deallocation is the reverse. Failed CAS attempts mean it continues to the next free bit atleast on the next byte (= (Bit Index >> 3 + 1) << 3). The happy allocation path is fewer than a hundred instructions, the unhappy path is about a thousand instructions. CAS latency can be quite unpredictable though, so probably not suitable for realtime systems.

scott_s · on Dec 1, 2020

You may find my work from grad school useful, which was a lock-free memory allocator. The paper [1] and source code [2] are easily available.

[1] Scalable Locality-Conscious Multithreaded Memory Allocation, https://www.scott-a-s.com/files/ismm06.pdf

[2] https://github.com/scotts/streamflow/

Jason_Gibson · on Dec 1, 2020

Microsoft's 'mimalloc' MIT-licensed allocator is lock-free. Perhaps there's something in there that'd be interesting to read.

  https://github.com/microsoft/mimalloc
  https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-atomic.h

It has gone through testing with Clang's TSAN and (at least some with) the GenMC model checker.

  https://plv.mpi-sws.org/genmc/