Hacker News new | past | comments | ask | show | jobs | submit login
ARM and Lock-Free Programming (randomascii.wordpress.com)
229 points by markdog12 on Dec 1, 2020 | hide | past | favorite | 107 comments



I appreciate learning more about what, exactly, ARM's "weaker memory model" constitutes. It's clearer to me after reading this article.

I wonder how much the gain in performance of e.g. Apple's M1 chip, compared to an x86 CPU, can be attributed to this weaker constraint. Given that the M1 can outperform an x86 CPU even when emulating x86 code, perhaps it's not much.

Also, I suspect programming languages that are immutable by default will gain a larger advantage using ARM's weaker memory model, as the compiler can more often safely let the CPU perform reordering (due to not having to wait for a mutable variable being updated until it can execute a subsequent line of code which depends on this updated variable).


The reason it doesn't help them that much in performance is that the difference in the way x86/ARM manage memory is not that ARM reorders more and x86 reorders less, but that ARM reorders openly and x86 reorders things behind your back but makes sure everything is where you think they should be if/when you look.

This is the reason that the relatively primitive memory model of x86 has turned out to be so successful despite the consensus in the field being that it's way to restrictive for performance way back in the 90's. It turns out that the x86 memory model is an easy target for a complex reordering backend to provide as a "facade", between what is actually happening the in CPU's own L1 and buffers and what everything else thinks it's doing.

The only way weaker memory orderings benefit is that they don't need as large buffers and queues on the backend, as they can retire memory operations earlier. Except that if you just implement the ARM memory model as specified and without a similar facade, it will actually lose to modern x86, because the x86 chip can make use of the freedom provided by it's machinery to reorder even more without any of it ever being visible to the software. So you end up implementing a similar system as the x86, with only very minor gains from the fact that you can sometimes properly retire accesses earlier and thus get a little more use out of your buffers.

ARM, especially 64-bit ARM, has genuine advantages over x86. (Notably, decode.) The weak memory model is not one of them, outside the very weakest and tiniest cores (where it allows some reordering with a tiny load-store backend).


Yes. In particular for a long time x86 had much cheaper memory barriers than the competition. Not only store release and load acquires were free, but sequentially consistent barriers were also fairly cheap. Intel had to add optimizations to make them fast much earlier, while weaker RISCs could get away with expensive barriers for much longer.

Now apparently M1 has very cheap atomics and barriers, to it is doing as much tracking and speculation as an x86 CPU. But probably they can get away with tracking fewer memory operations, which might improve both performance and power usage.


And on Arm cores, you aren't obligated to provide a weak memory model. Some Arm vendors provide cores with stronger memory models, mainly TSO in Arm server land and sequential consistency for NVIDIA in-house CPUs.


Retiring memory operations earlier must have some benefit though? If data is being passed between threads, earlier retirement of a store means the data is available earlier to other CPUs. For a mutex, it means the mutex is released sooner, no?


Given all of this, what approach does RISC-V take?

The RISC-V is presumably very 'greenfield' regarding its memory model, I imagine they've made their decision based on all this knowledge.


A weak memory model. See https://riscv.org/wp-content/uploads/2018/05/14.25-15.00-RIS... (by Dan Lustig at Nvidia, NV is both very involved in Arm and RISC-V).


It has a weak memory model by default, but you can opt in to TSO (like x86) as an optional extension.


And for Arm too, you can ship stronger memory models up to SC.

There's also the RCpc extension for memory accesses w/ another memory model there.


Is RCpc still multi-copy atomic? (Which AFAIK is a guaranteed behavior of the Aarch64 memory model)


No difference on that front, yep.


Thank you for your insightful comment. Do you mind going a bit more into the decode advantage?


32-bit fixed size instructions, that are aligned. That allows you to have 8-wide decoders for example.

On x86, you can't really do this/didn't happen because the instructions are 1 to 15 bytes long, and you'll need to decode the prior instruction at least in part to determine where your current instr starts. (currently 4-wide decoders are available for x86 uArches)


Fixed width instructions are much easier to decode at the same time as their neighbours.

The density is a tradeoff you have to think about, but X86 is not a clean ISA.

For a more alien example, the mill cpu has very long variable length instructions for density reasons but because it's specifically designed around that and not needing any implicit parallelism they can use tricks to find instruction boundaries much more easily.


The biggest thing for quickly finding instruction boundaries is probably not having prefixes, which is pretty straightforward.


Not having suffixes helps too. AFAIK, on x86 to find the length of an instruction, first you have to decode the instruction itself (the decoding can vary depending on the prefixes) to know if it is followed by a modRM byte, then decode the modRM byte to know whether it is followed by an immediate or not and how big that immediate is, and only then can you know where the next instruction is. The modRM byte (and the SIB byte) can be thought of as a "suffix" of sorts to the instruction.

Contrast for instance with RISC-V, where the first byte of every instruction has everything necessary to know the instruction length (and you only need 2 bits of that byte unless your core supports instructions longer than 4 bytes).


I've you want to learn even more, take a look at Documentation/memory-barriers.txt in the Linux kernel source tree:

https://www.kernel.org/doc/html/latest/staging/index.html#me...


The M1 is outperforming an x86 CPU when emulating x86 code by translating that code into ARMv8 instructions. So if the x86 code was designed using x86's strong memory model and is then dynamically translated into ARMv8 and run with a weak memory model, there will be problems, no? Maybe Rosetta 2 handles this; that would be impressive.


No, Apple doesn't run the translated code with a weak memory model, Rosetta 2 toggles the total store ordering (which the M1 supports) on when running its code.

Microsoft's translator (and QEMU?) does insert extra barriers to make it run on a weak memory model, because they are designed to run on HW without TSO.


Yes, Apple added a CPU mode bit. From the article:

  Note: writing an x86 emulator on ARM forces you to deal with this in a particularly messy way because you never know when reordering will be a problem so you typically have to insert a lot of memory barriers. Or you can follow Apple’s strategy and add a CPU mode that enforces x86/x64 memory ordering, turning that on when emulating.
I'm a little surprised+happy the ARM architectural license allowed this.


Why wouldn't it? A stronger memory model doesn't break anything in the ISA. You can always go stronger, it's not a problem even for native applications, it would just make them slightly slower.

What is surprising with Apple is that they were allowed to add extra instructions for machine learning related math (yeah, CPU instructions in addition to a dedicated separate "neural engine" on the chip: https://news.ycombinator.com/item?id=24471699)


After doing a little web research, I'll conclude that it's impossible to know from the outside what the restrictions are on ARM architectural licensees other than they have to pass the ARM compatibility test suite. New instructions (probably not), new modes (looks like), dunno (definitely).

However, Apple is one of the founding ARM partners and it is rumored they have a special license (not all licenses are created equal). Clearly they are doing this mode thing. Whether Samsung could get away with the exact same thing is a good question. I'm going to guess that NVidia can.


How does the toggle work? Is it per thread, CPU affinity, per process, CPU flag or something totally different?


It seems to be a CPU flag. This is managed by the kernel so that it is on for Rosetta 2 processes and off for everything else. See https://github.com/saagarjha/TSOEnabler


The M1 chip can do uncontended atomics almost "for free", and offers a mode where it firms up its memory ordering without much performance loss, so I am not actually sure how much this gets them.


Well, it gets them enough that they bothered implementing all of the circuitry for both memory models and an extra bit of thread state to switch between them. I'm sure they did a bunch of simulation before going with that decision. I'm sure they did some simulations before going down that road and found that it was worth it vs. just going with a more strict x86-like memory model all the time.


> I'm sure they did some simulations before going down that road and found that it was worth it vs. just going with a more strict x86-like memory model all the time.

Or they didn't want developers to rely on it so they can remove it in the future when they deprecate x64 (they deprecated x86 so it is only a matter of time).


> Or they didn't want developers to rely on it

If simulations show that it's not currently worth the effort, but they want to keep the option open, that implies that they believe some technological change in the future may make the relaxed memory model bear fruit.

Actually, now that I think about it, another option is that the weaker memory model is more advantageous in lower power implementations, like iWatch, and preventing developers from relying on the stronger memory model available in M1 reduces the number of bugs in iWatch applications that share code with iPhone apps or OSX apps.


If you want to dig deeper, watch these videos: https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memor...


Even after watching these videos and reading lots of articles on the topic, I still find the full C++ memory model extremely hard to understand. However, on x86 there are actually only a couple of things that one needs to understand to write correct lock free code. This is laid out in a blog post: https://databasearchitects.blogspot.com/2020/10/c-concurrenc...


C++'s "seq_cst" model is simple. If you're having any issues understanding anything at all, just stick with seq_cst.

If you want slightly better performance on some processors, you need to dip down into acquire-release. This 2nd memory model is faster because of the concept of half-barriers.

Lets say you have:

    a();
    b();
    acquire_barrier(); // Half barrier
    c();
    d();
    e();
    release_barrier(); // Half barrier
    f(); 
    g();
The compiler, CPU, and cache is ALLOWED to rearrange the code into the following:

    acquire_barrier(); // Optimizer moved a() and b() from outside the barrier to inside the barrier
    a();
    b();
    d();
    c();
    e();
    g();
    f();
    release_barrier(); // Optimizer moved g() and f() from outside the barrier to inside the barrier
You're allowed to move optimizations "inside" towards the barrier, but you are not allowed to rearrange code "outside" of the half-barrier region. Because more optimizations are available (for the compiler, the CPU, or the caches), half-barriers execute slightly faster than full sequential consistency.

----------

Now that we've talked about things in the abstract, lets think about "actual" code. Lets say we have:

    int i = 0; // a();
    i++; // b();

    full_barrier(); // seq_cst barrier

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    full_barrier(); // seq_cst barrier

    i+=5; // f();
    i+=6; // g();
As the optimizer, you're only allowed to optimize to...

    int i = 1; // a() and b() rearranged to the same line
    full_barrier(); // Not allowed to optimize past this line
    i+= 9; // c, d, and e rearranged
    full_barrier();
    i+= 11; // f, g rearranged.
Now lets do the same with half barriers:

    int i = 0; // a();
    i++; // b();

    acquire_barrier(); // acquire

    i+=2; // c();
    i+=3; // d();
    i+=4; // e();

    release_barrier(); // release

    i+=5; // f();
    i+=6; // g();
Because all code can be rearranged to the "inside" of the barrier, you can simply write:

   i = 21;
Therefore, half-barriers are faster.

----------

Now instead of the compiler rearranging code: imagine the L1 cache is rearranging writes to memory. With full barriers, the L1 cache has to write:

    i = 1;
    full_barrier(); // Ensure all other cores see that i is now = 1;

    i = 10; // L1 cache allows CPU to do +2, +3, and +4 operations, but L1 "merges them together" and other cores do NOT see the +2, +3, or +4 operations

    full_barrier(); // L1 cache communicates to other cores that i = 10 now;

    i = 21; // L1 cache allows CPU to do +5 and +6 operations

   // Without a barrier, L1 cache doesn't need to tell anyone that i is 21 now. No communication is guaranteed.
----------

Similarly, with half-barriers instead, the L1 cache's communication to other cores only has to be:

    i = 21; // L1 cache can "lazily" inform other cores, allowing the CPU to perform i+=1, i+=2... i+=6.
So for CPUs that implement half-barriers (like ARM), the L1 cache can communicate ever so slightly more efficiently, if the programmer specifies these barriers.

----------

Finally, you have "weak ordered atomics", which have no barriers involved at all. While the atomics are guaranteed to execute atomically, their order is completely unspecified.

There's also consume / release barriers, which no one understands and no compiler implements. So ignore those. :-) They're trying to make consume/release easier to understand in a future standard... and I don't think they got all the "bugs" out of the consume/release standard yet.

-------

EDIT: Now that I think of it, acquire_barriers / release_barriers are often baked into a load/store operation and are "relative" to a variable. So the above discussion is still inaccurate. Nonetheless, I think its a simplified discussion to kinda explain why these barriers exist and why programmers were driven to make a "more efficient barrier" mechanic.


To the edit: right. I like the description using half barriers, but I have trouble reconciling that with the Linux Kernel's READ_ONCE/WRITE_ONCE macros, which guarantee no tearing/alignment issues, but boil down to reads/writes through casts to volatile qualified pointer dereferences. I guess those don't have the same notion of memory ordering that the C++11 API has... Maybe rmb()/wmb()...


Yeah, the Linux kernel memory model is different from the C/C++11 one.


Well... I didn't describe the C++11 memory model above. I had a gross simplification, because I didn't account for how the C++11 memory model acts "relative to a variable". (And this "variable" is typically the mutex itself).

I don't know much about the Linux Kernel, but I gave the document a brief read-over (https://www.kernel.org/doc/Documentation/memory-barriers.txt).

My understanding is that WRITE_ONCE / READ_ONCE are meant for this "relative to a variable" issue. Its _precisely_ the issue I ignored in my post above.

All C++11 atomics are "relative to a variable". There are typically no memory-barriers floating around by themselves (there can be, but, you probably don't need the free-floating memory barriers to get the job done).

So you wouldn't write "acquire_barrier()" in C++11. You'd write "atomic_var.store(value, memory_order_release)", saying that the half-barrier is relative to atomic_var itself.

----------

    a();
    b();
    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    c();
    d();
    e();
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
    f(); 
    g();
So the C++ acquire/release model is always relative to a variable, commonly the spinlock.

This means that "c, d, and e" are protected by the spinlock (or whatever synchronization variable you're working with). Moving a or b "inside the lock" is fine, because that's the "unlocked region", and the higher-level programmer is fine with "any order" outside of the locked region.

Note: this means that c(), d(), and e() are free to be rearranged as necessary. For example:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value
    for(int i=0; i<100; i++){
      value+=i;
    }
    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
The optimizer is allowed to reorder the values inside into:

    while(val = atomic_swap(spinlock, 1, acquire_consistency), val!= 0) hyperthread_yield(); // half-barrier, write 1 into the spinlock while atomically reading its previous value

    for(int i=99; i>=0; i--){ // decrement-and-test form is faster on many processors
      value+=i;
    }

    atomic_store(spinlock, 0, release_consistency); // Half barrier, 0 means we're done with the lock
Its the ordering "relative" to the spinlock that needs to be kept. Not the order of any of the other loads or stores that happen. As long as all value+=i stores are done "before" the atomic_store(spinlock) command, and "after" the atomic_swap(spinlock) command, all reorderings are valid.

So reordering from "value+=0, value+=1, ... value+=99" into "value+=99, value+=98... value+=0" is an allowable optimization.

----------

It seems like WRITE_ONCE / READ_ONCE was written for DEC_Alpha, which is far weaker (less guarantees about order) than even ARM. DEC_Alpha was the first popular multicore system, but its memory model allowed a huge number of reorderings.

WRITE_ONCE / READ_ONCE probably compile into no-ops on ARM or x86. I'm not 100% sure, but that'd be my guess. I think the last 20-years of CPU design has overall said that the DEC_Alpha's reorderings were just too confusing to handle in the general case, so CPU designers / low-level programmers just avoid that situation entirely.

"dependent memory accesses" is very similar to the confusing language of memory_order_consume. Which is again: a model almost no one understands, and almost no C++ compiler implements. :-) So we can probably ignore that.


> I don't know much about the Linux Kernel, but I gave the document a brief read-over (https://www.kernel.org/doc/Documentation/memory-barriers.txt).

See also

http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/asplos18.pdf

Wrt Alpha, I think a large part of the weirdness was that some early variants had split caches which weren't coherent with each other or something like that. So if a pointer and the value it pointed to where in different cache banks you could get funny effects.


I’m reluctant to post this comment but does it feel weird to anyone else when Herb Sutter refers to threads and CPUs as “he”?


The most annoying aspect of lock free programming is the amount of 'roll your own' that goes on, and having to inspect this stuff to ensure it's got a chance of working on the platforms you care about.

I totally get that this technique is inappropriate for many situations where a lock works well enough, but I work in a field (audio) where accidental use of locks and system calls is in general a larger problem than incorrect lock free code.


Hm, in c++ there are a fair amount of battle-tested, permissively-licensed lockfree containers though

I use https://github.com/cameron314/readerwriterqueue and https://github.com/cameron314/concurrentqueue ; they deliver.


Yes, in addition to the atomics and memory ordering primitives added in the C++11 standard [1], I find the Boost lock-free structures quite reasonable [2]. (Yes, I'm aware some have an automatic aversion to Boost. But it is well-designed and I believe is a header-only library.)

[1] https://en.cppreference.com/w/cpp/atomic

[2] https://www.boost.org/doc/libs/1_74_0/doc/html/lockfree.html


This is common knowledge in embedded circles, where weak memory models are the norm.

This article doesn't address a different kind of memory weakness in some ARM implementations (and I presume other architectures) that you can run into when sharing memory between processes.

The data cache is virtually tagged, virtually indexed, which means that the cache is keyed with the domain (essentially the process number) + the virtual address. This means that the MMU doesn't have to be consulted for cache lookups, however, if you have two different processes mapped into the same physical memory, the writing process must execute a flush instruction, and the reading process must execute an invalidate instruction.


Ohh, Thanks! I had never heard about ARM's cache being tagged like that. Do you have any recommended reading?


> ... std::atomic<bool>. This tells the compiler not to elide reads and writes of this variable, ...

If I'm not mistaken, this is not true. The compiler is still allowed to elide reads and writes on atomic variables. (For example merge two consecutive writes, or remove unused reads)


Correct. See "N4455 No Sane Compiler Would Optimize Atomics" [1]. The first sentence of that paper is (spoiler allert) a laconic "false".

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n445...


Updated. Thanks for the correction.


Aside - calling it now - but the M1's x86 memory model mode is a temporary thing that'll go away in a few years along with Rosetta 2 once Apple is happy with the level of universal binary support.


I understand M1 has a special mode that implements the x86 memory model, which aids in Emulation. Apparently qualcom doesn’t have that, so x86 on Arm Windows is slow, being littered with memory barriers.

Couldn’t one just use userthreads in an emulator and ignore all these memory issues? Sure that would mean every emulated process is running on one core, but as long as blocking io is implemented properly, how bad can that be? (It would be like having a single core for the emulated process)


With sufficient memory barriers, the x86 memory model can be emulated entirely in software without having to sacrifice multithreading. I expect a year from now we will be able to run x86 Windows on our AS Macs at reasonable-ish speeds this way.


It would be pretty bad. Right now even relatively recent 3D games run fine under Rosetta. I don’t think that would be the case with a single thread limit.


> in a few years

Pure speculation: I suspect that Apple will support x86 longer than it supported PPC.

Why? The macOS platform is "older" now and more popular than when PPC was around. There's more old applications laying around, unmaintained. Also, don't forget that "windows compatibility" was a huge Mac selling point when they switched to x86.

So, I suspect a minimum of 8 years support, but possibly indefinitely if Windows can't make the jump to ARM. Again, pure speculation, but what will be the deciding factor is if x86 virtualization on ARM Mac ever becomes a thing. (And I hope it does!)


> I suspect that Apple will support x86 longer than it supported PPC.

Totally agree - I was surprised how quickly they dropped PPC support. The install base is so much larger now, and hardware refresh cycles so much longer that it'll be necessary. Hell, they've said this transition will take 2 years - the Intel transition took them a year to go from no Intel macs to all Intel macs.

> So, I suspect a minimum of 8 years support,

Not gonna happen. The Apple of today does not care about old unmaintained applications and definitely does not want to make computers for people to run Windows on.


> The Apple of today does not care about old unmaintained applications

The Apple of today is not the Apple of tomorrow. Considering how awesome the M1 Macs sound; if Mac reaches a high marketshare, it might be required.


Very educational.

I take exception with the common assertion that 'code is broken' if it doesn't work everywhere for every architecture. Code deployed for a particular purpose on a particular machine with a particular toolchain can work great and be 'correct' for that environment. I work in such environments all the time. And taking the time/performance hit seeking ill-conceived perfection wastes the clients time and money.

But for the vast majority of cloud-deployed or open-source prjects, its certainly wise to err on the side of 'perfect'.


When I said "that code was probably broken already" I meant that while the code might currently be working it could suddenly stop working if you upgrade your compiler. Yes, you are free to take that risk, but relying on things not guaranteed by the language is dangerous.


What I'm afraid is with the introduction of ARM desktops, there will be subtle bugs which did not manifest on x86. Those bugs are hard to find in old legacy multi-million LoC codebase with lots of spaghetti code nobody bothered to refactor, so those bugs will haunt users for years. That's why I'm reluctant to move to this architecture at least for a few years, even if it's superior in theory.

Thankfully ARM phones were a thing for a lot of time, so many libraries should work well.


Piggybacking this question for people interested in the topic:

Which everyday high level data structures you use are built on lock-free implementations?

Clojure's maps/vectors (which are persistent data structures) come to mind for me, for one.


SPSC (single producer single consumer) queues and ring buffers are used quite extensively in low-latency software engineering.


Yup! In Rust specifically, I've written bbqueue, which is a bipbuffer-inspired lock-free ring buffer, intended for embedded systems and DMA usage.

https://github.com/jamesmunns/bbqueue

It uses atomics (and Rust's borrow checker) heavily to safely guarantee lock free behavior, and works pretty darn quick.


You reference a blog post on your Github (regarding the algo description). I believe the author of that blog (someone at Ferrous Systems) has a pretty significant misunderstanding of DMA and serial ports, which raises a whole host of other questions on down the chain <additional unnecessary comment removed>.

edit: I know it's not cool to complain about downvotes, but this is a legitimate concern to raise for this poster. I am not attempting to denigrate anyone here.


I assume you mean this blog post:

https://ferrous-systems.com/blog/lock-free-ring-buffer/

Could you explain what you think the misunderstanding is, and what the consequences of that misunderstanding are?


edit3: resolved below.


The comparison that I was aiming to make was comparing a "busy wait blocking" implementation, e.g. sending one byte at a time, until the byte has been sent, which is typical for bare metal embedded systems (that don't use some kind of DMA or concurrency like threads or tasks).

In this case, your CPU would be left spinning for an equivalent amount of time to the line rate of the serial port, if you were busy waiting for data.


As a note, you can click on the time stamp of the child post, and respond directly there.

> since I can't reply to the child I'll just say this: I don't like the math you used, I feel like it's making a point that shouldn't be made in a place where it doesn't need to be made. Of course it's going to be dog slow if you're busy waiting. Wtf? I guess I will just leave it at that.

The entire post is about reducing the CPU time necessary to receive data from an external interface, like a serial port. The following paragraph states:

> Instead of making our CPU waste all of this time waiting around, we allow the hardware to manage the simple sending and receiving process, allowing our CPU to either process other important tasks, or go into sleep mode, saving power or battery life.

The intent for this example was to illustrate the difference between busy waiting and DMA. It is not uncommon on microcontrollers to only have a FIFO depth of one byte, which means if you aren't using interrupts, DMA, or threading, you will lose that data.


Thanks for the heads up on the reply.

And this is why I shouldn't critique year old blog posts before 6am, lol.

I still don't care for the way that section was phrased, but I have a rather pedantic background in telephony and serial ports.

My apologies.


Hello, I am also the author of that post and implementor of bbqueue. I agree, I'd be interested in what you think I have misunderstood.

Feel free to reply here (preferred), or send me an email at james.munns [at] ferrous-systems.com.


I've written some memory allocators that use lock-free allocation maps. In this case, a 4kb granularity allocator that used compressed free-bitmaps for usage in both userspace and kernelspace.

The entire thing was lock-free, so allocations would never stall on a lock, even if the allocator had to decompress a bitmap (the way compression worked was by looking for free memory ranges and then replacing them by a range expression, which saved a few MB of memory per Gigabyte of memory available, as well as ordering largely free memory blocks to the start of the bitmap list to reduce seek time). Despite the complexity in the allocator, it could serve any number of threads without ever locking up. The worst case was a long latency if a thread had to scan the entire free list for a free block (It would switch to a different allocation behaviour if a thread had more than a dozen failed allocation attempts).

A bunch of ring buffers I've worked with are atomic (ie, lock free), probably the most common application of lock-free stuff.


Interesting! Is this work public somewhere perhaps?

I'm currently working on a similar technique (though with smaller blocks, currently ranging 64-1024 bytes - basically any 5 power-of-twos) for a lock-free allocator intended for real time embedded systems. I use a 32-bit word for each 1024 byte "page", and have a tree structure inside of the word (it takes 31 bits to cover 5 power-of-two ranges).

My code (which only allocs and never frees at the moment) is here:

https://github.com/jamesmunns/bitpool/


Only a much older version which sadly does not feature compressed bitmaps or reordering of the bitmaps, plus having one or two bugs rather critical bugs. It's part of my toy project kernel. Largely written in Rust with a few unsafe behaviours since it's intended to be able to bootstrap given only a pointer and size of a heap memory block using only 8 kilobytes of stack, as well as being able to add or remove memory to the entire list.

There is only one section that behaves like a lock, which handles the free range decompression. It has to be somewhat exclusive in behaviour since it must be able to decompress the block without allocating memory, instead it simply uses the decompressed lock to allocate the memory necessary to accomodate for overhang memory in the range (ie, if a range is larger than a block). It simply means that during this time, the available memory shrinks by the amount represented by the range.

IIRC it's about 128 bytes of metadata, the rest is bits in the bitmaps to handle page allocation. Each block can in theory handle about 124 MiB, though I do have some logic so that sections can be split into multiple blocks to reduce contention as well as handling large pages (which largely reside in a single block as even 16MB pages enable managing Terabytes of memory).

Allocation is simply a CAS on the byte containing the free bit for the specific page (Base Address + 4KiB * Bit Index) then doing an atomic increment on the free page counter. Deallocation is the reverse. Failed CAS attempts mean it continues to the next free bit atleast on the next byte (= (Bit Index >> 3 + 1) << 3). The happy allocation path is fewer than a hundred instructions, the unhappy path is about a thousand instructions. CAS latency can be quite unpredictable though, so probably not suitable for realtime systems.


You may find my work from grad school useful, which was a lock-free memory allocator. The paper [1] and source code [2] are easily available.

[1] Scalable Locality-Conscious Multithreaded Memory Allocation, https://www.scott-a-s.com/files/ismm06.pdf

[2] https://github.com/scotts/streamflow/


Microsoft's 'mimalloc' MIT-licensed allocator is lock-free. Perhaps there's something in there that'd be interesting to read.

  https://github.com/microsoft/mimalloc
  https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-atomic.h
It has gone through testing with Clang's TSAN and (at least some with) the GenMC model checker.

  https://plv.mpi-sws.org/genmc/


Almost zero.

Queues or streams may have lockless properties. Or maybe not!

For the most part concurrent, mutable access to a shared data structure is undesirable and to be avoided.

If you find yourself wanting a lockless hashmap you’re going down the wrong path 99.9% of the time. There are always exceptions. But you better have a REALLY good reason why.


A good rule to follow in software engineering in general is "never write your own lock free data structures". I'm the one writing them so that my coworkers don't need to, though. I've used atomics to implement various flavors of queues in low level C for use on microcontrollers. A good single-producer single-consumer queue abstraction is helpful for communicating with interrupt service routines in peripheral drivers. Bespoke multiple-producer single or multiple-consumer queues are often useful for implementing logging abstractions that work from any context, including in interrupt handlers and critical sections.


This is more about the way of doing lockless data operations than specific data structures - look up the RCU pattern, there are good CPP talks by Fedor Pikus on it.


I used Michael & Scott lock-free queue and bounded SPMC & MPMC queues in my fibers library. Queues with locks were faster though. I've also written some QSBR to reclaim sockets registered in epoll/kqueue without a need to wrap concurrent epoll/kqueue calls in a lock.



log4j2 AsyncLogger uses RingBuffer (Disruptor) which is lock-free https://github.com/LMAX-Exchange/disruptor/wiki/Blogs-And-Ar...


I thought barriers were required on x86 also. There are a few cases where ordering is not guaranteed. [Also the consumer should have a barrier between the test and the data read.]

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fe...

The thing that's nice about x86 is that DMA is cache coherent- this is true for backwards compatibility with very old hardware.


At the assembly level, barriers are not required in many cases because x86 have total-store ordering. There are a few cases where memory-barriers are still needed... but not nearly as many examples as in ARM-world.

At the C / C++ level, barriers are required because almost all compilers rearrange code in the name of optimization (especially as you go from -O to -O1, to -O2, to -O3... more aggressive rearrangements happen).

The optimizer needs to know "not" to rearrange some code, so a compiler barrier is needed.

---------

It turns out that a compiler barrier (hey compiler: do not rearrange this read/write I'm specifying) is identical in concept to a CPU-memory barrier on a weak-ordered system (hey CPU: don't reorder this read/write I'm specifying).


Lock-Free programming is essential in high performance networking (10G+ network connections) and data communications. Most of the implementation are using ringbuffers. There can be multiple producer consumer implementation as well.

But it's not a real issue for the new M1 minis as its network is down to 1Gbit (from the previous mini 10G).


> But it's not a real issue for the new M1 minis as its network is down to 1Gbit (from the previous mini 10G).

FWIW, that's just the built-in port. 10G adapters will do 900MB/s. https://www.owcdigital.com/products/thunderbolt-3-10g-ethern...


Does Rust and it's borrow checker make this easier/implicit?

If compiled code will have to work on both RISC and CISC for the near future. Is the burden to do lock free C++ by identifying all the memory conditions outway the learning curve of Rust?

Don't know Rust beyond dead simple projects but just curious.


Rust’s ownership and concurrency model prevent you from writing such broken code by accident. The early code examples in the article wouldn’t compile in Rust: you have to use atomic types explicitly, or pass values around explicitly (which will most commonly match the two paradigms of memory sharing and message passing). Rust makes doing things the way shown here quite difficult, which is a good thing, because the idiomatic functionally-equivalent code in Rust will be much easier to follow.


Writing to an ordinary integer in Rust only compiles if it is guaranteed that the integer is not shared with any other code, so writing to that boolean would not compile with an ordinary integer.

Rust instead has various special types, such as AtomicBool, that works in the same way as the C++ atomic bool from the example.

That said, to actually write the release/acquire thing they did, you will need unsafe code because, to the compiler, the other variables are also shared, and would normally be prevented even if you used the atomic. That said, it would be possible to encapsulate this unsafe in a library without infecting other parts of the code.


Compiler checking of multi-threaded code is a good step forward (D has it too, albeit no one's entirely sure how to move forward with the feature), but to access the actual gadget (fence, etc.) you still need some level of either programmer guidance or library code to manage that access


You can do everything you might do in Rust equally as easily in C++.

The compiler won't hold your hand, so much, but if you wrap the semantics in a library, and use the library, you get the same reliability from it.

(This is an unpopular observation in certain places.)


The power of Rust is not in what you can do, it's in what you can't (without the `unsafe` keyword).

C++ doesn't have the prevention. If you're not an expert on concurrency, you might just share mutable access to a regular bool variable between threads and use it in that way (as in the article) and nothing would tell you it's wrong. You wouldn't know you need a specific library (or just "to use std::atomic" in this case) until you encounter the weird runtime behavior, try to debug it, try to formulate a google query that describes the bugs/code well enough, and find the answers.

"Hand holding" (i.e. rejecting programs that don't conform to more advanced correctness rules) is great. It's what computers are good at. It's foolish not to use computers for this!


> C++ doesn't have the prevention

Obviously. But that was not the topic.

A discipline of single-writer is as easily composed in C++, and, encapsulated in a library, as easily maintained. Nothing about Rust makes this unavailable in C++. The library itself, then, becomes responsible for maintaining the discipline, not the compiler. But once encoded there, it is done.

Rust's compiler provides safety advantages, but there is no need to exaggerate them. (Exaggeration is a liability to your language community.)


> [...] if you wrap the semantics in a library, and use the library, you get the same reliability from it.

Wrong. Rust offers more tools for checking that your library is used correctly than C++ does. First and foremost, the borrow checker – there is nothing comparable in C++, and it's a lot harder/impossible to statically prevent code from modifying shared data after a mutex has been unlocked, for example.

> (This is an unpopular observation in certain places.)

It is unpopular because it is factually incorrect.


(Knee-jerk reactions do not improve the reputation of your language or of your language community.)

A library interface boundary can enforce behavioral restrictions on its clients equally as strong as a compiler's. Clients not handed pointers cannot abuse them. It does demand correctness of the library implementation similar to demands on the compiler. But such correctness in a library formalism is certainly as possible to establish and maintain as in a compiler.

Use of wild pointers in other parts of the client could overwrite library data, but that is a different risk than the data races that are the topic here.


This reminds me of some lock-free code I fixed recently (by introducing a very peculiar kind of lock):

https://github.com/Ardour/ardour/blob/master/libs/pbd/pbd/rc...

The goal of the code is to allow a producer thread to update a pointer to an object in memory, which may be used lazily by consumer threads. Consumers are allowed to use stale versions of the object, but must not see inconsistent state. Obsolete objects must eventually be freed, but only after all consumers are done using them. And here's the catch: consumers are real-time code, so they must never block on a lock, nor are they allowed to free nor allocate memory on their own.

The design is based around Boost's shared_ptr, so that had to stay. But the original code was subtly broken: the object is passed around by passing a pointer to a shared_ptr (so double indirection there) which gets cloned, but there was a race condition where the original shared_ptr might be freed by the producer while the consumer is in the process of cloning it.

My solution ended up being to introduce a "lopsided spinlock". Consumers aren't allowed to block, but producers are. So consumers can locklessly signal, via an atomic variable, that they are currently inside their critical section. Then the producer can atomically (locklessly) swap the pointer at any time, but must spin (is this a half-lock?) until no consumers are in the critical section before freeing the old object. This ensures that any users of the old pointer have completed their clone of the shared_ptr, and therefore freeing the old one will no longer cause a use-after-free. Finally, the producer holds on to a reference to the object until it can prove that no consumers remain, to guarantee that the memory deallocation always happens in producer context. shared_ptr then ensures that the underlying object lives on until it is no longer needed.

It's kind of weird to see a half-spinlock like this, where one side has a "critical section" with no actual locking, and the other side waits for a "lock" to be released but then doesn't actually lock anything itself. But if you follow the sequence of operations, it all works out and is correct (as far as I can tell).

For various reasons this code has to build with older compilers, hence can't use C++ atomics. That's why it's using glib ones.

(Note: I'm not 100% sure there are no hidden allocations breaking the realtime requirement on the reader side; I went into this codebase to fix the race condition, but I haven't attempted to prove that the realtime requirement was properly respected in the original code; you'd have to carefully look at what shared_ptr does behind the scenes)


Assuming that shared_ptrs themselves are thread safe is a common error :(.

The standard (and boost) does provide additional operations to manipulate shared pointers themselves in a thread safe way, for example void atomic_store( std::shared_ptr<T>* p, std::shared_ptr<T> r ) but they are not guaranteed to be lock free (and they are not in practice) and have been deprecated in C++20.

It seems that your scenario would be best be handled via hazard pointers, but if you have an atomic variable per consumer I guess in practice it is the same thing (if not and it is just a shared counter, then consumers will be conflicting with each other).


As long as this code runs fine on ARM we're going to be ok:

http://move.rupy.se/file/atomic.txt


This guy is a member of Antifa. I'm not a fan of people reposting articles by known left-wing terrorist organizations on hacker news.



This has no comments though.


The importance and value of lock-free programming is massively overrated.

I routinely improve performance and also reliability of systems by deleting lock-free queues. The secret to both is reduced coupling, which lock-free methods do nothing to help with. Atomic operations invoke hardware mechanisms that, at base, are hardly different from locks. While they won't deadlock by themselves, they are so hard to get right that failures equally as bad as deadlocking are hard to avoid.

So, replace threads with separate processes that communicate via ring buffers, and batch work to make interaction less frequent. With interaction infrequent enough, time spent synchronizing, when you actually do it, becomes negligible.


> So, replace threads with separate processes that communicate via ring buffers

... but those ringbuffers you're using presumably have a lockfree fast path with a futex or pthread mutex slow path to prevent the consumer from busy-waiting, so you're still taking advantage of lockfree data structures.


You may presume so, but you would be wrong.

They are technically (also) lock-free, so require atomic operations when announcing additions, but being written by only one process, ownership of cache lines never changes, so hardware locks are not engaged.

Readers do poll, and sleep as long as the tolerated latency when nothing new is found. Writers batch additions, and announce new entries only just so frequently.


Interesting, though the overhead of nanosleep vs. futex wait with non-NULL timeout should be dwarfed by the context switch. I can see where if you have extreme latency and/or throughput constraints on the writer side of the ring buffer, you couldn't afford the extra atomic read and occasional futex wake on the writer side to wake the consumer.

> but being written by only one process, ownership of cache lines never changes, so hardware locks are not engaged.

Which cache coherency protocol are you referring to where cache lines are "locked"? I'm aware that early multiprocessor systems locked the whole memory bus for atomic operations, but my understanding is that most modern processors use advanced variants on the MESI protocol[0]. In the MOESI variant, an atomic write (or any write, for that matter) would put the cache line in the O state in the writing core's cache. If we had to label cache line states as either "locked" or "unlocked", the M, O, and E states would be "locked", and S and I would be "unlocked".

Your knowledge of cache coherency protocols is probably much better than mine, so I'm probably just misinterpreting your shorthand jargon for "locked" cache lines. By "hardware locks are not engaged" do you just mean that the O state doesn't ping-pong between cores?

[0] http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...


> * .. .you just mean that the O state doesn't ping-pong between cores?*

Correct. My terminology was sloppy, sorry. I have no experience of this "context switch" you speak of; my processes are pinned and don't do system calls or get interrupted. :-)

The writing core retains ownership of the cache lines, so no expensive negotiation with other caches occurs. When the writer writes, corresponding cache lines for other cores get invalidated, and when a reader reads, that core's cache requests a current copy of the writer's cache line.

The reader can poll its own cache line as frequently as it likes without generating bus congestion, because no invalidation comes in until a write happens.

There is a newish Intel instruction that just sleeps waiting for a cache line to be invalidated, that doesn't compete for execution resources with the other hyperthread, or burn watts. Of course compilers don't know it. I don't know if AMD has adopted it, but would like to know; I don't recall its name.


I agree with regards to lock-free not being faster than lock-based, but that:

> replace threads with separate processes that communicate via ring buffers

does not follow at all. Interprocess communication is always more work to set up and less efficient [1] than interthread communication. The only thing separate processes gain you is enforced memory protection.

[1] Different processes need different TLB entries, while different threads in the same process do not.


TLB is not necessarily an issue as presumably you would be running the processes on different cores.

The primary problem is that separate address spaces requires messages to be serialized (although of course serialization can be trivial). With shared memory you can share (or transfer ownership of) data by simply sharing a pointer.


TLB usage is an advantage as each process, pinned to a core, gets its own. TLB shootdowns vanish. Cache line ownership, similarly, does not pingpong. Sharing pointers causes fragility, as failure in one thread takes down the whole system.

Transferring ownership would add coupling. But such objects can live in a shared-memory area. Pointers into shared regions do require some care. Ideally, only one process ever owns objects in each, most commonly just the ring buffer.

With separate processes, they may come and go as needed. Yes, setting up shared memory for ring buffers is more work, but interactions establishing them are not fast-path.

Decoupling via separate processes is a different architecture, but has real benefits to motivate it.


Oh, I agree, it is all tradeoffs.

Losing the ability to cheaply share objects, including closures, is a big tradeoff though.

BTW lock free queues and data structures are pretty much required for interprocess sharing, unless you really want to deal with robust mutexes (and you don't!)


I do it. Without mutexes.

I don't count single-writer ring buffers among the ordinary lock-free queues, because there is no competition for access, and cache lines never change ownership. So, none of the overheads show up. And, they are very easy to get right, so pose none of the risk of subtle heisenbugs.

Being able to kill and add downstream clients at random is worth a lot of architectural rejiggery.


It follows in a weird way: Processes requires more effort to communicate so you are less likely to send things across processes and when you do it is because it is worth it. You are also likely to batch the communication and then go off and compute.

All of the above improve performance. Nothing about it requires processes though, you can get the same results with threads if you take just as much care.

However processes do add one more thing: you can scale to multiple physical computers in ways threads can't.


> However processes do add one more thing: you can scale to multiple physical computers in ways threads can't.

Although this is technically true, communication with another system has orders of magnitude higher latency and lower bandwidth compared to other cores in the same CPU. So in practice, you'll typically have to redesign your approach anyway.


Separate processes means no TLB shoot-downs.

If you put ring buffers in hugepages, the number of TLB entries involved is small, typically 1. And, pinning the processes to cores, each gets its own set.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: