Lockless Algorithms for Mere Mortals

alextheparrot · on July 29, 2020

It seems the answer is “Not available right now, let’s work on an API”? I think that matches my perspective on this, in that it isn’t worth a “mortal” developer’s time right now. Rarely have I hit a scenario where a lock is the primary source of inefficiency and crossing thread boundaries.

The Go approach of “Do not communicate by sharing memory” holds well, because CPU and memory do not always compose as one would expect (As the author of the post notes). Passing messages simplifies a lot of things because you learn to now rely on a specific thread or order, or program defensively against that possibility.

jiggawatts · on July 29, 2020

My experience is that I've often seen locks being an unseen source of performance woes in enterprise systems. So when you say that "rarely have I hit a scenario", I'd say that there's an even chance that you have, it's just that you didn't know it.

Here's the acid test: Imagine if you got given a two-socket AMD EPYC server with 128 cores to run your software on. No virtualisation, local NVMe SSD for storage, 400 Gbps Ethernet for networking. No hardware bottlenecks of any type! Could your software utilise all 256 hardware threads of this computer? If not, why not?

Maybe you got lucky and you really haven't had issues with locks, but certainly other developers have.

The most common issue I see is with implicit locks in things like logging frameworks. E.g.: Sending output to a text file can be a bottleneck for larger servers like in the example above. Similarly, console output in multi-threaded software also requires locks, and also often has contention issues.

Even innocent-looking code that simply uses dynamic memory allocation ("new", "malloc", etc...) can be bottlenecked by cross-thread contention on the heap data structures and the associated locks. Some modern allocators have thread-local pools, but this doesn't always work. E.g.: often large allocations go to a shared pool. Code that over-allocates large buffers and rapidly frees them can hit this issue all too easily.

Web application servers will often hit the wall on something like a shared cache or the session-state store. Unless 100% of the shared data uses efficient lock-free algorithms, then given enough threads eventually some mutex somewhere will be the limit.

There was a study done recently that showed that no modern database engine can scale past 64 cores properly, let alone 128. Not Oracle, not SAP, not SQL Server.

At the rate TSMC is advancing with chip technology, they'll hit 300 million transistors per square millimeter in just a couple years. I fully expect AMD to release 128-core CPUs once they're on that process. A quad-socket server with those will have 512 cores or 1,024 threads. Completely lock-free algorithms will be the only way to scale up to just one server at that scale!

PS: I remember reading the content on http://www.1024cores.net/ a few years back and thinking to myself that this guy has the right idea, but he's thinking too far ahead. Now... not so much. Now I think the author of that site is a visionary that more people should have paid attention to.

wahern · on July 30, 2020

Lock-free is not the same as wait-free.

> Web application servers will often hit the wall on something like a shared cache or the session-state store. Unless 100% of the shared data uses efficient lock-free algorithms, then given enough threads eventually some mutex somewhere will be the limit.

Without strong LL/SC (which exists nowhere for useful block sizes), it's difficult if not impossible to implement complex data structures, particularly compound data structures (lockless doesn't compose well), that can't block somewhere. Many sophisticated lock-free and even nominally wait-free data structures implicitly rely on dynamic memory and complex garbage collection strategies, which just hides the ball.

If you have shared resource contention, usually something is going to block somewhere. And the more you scale up, the more your optimization hacks to minimize real-world blocking breakdown, thus the long history of multithreaded software hitting limits at 2, 4, 8, 16, 32, 64, 128, etc CPUs.

The way around these is to architect your application at a higher-level to avoid or mitigate resource contention. And how you do that is extremely context dependent, though "share memory by communicating" is a rule of thumb intended to steer you in a less failure-prone direction, keeping your options open. If you go in planning to just add lockless implementations of basic data structures everywhere, you're going to fail hard. Often because the hit to pipelining, etc, for using atomic operations in your critical path impose huge costs upfront.

alextheparrot · on July 30, 2020

> lockless doesn't compose well

Are you familiar with any resources on this?I've noticed this when interviewing candidates and asking follow-ups related to how to make an LRU cache thread-safe. Many, many candidates usually reach for converting their HashMap into a ConcurrentHashMap, which effectively buys them nothing due to the point you raised.

Seems like a missing abstraction that makes it hard to compose these structures, and therefore construct complex structures. Another thing to the research list, irrespective. Or maybe it is naturally the case that the implementations of composition for lockless doesn't scale well?

I still need to read GP's article from 1024cores which seems to potential get into this as well.

wahern · on July 30, 2020

I'm not an expert in this area, I've just read enough literature on software transactional memory to understand some of the fundamental limitations of modern hardware architectures.

This is one of the few areas where a subscription to the ACM Digital Archive is priceless. It's been several years since I went down the rabbit hole, and many new data structures have since been published. If there's a decent self-contained solution, there's a good chance it's been published there. Any particular paper is likely floating around on the Internet, but efficiently sifting through the literature benefits from friction-free access to all the citations.

senderista · on July 30, 2020

If you have a DOI, sci-hub can generally find it for you. I consider paywalls mostly a solved problem at this point.

mark254 · on July 30, 2020

Another interesting article here is "MCS locks and qspinlocks": https://lwn.net/Articles/590243/

molodec · on July 30, 2020

Can you elaborate on the HashMap? I have used java.util.concurrent.ConcurrentHashMap for caches on machines with up to 16 cores. I am actually very impressed by the implementation every time I use it. Easy to use, and the performance is excellent. I've never had a chance to use it on 128 core machine, and may be my experience is limited, but I think that for the majority of use cases using ConcurrentHashMap for caches is a solid choice.

dastbe · on July 30, 2020

unless I'm misunderstanding terribly, the issue is that an LRU cache is composed of both a list of ordered cache entries and the map of data itself. Converting just the map to a threadsafe implementation doesn't actually fix the thread safety issues.

nyanpasu64 · on July 30, 2020

> Seems like a missing abstraction that makes it hard to compose these structures, and therefore construct complex structures. Another thing to the research list, irrespective. Or maybe it is naturally the case that the implementations of composition for lockless doesn't scale well?

Would transactional memory make lock-free algorithms compose better, or just race/conflict more?

NovaX · on July 30, 2020

The vast majority of STMs use lock-based transaction management. There are lock-free designs, but these tend to perform worse due to creating a lot of memory bus traffic. Unfortunately lock-free algorithms can easily perform worse than their lock-based counterparts. There are often simpler alternatives to reduce lock contention pitfalls on hot paths (like map reads), while still taking advantage of locking (like map writes).

scottlamb · on July 30, 2020

> I fully expect AMD to release 128-core CPUs once they're on that process. A quad-socket server with those will have 512 cores or 1,024 threads. Completely lock-free algorithms will be the only way to scale up to just one server at that scale!

How often will we need that scale in a single address space?

It's certainly desirable for hypervisors/kernels, database servers, perhaps L4 load balancers and reverse HTTP proxies. Those aren't nothing but far more people work on web application servers than all of them put together.

Web application servers are often written to be stateless (with the possible exception of caches) so they can scale to multiple machines. That's important for high-reliability sites even if they aren't large enough to fully saturate a single huge machine like that. As long as you need to load-balance between machines, it's not a big problem to also run multiple instances per machine. If the application scales well to 32 cores, run 16 of them per 512-core machine. Seems a lot easier than going to extraordinary efforts to make one address space scale...

jlokier · on July 30, 2020

Even if you have 1024 separate processes not sharing much of anything, there are still locks in the kernel running them.

For example, a pair of threads (inside one of those 1024 processes) synchronising with each other will often go through the kernel to do so. In Linux this uses the futex syscall; Windows etc have similar. If to do that the kernel takes a lock that is shared with other processes, even if just for a moment, even if it's hashed on address and memory space, even if it's a spinlock and there's little contention, that lock causes memory traffic between multiple cores and separate processes.

Same for processes that are reading the same files as other processes, or (for example) running in the same directory when doing path lookups. There's a lot of work done in Linux to keep this scalable (RCU), but it's easy to hit scaling barriers that nobody has tested or designed for yet. Once 1024 core CPUs are common, of course the kernel will be optimised for that.

scottlamb · on July 30, 2020

Yes, I included the kernel in my list of things that are desirable to scale well for that reason.

That said, in some cases I don't think it's strictly necessary for even the kernel to scale well as long as you have a hypervisor that does. It's not unusual to deploy software in VMs on a cluster. Having more, smaller VMs per machine is a way to handle poor kernel scalability, just as I suggested for the web application server. VMs are higher-overhead than multiple containers on a single kernel, so this wouldn't be my first choice, but many people use VMs anyway.

DenisM · on July 30, 2020

I feel it would be more productive to let databases handle any shared state and only write my own code that uses local state. If the databases are falling short I am sure they will catch up in a hurry, and will do a better job than I will have time for.

codr7 · on July 30, 2020

Many languages/standard libraries offer some kind of concurrent map that might be worth considering depending on context, sync.Map in Golang for example.

alextheparrot · on July 30, 2020

I think it is some mixture of luck and working primarily on distributed systems where the "default" is to in principle avoid locks or hide them behind something like a transaction, where optimistic locking is what is happening under the hood.

I think this is the "right" level of locking for "mortals", though complex transactions can quickly become hard to understand as well.

Irrespective, thank you for the link and comment, I haven't interfaced with many of the scenarios you shared!

necubi · on July 30, 2020

It depends on what you're doing. For real-time applications you simply cannot use locks in a real-time thread, because there is no guarantee that you will unlock within your deadline. This is common in audio programming, where missing your deadline causes glitches in the audio stream.

It's definitely niche, but when you need them, you really need them.

arakageeta · on July 30, 2020

That's not strictly true. There are dozens of real-time sleep-based locking protocols that support bounds on worst-case blocking time. Now, even if you have a suitable protocol available to you, your deadlines may be so tight that you can't afford the overhead of context switching. Then yes, you may need a lock-free or wait-free synchronization mechanism.

necubi · on July 30, 2020

I'm not sure if we're disagreeing here, but in the context of this article "a lock" is a kernel lock. What that implies will obviously depend on your kernel, but in Linux it means your thread may wait for an unbounded amount of time.

masklinn · on July 30, 2020

> The Go approach of “Do not communicate by sharing memory”

Calling it an approach is hardly true given Go programs are generally full of shared mutable memory, even more so when performance are of any concern because channels are rather slow (multiple locks are involved) but not only: any time you're sending a pointer through a channel (including sending pointer-like structures e.g. a map), you're sharing memory.

eximius · on July 30, 2020

It would seem to be a failing of the C language that ultimately prevents us from coming up with re-useable recipes for this. Macros or generics of sufficient power in a language with an advanced type system should be able to accomplish this.

I wonder, if language is the primary barrier here, if this is a way to integrate another language for specific small pieces like this.

imglorp · on July 30, 2020

You can do all this in C but it's super hard to get your brain around and do it right. It's probably better to use an abstraction.

For example, the Pony language is written in C and guarantees no races or deadlocks without using locks or mutexes so it's faster than C with locking. The user defines intentions, similar to Rust's borrow checking, and strict compile time analysis makes it happen for you.

https://www.ponylang.io/discover/#what-makes-pony-different

lalaland1125 · on July 30, 2020

> For example, the Pony language is written in C and guarantees no races or deadlocks without using locks or mutexes so it's faster than C with locking.

This claim is misleading. The reason why Pony is able to "guarantee" no races or deadlocks is because it doesn't support shared memory concurrency. If you are willing to give up shared memory concurrency in C you can achieve exactly the same performance.

In fact, you often have the opposite: shared memory concurrency (implemented in languages like C) can often offer performance advantages that actor models like Pony can't provide.

rurban · on July 30, 2020

This is wrong. The biggest advantage of Pony is to support pointers into shared memory and still guarantee all 3 safeties. It just doesn't support blocking IO if you need that. And pointers into shared memory have severe limitations, just as in the real world. But the advantage is that the compiler already guarantees safety.

danappelxx · on July 30, 2020

But... give an example. When is shared-memory concurrency (with locks) faster?

lalaland1125 · on July 30, 2020

One good example would be writing a multi-threaded dictionary compression program. Dictionary compression is quite simple, you simply need to encode a bunch of strings to a series of integers and construct a dictionary that can perform the inverse mapping. The main trick is that you need a shared mutable dictionary to store the current list of string to integer mappings.

Doing this with an actor model would be tricky to the point of impossible. Sending a message to the common "dictionary" actor is much, much, much less efficient than just holding the corresponding sharded lock.

bsder · on July 30, 2020

Part of the problem is that "Who waits?" or less colloquially "Are you guaranteeing fairness, progress, or performance?"

For example, in a network stack, I may want fairness. In that case, all of my packets should get through with equal probability so I want my threads to all make progress even if I lose some maximum bandwidth.

However, in an audio stack, my thread feeding the audio hardware had better NEVER block. I don't care how long the threads feeding into it have to wait, but the thread feeding the audio hardware gets priority under ALL conditions and I never want it to have to wait more than a single instruction.

Those are VERY different locking scenarios, and you can't automatically differentiate.

AnanasAttack · on July 30, 2020

You can totally have memory barriers in macros. It doesn't have much to do with the language at all

eximius · on July 30, 2020

There seems to be confusion about what I'm saying. I'm not saying C can't have recipes for concurrency. I'm saying it can't abstract those recipes well.

anaphor · on July 30, 2020

C actually had better support (than C++) for double-checked locking (which is a lockless algorithm) up until relatively recently https://preshing.com/20130930/double-checked-locking-is-fixe...

jcranmer · on July 30, 2020

False. The C standard had no consideration for thread semantics prior to C11, and C11's memory model is literally and explicitly derived from C++11's memory model (which itself is adapted from Java 5's memory model). All of the discussions about how the memory model semantics for C11 and C++11 actually work take place entirely within the C++ standards committee, as is true for most new C language features these days (with the C committee largely designing merely the spellings of the new features).

lalaland1125 · on July 30, 2020

Huh? Both C and C++ only added support for atomics and double-checked locking in the same year: 2011 (C11 for C and C++11 for C++). How did C support double checked locking before C++ when the standards were released contemporaneously?