Common Pitfalls in Writing Lock-Free Algorithms

sounds · on March 27, 2013

Probably the most useful observation in the article is hidden right before the Results / Conclusion...

Any optimization should be measured. Otherwise how do you know if all your hard work is worth it? In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.

It's a good idea to use existing implementations mostly so that you can measure and (hopefully without too much trouble) change out the implementation whenever you need to. Different hardware likely has different performance, so measure often!

pacaro · on March 27, 2013

Like the old carpenters adage "Measure twice, cut once"

niggler · on March 28, 2013

Like the old programmers adage "Days of programming save hours of planning"

tlarkworthy · on March 28, 2013

We recently bought a laser cutter and realised that adage is based on the assumption that the cutting takes the longest time. Now if we want a really good fit we do a binary search on the cut dimensions.

Our new laser cutter adage is "cut loads and don't bother measuring because it doesn't take into account the size of your cut"

stouset · on March 29, 2013

The adage is based on the fact that you can't "undo" a cut too short. If you need a 6' board, and cut it to 5'10", you've probably just wasted a perfectly good piece of lumber.

pacaro · on March 28, 2013

Also based on resource cost, you're probably not laser cutting a highly figured maple slab...

throwaway54-762 · on March 28, 2013

> In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.

The author is also using gcc 4.6; perhaps different results would be obtained with gcc 4.8 or recent Clang. If you care enough to think about LFDS, measure measure measure =).

ajross · on March 28, 2013

Or glibc's pthread_mutex_t implementation. Or hand-written assembly above futex(2). There's no lack of synchronization primitives.

ajtulloch · on March 27, 2013

If you found this interesting, you may also like the really clean implementation of a single-producer single-consumer lock-free queue in Facebook's folly C++ library.

https://github.com/facebook/folly/blob/master/folly/Producer... (code)

https://github.com/facebook/folly/blob/master/folly/docs/Pro... (docs)

aray · on March 28, 2013

First off, I'd call that barrier-less, which can be more important than lock-less. With the important distinction that it will break on ARM and 'weakly consistent' memory model architectures.

Android's SMP Primer does a really great way of covering this as well, with a lot of race examples, so I recommend checking it out as a compliment: http://developer.android.com/training/articles/smp.html

The author of this article (and probably most readers) will be concerned with x86, which has a "processor-consistent" memory model. What isn't mentioned is that ARM and other architectures have an even weaker ("weakly consistent") memory model. The SMP guide above describes these in detail and how they can be observed (race) here: http://developer.android.com/training/articles/smp.html#data...

If you are dealing with low-level problems like this--either in libraries, operating systems, or systems tools and utilities--design for "weakly consistent" memory models, or please leave a huge warning to future developers that it may have subtle and difficult race conditions on those platforms.

oofabz · on March 28, 2013

CompareAndSwap implementations use a memory barrier on ARM. That should be sufficient to guarantee correctness.

CoolGuySteve · on March 28, 2013

It may be lock free, but it's not very wait free.

Maybe I'm just tired, but OS X and Windows already have atomic lifo's that are much better: http://www.opensource.apple.com/source/Libc/Libc-763.13/i386...

(scroll to bottom, see declaration in /usr/include/libkern/OSAtomic.h)

http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...

For linux, there's llist.h in the kernel, but I don't know of a userland version and I haven't used it.

matthavener · on March 28, 2013

The article demos a list, where threads are contending to append to the top of it. Unlike a list, there is not much contention on a FIFO unless its typically empty..

CoolGuySteve · on March 28, 2013

You are correct, I meant lifo but wrote fifo.

OSEnqueueAtomic/OSDequeueAtomic and InterlockedSList, by virtue of only being able to push and pop the head, are effectively LIFOs.

zippie · on March 28, 2013

There is, of sorts (rbtree), http://github.com/johnj/llds

frozenport · on March 28, 2013

Another problem with lock free algorithms: sometimes locks are cheap.

This is certainly the case on NVIDIA's CUDA platform, where atomic operations incur a 50% penalty - which can sound like a lot but considering that the lock region can be 20% of runtime. In practice this means a lock free runs in 100 seconds while one with primitive locks runs in 110 seconds (and was easy to write debug and maintain).

starmole · on March 28, 2013

Hm. maybe I am missing a lot there, so I would be happy for people to correct me. But in my book a linked list is like a worst case to make "lock free". For me the first step to make something lock free is that all the frequent operations on a datastructure only need one item to change. A fixed size stack in this case. Once you have multiple things to keep in lock the old style lock is a necessity! You might invent one that works slightly better (like his sleep(250)..) under some benchmarks. But to me lock free means an algorithm where both producers and consumers have to only ever access one single atomic value.

johnduui · on March 28, 2013

"to me lock free means an algorithm where both producers and consumers have to only ever access one single atomic value"

Well it may mean that to you, but unfortunately it doesn't mean that to anyone else.

Lock free has a precise definition. If you don't know what it is, why get involved in this discussion?

starmole · on March 28, 2013

As I put in the first sentence, I might be missing a lot there. Could you please point me to the definition you are referring to? I probably did not express myself very well. And I already learned a bit by thinking about this differently. You (and everybody else, so you are right!) define lock free as dead lock free. Which can be done with a carefully chosen lock. I was always thinking of lock free as a pure performance optimization which admittedly is different to the algorithmic property. Am I on the right track? If yes, it should answer your question why I got involved in the discussion: to learn.

edsrzf · on March 28, 2013

Dmitry Vyukov keeps a great resource for learning about lock-free algorithms (and other interesting things) here: http://www.1024cores.net/home/lock-free-algorithms/introduct...

dannywoodz · on March 28, 2013

Isn't the very first example in that article flawed?

    void decrement_reference_counter(rc_base* obj)
    {
        if (0 == atomic_decrement(obj->rc))
            delete obj;
    }

This is classic test-then-act, isn't it? What happens if another thread bumps obj->rc after the comparison to 0, but before the deletion? That other thread could find itself referring to a suddenly-deleted object, or am I missing something?

bnegreve · on March 28, 2013

> If you don't know what it is, why get involved in this discussion?

To understand?

jmilkbal · on March 28, 2013

We can get high performance coupled with safer ways to deal with locks. You don't need to be afraid of locks so much as find something that handles them well along with whatever else you need. In Ada there exist "protected" objects and types, a protect object being either a singleton or an instantiation of a protected type. Protected types are one way to synchronize threads and provide access control to data safely. There are protected entries with queues, protected functions which can only read the data being protected and protected procedures which while having the ability to write to protected data are guaranteed to only be allowed access one at a time. There's plenty of great features of the built-in tasking to go along with protected objects like selective entries, time delayed aborts and a ton of other stuff. Ada really hits the mark for locking. If you don't want locks, but still want synchronization, there's also the Ada rendezvous. Good stuff and even more robust with the new 2012 standard.

cjh_ · on March 28, 2013

I cannot upvote this enough, thank you for the very well written article.

Afforess · on March 28, 2013

I found the observation about new/delete not being necessarily safe interesting. Been writing Java too long, Java has no delete (garbage collection) and presumably new would have been creating a linked list entry object, which is safe.

swift · on March 28, 2013

This is actually a really good point. Garbage collection has significant advantages for concurrency because it solves this problem so nicely; avoiding these issues in C++ often isn't just difficult for the programmer but also costs you performance. (Of course, if you generate so much garbage that the GC is running all the time, it will become a bottleneck, but the same is true of excessive allocation and deallocation even in unmanaged languages.)

taeric · on March 28, 2013

But could cause a garbage collection cycle, right? If so, I think this still counts in the "not safe" category.

swift · on March 28, 2013

Generally garbage collection algorithms more sophisticated than reference counting can handle cycles just fine. Certainly this is the case for Java.

Edit: Or perhaps you mean, could cause the garbage collector to run? I don't see why that'd be an issue either.

taeric · on March 29, 2013

That is what I meant. And having a garbage collector run could potentially freeze the JVM, causing a performance hit. (I mean, it is basically a global lock on the entire system, right?)

Basically, my reading of "safe" meant "would not block." A garbage collection cycle stops everything, so is effectively a block, right?

swift · on March 29, 2013

I read "safe" differently than you did; now I'm on the same page. With or without a GC, if you want to guarantee safety in that sense no matter what, you're right - you just can't allocate memory. So in the worst case running in a managed environment doesn't buy you much, but often the guarantees you need are a bit looser than that, and in that case managed memory can be really beneficial.

andrewcooke · on March 28, 2013

[(thanks) you should probably make sure your main site is readable if adobe typekit is blocked - it is blocked by default w ghostery]

johnbellone · on March 28, 2013

Multithreaded algorithms have always been on my list of things to master in programming. This is an excellent article.

raverbashing · on March 28, 2013

Well, lots of good insights, but I don't get this:

"We currently use boost::atomic for this. Our compiler is gcc 4.6, which currently supports std::atomic, but its implementation is much less efficient than boost’s. "

Really, never replace trust in a compiler by trust in a library.

They're getting down to all the nitty-gritty details and then saying 'a memory barrier is expensive'. It seems contradictory to me.

GCC will get better, other compilers will get better as well, but using another level of indirection (a library) is not usually a good idea.

Oh, and maybe replace the sleep by a sched_yield

And even better, don't have hundreds of threads on a single data structure, especially a constrained one like stack, that's probably the easiest way of having speed gains.