Hacker News new | past | comments | ask | show | jobs | submit login
Common Pitfalls in Writing Lock-Free Algorithms (memsql.com)
180 points by nikita on March 27, 2013 | hide | past | favorite | 33 comments



Probably the most useful observation in the article is hidden right before the Results / Conclusion...

Any optimization should be measured. Otherwise how do you know if all your hard work is worth it? In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.

It's a good idea to use existing implementations mostly so that you can measure and (hopefully without too much trouble) change out the implementation whenever you need to. Different hardware likely has different performance, so measure often!


Like the old carpenters adage "Measure twice, cut once"


Like the old programmers adage "Days of programming save hours of planning"


We recently bought a laser cutter and realised that adage is based on the assumption that the cutting takes the longest time. Now if we want a really good fit we do a binary search on the cut dimensions.

Our new laser cutter adage is "cut loads and don't bother measuring because it doesn't take into account the size of your cut"


The adage is based on the fact that you can't "undo" a cut too short. If you need a 6' board, and cut it to 5'10", you've probably just wasted a perfectly good piece of lumber.


Also based on resource cost, you're probably not laser cutting a highly figured maple slab...


> In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.

The author is also using gcc 4.6; perhaps different results would be obtained with gcc 4.8 or recent Clang. If you care enough to think about LFDS, measure measure measure =).


Or glibc's pthread_mutex_t implementation. Or hand-written assembly above futex(2). There's no lack of synchronization primitives.


If you found this interesting, you may also like the really clean implementation of a single-producer single-consumer lock-free queue in Facebook's folly C++ library.

https://github.com/facebook/folly/blob/master/folly/Producer... (code)

https://github.com/facebook/folly/blob/master/folly/docs/Pro... (docs)


First off, I'd call that barrier-less, which can be more important than lock-less. With the important distinction that it will break on ARM and 'weakly consistent' memory model architectures.

Android's SMP Primer does a really great way of covering this as well, with a lot of race examples, so I recommend checking it out as a compliment: http://developer.android.com/training/articles/smp.html

The author of this article (and probably most readers) will be concerned with x86, which has a "processor-consistent" memory model. What isn't mentioned is that ARM and other architectures have an even weaker ("weakly consistent") memory model. The SMP guide above describes these in detail and how they can be observed (race) here: http://developer.android.com/training/articles/smp.html#data...

If you are dealing with low-level problems like this--either in libraries, operating systems, or systems tools and utilities--design for "weakly consistent" memory models, or please leave a huge warning to future developers that it may have subtle and difficult race conditions on those platforms.


CompareAndSwap implementations use a memory barrier on ARM. That should be sufficient to guarantee correctness.


It may be lock free, but it's not very wait free.

Maybe I'm just tired, but OS X and Windows already have atomic lifo's that are much better: http://www.opensource.apple.com/source/Libc/Libc-763.13/i386...

(scroll to bottom, see declaration in /usr/include/libkern/OSAtomic.h)

http://msdn.microsoft.com/en-us/library/windows/desktop/ms68...

For linux, there's llist.h in the kernel, but I don't know of a userland version and I haven't used it.


The article demos a list, where threads are contending to append to the top of it. Unlike a list, there is not much contention on a FIFO unless its typically empty..


You are correct, I meant lifo but wrote fifo.

OSEnqueueAtomic/OSDequeueAtomic and InterlockedSList, by virtue of only being able to push and pop the head, are effectively LIFOs.


There is, of sorts (rbtree), http://github.com/johnj/llds


Another problem with lock free algorithms: sometimes locks are cheap.

This is certainly the case on NVIDIA's CUDA platform, where atomic operations incur a 50% penalty - which can sound like a lot but considering that the lock region can be 20% of runtime. In practice this means a lock free runs in 100 seconds while one with primitive locks runs in 110 seconds (and was easy to write debug and maintain).


Hm. maybe I am missing a lot there, so I would be happy for people to correct me. But in my book a linked list is like a worst case to make "lock free". For me the first step to make something lock free is that all the frequent operations on a datastructure only need one item to change. A fixed size stack in this case. Once you have multiple things to keep in lock the old style lock is a necessity! You might invent one that works slightly better (like his sleep(250)..) under some benchmarks. But to me lock free means an algorithm where both producers and consumers have to only ever access one single atomic value.


"to me lock free means an algorithm where both producers and consumers have to only ever access one single atomic value"

Well it may mean that to you, but unfortunately it doesn't mean that to anyone else.

Lock free has a precise definition. If you don't know what it is, why get involved in this discussion?


As I put in the first sentence, I might be missing a lot there. Could you please point me to the definition you are referring to? I probably did not express myself very well. And I already learned a bit by thinking about this differently. You (and everybody else, so you are right!) define lock free as dead lock free. Which can be done with a carefully chosen lock. I was always thinking of lock free as a pure performance optimization which admittedly is different to the algorithmic property. Am I on the right track? If yes, it should answer your question why I got involved in the discussion: to learn.


Dmitry Vyukov keeps a great resource for learning about lock-free algorithms (and other interesting things) here: http://www.1024cores.net/home/lock-free-algorithms/introduct...


Isn't the very first example in that article flawed?

    void decrement_reference_counter(rc_base* obj)
    {
        if (0 == atomic_decrement(obj->rc))
            delete obj;
    }
This is classic test-then-act, isn't it? What happens if another thread bumps obj->rc after the comparison to 0, but before the deletion? That other thread could find itself referring to a suddenly-deleted object, or am I missing something?


> If you don't know what it is, why get involved in this discussion?

To understand?


We can get high performance coupled with safer ways to deal with locks. You don't need to be afraid of locks so much as find something that handles them well along with whatever else you need. In Ada there exist "protected" objects and types, a protect object being either a singleton or an instantiation of a protected type. Protected types are one way to synchronize threads and provide access control to data safely. There are protected entries with queues, protected functions which can only read the data being protected and protected procedures which while having the ability to write to protected data are guaranteed to only be allowed access one at a time. There's plenty of great features of the built-in tasking to go along with protected objects like selective entries, time delayed aborts and a ton of other stuff. Ada really hits the mark for locking. If you don't want locks, but still want synchronization, there's also the Ada rendezvous. Good stuff and even more robust with the new 2012 standard.


I cannot upvote this enough, thank you for the very well written article.


I found the observation about new/delete not being necessarily safe interesting. Been writing Java too long, Java has no delete (garbage collection) and presumably new would have been creating a linked list entry object, which is safe.


This is actually a really good point. Garbage collection has significant advantages for concurrency because it solves this problem so nicely; avoiding these issues in C++ often isn't just difficult for the programmer but also costs you performance. (Of course, if you generate so much garbage that the GC is running all the time, it will become a bottleneck, but the same is true of excessive allocation and deallocation even in unmanaged languages.)


But could cause a garbage collection cycle, right? If so, I think this still counts in the "not safe" category.


Generally garbage collection algorithms more sophisticated than reference counting can handle cycles just fine. Certainly this is the case for Java.

Edit: Or perhaps you mean, could cause the garbage collector to run? I don't see why that'd be an issue either.


That is what I meant. And having a garbage collector run could potentially freeze the JVM, causing a performance hit. (I mean, it is basically a global lock on the entire system, right?)

Basically, my reading of "safe" meant "would not block." A garbage collection cycle stops everything, so is effectively a block, right?


I read "safe" differently than you did; now I'm on the same page. With or without a GC, if you want to guarantee safety in that sense no matter what, you're right - you just can't allocate memory. So in the worst case running in a managed environment doesn't buy you much, but often the guarantees you need are a bit looser than that, and in that case managed memory can be really beneficial.


[(thanks) you should probably make sure your main site is readable if adobe typekit is blocked - it is blocked by default w ghostery]


Multithreaded algorithms have always been on my list of things to master in programming. This is an excellent article.


Well, lots of good insights, but I don't get this:

"We currently use boost::atomic for this. Our compiler is gcc 4.6, which currently supports std::atomic, but its implementation is much less efficient than boost’s. "

Really, never replace trust in a compiler by trust in a library.

They're getting down to all the nitty-gritty details and then saying 'a memory barrier is expensive'. It seems contradictory to me.

GCC will get better, other compilers will get better as well, but using another level of indirection (a library) is not usually a good idea.

Oh, and maybe replace the sleep by a sched_yield

And even better, don't have hundreds of threads on a single data structure, especially a constrained one like stack, that's probably the easiest way of having speed gains.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: