Probably the most useful observation in the article is hidden right before the Results / Conclusion...
Any optimization should be measured. Otherwise how do you know if all your hard work is worth it? In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.
It's a good idea to use existing implementations mostly so that you can measure and (hopefully without too much trouble) change out the implementation whenever you need to. Different hardware likely has different performance, so measure often!
We recently bought a laser cutter and realised that adage is based on the assumption that the cutting takes the longest time.
Now if we want a really good fit we do a binary search on the cut dimensions.
Our new laser cutter adage is "cut loads and don't bother measuring because it doesn't take into account the size of your cut"
The adage is based on the fact that you can't "undo" a cut too short. If you need a 6' board, and cut it to 5'10", you've probably just wasted a perfectly good piece of lumber.
> In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.
The author is also using gcc 4.6; perhaps different results would be obtained with gcc 4.8 or recent Clang. If you care enough to think about LFDS, measure measure measure =).
If you found this interesting, you may also like the really clean implementation of a single-producer single-consumer lock-free queue in Facebook's folly C++ library.
First off, I'd call that barrier-less, which can be more important than lock-less. With the important distinction that it will break on ARM and 'weakly consistent' memory model architectures.
The author of this article (and probably most readers) will be concerned with x86, which has a "processor-consistent" memory model. What isn't mentioned is that ARM and other architectures have an even weaker ("weakly consistent") memory model. The SMP guide above describes these in detail and how they can be observed (race) here: http://developer.android.com/training/articles/smp.html#data...
If you are dealing with low-level problems like this--either in libraries, operating systems, or systems tools and utilities--design for "weakly consistent" memory models, or please leave a huge warning to future developers that it may have subtle and difficult race conditions on those platforms.
The article demos a list, where threads are contending to append to the top of it. Unlike a list, there is not much contention on a FIFO unless its typically empty..
Another problem with lock free algorithms: sometimes locks are cheap.
This is certainly the case on NVIDIA's CUDA platform, where atomic operations incur a 50% penalty - which can sound like a lot but considering that the lock region can be 20% of runtime. In practice this means a lock free runs in 100 seconds while one with primitive locks runs in 110 seconds (and was easy to write debug and maintain).
Hm. maybe I am missing a lot there, so I would be happy for people to correct me. But in my book a linked list is like a worst case to make "lock free". For me the first step to make something lock free is that all the frequent operations on a datastructure only need one item to change. A fixed size stack in this case. Once you have multiple things to keep in lock the old style lock is a necessity! You might invent one that works slightly better (like his sleep(250)..) under some benchmarks. But to me lock free means an algorithm where both producers and consumers have to only ever access one single atomic value.
As I put in the first sentence, I might be missing a lot there. Could you please point me to the definition you are referring to? I probably did not express myself very well. And I already learned a bit by thinking about this differently. You (and everybody else, so you are right!) define lock free as dead lock free. Which can be done with a carefully chosen lock. I was always thinking of lock free as a pure performance optimization which admittedly is different to the algorithmic property. Am I on the right track? If yes, it should answer your question why I got involved in the discussion: to learn.
This is classic test-then-act, isn't it? What happens if another thread bumps obj->rc after the comparison to 0, but before the deletion? That other thread could find itself referring to a suddenly-deleted object, or am I missing something?
We can get high performance coupled with safer ways to deal with locks. You don't need to be afraid of locks so much as find something that handles them well along with whatever else you need. In Ada there exist "protected" objects and types, a protect object being either a singleton or an instantiation of a protected type. Protected types are one way to synchronize threads and provide access control to data safely. There are protected entries with queues, protected functions which can only read the data being protected and protected procedures which while having the ability to write to protected data are guaranteed to only be allowed access one at a time. There's plenty of great features of the built-in tasking to go along with protected objects like selective entries, time delayed aborts and a ton of other stuff. Ada really hits the mark for locking. If you don't want locks, but still want synchronization, there's also the Ada rendezvous. Good stuff and even more robust with the new 2012 standard.
I found the observation about new/delete not being necessarily safe interesting. Been writing Java too long, Java has no delete (garbage collection) and presumably new would have been creating a linked list entry object, which is safe.
This is actually a really good point. Garbage collection has significant advantages for concurrency because it solves this problem so nicely; avoiding these issues in C++ often isn't just difficult for the programmer but also costs you performance. (Of course, if you generate so much garbage that the GC is running all the time, it will become a bottleneck, but the same is true of excessive allocation and deallocation even in unmanaged languages.)
That is what I meant. And having a garbage collector run could potentially freeze the JVM, causing a performance hit. (I mean, it is basically a global lock on the entire system, right?)
Basically, my reading of "safe" meant "would not block." A garbage collection cycle stops everything, so is effectively a block, right?
I read "safe" differently than you did; now I'm on the same page. With or without a GC, if you want to guarantee safety in that sense no matter what, you're right - you just can't allocate memory. So in the worst case running in a managed environment doesn't buy you much, but often the guarantees you need are a bit looser than that, and in that case managed memory can be really beneficial.
Well, lots of good insights, but I don't get this:
"We currently use boost::atomic for this.
Our compiler is gcc 4.6, which currently supports std::atomic, but its implementation is much less efficient than boost’s. "
Really, never replace trust in a compiler by trust in a library.
They're getting down to all the nitty-gritty details and then saying 'a memory barrier is expensive'. It seems contradictory to me.
GCC will get better, other compilers will get better as well, but using another level of indirection (a library) is not usually a good idea.
Oh, and maybe replace the sleep by a sched_yield
And even better, don't have hundreds of threads on a single data structure, especially a constrained one like stack, that's probably the easiest way of having speed gains.
Any optimization should be measured. Otherwise how do you know if all your hard work is worth it? In the author's example, the fastest implementation was actually _not_ lock-free, it merely used a better spinlock than std::mutex.
It's a good idea to use existing implementations mostly so that you can measure and (hopefully without too much trouble) change out the implementation whenever you need to. Different hardware likely has different performance, so measure often!