Please note that the code in this article is broken. It lacks proper use of atomics, memory barriers, let alone compiler barriers. You can find a similar and correct queue at http://concurrencykit.org (ck_fifo.h). The liblfds queue is very heavy-weight for a simple SPSC workload and is also lacking in correctness.
Care to point us to a link explaining why the code in the article is incorrect? And I am sure it would help many people if you would post the link in a comment under the source article as well. I would love to hear what the author has to say too.
You asked for a link to an explanation, I do not have a link to the explanation. I'm not sure what you found funny about referring you to a 30/50-line piece of source-code with an implementation for your own comparison. Also, people make mistakes, what does your respect for the author have to do with anything? Regardless, I will elaborate.
General issues:
"If you don't yet have ordered atomic variables yet on your language and platform, you can emulate them by using ordinary but aligned variables whose reads and writes are guaranteed to be naturally atomic, and enforce ordering by using either platform-specific ordered API calls (such as Win32's InterlockedCompareExchange for compare-and-swap) or platform-specific explicit memory fences/barriers (for example, Linux mb)."
This is incorrect. A lot of memory models in languages provide no such guarantees, even of aligned variables (this includes C, even with volatile variables). This not only applies to regular loads/stores but even to RMW.
Produce/Consume function:
We store to last->next and then to last. If we assume total store ordering, insertion itself is fine. However, it should be made clear that an explicit store fence is needed after updating the next pointer. The article gives the impression (due to statement above) that this algorithm is suitable for other languages.
In the consumer, we also need an explicit load fence. If the atomic is completely serializing, we are fine but in other languages even with aligned variables, these semantics are incorrect.
The "trim" functionality is broken as it makes no use of safe memory reclamation. It is fine to re-use nodes (assuming you don't have a concurrent lock-free traversal going on) but you need to be careful with destruction. Assume consumer is at some time T where divider != last (but divider = first). Let us say consumer is at result = divider->next->value. Pause execution of consumer after it has acquired divider->next. Producer enqueues another node and sees that first != divider (yet consumer has an earlier snapshot of divider), so it trims first. Consumer is now active again and dereferences the earlier snapshot it had of divider->next to get to value, but now this yields undefined behavior (crash or worse).
sbahra is correct; at the very least, in release 6, the order of pointer and counter reading is incorrect. This was fixed a long time ago, but release 7 hasn't yet come out; I've added an entry to the known issues page in the wiki. The counter needs to be read first and not doing so permits ABA. The queue is MPMC and as such is much heavier than the memory-barrier-only SPSC circular buffer.
I should quantify this. The pointer/counter bug was fixed a long, long time ago, but at the time slipped through the net with regard to being published on the known issues page. That was a very unfortunate oversight. Sbahra emailed a few weeks ago to inform me of the issue (and the others he raises) and I checked the existing code base, to see the pointer/counter issue already fixed and the other issues unclear (specifics were not given). I replied, but no reply - but sbahra is a busy guy and also very competent, so I consider it likely his comments have merit; I will look further into them once I've finished my immediate release 7 work. With regard to adding the pointer/counter issue to the known issues page, this I did today - I thought about it earlier, but release 7 was is so close to coming out and the bug has been present for so long, it seemed a pretty moot point (actually of course that's completely untrue; people using six need to know, so they can know to update their code).
I've spent an hour or two looking more closely at the pointer/counter ordering behaviour in the queue. I'm going to go out a bit on a limb (this is mental code inspection only) and say it's not in fact a problem. It IS a problem in a stack which increments the counter only on push, but it's NOT a problem in the M&S queue because of the checking logic which follows the pointer/counter read. If there has been a torn read, where the pointer and counter don't match, then the main loop will simply iterate and retry, because the counter doesn't match; and it's being incremented on queue and dequeue.
I have used this library http://www.liblfds.org/ for lock-free data structures successfully. Only used the queue but it also has a linked-list (logical deletion only), stack and a ring buffer.
There is another library others are using but can't remember the name currently.
OK, now I guess event queue can also be lock free. Use one lock free queue as the article described for incoming events from the producer, and another queue with the event sorted by the consumer.
Sure it is. Priority queues are a common example of this behavior, and they're still queues.
chj's question is interesting. It's not obvious how to build a lock-free code in the original article's framework if you aren't representing your queue as a list, but many queues are not represented this way. If you need this 'insert in the middle of the queue' operation, you probably aren't representing your queue as with a linked list since insertion will be inefficient - chances are you're using a heap.
If you're looking for a learning resource, I suggest looking at http://www.rdrop.com/users/paulmck/perfbook/ and http://1024cores.net