While Lock-free code gets you part of the way to efficient parallelism (by removing suspension-induced dead time) as he mentions, the performance impact caused by repeated cache line invalidation can be a significantly bigger problem.
In many modern processor architectures (x86, for example), a cache-coherence protocol is used to ensure that cache lines provide data according to the memory discipline. On x86, that discipline is Total Store Ordering. (See http://en.wikipedia.org/wiki/Memory_ordering).
This means that if two processors are contending for the same value (eg. trying to increment it, set it to true, or even read it), they will force the CPU to invalidate the cache line containing the value on all the other cores, leading to massive scalability bottlenecks. If multiple cores are contending for the same location in memory, whether it's lock free or not, performance will suffer.
More deviously is the case of false sharing, where 2 different values just happen to fall on the same cache line. Even though they don't conflict, the processor must still invalidate the line on every core. Modern compilers do their best to prevent this, but sometimes they need a little help.
The takeaway is this: Don't try to implement your own locks using CAS -- even something as simple as a lock is very hard to get right (performance-wise) when scaling to dozens and hundreds of cores / threads. People have solved this problem (people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf). Writing fast concurrent code (especially lock-free code) is a minefield of weird architecture gotchas. Watch your step.
Actually I didn't claim that lock-free is more efficient - I explicitly said that it isn't necessarily more efficient, though I didn't discuss the reasons that you do discuss; I only said that "lock-based" is defined by the need to deal with blocking upon suspension - not that it's necessarily a bad property in any way, in particular in its efficiency impacts.
I think it's a little too general to recommend not using CAS, as at a low enough level it's useful, but cache lines, as you say, are probably the most important functional issue.
It's possible to implement algorithms independently of data order on the memory bus, by dependence on execution order, and can therefore be barrier free, but cache lines are synced. Complete lines can be dedicated to single loads, a case where performance increases by sacrificing cache effective size.
The different Sparc memory write systems were always useful for explaining the performance effects of cache coherency and memory barriers.
No, they wouldn't trivialize the problems. Both message-passing and immutability pay a price for performance in ways that matter for real-world problems.
Example 1: Algorithms that operate on large matrices. Immutability means (1) that you can't have destructive updates of small parts, but need to copy the matrix in its entirety, and (2) that you can't have in-place operations, increasing your memory usage (and thus decreasing L2 cache performance).
Example 2: Caching expensive operations. You want to share these among as many threads as possible to maximize hit rates, but you also need to avoid serialization bottlenecks. This is incompatible with both pure message passing and, say, the standard implementations for immutable maps.
How does one implement message passing without atomic instructions and/or locks? Without some sort of lock I fail to see how one can have mutual exclusion on a message queue.
EDIT: I also fail to see how this is academic atomic instructions are needed to implement any basic concurrency construct.
The actor model and data immutability are good for programmer productivity and reducing errors, but not for writing highly performant or low-level code. Even then, someone has to write the message queue, and how do you plan do to that without any mutable shared state?
There's an enormous assumption here -- the assumption that we are on a shared memory computer to begin with. At the lowest level, memory devices are implemented using "messages" that go over wires, and locks are implemented on top of that.
But it's a feature of small systems that you THINK you have a flat address space and not a message passing system. CPU cores send messages to L1 caches to fetch memory, and the messages go outward to L2 and RAM, all in hardware. With a modern OS kernel this message will often be intercepted and fulfilled with disk IO.
So if you pretend you have a flat address space you won't be miserable or anything, because the entire system (hardware + software) is designed to make it look that way.
Once you get interested in performance details the flat memory space abstraction goes out the window. L1, L2, RAM, and disk latencies are really message passing latencies for messages passed between physically separate devices. Designing your algorithm to minimize the number of messages passed between devices (a.k.a. the number of "cache misses") will improve its performance. Optimizations meant to minimize message passing can have unintended consequences. For example, Linux once allocated memory on the region of RAM closest to the core running the process which requested the memory -- I'm sure it made some benchmarks run faster, but it turned into an absolute disaster for MySQL, because it meant that MySQL would only use one region of RAM. (Google "MySQL NUMA" for the full story. You can probably imagine most of it if you just think about the consequences of mixing on-die memory controllers with multi-socket systems.)
The degree to which the flat memory abstraction applies is related to the size of the system. An single-core, embedded microcontroller with 256 bytes of RAM really does have a flat address space and you can often count processor cycles just by looking at the assembly, the abstraction is basically perfect. A modern desktop with a quad-core processor is going to act less like an ideal shared memory machine because the message passing overhead of using contested cache lines can affect performance. And a world-class super computer might be split into N units of M cards of K chips with L cores on each chip -- the shared memory abstraction will only extend so far before it's all explicit message passing in software.
So the real question is not "how do you plan to write a queue without mutable shared state?" The real question is, "How do you implement mutable shared state using messages?"
Footnote: I think it's very telling... the shared memory abstraction is so convenient, the work behind the scenes is so good, we almost want to argue and say "Look, it's shared state, and everybody knows that it's shared state." Buddy, it's one fine illusion.
I have struggled to find the right ways to think about the phenomena underlying the shared memory abstractions. The limited visibility makes this harder than understanding other aspects of software. You can run instrumented simulations like cachegrind; you can query arcane CPU performance counters (Intel's Nehalem optimization guide was eye-opening regarding how much goes on inside vs. what you might learn in an undergrad CPU design class) -- but at the end of the day you have to be guided by experimentation and measurement. And so many times those leave us with no good theories to explain the observed behavior.
(War story, feel free to skip: One time we were trying to speed up a datafile parser by any means possible -- which was already split into a producer-consumer thread pair, one thread running the unfortunately complex grammar and producing a stream of mutation commands, the other thread consuming that and building up in-memory state. The engineer working on this found that adding NOPs could speed this up, and he measured and charted a range of # of NOPs and chose the best one. Our best guess was "something to do with memory layout?" The outcome of the story was that we tore out a bunch of abstraction layers and ended up with a simpler, single-threaded parser that didn't need such heroic and bizarre efforts, but it also left us feeling a bit of vertigo with regard to memory hierarchy behavior.)
Your pointing out that the shared memory abstraction is backed by message-passing between hardware components (which each represent concurrent processes) is really interesting - thanks!
Obviously there's no avoiding this, but the author writes in the context of high-level code where OS-provided locking mechanisms are available. Why are we discussing this in a low-level (or embedded, where every ounce of performance matters) context?
You don't need to be embedded to have performance matter - even if your OS gives you concurrency primitives, there are many situations where jumping into kernel code is still "too expensive."
There are many situations in which OS-level primitives such as blocking semaphores are available but userland primitives are not. When implementing libc, for example. Or when implementing a language runtime, or TBB, or boost::thread_pool, or what have you.
We are writing such code because it is an established way of doing things that also happens to be more convenient to a lot of programmers in the field.
I don't know where you heard that, and you should stop repeating it because it's almost gibberish. It's kind of like asking, "Which is faster? A car or a typewriter?" Well, a decent typist can put out 100 WPM on a typewriter, and a decent car can go 60 m/s. So it depends on where you are going. If your goal is an essay, then the typewriter is faster. If your goal is the other side of town, the car is faster.
The big disadvantage of locks is that performance decreases with contention, and performance of a shared-memory system in general degrades as the number of nodes increases. So every supercomputer in recent history uses a hierarchical approach: a network of multi-core units. Shared memory and locks for sharing data with cores in the same unit, message passing for sharing data with other units.
Just imagine trying to use system-wide locks on the IBM Sequoia. It has something like a million cores.
That doesn't obviate the original point. Message passing is generally slower, even though shared memory mechanisms cannot scale indefinitely. But there are many problems where you cannot afford to pass on the performance gain from shared memory up to that point, even if you have to tie your nodes together using a message passing architecture afterwards.
Also, advances have been made in manycore shared memory systems. The Cray XE6 (the hardware behind, e.g. HECToR [1]) has a hardware accelerated global address space with remote direct memory access that allows PGAS [2] to outperform MPI [3].
By the way, system wide locks are a red herring. At these scales, you avoid global data as much as you can, regardless of what your programming model is.
Part of my point was that if you say "message passing is generally slower" and "shared memory mechanisms cannot scale indefinitely", then the logical conclusion is that problems are generally small. You can't meaningfully compare the speed of these two techniques in abstract any more than you can meaningfully ask if 100 is a "big number".
The comment about global locks was intended to be silly, because comparing locks to message passing without talking about what you're doing with them is also silly.
The linked paper compares MPI message passing to an alternative hardware-accelerated message passing, which is interesting, but the choice of micro-benchmarks is not very exciting. To be clear, while the GP was really comparing the actor model (private memory + message passing) against the shared memory + locks model, I was only responding to the parent comment, and when I think "message passing" I don't automatically think "private memory".
PGAS is not an "alternative hardware-accelerated message passing mechanism", unless you use a definition of message passing that is so expansive that the statement becomes vacuously true. It's distributed shared memory, integrated in the memory hierarchy, which you can manipulate at the same granularity as other memory, which you can have pointers to, etc.
You can have, say, a 100,000 x 100,000 matrix represented as an array over thousands of processors, where each processor can read and write each array element individually.
Well, my comment was a response to the OP's comment stating message passing obliterates the need for lock completely. It can only be viewed within that context. I think you took my comment out of context and went in a different direction.
I assume you meant to say lock and message-passing address different problems, and I agree, but the GP asserted message-passing would solve the problems that lock would have solved generally with better result, which is just not true, as least in the performance department.
In many modern processor architectures (x86, for example), a cache-coherence protocol is used to ensure that cache lines provide data according to the memory discipline. On x86, that discipline is Total Store Ordering. (See http://en.wikipedia.org/wiki/Memory_ordering).
This means that if two processors are contending for the same value (eg. trying to increment it, set it to true, or even read it), they will force the CPU to invalidate the cache line containing the value on all the other cores, leading to massive scalability bottlenecks. If multiple cores are contending for the same location in memory, whether it's lock free or not, performance will suffer.
More deviously is the case of false sharing, where 2 different values just happen to fall on the same cache line. Even though they don't conflict, the processor must still invalidate the line on every core. Modern compilers do their best to prevent this, but sometimes they need a little help.
The takeaway is this: Don't try to implement your own locks using CAS -- even something as simple as a lock is very hard to get right (performance-wise) when scaling to dozens and hundreds of cores / threads. People have solved this problem (people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf). Writing fast concurrent code (especially lock-free code) is a minefield of weird architecture gotchas. Watch your step.