Thanks for this insightful comment. I have struggled to find the right ways to t...

Thanks for this insightful comment.

I have struggled to find the right ways to think about the phenomena underlying the shared memory abstractions. The limited visibility makes this harder than understanding other aspects of software. You can run instrumented simulations like cachegrind; you can query arcane CPU performance counters (Intel's Nehalem optimization guide was eye-opening regarding how much goes on inside vs. what you might learn in an undergrad CPU design class) -- but at the end of the day you have to be guided by experimentation and measurement. And so many times those leave us with no good theories to explain the observed behavior.

(War story, feel free to skip: One time we were trying to speed up a datafile parser by any means possible -- which was already split into a producer-consumer thread pair, one thread running the unfortunately complex grammar and producing a stream of mutation commands, the other thread consuming that and building up in-memory state. The engineer working on this found that adding NOPs could speed this up, and he measured and charted a range of # of NOPs and chose the best one. Our best guess was "something to do with memory layout?" The outcome of the story was that we tore out a bunch of abstraction layers and ended up with a simpler, single-threaded parser that didn't need such heroic and bizarre efforts, but it also left us feeling a bit of vertigo with regard to memory hierarchy behavior.)

Your pointing out that the shared memory abstraction is backed by message-passing between hardware components (which each represent concurrent processes) is really interesting - thanks!