Good article but I think it falls into the same trap that all 'smart people' (read: experts) do. I write multi-threaded embedded code in C for a living. It's hard - But it's not that hard... But that's because I do it for a living ALL THE TIME. If someone has a passing need for this kind of thing I don't think the arguments in the article hold water.
For people who don't do "This thing I'm an expert at and claim is not so hard" as often as the expert in question it really is that hard. Things get easier, and people better, with practice. Now, if this is something you plan on using a lot in your job / particular programming field then you shouldn't be afraid of multi-threaded programming but if this is going to be a small passing item that you work with rarely if ever you should probably use existing libraries / abstractions for this type of thing.
Has the author really written lots of multithreaded code?
Lots of people think they understand multithreading because they understand the synchronization primitives. "I'll just share this synchronized HashMap among my threads and everything will be fine". Not so. Because your code makes implicit assumptions about that HashMap all the time, e.g. that you can check if an element is in the map and if not, add it (but another thread added it after you checked); that the map won't change and throw an exception while you iterate over it because my loop doesn't change it, etc.
I don't think granularity is the hardest thing about multithreaded programming, it's these implicit assumptions. In an imperative language we just tend to think about our code as if it is single threaded and nothing changes until you change it. Then you get rare and irreproducible bugs in your program.
I think I agree with both. The impression I got from the article was that even after you've carefully identified all shared datastructures, and made sure they are as few as possible and locked properly, you still have to decide how to structure the parallelization. This will have effects on how it scales to higher # of cores and larger problems and is, in some sense, a less constrained problem because you don't know where to look. At least wrt synchronization, you can make a list of shared data and go through it.
The article says: "You can almost always use a simple locked data-structure for synchronization fairly painlessly (and with a bit of practice, you can usually write one fairly painlessly, too)."
If we focus on the first part, in my experience it is wrong: you can't rely on the data structure to do synchronization for you, because these structures only guarantee that internally they will work consistently, not that using them you can forget about synchronization and threading. Quite often you need to do your own synchronization on top of them, and it is easy to miss some cases. That's what my examples are about.
I think this is where a pure functional language like Haskell really shines.
Haskell might be a turn off to some people because it doesn't allow the quick and dirty solution for people who know what they're doing, and I'm sure there are those who have been permanently turned off to "protecting the programmer from himself" by such limp-wristed languages as Java, but you've got to admit there's merit to a language where instead of spending 5 years going through the school of hard knocks with production bugs, the programmer is baffled from the get-go and has to go ask an expert right away, resulting in much more solid code from the first version.
> (note: this post is mostly about data-parallel programming, or other parallel optimizations for computational performance purposes. There are other reasons to use concurrency, but I’m not sure my argument will hold in most of those cases.)
Respectfully, I think the usual claim is that concurrent programming is hard, not that parallel programming is hard. So, while the article is interesting in its discussions of the performance trade-offs involved in implementing data-parallel algorithms, it's arguing against a strawman, I think.
Using high-level, no-brainer parallel constructs is kosher (like a pre-built, concurrent data structures lib built by odd people like Doug Lea), similar to the advice about cryptography. Playing with the atoms yourself is dangerous. You can understand the concepts well, be contentious in your coding and build something the right person can drive a truck through.
How do you verify correctness? How do you test? A bug might only present one in ten billion loops, perhaps when you push the cores really hard. Or fail on the new processors with the slightly different barrier/cache/coherency semantics. Or fail in ways that send you down a million heisenbug gullies. Without any proper tools or diagnostics (yet).
This article covers mostly how _performant_ (as in fast/resource savvy) parallel programming is hard. Though robust parallel programming can be tricky too, and the way you "[Avoid] race conditions" will have an impact on performance. So the trade-off evocated in the article between speed, maintenance, memory could also take into account robustness.
Exactly. When I see people getting themselves tangled up with parallel programming, it's because they can't accept the performance trade-offs of simple, tractable solutions. 90% of the time, the inability to accept the trade-offs has nothing to do with actual performance measurements and everything to do with groundless assumptions that certain techniques are always way too slow.
The raw increase performance increase that you can get from parallel processing is at the absolute maximum proportionate to the number of processors you can add to your program. Usually it is much less.
A raw performance increase is not the most common good reason for concurrency, for multiple threads or processes in an application though it's a common bad reason. Achieving a decrease in latency is one common good reason to for concurrency on a single machine.
It seems like the author has very muddled notions of the separation between task parallelism and data parallelism. He talks about examples that could be data parallelism but then uses terminology like it's task parallelism. His section on race conditions is likewise muddied at best.
His basic argument at the top seems correct if you limit it to data parallelism. But if you read the rest of the article the author doesn't seem to follow the thread of his own argument.
At its core the notion that data parallelism is easier to debug than control parallelism is obvious. It is, and its often hardware supported. But unfortunately data parallelism is harder to write and often requires skill in math, so it's not used nearly so much.
I'm not sure I understand this. Even if you buy the idea that "[y]ou can almost always use a simple locked data-structure for synchronization fairly painlessly," how is the choice of granularity anything more than the choice of what and where to lock? The choice of granularity is a response to the nature of the synchronization problem.
I think that he is correct in general, but his argument can be better constructed. A simpler way to illustrate the issue is: any shared memory model can be mapped to a message-passing model (in parallel computing) with equal efficiency. Since message-passing model is fairly easy and intuitive, the concern all boils down to find an optimal parallel solution to a certain problem. That step is really hard.
Message passing is easier than using shared-memory correctly, but ensuring that a non-trivial message passing implementation doesn't have any deadlocks is not what I would consider "fairly easy".
Message passing circumvents locking, but not necessarily synchronization and I would not describe using it as easy. The difficulty in concurrency is maintaining an understanding of all possible states at any given point and message passing doesn't fix that.
tl;dr you can deadlock in occam
I haven't yet seen a case where parallelism made the program easier to understand, write, or prove properties about; hence, if programming in general is hard, then parallelism makes a given program as hard or harder. So, yes.
Why don't articles like this ever discuss the need for education on patterns and abstractions around threading such as messaging (via actors), the use of immutability, etc.?
If you were talking about general currency i would expect to see it a lot. If you were talking about the subset of data parallel concurrency, I would expect to not see it very much at all.
The tone is rude, yes. It's OK to say that point X was hard to follow for reason Y, but calling the article incomprehensible thrash brings little to the discussion.
The list of things in the article that are 'actually hard' are all difficult problems, but I think the article glosses over the most important part: expected behavior.
If you're just trying to do some number crunching across sixteen cores, it might be pretty easy to figure out what the expected result is, since you can compare it with a single-core version of the algorithm in question to make sure your results are right. But what if you look at some real-world applications that are highly parallel?
For example, both World of Warcraft and Aion have suffered from multiple highly visible race conditions in their game mechanics. This might sound simple, but in practice players are able to actually identify those races and find ways to exploit them - sometimes it is possible to exploit them in such a way that the player is able to earn gold or other in-game rewards more rapidly than other players. Since you can often convert in-game rewards into real-world ones, I think you can see where that goes...
The most likely explanation in the above cases is that the people involved in writing the code just didn't have a good understanding of concurrency, but that can't be all there is to it, because if you watch closely, the same concurrency related issues crop up in games and related software all the time.
A significant issue here is that in complex systems, it can be impossible to distinguish between a race condition and expected behavior. If your game has a button labelled 'win match', and two players press the button, which one should win the match? Do you compare the timestamps on the TCP packets that say they pressed the button, ignoring the impact of network latency? Do you try and compare local time data from their machines, ignoring the impact of clock drift? What if, by some fluke of nature, they both send in a request with the exact same timestamp? In a simple case, you'd never see this, but when you have 11+ million customers playing on a daily/weekly basis, such things can happen.
For people who don't do "This thing I'm an expert at and claim is not so hard" as often as the expert in question it really is that hard. Things get easier, and people better, with practice. Now, if this is something you plan on using a lot in your job / particular programming field then you shouldn't be afraid of multi-threaded programming but if this is going to be a small passing item that you work with rarely if ever you should probably use existing libraries / abstractions for this type of thing.