Hacker News new | past | comments | ask | show | jobs | submit login

While Lock-free code gets you part of the way to efficient parallelism (by removing suspension-induced dead time) as he mentions, the performance impact caused by repeated cache line invalidation can be a significantly bigger problem.

In many modern processor architectures (x86, for example), a cache-coherence protocol is used to ensure that cache lines provide data according to the memory discipline. On x86, that discipline is Total Store Ordering. (See http://en.wikipedia.org/wiki/Memory_ordering).

This means that if two processors are contending for the same value (eg. trying to increment it, set it to true, or even read it), they will force the CPU to invalidate the cache line containing the value on all the other cores, leading to massive scalability bottlenecks. If multiple cores are contending for the same location in memory, whether it's lock free or not, performance will suffer.

More deviously is the case of false sharing, where 2 different values just happen to fall on the same cache line. Even though they don't conflict, the processor must still invalidate the line on every core. Modern compilers do their best to prevent this, but sometimes they need a little help.

The takeaway is this: Don't try to implement your own locks using CAS -- even something as simple as a lock is very hard to get right (performance-wise) when scaling to dozens and hundreds of cores / threads. People have solved this problem (people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf). Writing fast concurrent code (especially lock-free code) is a minefield of weird architecture gotchas. Watch your step.




Actually I didn't claim that lock-free is more efficient - I explicitly said that it isn't necessarily more efficient, though I didn't discuss the reasons that you do discuss; I only said that "lock-based" is defined by the need to deal with blocking upon suspension - not that it's necessarily a bad property in any way, in particular in its efficiency impacts.


I think it's a little too general to recommend not using CAS, as at a low enough level it's useful, but cache lines, as you say, are probably the most important functional issue.

It's possible to implement algorithms independently of data order on the memory bus, by dependence on execution order, and can therefore be barrier free, but cache lines are synced. Complete lines can be dedicated to single loads, a case where performance increases by sacrificing cache effective size.

The different Sparc memory write systems were always useful for explaining the performance effects of cache coherency and memory barriers.


All this and more at http://www.1024cores.net/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: