I've done it. You need hardware support to do it without locks. Here is a paper ...

marshray · on Nov 21, 2011

That is awesome.

I've been using g++-4.6 lately and with the new lambdas I'm migrating more and more to this async task application model.

One point that seems frequently understated with the lamentations about the absence of a true double-pointer-sized DCAS is that on machines with a 64 bit CAS (e.g. x86 and x64) you could still implement something useful CAS on pairs of 32-bit handles.

2^32 concurrent work items "ought to be enough for anybody", right? :-) Well, it might be useful to those of us with less than 100 GB of RAM.

lukesandberg · on Nov 21, 2011

Luckily these days people are designing algorithms/data structures that don't rely on DCAS :) So you can actually implement them on real hardware

marshray · on Nov 22, 2011

Hmm, maybe they should.

In the paper you linked they don't seem to be able to perform a priority queue op in under about 1 us. Even when there are only 15 threads running on an exotic 29-core machine, they can only handle 750e3 ops/second (with no workload on each op). (This is consistent with a benchmark I did on Boost.ASIO a while back.

So the tasks should be sized to take at least (thread cnt)* 9 us to complete to keep the overhead from the priority queue overhead under about 10%. The best cases for the lock free algorithm might let you bring that down to about 3 us in theory. I'm not sure that justifies the additional complexity.

This survey paper http://www.zurich.ibm.com/pdf/sys/adv_messaging/shortConcurr... says a lot of stuff and then at the end Figure 4 shows a simple coarse-locked single-threaded structure providing double the performance when there is actual load involved and significant preemption. Which is the opposite of what I'd expected, which would be the occasional preemption of the thread holding the coarse lock would cause a slowdown.

lukesandberg · on Nov 28, 2011

In my investigations i did find the overhead to be fairly high and in most of my benchmarks i found that using simple fixed size fifo queues was significantly faster. However my benchmarks were generally simple cpu bound tasks you would definitely want to explore more varied workloads. Also since you are using a priority based scheduling algorithm presumably the priorities are significant and therefore sacrificing some throughput would be worth it.