in order to do work stealing in general you just need to organize your computations into tasks. Having a language that supports closures (or a similar thing) makes it easier, but it is not neccesary because you can just use function pointers to define tasks (and some kind of struct to pass args)
I've been using g++-4.6 lately and with the new lambdas I'm migrating more and more to this async task application model.
One point that seems frequently understated with the lamentations about the absence of a true double-pointer-sized DCAS is that on machines with a 64 bit CAS (e.g. x86 and x64) you could still implement something useful CAS on pairs of 32-bit handles.
2^32 concurrent work items "ought to be enough for anybody", right? :-) Well, it might be useful to those of us with less than 100 GB of RAM.
In the paper you linked they don't seem to be able to perform a priority queue op in under about 1 us. Even when there are only 15 threads running on an exotic 29-core machine, they can only handle 750e3 ops/second (with no workload on each op). (This is consistent with a benchmark I did on Boost.ASIO a while back.
So the tasks should be sized to take at least (thread cnt)* 9 us to complete to keep the overhead from the priority queue overhead under about 10%. The best cases for the lock free algorithm might let you bring that down to about 3 us in theory. I'm not sure that justifies the additional complexity.
This survey paper http://www.zurich.ibm.com/pdf/sys/adv_messaging/shortConcurr... says a lot of stuff and then at the end Figure 4 shows a simple coarse-locked single-threaded structure providing double the performance when there is actual load involved and significant preemption. Which is the opposite of what I'd expected, which would be the occasional preemption of the thread holding the coarse lock would cause a slowdown.
In my investigations i did find the overhead to be fairly high and in most of my benchmarks i found that using simple fixed size fifo queues was significantly faster. However my benchmarks were generally simple cpu bound tasks you would definitely want to explore more varied workloads. Also since you are using a priority based scheduling algorithm presumably the priorities are significant and therefore sacrificing some throughput would be worth it.
in order to do work stealing in general you just need to organize your computations into tasks. Having a language that supports closures (or a similar thing) makes it easier, but it is not neccesary because you can just use function pointers to define tasks (and some kind of struct to pass args)