It's all about better L3 utilization. Cilk is good for using L1 and L2, i.e. caches that are core-local, and suboptimal for using L3, i.e. caches that are shared between cores.
In the presentation posted here there was a paper citation that compared in terms of cache the model used (parallel depth first scheduling) over work stealing, but I didn't have time to read it yet.