Goodbye MapReduce, Hello Cascading

jacobscott · on Sept 6, 2008

This is cool. For anyone who has cascading experience: what tuning can you do for the Hadoop jobs/does it autotune? How does performance compare to running multiple MapReduce jobs in sequence?

It would be awesome to see this compared to Microsoft's Dryad (http://research.microsoft.com/research/sv/Dryad/) which also supports DAG-like large scale computing. I don't think Dryad is publicly available though...

mmcgrana · on Sept 6, 2008

I'd be interested in learning what computations they do that "require up to TEN MapReduce jobs to execute in sequence". As a point of comparison, the Goolge MapReduce paper from 2004 says that their production web search indexing system "runs as a sequence of five to ten MapReduce operations." [labs.google.com/papers/mapreduce-osdi04.pdf]

fizx · on Sept 6, 2008

I routinely run composite jobs of thousands of map-reduces. Imagine an iterative machine learning algorithm where each epoch is a map-reduce job. Imagine a meta-job that runs dozens of these. It's not to bad with Hadoop's job control system. Where cascading really would shine (haven't tried it) is when these jobs require data joins.

Edit: This could be a case of map-reduce fail(tm). But I don't think so. I imagine Google's PageRank computation takes more than ten iterations to converge, and each iteration is a map-reduce job.

mikkom · on Sept 7, 2008

Looks like someone is trying to implement erlang in java.