I'd like to get in a plug for my personal favourite Scala / Hadoop productivity framework, Scoobi.[0] If you're already familiar with Scala, Scoobi has the more familiar and intuitive API, and it leverages the type system to provide stronger guarantees before the code is even run. (For example, if Cascading / Scalding don't know how to serialize your data you'll get a runtime error; in Scoobi, it complains at compile-time. This is really useful when your compile / deploy / run / error loop may take several minutes...) I've also run into a few bugs in Scalding, while I've found Scoobi to be much more solid.
OTOH, Scalding has the edge in terms of community size and ecosystem. I haven't found this to be a big issue -- shimming an existing Hadoop input / output to work with either project is quite simple -- but YMMV.
Yeah, for those same reasons I much prefer using Scalding's typed API [1], which feels very similar to Scoobi. The tuple API shown in these slides is great for places like Etsy that already have a large investment in Cascading, but otherwise you're better off getting the added type safety and similarity to the standard Scala API.
I'm familiar with the typed API, but it still doesn't quite bring me as far as Scoobi does. I recognize your name from the Scalding code, so I'll say that this is meant as helpful criticism and not a complaint.*
- There a couple types for datasets in the Scalding API: TypedPipe, and KeyedList and subclasses. Scoobi subsumes both of these under DList; thanks to the usual Scala wizardry, this has all the methods to operate on key-value pairs without loss of typesafety. This isn't a huge deal, but it removes the tiny pains of constantly converting back and forth between the two.
- Scoobi's other abstraction, DObject, represents a single value. These are usually created by aggregations or as a way to expose the distributed cache, and have all the operations you'd expect when joining them together or with full datasets. You can emulate this in Cascading / Scalding, but it's a bit less explicit and more error-prone.
- There's no equivalent to the compile-time check for serialization in Scalding, AFAICT.
- Scoobi has less opinions about the job runner itself... there are some helpers for setting up the job, but all features are available as a library. For some reason, I found the two harder to separate in Scalding?
- IIRC, Scalding did job setup by mutating a Cascading object that was available implicitly in the Job. In Scoobi, you build up an immutable datastructure describing the computation and hand that to the compiler. This suits my sense of aesthetics better, I suppose...
* Also, thanks to you guys for Algebird! That's a really fantastic little project, and I use it all the time.
1) Scalding has a DObject like type: ValuePipe[+T].
2) The reason you must explicitly call .group to go to a keyed type is that is costly to do a shuffle, this makes it clear to people when they do trigger a shuffle. If you don't like that, make an implicit def from TypedPipe[(K, V)] to Grouped[K, V]
3) You can easily use scalding as a library, but most examples use our default runner. We use it as a library in summingbird. But you are right, a nice doc to help people see what to do might help people (hint: set up an implicit FlowDef and Mode, do your scalding code, then call a method to run the FlowDef).
1) Ah, the ValuePipe is (relatively) new; thanks for the pointer.
2) You have to explicitly `.group` in Scoobi as well; it transforms a DList[(K,V)] to a DList[(K, Iterable[V])] or similar. You don't have to call `.toTypedPype` to get map and friends, though, since it's just a DList.
3) I've actually written this exact integration, so I'm glad it's the approved method! The global, mutable Mode made me nervous, IIRC.
The global Mode is gone in 0.9.0. And there is an implicit from Grouped to TypedPipe, so you don't need to call .toTypedPipe (that directly seems less likely to cause problems, especially given we have mapValues and filter on Grouped, so we should avoid needlessly leaving the Grouped representation).
I'm really excited to see scoobi listed here as it's something I have really grown to love over the last couple of years. I find that it's very flexible and end up using it for jobs that do more unconventional mapred things.
OTOH, Scalding has the edge in terms of community size and ecosystem. I haven't found this to be a big issue -- shimming an existing Hadoop input / output to work with either project is quite simple -- but YMMV.
[0] https://github.com/NICTA/scoobi