My experience with Hadoop tells me that its great for all counting tasks. Makes a lot of sense that it was designed at Google to construct posting lists for their search index. Beyond this sweet spot, it gets really tricky to map your solution to a map-reduce task. The programmer has to rely more on the fact that map/reduce are java black boxes to express everything he needs. Hadoop's big victory is the scale it operates on.
Are people here waiting for a "SQL-like-declarative-query-language + Hadoop" hybrid, which lets one write declarative processes to run on large amounts of data? Academia seems to be very motivated to produce something like this.
Yes, it is a similar programming model. Some differences (please put "to the best of my knowledge" in front of all of these):
- Storm does not allow arbitrary state in operators (what Nathan Marz calls "bolts"). This makes implementing the runtime easier, such as being able to replay tuple sends for fault tolerance, but it limits what kinds of applications one can make. Yes, I'm on board with the idea that we should avoid mutable state as much as possible, but people who build real applications want it. Yet, fault tolerance in our system requires more work, so it's a trade-off.
- Storm programs are implemented in Java. Streams applications are implemented in our programming language, which has the rather pedestrian name Streams Programming Language, but usually just SPL. This may seem minor, but it's a big deal. Marz is working on a higher level language in Clojure. Implementing programs in a higher level language enables developers to abstract away many issues related to high performance, distributed systems. I compare it to the difference between writing assembly code and writing C code. (Or the difference between writing Python code and writing C code.) The code that we generate is similar in principle to how one writes a Storm application. Which brings me to...
- Storm runs on the JVM, we generate C++ code which gets compiled.
Neither Storm or Streams are the first or only in this area. Stream programming is also popular for hardware, but that is usually synchronous and if there's state, it's shared-memory. Storm and Streams are distributed and asynchronous. There are academic distributed streaming systems such as Borealis. The research name for Streams is System S, and there are many academic papers about it, or that use it as a platform for other research: http://dl.acm.org/results.cfm?h=1&cfid=66087472&cfto...
You can have any state you want in bolts. What Storm does not provide is a persistence mechanism for that state. For that, you can just use an external database that knows how to handle distributed state and the associated tradeoffs (such as Riak or Cassandra).
The "state spout" abstraction, a future feature for Storm, will alleviate the performance problems with using an external database. Although in the time being, smart use of batching/checkpointing is sufficient for most applications.
Also, Storm topologies can be written in any language. Storm has great multi-language support.
I do agree though that higher level abstractions are important. That will come later, once we're confident that we've mastered the primitives for doing fault-tolerant realtime computation.
Storm does not allow arbitrary state in operators (what Nathan Marz calls "bolts").
Maybe I misunderstand your point, but Storm does allow state in Bolts - a Bolt is just a Java object, so it can have member variables. That's how aggregation (e.g. counting events per user) is done. Of course, if you want a Bolt that scales horizontally, you need to account for the state being split across several instances of the Bolt class; and if you need the state to survive a restart, you need to keep it in an external database instead of in the object's memory.
Then I stand corrected. Is there any means to declare partitioned state that the runtime then handles for the user, or does the user always have to manage it themselves? This is one of those things that is, I think, easier to do declaratively one level up in abstraction. (You can see examples of this by looking at invocations of our Aggregate operator in the above documents.)
I had assumed there was no arbitrary state because of the replay semantics. Let's say bolt A sends tuples to bolt B. B has internal state. A sends tuples t1, t2, t3 and t4. A receives acknowledgements that t1, t3 and t4 were processed. So t2 needs to be replayed. But the semantics of what that means is undefined - B has internal state that already incorporates, for certain, t3 and t4, and maybe t2. (While it's unlikely, you never know where a tuple got lost.) So replaying t2 is problematic - do you just blindly replay it, and allow potentially broken semantics? The alternative is to do rollback, which is quite hairy.
partitioned state that the runtime then handles for the user
Yes, when "wiring up" a bolt to a spout or another bolt, you can say that it should receive tuples grouped by one or more of the tuple fields. (e.g. tuples [user, num_events] grouped by "user".) Then Storm takes care of the consistent hashing needed to ensure that each instance of the bolt receives all of the tuples in a given group (e.g. instance 1 gets all events for user 42, instance 2 gets all events for user 53).
I had assumed there was no arbitrary state because of the replay semantics.
Storm itself doesn't implement replay; it relies on the ability of an external event source (e.g. a message queue) to replay tuples if needed. It just provides the ability to notify the source of whether replay is needed (by tracking the tuples as they flow through the topology). So whether or not you replay tuples is an application choice.
But yes, if you have stateful bolts and replay then you do need to make sure that processing a given tuple is idempotent.
Very cool. Is it possible to add operators to Stream? I commonly run into the problem of batch resizing a lot of images. If there was an easy way to integrate imagemagick as an operator into a system which can push this task to different cores, that would be a big big win.
Yes, user-defined operators are a big part of the design of the system. But, as noted below, this is software that IBM sells, and right now we're targeting large companies.
It already exists, is free, and is widely deployed. It's also actively maintained, mainly by a lot of Facebook folks, and I've heard that they know a thing or two about scale:
Check out Hive - https://cwiki.apache.org/confluence/display/Hive/Home - it uses a "SQL like" syntax to let you run ad-hoc queries across data in a Hadoop+HDFS cluster. We're using it to run reporting nightly on ~10gb of data and are pretty happy with it.
Obviously the ease of programming any given task in MR corresponds, to invent a word, with how cross-talky the task is. If it is no-communication parallelizable, then it's very easy to do. The more communication or the more data striping you do, the more interesting it becomes.
In any case, tons of people would like a simple sql like language for hadoop. There already exist some examples: pig, sawzall, etc. Unfortunately, the inefficiencies hurt. Say pig takes 2x as much data processing as hand coding java. At large scale, that can eat you alive. To get some intuition about scale, review Ron Bodkin's, former vp eng at quantcast, slides: [1], page 9 and on. Obviously if a 2x penalty means going from 4 to 8 machines in your cluster, it's not such a big deal. But if you buy clusters a datacenter's cage worth at a time or more, its painful. We haven't escaped the tradeoff between programmer time and computer cost.
People would also love something as easy to program as R or matlab that magically scales to large data. Nobody has written such a thing despite quite a lot of demand, which makes me think it's even harder than I thought it as, and I believe it to be a quite hard task.
For the tools: I'm not a qc spokesperson and none of this represents the opinion of my employer and wasn't endorsed by them. If you want qc's position on anything, ask our spokesperson.
As someone who would like to start experimenting with Hadoop in near future, I would appreciate it a lot if you could elaborate on that or point me in the direction of a better comparison.
Personally, I dislike LISP odd syntax, the widespread use of side effects in
a functional language and the poor abstraction that lists represent over RAM,
from a performance point of view — indeed LISP variants often add additional
data structures, somehow negating the “LIS” part of the language. In the
specific case of Clojure, the fact that a compiled language is compiled into
an interpreted one, JVM bytecode, combining a slow dev cycle with suboptimal
performance, makes me think Clojure users must be glutton for punishment.
Clojure has sophisticated state management. So much for widespread use of side effects.
Clojure has high performance data structure implementations tuned for modern hardware. So much for performance.
Clojure is compiled on the fly at the REPL and the JVM is one the fastest runtimes out there. So much for slow dev cycle.
You may not like the syntax, but boy is that Hadoop query short and sweet.
Yeah it's almost hilarious how every critic he is aiming at LISP is fixed in Clojure, where immutable is the default and vector the data structure of choice. Also he clearly doesn't understand the subtleties/differences between compilation and interpretation, and how the two can interlace.
Are people here waiting for a "SQL-like-declarative-query-language + Hadoop" hybrid, which lets one write declarative processes to run on large amounts of data? Academia seems to be very motivated to produce something like this.