I'm much disturbed that the big G hijacked the dataflow term, to suddenly mean their specific - rather involved I must say - dataflow based programming model. The real dataflow [1] is a much broader term that doesn't outline specifics like programming semantics.
Seems as if they're trying to ride the wave of the recent upsurge in interest in dataflow in general (with sub-fields such as Flow-based programming, and implementations like Akka streams etc). That's OK, but hijacking a term for the whole field, is not.
I haven't studied the code very carefully, but I think this is mostly PR piece. Not only using Spark from Java isn't a very good idea, but the "canonical" way to do transformations like that is using dataframes or datasets (which are from Spark 1.6 and provide some improvements over dataframes).
You're not going to get clean out-of-order processing semantics with any mode of Spark transformations. If you actually take the time to read the article, there's a section discussing the Java/Scala angle. The difference in code size is really secondary (though it is a difference). The difficulty in maintaining and evolving your pipeline over time using Spark is the main point, given the way important concepts become conflated with their API (any version of it). This all comes across much more clearly for those that actually take the time to read the words.
> If you actually take the time to read the article, there's a section discussing the Java/Scala angle.
They claim this isn't about the length of code, yet they select the most verbose way to use Spark and proudly display how long it is.
I mean sure, lack of event-time based processing is known limitation of Spark (and a pretty annoying one - though it is supposed to be worked on) but there are ways to write about it without code made to look bad on purpose.
EDIT: come to think of it, this whole article is "spark streaming can't do event time" written in thousands of words with contrived examples attached.
I think both length of code is a side effect that results from the primary argument, and "cant do event time" is one of the symptoms. Neither is a primary argument in the blog post.
The primary argument is demonstrated through color coding different logical bits, which end up being clearly portable and elegantly distinct in dataflow.
This is demonstrated in two ways:
1. The "juicy value add" code that does the aggregation is labeled yellow, and doesn't change across all the samples with Dataflow. With Spark, it needs to be rewritten for every use case. Similarly, for all colors.
2. In Dataflow all the colors are separate. This makes expressing your logic easier. In Spark, the colors mix in dramatic ways with every demonstrated use case.
As Tyler said, all this is described in the blog post itself, but I don't blame you for missing it, since it's a really long post :)
Hi, I'm also an engineer on the Dataflow team.
This is a very exciting project! However it is still constrained by Spark's limitations outlined in this post in terms of what Dataflow pipelines it can run. For example, currently it doesn't support session windows, it windows based on processing time instead of event time, and it doesn't seem to support triggers (the word "trigger" doesn't appear in the code base). AFAIK Spark has plans to add support for event time, and then it will be possible to make this runner run more pipelines consistently with the semantics of the Dataflow model. It is also likely that the capabilities of this runner will expand greatly as https://wiki.apache.org/incubator/BeamProposal progresses.
I think using Java as a comparison is fine. It's an officially supported platform, and while Scala is better we do a lot of work using Java without problems.
The Datasets thing is more interesting. There is no doubt that the way they unify the Dataframe/RDD programming model is better, but it is so new (1.6 only) we certainly haven't migrated to it yet. The documentation isn't huge, either: http://spark.apache.org/docs/latest/sql-programming-guide.ht...
It's less Google and more Google engineers, and from that standpoint it is totally normal fire them to kick the living daylights out of any code, particularly Google code.
Spark streaming really feels operationally immature compared to a lot of other stream processing frameworks, even Dataflow. The criticism is both unsurprising and warranted.
What other stream processing frameworks do you prefer in the place of spark streaming?Storm is pretty mature, but does not play very well with YARN last I tried. Flink looks pretty good, but is fairly new. Samza is another one. I'm curious to know if you have any specific issues with spark streaming's operational immaturity. Some things I don't like in general are:
# Backpressure algo is fairly new, but pluggable
# Does not handle stragglers very well, inspite of back pressure - this is due to treating everything as batch
# Events from the system are not very rich and cannot be customized.
# Error handling is very unclear, and does not offer a lot of flexibility
In spite of these shortcomings, it has pretty good Kafka integration, mostly uses the same paradigm as batch and plays well with hadoop infrastructure. Makes it a decent choice for many use cases
Storm has the community around it, though it shows signs of decline (perhaps the release of YARN-friendly Heron will help). There are a lot of suitors for developer affection. Fink is relatively green. Samza looks good, but is still picking up momentum in the community. DataTorrent seems to have some moment with its Apache Apex.
Regardless, most of these stream processing frameworks are still very much in early days and lack a lot of the sophistication you find in custom in house systems (such as found at... Google ;-). The open source world will no doubt catch up and overtake those systems, but right now there is still enough of a gap that it is rather painful.
Spark streaming pains:
1. Backpressure & stragglers. Duh.
2. Setup & tear down is still rough, even compared to Storm.
3. The whole context singleton thing means you need a new VM for each job, which annoys the #@$@#$ out of me.
4. Error handling isn't just unclear, it's kind of disastrous.
5. You can feel its "batch" heritage in lots of places, not just the stragglers. For some that is a feature, for me, a bug, even though with Storm I use Trident.
6. When a job runs amuck, it's a pain to recover from it. Storm is no picnic either, but it is indeed better.
I really like the Dataflow programming model, but this feels to me like an apples-to-oranges comparison.
Spark Streaming is a fully open-source project; although the Dataflow SDK is also OSS, my understanding is that the released version can only handle bounded datasets. Support for streaming (which is the major innovation, IMO) is only available in the form of stubs that call out to Google's paid, proprietary Dataflow service.
It's totally fine to compare an open-source project with a proprietary alternative, but I think it's odd that this article opens by talking about how the Dataflow SDK is being opened, and then spends all its time talking about proprietary features.
Accoring to the proposal (https://wiki.apache.org/incubator/BeamProposal), OSS impl. of streaming is on the way, by Apache Flink, etc.
The blog is just suggesting the model itself is superior, regardless of OSS or not.
Huh, I can't find anything to that effect in the proposal. It does mention the existence of runners for Spark and Flink, but doesn't say they'll be getting streaming support. I had assumed it was unlikely for the same reason that this article talks about (lack of support for first-class processing by event time).
But assuming it's true. it's very welcome news! I'll be keeping a close eye on future releases.
Flink's event-time support is coming along nicely. Their first round of true event-time support came in November (https://flink.apache.org/news/2015/11/16/release-0.10.0.html), and much more is on the way. Flink will be an excellent platform for Beam, both batch and streaming.
As I understand it, Spark has event-time support coming soon as well. I think basic stuff is landing in 1.7. Not sure precisely what they have planned, but I can only imagine that Spark will also become an excellent platform for executing streaming Beam pipelines in due time. In the meantime, the streaming runner for Spark can either target those features which Spark does support well (i.e., processing-time windowing, in this case), or try to emulate those it doesn't (such as how it was done in the article).
Did anyone read this? There was so much buzzwords and bullshit I couldn't slog past the first page.
Who do they write these things for anyways? It's not like we're college admissions or professors, just give us the straight deal. Unless you're pitching to schools and naive undergrads of course.
Hey, this is kind of offtopic, but figured still appropriate to ask;
How come Google Cloud Storage can be used instead of HDFS? I'm comparing google/amazon/azure right now. Both Amazon and Azure have 2 types of storage options - the regular object storage (S3a and Blobs) and block storage (S3 and Data Lake Store). S3 and DLS can act as the file system for Hadoop themselves (meaning you can let the data sit there and fire up clusters just for processing when needed), but they cannot interface with tools like the regular storage.
Meanwhile, Google's storage is like regular object storage, but you can run map/reduce (dataproc) and Spark on it.
Since Cloud Dataproc is ephemeral it's going to be a (probably) better idea to use GCS over HDFS so there is no data loss.
Technically, Cloud Dataproc clusters have both HDFS (on PD) for write/read-intensive operations (and scratch space) along with the GCS connector. GCS is not the default file system, however.
Most people (in my experience) don't use the old S3 block filesystem protocol anymore, and just use object storage instead (with either S3n, S3a or the proprietary EMRFS).
The Hadoop FileSystem interface doesn't really force an specific underlying implementation, and you can even use "local" filesystems without any issue, in fact, IIRC, MapR-FS is just an optimized NFS drive, i.e. a shared network drive.
Seems as if they're trying to ride the wave of the recent upsurge in interest in dataflow in general (with sub-fields such as Flow-based programming, and implementations like Akka streams etc). That's OK, but hijacking a term for the whole field, is not.
[1] https://en.wikipedia.org/wiki/Dataflow