Google itself moved on to "Flume" and later created "Dataflow" the precursor for Apache Beam. While Dataflow/Beam aren't execution engines for data processing themselves, they abstract away the language of expressing data computation from the engines themselves. At Google for example, a data processing job might be expressed using Beam on top of Flume for processing.
Outside of Google, most organizations with large distributed data processing problems moved on to Hadoop2 (YARN/MapReduce2) and later in present day to Apache Spark. When organizations say they are using "Databricks" they are using Apache Spark provided as a service, from a company started by the creators of Apache Spark, which happens to be Databricks.
Apache Beam is also used outside of Google on top of other data processing "engines" or runners for these jobs, such as Google's Cloud Dataflow service, Apache Flink, Apache Spark, etc.
To quote from there: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult."
Aren't hash joins done in RDBMS just like a general application of map-reduce? In left joins The big table is hashed on the join key value and sent to N machines and the little table is just everywhere. IIUC this is how any OLAP/Bigdata frameworks think while doing massive joins or partitioning to reduce data later, they just have to deal with additional issues like locality of partition to computation target.
So map reduce is in the DNA of many data computation flows instead of a thing in off itself.
Also the second generation of Flume/Spark different vs MapReduce/Hadoop has to be understood in the context of what other assumptions changed at the same time. At Google, GFS was replaced with Colossus (can’t share specifics but this was also accompanied by a change in “data/machine topology” and associated networking changes away from uniform less specialized servers) which made it so “move code to data” became less important. Similarly Spark was originally meant to run on HDFS but became a lot more popular once it started being able to use things like s3 as its storage layer and public cloud VMs for compute (which was a similar transition to GFS->Colossus).
In terms of usability the other two main innovations were to make it easier to program a workflow that chained MapReduce operations (without an intermediate, expensive, blocks-until-all-nodes-done disk write step, nor a jankass orchestration engine) and subsequently to declaratively specify the desired output (eg SQL) without requiring the user to specify the implementation.
They’ve since added more stuff like streaming, ML, whatever, but the biggest change from 1st to 2nd gen is really in the data topology.
Yep. Regarding workflows with chains of Map and Reduce operations, the Hadoop ecosystem had a similar improvement with the introduction of Hadoop 2 where YARN as a container resource manager and MapReduce (MapReduce2) were introduced, separating the workflow constraints in original Hadoop/MapReduce. This led to Hadoop projects, such as Tez as an alternate execution engine, replacing MapReduce2, on YARN with the same types of flow optimizations for chained operations and reducing the number of shuffles/writes to disk (i.e. overall much better pipeline performance for typical jobs) -- This was particularly relevant for things like Hive, where Tez could be plugged in as the execution engine when running on a Hadoop 2 cluster.
In addition to Flume/Dataflow, there's a significant push toward SQL engines. In general, SQL (or similar query engines written in more declarative languages/APIs) has some performance benefits over usual Flume codes thanks to vectorized execution and other optimizations.
Isn't that Rama framework(Nathan Marz's new thing from Red Planet labs) the latest iteration of "lets completely abstract computation latency/complexity from the framework"? In my mind it tries to do different things depending on who you are. In the words of a colleague, "I am excited someone is trying to remove SQL"
Rama seems like if you are a fullstack or backend dev then it can provide you an easy way to have a(low latency) view of your data to build upon. If you are a Data Scientist you can use the thing to pull necessary data for analysis and slice and dice it.
So if I read this right, if you're not a big company (perhaps just a standard dev with maybe a tiny cluster of computers or just one beefy one), you just make a Docker container with pyspark and put your scripts in there, and everyone can reproduce your work easily on any type of machine or cluster?
It seems like a reasonable approach, though it would be nice to not need the OS dependencies/docker for spark.
If you are running jobs inside pure Docker containers (i.e. just one node without need for k8s, compose, rancher or whatever), it may be the case you don't even need pyspark.
Outside of Google, most organizations with large distributed data processing problems moved on to Hadoop2 (YARN/MapReduce2) and later in present day to Apache Spark. When organizations say they are using "Databricks" they are using Apache Spark provided as a service, from a company started by the creators of Apache Spark, which happens to be Databricks.
Apache Beam is also used outside of Google on top of other data processing "engines" or runners for these jobs, such as Google's Cloud Dataflow service, Apache Flink, Apache Spark, etc.