Google itself moved on to "Flume" and later created "Dataflow" the precursor for...

gravypod · on Aug 29, 2023

Opinions are my own.

Some info on flume: https://research.google/pubs/pub35650/

To quote from there: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult."

itissid · on Aug 30, 2023

Aren't hash joins done in RDBMS just like a general application of map-reduce? In left joins The big table is hashed on the join key value and sent to N machines and the little table is just everywhere. IIUC this is how any OLAP/Bigdata frameworks think while doing massive joins or partitioning to reduce data later, they just have to deal with additional issues like locality of partition to computation target.

So map reduce is in the DNA of many data computation flows instead of a thing in off itself.

opportune · on Aug 30, 2023

Also the second generation of Flume/Spark different vs MapReduce/Hadoop has to be understood in the context of what other assumptions changed at the same time. At Google, GFS was replaced with Colossus (can’t share specifics but this was also accompanied by a change in “data/machine topology” and associated networking changes away from uniform less specialized servers) which made it so “move code to data” became less important. Similarly Spark was originally meant to run on HDFS but became a lot more popular once it started being able to use things like s3 as its storage layer and public cloud VMs for compute (which was a similar transition to GFS->Colossus).

In terms of usability the other two main innovations were to make it easier to program a workflow that chained MapReduce operations (without an intermediate, expensive, blocks-until-all-nodes-done disk write step, nor a jankass orchestration engine) and subsequently to declaratively specify the desired output (eg SQL) without requiring the user to specify the implementation.

They’ve since added more stuff like streaming, ML, whatever, but the biggest change from 1st to 2nd gen is really in the data topology.

moandcompany · on Aug 30, 2023

Yep. Regarding workflows with chains of Map and Reduce operations, the Hadoop ecosystem had a similar improvement with the introduction of Hadoop 2 where YARN as a container resource manager and MapReduce (MapReduce2) were introduced, separating the workflow constraints in original Hadoop/MapReduce. This led to Hadoop projects, such as Tez as an alternate execution engine, replacing MapReduce2, on YARN with the same types of flow optimizations for chained operations and reducing the number of shuffles/writes to disk (i.e. overall much better pipeline performance for typical jobs) -- This was particularly relevant for things like Hive, where Tez could be plugged in as the execution engine when running on a Hadoop 2 cluster.

summerlight · on Aug 29, 2023

In addition to Flume/Dataflow, there's a significant push toward SQL engines. In general, SQL (or similar query engines written in more declarative languages/APIs) has some performance benefits over usual Flume codes thanks to vectorized execution and other optimizations.

itissid · on Aug 30, 2023

Isn't that Rama framework(Nathan Marz's new thing from Red Planet labs) the latest iteration of "lets completely abstract computation latency/complexity from the framework"? In my mind it tries to do different things depending on who you are. In the words of a colleague, "I am excited someone is trying to remove SQL"

Rama seems like if you are a fullstack or backend dev then it can provide you an easy way to have a(low latency) view of your data to build upon. If you are a Data Scientist you can use the thing to pull necessary data for analysis and slice and dice it.

disgruntledphd2 · on Aug 30, 2023

The above will only be successful if it supports SQL.

The best place to end up is something like PySpark/Snowpark as a better API for SQL is really useful when doing complicated things.

You still need to have a standard SQL layer though, as otherwise you'll cripple adoption.

chaxor · on Aug 29, 2023

So if I read this right, if you're not a big company (perhaps just a standard dev with maybe a tiny cluster of computers or just one beefy one), you just make a Docker container with pyspark and put your scripts in there, and everyone can reproduce your work easily on any type of machine or cluster? It seems like a reasonable approach, though it would be nice to not need the OS dependencies/docker for spark.

mrbungie · on Aug 30, 2023

If you are running jobs inside pure Docker containers (i.e. just one node without need for k8s, compose, rancher or whatever), it may be the case you don't even need pyspark.