This analysis just feels totally off to me. Google didn't miss anything with MapReduce and TensorFlow. So what that Hadoop and PyTorch came out and became open source favorites? It's not like that had any significant negative business impact to Google. On the flip side, Google's open sourcing of Kubernetes did win the container orchestration "war" (RIP Mesos), and it's not like that brought them any great benefits.
I agree with another commenter, where Google missed the boat was cloud. Ironically, AppEngine came out pretty early in 2008, but its original incarnation was much too "Google-specific". Google just didn't have the corporate DNA to understand that, sure, Datastore can allow infinite horizontal scalability, but most people don't want to deal with eventual consistency, and they just want a plain-old SQL DB. It took Google a long time to come around to understand how other businesses use their infrastructure.
AppEngine also hit the classic Google problem of missing attention to detail. I used it for a couple of projects in 2008/2009 and it just wasn’t competitive - cold starts were painful, you’d hit all of these gaps in features or random bugs in the tooling, etc. and while Datastore might scale it was consistently much slower than any mainstream SQL database so you really had to need that scaling.
That was the first project which trained me not to use Google products. I remember reporting by a ton of stuff and then never hearing back and things never improving until much (multiple years?) later when I’d stopped using it. I was profoundly unsurprised when AWS ate their lunch.
> AppEngine also hit the classic Google problem of missing attention to detail.
Amen, amen, amen. This is so frustrating to me because in general I really like GCP as a platform, and I think they're doing great things with stuff like Cloud Run.
Case in point, I like Firebase a lot, and I think Firebase Auth (aka Google Identity Platform) is a great product. But it took them about 2.5 years between releasing their initial support for a second factor, which was originally SMS only, and allowing a TOTP (e.g. Google Authenticator/Auth) 2nd factor. And they still don't support "remember this device" functionality, so every time someone logs in they always need to enter the 2nd factor. I literally can't think of any major site (especially Google's own) that doesn't support "remember this device", and yet it just must be a detail that's not "important enough" to get someone at Google a promotion.
Yeah, the updates I’m still getting on parity-with-AWS tickets from before the pandemic just reminds me that nobody at Google gets promoted for removing warts.
They did close my textbook example of that but it took half a decade to implement HTTP to HTTPS redirects, which basically every large customer wants to make compliance easier:
Wow, that issue thread basically shows everything that's wrong with Google when it comes to dealing with business customers, though I do give them credit that it seems to have slightly improved near the end there: at least a Google PM was giving status updates and engaging with the commenters - before that for years it was just crickets.
Here is my similar (though not as long) example - point-in-time-recovery on postgres: https://issuetracker.google.com/issues/78448400. Interestingly, it had the same problem with 0 communication from a Google PM until early 2020, so maybe they hired someone at that time who convinced them you can't treat business customers the same way you do consumers.
MapReduce helped solve a huge internal problem for Google - how to update the index dynamically. Originally, index updates were batch jobs that took days to run. But MapReduce isn't the solution to that problem. It's just the orchestration for the algorithm. It was figuring out how to partition that matrix inversion that was the hard part.
Most of Google's problems in B2B reflect that the organization has little concept of customer service, in the sense of serving customers. It's not the technology.
The current problem with AI for Google is that the good stuff costs a bit too much to give away with search. That problem may be solved, or evaded. Most searches or questions don't need a large language model. I expect to see systems where you ask a question, and if it's hard, you get something like "That's a hard question. Give me a minute to work on that. Meanwhile, here's a word from our sponsor."
> Most of Google's problems in B2B reflect that the organization has little concept of customer service, in the sense of serving customers. It's not the technology.
This, exactly. Google is fantastic at research, great at development. It's terrible at product. Or, specifically, all the bits of product that aren't R&D. MapReduce, Tensorflow and Kubernetes are all examples of technical successes that Google has spawned. None is a product. You can't buy them, buy support, or, you know, actually talk to someone if something goes wrong.
That's why Google cloud has failed so far. It's not the tech; big G can go toe to toe with Amazon and MS on that. It's that enterprises want contracts and support and relationship managers and all that stuff. That just doesn't seem to be in Google's DNA. Unless they can adress that, they're not going to win with big org customers.
They might win the startups - where the customers are themselves techies who are OK with self supporting, looking up Stackoverflow, asking ChatGPT, etc. But unless they can address the people angle, I don't see Vertex gaining meaningful traction against Amazon or OpenAI/MS.
Make no mistake, Google has a legion of Account Managers, Customer Engineers, TAMs – you
name it – working with large enterprises in most/all major cities... Thomas Kurian's Google Cloud absolutely has sales and contracts in their DNA.
> The current problem with AI for Google is that the good stuff costs a bit too much to give away with search.
Perhaps, but they seem to be in a better position than many to get that cost to a reasonable level.
They've got the scale, the research side, lots of experience with models of different sizes, they own TensorFlow, they produce and deploy their own Tensor chips (so I would guess are not dependent on Nvidia's tech at 10x markup), and have lots of experience in datacenter power efficiency.
This is all just based on what I've read in tech news. I happen to work at Google but don't have any insider info here. I'm biased, but they seem to be in a good place.
I believe the issue goes much deeper. Google itself did not have a scalable SQL DB when App Engine first came out. Spanner was still being designed. Even if Google understood that customers wanted a SQL DB, it could not give them that.
Which actually goes to the heart of the issue. Who cares what Google had? MySQL and Postgres existed - Amazon RDS first came out in 2009, and you could run DBs on EC2 before that.
Google famously uses lots of custom versions of infrastructure software while the rest of the world uses something else. Obviously that has served then extremely well, but the rest of the world wants stuff they are familiar with. It really took Google years to fully grasp that.
Google was running MySQL in different flavours and configurations before Spanner. I suspect strategy tax (knowing Spanner was coming along eventually) might have been a reason for not offering that as a managed option though.
The point of Kubernetes was to make "run this software on a cloud" trivially compatible across clouds. No specialized code that is needed for load balancers, storage, networking... you should be able to take your containerized application and the set of YAML files that describes what you want to do, and they work on "any" cloud.
As long as you don't rely on implementation details like a specific Ingress Controller that needs special annotations, this is exactly what Kubernetes does, and the reason why you have Helm Charts that don't need to take into consideration which the underlying cloud is.
Going back to what we had before, you had to be very aware of exactly which cloud provider you'd use. And since AWS had about 80+% of the market then, that meant that instructions for how to run things on Google's cloud were not very good, if they existed at all.
GKE in general seems to be doing better than EKS though. They have a lot of things right and EKS seems to be playing catch-up. For example if you look at any Multi-cluster Kubernetes setup by AWS it's just a giant duct-tape rather than a ground-up solution. GKE worked on fundamentals first like multi-cluster endpoints, multi-cluster services, multi-cluster ingress, multi-cluster config-sync and now bringing it all together under GKE Enterprise.
I did a migration to GKE at a previous company and it was basically great to work with. At the time we had a few teething issues with the most edge-casey bits of our system, to do with getting ingress and CDNs to work well together I think. The end result was a little more manual Terraforming than would have been ideal, but it all worked, and I believe the improvement needed from GCP was on their roadmap and came a few months later.
GKE (with Autopilot) is firmly set in my "day 1 toolbox" if I were to start a new company, assuming K8S was the right tech choice.
EKS came later and for a long time was basically barely working. Better now but still has odd things like no free tier.
The problem is that GCP for some reason just doesn't align with many corporate customer requirements (ie: support, stability, never kill anything, available resources, etc.) and an engineer friendly product (GKS) does nothing to alleviate that. It makes GCP better in an area it was already better in and doesn't at all fix the tangential reason companies actually avoid it (or rather from what I hear either migrate off of GCP eventually or use it as just a secondary cloud provider).
The part about Vertex might be right but the establishing story about mapreduce is totally wrong. By the time Hadoop took off, mapreduce at Google already had one foot in the grave. If you are using Hadoop today you have adopted a technology stack that Google recognized as obsolete 15 years ago. It is difficult to see how Google lost that battle. They effectively disabled an entire industry by suggesting an obsolete stack, while simultaneously moving on to cheaper, better sequels.
It wasn't obsolete 15 years ago. There were production mapreduces making double-digit improvements in key metrics (watch time, apps purchased) much more recently than that. The system I worked on, Sibyl, isn't well known outside of Google, but used MR as an engine to do extremely large-scale machine learning. we added a number of features and used MR in ways that would have been extremely challenging to reimplement while maintaining performance targets.
I'm not even sure the mapreduce code has been deleted from google3 yet.
To be fair, MR was definitely dated by the time I joined- 2007- and I'm surprised it lasted as long as it did. But it was really well-tuned and reliable.
Also the MR paper was never intended to stake Google's position as a data processing provider (that came far, far later). The MR, Bigtable, and GFS papers were written to attract software engineers to work on infra at google, to share some useful ideas with the world (specifically, the distributed shuffle in mapreduce, the bloom filter in bigtable, and the single-master-index-in-ram of GFS), and finally, to "show off".
CL is changelist, the perforce equivalent of e.g. a git pull request (sorta).
Even though Google codebase broke perforce scaling long ago and doesn't use it anymore, the replacement still borrows a lot of perforce names and sort of API.
Yes, I'm not thinking of Flume. When I worked on sibyl we were in the middle of converting most of the data pipelines (which are now externalized in TFX) to Flume, but the core learner, along with many other prod jobs at google, still used MapReduce. But, by that time, the majority of google developers were using Flume instead of MR.
Certainly even in 2013 MR was definitely being used; I launched a product at that time that ran an MR because we couldn't get similar performance out of Flume yet.
You’re right and wrong. MapReduce is two things: a pattern for massively parallel computation, and the name of the initial implementation of the pattern at Google.
While the initial implementation at Google quickly got replaced with better things, the MapReduce pattern is everywhere in the data space, and almost taken for granted now. Hadoop is basically the same: a shitty (I think HDFS is still pretty good, just not the compute part) initial implementation of the pattern that was quickly iterated and improved upon.
Also, a big reason people stopped having to think about eg rack-local operations is that most people operating on huge amounts of data now aren’t doing it on traditional generic servers, they’re using something like s3 on VMs in Public Cloud datacenters if they’re doing something relatively “low level” or more likely just using something like Snowflake, Spark/Databricks (pretty close to OG mapreduce…), etc.
Hadoop and then Yarn+Mapreduce gave us quite a lot of value even until 2017. I honestly don't think it was a bad technology choice. Should have moved off it faster but we had cheap commodity hardware running cheap software and early Spark was massively memory-finicky (cliff of performance). I wouldn't use that tech today but back in 2008-2014 it let us run things over a few petabytes relatively cheaply with relatively slow interconnect.
We had an impl of the Pregel paper on top of the Yarn manager.
The API was painful and easy to err on but it did provide quite a bit of functionality.
Now, of course, that stuff is all out of date. Where I am now we have custom job engine and it's way better. I imagine others have something like this too.
Things have just changed. Interconnect is now cheap and fast: 40 Gbps is commodity.
Google itself moved on to "Flume" and later created "Dataflow" the precursor for Apache Beam. While Dataflow/Beam aren't execution engines for data processing themselves, they abstract away the language of expressing data computation from the engines themselves. At Google for example, a data processing job might be expressed using Beam on top of Flume for processing.
Outside of Google, most organizations with large distributed data processing problems moved on to Hadoop2 (YARN/MapReduce2) and later in present day to Apache Spark. When organizations say they are using "Databricks" they are using Apache Spark provided as a service, from a company started by the creators of Apache Spark, which happens to be Databricks.
Apache Beam is also used outside of Google on top of other data processing "engines" or runners for these jobs, such as Google's Cloud Dataflow service, Apache Flink, Apache Spark, etc.
To quote from there: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult."
Aren't hash joins done in RDBMS just like a general application of map-reduce? In left joins The big table is hashed on the join key value and sent to N machines and the little table is just everywhere. IIUC this is how any OLAP/Bigdata frameworks think while doing massive joins or partitioning to reduce data later, they just have to deal with additional issues like locality of partition to computation target.
So map reduce is in the DNA of many data computation flows instead of a thing in off itself.
Also the second generation of Flume/Spark different vs MapReduce/Hadoop has to be understood in the context of what other assumptions changed at the same time. At Google, GFS was replaced with Colossus (can’t share specifics but this was also accompanied by a change in “data/machine topology” and associated networking changes away from uniform less specialized servers) which made it so “move code to data” became less important. Similarly Spark was originally meant to run on HDFS but became a lot more popular once it started being able to use things like s3 as its storage layer and public cloud VMs for compute (which was a similar transition to GFS->Colossus).
In terms of usability the other two main innovations were to make it easier to program a workflow that chained MapReduce operations (without an intermediate, expensive, blocks-until-all-nodes-done disk write step, nor a jankass orchestration engine) and subsequently to declaratively specify the desired output (eg SQL) without requiring the user to specify the implementation.
They’ve since added more stuff like streaming, ML, whatever, but the biggest change from 1st to 2nd gen is really in the data topology.
Yep. Regarding workflows with chains of Map and Reduce operations, the Hadoop ecosystem had a similar improvement with the introduction of Hadoop 2 where YARN as a container resource manager and MapReduce (MapReduce2) were introduced, separating the workflow constraints in original Hadoop/MapReduce. This led to Hadoop projects, such as Tez as an alternate execution engine, replacing MapReduce2, on YARN with the same types of flow optimizations for chained operations and reducing the number of shuffles/writes to disk (i.e. overall much better pipeline performance for typical jobs) -- This was particularly relevant for things like Hive, where Tez could be plugged in as the execution engine when running on a Hadoop 2 cluster.
In addition to Flume/Dataflow, there's a significant push toward SQL engines. In general, SQL (or similar query engines written in more declarative languages/APIs) has some performance benefits over usual Flume codes thanks to vectorized execution and other optimizations.
Isn't that Rama framework(Nathan Marz's new thing from Red Planet labs) the latest iteration of "lets completely abstract computation latency/complexity from the framework"? In my mind it tries to do different things depending on who you are. In the words of a colleague, "I am excited someone is trying to remove SQL"
Rama seems like if you are a fullstack or backend dev then it can provide you an easy way to have a(low latency) view of your data to build upon. If you are a Data Scientist you can use the thing to pull necessary data for analysis and slice and dice it.
So if I read this right, if you're not a big company (perhaps just a standard dev with maybe a tiny cluster of computers or just one beefy one), you just make a Docker container with pyspark and put your scripts in there, and everyone can reproduce your work easily on any type of machine or cluster?
It seems like a reasonable approach, though it would be nice to not need the OS dependencies/docker for spark.
If you are running jobs inside pure Docker containers (i.e. just one node without need for k8s, compose, rancher or whatever), it may be the case you don't even need pyspark.
Regarding "obsolete" and missed opportunities: this idea of missed opportunities for Google has been touched upon several times in the last decade, including the launch of Google Cloud Platform itself.
Urs Hölzle, the former head of Google TI (Technical Infrastructure), discussed in public some of the challenges and reasons for creating Google Cloud Platform as a platform, and backing projects like Kubernetes.
Over time, Google has become a proprietary tech "island" in severals ways and arguably more fragmented than other large tech companies, such as Microsoft and Amazon, which happen to both have commercial cloud offerings, and Meta/Facebook. While all of these companies certainly have challenges with not-invented-here ("NIH") syndrome, and lots of internal, proprietary tools, as a software engineer at one of these three, odds are you will use and touch more commercial and open-source technologies than you would at Google. Google itself still struggles with having teams and projects use GCP for internal work versus Borg/etc; and there are plenty of valid reasons why Google teams don't use GCP.
The proprietary tech "island" issue is a non-trival concern when you need to hire new software engineers from industry/outside and ramp-up time with some of these systems may be 6-months or even greater; today Alphabet/Google is at around 200k+ FTE, and you aren't going to be able to find many engineers outside that have experience with Borg/Flume/Spanner/Monarch/etc. Likewise when you are an experienced Google software engineer looking to work elsewhere, you need a translation map to figure out what tools outside are similar to the ones from inside.
Google's proprietary tech island has its legitimate reasons for existing, and when people say 'xyz' commercial/open-source thing is "better," they often mean it is better for their problem at hand.
At Google, a decade-plus ago many of the problems it had to solve were problems that few other organizations had, such as large-scale data processing (to be made cost-efficient on commodity hardware), and it needed to create a number of tools/platforms as solutions such as Map Reduce/GFS.
Many of these tools and platforms were discussed via papers, and inspired open-source work. In the Map Reduce case, it changed how Apache Hadoop itself took shape, and the lessons from all of these later led to things like Apache Spark.
The idea of losing a battle can only be applied with the benefit of hindsight, and many of the Google examples given were created at a time where there were no peers, nor at that time was Google interested in selling these things as commercial products at the time (i.e. GCP vs AWS vs Azure); it built these things according to its unique internal needs that few other organizations could relate to. I acknowledge that I am intentionally leaving out organizational politics, and culture (e.g. PERF) as non-trivial contributors for this result).
> The proprietary tech "island" issue is a non-trival concern when you need to hire new software engineers from industry/outside and ramp-up time with some of these systems may be 6-months or even greater
Went to a GCP event once, expected it to be like the aws one… it was 100% marketing and the Wi-Fi didn’t work. So yea, they drop the ball a LOT when trying to interface with the developer community.
It's really far off, I'm not being pedantic. We do know the 181K Q2 2023 number is public, from there, we might consider factors starting with not every Alphabet-er is a SWE (and it's not close). That alone is worth an incredible number.
This is a valuable comment and I didn't mean to nerdsnipe you.
> They effectively disabled an entire industry by suggesting an obsolete stack
Interesting opinion but not supported at all by evidence. Most non-Google datasets are small and stored on off-the-shelf heterogenous hardware, so HDFS / Mapreduce for streaming OLAP is a great fit. Cassandra (BigTable) and Parquet (Dremel) plus Cloudera’s Impala had much quicker time-to-market when large-scale BI became more relevant.
“Obsolete” for Google problems sure, but Google problems largely only happen at Google. Stuff like ad targeting and ML look a lot different for products outside the Chocolate Factory.
The problem with using technology X “because Google does it” or “this is the best open source version of what Google uses, so let’s use it because Google does it” is that companies neglect that Google does not just use the technology out of the box they
1. created the software for their own needs
2. maintain a developer team to improve it and address requirements/pain points/integration
3. have an internal pool of experts in the form of the developer team and “customers”/early adopters
4. most likely have other proprietary systems like Borg or Colossus which integrate with the software very well, which OSS like Hadoop may not (another example: OSS Bazel vs Blaze+Forge+Piper+Monorepo structure).
Something like HDFS was hugely painful for many teams because they had no idea how it worked or how to debug it, had no idea how to fix it or extend it, and didn’t have any good tooling to understand why something was slow. All they could do was try to configure it, integrate with it, and find answers for their problems online. That’s because HDFS was “free” but a team capable of properly maintaining, supporting/operations, and developing HDFS was extremely expensive.
About 10 years ago, I worked at a place where we had a "big data" project. Some big wig wanted to use Hadoop. Turns out, the "big data" isn't big. It's not even a gigabyte, spread over less than a dozen files. Everything worked great, but the job orchestration took longer than the actual processing.
Despite this misuse of Hadoop, another guy really loves it, and decides to start a project rewriting everything to use MapReduce. A new guy started, got assigned to the "MapReduce project"... he worked on this for over a year. It never made it to production.
I recommend reading the GFS paper and consider that there were/are use cases for horizontally scalable, fault-tolerant object storage, with the nuance of understanding that some of your storage/data nodes may also be compute nodes and there can be a benefit or preference for assigning applications to run where underlying data is stored.
In the HDFS case with Hadoop's ecosystem, consider Hive, BigTable, Drill, and even Spark when running on YARN.
In the peak days of Hadoop, many organizations were primarily on-prem, and S3 or S3 compatible object stores were mostly reserved for people using AWS.
My first job after college was at Google. I had read the MapReduce paper and was so excited to run one, and luckily they did have a tutorial and I ended up running my first MapReduce within I think a couple weeks of joining, over Google's index of the internet at the time, which was really mind-blowing / exciting.
I remember vaguely one task where i tried to join the index of the internet to some other dataset, which required joining across datacenters. The job ran with some warning. Then I got a friendly message from an SRE saying something like "it can cost thousands of dollars to run a join like that, so, no worries, just only do it if its something important." Of course, I wasn't doing anything important. I was just enjoying running MapReduce.
> The final piece of Google’s strategy today came in the form of a subtle, and very vague, announcement from Nvidia CEO Jensen Huang on stage in a brief appearance of only a few minutes. Huang announced that Google and Nvidia had been collaborating with a new language model development framework, PaxML, built on top of Google’s cutting-edge machine learning framework JAX and its Accelerated Linear Algebra framework (or XLA).
My only thought is, I wonder how well the nvidia/google partnership will do against Azure/Intel (I believe Azure invested heavily in FPGA's for their ML use cases).
Wow, XLA has historically been absolute crap outside of TPUs and even for TPUs the error messages are incredibly poor. If nvidia actually wants to support XLA now perhaps that means the TPU 5 is the last TPU, and/or future TPUs might be targeted at just inference and efficiency (like TPU 5) and then nvidia owns the training game.
After all if you compare Nvidia’s success with H100 sales versus GCloud TPU sales, it would be easy for Sundar to say “if you can’t beat em join em” and just maintain TPU team for inference which is more closely tied to wall street margins.
Google didn't miss on MapReduce; it missed on Cloud. Amazon was light years behind in datacenter technology, but made it all available via AWS, while Google kept everything to themselves. It was a colossal failure.
LLMs are shaping up to be the second such failure.
In both cases Google regarded the technology as a competitive advantage in the business they were in (web search), so naturally wanted to keep it internal. Maybe almost as important, they were so far ahead on those technologies that making a viable product out of them would have been a huge effort with no benefit to search. Google tech has always been an "island". Even when they did release GCP, the offerings like AppEngine and transparent networking were incomprehensible to customers who just wanted to lift-and-shift their existing datacenter, not adopt Google practices.
Amazon, on the other hand, has no qualms about converting their internal expertise into products ("turn every major cost into a source of revenue" [0]) and giving customers what they ask for.
It was a big struggle to get google to commit to cloud. When I worked there and advocated that Google needed to dive into cloud headfirst, the responses I got were a mix of "we already have appengine" (which totally misses the point) and "it's not as profitable as ads" (not many things are).
I never expected google to end up in the innovator's dilemma but here they are.
Who invented Kubernetes? Ok, water it down if you must. Which of the FAANG was in at the birth? Yep.
What Google missed on, was taking cloud outside of itself as a visible customer product. AWS leapt into the breach, but the irony is that we want to use AWS to run a technology platform which Google has significant DNA in.
No. It actually compounds the problem in some ways: Great at designing technology, poor at capitalising on it. And of course, much we attribute as great in google was acquisition, not invention. Maps? Originally I believe outside. Android? Outside. Picasa? Outside. It's a long list of amazing things, Google acquired.
Go? Inside. They hired Pike and Thompson amongst others. Kubernetes, inside. Pike and Thompson had been working on plan9, which in many respects foreshadows Kubernetes. The language couldn't really be proprietary (google maintain very strong control over it) and Kubernetes went out as an open source quite quickly. QUIC also underwent this transformation from in-house to shared.
Seriously though, the AI race has just started. On a horizon that matters for large scale and ongoing adoption (and thus persistent corporate profits and valuations as opposed to manic hypes) nothing has been decided.
The hardware and software mix that will deliver this large scale adoption is not clear yet. Past experience (and reason) suggests it should be relatively cheap (commoditized) and easy to use.
Achieving X but at 0.1% of the cost will be the game that people will thrive at, betting on gargantuan volumes instead of gargantuan prices.
Given "AI" is more or less linear algebra the world is crying for commoditized vectorized compute. Its a solved problem. The world will get what it wants.
Happens at every big company. The reason we don't hear about it more often isn't that big companies don't know this is a problem. Its just that the only solution they seem to be able to come up with is to enforce keeping unused breakthroughs secret.
We can be happy that at Google, at least, these things can seep out and be of some benefit to the rest of us.
Google didn't release MapReduce implementation? It took off with Hadoop. If anything, it should have been open source from the start, then Hadoop wouldn't have been needed.
I agree with another commenter, where Google missed the boat was cloud. Ironically, AppEngine came out pretty early in 2008, but its original incarnation was much too "Google-specific". Google just didn't have the corporate DNA to understand that, sure, Datastore can allow infinite horizontal scalability, but most people don't want to deal with eventual consistency, and they just want a plain-old SQL DB. It took Google a long time to come around to understand how other businesses use their infrastructure.