MapReduce, TensorFlow, Vertex: Google's bet to avoid repeating history in AI

hn_throwaway_99 · on Aug 30, 2023

This analysis just feels totally off to me. Google didn't miss anything with MapReduce and TensorFlow. So what that Hadoop and PyTorch came out and became open source favorites? It's not like that had any significant negative business impact to Google. On the flip side, Google's open sourcing of Kubernetes did win the container orchestration "war" (RIP Mesos), and it's not like that brought them any great benefits.

I agree with another commenter, where Google missed the boat was cloud. Ironically, AppEngine came out pretty early in 2008, but its original incarnation was much too "Google-specific". Google just didn't have the corporate DNA to understand that, sure, Datastore can allow infinite horizontal scalability, but most people don't want to deal with eventual consistency, and they just want a plain-old SQL DB. It took Google a long time to come around to understand how other businesses use their infrastructure.

acdha · on Aug 30, 2023

AppEngine also hit the classic Google problem of missing attention to detail. I used it for a couple of projects in 2008/2009 and it just wasn’t competitive - cold starts were painful, you’d hit all of these gaps in features or random bugs in the tooling, etc. and while Datastore might scale it was consistently much slower than any mainstream SQL database so you really had to need that scaling.

That was the first project which trained me not to use Google products. I remember reporting by a ton of stuff and then never hearing back and things never improving until much (multiple years?) later when I’d stopped using it. I was profoundly unsurprised when AWS ate their lunch.

hn_throwaway_99 · on Aug 30, 2023

> AppEngine also hit the classic Google problem of missing attention to detail.

Amen, amen, amen. This is so frustrating to me because in general I really like GCP as a platform, and I think they're doing great things with stuff like Cloud Run.

Case in point, I like Firebase a lot, and I think Firebase Auth (aka Google Identity Platform) is a great product. But it took them about 2.5 years between releasing their initial support for a second factor, which was originally SMS only, and allowing a TOTP (e.g. Google Authenticator/Auth) 2nd factor. And they still don't support "remember this device" functionality, so every time someone logs in they always need to enter the 2nd factor. I literally can't think of any major site (especially Google's own) that doesn't support "remember this device", and yet it just must be a detail that's not "important enough" to get someone at Google a promotion.

acdha · on Aug 30, 2023

Yeah, the updates I’m still getting on parity-with-AWS tickets from before the pandemic just reminds me that nobody at Google gets promoted for removing warts.

They did close my textbook example of that but it took half a decade to implement HTTP to HTTPS redirects, which basically every large customer wants to make compliance easier:

https://issuetracker.google.com/issues/35904733

hn_throwaway_99 · on Aug 30, 2023

Wow, that issue thread basically shows everything that's wrong with Google when it comes to dealing with business customers, though I do give them credit that it seems to have slightly improved near the end there: at least a Google PM was giving status updates and engaging with the commenters - before that for years it was just crickets.

Here is my similar (though not as long) example - point-in-time-recovery on postgres: https://issuetracker.google.com/issues/78448400. Interestingly, it had the same problem with 0 communication from a Google PM until early 2020, so maybe they hired someone at that time who convinced them you can't treat business customers the same way you do consumers.

jmugan · on Aug 30, 2023

I had the same experience with AppEngine. I kept hitting snags and just couldn't get them solved so I gave up on it and moved to AWS.

carom · on Aug 30, 2023

Exact opposite experience here. I love AppEngine and can't get Beanstalk to work for the life of me.

Animats · on Aug 30, 2023

MapReduce helped solve a huge internal problem for Google - how to update the index dynamically. Originally, index updates were batch jobs that took days to run. But MapReduce isn't the solution to that problem. It's just the orchestration for the algorithm. It was figuring out how to partition that matrix inversion that was the hard part.

Most of Google's problems in B2B reflect that the organization has little concept of customer service, in the sense of serving customers. It's not the technology.

The current problem with AI for Google is that the good stuff costs a bit too much to give away with search. That problem may be solved, or evaded. Most searches or questions don't need a large language model. I expect to see systems where you ask a question, and if it's hard, you get something like "That's a hard question. Give me a minute to work on that. Meanwhile, here's a word from our sponsor."

spinningslate · on Aug 30, 2023

> Most of Google's problems in B2B reflect that the organization has little concept of customer service, in the sense of serving customers. It's not the technology.

This, exactly. Google is fantastic at research, great at development. It's terrible at product. Or, specifically, all the bits of product that aren't R&D. MapReduce, Tensorflow and Kubernetes are all examples of technical successes that Google has spawned. None is a product. You can't buy them, buy support, or, you know, actually talk to someone if something goes wrong.

That's why Google cloud has failed so far. It's not the tech; big G can go toe to toe with Amazon and MS on that. It's that enterprises want contracts and support and relationship managers and all that stuff. That just doesn't seem to be in Google's DNA. Unless they can adress that, they're not going to win with big org customers.

They might win the startups - where the customers are themselves techies who are OK with self supporting, looking up Stackoverflow, asking ChatGPT, etc. But unless they can address the people angle, I don't see Vertex gaining meaningful traction against Amazon or OpenAI/MS.

ctippett · on Aug 30, 2023

Make no mistake, Google has a legion of Account Managers, Customer Engineers, TAMs – you name it – working with large enterprises in most/all major cities... Thomas Kurian's Google Cloud absolutely has sales and contracts in their DNA.

danpalmer · on Aug 30, 2023

> The current problem with AI for Google is that the good stuff costs a bit too much to give away with search.

Perhaps, but they seem to be in a better position than many to get that cost to a reasonable level.

They've got the scale, the research side, lots of experience with models of different sizes, they own TensorFlow, they produce and deploy their own Tensor chips (so I would guess are not dependent on Nvidia's tech at 10x markup), and have lots of experience in datacenter power efficiency.

This is all just based on what I've read in tech news. I happen to work at Google but don't have any insider info here. I'm biased, but they seem to be in a good place.

kccqzy · on Aug 30, 2023

I believe the issue goes much deeper. Google itself did not have a scalable SQL DB when App Engine first came out. Spanner was still being designed. Even if Google understood that customers wanted a SQL DB, it could not give them that.

hn_throwaway_99 · on Aug 30, 2023

Which actually goes to the heart of the issue. Who cares what Google had? MySQL and Postgres existed - Amazon RDS first came out in 2009, and you could run DBs on EC2 before that.

Google famously uses lots of custom versions of infrastructure software while the rest of the world uses something else. Obviously that has served then extremely well, but the rest of the world wants stuff they are familiar with. It really took Google years to fully grasp that.

barrkel · on Aug 30, 2023

Google was running MySQL in different flavours and configurations before Spanner. I suspect strategy tax (knowing Spanner was coming along eventually) might have been a reason for not offering that as a managed option though.

vineyardmike · on Aug 30, 2023

But lost people don’t need a scalable database. They need Postgres for 200 rows. Or really just SQLite but over the network.

paulddraper · on Aug 30, 2023

K8 definitely helped Google Cloud.

Sure it wasn't enough alone to outlet outplay AWS, but Google would have been worse off without it

hn_throwaway_99 · on Aug 30, 2023

Really? To turn the question on its head a bit, Amazon has EKS, and they don't seem to be any worse off for having not invented it.

llarsson · on Aug 30, 2023

The point of Kubernetes was to make "run this software on a cloud" trivially compatible across clouds. No specialized code that is needed for load balancers, storage, networking... you should be able to take your containerized application and the set of YAML files that describes what you want to do, and they work on "any" cloud.

As long as you don't rely on implementation details like a specific Ingress Controller that needs special annotations, this is exactly what Kubernetes does, and the reason why you have Helm Charts that don't need to take into consideration which the underlying cloud is.

Going back to what we had before, you had to be very aware of exactly which cloud provider you'd use. And since AWS had about 80+% of the market then, that meant that instructions for how to run things on Google's cloud were not very good, if they existed at all.

paulddraper · on Aug 30, 2023

> The point of Kubernetes was to make "run this software on a cloud" trivially compatible across clouds.

That was not the point of Kubernetes.

And if it was, it did not realize that point for many years.

harpratap · on Aug 30, 2023

GKE in general seems to be doing better than EKS though. They have a lot of things right and EKS seems to be playing catch-up. For example if you look at any Multi-cluster Kubernetes setup by AWS it's just a giant duct-tape rather than a ground-up solution. GKE worked on fundamentals first like multi-cluster endpoints, multi-cluster services, multi-cluster ingress, multi-cluster config-sync and now bringing it all together under GKE Enterprise.

danpalmer · on Aug 30, 2023

I did a migration to GKE at a previous company and it was basically great to work with. At the time we had a few teething issues with the most edge-casey bits of our system, to do with getting ingress and CDNs to work well together I think. The end result was a little more manual Terraforming than would have been ideal, but it all worked, and I believe the improvement needed from GCP was on their roadmap and came a few months later.

GKE (with Autopilot) is firmly set in my "day 1 toolbox" if I were to start a new company, assuming K8S was the right tech choice.

hn_throwaway_99 · on Aug 30, 2023

Thanks, this is helpful context, most appreciated.

marcinzm · on Aug 30, 2023

EKS came later and for a long time was basically barely working. Better now but still has odd things like no free tier.

The problem is that GCP for some reason just doesn't align with many corporate customer requirements (ie: support, stability, never kill anything, available resources, etc.) and an engineer friendly product (GKS) does nothing to alleviate that. It makes GCP better in an area it was already better in and doesn't at all fix the tangential reason companies actually avoid it (or rather from what I hear either migrate off of GCP eventually or use it as just a secondary cloud provider).

paulddraper · on Aug 30, 2023

EKS came years later.

marcinzm · on Aug 30, 2023

In theory good kubernetes integration makes GCP have a competitive advantage.

jeffbee · on Aug 29, 2023

The part about Vertex might be right but the establishing story about mapreduce is totally wrong. By the time Hadoop took off, mapreduce at Google already had one foot in the grave. If you are using Hadoop today you have adopted a technology stack that Google recognized as obsolete 15 years ago. It is difficult to see how Google lost that battle. They effectively disabled an entire industry by suggesting an obsolete stack, while simultaneously moving on to cheaper, better sequels.

dekhn · on Aug 29, 2023

It wasn't obsolete 15 years ago. There were production mapreduces making double-digit improvements in key metrics (watch time, apps purchased) much more recently than that. The system I worked on, Sibyl, isn't well known outside of Google, but used MR as an engine to do extremely large-scale machine learning. we added a number of features and used MR in ways that would have been extremely challenging to reimplement while maintaining performance targets.

I'm not even sure the mapreduce code has been deleted from google3 yet.

To be fair, MR was definitely dated by the time I joined- 2007- and I'm surprised it lasted as long as it did. But it was really well-tuned and reliable.

Also the MR paper was never intended to stake Google's position as a data processing provider (that came far, far later). The MR, Bigtable, and GFS papers were written to attract software engineers to work on infra at google, to share some useful ideas with the world (specifically, the distributed shuffle in mapreduce, the bloom filter in bigtable, and the single-master-index-in-ram of GFS), and finally, to "show off".

ithkuil · on Aug 29, 2023

I remember the joke CL that deleted mapreduce from google3, even got an approval from Urs IIRC

dieortin · on Aug 29, 2023

As an outsider, what is a “CL” and who is Urs?

dmoy · on Aug 29, 2023

CL is changelist, the perforce equivalent of e.g. a git pull request (sorta).

Even though Google codebase broke perforce scaling long ago and doesn't use it anymore, the replacement still borrows a lot of perforce names and sort of API.

ebcase · on Aug 29, 2023

https://en.wikipedia.org/wiki/Urs_H%C3%B6lzle

ReactiveJelly · on Aug 29, 2023

Probably Change List https://stackoverflow.com/questions/25716920/what-does-cl-me...

calderwoodra · on Aug 30, 2023

Are you sure you're not thinking of flume? I think flume is different from the original map-reduce.

dekhn · on Aug 30, 2023

Yes, I'm not thinking of Flume. When I worked on sibyl we were in the middle of converting most of the data pipelines (which are now externalized in TFX) to Flume, but the core learner, along with many other prod jobs at google, still used MapReduce. But, by that time, the majority of google developers were using Flume instead of MR.

Certainly even in 2013 MR was definitely being used; I launched a product at that time that ran an MR because we couldn't get similar performance out of Flume yet.

opportune · on Aug 29, 2023

You’re right and wrong. MapReduce is two things: a pattern for massively parallel computation, and the name of the initial implementation of the pattern at Google.

While the initial implementation at Google quickly got replaced with better things, the MapReduce pattern is everywhere in the data space, and almost taken for granted now. Hadoop is basically the same: a shitty (I think HDFS is still pretty good, just not the compute part) initial implementation of the pattern that was quickly iterated and improved upon.

Also, a big reason people stopped having to think about eg rack-local operations is that most people operating on huge amounts of data now aren’t doing it on traditional generic servers, they’re using something like s3 on VMs in Public Cloud datacenters if they’re doing something relatively “low level” or more likely just using something like Snowflake, Spark/Databricks (pretty close to OG mapreduce…), etc.

renewiltord · on Aug 30, 2023

Hadoop and then Yarn+Mapreduce gave us quite a lot of value even until 2017. I honestly don't think it was a bad technology choice. Should have moved off it faster but we had cheap commodity hardware running cheap software and early Spark was massively memory-finicky (cliff of performance). I wouldn't use that tech today but back in 2008-2014 it let us run things over a few petabytes relatively cheaply with relatively slow interconnect.

We had an impl of the Pregel paper on top of the Yarn manager.

The API was painful and easy to err on but it did provide quite a bit of functionality.

Now, of course, that stuff is all out of date. Where I am now we have custom job engine and it's way better. I imagine others have something like this too.

Things have just changed. Interconnect is now cheap and fast: 40 Gbps is commodity.

sberens · on Aug 29, 2023

[Genuinely curious,] what have people moved on to?

moandcompany · on Aug 29, 2023

Google itself moved on to "Flume" and later created "Dataflow" the precursor for Apache Beam. While Dataflow/Beam aren't execution engines for data processing themselves, they abstract away the language of expressing data computation from the engines themselves. At Google for example, a data processing job might be expressed using Beam on top of Flume for processing.

Outside of Google, most organizations with large distributed data processing problems moved on to Hadoop2 (YARN/MapReduce2) and later in present day to Apache Spark. When organizations say they are using "Databricks" they are using Apache Spark provided as a service, from a company started by the creators of Apache Spark, which happens to be Databricks.

Apache Beam is also used outside of Google on top of other data processing "engines" or runners for these jobs, such as Google's Cloud Dataflow service, Apache Flink, Apache Spark, etc.

gravypod · on Aug 29, 2023

Opinions are my own.

Some info on flume: https://research.google/pubs/pub35650/

To quote from there: "MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult."

itissid · on Aug 30, 2023

Aren't hash joins done in RDBMS just like a general application of map-reduce? In left joins The big table is hashed on the join key value and sent to N machines and the little table is just everywhere. IIUC this is how any OLAP/Bigdata frameworks think while doing massive joins or partitioning to reduce data later, they just have to deal with additional issues like locality of partition to computation target.

So map reduce is in the DNA of many data computation flows instead of a thing in off itself.

opportune · on Aug 30, 2023

Also the second generation of Flume/Spark different vs MapReduce/Hadoop has to be understood in the context of what other assumptions changed at the same time. At Google, GFS was replaced with Colossus (can’t share specifics but this was also accompanied by a change in “data/machine topology” and associated networking changes away from uniform less specialized servers) which made it so “move code to data” became less important. Similarly Spark was originally meant to run on HDFS but became a lot more popular once it started being able to use things like s3 as its storage layer and public cloud VMs for compute (which was a similar transition to GFS->Colossus).

In terms of usability the other two main innovations were to make it easier to program a workflow that chained MapReduce operations (without an intermediate, expensive, blocks-until-all-nodes-done disk write step, nor a jankass orchestration engine) and subsequently to declaratively specify the desired output (eg SQL) without requiring the user to specify the implementation.

They’ve since added more stuff like streaming, ML, whatever, but the biggest change from 1st to 2nd gen is really in the data topology.

moandcompany · on Aug 30, 2023

Yep. Regarding workflows with chains of Map and Reduce operations, the Hadoop ecosystem had a similar improvement with the introduction of Hadoop 2 where YARN as a container resource manager and MapReduce (MapReduce2) were introduced, separating the workflow constraints in original Hadoop/MapReduce. This led to Hadoop projects, such as Tez as an alternate execution engine, replacing MapReduce2, on YARN with the same types of flow optimizations for chained operations and reducing the number of shuffles/writes to disk (i.e. overall much better pipeline performance for typical jobs) -- This was particularly relevant for things like Hive, where Tez could be plugged in as the execution engine when running on a Hadoop 2 cluster.

summerlight · on Aug 29, 2023

In addition to Flume/Dataflow, there's a significant push toward SQL engines. In general, SQL (or similar query engines written in more declarative languages/APIs) has some performance benefits over usual Flume codes thanks to vectorized execution and other optimizations.

itissid · on Aug 30, 2023

Isn't that Rama framework(Nathan Marz's new thing from Red Planet labs) the latest iteration of "lets completely abstract computation latency/complexity from the framework"? In my mind it tries to do different things depending on who you are. In the words of a colleague, "I am excited someone is trying to remove SQL"

Rama seems like if you are a fullstack or backend dev then it can provide you an easy way to have a(low latency) view of your data to build upon. If you are a Data Scientist you can use the thing to pull necessary data for analysis and slice and dice it.

disgruntledphd2 · on Aug 30, 2023

The above will only be successful if it supports SQL.

The best place to end up is something like PySpark/Snowpark as a better API for SQL is really useful when doing complicated things.

You still need to have a standard SQL layer though, as otherwise you'll cripple adoption.

chaxor · on Aug 29, 2023

So if I read this right, if you're not a big company (perhaps just a standard dev with maybe a tiny cluster of computers or just one beefy one), you just make a Docker container with pyspark and put your scripts in there, and everyone can reproduce your work easily on any type of machine or cluster? It seems like a reasonable approach, though it would be nice to not need the OS dependencies/docker for spark.

mrbungie · on Aug 30, 2023

If you are running jobs inside pure Docker containers (i.e. just one node without need for k8s, compose, rancher or whatever), it may be the case you don't even need pyspark.

choppaface · on Aug 29, 2023

Dremel / Parquet, which helps facilitate much more efficient joins and filters versus sawzall on mapreduce.

For streaming there is flume and beam, or just load important data into Spanner.

riku_iki · on Aug 30, 2023

> Parquet

Parquet is a format, and not execution engine or paradigm?.. You can totally mr over parquet.

moandcompany · on Aug 29, 2023

Regarding "obsolete" and missed opportunities: this idea of missed opportunities for Google has been touched upon several times in the last decade, including the launch of Google Cloud Platform itself.

Urs Hölzle, the former head of Google TI (Technical Infrastructure), discussed in public some of the challenges and reasons for creating Google Cloud Platform as a platform, and backing projects like Kubernetes.

Over time, Google has become a proprietary tech "island" in severals ways and arguably more fragmented than other large tech companies, such as Microsoft and Amazon, which happen to both have commercial cloud offerings, and Meta/Facebook. While all of these companies certainly have challenges with not-invented-here ("NIH") syndrome, and lots of internal, proprietary tools, as a software engineer at one of these three, odds are you will use and touch more commercial and open-source technologies than you would at Google. Google itself still struggles with having teams and projects use GCP for internal work versus Borg/etc; and there are plenty of valid reasons why Google teams don't use GCP.

The proprietary tech "island" issue is a non-trival concern when you need to hire new software engineers from industry/outside and ramp-up time with some of these systems may be 6-months or even greater; today Alphabet/Google is at around 200k+ FTE, and you aren't going to be able to find many engineers outside that have experience with Borg/Flume/Spanner/Monarch/etc. Likewise when you are an experienced Google software engineer looking to work elsewhere, you need a translation map to figure out what tools outside are similar to the ones from inside.

Google's proprietary tech island has its legitimate reasons for existing, and when people say 'xyz' commercial/open-source thing is "better," they often mean it is better for their problem at hand.

At Google, a decade-plus ago many of the problems it had to solve were problems that few other organizations had, such as large-scale data processing (to be made cost-efficient on commodity hardware), and it needed to create a number of tools/platforms as solutions such as Map Reduce/GFS.

Many of these tools and platforms were discussed via papers, and inspired open-source work. In the Map Reduce case, it changed how Apache Hadoop itself took shape, and the lessons from all of these later led to things like Apache Spark.

The idea of losing a battle can only be applied with the benefit of hindsight, and many of the Google examples given were created at a time where there were no peers, nor at that time was Google interested in selling these things as commercial products at the time (i.e. GCP vs AWS vs Azure); it built these things according to its unique internal needs that few other organizations could relate to. I acknowledge that I am intentionally leaving out organizational politics, and culture (e.g. PERF) as non-trivial contributors for this result).

liquidpele · on Aug 30, 2023

> The proprietary tech "island" issue is a non-trival concern when you need to hire new software engineers from industry/outside and ramp-up time with some of these systems may be 6-months or even greater

Went to a GCP event once, expected it to be like the aws one… it was 100% marketing and the Wi-Fi didn’t work. So yea, they drop the ball a LOT when trying to interface with the developer community.

a1o · on Aug 30, 2023

What is PERF?

two_handfuls · on Aug 30, 2023

The performance review process, specifically how it reportedly encourages new products but discourages maintaining existing products.

refulgentis · on Aug 29, 2023

Not even close to 200K

moandcompany · on Aug 29, 2023

Not 200k SWE, but 200k FTE (it's 181k+ today. Even more with TVCs. You are the best kind of correct, congratulations.

refulgentis · on Aug 30, 2023

It's really far off, I'm not being pedantic. We do know the 181K Q2 2023 number is public, from there, we might consider factors starting with not every Alphabet-er is a SWE (and it's not close). That alone is worth an incredible number.

This is a valuable comment and I didn't mean to nerdsnipe you.

choppaface · on Aug 29, 2023

> They effectively disabled an entire industry by suggesting an obsolete stack

Interesting opinion but not supported at all by evidence. Most non-Google datasets are small and stored on off-the-shelf heterogenous hardware, so HDFS / Mapreduce for streaming OLAP is a great fit. Cassandra (BigTable) and Parquet (Dremel) plus Cloudera’s Impala had much quicker time-to-market when large-scale BI became more relevant.

“Obsolete” for Google problems sure, but Google problems largely only happen at Google. Stuff like ad targeting and ML look a lot different for products outside the Chocolate Factory.

jeffbee · on Aug 29, 2023

I can’t imagine HDFS being a “Good fit” for anything. Its mere existence was undoubtedly a disabling force at two large companies where I worked.

opportune · on Aug 29, 2023

The problem with using technology X “because Google does it” or “this is the best open source version of what Google uses, so let’s use it because Google does it” is that companies neglect that Google does not just use the technology out of the box they

1. created the software for their own needs 2. maintain a developer team to improve it and address requirements/pain points/integration 3. have an internal pool of experts in the form of the developer team and “customers”/early adopters 4. most likely have other proprietary systems like Borg or Colossus which integrate with the software very well, which OSS like Hadoop may not (another example: OSS Bazel vs Blaze+Forge+Piper+Monorepo structure).

Something like HDFS was hugely painful for many teams because they had no idea how it worked or how to debug it, had no idea how to fix it or extend it, and didn’t have any good tooling to understand why something was slow. All they could do was try to configure it, integrate with it, and find answers for their problems online. That’s because HDFS was “free” but a team capable of properly maintaining, supporting/operations, and developing HDFS was extremely expensive.

jeffbee · on Aug 29, 2023

> OSS Bazel vs Blaze+Forge+Piper+Monorepo structure

Wish more organizations understood this part before adopting Bazel.

icedchai · on Aug 29, 2023

About 10 years ago, I worked at a place where we had a "big data" project. Some big wig wanted to use Hadoop. Turns out, the "big data" isn't big. It's not even a gigabyte, spread over less than a dozen files. Everything worked great, but the job orchestration took longer than the actual processing.

Despite this misuse of Hadoop, another guy really loves it, and decides to start a project rewriting everything to use MapReduce. A new guy started, got assigned to the "MapReduce project"... he worked on this for over a year. It never made it to production.

moandcompany · on Aug 29, 2023

I recommend reading the GFS paper and consider that there were/are use cases for horizontally scalable, fault-tolerant object storage, with the nuance of understanding that some of your storage/data nodes may also be compute nodes and there can be a benefit or preference for assigning applications to run where underlying data is stored.

In the HDFS case with Hadoop's ecosystem, consider Hive, BigTable, Drill, and even Spark when running on YARN.

In the peak days of Hadoop, many organizations were primarily on-prem, and S3 or S3 compatible object stores were mostly reserved for people using AWS.

lmm · on Aug 29, 2023

Hmm, I experienced the opposite - HDFS was one of the main reasons that one of the large companies where I worked was able to work at all.

riku_iki · on Aug 29, 2023

> can’t imagine HDFS being a “Good fit” for anything.

what is the better fit if you need to store 1PB of data for cheap?..

Ozzie_osman · on Aug 30, 2023

My first job after college was at Google. I had read the MapReduce paper and was so excited to run one, and luckily they did have a tutorial and I ended up running my first MapReduce within I think a couple weeks of joining, over Google's index of the internet at the time, which was really mind-blowing / exciting.

I remember vaguely one task where i tried to join the index of the internet to some other dataset, which required joining across datacenters. The job ran with some warning. Then I got a friendly message from an SRE saying something like "it can cost thousands of dollars to run a join like that, so, no worries, just only do it if its something important." Of course, I wasn't doing anything important. I was just enjoying running MapReduce.

hiddencost · on Aug 30, 2023

Hm. What SRE gets out of bed for $X,000 batch compute?

JohnMakin · on Aug 29, 2023

> The final piece of Google’s strategy today came in the form of a subtle, and very vague, announcement from Nvidia CEO Jensen Huang on stage in a brief appearance of only a few minutes. Huang announced that Google and Nvidia had been collaborating with a new language model development framework, PaxML, built on top of Google’s cutting-edge machine learning framework JAX and its Accelerated Linear Algebra framework (or XLA).

My only thought is, I wonder how well the nvidia/google partnership will do against Azure/Intel (I believe Azure invested heavily in FPGA's for their ML use cases).

ipsum2 · on Aug 29, 2023

Azure doesn't use FPGA for ML, they use Nvidia like everyone else. Azure used FPGAs a few years ago for network switches though.

JohnMakin · on Aug 29, 2023

Ah, I see, looks like that might have been a few years ago, when I last worked in that space.

brucethemoose2 · on Aug 29, 2023

Does Microsft use Gaudi 2 or Ponte Vecchio much? If Azure offers them, I never hear about projects using them out in the wild.

And Microsoft is seemingly shooting for their own AI hardware: https://www.nextplatform.com/2023/07/12/microsofts-chiplet-c...

choppaface · on Aug 29, 2023

Wow, XLA has historically been absolute crap outside of TPUs and even for TPUs the error messages are incredibly poor. If nvidia actually wants to support XLA now perhaps that means the TPU 5 is the last TPU, and/or future TPUs might be targeted at just inference and efficiency (like TPU 5) and then nvidia owns the training game.

After all if you compare Nvidia’s success with H100 sales versus GCloud TPU sales, it would be easy for Sundar to say “if you can’t beat em join em” and just maintain TPU team for inference which is more closely tied to wall street margins.

flakiness · on Aug 29, 2023

This is over-indexing the event.

XLA (backing Jax) has been supporting the NVIDIA GPU for a long time. (Otherwise TensorFlow were TPU-only, which cannot be true.)

I think the announcement is more ceremonial than technical, maybe expecting this kind of superficial reaction.

Upvoter33 · on Aug 29, 2023

Google didn't miss on MapReduce; it missed on Cloud. Amazon was light years behind in datacenter technology, but made it all available via AWS, while Google kept everything to themselves. It was a colossal failure.

LLMs are shaping up to be the second such failure.

wrs · on Aug 29, 2023

In both cases Google regarded the technology as a competitive advantage in the business they were in (web search), so naturally wanted to keep it internal. Maybe almost as important, they were so far ahead on those technologies that making a viable product out of them would have been a huge effort with no benefit to search. Google tech has always been an "island". Even when they did release GCP, the offerings like AppEngine and transparent networking were incomprehensible to customers who just wanted to lift-and-shift their existing datacenter, not adopt Google practices.

Amazon, on the other hand, has no qualms about converting their internal expertise into products ("turn every major cost into a source of revenue" [0]) and giving customers what they ask for.

[0] https://twitter.com/BrianFeroldi/status/1284795114187919362

dekhn · on Aug 30, 2023

It was a big struggle to get google to commit to cloud. When I worked there and advocated that Google needed to dive into cloud headfirst, the responses I got were a mix of "we already have appengine" (which totally misses the point) and "it's not as profitable as ads" (not many things are).

I never expected google to end up in the innovator's dilemma but here they are.

ggm · on Aug 29, 2023

Who invented Kubernetes? Ok, water it down if you must. Which of the FAANG was in at the birth? Yep.

What Google missed on, was taking cloud outside of itself as a visible customer product. AWS leapt into the breach, but the irony is that we want to use AWS to run a technology platform which Google has significant DNA in.

vasco · on Aug 30, 2023

> the irony is that we want to use AWS to run a technology platform which Google has significant DNA in.

More of a proof of how much it was google's to lose than anything against AWS.

nsonha · on Aug 30, 2023

> invented Kubernetes

So? Does it translate to them escaping the reputation of mostly making money from ad?

ggm · on Aug 30, 2023

No. It actually compounds the problem in some ways: Great at designing technology, poor at capitalising on it. And of course, much we attribute as great in google was acquisition, not invention. Maps? Originally I believe outside. Android? Outside. Picasa? Outside. It's a long list of amazing things, Google acquired.

Go? Inside. They hired Pike and Thompson amongst others. Kubernetes, inside. Pike and Thompson had been working on plan9, which in many respects foreshadows Kubernetes. The language couldn't really be proprietary (google maintain very strong control over it) and Kubernetes went out as an open source quite quickly. QUIC also underwent this transformation from in-house to shared.

tguvot · on Aug 30, 2023

>Who invented Kubernetes?

wait a few years. then google will be blamed for it, not praised.

nologic01 · on Aug 30, 2023

Add Angular to the list.

Seriously though, the AI race has just started. On a horizon that matters for large scale and ongoing adoption (and thus persistent corporate profits and valuations as opposed to manic hypes) nothing has been decided.

The hardware and software mix that will deliver this large scale adoption is not clear yet. Past experience (and reason) suggests it should be relatively cheap (commoditized) and easy to use.

Achieving X but at 0.1% of the cost will be the game that people will thrive at, betting on gargantuan volumes instead of gargantuan prices.

Given "AI" is more or less linear algebra the world is crying for commoditized vectorized compute. Its a solved problem. The world will get what it wants.

rhelz · on Aug 29, 2023

Happens at every big company. The reason we don't hear about it more often isn't that big companies don't know this is a problem. Its just that the only solution they seem to be able to come up with is to enforce keeping unused breakthroughs secret.

We can be happy that at Google, at least, these things can seep out and be of some benefit to the rest of us.

shmerl · on Aug 29, 2023

Google didn't release MapReduce implementation? It took off with Hadoop. If anything, it should have been open source from the start, then Hadoop wouldn't have been needed.

ilaksh · on Aug 29, 2023

They supposedly just improved Cody by "up to" 25%. I wonder how it compares to GPT-4 now.

FrustratedMonky · on Aug 29, 2023

Nice recap.

quantum_state · on Aug 30, 2023

MapReduce is basically SIMD .. it existed long before google …