What does it take to make Google work at scale?

0xbadcafebee · on Aug 18, 2015

Keep in mind this is really "How does Google (and other big companies) do it", not "what does it take", because what it actually takes is much simpler than custom hardware and custom software. This was all pretty accessible 15 years ago using available software; the trick was putting it together right.

In general, the maxim of "cache everything, everywhere" is the cheapest way to gain both speed and availability. Of course you need to have results already to cache them, which is where all that big data processing comes into play, but there's still way more optimization going on here than is necessary to get results like this. You could make a tradeoff and give people newly-processed results more slowly and still give them mostly what they want, and not have to maintain so much custom software+hardware infrastructure.

Sometimes I feel like Batty from Blade Runner when I think about how well old crappy software used to work. "I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched mod_perl apps serve 50 thousand dynamic requests per second with nothing but Apache and MySQL. All those moments will be lost in time... like tears in rain..."

nickpsecurity · on Aug 17, 2015

Nice overview. I particularly like how they visualized BigTable and MapReduce. Will make it easier for some to understand. For those liking Spanner, Google has exceeded that with their awesome F1 RDBMS (built on Spanner/GFS) below:

https://static.googleusercontent.com/media/research.google.c...

Can't wait for someone to clone this. Come to think of it, has anyone cloned Spanner? I recall some calling GPS a hack but I thought it was engineering brilliance. Spanner was some good work, too.

sargun · on Aug 17, 2015

The people who are working to it are the Cockroach Labs folks: http://www.cockroachlabs.com/

nickpsecurity · on Aug 17, 2015

The Wired article says they're not fully replicating Spanner, esp is time tricks. However, they are replicating a lot of its other capabilities. So, that's still a great tech in development. Thanks for the link. :)

Still gotta wait a while for a full Spanner replacement with GPS tricks and all. Or something even better hopefully.

sargun · on Aug 18, 2015

You don't _NEED_ the GPS tricks and all, and realistically you can replace some of the HLC code in Cockroach with a time network time server implementation that speaks to GPS / Atomic clocks, if you want.

nickpsecurity · on Aug 19, 2015

Good to hear :)

tschottdorf · on Aug 25, 2015

Note that Spanner always waits out the clock uncertainty on commit, while Cockroach doesn't need to do that - by design there are fewer situations in which the offset really matters. We can still benefit from a good bound in some situations, in particular you get global linearizability between causally unrelated transactions by waiting out the maximum clock offset (so just what Spanner does). BTW, I'm not sure when you would ever need the above guarantee, but it's cool! (disclaimer: actually working on the project).

nickpsecurity · on Aug 26, 2015

Again, appreciate the insight.

bpicolo · on Aug 18, 2015

Also https://www.treode.com/

nickpsecurity · on Aug 18, 2015

That is interesting. Thanks for that one too.

davis_m · on Aug 17, 2015

"NTP isn't accurate enough so we use a hardware clock assist that uses GPS and atomic clocks" - Google

I think it is crazy that something as simple as keeping time can have such a complicated solution at that sort of scale.

Every review I hear about working at Google makes me want to stay away, just like the recent conversations about working at Amazon, but crazy stuff like this always piques my interest.

munificent · on Aug 17, 2015

> Every review I hear about working at Google makes me want to stay away

Google was Fortune Magazine's #1 best place to work for six years in a row. The reason you occasionally hear a story about how Google isn't perfect to work at is because those are news. In general, it's a fantastic workplace.

davis_m · on Aug 17, 2015

Reading it back, my comment was harsher than I intended, especially with the reference to Amazon, but most of my aversion to working at a place like Google is just the shear size as well as where most of the offices are located.

I grew up in a tiny midwest town and I love it here. I would not enjoy living on either one of the coasts.

Many of my classmates in college couldn't wait to get out of the midwest. I have friends at Google, Amazon, Microsoft, and other large names in the tech industry, but more often than not, when I hear them talk about their jobs, even when they are talking about them positively, I am glad I stayed in the midwest because that is what fits me the best.

jsolson · on Aug 17, 2015

You could try Google Pittsburgh; it's almost the midwest :)

(I work for Google Seattle, but I grew up in Pittsburgh -- Seattle is definitely more my speed, but I do enjoy a trip home now and then)

monksy · on Aug 18, 2015

Google also does some dev work in Chicago.

aiiane · on Aug 18, 2015

I've worked with teams that are based in the Google office in Boulder, CO and visited them a few times - it's a nice office.

runamok · on Aug 17, 2015

> I would not enjoy living on either one of the coasts. Have you left the tiny midwest town and know that for a fact? Or do you base that on just word of mouth, articles, or..?

I have lived in a lot of different places: (New Jersey, North Carolina, Texas, Southern California, Northern California (don't laugh re: me splitting those!) and briefly in Illinoism Connecticut and the Dominican Replublic. ) and find value in their differentness.

jerf · on Aug 17, 2015

I live in the Midwest, and have traveled frequently to Silicon Valley for work. It's a nice enough area in most ways, but I get a really nice view of it through travel-expensed meals and hotel accomodations where the commute to the office is 10 minutes of pleasent traffic. The experience the locals have involves more traffic and rather fewer trips to modestly nice restaurants. I don't hate the area by any means, but it certainly isn't so much better than where I live that I would want the commute or to buy 5 or 6 fairly nice houses just to live there... which is a rough approximation of the housing costs vs. where I live now.

To be clear, I'll say again it's not like I hate the Valley, but the reality is that day-to-day life between my Valley coworkers and mine just isn't that different, but sure is more expensive. If you find a Silicon Valley job from a SV company in a remote office... and there are rather a lot of them, just not all in one place... there's not that much advantage left, unless you really love something about SV specifically, which is of course a totally reasonable and sensible thing.

runamok · on Aug 25, 2015

It's a mecca for programmers. I can go to talks on various technical topics every day of the week if I wanted to. I have spoken to the creator of Scala, the Symfony PHP framework and Optimizely. It's just really exciting (to me) to have that level of access.

Whereas my friends in Houston get to go to one conference a year (if that). But I totally agree if you want a house with land to raise kids personally I think it sucks to live here.

lg · on Aug 17, 2015

there's a chicago office, fwiw...

nmrm2 · on Aug 17, 2015

Chicago is midwest in geography and climate only. Someone who grew up in a typical small midwest town would feel as out of place in Chicago as in San Francisco, New York, or Paris.

monksy · on Aug 18, 2015

Chicago is a large city, and with having a large city some of the small town-isms just don't fly.

However, I do enjoy living in Chicago, and that I do consider it a huge difference from the coasts.

cmrdporcupine · on Aug 18, 2015

Probably the closest to the vibe the person is looking for would be the Google Waterloo office where I work, or the Pittsburgh office the other poster mentioned.

selectodude · on Aug 17, 2015

Google Chicago is almost all salespeople. Not much technology going on there.

bwillard · on Aug 18, 2015

It's true the sales people do out number us here in Chicago, but we do have O(100) engineers in the office. We are working on lots of cool stuff: Ads, Search, and Privacy to name a couple.

monksy · on Aug 18, 2015

You guys need to help the CJUG get some more talks! It's been a while since Google has had a role in that.

jrockway · on Aug 18, 2015

And you guys have a super-cool and mildly famous engineering site lead!

untog · on Aug 17, 2015

Everyone is different - while I don't doubt that Google has fantastic perks, I can't think of many things that I would want to work on that aren't the moonshot crazy experimental projects (and, let's face it, most people won't be working on them). And working on a giant campus that's isolated from the outside world... no thank you.

But like I said, we're all very different.

DannyBee · on Aug 17, 2015

"I can't think of many things that I would want to work on that aren't the moonshot crazy experimental projects (and, let's face it, most people won't be working on them)."

This is true of almost everyone, of course. Every year i read the incoming intern abstracts, and they all literally say the same thing "I really would like to work on <whatever really popular crazy project was in the news lately>". Literally all of them.

That said, often you can work on them if you are good enough at what you do.

(But yes, often you have to prove that first, either internally or externally)

untog · on Aug 17, 2015

Are there ever new, fast-moving projects at Google? That's the core of my perception - I don't want to work on Google Docs, Gmail, etc. etc. or other large, established projects. I want to work on something small, iterate quickly, etc. - but to the outsider, I don't see any Google products doing that.

nostrademons · on Aug 17, 2015

There are lots of them, but they tend to get canceled early. As an outsider, you only see projects once they've reached some baseline threshold of viability - a large number of projects never reach that, not because Google says they can't but just because they're bad ideas to begin with.

I spent about half my time at Google working on mundane improvements to search - visual redesigns, feature unification, infrastructure improvements - and half working on crazy green-field projects. Most of the crazy stuff was eventually canceled, and the stuff that did launch (eg. Google Authorship) ended up being a lot more toned-down than we initially envisioned. Ultimately I think I learned more from the crazy projects, but it's a very different kind of learning, much more experiential than factual.

The other thing you learn when you actually succeed at a crazy new idea is that people build up a tolerance to them really quickly. The first time we did an interactive doodle on the home page (PacMan...actually technically that was the second, but it was the first people noticed), everybody went wild, it was in all the newspapers, and we calculated people spent 4.82 million hours playing it. Now when an interactive doodle comes out, most people don't even notice. Remember that Google Docs, GMail, etc. were revolutionary in their day; it's only because they've become successful that you don't want to work on them.

cmrdporcupine · on Aug 18, 2015

There are, but as you might expect when a project like that gets started, even the smell of it attracts a lot of people. I can't say that anything I've seen at Google iterates quickly compared to where I've worked before. It's just not a company built for quick iteration -- between the code reviews and style guide and readability restrictions, a lot of discussion around design docs, interacting with a lot of other teams, etc.

But it has other strengths.

jefftk · on Aug 17, 2015

    working on a giant campus that's isolated from
    the outside world

You don't have to work in Mountain View! The remote offices are really nice, and are generally well integrated into their cities.

Filligree · on Aug 17, 2015

Right, even the one in San Francisco.

tomcam · on Aug 18, 2015

> But like I said, we're all very different.

Actually, I am indistinguishable from everybody else.

bryanstump · on Aug 17, 2015

yeah those are fun projects

ska · on Aug 17, 2015

   I think it is crazy that something as simple as keeping time can have such a complicated solution at that sort of scale.

Time keeping is one of the things that is most often underestimated and screwed up at any scale, in my experience. Many a subtle bug turns out to be a poor assumption or misunderstanding about how clocks work. Or calendars.

jasonpeacock · on Aug 17, 2015

i.e. A lot of distributed systems problems can be simplified if you can trust your clocks.

chetanahuja · on Aug 17, 2015

Or guarantee that two separate coordinates on space-time can have identical values... which would mean breaking some laws of physics.

AnimalMuppet · on Aug 17, 2015

If the two clocks are stationary with respect to each other, that isn't a problem. Most of Google's servers are on the Earth's surface, so...

(Edit: Yes, different elevations cause a gravitational time dilation difference. For Earth's gravitational field and the elevation difference between different Google servers, I doubt it's an issue at the time resolution that Google needs to maintain.)

dragonwriter · on Aug 17, 2015

> If the two clocks are stationary with respect to each other, that isn't a problem. Most of Google's servers are on the Earth's surface, so...

...you can't generally guarantee that they are (even approximately) stationary with respect to each other, because points on the earth's surface (in general) are not stationary with respect to each other in an inertial frame of reference.

AnimalMuppet · on Aug 17, 2015

> you can't generally guarantee that they are (even approximately) stationary with respect to each other...

False.

> ... because points on the earth's surface (in general) are not stationary with respect to each other in an inertial frame of reference.

True. There is both the earth's rotation, and the relativistic difference due to differing elevations. But given earth's angular velocity and gravitational gradient, points on the surface are still approximately stationary with respect to each other, where "approximately" is defined by the amount of difference it will make compared to the time precision that Google cares about.

chetanahuja · on Aug 17, 2015

Even if the clocks are stationary with respect to each other (within some tolerance), it's impossible to guarantee completely synchronized clocks among the systems. This follows trivially from the impossibility of instantaneous communication due to the second postulate of special relativity.

AnimalMuppet · on Aug 18, 2015

If two clocks are stationary with respect to each other, then there is no ambiguity about where the midpoint between them is. Then you just do something like, when your clock hits noon, fire a projectile at a fixed speed toward the other clock. If the projectiles meet at the halfway point, then the clocks are synchronized.

And there is no relativistic funny business involved, because the clocks are stationary with respect to each other. There's no difference of viewpoint as to whether the projectiles met at the halfway point, or where the halfway point was, or even how far off from the halfway point they met (and therefore how far off the clocks are from each other).

This is the argument used in my relativity class to show that you can synchronize clocks that are stationary with respect to each other. (You have to be able to do that to construct an inertial frame of reference, that is, to be able to determine what time coordinate some event occurs at, no matter what spatial location it occurred at.)

chetanahuja · on Aug 18, 2015

"If the projectiles meet at the halfway point, then the clocks are synchronized."

And how would either end-point know this exactly?

AnimalMuppet · on Aug 18, 2015

That's the point of using physical projectiles, not light beams. You watch. That's how you know.

"Watch" may mean using something like a phased-array radar to measure it more precisely, if you wish...

bagels · on Aug 18, 2015

If you move the clocks together physically, and synchronize them, this removes a large degree of the concern about instantaneous communication (this probably makes communication faster than your tolerances)

chetanahuja · on Aug 24, 2015

But you simply can't. Not even Spanner (Google's globally distributed db that's using that fancy GPS based clock) doesn't pretend that the clock's are actually synchronized.

https://queue.acm.org/detail.cfm?id=2745385

heapcity · on Aug 17, 2015

what time is it anyway?

zo1 · on Aug 17, 2015

Can you elaborate on the need? What possible reason, short of a very contrived one, is there for having to keep a large number of machines' clocks in-sync?

And for that matter, why would anyone build any process/system/software that requires a distributed system's machines to all have their clocks in-sync. I am baffled.

lmorris84 · on Aug 17, 2015

I believe Google Spanner obsesses over timestamps quite a lot to deliver distributed transactions with something called TrueTime [1].

[1] http://research.google.com/archive/spanner.html

matthewmacleod · on Aug 17, 2015

Well, it's a reasonable solution to some distributed problems. I imagine that many distributed algorithms can be simplified a lot of you have a reliable and accurate time source. If you can build a reliable clock at less cost (development and/or overhead) than a time-insensitive algorithm would cost, then why not do it?

pjungwir · on Aug 17, 2015

Many security protocols use credentials that are only valid for a brief time. Here is a simple example of a problem caused by my server's time being more than 10 minutes off from the time on S3:

http://illuminatedcomputing.com/posts/2015/04/paperclip_expi...

abrookewood · on Aug 18, 2015

Yep, security is a big one. Authentication & encryption can behave very unpredictably if your times are not in sync. So, you might not be able to login to a server, or connect via HTTPS etc

amaks · on Aug 17, 2015

There is a simple reason why it's needed - in distributed transaction you need to establish the causality of changes. That's why truetime is needed.

ska · on Aug 18, 2015

Sorry - missed this somehow.

People have mentioned distributed transactions and security, another area is synchronizing modeling with (hard or soft) real time inputs from separate hardware. There are a bunch of ways to get yourself tied up in knots once at least 2 physical bits of hardware are involved.

mad44 · on Aug 17, 2015

Hybrid Logical Clocks offer a simple and feasible alternative to Google's custom hardware based TrueTime.

http://muratbuffalo.blogspot.com/2014/07/hybrid-logical-cloc...

acqq · on Aug 18, 2015

It would be nice if the authors would evaluate how the hardware clocks compare to their software method in the case of maintenance regimes. I understand that their method is "good enough" in the "stable" state across the datacenters spread out through the world. The question remains if the existence of the hardware clocks simplifies the "unstable" states (that occur not when everything runs according to your software design, but when you maintain the infrastructure at that scale) and I believe it still does. I don't have proofs, but my belief is that using GPS and atomic clocks at that scale (GPS clocks on different physical locations) isn't an example of doing unnecessary stuff.

MattCruikshank · on Aug 17, 2015

"Every review I hear about working at Google makes me want to stay away"

I loved working there. Sure, who your boss is has a huge impact on your happiness. I ended up with a boss I enjoyed working for.

Does that review make you want to stay away, too?

tikhonj · on Aug 17, 2015

It does for me, at least. I'd absolutely hate having my job satisfaction determined so much by a manager and team that I was assigned. (Ideally I wouldn't want it impacting much at all, but truly flat companies are still hard to find, despite the rhetoric.) It works for some people, but it won't work for me.

pauleastlund · on Aug 17, 2015

I spent about 5 years at Google starting in 2006. When I arrived I was assigned to one manager, but I wasn't _super_ excited about the project. A week in, another manager offered me another opportunity and got me permission to transfer. Problem solved.

A couple years later, a new hire got assigned to a team I was on. He was a little bummed because he'd really had his eye on another project. So we talked to our manager about it and he was allowed to transfer to the team he'd been hoping for.

I have more anecdotes like this, but the long and short of it is that in my experience Google is a lot less capricious and uncaring an organization than you imagine.

hueving · on Aug 18, 2015

>I have more anecdotes like this, but the long and short of it is that in my experience Google is a lot less capricious and uncaring an organization than you imagine.

I think the problem is that their interviewing process is. That's the first thing people encounter and it's the only thing rejected people encounter. That harms people's views of the company, even if they understand the explanation about false positives being so expensive.

MattCruikshank · on Aug 17, 2015

In my hiring process, I interviewed with multiple managers. I picked my favorite, and I ended up working for him. It didn't work out. Within a few weeks I was working for a new manager.

And frankly, I've never once worked for a company where I was immune from re-orgs.

geofft · on Aug 18, 2015

> It does for me, at least. I'd absolutely hate having my job satisfaction determined so much by a manager and team that I was assigned.

Well put. There were a lot of reasons I turned down my Google offer out of college and have been unenthusiastic to re-apply, but the process of saying "We'll find something for you" was a huge part of it. (The recruiter wouldn't even listen to something as simple as wanting SWE over SRE.) I ended up at a much smaller company where I could know my job and product and meet my boss before I signed.

MattCruikshank · on Aug 18, 2015

I think the best analogy for Google is that it's like being accepted at a University. And they think it's going to take you a while to declare your major. You may not like your weeder classes. You may not like your first few professors. But pretty soon, you'll find a home for yourself, if you want to make the best of it.

cmrdporcupine · on Aug 18, 2015

Transferring within Google is easy. Yes, you would be expected to stay on your first project for a reasonable chunk of time, but transfers after that are not only possible but expected.

ridiculous_fish · on Aug 18, 2015

How do you reconcile that with your claim elsewhere that "staff turnover is very very low?"

I worked in the Mountain View office. Transferring is indeed easy, and people did transfer frequently. As a result turnover was high, and it was hard to build friendships, or gel as a team.

cmrdporcupine · on Aug 18, 2015

I thought turnover in this context was the usual use of turnover as in, people leaving the company and new ones being hired.

ridiculous_fish · on Aug 18, 2015

Ok. I'm not sure the distinction is that important when it comes to team camaraderie though.

aiiane · on Aug 18, 2015

Team turnover vs. company turnover. It's one thing to learn a new project, another to learn the entire ecosystem.

jjwiseman · on Aug 18, 2015

Two months after I was hired, I realized I hated my assigned project and talked to my manager. He talked to the site director, and passed back his response to me: "We don't care." Policy is to stay on your first team for 18 months. Exceptions to that are exceptions.

asuffield · on Aug 17, 2015

Your first team after hiring can be a bit arbitrary - you aren't cut out of the decision loop, but as an outsider it is impossible for you to know enough to make an informed decision.

Every team after that is your choice with full access to information. Google has a thriving internal job market.

vidarh · on Aug 17, 2015

What makes me want to stay away, apart from the ridiculous recruitment process, is the insane staff turnover. No amount of positive reviews will overcome that for me.

cmrdporcupine · on Aug 18, 2015

Staff turnover at the Google office I work in is very very low.

robrenaud · on Aug 17, 2015

Working at Google is pretty cool.

Personally, I get paid pretty good money and have tons of resources to build models that serve 100s of millions of people.

drcross · on Aug 17, 2015

After 9 different interview rounds I've all but giving up on trying to get into there. Any advice on trying to get in?

mahyarm · on Aug 17, 2015

Master Cracking the Coding Interview? Have a positive attitude?

hueving · on Aug 18, 2015

Right, spend all of your free time cramming on algo and data structure questions that you weren't using before and probably wont use at Google.

robotresearcher · on Aug 18, 2015

For 20 years of 6 figure salary at one of the most prestigious engineering jobs in the world? Maybe worth cramming a bit?

hueving · on Aug 18, 2015

Google is not one of the most prestigious anymore. It's sought after because it has nice employee perks, but it's fallen a long ways from the early days in terms of prestige. They now have thousands of engineers working on mundane jobs moving around advertising data.

I've interviewed a couple of ex-googler's (not fired, still employed there) at a startup and we had to turn them down. They could handle a simple coding exercise fine, but then fell over in the component where they had to debug code and extend it with a feature.

That was when I realized a big chunk of Google's employees are going to be a reflection of their interview process (puzzle solvers but not software engineers). This was even reflected in the creation of the Go language. It's stupid simple to deal with Google employees that don't know how to use more advanced features in a language without creating something unmaintainable.

You can get 20 years 6 figure salary in most of the locations Google operates offices working for other companies. Especially in the bay.

mahyarm · on Aug 18, 2015

If you write any sort of library code, you might actually start using those algos. You might have to start looking at CS papers too if your assigned a large enough scale problem!

He wanted the solution, and google interviews are more algo heavy than most. There is also this:

https://leetcode.com/problemset/algorithms/

hueving · on Aug 19, 2015

Using them and understanding them are still a lot different than memorizing them to the point of working with them on a whiteboard in a google gang bang interview.

I reviewed lots of CS papers while in uni and it only took a quick search to confirm the algorithm was what I had in mind. Completely unnecessary to memorize each one.

martincmartin · on Aug 17, 2015

A good article about this is "There is No Now" on ACM:

https://queue.acm.org/detail.cfm?id=2745385

SEJeff · on Aug 17, 2015

This isn't super surprising tbh, all of the finance industry has PTP timesources in most every datacenter. Different needs for slightly different industries.

tajano · on Aug 18, 2015

> "something as simple as keeping time"

Keeping time has been a difficult problem for... well, as long as we've tried keeping time. For an interesting historical account on past difficulties, you might check out the book "Longitude" [1].

[1] http://www.amazon.com/Longitude-Genius-Greatest-Scientific-P...

mmoya · on Aug 18, 2015

> I think it is crazy that something as simple as keeping time can have such a complicated solution at that sort of scale.

Check this talk: https://archive.fosdem.org/2015/schedule/event/ntimed_ntpd_r... . It's more difficult than it appear.

mozumder · on Aug 18, 2015

> I think it is crazy that something as simple as keeping time can have such a complicated solution at that sort of scale.

Timekeeping is such a basic problem that Albert Einstein invented the theory of relativity because of it.

i336_ · on Aug 18, 2015

Are these slides attached to a video presentation or something similar? I'd really like to see the talk this accompanies.

david_shaw · on Aug 17, 2015

> Wow, this file is really popular! Some tools might be unavailable until the crowd clears.

Hopefully the irony isn't lost on the readers of this document :)

DannyBee · on Aug 17, 2015

What irony?

To me, this isn't irony at all. It shows that google can't defeat the algorithmic complexity of things like consensus and shared state editing, and so gracefully degrades functionality.

That shows a very good understanding of scaling - realizing any solution you choose has limits and tradeoffs, and thinking about and handling those limit cases sanely ahead-of-time, rather than waiting for it to fall over and hoping for the best.

munin · on Aug 17, 2015

I got an error message saying the document wasn't available at all and to come back after the "crowd clears" which is much more ironic... after I refreshed a bunch it became available...

pjungwir · on Aug 17, 2015

I think it has more to do with preventing concurrent writes stepping on each other. Google Docs must be the number one example of real-world systems doing Operational Transformation [1]. It's not like HN is bring down Google's servers.

Incidentally OT is another example of the value of in-sync clocks, for those asking about that elsewhere in this thread.

[1] https://en.wikipedia.org/wiki/Operational_transformation

jordz · on Aug 17, 2015

It's presentations like this that give me a that spark to go out and build crazy-shit. Sometimes you HAVE to reinvent the wheel because wheels just don't work to move objects the size of the planet the wheel is bound to. I'm not a fan of re-inventing the wheel by any means but it's this kind of stuff that makes me love going to work every day and basing ideas off presentations like this!

i336_ · on Aug 18, 2015

Well, it isn't every day you get to redefine π :D

magicalist · on Aug 17, 2015

previous discussion on work from the same author: https://news.ycombinator.com/item?id=9652528

vayarajesh · on Aug 18, 2015

Is it possible to view the video or the presentation given in which the speaker talks more in detail explaining the slideS?

ms705 · on Aug 18, 2015

Hi, author here.

The talk for which I put the slide deck together was given at a summer school and unfortunately not recorded, but if there is sufficient interest, I might tape a re-run and upload it. (Though unlikely to have time in the next month, so it might be a while.)

In the meantime, the original papers (listed in the bibliographies at http://malteschwarzkopf.de/research/assets/google-stack.pdf and http://malteschwarzkopf.de/research/assets/facebook-stack.pd...) have a lot more detail than my (very condensed) slides.

badtoyx · on Aug 18, 2015

Please DO re-run the talk and share it with us. It would be quite valuable for a variety of audience. Thanks in advance!

whitenoice · on Aug 18, 2015

Thank you for sharing this! and the consolidated papers list. Would be great if you can record and upload your next re-run!

cronopios · on Aug 18, 2015

Wow, that would be awesome!

I hope you do it!

vayarajesh · on Aug 18, 2015

Great! thanks for the share

ohitsdom · on Aug 17, 2015

I understood some of these terms.

faragon · on Aug 18, 2015

Wow. That's amazing. Anyone knows how they do a backup/restore of a GFS distributed file system? (in case they do backup of some sort)

xyzzyz · on Aug 18, 2015

Apart from other people mentioning that distributed file systems have replication as a part of the architecture, Google also uses tape[1][2].

[1] - http://www.tested.com/tech/1926-why-google-uses-tape-to-back... [2] - http://www.theregister.co.uk/2013/12/29/a_year_of_tape_tittl...

jcrites · on Aug 18, 2015

I don't have any special knowledge of GFS, but one purpose of distributed file systems of that sort is generally to avoid the need for additional backup by replicating the data many times as part of its routine operation. The data storage layer itself replicates data many times, across multiple data centers, and continually verifies the replicas' integrity and re-replicates as needed.

The GFS paper describes replication in a bit more detail:

> Users can specify different replication levels for different parts of the file namespace. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification

Some other reasons you wouldn't backup a distributed file system (not saying there are no reasons ever [1]):

(1) Difficult to add another layer of backup without impacting performance unpredictably at backup time. It's more predictable to implement much of the replication synchronously within the request (while optionally some replicas to catch up out-of-band) (2) Files are differently important - some may warrant a greater degree of redundancy than others. The file system can understand this and take advantage of it; a separate backup system on top of the file system probably can't. (3) A standard backup/restore process often implies downtime during recovery. One goal of distributed systems is to avoid downtime by handling faults transparently. They continuously repair themselves. See: recovery-oriented computing. (4) A backup and restore process that's in any way intrusive on the operation of the system will not be easy to test on an ongoing basis the way that failure recovery will be tested constantly within the distributed file system. (In a big server fleets, drives will fail all the time, giving you no end of opportunities to exercise your recovery process.)

[1] One reason might be a defense against "unknown unknown" faults in the file system itself that cause it to irrecoverably lose track of data.

dejv · on Aug 18, 2015

My guess is that they just copy the blobs and then send it to other machines/datacenters.

yzh · on Aug 18, 2015

This may seem irrelevant, but as a GPU computing researcher, I'm still disappointed that GPU still hasn't made its way to be the first citizen at data centers. I know world of HPC looks much different, but I wonder when GPU can be vastly used in data centers, and will NVIDIA's Pascal architecture make a huge impact on this?

ploxiln · on Aug 18, 2015

I don't think GPUs will be big with "web services" datacenters anytime soon.

As the slides mentioned, the difference between this web-services stuff and HPC is that HPC has a much higher compute/data ratio. Said another way, this web-services stuff is all about moving data, collating it a bit, and less about intensive mathematical processing of it.

Most of the boxes in the diagrams are datastores. Block stores, object stores, columnar stores, caches. I hate to say it but... big data. Not (as) big compute.

xyzzyz · on Aug 18, 2015

They are. With as much neural networks stuff as Google does, it's pretty obvious tons of GPU-hours must be in use.

http://www.wired.com/2013/05/gpus-in-the-data-center/

Thaxll · on Aug 17, 2015

The thing that might be scary working at Google is that you work with tech than no-one else uses.

jrockway · on Aug 18, 2015

The one thing that worries me about leaving Google is how much time I'd have to spend building hacked-together imitations of the infrastructure I'm used to using. It's not because I don't know how other things work, it's because I know how these things work and I like them better.

At least the build system was recently open-sourced, so I won't have to build that from scratch. But things like Borg and D are both elegant and easy to use, and I would hate to have to go back and care about deploying software or configuring RAID arrays again. Unfun and uninteresting. Totally solved at Google for the kinds of problems I work on.

suls · on Aug 19, 2015

For me - from the outside - the build-system and the trunk-based development is the single most important factor for an organization that scales. Yet I wonder why it is so hard to convince my co-workers .. any anecdotes about how it was started at Google and later at Twitter?

rco8786 · on Aug 17, 2015

A lot of times that tech no one uses today becomes the industry standard tomorrow.

It's scary being at the bleeding edge.

obeattie · on Aug 17, 2015

I'm not sure that Google thinks of that as a problem. In fact, I'd argue a lot of the tech that many companies use nowadays exists because Google pioneered it.

nostrademons · on Aug 17, 2015

I think that he's talking about the employee perspective, not the company perspective. Google's tech stack is mostly proprietary, which means you can't take your tech skills with you to a new employer.

...And I haven't found it to be a major problem, being an ex-Googler a little over a year out. Tech skills are easy to pick up on your own. My current startup is based on Node.js, a native Android client, and AWS; the one before it was Django & Heroku. Haven't tried looking for jobs yet - I made enough at Google to not have to worry about that for a while - but I occasionally get in-bound interest from big-name, fast-growing startups. Most clueful hiring managers look for experience with problems, not with solutions, and Google lets you face problems that the rest of the industry won't deal with for a while.

delroth · on Aug 17, 2015

As a Googler I find it to be the most fun and most attractive part of the job. I'm not too worried about learning specific technologies. Learning the core fundamentals of how these technologies work is however critical. But since Google is (still nowadays) pioneering a lot of that, esp. in the distributed world, that's not really a problem either.

plorkyeran · on Aug 17, 2015

My fear would be related to not having things to put on my resume or being able to answer specific questions during interviews, but of course that's not remotely an issue for the specific case of ex-Google employees.

threeseed · on Aug 17, 2015

A lot of technologies ? No not even close. More like 3.

GFS, MapReduce, BigTable were the key Google inventions that became the sparks that ignited the modern day big data/analytics revolution.

segmondy · on Aug 17, 2015

There are plenty of other companies that also have tech that Google doesn't have or use too. Google has all these amazing tech, but I can assure you working at Google doesn't mean you will ever get to use it directly.

threeseed · on Aug 17, 2015

Exactly. Facebook and Yahoo have contributed equally as much to the technology landscape as Google has.

And unlike Google they actually make it available to the public and then follow it up with supporting the projects.

animefan · on Aug 18, 2015

A lot of these things were originally developed at Google then spread to other companies (when employees moved etc.). So Google is contributing probably more, but not directly.

whitenoice · on Aug 18, 2015

Do young (new to google) googlers get to work on these projects?

blablablksheep · on Aug 18, 2015

There is no mention of SPARK like in-memory compute engine(s) Did i miss it in my read ? thanks in advance

novaleaf · on Aug 19, 2015

Wish there was audio / text that went along with these slides :(

halotrope · on Aug 17, 2015

Kind of ironic: "Wow, this file is really popular! Some tools might be unavailable until the crowd clears"

nostrademons · on Aug 17, 2015

Incidentally, that's part of scalability: knowing when you can drop functionality and the user won't care. For Google Docs, it's highly unlikely that a file with 1000s of viewers is actively being edited by 1000s of users, and more likely the link just got shared on the Internet. So you can drop edit functionality for everyone but the original author and they won't care.

Another example - one that I've heard second-hand - is how Facebook gets consistency on the news feed. Apparently your own writes to Facebook are sent separately to a write-aside cache, and then the webserver merges them back in whenever you view a page. As a result, Facebook is always strongly-consistent when it comes to your comments (you'll never post something and then fail to see it show up), but it's only eventually-consistent with respect to other peoples' comments. But then, you won't know about or care about the latter, because how would you know that you're not seeing something they posted?

patio11 · on Aug 17, 2015

Same at LinkedIn. (Source: presentation by their then-CEO at JavaOne, circa 2009.)

Rainymood · on Aug 18, 2015

That's really smart. Never noticed a thing. Awesome :)

ska · on Aug 17, 2015

No. It would be ironic is if it stopped working at all with no notification.

As it is they are just demonstrating competence at scaling.

mason55 · on Aug 18, 2015

This is actually a really good example of the CAP theorem. Because of the way Docs merges updates to a single document they had to either tie an editable document to a single machine or risk inconsistent updates. The choice was made to ensure consistency. If a doc gets too popular to be served off a single machine then it goes into read-only mode and gets distributed to multiple boxes. When the traffic dies down it can fall back to a single box and re-enter read/write mode.