Google’s Reliability Team Sat Down for an AMA Right Before Gmail Exploded

SavvyGuard · on Jan 24, 2014

Today's breakdown was a huge warning flag for me. I'm heavily integrated with Google Services, and use Hangouts as my main messenger and SMS app on my Nexus phone. While Gmail was down, I couldn't respond to anyone who was using hangouts to contact me, couldn't share any documents on Google Drive, etc.

I'm going to have to seriously think about the risks of being so heavily reliant on Google services.

markdown · on Jan 24, 2014

Yeah me too. Some of the greatest programmers and sysadmins in the world couldn't deliver 100% uptime. I think I'll have to take over and host my own services in my basement.

I'll show them how it's done!

randallsquared · on Jan 24, 2014

Easy problems can become nearly impossible at scale.

jmathai · on Jan 24, 2014

I'll take GMail with 99% uptime over running Horde in my basement with 100% uptime.

Groxx · on Jan 25, 2014

You get 100% uptime from your ISP? Dang. Where can I sign up?

packetslave · on Jan 25, 2014

Yep. If something has a "one in a million" chance of occurring, you'll run into it 17 times per hour (based on Comscore's Sept 2013 estimates of Google search volume)

jaimeyap · on Jan 25, 2014

Are the alternatives any better?

If you run a service yourself, you will have downtime, only you won't have Google's amazing SREs and wealth of experience to draw upon to resolve your outage. You will have to fix it yourself.

If you depend on some other service, you are subject to their uptime and disaster recovery capabilities. Nobody is better than Google at either of these things.

But to your point, it probably is a good idea anyways to have offline backups of your critical data, and to have out-of-band fallbacks for critical communication in the event your primary channels have an interruption.

taopao · on Jan 24, 2014

Good luck in your holy grail quest for that 100.000%.

eitally · on Jan 25, 2014

No kidding. We're a Google Apps customer and, since all our employees already have a Google account, decided to use Google Auth for some internally business apps we've developed. While it sucked that Gmail/Hangouts were down, it royally sucked that our other apps were accessible. Even after Gmail/Hangouts came back online, Google Auth was throwing 503s for about another half hour. Boooooo!

slowmover · on Jan 24, 2014

We should all take this opportunity to think about the risks of being so heavily reliant on X, Y, or Z.

code_duck · on Jan 25, 2014

So you think that Google isn't going to be taking the effort to maintain as much uptime for its services as possible in the future?

Steko · on Jan 25, 2014

Apparently Google+ was down too.

crickets

_euac · on Jan 25, 2014

apparently. Nobody could confirm though.

throwaway_yy2Di · on Jan 25, 2014

Tangential: why does MapReduce use /dev/random as its entropy source?

"After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random!"

http://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_goog...

packetslave · on Jan 25, 2014

It sounds like it wasn't MapReduce itself, but rather the specific MR user job that was being run.

hcarvalhoalves · on Jan 24, 2014

Which makes me wonder... did they just pulled a Murphy's, or are the services so unstable that they go down if no one's overlooking it? Maybe the services already go down multiple times a day, but the outage is short?

timdorr · on Jan 24, 2014

It's unlikely that any gmail outage would go unnoticed, considering how much activity it gets 24/7.

Also, these guys are in engineering. They are very likely not even directly involved when there are outages. They build the systems and protocols to avoid and recover from outages, but don't actually perform the work themselves. It's developers vs. IT.

menage · on Jan 24, 2014

[I used to be a GMail SRE]

Correct, it's pretty much impossible for an outage to not be noticed and the GMail on-call being automatically paged.

SREs at GMail are engineers, yes, but they're very much directly involved with fixing outages - not so much at the 'try turning it off and then turning it on again' level, more the 'redirect all traffic away from this cluster into a different one, while we roll back the broken update'.

SRE is a combination of problem-solving when there are outages, and building tools to 1) automate away the manual jobs involved in massive-scale system administration so that outages are less likely to occur.

minwcnt5 · on Jan 24, 2014

Actually, one of the things they said in the AMA is that they don't have any concept of "level one" triage. Rather, they try as much as possible to direct pages to the engineers who built the software because that way it's more likely to get fixed properly and permanently.

shasa · on Jan 24, 2014

I don't think there are only 5 people on the team. Could be a coincidence or may be some hacker timed it to perfection. To check downtime: http://downrightnow.com/gmail

noir_lord · on Jan 25, 2014

http://queue.acm.org/detail.cfm?id=2371516#sidebar

Doesn't sound like an organisation that would miss services going down even briefly.

werid · on Jan 25, 2014

Technically incorrect, the AMA was announced right before the downtime occured, but answers weren't scheduled until awhile later. A common tactic to let the community post questions and vote on them when there's potential to be quite a few of them.

Four SREs showed up. Two answered 4 questions each, another 8, last one 12.

Pretty poor to be honest.

wavefunction · on Jan 25, 2014

Sounds like the standard quality of public interaction with Google, to be honest. I'm not trying to slag Google off, but I don't know of any company of its size and services with as poor customer support as they have.

Maybe Oracle?

pavs · on Jan 25, 2014

Different google team did quite a few AMA on Reddit, to my knowledge, most of them were semi-live/live and very effective.

dspeyer · on Jan 25, 2014

It's worth remembering that Google doesn't have a single SRE team. Each major service has a separate SRE team. There's a lot of specialized knowledge in each, and redirecting the Search and Storage SREs doing the AMA to help with gmail (or login, which was probably the problem) would have only resulted in being in the way.

vezzy-fnord · on Jan 24, 2014

Damn, TechCrunch sure is having a field day today.

That said, this was already posted in their original article about the Gmail downtime.

shasa · on Jan 24, 2014

I didn't have any problem with gmail today though my roommate's mail was down. However, my yahoo mail is still down.

IBM · on Jan 24, 2014

Looks like its not just design that Google needs to get better at.

davorak · on Jan 24, 2014

I was under the impression that google was above the industry standard for up time. If not I would like to know who offers similar services with better up time.

danrockwelljr · on Jan 25, 2014

He's just being a troll:

https://news.ycombinator.com/submitted?id=IBM

https://news.ycombinator.com/threads?id=IBM

IBM · on Jan 25, 2014

I wouldn't have as high of a karma as I do if I were a troll.

lelandbatey · on Jan 25, 2014

I'm really not sure where these comments are coming from. I use dozens, maybe hundreds of different web services, and I have had every single one go down on me at one point or another except for Google services. This is the first Gmail outage I've ever seen, and if they stay this infrequent (10 - 30 minutes per ~5 years or 0.000011% downtime) at this price, I'm totally fine with that.

Now, Google could be better about this. It's the implication that they're somehow not good in the first place that's so off-putting.

elwell · on Jan 24, 2014

I keep getting a lot of emails at my address (dsp559 [at] hotmail.com).

shasa · on Jan 24, 2014

TechCrunch has provided a link to the person's resume... so much for privacy.