Hacker News new | past | comments | ask | show | jobs | submit login
Google’s Reliability Team Sat Down for an AMA Right Before Gmail Exploded (techcrunch.com)
100 points by shasa on Jan 24, 2014 | hide | past | favorite | 34 comments



Today's breakdown was a huge warning flag for me. I'm heavily integrated with Google Services, and use Hangouts as my main messenger and SMS app on my Nexus phone. While Gmail was down, I couldn't respond to anyone who was using hangouts to contact me, couldn't share any documents on Google Drive, etc.

I'm going to have to seriously think about the risks of being so heavily reliant on Google services.


Yeah me too. Some of the greatest programmers and sysadmins in the world couldn't deliver 100% uptime. I think I'll have to take over and host my own services in my basement.

I'll show them how it's done!


Easy problems can become nearly impossible at scale.


I'll take GMail with 99% uptime over running Horde in my basement with 100% uptime.


You get 100% uptime from your ISP? Dang. Where can I sign up?


Yep. If something has a "one in a million" chance of occurring, you'll run into it 17 times per hour (based on Comscore's Sept 2013 estimates of Google search volume)


Are the alternatives any better?

If you run a service yourself, you will have downtime, only you won't have Google's amazing SREs and wealth of experience to draw upon to resolve your outage. You will have to fix it yourself.

If you depend on some other service, you are subject to their uptime and disaster recovery capabilities. Nobody is better than Google at either of these things.

But to your point, it probably is a good idea anyways to have offline backups of your critical data, and to have out-of-band fallbacks for critical communication in the event your primary channels have an interruption.


Good luck in your holy grail quest for that 100.000%.


No kidding. We're a Google Apps customer and, since all our employees already have a Google account, decided to use Google Auth for some internally business apps we've developed. While it sucked that Gmail/Hangouts were down, it royally sucked that our other apps were accessible. Even after Gmail/Hangouts came back online, Google Auth was throwing 503s for about another half hour. Boooooo!


We should all take this opportunity to think about the risks of being so heavily reliant on X, Y, or Z.


So you think that Google isn't going to be taking the effort to maintain as much uptime for its services as possible in the future?


Apparently Google+ was down too.

crickets


apparently. Nobody could confirm though.


Tangential: why does MapReduce use /dev/random as its entropy source?

"After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random!"

http://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_goog...


It sounds like it wasn't MapReduce itself, but rather the specific MR user job that was being run.


Which makes me wonder... did they just pulled a Murphy's, or are the services so unstable that they go down if no one's overlooking it? Maybe the services already go down multiple times a day, but the outage is short?


It's unlikely that any gmail outage would go unnoticed, considering how much activity it gets 24/7.

Also, these guys are in engineering. They are very likely not even directly involved when there are outages. They build the systems and protocols to avoid and recover from outages, but don't actually perform the work themselves. It's developers vs. IT.


[I used to be a GMail SRE]

Correct, it's pretty much impossible for an outage to not be noticed and the GMail on-call being automatically paged.

SREs at GMail are engineers, yes, but they're very much directly involved with fixing outages - not so much at the 'try turning it off and then turning it on again' level, more the 'redirect all traffic away from this cluster into a different one, while we roll back the broken update'.

SRE is a combination of problem-solving when there are outages, and building tools to 1) automate away the manual jobs involved in massive-scale system administration so that outages are less likely to occur.


Actually, one of the things they said in the AMA is that they don't have any concept of "level one" triage. Rather, they try as much as possible to direct pages to the engineers who built the software because that way it's more likely to get fixed properly and permanently.


I don't think there are only 5 people on the team. Could be a coincidence or may be some hacker timed it to perfection. To check downtime: http://downrightnow.com/gmail


http://queue.acm.org/detail.cfm?id=2371516#sidebar

Doesn't sound like an organisation that would miss services going down even briefly.


Technically incorrect, the AMA was announced right before the downtime occured, but answers weren't scheduled until awhile later. A common tactic to let the community post questions and vote on them when there's potential to be quite a few of them.

Four SREs showed up. Two answered 4 questions each, another 8, last one 12.

Pretty poor to be honest.


Sounds like the standard quality of public interaction with Google, to be honest. I'm not trying to slag Google off, but I don't know of any company of its size and services with as poor customer support as they have.

Maybe Oracle?


Different google team did quite a few AMA on Reddit, to my knowledge, most of them were semi-live/live and very effective.


It's worth remembering that Google doesn't have a single SRE team. Each major service has a separate SRE team. There's a lot of specialized knowledge in each, and redirecting the Search and Storage SREs doing the AMA to help with gmail (or login, which was probably the problem) would have only resulted in being in the way.


Damn, TechCrunch sure is having a field day today.

That said, this was already posted in their original article about the Gmail downtime.


I didn't have any problem with gmail today though my roommate's mail was down. However, my yahoo mail is still down.


Looks like its not just design that Google needs to get better at.


I was under the impression that google was above the industry standard for up time. If not I would like to know who offers similar services with better up time.



I wouldn't have as high of a karma as I do if I were a troll.


I'm really not sure where these comments are coming from. I use dozens, maybe hundreds of different web services, and I have had every single one go down on me at one point or another except for Google services. This is the first Gmail outage I've ever seen, and if they stay this infrequent (10 - 30 minutes per ~5 years or 0.000011% downtime) at this price, I'm totally fine with that.

Now, Google could be better about this. It's the implication that they're somehow not good in the first place that's so off-putting.


I keep getting a lot of emails at my address (dsp559 [at] hotmail.com).


TechCrunch has provided a link to the person's resume... so much for privacy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: