Today's breakdown was a huge warning flag for me. I'm heavily integrated with Google Services, and use Hangouts as my main messenger and SMS app on my Nexus phone. While Gmail was down, I couldn't respond to anyone who was using hangouts to contact me, couldn't share any documents on Google Drive, etc.
I'm going to have to seriously think about the risks of being so heavily reliant on Google services.
Yeah me too. Some of the greatest programmers and sysadmins in the world couldn't deliver 100% uptime. I think I'll have to take over and host my own services in my basement.
Yep. If something has a "one in a million" chance of occurring, you'll run into it 17 times per hour (based on Comscore's Sept 2013 estimates of Google search volume)
If you run a service yourself, you will have downtime, only you won't have Google's amazing SREs and wealth of experience to draw upon to resolve your outage. You will have to fix it yourself.
If you depend on some other service, you are subject to their uptime and disaster recovery capabilities. Nobody is better than Google at either of these things.
But to your point, it probably is a good idea anyways to have offline backups of your critical data, and to have out-of-band fallbacks for critical communication in the event your primary channels have an interruption.
No kidding. We're a Google Apps customer and, since all our employees already have a Google account, decided to use Google Auth for some internally business apps we've developed. While it sucked that Gmail/Hangouts were down, it royally sucked that our other apps were accessible. Even after Gmail/Hangouts came back online, Google Auth was throwing 503s for about another half hour. Boooooo!
Tangential: why does MapReduce use /dev/random as its entropy source?
"After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random!"
Which makes me wonder... did they just pulled a Murphy's, or are the services so unstable that they go down if no one's overlooking it? Maybe the services already go down multiple times a day, but the outage is short?
It's unlikely that any gmail outage would go unnoticed, considering how much activity it gets 24/7.
Also, these guys are in engineering. They are very likely not even directly involved when there are outages. They build the systems and protocols to avoid and recover from outages, but don't actually perform the work themselves. It's developers vs. IT.
Correct, it's pretty much impossible for an outage to not be noticed and the GMail on-call being automatically paged.
SREs at GMail are engineers, yes, but they're very much directly involved with fixing outages - not so much at the 'try turning it off and then turning it on again' level, more the 'redirect all traffic away from this cluster into a different one, while we roll back the broken update'.
SRE is a combination of problem-solving when there are outages, and building tools to 1) automate away the manual jobs involved in massive-scale system administration so that outages are less likely to occur.
Actually, one of the things they said in the AMA is that they don't have any concept of "level one" triage. Rather, they try as much as possible to direct pages to the engineers who built the software because that way it's more likely to get fixed properly and permanently.
I don't think there are only 5 people on the team. Could be a coincidence or may be some hacker timed it to perfection.
To check downtime:
http://downrightnow.com/gmail
Technically incorrect, the AMA was announced right before the downtime occured, but answers weren't scheduled until awhile later. A common tactic to let the community post questions and vote on them when there's potential to be quite a few of them.
Four SREs showed up. Two answered 4 questions each, another 8, last one 12.
Sounds like the standard quality of public interaction with Google, to be honest. I'm not trying to slag Google off, but I don't know of any company of its size and services with as poor customer support as they have.
It's worth remembering that Google doesn't have a single SRE team. Each major service has a separate SRE team. There's a lot of specialized knowledge in each, and redirecting the Search and Storage SREs doing the AMA to help with gmail (or login, which was probably the problem) would have only resulted in being in the way.
I was under the impression that google was above the industry standard for up time. If not I would like to know who offers similar services with better up time.
I'm really not sure where these comments are coming from. I use dozens, maybe hundreds of different web services, and I have had every single one go down on me at one point or another except for Google services. This is the first Gmail outage I've ever seen, and if they stay this infrequent (10 - 30 minutes per ~5 years or 0.000011% downtime) at this price, I'm totally fine with that.
Now, Google could be better about this. It's the implication that they're somehow not good in the first place that's so off-putting.
I'm going to have to seriously think about the risks of being so heavily reliant on Google services.