Google's Down

randomifcpfan · on March 26, 2020

Authoritative explanation

Very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: this wasn't related to traffic levels or any kind of overload, our network is not stressed by Covid-19.

https://twitter.com/uhoelzle/status/1243217659690278912

Kialo · on March 26, 2020

Thank you. Upvote this!

bwb · on March 26, 2020

Spain went down for 5 min while i was in a google doc + analytics.

I run downforeveryoneorjustme.com and we went up to 1000 people a minute checking too. It only lasted a few min though.

jedberg · on March 26, 2020

I just want to say thank you for your website. It’s saved my sanity many times.

I’m curious, what’s your tech stack? Your site always seems to be up when everyone else is down. :)

bwb · on March 26, 2020

Thanks :)

My friend spent the last year rewriting it! She is amazing! We went from Ruby on Rails, AWS, and a $800+ a month server bill to paying almost nothing to run it (and with better checking).

We are working on launching a blog for it and she is going to write up a full breakdown of the project. I'll post it to Show HN then.

This is a write up that covers it a bit from how we got featured on CloudFlare last week: " As for how we're using Cloudflare workers...

The actual frontend website is a fairly tiny Nuxt/VueJS app and that operates in AWS Lambda right now, but we have about an 80% cache rate normally, so Cloudflare is serving most of that anyway. It's on AWS primarily because we have credits with them, plus that is where I started working on it and I had used Lambda in previous projects. At the time we relaunched it early last year, Cloudflare workers was still in beta and not as refined as it is now, especially now that it has better CLI tooling. The only reason the site itself doesn't live in Workers is because deploying JS apps was a bit...rough last year due to lack of tooling but it looks like that has improved significantly and I will probably eventually move the frontend site, too.

The backend portion of the site that checks if a site is down is just a simple HTTP API, which also was first written for Lambda/Node, but I later rewrote it in Cloudflare Workers primarily because I liked that it executed the worker in the edge closest to the visitor automatically, instead of us having to deploy our app in numerous AWS regions. It's a perfect use case for how our site works.

The other way we use Workers is in conjunction with the key/val store service you have. Our site gets a lot of attacks and I made a rate-limiting system that will rate limit and then automatically block abusers that do not relent in the cloudflare firewall via your API. It uses ELK + Elastalert to accomplish that, which gives us a bit more control over how and when we want to rate limit / block than the one in built-in to Cloudflare. I struggled a bit getting these under control with AWS (even with their WAF) and had a fairly frequent events that paged me requiring manual intervention, but now I barely ever get paged for issues on the site, it's been 100% uptime for months.

I'm a huge proponent of serverless if the project fit is right and I've definitely enjoyed working with Cloudflare a lot more than Lambda. "

ksec · on March 26, 2020

>We went from Ruby on Rails, AWS, and a $800+ a month server bill to paying almost nothing to run it

Would love to read a write up on it. I know the focus may be on Cloudflare worker and Lambda, but would love to see how did RoR perform as well.

clairegraham · on March 27, 2020

I'm the other person Ben mentioned :)

The site used RoR for a long time, so I don't want to trash RoR. I still like Rails a lot (and Ruby is my favorite language by far!), but it didn't make sense for where the site was at the time. Also, my background is primarily in ops/sysadmin, and I hadn't use Rails since about version 3.x, and I know a lot had changed since then. I didn't want to spend time re-learning a large framework like Rails for what was a very simple website. So that is a big reason why the idea of using something JS-based that could operate in a serverless container was appealing to me and I lean more on the cloud provider handling resource allocation for spikes...

I had used serverless (Lambda specifically) on a few smaller work projects prior to taking over the downfor site, mostly building simple backend APIs, but nothing on the frontend. I am not a frontend developer at all. As someone that has spent a large portion of their career managing servers, even bare-metal (and being on-call for them), the idea of not worrying about even a misconfigured auto-scaling group was very appealing.

When I took over the site, there were several problems:

1. The site is very bursty, by its nature. We often 4x our traffic or more when a popular site goes down, within a minute or two. The problem the Rails site faced, even with an AWS auto-scaling group, was that it could not respond to those bursts fast enough to bring up more instances before our Passenger slots were saturated and the site would go down. A lot of outages are really quick, so people swarm in, and then they're gone, and our site would be down as a result, which was embarrassing for a site about downtime :)

2. This is related to the bursty nature of our site -- we get a lot of attacks. Because our site makes remote HTTP requests to other sites, people try to use us as a proxy to attack other sites. The Rails version of the site did not cache remote HTTP lookups at all, not even for five seconds, so one request to our site meant one request to a remote site. We were regularly, and understandably, being blocked by a lot of sites which further reduced the efficacy of our website to communicate if a site was actually down or not.

3. The site operated in EC2 instances, so the cost was very high for what it was doing. It varied between $600-$800 because of the attacks and how many times the auto-scaling group would bring up more instances. As most everyone knows, too, the bandwidth costs at AWS are high and we would regularly face multi-Gbps attacks. State was stored in a Postgres database, which was also expensive, but not really needed given the simplicity of the site.

4. Page load speed. Google regularly complained about site speed because Passenger requests were being saturated intermittently, and there were too many dynamic requests being made on every page load, and the cache rate was awful.

5. One of the glaring problems with the previous site was that the remote HTTP request happened inside of the frontend Rails application in the foreground, so every page load, even by bots, triggered that request. The request now happens client-side, which greatly reduces the amount of remote HTTP requests we perform (and subsequently have to wait on finishing). This was a big reason why Passenger requests would be saturated frequently, because our site would be waiting for a remote website to be down in the foreground, on websites that are, as you an can imagine, likely going to be down or hanging for many seconds. I think this is why you see a lot of our competitor websites regularly go down during large outages -- their foreground processes are saturated waiting for down websites to timeout.

The previous developer made most of the Rails version of the site, but they did not have much of an operations background. It ran in Elastic Beanstalk, too, and I didn't like that dependency if I ever wanted to move it somewhere else. I maintained that version of the site at first and tried to keep it online and stable, until I had enough (especially being paged in the middle of the night) and decided to rewrite it from scratch and dramatically simplify it. I did not want to ever be woken up in the middle of the night for a side project like this, and I also hate seeing a site I run have downtime or perform poorly :) I am normally against complete software rewrites, having experienced that pain before in the past, but I didn't see much to salvage and the site was very simple, so the rewrite did not take very long.

The first serverless iteration (still NuxtJS) was entirely on AWS, which was definitely an improvement over what we had in page load speed, cache rate, management time, and cost. But we still faced regular distributed multi-Gbps attacks that AWS was obviously happy to charge us for. I tried to manage some of these attacks with the AWS WAF service, but it felt very much like a beta service, barely ready for production usage. It did help prevent a lot of nasty stuff getting to our backend, which was more expensive than the WAF, but it was not pleasant to use and not flexible.

I had considered setting up a pool of Nginx or HAproxy machines myself, possibly with mod_security, to help mitigate some of the smaller frontend attacks and prevent our backend from being saturated, but I decided to move the site to Cloudflare instead because the cost was dramatically better. That also fixed the problem of stopping both high-bandwidth attacks and layer-7 attacks that would saturate our backends. Not having to pay per-request and the high bandwidth cost in AWS saved us a lot on infra cost instantly; it also gave us more flexibility to perform JS/captcha challenges to obvious attackers and prevent them from getting to our backend. We still get attacks today, but they are much easier to manage and modify the rules as needed, and I can do that from my phone easily instead of having to modify a config file. Cloudflare also gives us access to a great CDN, without having to pay AWS Cloudfront pricing.

A lot of the problems with the Rails site could likely have been fixed by moving the frontend to Cloudflare and working on the cache rate and attack problem to better protect the backend, but I knew I was going to be the only person working on the site, so I wanted a stack that I was comfortable with and could build upon. I still believe Rails was overkill for a site of this simplicity. I also hadn't worked with JS much at that point, as I'm not a frontend developer, so it was a relevant learning experience and it also was a nice transition to using JS more for serverless applications. I had also considered keeping all of the backend operations to a Rails app, but move the entire frontend to a static site.

This is not to say serverless is without problems, but I do find that people denigrate serverless a lot on this site, or dismiss it as some fad. I'll definitely be writing more about that on our blog when we launch that, but I always like to use the appropriate tool for the job, and I still think serverless was a good move for us based on how our site works and me wanting a "hands off" approach to operating it. And I think the results speak for themselves in increased performance, much lower cost, almost no active management time, and we regularly hit 100% uptime.

If you have any other questions, let me know! My email is also in my profile if you want to contact me directly.

ksec · on March 30, 2020

Thank You. I was very late seeing this, but thank you so much for the write up. It all make sense once you started to describe your problem and Rails wasn't going to be ( and possibly never will be ) a great fit, and on the surface it seems Serverless model was a match make in heaven for isDown.

gazab · on March 28, 2020

Thanks for this! Very interesting!

bwb · on March 27, 2020

will do!

zymhan · on March 26, 2020

It took me forEVER to memorize that URL lol, but I love your site.

vile_wretch · on March 26, 2020

You can also use http://isup.me

bwb · on March 26, 2020

Thanks :)

bithavoc · on March 26, 2020

Meanwhile their status page[0] is all green. The less they publish on their status page the more nines they preserve.

[0] https://status.cloud.google.com/

Edit: status page not entirely green now, took them 25 minutes to acknowledge.

ricardobeat · on March 26, 2020

It has a 'We are experiencing an issue with Google Cloud infrastructure components.' notice at the top though.

Gmail / GSuite / Youtube are not directly listed, though Firestore is and it's still green.

supdatecron · on March 26, 2020

They added "We are experiencing an issue with Google Cloud infrastructure components" at the top, but everything is still green.

Are they unaware of which components have issues?

bradfitz · on March 26, 2020

They know internally from the real status dashboards & real monitoring.

I think the issue is that this status page is maintained by hand, by humans, and updating it likely requires lots of communication and sign-off approvals, and trying to get numbers to decide whether an outage is big or only impacting a tiny fraction of users. Otherwise it'd probably always be red.

discreditable · on March 26, 2020

G Suite Status is all green with no warnings. https://www.google.com/appsstatus

_ytji · on March 26, 2020

FYI - these charts are more than likely updated by hand, by a person, with a bunch of deliberation and very high level approval (your actual SLA is probably based off of this person's call). At least that's how it was at another FAANG.

pedrorijo91 · on March 26, 2020

Good to know that Google+ is healthy

jvolkman · on March 26, 2020

There are many warnings now.

discreditable · on March 26, 2020

Seeing that now. I like how the outage times are back-timed (11:57am est) to well before they posted anything about it.

jvolkman · on March 26, 2020

That seems reasonable to me. You could argue that the posting should be quicker, but it's never going to be instantaneous. Back-timing is at least somewhat transparent.

jvolkman · on March 26, 2020

For posterity, the final status update:

The issue with Google Cloud infrastructure components has been resolved for all affected users as of 09:21. Total time of impact was 08:18 to 09:21 US/Pacific, with the most severe impact at the start of the issue, tapering off as services routed traffic away from Atlanta.

The impact of this incident was concentrated in a region that is not a main GCP region and therefore there was no impact to services based on Google Compute Engine. Services that may have been impacted include External HTTP/S Load balancing requests and API requests that may have been served near the Atlanta metro.

The root cause was a set of router failures in Atlanta, which affected traffic routed through that region.

rafaelgarrido · on March 26, 2020

Who searched "google" on google.com ??

PopeDotNinja · on March 26, 2020

raises hand

js2 · on March 26, 2020

Terri Gross: Now I'll tell you, in preparing for this, I decided, let me Google Google, so I typed in "Google" into the Google search, and I came up with a lot of Google things in the regular search, but in the "Are you feeling lucky?" search, I got nothing.

Larry Page: Well you just got Google itself.

TG: Yeah, I just got Google itself. Oh, I see, Google was giving me itself.

LP: Yeah.

TG: Oh.

LP: In computer science, we call that recursion. [laugh].

TG: Oh, you even have a name for it. [laugh]. I didn't quite get that. I kept thinking it was just repeating itself. I didn't realize it was giving me itself. [laugh].

LP: [laugh]

TG: And what's the name for it?

LP: Uh, recursion. It's... kind of... Sergey is giving me a dirty look.

TG: Why?

LP: It's a loose definition. [laugh]

TG: Lighten up Sergey. [laugh]

LP: It's a loose interpretation of... [laugh]... recursion.

TG: Sergey, what's the more literal interpretation?

...

Starts at 13:45:

https://www.npr.org/2003/10/14/167643282/google-founders-lar...

dumbfounder · on March 26, 2020

https://pasteboard.co/J0TXBAW.png

cdubzzz · on March 26, 2020

https://www.youtube.com/watch?v=qdjRwpYM-Kw

quickthrower2 · on March 27, 2020

Google do a lot of seo to stay #1 for “google”

DonHopkins · on March 27, 2020

Who targets ads at searches for "google" on google.com? People who do that are a highly sought after demographic.

pilom · on March 26, 2020

Looks like issues are primarily on the east coast: https://downdetector.com/status/google/map/

No issues with any google services thus far for me in Denver.

jedberg · on March 26, 2020

I think that graph is biased towards the fact that the east coast is awake while the west coast is not. Also almost 50% of the US population lives in a state that touches the Atlantic.

It was down for me here in California.

dillondoyle · on March 26, 2020

Im in Denver too. consumer products are up.

Cloud: we have large appEngine (2000 req/s at peak - it's <i>very</i> up and down spiking), functions, and heavy pubsub/biquery. Didn't notice any downtime or large errors.

Last time there was cloud downtime Google wouldn't give any credit because it was still under the SLA. Though we're really small fish a tiny gesture of good will would have felt nice.

tinix · on March 26, 2020

Denver here, I got a few 500 errors earlier this morning on search.

husamia · on March 26, 2020

Not only east coast, it looks global.

albertzeyer · on March 26, 2020

No, it does not. No problems at all here (Germany).

jldugger · on March 26, 2020

Hard to say. US populations are eastern centralized; west coast is only now getting to work, checking their mail, and searching bing for why google is down. You'd need a population + timezone forecast of that map to really measure regional outages.

jve · on March 26, 2020

Me from Europe. Youtube has been open for last ~20mins and is UP.

sevenf0ur · on March 26, 2020

We're a gsuite customer and just noticed some interruptions here as well. GMail is flaky, chat is down, and the app status page is unavailable as well. https://www.google.com/appsstatus Not very useful :P

6gvONxR4sf7o · on March 26, 2020

Can we get a title change? "Google's Down" tells me something totally different.

judge2020 · on March 26, 2020

Discord as well, looks like it's hitting GCP https://twitter.com/hopeseekr/status/1243201607854108673?s=2...

murillians · on March 26, 2020

Search or another google service? G Suite is reporting some issues, but i'm connecting to search fine https://twitter.com/gsuite/status/1243203000195256320

heartbreak · on March 26, 2020

I can't connect to search. I get a 500 error page.

cpach · on March 26, 2020

So far Gmail and Youtube works fine here (Stockholm area, Sweden)

larrik · on March 26, 2020

I got a 503 from Google Classroom, but it seems intermittent.

geephroh · on March 26, 2020

Been able to log into Analytics web dashboard, but get "try again later" errors for all our properties. Been that way since about 8am PDT here in Seattle.

sixhobbits · on March 26, 2020

Haven't noticed issues here in Switzerland

bob33212 · on March 26, 2020

Back up for me. I have Google fiber as well

timvisee · on March 26, 2020

It seems to be pulling a LOT with it. Even services like Microsoft Office 356

tyingq · on March 26, 2020

Apt typo. If it's 356, they have 9 days where they can be down a year.

kabdib · on March 26, 2020

"Five nines of reliability? Pshaw! We have nine FIVES!"

olyjohn · on March 26, 2020

The running joke at my office was that we'd call it Microsoft 364 or 363 every time we had an outage.

sumoboy · on March 26, 2020

Google ads is insanely slow.

bsilvereagle · on March 26, 2020

On this note, a public postmortem from the 16 Oct 2018 YouTube outage does not seem to exist. If enterprise customers are impacted, there will likely be a public postmortem for this incident, but it is still publicly unknown what caused YouTube to go down.

nathanaldensr · on March 26, 2020

Discord was down for several minutes earlier, as well. Related?

ghastmaster · on March 26, 2020

Gmail down, search working, youtube working in Cincinnati ohio.

Aaronstotle · on March 26, 2020

G-suite has been intermittent for me this morning.

jeffrogers · on March 26, 2020

Google Classroom getting 504s on the west coast

FactolSarin · on March 26, 2020

Hangouts seems to be down altogether

sarathyweb · on March 26, 2020

Search is working fine in India.

mercurial · on March 26, 2020

Works fine here (Denmark)

superbm · on March 26, 2020

Now you see that it's not only Microsoft problem...

perlpimp · on March 26, 2020

works fine in Toronto, Canada

MasonBario · on March 26, 2020

Haven't noticed :/