Very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: this wasn't related to traffic levels or any kind of overload, our network is not stressed by Covid-19.
My friend spent the last year rewriting it! She is amazing! We went from Ruby on Rails, AWS, and a $800+ a month server bill to paying almost nothing to run it (and with better checking).
We are working on launching a blog for it and she is going to write up a full breakdown of the project. I'll post it to Show HN then.
This is a write up that covers it a bit from how we got featured on CloudFlare last week:
"
As for how we're using Cloudflare workers...
The actual frontend website is a fairly tiny Nuxt/VueJS app and that operates in AWS Lambda right now, but we have about an 80% cache rate normally, so Cloudflare is serving most of that anyway. It's on AWS primarily because we have credits with them, plus that is where I started working on it and I had used Lambda in previous projects. At the time we relaunched it early last year, Cloudflare workers was still in beta and not as refined as it is now, especially now that it has better CLI tooling. The only reason the site itself doesn't live in Workers is because deploying JS apps was a bit...rough last year due to lack of tooling but it looks like that has improved significantly and I will probably eventually move the frontend site, too.
The backend portion of the site that checks if a site is down is just a simple HTTP API, which also was first written for Lambda/Node, but I later rewrote it in Cloudflare Workers primarily because I liked that it executed the worker in the edge closest to the visitor automatically, instead of us having to deploy our app in numerous AWS regions. It's a perfect use case for how our site works.
The other way we use Workers is in conjunction with the key/val store service you have. Our site gets a lot of attacks and I made a rate-limiting system that will rate limit and then automatically block abusers that do not relent in the cloudflare firewall via your API. It uses ELK + Elastalert to accomplish that, which gives us a bit more control over how and when we want to rate limit / block than the one in built-in to Cloudflare. I struggled a bit getting these under control with AWS (even with their WAF) and had a fairly frequent events that paged me requiring manual intervention, but now I barely ever get paged for issues on the site, it's been 100% uptime for months.
I'm a huge proponent of serverless if the project fit is right and I've definitely enjoyed working with Cloudflare a lot more than Lambda.
"
The site used RoR for a long time, so I don't want to trash RoR. I still like Rails a lot (and Ruby is my favorite language by far!), but it didn't make sense for where the site was at the time. Also, my background is primarily in ops/sysadmin, and I hadn't use Rails since about version 3.x, and I know a lot had changed since then. I didn't want to spend time re-learning a large framework like Rails for what was a very simple website. So that is a big reason why the idea of using something JS-based that could operate in a serverless container was appealing to me and I lean more on the cloud provider handling resource allocation for spikes...
I had used serverless (Lambda specifically) on a few smaller work projects prior to taking over the downfor site, mostly building simple backend APIs, but nothing on the frontend. I am not a frontend developer at all. As someone that has spent a large portion of their career managing servers, even bare-metal (and being on-call for them), the idea of not worrying about even a misconfigured auto-scaling group was very appealing.
When I took over the site, there were several problems:
1. The site is very bursty, by its nature. We often 4x our traffic or more when a popular site goes down, within a minute or two. The problem the Rails site faced, even with an AWS auto-scaling group, was that it could not respond to those bursts fast enough to bring up more instances before our Passenger slots were saturated and the site would go down. A lot of outages are really quick, so people swarm in, and then they're gone, and our site would be down as a result, which was embarrassing for a site about downtime :)
2. This is related to the bursty nature of our site -- we get a lot of attacks. Because our site makes remote HTTP requests to other sites, people try to use us as a proxy to attack other sites. The Rails version of the site did not cache remote HTTP lookups at all, not even for five seconds, so one request to our site meant one request to a remote site. We were regularly, and understandably, being blocked by a lot of sites which further reduced the efficacy of our website to communicate if a site was actually down or not.
3. The site operated in EC2 instances, so the cost was very high for what it was doing. It varied between $600-$800 because of the attacks and how many times the auto-scaling group would bring up more instances. As most everyone knows, too, the bandwidth costs at AWS are high and we would regularly face multi-Gbps attacks. State was stored in a Postgres database, which was also expensive, but not really needed given the simplicity of the site.
4. Page load speed. Google regularly complained about site speed because Passenger requests were being saturated intermittently, and there were too many dynamic requests being made on every page load, and the cache rate was awful.
5. One of the glaring problems with the previous site was that the remote HTTP request happened inside of the frontend Rails application in the foreground, so every page load, even by bots, triggered that request. The request now happens client-side, which greatly reduces the amount of remote HTTP requests we perform (and subsequently have to wait on finishing). This was a big reason why Passenger requests would be saturated frequently, because our site would be waiting for a remote website to be down in the foreground, on websites that are, as you an can imagine, likely going to be down or hanging for many seconds. I think this is why you see a lot of our competitor websites regularly go down during large outages -- their foreground processes are saturated waiting for down websites to timeout.
The previous developer made most of the Rails version of the site, but they did not have much of an operations background. It ran in Elastic Beanstalk, too, and I didn't like that dependency if I ever wanted to move it somewhere else. I maintained that version of the site at first and tried to keep it online and stable, until I had enough (especially being paged in the middle of the night) and decided to rewrite it from scratch and dramatically simplify it. I did not want to ever be woken up in the middle of the night for a side project like this, and I also hate seeing a site I run have downtime or perform poorly :) I am normally against complete software rewrites, having experienced that pain before in the past, but I didn't see much to salvage and the site was very simple, so the rewrite did not take very long.
The first serverless iteration (still NuxtJS) was entirely on AWS, which was definitely an improvement over what we had in page load speed, cache rate, management time, and cost. But we still faced regular distributed multi-Gbps attacks that AWS was obviously happy to charge us for. I tried to manage some of these attacks with the AWS WAF service, but it felt very much like a beta service, barely ready for production usage. It did help prevent a lot of nasty stuff getting to our backend, which was more expensive than the WAF, but it was not pleasant to use and not flexible.
I had considered setting up a pool of Nginx or HAproxy machines myself, possibly with mod_security, to help mitigate some of the smaller frontend attacks and prevent our backend from being saturated, but I decided to move the site to Cloudflare instead because the cost was dramatically better. That also fixed the problem of stopping both high-bandwidth attacks and layer-7 attacks that would saturate our backends. Not having to pay per-request and the high bandwidth cost in AWS saved us a lot on infra cost instantly; it also gave us more flexibility to perform JS/captcha challenges to obvious attackers and prevent them from getting to our backend. We still get attacks today, but they are much easier to manage and modify the rules as needed, and I can do that from my phone easily instead of having to modify a config file. Cloudflare also gives us access to a great CDN, without having to pay AWS Cloudfront pricing.
A lot of the problems with the Rails site could likely have been fixed by moving the frontend to Cloudflare and working on the cache rate and attack problem to better protect the backend, but I knew I was going to be the only person working on the site, so I wanted a stack that I was comfortable with and could build upon. I still believe Rails was overkill for a site of this simplicity. I also hadn't worked with JS much at that point, as I'm not a frontend developer, so it was a relevant learning experience and it also was a nice transition to using JS more for serverless applications. I had also considered keeping all of the backend operations to a Rails app, but move the entire frontend to a static site.
This is not to say serverless is without problems, but I do find that people denigrate serverless a lot on this site, or dismiss it as some fad. I'll definitely be writing more about that on our blog when we launch that, but I always like to use the appropriate tool for the job, and I still think serverless was a good move for us based on how our site works and me wanting a "hands off" approach to operating it. And I think the results speak for themselves in increased performance, much lower cost, almost no active management time, and we regularly hit 100% uptime.
If you have any other questions, let me know! My email is also in my profile if you want to contact me directly.
Thank You. I was very late seeing this, but thank you so much for the write up. It all make sense once you started to describe your problem and Rails wasn't going to be ( and possibly never will be ) a great fit, and on the surface it seems Serverless model was a match make in heaven for isDown.
They know internally from the real status dashboards & real monitoring.
I think the issue is that this status page is maintained by hand, by humans, and updating it likely requires lots of communication and sign-off approvals, and trying to get numbers to decide whether an outage is big or only impacting a tiny fraction of users. Otherwise it'd probably always be red.
FYI - these charts are more than likely updated by hand, by a person, with a bunch of deliberation and very high level approval (your actual SLA is probably based off of this person's call). At least that's how it was at another FAANG.
That seems reasonable to me. You could argue that the posting should be quicker, but it's never going to be instantaneous. Back-timing is at least somewhat transparent.
The issue with Google Cloud infrastructure components has been resolved for all affected users as of 09:21. Total time of impact was 08:18 to 09:21 US/Pacific, with the most severe impact at the start of the issue, tapering off as services routed traffic away from Atlanta.
The impact of this incident was concentrated in a region that is not a main GCP region and therefore there was no impact to services based on Google Compute Engine. Services that may have been impacted include External HTTP/S Load balancing requests and API requests that may have been served near the Atlanta metro.
The root cause was a set of router failures in Atlanta, which affected traffic routed through that region.
Terri Gross: Now I'll tell you, in preparing for this, I decided, let me Google Google, so I typed in "Google" into the Google search, and I came up with a lot of Google things in the regular search, but in the "Are you feeling lucky?" search, I got nothing.
Larry Page: Well you just got Google itself.
TG: Yeah, I just got Google itself. Oh, I see, Google was giving me itself.
LP: Yeah.
TG: Oh.
LP: In computer science, we call that recursion. [laugh].
TG: Oh, you even have a name for it. [laugh]. I didn't quite get that. I kept thinking it was just repeating itself. I didn't realize it was giving me itself. [laugh].
LP: [laugh]
TG: And what's the name for it?
LP: Uh, recursion. It's... kind of... Sergey is giving me a dirty look.
TG: Why?
LP: It's a loose definition. [laugh]
TG: Lighten up Sergey. [laugh]
LP: It's a loose interpretation of... [laugh]... recursion.
TG: Sergey, what's the more literal interpretation?
I think that graph is biased towards the fact that the east coast is awake while the west coast is not. Also almost 50% of the US population lives in a state that touches the Atlantic.
Cloud: we have large appEngine (2000 req/s at peak - it's <i>very</i> up and down spiking), functions, and heavy pubsub/biquery. Didn't notice any downtime or large errors.
Last time there was cloud downtime Google wouldn't give any credit because it was still under the SLA. Though we're really small fish a tiny gesture of good will would have felt nice.
Hard to say. US populations are eastern centralized; west coast is only now getting to work, checking their mail, and searching bing for why google is down. You'd need a population + timezone forecast of that map to really measure regional outages.
We're a gsuite customer and just noticed some interruptions here as well. GMail is flaky, chat is down, and the app status page is unavailable as well. https://www.google.com/appsstatus Not very useful :P
Been able to log into Analytics web dashboard, but get "try again later" errors for all our properties. Been that way since about 8am PDT here in Seattle.
On this note, a public postmortem from the 16 Oct 2018 YouTube outage does not seem to exist. If enterprise customers are impacted, there will likely be a public postmortem for this incident, but it is still publicly unknown what caused YouTube to go down.
Very sorry about that! We had a router failure in Atlanta, which affected traffic routed through that region. Things should be back to normal now. Just to make sure: this wasn't related to traffic levels or any kind of overload, our network is not stressed by Covid-19.
https://twitter.com/uhoelzle/status/1243217659690278912