Hacker News new | past | comments | ask | show | jobs | submit login
Fastly Outage (fastly.com)
1255 points by pcr0 on June 8, 2021 | hide | past | favorite | 694 comments



This seems to be impacting a number of huge sites, including the UK government website[0].

[0] https://www.gov.uk/

https://m.media-amazon.com/

https://pages.github.com/

https://www.paypal.com/

https://stackoverflow.com/

https://nytimes.com/

Edit:

Fastly's incident report status page: https://status.fastly.com/incidents/vpk0ssybt3bj


Fastly Engineer 1: Seems like a common error message. Can you check stackoverflow to see if there's an easy fix?

Fastly Engineer 2: I have some very bad news...


Well, with SO, at least you can search on Google and view the version cached by Google just fine.

With Reddit however, these days almost all comments are locked behind “view entire discussion” or “continue this thread”. In fact, just now I searched for something for which the most relevant discussion was on Reddit; Reddit was down so I opened the cached version, and was literally greeted by five “continue this thread”s and nothing else. What a joke.


Reddit's attempts at dark patterns are embarrassing from all perspectives. If you use dark patterns it's a laughably abysmal implementation. If you abhor dark patterns, it's a frustration.


It's just enough to annoy you but not enough to make everyone leave the platform


They've actually done a masterful job of finding this balance. I've been on reddit for 15 years and would have quit if they didn't leave the old interface available.


On the same day that old.reddit.com stops working I'll leave.


That and third party access to their API.

Sync is so much better than the official app it's not even funny.


Preach!


The mobile version is literally unusable. Half the subs show an error and you can't load most comments.


I think it's because there haven't been any interesting alternatives. I know if I ever see one I'll probably switch in a femtosecond.


Not _yet_, the same was said about Digg once.


I honestly thought Reddit would die when they introduced Reddit awards, it seemed like such an obvious cash grab. You can't underestimate the amount of community momentum that the site has though.


Eh, as far as funding methods go, letting people throw away money is not the worst one.


Its one of the three major sites I use. Yeah...


Yeah it's crazy how bad user-hostile reddit.com has become. Fortunately old.reddit.com is still available, but for how long? If only Javascript did not exist, it would be impossible for UX people to come up with something that bad.


> only Javascript did not exist, it would be impossible for UX people to come up with something that bad.

Arrange the html so that the list of comments is at the end (via css). Keep the http connection open, have the show more button send some of request, and when you receive that request send the rest of the page over the original http connection.

As usual, solve people problems via people, not tech.


How would you make the button send a request without js and without navigating to another page?

Maybe css to load an image on :active or is there some better way?


Here are two robust techniques that I haven’t seen actually employed in production for maybe fifteen years:

① A submit button or link targeting an iframe which is visually hidden. (Or even don’t hide it. If only seamless iframes had happened, or any other way of auto-resizing an iframe: relevant spec issues are https://github.com/whatwg/html/issues/555 and https://github.com/w3c/csswg-drafts/issues/1771.)

② A submit button or link to a URL that returns status 204 No Content.

(CSS image loading in any form is not as robust because some clients will have images disabled. background-image is probably (unverified claim!) less robust than pseudoelement content as accessibility modes (like high contrast) are more likely to strip background images, though I’m not sure if they are skipped outright or load and aren’t shown. :active is neither robust nor correct: it doesn’t respond to keyboard activation, and it’s triggered on mouse down rather than mouse up. Little tip here for a thing that people often get wrong: mouse things activate on mouseup, keyboard things on keydown.)


Mhhh, iframes all the way down. Could make a nice experiment.


Yep:

.button:active { background-image: url('/some-reference-thats-actually-a-tracker'); }


Well technically everything is possible. But Javascript was precisely designed to encourage this kind of patterns.

> As usual, solve people problems via people, not tech.

So true..


“Continue this thread” links don’t depend on JavaScript at all.

“View entire discussion” couldn’t be implemented perfectly with <details> in its present form, but you can get quite close to it with a couple of different approaches.

I think the infinite scrolling of subreddits is about the only thing that would really be lost by shedding JavaScript. Even inline replies can be implemented quite successfully with <details> if you really want.


Yeah I’m going to stop using the platform when they get rid of this . Not interested


Why wait? You are wasting your life away.


And commenting on HN is any more productive?


When it goes away you can try teddit.net


Why wait? Teddit has been a great substitute for reading in a mobile browser, and making an iOS shortcut for transforming Reddit links was pretty straightforward.


Impossible? Man, it's crazy how fast people forget things like good old fashioned <form> GETs and POSTs. It would obviously be a full page refresh, but other than that the same awful UX could still be implemented.


I wanted to suggest site:old.reddit.com since I use that version with automatic redirect, but this:

https://old.reddit.com/robots.txt

is very different from this:

https://reddit.com/robots.txt

I guess there is a market for search engine (maybe accessed through tor) which does not care about robots.txt, DMCAs, right to be forgotten etc. Bootstrapping it should not be that hard since it can also provide better results for some queries since nobody is fighting about the position until it's widely known.

I'm not sure how far are we from being able to do full text internet search. Or rather even quote search, preferably some fuzziness options. That would be cool, Google's quotation marks were really neat back when they were working.


Wonder what the story is behind these two...

    User-Agent: bender
    Disallow: /my_shiny_metal_ass
    
    User-Agent: Gort
    Disallow: /earth


That's the good old Easter eggs, perhaps a memory from when Reddit was a nice place. They stop appearing and are replaced by dark patterns once sites jump the shark.


I reod some people use false slugs in the robots.txt as a honey pot of sorts. IPs that actually read the robots.txt, ignore the disallow, and still access the uri are outright banned.


Then when a flamewar breaks out you just have to get your adversary to click a link to get them IP banned.


Ha, that would have been a really smart idea! Sadly we didn't think of that at the time. But we had other honey pot URLs.


It might be related to the time few years ago when Google added exclusions for user agent t1300 in regard to its founders. Gort seems to be a robot from old scifi and bender might be something similar.


Bender is from Futurama, Gort is old classic scifi: https://en.wikipedia.org/wiki/Gort_(The_Day_the_Earth_Stood_...


Just some fun humor we added for other nerds who read robots.txt files.


Easter eggs


It's neckbeard humor.


I'll have you know our beards were neatly trimmed when we added those.


I know you, and I find that hard to believe ;)


> I guess there is a market for search engine (maybe accessed through tor) which does not care about robots.txt, DMCAs, right to be forgotten etc. Bootstrapping it should not be that hard since it can also provide better results for some queries since nobody is fighting about the position until it's widely known.

That’s not going to happen before Cloudflare is dethroned. See this recent thread for some perspective: https://news.ycombinator.com/item?id=27153603

And even if there’s no Cloudflare, large sites that people want to search will always find ways to block bad bots.

The only thing I can think of that might work is using crowd-sourced data, with all the problems that come with crowdsourcing.


Sadness.

There is a solution for all this mess and I'm blocking HN and a few different domains until I implement at least the first step after which I can share it here.


try editing your hosts file to redirect reddit to old.reddit

/etc/hosts

reddit.com old.reddit.com

www.reddit.com old.reddit.com

np.reddit.com old.reddit.com


I am archiving subreddits on Github in plain-text org-mode. If you have some subreddit in mind, open an issue, and I'll create an archive repo for it.

- https://github.com/NightMachinary/r_HPfanfiction

- https://github.com/NightMachinary/r_rational


Try “site:old.Reddit.com”


That's not going to work.

  $ curl https://old.reddit.com/robots.txt
  User-Agent: *
  Disallow: /
Also, even if search engines are allowed, old.reddit.com pages are not canonical (<link rel="canonical"> points to the www.reddit.com version, which is actually reasonable behavior), so pages there would not be crawled as often or at all.


Stack Overflow is down, can someone tell me how to declare a static multidimensional array in C++?


Google and DDG surface SO results cached within their own page. Here’s the copied answer:

int main() { int arr[100][200][100]; // allocate on the stack

    return 0;
}


Haha! That sounds highly plausible!


Haha! That explains why the internet was down for a while!


Oh man, how do we keep a pocket copy of SO? All of our jobs depend on it.


You can use kiwix (https://www.kiwix.org/en/).

It is an open-source software that allows you to keep and read offline static versions of websites in a specialized archive format (zim-files)

It was originally designed to allow you to read wikipedia offline, but there are also dumps of stackoverflow available on the relevant page : https://wiki.kiwix.org/wiki/Content_in_all_languages


Here, just pin the underlying IPFS object, or use this one hosted by cloudflare: https://ipfs-sec.stackexchange.cloudflare-ipfs.com/


https://kapeli.com/dash also has the ability to download offline archives of SO. Its interface is very good.


You can download the database dump from https://archive.org/details/stackexchange.


no they don't, but if yours does you can download a complete datadump of SO from them.


But https://news.ycombinator.com/ is UP! :) Prepare those HN servers for massive influx in 3...2..1..


While we're here.. I am a bit surprised to see how many sites use Fastly. As a dev I've always been happy with Cloudflare.


Me too, but in a way I'm even happier knowing that not everyone does and something else popular exists too.


Google's Firebase platform uses Fastly so that's a significant chunk of the web.


Now imagine how many sites would go down if it was CF


No need to imagine! Just search HN for "cloudflare outage" and you'll see that it happened several times over the last few years


Is this a call for competition? I regard Cloudflare as state-of-the-art in terms of security and ease-of-use. I certainly hope their knowledge replicates across other organizations. As of now they're still building highly impactful tools that are easy to use and that noone else quite provides. I don't really expect another organization to match them given the strength of their current leadership. I think they've built in a head start for awhile.


> Cloudflare as state-of-the-art in terms of security and ease-of-use

Depends whose security. I value my security dearly and that's why i use the Tor Browser. Cloudflare has decided i cannot browse any of their websites if i care about my security (they filter out tor users and archiving bots agressively) so i'm not using any cloudflare-powered website. Is it good for security that we prevent people from using security-oriented tooling, and let a single multinational corporation decide who gets to enter a website or not? In my book creating a SPOF is already bad practice, but having them filter out entrances is even worse.

Also, are all of these CDNs and other cloud providers are solving the right problems?

If you want your service to be resilient against DDOS attacks, you don't need such huge infrastructure. I've seen WP site operators move to Cloudflare because they had no caching in place, let alone a static site.

If you want better connectivity in remote places where our optic fiber overlords haven't invested yet, P2P technology has much better guarantees than a CDN (content-addressing, no SPOF). IPFS/dat/Freenet/Bittorrent... even multicast can be used for spreading content far and wide.

Why do sysadmins want/use CDNs? Can't we find better solutions? Solutions that are more respectful to spiders and privacy-minding folks with NoScript and/or Tor Browser?


Speaking for myself here, I don't see how people can use the web without javascript. As for Tor, you're routing other people's traffic while they route yours, so I can understand how such connections would be blocked given that blocking IPs is still a method for mitigating security issues, and you can't determine the IP of a Tor browser.


> I don't see how people can use the web without javascript.

Its pretty easy: browse marked up documents, not applications. If some developer conflates the first for the second, move on.


> As for Tor, you're routing other people's traffic while they route yours

Using Tor doesn't imply that your machine is also a Tor exit node.


They have also been responsible for one the worst security incidents ever:

https://news.ycombinator.com/item?id=13718752

Only discovered we should not forget,due to the good graces of google project zero.

A certain those of skepticism towards any technical offer out there would be advised.


I like Cloudflare's post mortems, and I like how they fight back against patent trolls. For me as a dev they are #1.


Do you have experience with the competitors?


I prefer tech that I can use both at work and on hobby projects at home.

To that end I've only used cloudflare and netlify. The others have too much friction to try out. I expect I would get experience on the job if necessary.


Do more rely on Cloudflare? Because this felt like it was more than half the internet, certainly more than half the biggest sites.


I think so, Fastly seems to have a few huge enterprise clients while Cloudflare seems more balanced (and larger)


I think that Fastly starts at $50/month, no free tier. So that would preclude small or not-profit-motivated sites from using it.


interesting thought ... a new type of 'to big to fail' ?


No more than a transatlantic cable...


Where is Akamai in this comparison?


Fair point. Maybe Fastly is more akin to Akamai given it seems to be more enterprise-y. By market cap, Cloudflare is 26 billion, Akamai is 18, and Fastly is 6.

Fastly's free offering gives you "$50 worth of traffic" whereas Cloudflare has a perpetually free option. And for Akamai you have to apply for a free trial.


This is market cap, but if you look at amount of traffic you have Akamai estimated at 15-30%, CF at 10%.

So if it would go down, it would cripple vast amount of internet.


Akamai is balls deep in video streaming, which is probably the most bandwidth/traffic intense thing for a CDN to dabble with. My guess is that CF has much more diverse traffic. Hence the fallout from an interruption would be quite different.


Not quite, Akamai is more large corp centric (they don't serve average Joe) besides that they do also security. If it went down you would get all of sudden e.g. a lot of DDOS possible.


New error now, hopefully fix in progress.

Fastly error: unknown domain: www.fastly.com.

Details: cache-syd10161-SYD


>The issue has been identified and a fix is being implemented.

According to the status page.


That doesn't take away their embarrassment. It's mean how many websites rely on fastly. Twitter hasn't been loading emojis in a while, and I believe it's for the same reason.


Might not be the case anymore, but a few years back, Hackernews was just running on a single server.


fairly sure it still is.


I am already here


Amusingly, the Stackoverflow 503 page has a typo:

  Error 503 Service Unavailable
  Service Unavailable
  
  Guru *Mediation*:
  Details: cache-lon4236-LON 1623146049 854282175
  
  Varnish cache server


We use Fastly (and our site is down too) but I asked them about this a couple of years ago. It is deliberate. They said it was so they can tell if it is their Varnish service or the customer's Varnish service that is down


This comment is correct. I made that change ages ago. Amused it's still there.


Fastly modified the Varnish error to ensure that it is known if the error is returned by Fastly's Varnish or by the origin's Varnish should the customer run their own Varnish on the origin.


Looks like an error (typo) with Varnish .. it's the same on multiple sites.

Maybe a good way to work out which versions are being used.


It's interesting because Varnish gets it write in their docs: https://varnish-cache.org/docs/trunk/users-guide/troubleshoo...


> It's interesting because Varnish gets it write in their docs

Sorry I couldn't help myself...it's too funny to misspell "right" in a thread about spelling mistakes


Mruphy's Law in action


Isn't it Muphry's Law? Or was that another example :)

https://en.wikipedia.org/wiki/Muphry%27s_law


this is devolving quickly into an r/something


;)



Same 503 error message on the actual http://fastly.com/ website


Mediation is a real word though?


That is true, but the typo is because they presumably meant to reference this: https://en.wikipedia.org/wiki/Guru_Meditation


...and naturally someone has already updated the page today to include (and highlight) its use and mispelling on Varnish.


Someone (I can't unfortunately due to IP block) needs to change that. The part about the spelling is false, apparently [1] it's an intentional change by Fastly so that they can tell if it's their own Varnish or a customer's Varnish that is throwing an error.

[1] https://news.ycombinator.com/item?id=27433139


That wiki references the Varnish version.

I think it's an intentional alteration of the original, given the context.


That edit was added after today's fastly outage began.


Even so, mediation makes sense in this context given fastly's business model.

It could be a typo or an attempt to be clever.


This seems like it's intentional given the context.


I don't see the typo? definitely a fastly problem though..

Fastly error: unknown domain: numpy.org.

Details: cache-pdk17841-PDK


I assume "Guru Mediation" is supposed to be "Guru Meditation", a reference to the way AmigaOS used to describe system failures.


Mediation vs Meditation I think.

I think Varnish uses mediation intentionally though, it was this way 7 years ago when I last used Varnish.



According to the Wikipedia page, it is intentional:

https://en.wikipedia.org/wiki/Guru_Meditation

Or did one of you already edit the Wikipedia page to reflect this discussion on hn?


It appears the mention of "Mediation" was made very recently, likely in response to what's currently going on.

https://en.wikipedia.org/w/index.php?title=Guru_Meditation&d...

And it seems to be incorrect, since this "spelling variation" is only used by Fastly and not part of Varnish?...


The typo is "Mediation" instead of "Meditation", I think.


I assumed it was an intentional pun.


also https://www.reddit.com (at least in Netherlands)

edit: 12:05 up again for me, no images or custom fonts loading though ... and down again 1 minute later

edit: 13:01 reliably up again for me


Down in SA


Down in UK


Up for me but no images mate.


Down in US. Also Imgur, which is closely related


Down in india


Same here in Germany: imgur and reddit are down, plus a bunch of other sites.


Same in France


> potential impact to performance

So it is a "performance" issue when all pages give a 503.


Does the 503 page load fast(ly) or slowly?


I wonder why Amazon is not using Cloudfront for their own website.


Cloudfront, by Amazon's own admission, specialises in high bandwidth delivery (ie huge videos). Fastly has consistently better performance as a small object cache, which makes it the choice for web assets

https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...


Fastly gives them the edge performance they need without having to build it themselves. They have been a customer for a while I think.


But they have competing products through AWS.


I imagine it works well for the whole business that they allow product teams to use the best cloud tools for the job rather than requiring them to use AWS for everything. If AWS is forced to compete even for Amazon.com's custom, that should make the whole company more resilient to long term technical stagnation.


AWS Route53 and Cloudfront are direct competitors to Fastly.


That's how good Fastly is. Outside of this issue it's a great service.


Yeah, this is what makes me feel this is more an AWS thing


The m.media-amazon.com domain (and a few other CDN'd domains that they use) are running through Fastly:

    nslookup m.media-amazon.com
    
    Name:  media.amazon.map.fastly.net

It is very interesting that they are not using CloudFront!


really, m.media-amazon.com seems to have a very short TTL (showing 37 seconds right now) and has been weighted to cloudfront now.

Amazon is also known to use Akamai. Sure, Amazon relies heavily on AWS, but why should it surprise anyone that a retail website obsessed with instant loading of pages decides to use non-AWS CDNs if the performance is better.

Even if CloudFront became the default, I'm certain amazon.com would keep contracts with fastly and akamai just so they can weight traffic away from CloudFront in an outage.


Good to have 3rd party redundancy, time to fail over to something else now I'd think though.


They already have:

  $ host m.media-amazon.com
  m.media-amazon.com is an alias for c.media-amazon.com.
  c.media-amazon.com has address 99.86.119.84
(which is a Cloudfront IP)


Yep they did exactly this and are now running on cloudfront


Why?


looks like amazon.com started using fastly in May 2020 (https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...) so it's not an AWS thing


AWS is reporting no issues across the regions:

https://status.aws.amazon.com


AWS is almost never reporting issues on this page.


AWS don't report outages until it's undoubtedly them.


Fastly deploy their own hardware, (That's one of their selling points) I don't think they rely much on AWS, maybe just for network interconnection?


Fastly doesn't run on AWS.


It sure looks like a AWS error, even Amazon.com is mostly down.


I wonder why amazon.co.uk uses Fastly and not CloudFront?


I imagine they use a few different CDNs for things like this.


That doesn't appear to be the case does it? Amazon sites are all working fine, at least for me


Their CSS and JS were down for a few minutes. I was able to login to Amazon but the entire site was in Times New Roman, but was fixed a few minutes later


Must be more than fastly. Heroku is also down.


That's also because of fastly, I've got this response from the Heroku dashboard:

Fastly error: unknown domain: dashboard.heroku.com.


Good thing we use Cloudfront and Cloudflare where I work.

> Statuspage Automation updated third-party component Spreedly Core from Operational to Major Outage.

> Statuspage Automation updated third-party component Filestack API from Operational to Degraded Performance.

Oh, right. :-D

Don't get me wrong, I love the proliferation of APIs and easily-integrated services over the past 20 years. We're all one interdependent family, for better and for worse.


CSS/Javascript at https://github.com/ appears to be down as well making GitHub quite unusable.


GitHub Pages appears to be down too, taking an awful lot of sites offline


Github is working fine for me in Canada but others aren't. Tried without browser cache too and it works okay.

EDIT: Most sites seem fixed now here in Canada. Tested stackoverflow, reddit, GitHub, PayPal, gov.UK and all worked fine.


Yeah, things are mostly back now. Including my personal GitHub Pages site :)


Yikes seeing just a "connection failure" on Paypal is something else.

edit: PayPal looks be back up at least in US East but when I turn off my VPN and access from Asia I get "Fastly error: unknown domain: www.paypal.com."

Now I'm seeing a 503


> Monitoring The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return. Posted 4 minutes ago. Jun 08, 2021 - 10:57 UTC

Looks to be working again my end.


Interestingly, Twitter only has its emoji SVGs down.


And this is (one reason) why using images instead of actual emojis is such a stupid idea. Why, Twitter, WHY?


err, to make representations platform independent?


That sounds antithetical to the purpose of emojis.


Vendors don’t even agree on whether the :gun: is a revolver or an automatic or space ray guns or even water guns, btw it’s an 1911 in original DoCoMo emojis

1: https://blog.emojipedia.org/content/images/2018/04/microsoft...


Sure, that's a benefit of emojis being semantic. If you want 'SFW' emojis, you can get them. Converting them to images makes that impossible. And uses vastly more bandwidth, makes them impossible to copy+paste, probably has accessibility issues, etc.


Same reason why Gmail uses their own emojis rather than the system ones — (as said above) branding. When you send a tweet, Twitter wants it to look identical across all devices. The classic native UI vs cross-platform UI debate in a nutshell.


Cool, so instead of actually serving text, they could also just serve up little SVGs for each letter. Because god forbid the recipient chooses a different font than Gmail!


Which is why Slack style :tofu_on_fire: emoji notation is genius


You're pretty much defining webfonts!


Indeed. Another abomination.


That's not a minor UI issue.

Twitter is a media between people. Removing emoji representation differences on user devices is a way to hopefully reduce misunderstandings between users.


Branding! (Fun fact: Hacker News strips emoji.)


How does it strip it? Test:

Edit: You are right. It got rid of the emoji after Test.


https://deb.debian.org is down too which borked my installation.



The mirrors still work though, and cabal will just fall back to those


https://www.bbc.com/news/technology-57399628

"A number of leading media websites are currently not working, including the Guardian, Financial Times, Independent and the New York Times."


Not that the BBC are gloating that they're still up


The BBC.com site was down for about 10-15 minutes.



What's far worse than half of the internet being down was that Hacker News also had problems. If I waited long enough on a comments page I got an error message. I don't quite understand what happened there. The communication between my system and HN must have been working otherwise I would never have gotten an error message, so it must have been some internal HN problem. But since HN should only need its own internal "database" to generate comment pages, I don't understand why it should be impacted by the Fastly problems.


I could not tell from the fastly status page. What caused the fault? Could anyone point to any past stories which may be of similar nature other than DDos?



Bitwarden is also down (the Web Vault, not the website).


I will never understand the meaning of putting CDN behind CDN.


What makes sense in the world is what puts bacon on the table and not what actually make sense.


Yo dawg, I heard you like CDNs...


My self hosted bitwarden is fine, as are all my self hosted sites.


Seems to affect Target ( https://www.target.com/ ) and Reddit ( https://www.reddit.com/ ) as well.



PayPal seems to be working for me at the moment. Rest of the sites are 503s.


Centralising everything™ and the whole internet goes down because of that.


and yet you're able to leave this comment.


Because HN and those who use less or those who use backup services are smart and those who are caught now have to panic and wait.

Probably going to short the hell out of $FSLY.


Over one issue that highlights they have an abundance of top level customers? Interesting strategy when it's already at a low.


One issue that should have been mitigated at least by Fastly; worse if the client has to do it.

They proudly stated this from their own website to their customers:

> "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime."

If that isn't one huge lie, I don't know what is.


Please don't call it a lie. It means that they knowingly presented something they knew to be false as the truth. So far I have seen no evidence to support that.


It is definitely a lie, but it's the same lie sold by all cloud offerings. Can you name a single cloud/CDN operator without downtimes?

It's normal to have downtimes but they are usually scheduled and quick (think <10 minutes per month for rebooting and/or hardware parts replacement). I'm pretty sure most non-profit hosts like disroot.org or globenet.org have similar or better 9's than all these fancy cloud services.


It can have all these things and still fail, suggesting otherwise would be fairly naive.


if by "everything" you mean one thing, and by "centralize" you mean not centralized, then sure.


How is having a large chunk of the internet using the same CDN provider not "centralizing"? It's not a hard monopoly obviously but still it meets the definition of centralization.


how is private companies choosing to use a common supplier in a competitive market centralization? monopolies are not centralization either. you need to read a better book.


How is a market competitive when there's a quasi-monopoly on infrastructure? When public money is used to irrigate the same corporations with huge $$$, while non-profit network operators are left to rot?


it's centralization because they all use the same provider. Why do you care about incentives here? The result is the same, just like capitalism and free market tend to monopolies in the long run.



For what its worth, I'm having these problems also with cnn.com, reddit and many others, however when I switch away from WiFi to use my cell provider network, they work fine.


Paypal back, off fastly


Why no other sites bypass the CDN and go directly?


If you aren't prepared to do CDN changes on a whim when something like this happens, it's often better to wait for the problem to be resolved instead of making things worse for yourself due to misconfigurations, revealing your origin IPs, etc.

Can always improve the process for the next outage.


For sure, similar to other industries all changes come after big troubles like this. But would be interesting to heard about how them (paypal) deal with that


Also it takes time for DNS changes to propagate(some people hates this word but actually)


You need big infra and Crack teams of ops people, which paypal can't afford not to have.


Is their anything these big sites could do in this situation, or must they choose between running and maintaining all of their own infra or relying on a single CDN?


If you have absolutely vanilla CDN requirements, you can run multiple CDNs and fail-over or load balance between them using DNS.

Quite a few Fastly customers have more than vanilla requirements though, and may have a lot of business logic performed within the CDN itself. That Fastly is "just Varnish" and you can perform powerful traffic manipulation is one of it's main selling points.


I suppose it’s still a bad experience for the user if some % of attempts to connect fail or if some % of scripts/styles/images fail to load. So I think that means dns information about failures needs to somehow be propagated quickly. Not sure how well that works in practice.


Use two CDNs and DNS providers for redundancy. Gets expensive, but at scale, probably doesn't make a huge difference. More complexity for the site operators to manage, however.


Spotify is behaving strangely as well https://www.spotify.com/


Quora and reddit too


All of these work from here in Grenoble, France...


That's the problem with these black-box cloud offerings, that you can never know what will work (or not) and from where. You get semi-random, pseudo-localized outages that are not accounted for in all the 9's of availability.

With a standard TCP/UDP session, it mostly just works or doesn't and you can get a proper traceroute to know what's up. With these fancy CDNs, there's a whole new can of worms to deal with and from a client's perspective you have no clue what's happening because it's all taking place in their private network space where we have no "looking glass".

Fuck the cloud, i want real Internet.


"Gets blasted by a DDOS and is no longer on the internet"


Same here in central Poland (Łódź area), no problem with any of linked websites.

edit: My whole Twitter timeline is full of posts saying "Twitter outage? what outage?". Same on Reddit and Twitch chat, feels like for a short time I was invited into some exclusive circle lmao. StackOverflow and other StackExchange sites also work so I can look stuff up for you.


Interesting. Here in the Netherlands they don't.


Germany here (n=1), everything works except reddit and ft.com


Same from Germany, all of these seem up except for reddit and ft. Maybe we got lucky with our edge node...


Not in East Germany :D


What about banana.ch?


depends where in France, most people i know here are affected as well


yeah grenoble, updated.


Not from Paris, France.


Not for me



https://www.theverge.com/ seems to be down too



Is the fact of looking at those links is like looking at a road accident with insistence instead of just passing by?




Terraform having issues and rubygems down too


That explains the spotty container build failures over the last half hour. Good thing I decided to procrastinate instead of debugging the issue!



Seems to be every site that runs varnish...


Fastly largely runs on Varnish, it seems: https://www.fastly.com/blog/benefits-using-varnish

>At the core of Fastly is Varnish, an open source web accelerator that’s designed for high-performance content delivery. Varnish is the key to being able to accelerate dynamic content, APIs, and logic at the edge.


I think Fastly is the one having problems (they happen to use varnish but I haven't seen anything which says varnish is the root cause) - so all sites using it are down.


Firebase hosting has been affected as well



SSO and github are back online now


nature.com


You would think that the UK GOVERNMENT would have their own private CDN or something...


Why?


twitch also, lots of other minor ish websites


Searchable offline backup of stack anyone?


www.gov.uk & bbc are back


elastic.co down as well


developer.spotify.com


reddit down aswell


It's OK though, because large swathes of this discussion seem to have turned HN into reddit, at least temporarily. Normal service will no doubt resume in due course.


twitch.tv Too.


etsy.com too


> [0] https://www.gov.uk/

Just checked, thank god the NHS vaccine site is still available - vaccines just got rolled out for under 30s today.


Edit: I didn’t mean anything negative here! Just slightly shocked that as the UK is opening up under 30 vaccinations, the US is struggling to find any more willing takers. It’s really probably a sign that there’s fewer anti-vaxxers in the UK more than anything. And that universal healthcare is more efficient at distribution than an inherently for profit system. I don’t know, but I just didn’t realize it was so different in the UK


I think this may be because we've had much higher uptake as far as I know, so getting down the age ranges has been slower (by which I mean, yes, maybe the US has made it available to all adults, but how many (as a proportion) have taken it up)


This is awesome. And also exposes how broken the US is with its anti-vaccination trend


I have seen the argument made that one of the reasons for high vaccine confidence in the UK is as a result of Andrew Wakefield's MMR fraud, which was perhaps debunked more effectively in the UK than the US.

https://www.youtube.com/watch?v=8BIcAZxFfrc


US and UK have very similar vaccination rates despite the US being open to more age ranges. This indicates that a higher percentage of eligible people have gotten the vaccine in the UK, and the US has somewhat hit a wall in terms of vaccinations (though there is the concern that the rates will slow down in the UK also).

I must admit, it has been strange seeing my US peers getting the vaccine months before I can in the UK, but I guess I take comfort knowing that both countries are still doing pretty well!


You know which one’s worst? Japan... still reservation based and for 65 and up only!


Both the UK and US are doing well.

https://ig.ft.com/coronavirus-vaccine-tracker for reference.

What's important is important to share vaccines with all nations, and non-nations.



Fascinating. So those rates are including only ages 30+, which means that once it’s unrestricted the UK should have a very high vaccination rate while ~15-25% of the US will still remain unvaccinated entirely by choice. Wow. So you’re absolutely right, the UK is in reality far far ahead and the US is completely broken as far as public health is concerned because of willing ignorance.


For one dose. For full vaccination, the US is (slightly) ahead according to that same site.


I think we can agree it's certainly not "far behind"


This is by design though, the gap between the two doses is higher here.


and Imgix


Click the new tab. Lots of posts about sites being down. All flagged.


Yes, because they're all just repeats of this one.


Yeah so it's been mentioned in the comments already, but to everyone in Fastly right now: I feel for you. Something like this must be insanely stressful, and not just during the outage. There will be (should be) a massive post-mortem. People will be losing sleep over this for days, weeks, months.

:(

Edit: There seems to be a major empathy outage in this thread. Disgusted but not surprised, unfortunately.


Meh. Losing sleep sounds like an over-reaction. No system is foolproof. Of course Fastly should do what they can to prevent downtime, but it's still expected that they will go down.

I would blame anyone who claimed otherwise or couldn't deal with it while not having a fallback.


I hear that you're suggesting that those involved shouldnt feel bad because its a systemic / just a job / etc. But the reality is that incidents like this can be very traumatic for those involved and thats not something they can control. If it was that simple to manage, depression and anxiety would not be a thing.

Think its best to show a large amount of support and empathy for the individuals having a really bad day today, and how awful they may feel. Some will probably end up reading this thread (I know I would).

And of course, still hold Fastly the business accountable for their response (but objectively, once we understand what the root cause was, and the long term solution).


I don't see how it's so traumatic for the engineers involved, unless the company culture in Fastly is really awful and there are punitive repercussions, or attempts to pin responsbility on individuals rather than systems, which I doubt.

Many here have been responsible for web service outages albeit on much smaller scales, and in my experience it feels awful while it's happening but you quickly forget about it because so does everyone else.


I guess it very much depends on your personality. I screwed up a a not very important project for a client 4 years ago while working at a different company, and I still feel bad when I think about it, despite the fact that my company had my back through the entire process and literally everybody involved has moved on and probably forgotten about it.


When CNN is reporting on the bug you deployed it might have some psychological impact


> on much smaller scales

> you quickly forget about it because so does everyone else

This is definitely not the case here, and the experiences are bound to be very different.


I wanted to show support to the engineers in the sense that I don't think you should encourage a working culture where you have "massive post-mortems" and expect people to feel bad for extended periods of time over simple mistakes. By not making a big deal out of it, you can also support your staff.

But I think our disagreement mainly stems from how we interpreted the parent comment. I thought it was very double, at one hand claiming to show support, at the other hand emphasizing how big of a catastrophy this was.

I just wanted to say that I think it most likely was a completely natural mistake, only exerbarated by the scale of the company, and that while you should take some action to prevent it in the future, you should not spend so much time dwelling on it. Shit happens, it's fine.


I agree, and I think I picked on your comment a bit because it was the top one.


I think the government websites being down (UK ones for example) are the bigger issue. Reddit/Stackoverflow etc being down isn't that big of a deal imo.


Imagine losing sleep over a corporate problem where you're just the next Joe Engineer, to be fired the second you're not needed. Have some perspective people.


I'm confused, why isn't being fired something to lose sleep over in your eyes?

I get that you're implying that the job itself is not worth that much concern, but it seems you're ignoring that jobs bring in income, pay your mortgage, etc.

If i lost my job tomorrow i'd be terrified.


People rarely get fired for outages. The comment you are replying to is saying that engineers shouldn’t stress out over an outage that only impacts a corporation.

It’s a commentary on work / life balance and the all-too-common phenomenon of employees sacrificing for a company (in this case, feeling such personal stress that they would lose sleep) and contrasting it with the fact that most employers will fire you without a second thought if it’s what’s best for the business (they won’t lose any sleep).

It’s a critique of the asymmetry that often exists and is frequently exploited by companies. This is often seen in statements like, “we are one big family so put in a few more hours for this launch” coupled with announcements like, “profit projections didn’t meet expectations so we are downsizing 5% of the work force.” You are family when they need you to work hard, and an expendable free market agent when your continued employment might risk hitting the quarterly goal.

It is, of course, reasonable to lose sleep if you think your employment is in jeopardy. Very few companies, especially in the competitive SV market are firing engineers because of a single outage, even a bad one, because you just paid a bunch of money to train those engineers how to see this coming and fix it.


Yup, exactly, couldn't have written it better myself :)


I have worked for one of their competitors (I'm not saying which) for quite a while. I've indirectly caused multiple outages that were maybe 1% this bad before, that didn't make the news only due to luck. Code that I owned (but did not write) was once a key cause of a severe outage that did make the news, and it would have been worse if I weren't coincidentally halfway through replacing that code with something more modern. I also had to do some very rapid work on internal failsafes around the time of the infamous Mirai botnet, to minimize service degradation in case it was pointed at us.

It sucks. Working on CDN reliability is like working on wastewater management: the public forgets you exist until something breaks, when they start asking why you weren't doing your job. Fortunately, internal people at least seem to get it -- I hope this is the same as Fastly.


They shouldn't lose sleep over it, though.


Everyone's got responsibilities and aspirations. To be fair, I was thinking more of the jobbing engineer who's going to face anxiety about losing their job over this, but it extends to all levels. Having a fat bank balance helps get through periods without employment, but it's not just about money. There's anxiety, shame, embarrassment, the whole gamut. Going through a major incident at work is a shitty experience.


Well, not much, I mean all our competitors are also using Fastly. I would be more worried if we were the only one using Fastly and everybody else was fine. But as we are all in the same boat, we lose the same :-)


He's talking about Fastly themselves, not their customers


Arf, thanks for pointing at it, I misread. Sorry.


Empathy is hard to find around here, maybe someone needs to study it. Is it a feature of people in tech? Don't remember much being on slashdot either.


Just wait till they figure out how to make money off it.


#HugOps


I feel for the Fastly workers, who managers are probably currently harassing to get things back online. I certainly don't feel any sympathy for Fastly administrators/managers who make business out of exploiting other people.


Call me old fashioned but the latest trend of showing "empathy" for a serious incident, then proceeding to dance around the aftermath of it, whilst people give themselves a pat on back in a retro/post-mortem, isn't the way to do it.

People need to be blamed, and responsibility for actions taken (without covering asses)


The point isn't to dance around the incident, but to not blame people. You can blame systems, design, engineering culture, processes, but don't blame people. Even if someone accidentally pressed the 'destroy prod' button, that's not the fault of that person, it's the fault of that button existing and being accessible in the first place.

I have no empathy for Fastly-the-company. I hate the fact that the Internet is centralized around CDNs. I wish this idea of 'but we _must_ run a CDN for our 1QPM blog!' would die in a fire. But I can still empathize with the Fastly engineers handling this shitstorm right now.


I disagree. People implemented those systems, so if you are correct that it is the systems fault, then it is also a persons fault.

People must be held accountable to have good incentives to reduce such outtages in the future.

I do agree though that we should always be compassionate and realistic with other humans.


> I disagree. People implemented those systems, so if you are correct that it is the systems fault, then it is also a persons fault.

How do you make sure that mistakes don't happen, then? Do you blame and fire people who make mistakes, and hope that the next person put in the same spot doesn't make a mistake? Or do you figure out what caused that person to make the mistake and ensure there are processes in place so that next time this is less likely to happen?

Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up.


I see: When I said we need to hold people accountable, you may have heard that we need to fire people. That was honestly never on my mind.

Maybe its a cultural thing. I hear a lot of firing at the US. I am from Europe.


This sort of culture worked at Netflix. Did they go down today?


Lets hope you don't ever go into management. You clearly have no idea how to motivate and retain people or have any insight on how hard it is to hire good people to begin with. And no, I'm pretty certain this is not how Netflix's culture is.


> pretty certain this is not how Netflix's culture is.

> pReTtY CeRtAiN

This, the wording in of itself shows you have absolutely no clue whatsoever at all of Netflix's culture.


Riiiight... Anyways, you kept complaining of being downvoted, here's a clue: you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...


> you're being an ass and no one likes you or what you have to say because you're wrong. So go scurry back to reddit where you belong troll...

Okay? some proof please? This is not far off from a baseless character attack which isn't really effective when trying to convince me about your point on you knowing about Netflix's culture.

If you really want a proper answer, the truth is, unfortunately for you I am in management (previously was an engineer) and have always known Netflix to have a stellar performance oriented (and fear driven) culture, their playbook operates like a sports team. Not for everyone, but that's the point and it works for them.

Maybe you should look inward to yourself if you're so vexed with me to call me silly names, that you can't handle the truth or the culture about why some companies like Netflix adopts this.

Peace.


Proof? All the downvotes you got and why your comments are barely visible and all the crying you did in your comments about getting downvoted.


You think downvotes and character attacks present as a good argument? Doesn't count as proof IMO if there isn't a valid argument presented, you're going to have to do a lot better than that.

And back to the main point, So I assume you agree that Netflix did go completely down the other day then right? It seems according to you that you know better of Netflix's management culture.

> I'm pretty certain this is not how Netflix's culture is.

Would you be willing to share your expert insight of this if you know better then?


I'm not arguing Netflix, its mostly your attitude towards management and engineering culture. Basically your reply to the user "q3k". "Extrinsic motivators like 'we will give you a bonus' or 'we will fire you' are surprisingly bad at getting people to not fuck things up". You don't fire people just because they made a mistake. You find out what caused it, how to prevent it in the future, and you move on. That's what blameless post-mortems are about. No one is perfect and if you really are a manager that expects perfection, you really just suck as a person.

But now getting back to Netflix, they have post-mortems and they don't fire people willy-nilly over mistakes. Sure it's not hugops (a term I don't care for either), but they don't just up and fire people over a mistake. I never said anything about netflix going up or down on that day, but they also have problems just like everyone else. Their SLA is not 100% uptime and neither is Fastly.

In closing, you are being a pedantic little bitch who wants to argue minutia and I'm done with your trolling. I'm done responding to you, feel free to have the last reply as I really don't care anymore.


That's a sure fire way to get a CYA culture, and it's a reason why the most successful tech firms don't do it.


v1. "It's Bob's fault and so we fired Bob."

v2. "The issue was caused by a previously unidentified pathway that caused a feedback loop and overloaded our servers in a cascading fashion (or whatever). We have implemented a fix for this and updated our testing and deployment processes to stop similar cascades."

Which solves the problem long term?

As an architect making product choices, v2 wins every time.

(With the caveat that if the cause was something that reveals a fundamental problem with the larger processes/professionalism/culture of the company, especially to do with security concerns, then I'm not buying that product and migrating away if we already use it.


If an employee does something actively malicious, you should absolute remove them. This is very rare though - incompetence /broken systems is much more likely.

Otherwise you develop internal process that's entirely scar tissue, and only stops your teams doing their jobs.


I feel it is somewhat obvious and goes without saying that malicious action results in personal responsibility & repercussions. However I don't have any evidence or past experience that malicious action by an internal employee is a likely scenario for most outages. It may well occur but most examples I've heard of seem apocryphal.

The scar tissue: this is where good choices come in because it's certainly not a rule that a change as a result of an incident review is an impediment to work. These definitely occur, and sometimes linger after the root cause is phased out. But best practices often reduce cognitive & process overheads.

A rough example is that there are still people out there FTPing code to servers, having to manually select which files from a directory to upload. Replacing this error prone process with a deployment pipeline leads to a massive reduction in the likelihood of errors and will actually speed up the deployment process. It's all about making the right choices, not knee-jerk protections, and sometimes the choice is to leave things as they are.


As I replied to a sibling comment, I never thought about firing Bob. I think we can assign responsibilities without being mean or denegrate someone.

I am critizing myself all the time for stuff. No hurt feelings there.


> People must be held accountable to have good incentives to reduce such outtages in the future.

Holding specific people "accountable" for outages doesn't incentivize reducing outages; it incentivizes not getting caught for having caused the outage.

As a result, post-mortems turn into finger-pointing games instead of finding and resolving the root cause of the issue, which costs the company more money in the long run when a political scapegoat is found but the actual bug in the code is not.


Loss of trust in a service provider and the afterwards loss of business is quite an incentive. Having someone drawn and quartered just provides an incentive to scapegoat.


> don't blame people

I feel like this requires some nuance.

Don't blame an IC for introducing a bug or misconfiguration that led to the outage.

Do consider blaming (and firing!) management if, during the postmortem, it turns out that it was in the way of fixing systemic problems.

Ultimately, rule #1 should be: don't blame somebody unless malice or gross negligence is proven. Rule #2 should be the assumption that ICs will not have done either. Rule #3 is that sometimes, individual responsibility is required.


Blame culture isn't the way forward here.

Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again.

Obviously if there are levels of gross negligence or misconduct discovered during post-mortem, that will need to be dealt with accordingly, but coming into this with an attitude of "we must find someone to blame and incur repercussions" isn't healthy at all.

We are humans - don't forget that.

edit: forgot some words.


> Do a post-mortem, work out root causes, work as a unit to ensure this doesn't happen again. And if this happens again? They advertised they had failover and mitigations for this in the RAREST of cases:

> Notices will be posted here when we re-route traffic, upgrade hardware, or in the extremely rare case our network isn’t serving traffic. - status.fastly.com

The extremely rare case happened for an hour, which is a very long time in internet time.


Edit: So the truth is also getting flagged here. unbelievable.


I think what you said is exactly why people have different opinions on this topic: what counts as "gross negligence" and what doesn't? Different people draw lines at different places.


There's, to me, no obvious clear cut line. But here are some indicators that make me consider someone was being grossly negligent and/or even malicious:

- ignoring warnings

- acting against known-to-them best practices

- repeating a previous mistake

But, again, these are just indicators, not a checklist.

Interestingly, any of these can happen also due to stress, burnout and generally broken company/team culture. Including something like a CYA culture where if they don't do something fast, they will be blamed for it, and thus they need to move fast and break things.


The problem is a blame culture ensures the near-misses are never reported. Air safety discovered this many years back - a no-blame culture ensures anything safety-related can be reported without fear of repercussions. This allows you to discover near misses due to human error and ensure that the overall system gains resilience over time. If you blame people for mistakes, they cover the non-obvious ones up, and so you cannot protect against similar ones in future, so your reliability/safety ends up much lower in the long run. It's all about evolving a system that is resilient to human error - we will make mistakes, but the system overall should catch them before they become catastrophies. In air travel now, the remaining errors almost never have a single simple cause, except in airlines/countries that don't have an effective safety reporting culture.


I recommend reading about "blameless postmortems" [1]. Our natural tendency is to look for who is responsible for an incident and point the finger of blame. Over time this leads to a cover-your-ass culture, whether you like it or not. Therefore such a tendency needs to be actively fought against to keep the focus on quality engineering and not politics.

"An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization."

[1] https://sre.google/sre-book/postmortem-culture/


I'm sure you've never made a mistake.

The best way (in a team), to tackle mistakes, is to ensure the process in place corrects these mistakes. The only way to do that, is a post-mortem/learning from the mistake. If you blame it on some engineer who did it, that guy will eventually be replaced by some other guy, who may make the same mistake.


You also need to be proactive about other possible failure modes. Avoiding a culture of blame may or may not help. There needs to be a strong incentive for the organization to expend the resources to do so, and a mere "oops my bad" doesn't provide that without SLAs with teeth.


We need to learn from our, and other mistakes, or else we keep repeating them. Nothing "old fashioned" about that.

And we, especially companies, typically only learn if there is something at stake. Stock-price, a job, customers, liability etc.

(Call me old fashioned, but what I learned from it, having no stake in the game, is we are truly demolishing the resilient, decentralised nature of the internet; or already have done so)


I don't agree about the blame, but I do also find the empathy cringeworthy. Something's broken; someone's job is to fix it; they'll fix it; it will work again. /shrug/

Post-mortems make far more interesting submissions IMO, but I suppose people up-vote 'yes down for me too'.


the attitude that "people need to be blamed" will never improve reliability in the long run. people come and go; systems and processes endure. blaming people is the best way to avoid making durable improvements to systems and processes.


Doctors that make too many mistakes resulting in too high of payouts can't get individual malpractice insurance. Doctors that can't get individual malpractice insurance go to hospitals. Hospitals that hire too many doctors that make too many mistakes can't get hospital level policy. Hospital has to fire those doctors. That's how the system adjusts.

We do not have a system that adjusts to "oops"


I hear you, but I just want to point out that this rarely happens anywhere else. It's great if tech (and people in general) hold themselves to progressively higher standards than what is out there already, but I don't think tech needs to be that much better, I'd settle for just doing a good honest retro (without throwing anyone under the bus, and without covering their asses)

A good leader will take the hit (and the repercussions) for their underlings, compensate customers where compensation can make it better (and offer to make it easy to use fallbacks if this happens again) -- and internally fix the problem so it can't happen again, without throwing anyone to the dogs.


> People need to be blamed, and responsibility for actions taken (without covering asses)

What i think this syntactically invalid sentence is trying to say is:

People need to be blamed, and held responsible for actions taken.

Why do people need to be blamed? Why do we need to make someone the scapegoat? What does being held responsible look like?

Let say we find some sacrificial engineer to pin this on:

* does the downtime magically disappear?

* does the engineer suffering (say losing his job or whatever) make your downtime meaningful? You'll recoup your revenue somehow from it?

* does the fact that there's a scapegoat mean that everyone else at fastly is perfect and it's ok to keep using them?


Scapegoating in those situations happens more often than not. In an operations team all problems are systemic - having to do with decision makers throughout the process, sometimes acting on perverse incentives set up by others. Blame then gets diluted but still tends to fall upon the organization responsible rather than an individual, which is where it should be. Gross negligence is not so cut and dry.


"Call me old-fashioned but..." is a dog-whistle harking back to "better days" that never existed.

Emapthy and responsiblity are not mutually exclusive.


> People need to be blamed, and responsibility for actions taken (without covering asses)

This. When people talk about "HugOps", "empathy" and all that when a worldwide incident affecting a huge amount of time critical customers (e.g. trading, hft, cargo, food delivery, etc.) is happening for an hour, it has catastrophic consequences.

I hope the engineers also understand the other side and why we are paying huge sums of cash for their service.


It's empathy towards people managing the incident, not towards the company. It's a sign of solidarity from SRE to SRE, not a sign of solidarity with a company.


Our fathers and mothers put man on the moon… we build shitty software that helps the technocrats sell more junk to the masses.


Well, while engineers are getting paid $100K/yr to post #HugOps, I know someone in HFT and their dashboard uses the Fastly service, so this has had a huge impact on them for sure.

Flag and downvote all you want, you know this is true.


I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer. They're both pretty privileged jobs and HFT is not known for having tons of benefits to society


> I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer.

Engineers are paid because their companies have customers. The it is pure madness that #hugops is the thing. I sincerely hope that Fastly's customers wack it $$ wise so hard that it actually affects #hugops engineering culture.


> I suspect you'll have trouble convincing a forum of primarily engineers that a high frequency trader is more worthy of sympathy than an engineer.

At least HFT traders don't get paid to spy on their own customers with trackers littered everywhere, I find that very unethical that engineers get paid to even do that sort of thing, and every damn website has these trackers because engineers put them there.

> They're both pretty privileged jobs and HFT is not known for having tons of benefits to society

So HFT firms don't have their own foundations and grants to give to charities and organisations then?


And ignore the pre-agreed SLA targets and compensation for not meeting those targets that's in the contract they signed right? If you're going to say you're losing $X/minute of downtime, then either deal with it, architect around it, or negotiate the necessary SLA and compensation.


It's not me you should be telling this to though, if you know someone at Fastly, perhaps you should reminding them that.

I expect huge clients to be knocking on Fastly's door lining up for answers because of this.


Not my problem. Fastly should work as intended.

The fault is theirs and they have said that they have failover, this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.

> "Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime." - status.fastly.com

Even their status page was down. Very embarrassing, Fastly did not work as advertised and mislead its customers.

Edit: Offended flaggers circling around silencing misled Fastly customers. How pathetic.


> this worldwide outage caused by them just goes to show you that Fastly does not actually have a failover system in place.

I don’t know Fastly at all, but in my experience there’s no such thing as a foolproof failover system that covers all possible scenarios.


Even when they said this was a rare [0] case, they knew this case should be handled, but didn't handle it.

> or in the extremely rare case our network isn’t serving traffic.

reports also came in that this was a service configuration[1] issue, so not only there is no failover system, not even any validation automation was in place that could have prevented this.

[0] https://status.fastly.com [1] https://twitter.com/fastly/status/1402221348659814411


Systems failing is not evidence of systems not existing.


So why didn't the 'automatic failover' kick in during the outage? Where was it then? I don't see anything about 're-routing traffic' anywhere in the status page [0]

[0] https://status.fastly.com/incidents/vpk0ssybt3bj


We don't know, but the usual scenarios would be "issue impacts failover mechanism too", "failover mechanism overloads other system components leading to cascading failure" or "something causes failover mechanism to to think all is fine".


> We don't know...

So, the rarest of cases (our network isn’t serving traffic) just happened right now, and their failover system just took a snooze then, but 'it exists apparently' according to you.

Tell that the huge clients that lost sales because of this, and all you have to say is: "wE DoN'T kNoW..."


> Tell that the huge clients that lost sales because of this, and all you have to say is: "wE DoN'T kNoW..."

Tell these clients that they should've carefully read their contract with Fastly, especially the 'Service Level Agreement' part.


Not the point. They were also told that a failover system would kick in and re-route traffic had there been any issues, but this was where to be seen.

A worldwide outage happened that affected almost all locations and everybody, so actually SLA is meaningless in this case. Where was the extra redundancy? Where was the failover system? Why was other companies indirectly affected?

As far as I know Fastly's status page was even down during the outage, the fact that the best answer to this 'is we don't know' tells you everything you need to know. Maybe stop victim blaming this situation and focus on the main culprit.


> Not my problem. Fastly should work as intended.

What's your SLA with them?

Just assuming things will always work because the marketing copy said so is recipe for disaster. It's hoping that things never go wrong, and when they inevitably do, being caught pants down.

Everything fails sometimes. You must know how much your SaaS provider contractually promises, ensure that any SLA breach is something financially acceptable for you, and ensure that you can handle failure time within SLA.


> What's your SLA with them?

Sorry what?

You've just witnessed almost the entire internet break because of a catastrophic cascading outage that affected lots of huge companies, since third party services used and trusted Fastly.

Shopify stores couldn't accept payments on their websites, Coinbase Retail/Pro transactions and trading apps failed to load, and delivery apps stopped loading all of a sudden. These are just a few that this outage has caused, and now you are trying to blame this onto me for not checking their SLA when millions were indirectly affected by this?

Fastly offered a product, their main product which is a CDN which took down lots of websites. I don't care if everything fails sometimes. There are sites that should NOT go down because of this configuration issue which they messed up.


> I don't care if everything fails sometimes

You can say you don't care for reality, but it's not going to help you have better systems.

> There are sites that should NOT go down

Then they surely either engineered their system to not 100% rely on Fastly or negotiated appropriate terms with Fastly (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody). Everything else would be negligent, and surely nobody would be negligent when operating a site that "should NOT go down"?


> You can say you don't care for reality, but it's not going to help you have better systems.

No where in my sentence I said this so quit the strawman argument.

I know a client using a service that has 100% uptime for the year, that also relies on huge clients, I don't understand why Fastly can't guarantee at the very least and a failover system to counteract this, but clearly didn't work. (or even existed)

> (Or decided Fastly going down was an acceptable business risk, which it is for nearly everybody).

Then why did this cascade to almost everybody even indirectly? Surely their advertised failover system would have prevented this from prolonging further but lasted longer than it should have.

I don't think a store, exchange or trading desk not accepting payments from people for an hour is acceptable at all.


> You've just witnessed almost the entire internet break because of a catastrophic cascading outage that affected lots of huge companies, since third party services used and trusted Fastly.

Blame the companies that relied on Fastly being up 100% of the time, even though Fastly explicitly states that they might be down any number of hours, and they will even give you money back for that [1]. If they did offer 100% SLA, it would probably be out of budget for most users, as that kind of systems are prohibitively expensive to run.

Depending on a single CDN like Fastly is building an SPOF into your product. It is not less of a design blunder that whatever Fastly did internally to have an outage. If Shopify lost millions because of a short, simple third-party outage they have at least as much of a high-priority postmortem to write and issues to address as Fastly.

[1] - https://docs.fastly.com/products/service-availability-sla


The main problem is that they had a failover system, the mystery is where was it in this outage?

Why didn't this trigger? where was this system in place to prevent further cascading failures?

> Blame the companies that relied on Fastly

So it's everybody's fault Fastly went down now? That is a new one.


If companyA got affected by this, then either: 1- Its companyA's fault for not having a contingency plan or 2- Its companyA's accepted risk that this might happen.

We understand you're upset and passionate about this, perhaps now when more information has been published you understand better the circumstances that caused this problem.


https://easydns.com/blog/2020/07/20/turns-out-half-the-inter...

The whole idea of the internet was a distributed network impervious to most attacks.

The reality is that a single failure can knock out 90% of the services people use.


The internet still works, only the websites are returning the wrong response


yeah, the internet is working perfectly. if you want to view 503 errors.


Believe it or not but "the internet" and "the world wide web" are not synonyms.


True. But the vast majority of use goes via "WWW".

For example email - the other big "internet-user" is technically not part of the WWW, but most (? I don't have any stats, just a guess) of our mailclients run on the WWW, nonetheless.


I think that's the point the other person was making: The Internet is still fine, regardless of whether or not the content gets delivered.

There are roads (or shall I say tubes?). There are cars and busses on the road. Over time, almost everyone has migrated to just a few bus companies. One of them suffers a complete collapse for a few hours. Yes, this means chaos when it comes to transporting people. But the roads are just fine.

This doesn't mean that the situation is fine and that people aren't affected. But it would be entirely different if the roads had been washed away or something.


BitTorrent was half of all Internet traffic for a while, though it has decreased with the rise of legal and convenient streaming services.


Most of which (unfortunately) run on the WWW.

I'm not sure what the native clients for Netflix and Spotify actually run, but I use their WWW clients mostly. Making most of my internet bits&bytes go over the WWW.


Thank god network people haven't drunk the centralization kool-aid.


It’s the equivalent to JIT manufacturing. Cheaper when everything is going fine, and devastating when it’s not. And then when everything goes down at once there’s not enough advantage to being the only one still up.


about that...


Interestingly, server side rendered pages worked well during the outage. Most of the issues were caused by sites that are relying too much on Javascript.


Yes, my personal project was working fine all the time. Only I couldn't access the Stripe payment system dashboard


And only those websites on some networks. If I connect my phone to my cell network instead of wifi, the problem sites work for me.


There are ten websites left on the internet and they're all hosted by four or so megacorps. Isn't it great?


"I'm old enough to remember when the Internet wasn't a group of five websites, each consisting of screenshots of text from the other four."

-- https://twitter.com/tveastman/status/1069674780826071040

:-(


Most people don't need any of them to continue with their life though!


The Web (World Wide Web) build atop of the Internet, is not impervious.

ps. "The Internet was build to survive attacks" is not true. It's a myth made popular by Robert Cringely in the early 1990s. The Arpanet was simply a protocol for mainframes used by computer scientists to connect. The Internet is relatively resilient against attacks, but that was not the "whole idea". It was not in the design at all.

Bob Taylor: “In February of 1966 I initiated the ARPAnet project. I was Director of ARPA‘s Information Processing Techniques Office (IPTO) from late ‚65 to late ‚69. There were only two people involved in the decision to launch the ARPAnet: my boss, the Director of ARPA Charles Herzfeld, and me. The creation of the ARPAnet was not motivated by considerations of war. The ARPAnet was created to enable folks with common interests to connect with one another through interactive computing even when widely separated by geography”.

Vint Cerf says the same about invention if TCP/IP transport protocol.


BGP (the protocol underpinning the internet) is built entirely for avoiding outages of any size.

Even email has a method baked into to the protocol for handling failure.

Fallbacks are good, baking in resiliency is better.


BGP has its problems (that time centurylink blackholed traffic but wouldn't drop their connections, bgp hijacks etc), but it's not centralised in single (or very few) points of failure


User iso1631 talked about attacks, not just outages.

The basic design BGP is very vulnerable against malicious attacks. Email security is nonexistent.


Why is this a link to the Fastly homepage, where absolutely no information is provided?

This is the page that should be linked:

https://status.fastly.com


Oddly their homepage rendering an error was a more accurate description of the problem than "investigating potential impact to performance with our CDN"


Stuff is down across the web, but the most it says is “degraded performance” and in my area it’s all green even though the sites are still down.


All looks orange now, but "degraded performance" is a cheeky way to describe "everything is on fire".


> Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime. But when a network issue does arise, we think our customers deserve clear, transparent communication so they can maintain trust in our service and our team.

What a joke!


I didn’t see any error whatsoever on their homepage, while now I see “Global CDN Disruption” on their status page.


This is the link you want I think https://status.fastly.com/incidents/vpk0ssybt3bj


Because even their homepage is down intermittently/for some people.


To save everyone else hitting the site as well:

As of 10:44UTC, this status page has just updated to say the issue has been identified and a fix is being implemented.


it is starting to show several Degraded Performance tags


I didn't know so many sites were depending on Fastly. Stack Overflow, GitHub, reddit, .... Even pip is unavailable. My development workflow is completely janked up. It is a bit scary that we are putting too many eggs in one basket.


fastly gives free service to things like pip. It's actually very nice.


Bit pedantic, but it's PyPI that Fastly gives services to, not pip (and PyPI that's down, not pip). The two are only loosely related – pip is a piece of software.


You would think sites like Github and key government sites would at least have a fall back at the ready. It reasonable to use a CDN like Fastly, but having a single point of failure seems silly if you're the BBC or Gov UK. Although, it does seem BBC managed to get back up and running pretty quick so perhaps they were prepared for this.


Gov.UK is back up too. They have a mandate from government to be able to provide emergency communications so I expect they did have a backup and have managed to switch over, but just took 30 mins to do so.

Gov.UK is supposed to be a bit like BBC 1 or Radio 1 – in a national emergency they can be taken over to disseminate critical information, like if there was a nuclear attack launched on the UK.


no, it's just that the incident has just been fixed by fastly https://status.fastly.com/incidents/vpk0ssybt3bj


Hackage (Haskell) is down as well: http://hackage.haskell.org


The mirrors still work though, and cabal will just fall back to those


Must... decentralize... internet...


Blame site operators that are single homing and not loadbalancing CDNs


For sites of any complexity with any dynamic content having CDN redundancy is akin to being multi-cloud -- it is not worth the effort.

A lot of dynamic sites use Fastly for its programmatic edge control and a near immediate ( ~1s-4s, typically around 2 ) global cache invalidation for any tagged objects with a single call to the tag. That feature alone simplifies backend logic significantly. To make this feature portable to CDNs that do not support it and provide only regular cache invalidation requires a complicated workflow setup which significantly increases the cache bust time, which in turn removes all the advantages of the treat dynamic content as static and cache bust on write approach.


>> For sites of any complexity with any dynamic content having CDN redundancy is akin to being multi-cloud — it is not worth the effort.

I proposed and lead our multi-CDN project at Pinterest for both static and dynamic content and I can tell you, many many times over, it has been well worth the effort. Everybody should do this if not only for contract negotiating leverage.

Cache invalidation is fast enough on all CDNs now for most use cases (yes, including Akamai). But realistically, most sites (Pinterest included) are not using clever cache invalidation for dynamic content because it’s not worth the integration effort (and it’s very difficult to abstract for large 1k+ engineering teams). Most customers are just using DSAs for the L4/L5 benefits (both security and perf). In that case, it’s not complicated to implement multi-cdn.


Here's the status page incident for this.

https://status.fastly.com/incidents/vpk0ssybt3bj


> We're currently investigating potential impact to performance with our CDN services.

Guys, you are offline with a 503 error, this is a little more than "potential impact to performance".


Lowkey status reports are the norm now :)

"some users may experience degraded service" => site completely down for all locations


I fully expect that if I find a "major outage" on Slack's status page that it could only mean the outbreak of nuclear war.


"Some users may experience brief service disruption."


"By the account of them and us being completely vaporized"


I was going to link the appropriate XKCD where organised attackers are panicing as they realise they're dealing with a sysadmin muttering about uptime..

.. but of course XKCD is down too.

e: https://xkcd.com/705/


"Only the last couple seconds of their lives."


Or the AWS typical status of ‘seeing increased error rates on the API’ = us-east-1 is dead


At least that's accurate. "Degraded performance" would imply to me that things are functional, but slow. increased error rates can be anything from "try again" to ":shrug:"


"We're investigating reports of intermittent connectivity issues" => transatlantic cables cut, WWIII imminent


Well to be fair some users were not accessing the site at that time


Yeah, that's my experience as well. I thought it meant "we have no idea what's going on" though.


Yeah, I also wrote a bot that chooses to create a status incident with the lowest key neutral message when it detects continued healthcheck fails (outside of maintenance) that steps in if an operator hasn't already created an incident. Maybe they're too busy fixing.


Yes, I also thought the header was hilarious:

> CDN Performance Impact


No issues reported for Perth Australia. Strange because reddit, zip pay, fastly itself, and probably a bunch of other sites are down.

Doesn't seem the status page is automatically updated or perhaps whatever event or polling is used is also broken.


Im experiencing the outage here in brissy its not looking good


In Adelaide, experiencing the outage as well.


> This incident affects: North America (Ashburn (BWI), Ashburn (DCA)).

How come we are affected by this in the Netherlands?


They've updated it to

>North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).


MAD affected, not on the list. I assume it is all locations.


I've seen errors return to me referencing a LON (London I assume) server - Details: cache-lon4238-LON for example


Auckland (AKL) is affected but not on the list.


Seems like they are still taking stock of exactly what is broken.

It has now been updated to a pretty sizable list.

edit: And now it looks like it includes every location.


Currently only listing a small issue in NA


Amazon.com was completely broken here (Europe) and they're back, I was observing from where the assets were loaded from and they switched from EU to NA as a failover. Homework well done.


I was surprised to learn Amazon don't use their own CDN


They used to use AWS CloudFront and switched to Fastly, someone shared this in another comment:

[https://www.streamingmediablog.com/2020/05/fastly-amazon-hom...: CDN Fastly Wins Content Delivery Business For Amazon.com and IMDB Websites)

Quoting:

> "But with small object delivery, like images loading fast on Amazon’s home page, it’s the opposite. Customers will pay for a better level of performance and in this case, Fastly clearly outperformed Amazon’s own CDN CloudFront. This isn’t too surprising since CloudFront’s strength isn’t web performance, or even live streaming, but rather on-demand delivery of video and downloads."


Amazon (like a lot of others) use several CDNs for redundancy. You can see from dig that it resolves to combinations of cloudfront, akamai, and (presumably, based on your reported experience) fastly.

  dig +short www.amazon.com
  tp.47cf2c8c9-frontier.amazon.com.
  d3ag4hukkh62yn.cloudfront.net.
  65.8.70.16

  dig +short www.amazon.co.uk
  tp.bfbdc3ca1-frontier.amazon.co.uk.
  dmv2chczz9u6u.cloudfront.net.
  13.224.0.89

  dig +short www.amazon.in
  tp.c95e7e602-frontier.amazon.in.
  d1elgm1ww0d6wo.cloudfront.net.
  13.224.9.30

  dig +short www.amazon.co.jp
  tp.4d5ad1d2b-frontier.amazon.co.jp.
  www.amazon.co.jp.edgekey.net.
  e15312.a.akamaiedge.net.
  104.71.134.162


Still getting broken assets from the UK.


You're right, I should've said *partially* back. At least the CSSs now load, but a few products images are still gone. However it was completely broken here before (literally loading just the main HTML).


basically the internet is down

reddit, stackoverflow, github, paypal, pypi, twitter, twitch, NYT, CNN, BBC, the Guardian...

edit: wow, even Amazon.com relies on Fastly for some of its edge caches!


https://www.washingtonpost.com/technology/2020/04/06/your-in...

“This basic architecture is 50 years old, and everyone is online,” Cerf noted in a video interview over Google Hangouts, with a mix of triumph and wonder in his voice. “And the thing is not collapsing.”

The Internet, born as a Pentagon project during the chillier years of the Cold War, has taken such a central role in 21st Century civilian society, culture and business that few pause any longer to appreciate its wonders — except perhaps, as in the past few weeks, when it becomes even more central to our lives.


Opened my browser, ad my three major Web pages : github, gitlab.gnome.org and old.reddit.com... They all are down.


Unless you're browsing reddit without logging in, you can just set the old reddit theme from your account settings so you don't need to use the old. prefix :)


They reset the setting, regularly just to piss off people who only want the old frames.


And if you're browsing on mobile, you need to request a desktop website, otherwise it switches to the new version anyway. Took me so long to figure out, so many annoying attempts to replace www with old in safari, and losing the selection after misclicking.


> stackoverflow

How will they troubleshoot the error messages now?


gasp you're absolutely right...!


BBC is still up at least in the UK


Seems to be mixed for me, BBC News and Sport works but stuff like Weather, iPlayer (video streaming) and Sounds (audio streaming) have died. I guess the BBC is big enough that different bits of the site run off different solutions (perhaps news and sport are still in spirit running off "news.bbc.co.uk" instead of the main servers?).


Not here (although won't be long)

dig bbc.co.uk

  bbc.co.uk.  193 IN A 151.101.64.81
  bbc.co.uk.  193 IN A 151.101.128.81
  bbc.co.uk.  193 IN A 151.101.192.81
  bbc.co.uk.  193 IN A 151.101.0.81


it's down


debian's main apt repo mirror affected as well


This has got to be even bigger than when cloudflare went offline, in terms of big companies affected. Clearly they have way more F500 customers than CF.

Good luck to the on call engineers!


The funny part is that it isn't uncommon for sites to depend on both cloudflare and fastly in one way or another, due to buying services from saas companies that also depend on them.


This outage made me realize that github is served over a single IP address (A record) for my point of origin (India). Stackoverflow has 4 A record listing, but all of these belong to fastly.

The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ... <confused>.


> The internet is designed for redundancy. Wonder why these companies don't have a fail over network. Makes me wonder if cost is factor considering their already massive infra. But a single point of failure ..

Well, Internet was indeed designed for redundancy, and it worked as intended. A no point in time it failed to make you reach the server it was supposed to make you talk to.

What are failing are all the application protocols that are running on top of the network.


Github's DNS likely will serve up a different IP for github when there is an outage. I can't talk about the details but GitHub and the rest of Microsoft use a global load balancing system that works through DNS.


Would be interesting to know what these fail over patterns are. As DNS takes a while to propagate, I thought DNS records already indicate fail over addresses.


I think only MX records indicate any priority for each additional record returned, for A records theres no indication of which records have priority over others and the usual behavior of authoritative DNS servers is to rotate the order in which records for the same thing are returned, so effectively returning more than one record for the same question results in a distribution of requests to the IPs returned rather than any sort of failover behavior.

In the case of the software Microsoft uses, it monitors endpoints for the websites in question and then changes which IP(s) are returned based on the availability of those endpoints, the geographic region and other factors.


Some reliability systems change the routing for the IPs instead of updating the DNS as BGP can propagate faster than DNS caching.

Priority for A records would a nice feature.


Update: The issue has been identified and a fix is being implemented. Posted Jun 08, 2021 - 10:44 UTC

Seems like this is being resolved; curious to see the details afterwards

(from https://status.fastly.com/incidents/vpk0ssybt3bj)


Reddit, Stack Overflow, Spotify, all back for me. Good job Fastly engineers!


Made my alpine linux docker builds fail as well (varnish) - but shouldn’t it use a mirror when the primary download site is gone?

fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKIN... fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/... ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.12/main: temporary error (try again later)


What conclusions can we draw about concentrating web content in a few CDNs?


In HTML/CSS you should be able to specify a fallback source if the first returns a non-200.

Or that companies need to have better DNS strategies.


> In HTML/CSS you should be able to specify a fallback source if the first returns a non-200.

Except if the HTML/CSS is hosted on that CDN?


DNS didn't fail, and there's nothing you can do in HTML/CS/JS if your CDN fails to serve those things


Content-centric networking had been a central research topic for many years. And many potentially useful systems have been proposed and implemented.

At some point some of them will start to become popular.


Web Browsers should probably retry a different server in DNS if they get a 503 - but they don't.


That sometimes they fail but the world goes on.


we had that experience when cloudfare was down for sometime lastyear. We now setup a minor own static server as a backup, if at all this happens again. Althgh we hadn't so far had to use it.


Good marketing for Fastly! I had no idea so much of the internet relied on it...


Shopify's CDN is down.

Which is causing $15+ million in lost product sales for every hour of outage.

Not to mention the loss of any new customers.


StackOverflow and all the StackExchange family of sites are down. I suspect the lost productivity from that will be more costly over the whole economy than potential lost sales via Shopify. People can go back to shopify so those transactions not definitely blocked for ever, any time "lost" due to reference resources being unavailable can't so easily be claimed back.


I don't think you understand how ecommerce works.

A very significant amount of people won't go back. It's why the most effective marketing campaign by far is retargeting those people to convince them to come back. Unfortunately that's not possible in this case since you can't track the users as the site is unusable.


> A very significant amount of people won't go back

So they didn't need what they were about to purchase and saved their money. Doesn't sound like a net loss to me.


> I don't think you understand how ecommerce works ... people won't go back

I was talking about the economy in general, not specific e-commerce sites. People that actually need what they were looking for but don't go back will buy it elsewhere. The money still flows, just somewhere else. And if they don't need the item(s), they'll perhaps use the money for something more useful.


Some sites are on Cloudflare, right? Looks like we have a natural experiment to test this belief!


Makes me wonder how the engineers will fix this if they can’t visit Stack Overflow :)


Believe it or not, but there are developers out there that read the docs.


[citation needed]


Here is lesson to learn for shopify talented staff. Don't put all your eggs in the same nest. I'm sure they can build something better than that. Hopefully, they will learn from this outage.


Does Shopify do that much when the US is asleep?


Such a huge number of sites. It seems like it's mostly US based sites and Australians are okay. Sending good vibes to whatever poor person is on support right now.


I'm in Australia and there are heaps of sites down for me.


As per report above - most (or all?) of Asia/Pac servers are down.

This incident affects: North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD)), Europe (Amsterdam (AMS)), and Asia/Pacific (Hong Kong (HKG), Tokyo (TYO), Singapore (QPG)).


Affects far more than that


Ah, I meant more sites like ABC, 9NOW, SBS, AFL, Foxtel etc rather than accessing US sites from AU.


In Perth, reddit is down. So is Blackboard files for uni


Would be fascinating if Fastly is not be able to use GitHub, Travis, Terraform, pip, etc. to deploy their fix


Interesting thought. I had not thought about this before. If there is a cyclic dependency (not saying there is at the moment) how would things play out? Do you just ssh into your own servers to deploy the fix?


So I'm wondering where in the "hundreds of servers around the world" did they exactly go wrong.

This happened with Cloudflare before too. I think we are a little too dependent on these services.


It is a meaningless premise when you actually have SPoFs baked deep inside the system.


I’d love to see a breakdown of what single point of failure causes these worldwide network outages. They even brag about redundancy in their marketing materials. I hope we see a post mortem on this


In Software Engineering we call it "coupling"

/s


Yeah seriously. Time to rebuilt the architecture from the ground up.


Stupid question: why didn't sites "just" fail over to their actual servers to handle the traffic, albeit slowly? I guess they won't be sized to handle the load in a lot of cases, and Fastly was responding, so DNS fail over didn't work?


Probably a different answer for each site. I'm not a DNS expert but I think you're right on both counts. Having failover also requires a duplicate CDN architecture at the fallback location, which is an increase of costs in time, money & maintenance for relatively little benefit. Often there's a fair amount of background integration with a CDN, and each function slightly differently, so it's not simply plug & play.


yeah. the dns was up. the problem was the servers weren't able to proxy the traffic. Also, as you say, you'll probably end up bringing down the upstream servers if you just fail open (and not even sure that'd be a possibility with fastly in it's "down" state that we saw).


Perhaps Fastly is simply taking their commitment to reducing CO2 seriously? Three hurrays for the climate!


I gave it about 10 tries, and it seems a very small percentage of transactions do go through.

A decent number of tries is rejected right at the Varnish front door:

< HTTP/2 503 < server: Varnish < retry-after: 0 < date: Tue, 08 Jun 2021 10:11:41 GMT < x-varnish: 271470009 < via: 1.1 varnish < fastly-debug-path: (D cache-bma1666-BMA 1623147101) < fastly-debug-ttl: (M cache-bma1666-BMA - - -) < content-length: 450 < Service Unavailable Guru Mediation: Details: cache-bma1666-BMA 1623147101 271470009

Many more reach some backend system that just dumps "connection failure":

< HTTP/2 502 < content-type: text/plain; charset=utf-8 < content-length: 18 < connection failure

And a tiny few do get through:

< HTTP/2 200 < content-type: text/html; charset=UTF-8 < cache-control: max-age=0, must-revalidate < date: Tue, 08 Jun 2021 10:11:43 GMT < via: 1.1 varnish < vary: accept-encoding < set-cookie: ...snip... < server: snooserv < content-length: 275036 < <!doctype html><html>...snip...


This is one of the things that excites me about IPFS: in a world of decentralized data storage, yes self-hosting and control over your data is nice and all, but serious resilience to most random infrastructure outages is a much bigger deal.

It's still early days, but I'm hopeful that it can provide a real solution to today's CDN centralization.


Agree, but currently, ipfs would serve as a fallback, since it's about files. Decentralized/distributed generally has slower network performance.

Unless most nodes are high performance, I guess?

Personally I think a distributed database system, where entries are being made redundant in something like a blockchain+dht, would be a good start?

Decentralizing the internet works if it financially makes sense for platforms to build such tools.


> Agree, but currently, ipfs would serve as a fallback, since it's about files.

Isn't a CDN fundamentally all about files too?

> Decentralized/distributed generally has slower network performance. Unless most nodes are high performance, I guess?

There is definitely more work to do here before this is really useful, but it's well within the realm of things that IPFS should be able to do at reasonable performance for production sites in future. Good performance still requires a serious CDN node network similar to traditional CDNs today (to seed your content for day to day use) but with IPFS if that CDN goes down then existing users on your site can _also_ serve the site to other nearby users directly, or other CDNs can serve your site too, etc etc. Your DNS wouldn't be linked to any specific CDN in any way, just to the hash of the content itself, so anybody could serve it.

> Decentralizing the internet works if it financially makes sense for platforms to build such tools.

There's a platform company called Fleek who already do this today: https://fleek.co/hosting/ (no affiliation, and I've never even used the product, just looks cool). Seems to be designed as a Netlify competitor: push code with git and it builds it into static content and then deploys to IPFS.

The benefits don't exist today of course, because no browsers natively support IPFS, so most users can only access the content via an IPFS gateway, which means you're back to fully centralized server infrastructure again... If we can get IPFS support into browsers though then fully decentralized CDN infrastructure for the web is totally possible.


I'm pretty sure you can serve hundreds if not thousands of users from a single Raspberry Pi


I mean, yes, absolutely, and that works to start with, but I'm willing to bet the overall uptime and performance of a raspberry pi in your living room is quite a bit worse that Fastly's :-).


isitdownrightnow.com is down


Thanks for the best laughs in a while friend - that's pure irony right there!


I'm having intermittent Reddit issues, as one more data point.

I'm grateful for HN. I rebooted my computer. I thought it was my device and then saw this on my phone while rebooting.


Just occurring to me how CDNs are a major point of failure now for the internet


Amazon being down surely points to something other than Fastly being the cause?


I just had a look at amazon.co.uk and most assets fail to load, the browser debug console is full of 503 errors. Picking one at random, it's fastly:

    $ nslookup images-eu.ssl-images-amazon.com

    Server:  127.0.0.53
    Address: 127.0.0.53#53

    Non-authoritative answer:
    images-eu.ssl-images-amazon.com canonical name = m.media-amazon.com.
    m.media-amazon.com canonical name = media.amazon.map.fastly.net.
    Name: media.amazon.map.fastly.net
    Address: 199.232.177.16
    Name: media.amazon.map.fastly.net
    Address: 2a04:4e42:1d::272



[deleted]


They will use S3, but they need a CDN in front. Surprised they don't use CloudFront - maybe that's what they've failed over to.


Apparently they switched from CloudFront after determining Fastly was faster for this use case. CloudFront is focused on large streaming services, not small HTTP resources.


Yep, seems like:

Reddit BBC News Twitch.tv Twitter emoji cdn?

are all down 503 service error


Ah didn't cop that Twitter emoji issue was related! Thought an ad-blocker was stepping up its filters aggressively :)

Stack Overflow, The Guardian, Gov.uk too as some other biggish names getting hit.


Various bits of GitHub on the Web (committing edits, editing releases) were broken for the same reason. Failure modes of JS-heavy GUIs are interesting.


Some people are claiming online that this is a cyber attack. I contract for the UK Gov and I'm hearing reports that traffic is going through the roof right now.

Anyone know if there is any legitimacy to this?


The fastly monitoring/status page says: "Customers may experience increased origin load as global services return". Which sounds like the increased traffic is to be expected.

[1] status.fastly.com


I did not realise fastly adoption was so wide-spread. Can anyone more enlightened tell my why or have some resource on which use-cases fastly is superior to other CDNs such as CloudFlare?


how will their devs fix it if stackoverflow has gone down?!


This incident affects: Europe (Amsterdam (AMS), Dublin (DUB), Frankfurt (FRA), Frankfurt (HHN), London (LCY)), North America (Ashburn (BWI), Ashburn (DCA), Ashburn (IAD), Ashburn (WDC), Atlanta (FTY), Atlanta (PDK), Boston (BOS), Chicago (ORD), Dallas (DAL), Los Angeles (LAX)), and Asia/Pacific (Hong Kong (HKG), Tokyo (HND), Tokyo (TYO), Singapore (QPG)).


Their status page is now saying every location has degraded performance.


Affecting Auckland (AKL) which is not on the list so I can only assume it's affecting more locations than they're letting on.


+= North Africa (Egypt, Cairo)

Stackoverflow.com, reddit, qoura down. (and probably more, those are the ones I tested)


This post is suspiciously ranked much lower than it should be (1216 points, 9 hours ago), lower than posts with < 100 points.


Looks like this has taken out Reddit at least.


Is it also hitting Github? I'm not getting any css when loading Github.


Looks like it is. If you're still able to see much of the UI, don't force-reload the page as it'll invalidate the CSS in the cache.

I did that moments ago, and I regret it.


And a large part of GitLab


FWIW, Fastly ~8 hours ago (3am UTC) reported another incident: https://status.fastly.com/incidents/1glxxb8sf2zv and deployed a fix—either the fix made it worse or wasn't sufficient to mitigate the problem.


I think the honorable thing would be for them to have a statement easily findable.

So many companies sweep this sort of things under the rug if it’s only customer data that’s been breached. If they can’t sweep they have a high priced PR agency do the communicating.

I do not trust companies who handle things this way.


The outage has already been added to the Fastly Wikipedia page


Holy smokes these Wikipedia writers are quick! I'm sometimes impressed by how fast a page on a super recent happening gets populated with all of the currently known details.


My money is on an expired internal certificate or CA.


Fastly has scheduled maintenance to retire some TLS certs next week.


Before the "Error 503 Service Unavailable" messages appeared, there were a few minutes where the error was a single line:

    connection failure
Not sure if that provides anyone here with more insight into what might have caused this!


I got that, then a 'Fastly unknown domain' error (on Reddit), then the 503s on multiple sites (I also had an API I use return a 502 then a 500 error, but I don't know what the full response was as it was just a quickly thrown together script I was using).

Edit: and now "I/O error" on Reddit.


I also saw a glimpse of 'I/O error'. That sounds fun.


It was `connection failure` for me.


Hands up if you're also here after being woken up by downtime alerts on the west coast


Anyone want to talk about half the internet going out because one provider couldn’t keep their service up instead of SO jokes and feels for the engineers? the entire internet is like a stack of cards from the protocol to the economic model.


wouldn't websites have alternate CDN's managing their traffic, why should they have a single point of failure ?

I was assuming there are couple of services like Fastly and companies might have architected keeping in mind the alternatives too, I guess.


Normally you configure your a record to point at the cdn as the cdn is the thing that gives you multiple points of failure (caches all over the world). Hard to have a fallback to that. Running multiple cdns would be extremely expensive. Cdn caches are kept useful by traffic running through them, so hard to have a backup for that too.


Because interacting and switching between cdns can be very complicated and/or costly

It should be planned for, especially by major tech organizations like reddit, or Amazon, etc.

But I won't fault news organizations, who already don't have boatloads of money for not having fail over cdns



No mention of outage on https://status.cloud.google.com/, and I wonder why, because apparently this is a GCP problem.


Ah yes, the wonders of centralized internet infrastructure.

Let's use a handful of providers for everything, they said. It will be cheaper, they said. It will be easier to manage, they said.

And it was cheaper, until downtimes began to affect more and more sites when central SPOFs got hit.

And I wonder how much of that need for these centralized SPOFs actually comes from the sheer absurd amount of bloat, ads, code and assets that sites these days "have" to deliver to the customer. I 'member times when pages had 100kb total size, loaded in an instant and were perfectly usable.


Since Fastly’s own website is currently down:

What is fastly? Why are a huge number of web sites dependent on them? They are some kind of web host for companies that don’t want to run their own servers/data centers?


Fastly is a Content Distribution Network (CDN).

Basically the closer the server serving the webpage is to the end user the faster it is for the end user to see and interact with.

But running servers all over the world 1) isn't efficient 2) costs a lot of money.

So a few companies (fastly, cloud flare, akamai) figured, hey, why don't we build a bunch of small data centers all over the world and then provide a distributed way to serve web traffic from it.

It originally was brought about for services like Netflix, but has expanded greatly.

You still host your servers, but a copy of the webpage/media is given to the CDN to serve to customers.


Thanks. That makes sense.

Wouldn’t you build in a failsafe that bypasses Fastly and sends traffic to your own servers in the case of this kind of outage? Or outages are so rare that it’s not worth the trouble?


The number of serious CDN outages in the world are incredibly rare.

In fact, you can probably remember most of them if you were given dates.

Plus, going around the CDN can be very complex (depending on the type of content), very expensive (all of a sudden you have a massive data out network traffic that didn't exist previously), and not guaranteed to work (DNS updates can take longer to get to everyone than the actual CDN outage lasts).

There are places where it is worth it and useful, but for a lot of the sites listed it's not useful.


That's the fallback, but the original stack is not designed with the volume of traffic in mind. So it gets overwhelmed very quickly and makes the website practically unavailable.


> Or outages are so rare that it’s not worth the trouble?

This, I can't remember the last Fastly outage in this dimension, so the time spent on setting up a secondary server serving your assets is probably not really worth it for small-medium companies. Although i'd think otherwise for a company like Shopify.


Many sites do this; Amazon's failed over to their own servers for images for me, it appears. It typically just takes some human intervention, I suspect.


I'm particularly intrigued as to why Amazon.com uses them.

They literally have their own directly competing CDN product. You'd think they'd be dogfooding it.


Amazon doesn’t enforce dogfooding, their retail site is its own stack and has been only recently been migrating to EC2s


[deleted]


S3 is fine.


BTC/USD is down too.


Perfect time for the crypto whales to dump massively and cause an absolute panic.


Tangential question, but with services like these, is there a known way to handle failure gracefully? Some way to automatically bypass these services if they are known to be down?


You have to have two separate cdns and use DNS to fail over. The problem is that means paying for a CDN that just sits dormant for the 99.999% of the time that your primary is down.

Alternatively you could use DNS to fail over to the content you host, instead of another CDN. But in many cases that would be the same as an outage since the CDN exists to reduce the impact of all those requests on your infra


Have two different CDN partners, own your own DNS, and then withdraw one of the CDNs if they are down. Suspect that's what Amazon have done.


Yikes, seems like a massive outage.

EDIT: Hexdocs is down, elixir-lang.org is down


None of the ES/NQ/RTY/YM futures contracts took kindly to the outage! This could have had a much wider financial impact. Most seem to have recovered now.



Looks like fastly.com uses fastly…


Do they have an official status page? Googling gets https://docs.fastly.com/en/guides/fastlys-network-status which is 503

Edit: Elsewhere in the comments: https://status.fastly.com/incidents/vpk0ssybt3bj


Hacker News is the only one UP!


It should be resolve soon. From fastly status page:

The issue has been identified and a fix is being implemented. Posted 1 minute ago. Jun 08, 2021 - 10:44 UTC


Wonder if all the caches will have been wiped, causing knock on issues


You might be right. Here is another update from fastly:

The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return.

Let's see


Phew!

That time to find the issue is always the stressful part. < 1 hour is pretty good for weird stuff, and fortunately the east coast of the US is barely online this early (sorry Europe!).


https://www.bbc.com/news/technology-57399628 is rendering and reporting on the story, but BBC itself was down at the start of the outage, with the same 503 varnish error message.

Presumably the BBC has some kind of fallback in place.

The journalists ought interview their own techies :)




Anything hosted on Firebase seems to be down


I will NEVER understand why people put so much trust in single provider solutions for anything critical.


What happens when there is excessive centralization.

I thought that one of the principles behind the Internet is to be able to reroute around failures, but neither these service providers nor their clients ever seem to learn.

I guess in their mind that only applies to packet routing not services. SMH


Interestingly, https://www.fastly.com/ works for me, whereas https://fastly.com/ doesn't.


Funnily enough, it's the opposite for me...


I was wondering why my Tidal app just stopped mid song and won't connect, after much googling and absolutely no help or even notifications from Tidal explaining there's an issue it seems this outage is the culprit. Bugger.


Time to develop CDN for CDNs.

It seems like a pattern that CDN have overly centralized the web and lead to issues like this.

Maybe its time to build a CDN that distributes your static assets to multiple CDNs and has a set of fallback states for service outtages.



I got a push notification from the CNN app telling me a bunch of the internet was down due to a cloud provider. I clicked the link only for the app to open to a 503. In hindsight not surprising, but quite amusing.


pypi.org, but not https://status.python.org/ - I'm impressed that they actually hosted the status page differently!


That's fairly standard practice.

Fastly itself has its status page up as well: https://status.fastly.com/


Their status page keeps claiming that my region, Chicago (ORD), is either Degraded Performance, or Operational. But clearly it's down. Is fuzzing metrics like this how they hit their SLA targets?


Looks like they're currently applying a fix.

https://status.fastly.com/incidents/vpk0ssybt3bj


It's funny, I searched Twitter for "Ebay down" and the top result was an Ebay tweet with some not coincidentally broken Twitter emoji SVGs (as another person mentioned)...


GitHub? I had some issues, checked the service status page said no issues, but images were returning a 503. Maybe they host their service status page elsewhere including using fastly.


GitHub now showing partial outage (images on status pages are fixed)


Pretty bad www.gov.uk is down as more services move to digital.


I don't think moving to digital is the issue here. The issue is relying on third parties, which can have an issue at any moment, taking down whoever relies on them with them.

A government should not rely on CDNs like that. In fact government websites should not have any traffic going over third parties. When I want to use/view a government website, I should not be subjected to sharing any data with unwanted third parties and the government should not be affected, when some private company makes mistakes or has outages. It is an unacceptable situation.

They can set up their own state-owned CDN, using the same underlying technology. Compared to where they spend all that tax money, some servers and some engineers would be a very cheap investment, in relation to the independence achieved.


They seem to have migrated across to Cloudfront - working now.


I briefly saw an output error about "domain not found" when hitting fastly.com, wonder if some list of domains has hit a limit/flushed/etc.


I get this now on reddit:

    Fastly error: unknown domain: www.reddit.com.


How does one design a system that has a redundancy for when the CDN goes down? Paying for more than one CDN is probably too expensive isn't it?


Good job Fastly for getting the issue identified and resolved so quickly. < 1 hour to identify, <13 minutes to fix (assuming status is accurate).


numpy docs, too. i think it's cloudflare related as well. at least, I keep seeing some cloudflare errors interpolated with the 503 varnish error.


Well they thought that using a CDN over a CDN would be a good idea


We've got Cloudflare sitting in front of our Firebase/GCP instance (which I've just found out is Fastly-cached :/). Getting 503s at the origin but we're up on our URL with an always online notice thanks to CF. Double dip isn't all that bad.


Pytorch and Python docs, all down. No stackoverflow. I guess this is a forced bank holiday for developers around the world.


Quick question if the cdns are down why cant traffic be routed to the web servers the central web servers the company owns ?

I thought cdns had fallback configured ?


Those of you that work in DevOps, SRE or are CTOs.

What kind of things do you put in place to manage these kind of centralised issues that are beyond your control?


These issues are in your control - not for the centralised service but your use of them. You can build appropriate redundancy for the components/providers in your stack and the budget you have.



Only their main website though. My Heroku apps work pretty well.


>The issue has been identified and a fix has been applied. Customers may experience increased origin load as global services return.

Is fixed


Ironically, even this Outage page is out for me


Wow, talk about a brutal SPOF, most of the things I had planned to work with today are broken: reddit, github, stack overflow.


I̶n̶ ̶r̶o̶m̶a̶n̶i̶a̶ ̶e̶v̶e̶r̶y̶t̶h̶i̶n̶g̶ ̶s̶e̶e̶m̶s̶ ̶b̶a̶c̶k̶ ̶t̶o̶ ̶n̶o̶r̶m̶a̶l̶.̶.̶.̶?̶

Edit: nope, just worked for 2-3 requests (10 secs)



Worrying that this is impacting so many dev toolchains and services, which will hinder the ability to respond to the issue.


This seems to be a bigger issue. BGP failure?


If they can serve me a garbage Varnish error (shoutout to "software that actually runs your business that none of your devs work on") it's not BGP.


Things seem to have come back online in Australia, although not sure if that's just sites switching over their DNS?


"The internet will just route around a local / centralised problem ... like water around an object"

Obligatory LOL ...


Really not the "internet's" fault.


Nobody said how quickly, to be fair


Firebase Dynamic Links is affected too. Checking the IP looks like they are using Fastly which is quite surprising.


I’ve noticed lots of social media content is tied to this - Reddit and Twitter images and some videos, for one.


The issue has been identified and a fix is being implemented. Posted 3 minutes ago. Jun 08, 2021 - 10:44 UTC


Let's make all of the main internet sites dependent upon one central private service. Great idea guys.


Seems like another single point of failure. What is a solution to not be affected by such an outage?


It is time to remove that "100% uptime guarantee" claim from the website :grimacing:


My work's website is down too and the regular sites I use to escape work borderm


Fastly is back now. (The issue has been identified and a fix is being implemented.)


It would be interesting to see estimations on the man-hour cost of this outage.


Got the same here (Australia)


rubygems.org affected too


Well I know where to go next time if I were to be a Russian hacker


Twitch isn’t working and not responding and also the web dashboard


When this happens to cloudflare, it will be even more impactful.


Looks like Fastly did not work as advertised, very misleading.


I'm sure it's just a coincidence that today is Patch Tuesday.

:-|


Spotify is also hit, though it still works without images


Someone must have 51% attack the Pied Piper blockchain!


Damn, I thought I cloud blame myself or the provider..


Ten Percent Happier is down, and now my day is ruined.


When viewing a meditation session you can see a download button in the upper right (at least on iOS).

I always have a small stash of my favorites saved locally in case of internet outage or I’m caught in a situation where I don’t have internet but need a few minutes.

On top of that I’ve been really trying to rely less on an app. So I throw a lightly guided or unguided session in every couple days at least where I focus on going solo so I don’t need an app and just need a timer.


just had my own site down because of this. glad to see it wasn't my fault lol but good luck to the Fastly people on fixing the issue.


Twitch isn’t responding and also the web dashboard


That explains why I couldn't access reddit


No wonder, The Verge and NYT are down too.


www.python.org down as well, with the shortest of messages: 'connection failure'. Probably related?


...and now back up, with reddit et al still down. Hmm.


Even amazon.com styling is borked for me


I think reddit in India is down as well.


Extremely long call, but what are the chances this turns out connected to the raids on organised crime using the An0m app that started today?


It's probably a DDoS attack.


And all Webflow sites it seems...


Indeed, part of GitHub (.io) too.


Looks like HN is working ;-)


Do companies really not run test suites / do manual testing before deploying to production?


Seems to be back online


Basically everything is broken. "Centralising Everything" huh


All Webflow sites?


StackOverflow too.


Parts of Shopify


Looks like an SRE team rolled out buggy software.


Let's start getting our guesses in.

I think it's some dodgy VCL rolled out to all machines at once. For some reason it worked in staging.


It's always DNS...


github is back online. SSO too.


Whew, DevOps fire alarms are going off!


github.com is pretty broken


SMH.com.au


the problem has been fixed


reddit.com is affected too


cnn.com is down as well.


A real-world Chaos experiment!


it seems to be up now


reddit down aswell


I first noticed that xkcd was down. Then I went to post about it on reddit . . . also down! Good thing HN is up.


Taken out xkcd as well.


Today's comic is titled "Product Launch", so the joke still works if you assume it's about a disastrous launch. ;)


Isn't there an xkcd comic about CDN failures?



xkcd is down too :(


Maybe this one, titled "The Cloud". https://xkcd.com/908/


We have no way to know. https://xkcd.com/908/


Might be xkcd.com/503.


Are these sites on the same cloud or CDN?


They are all on Fastly CDN...?


Also, why has this been allowed to happen? Billions of dollars lost because of this one company?

I don't understand this.


For a moment I thought all of Western internet was cut off from India. Says how siloed my browsing habits are!


Couldn't be happier I moved https://noisycamp.com to BunnyCDN.com.


Every other comment about what's down in this thread -- as if we needed dozens of site-by-site accountings of this outage in the first place -- is a bitch about reddit. Why is reddit so important to this crowd? The specific topics I used to read the site for (half a dozen years ago) have all been overrun by "bucket people," there is literally never an answer to any question I find a google link to there, and the site's design is actively user-hostile. Seriously: what's keeping that place afloat? Porn, I suppose.


Of course, the Enlightened Folk of this site can no longer use their leisure time on lowly activities such as the "Reddit".

Teach me your ways, master! /s

Jokes aside, people can do whatever they please. Reddit has a bunch of niche communities around many hobbies and fun things. No need to be bitter about it.


You have put your finger on it. I AM bitter about it. It used to be really cool, and really nice to use, before the Taylor/Pao dustup, and the redesign.


old.reddit still a thing and there is a plenty of educational subreddits with really nice community around them, it's just like the internet just pick the things that suits you.


reddit taught me to never trust a mod, so it does have some purpose still. i think without glaringly bad examples of how (not) to run a community based site, we would be doomed to repeat it's mistakes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: