Level 3 Global Outage

dz0ny on Aug 30, 2020 | [–]

Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.

Source https://puck.nether.net/pipermail/outages/2020-August/013229...

kitteh on Aug 30, 2020 | | [–]

Flowspec strikes again.

Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.

parliament32 on Sept 1, 2020 | | | [–]

Is "doing what you ask" considered a sharp edge? Network-related tools don't really have safeties, ever (your linux host will happily "ip rule add 0 blackhole" without confirmation). Every case of flowspec shenanigans in the news has been operator error.

mrguyorama on Sept 2, 2020 | | | [–]

It's possible that if a tool allows you to destroy everything with a single click, that tool (or maybe process) is bad

kitteh on Aug 30, 2020 | | [–]

Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.

Edit: if you are a Level3 customer shut your sessions down to them.

beagle3 on Aug 30, 2020 | | [–]

History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").

It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.

The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.

Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.

The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.

[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

phkahler on Aug 30, 2020 | | | [–]

Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.

This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...

jacquesm on Aug 30, 2020 | | | [–]

> Normal testing was bypassed - per management request after a small code change.

That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.

cortesoft on Aug 30, 2020 | | | [–]

It isn't learned because 99% of the time, it works fine and nothing bad happens.

We are very bad at avoiding these sorts of rare, catastrophic events.

eru on Aug 31, 2020 | | | | [–]

That's why the most reliable way to instil this lesson is to instil it into our tools. Automate as much testing as possible, so that bypassing the tests becomes more work than running them.

smaudet on Aug 31, 2020 | | | [–]

I disagree, it's in part a people problem - more draconian test suites just make developers more inclined to cheat and they tend to write tests which are not valid or just get the tool passing...

It's more important to visually model and test than to enforce some arbitrary set of rules that don't apply universally - then you have at least the visual impetus of 'this is wrong' or 'I need to test this right'.

A lot of time is spent visually testing UIs and yet these same people struggle with testing the code that matters...

monkpit on Aug 31, 2020 | | | | [–]

Until a manager is told about how hard the automation makes it to accomplish their goal...

eru on Sept 1, 2020 | | | [–]

You need buy-in to automation at a high enough level.

If a team manager at eg Google was complaining about how automation gets in the way and wanted to bypass it, they wouldn't last too long.

edoceo on Aug 30, 2020 | | | | [–]

Managers who have been bitten still make this choice

coryrc on Aug 30, 2020 | | | | [–]

Or most of what we do isn't really important so it doesn't matter if it breaks every once in a while.

pacificmint on Aug 30, 2020 | | | | [–]

Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.

It’s been many years since I read it, but I recall it being a very interesting read.

tinco on Aug 30, 2020 | | | | [–]

For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.

The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.

TheSpiceIsLife on Aug 31, 2020 | | | [–]

Do management typical typically study Computer Science?

_heimdall on Aug 31, 2020 | | | [–]

Unfortunately most people become a manager by bring a stellar independent contributor. People management and engineering are very different skills, I'm always impressed when I see someone make that jump smoothly.

I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.

thatfunkymunki on Aug 31, 2020 | | | [–]

Yeah, it just sucks that you eventually have someone making significant people management decisions without the technical knowledge of what the consequences could end up being. This would be even worse if you had people manager hiring be completely decoupled. The US military works this way and I have to say it's not the best mode.

tinco on Aug 31, 2020 | | | | [–]

Typically yes actually, the director of engineering should always be an engineer. Of course, these are hardware companies so it would probably be some kind of hardware engineer.

TheSpiceIsLife on Sept 1, 2020 | | | [–]

Should.

Sure.

_heimdall on Aug 31, 2020 | | | | [–]

As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.

The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.

twic on Aug 30, 2020 | | | | [–]

This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.

I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?

ficklepickle on Aug 30, 2020 | | | [–]

I can't find anything by that name either, but the details do match the major ARPANET outage of Oct 27, 1980.

The incident is detailed in RFC 789:

http://www.faqs.org/rfcs/rfc789.html#b

twic on Aug 30, 2020 | | | [–]

Interesting, thanks! That is different to the story i remember, but it's possible that i remember incorrectly, or read an incorrect explanation.

I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:

https://books.google.co.uk/books?id=56mb7XuSx3QC&lpg=PA133&d...

0xbadcafebee on Aug 30, 2020 | | | | [–]

I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.

BoorishBears on Aug 30, 2020 | | | [–]

The idea isn't that Erlang is infallible in the design of distributed systems.

The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language

mcspiff on Aug 30, 2020 | | | | [–]

There was something similar a few years ago on a large US mobile network. You could watch the ‘storm’ rolling across the map. Fascinating stuff

throw0101a on Aug 30, 2020 | | | [–]

Are you referring to CenturyLink’s 37-hour, nationwide outage?

> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.

* https://arstechnica.com/information-technology/2019/08/centu...

hinkley on Aug 30, 2020 | | | | [–]

I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).

mikelward on Aug 30, 2020 | | | [–]

Some queue processing systems I've seen have infinite retries.

At least they have exponential backoff I guess.

hinkley on Aug 30, 2020 | | | [–]

But your queue will grow and grow and the fraction of time you spend servicing old messages grows and grows.

Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).

Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.

mikelward on Aug 30, 2020 | | | [–]

We also alert if any message is in the queue too long.

If anything, the alert is too sensitive.

doctorshady on Aug 30, 2020 | | | | [–]

The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.

p_l on Aug 30, 2020 | | | [–]

you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.

doctorshady on Aug 30, 2020 | | | [–]

That's my point; as a subscriber, you can prefix a long distance call with a routing code to avoid, for example, a shut down long distance network without any administrator changes. Routing to the long distance networks is done independently through the local network, so if AT&T's long distance network was having issues, it'd have no impact on your ability to access Verizon's long distance network.

a1369209993 on Aug 31, 2020 | | | [–]

There's actually no technical reason why you couldn't do that with IP (4 or 6); although you'd need a approriately located host to be running a relay daemon[0].

0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.

Edit: It probably wouldn't work for TCP though - maybe try TOR?

KMag on Aug 31, 2020 | | | [–]

There are plenty of ways to do what you're describing, and they all work with TCP. Some of them only work if the encapsulated traffic is IPv6 (and a designed to give IPv6 access on ISPs that only support IPv4). Some of them may end up buffering the TCP stream and potentially generating packet boundaries at different locations than in the original TCP stream.

[0] https://en.wikipedia.org/wiki/Generic_Routing_Encapsulation

[1] https://en.wikipedia.org/wiki/Teredo_tunneling

[2] https://en.wikipedia.org/wiki/6to4

[3] Any of the various https://en.wikipedia.org/wiki/Virtual_private_network technologies (WireGuard, IPSec, SOCKS TLS proxies, etc.)

[3] As you mention, a Tor SOCKS proxy

p_l on Aug 31, 2020 | | | | [–]

There is, technically, a way for IP packets to signify preferred routes, but due to other (security) reasons it's disabled.

cat199 on Aug 30, 2020 | | | | [–]

> best known result is Erlang and its ecosystem

not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)

dadver on Aug 30, 2020 | | | | [–]

This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.

swinglock on Aug 30, 2020 | | | | [–]

Indeed. In 2018 an Erlang telco software did break, bringing down the UK and Japan.

pcc on Aug 30, 2020 | | | [–]

If memory serves that also involved an expired certificate

swinglock on Aug 30, 2020 | | | [–]

That matches my memory.

sbmthakur on Aug 31, 2020 | | | | [–]

A thread discussing that event:

https://news.ycombinator.com/item?id=24323412

_gqhf on Aug 30, 2020 | | | | [–]

Is that related to the hacker's crackdown?

chrisweekly on Aug 30, 2020 | | | | [–]

Fascinating. Thanks for sharing! :)

kitteh on Aug 30, 2020 | | | [–]

Most of level3s settlement free peers aka "tier 1s" have shutdown or depreffed their sessions with them.

Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...

kitteh on Aug 30, 2020 | | | [–]

Root cause identified. Folks are turning things back on now.

emilstahl on Aug 30, 2020 | | | [–]

Source?

miken123 on Aug 30, 2020 | | | [–]

https://mobile.twitter.com/Gustawsson/status/130008521478233...

guerrilla on Aug 30, 2020 | | | [–]

What is a reconvergence event? Is that what's described in your last sentence?

snuxoll on Aug 30, 2020 | | | [–]

BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.

If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.

This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.

Convergence time is a known bugbear of BGP.

mitchs on Aug 31, 2020 | | | | [–]

BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.

For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.

ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)

Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."

As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)

swinglock on Aug 30, 2020 | | | | [–]

https://en.wikipedia.org/wiki/Convergence_(routing)

IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.

Yajirobe on Aug 30, 2020 | | | [–]

I know some of these words

rimjongun on Aug 30, 2020 | | | [–]

Can you tell tell me more about what happened, but in a way that for a person who struggled with the CCNA? I’ve never heard of a reconvergence event.

emilstahl on Aug 30, 2020 | | [–]

CenturyLink/Level3 on Twitter: "We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused."

https://twitter.com/CenturyLink/status/1300089110858797063

ystad on Aug 30, 2020 | | [–]

I hope they provide a root cause analysis

colde on Aug 30, 2020 | | | [–]

Based on experience it will probably not public, or at least very limited.

But customers are likely to get one, at least if they request it.

rootsudo on Aug 30, 2020 | | | [–]

Being it was pretty big, they'll probably make it public.

jlgaddis on Aug 30, 2020 | | | | [–]

https://news.ycombinator.com/item?id=24324280

MLij on Aug 30, 2020 | | [–]

India just lost to Russia in the final of the firstever online chess olympiad, probably due to connection issues of two of its players. I wonder if it's related to this incident and if the organizers are aware. Edit: the organizers are aware, and Russia and India have now been declared joint winner.

mark_l_watson on Aug 30, 2020 | | [–]

I am glad they declared a tie. Seems fair.

I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.

Abishek_Muthian on Aug 30, 2020 | | | [–]

Thanks for the update.

Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.

redwood on Aug 30, 2020 | | | [–]

Interesting. How would connection issues cause them to lose? Was it a timed round?

colinbartlett on Aug 30, 2020 | | | [–]

Related: World champion Magnus Carlson recently resigned a match after 4 moves as an act of honor because in his previous match with the same opponent, Magnus won solely due to his opponent having been disconnected.

repiret on Aug 31, 2020 | | | [–]

His opponent, Ding Liren, is from China, and has been especially plagued by unreliable internet since all the high level chess tournaments have moved online. He is currently ranked #3, behind Magnus Carlson and Fabiano Caruana.

cyphar on Aug 30, 2020 | | | | [–]

All professional chess games have a time limit for each player (if you've ever heard of "chess clocks" -- that's what they're used for). In "slow chess" each player has a 2-hour limit and all of the other time control schemes (such as rapid and blitz) are much shorter.

hinkley on Aug 30, 2020 | | | [–]

There’s an interesting protocol for splitting a Go or chess game over multiple days so that neither party has the entire time to think about their response to the last move: at the end of the day the final move is made by one player but is sealed, not to be revealed until the start of the next session.

For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.

This wouldn’t save any real-time esports, but would be serviceable for turn based systems.

FR10 on Aug 30, 2020 | | | [–]

Yes, this is call Adjournment[0] and they used to do it until 20 or so years ago when computer analysis became too good/mainstream.

[0] https://en.wikipedia.org/wiki/Adjournment_(games)

MLij on Aug 30, 2020 | | | | [–]

Yes, two players lost on time.

amin on Aug 30, 2020 | | | [–]

That's fascinating. But I wonder, why don't they start over, or continue where they left off, once the internet is back?

Strom on Aug 30, 2020 | | | [–]

> continue where they left off

The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.

> why don't they start over

That would be unfair to the player who was ahead.

That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.

derefr on Aug 30, 2020 | | | [–]

Seems like one of those cases where solving a “little” issue would actually require rearchitecting the entire system.

Namely, in this case, it seems like the “right thing” is for games to not derive their ELO contributions from pure win/loss/draw scorings at all, but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason (where checkmate, forfeit, and game disruption are all valid reasons.) Perhaps with some Best-rank (https://www.evanmiller.org/how-not-to-sort-by-average-rating...) applied, so that games that go on longer are “more proof” of the competitive edge of the player that was ahead at the time.

Of course, in most central cases (of chess matches that run to checkmate or a “deep” forfeit), such a scoring method would be irrelevant, and would just reduce to the same data as win/loss/draw inputs to ELO would. So it’d be a bunch of effort only to solve these weird edge cases like “how does a half-game that neither player forfeited contribute to ELO.”

retrac on Aug 30, 2020 | | | [–]

> but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason

Except for the obvious positions that no one serious would even play, there is no agreed-upon way of calculating who has an advantage in chess like that. One man's terrible mobility and probable blunder is another's brilliant stratagem.

derefr on Aug 30, 2020 | | | [–]

Hm, you’re right; guess I was thinking in terms of how this would apply to Go, where it’d be as simple as counting territory.

Still, just to spitball: one “obvious” approach, at least in our modern world where technology is an inextricable part of the game, would be to ask a chess-computer: “given that both players play optimally from now on, what would be the likelihood of each player winning from this starting board position?” The situations where this answer is hard/impossible to calculate (i.e. estimations close to the beginning of a match) are exactly the situations where the ELO contribution should be minuscule anyway, because the match didn’t contribute much to tightening the confidence interval of the skill gap between the players.

Of course, players don’t play optimally. I suspect that, given GPT-3 and the like, we’ll soon be able to train chess-computers to mimic specific players’ play-styles and seeming limits of knowledge (insofar as those are subsets of the chess-computer’s own capabilities, that it’s constraining its play to.) At that point, we might actually be able to ask the more interesting question: “given these two player-models and this board position, in what percentage of evolutions from this position does player-model A win?”

Interestingly, you could ask that question with the board position being the initial one, and thus end up with automatically-computed betting odds based on the players’ last-known skill (which would be strictly better than ELO as a prediction on how well an individual pair of players would do when facing off; and therefore could, in theory, be used as a replacement for ELO in determining who “should” be playing whom. You’d need an HPC cluster to generate that ladder, but it’d be theoretically possible, and that’s interesting.)

suby on Aug 30, 2020 | | [–]

I was doing development work which uses a server I've got hosted on digital ocean. I started getting intermittent responses which I thought weird as I hadn't changed anything on the server. I spent a good ten minutes trying to debug the issue before searching for something on duckduckgo, which also didn't respond. Cloudfare shouldn't be involved at all with my little site, so I don't think it's limited to just them.

one2know on Aug 30, 2020 | | [–]

Yeah, something happened to ipv4 traffic worldwide. Don't see how that could happen.

pps43 on Aug 30, 2020 | | | [–]

Let me guess: somebody misconfigured BGP again?

johnisgood on Aug 30, 2020 | | | [–]

https://puck.nether.net/pipermail/outages/2020-August/013198...

ra on Aug 30, 2020 | | | | [–]

likely

Sebb767 on Aug 30, 2020 | | | | [–]

That's definitely going to be an interesting postmortem.

opan on Aug 30, 2020 | | | [–]

Seconding this. Had some ssh connections timing out repeatedly just a bit ago. Also got disconnected on IRC.

cotillion on Aug 30, 2020 | | | [–]

IKEA had their payment system go down worldwide also. I really doubt that uses Cloudflare.

Cameron_D on Aug 30, 2020 | | | [–]

It's not a just CloudFlare outage, its a global CenturyLink/Level3 outage

tpmx on Aug 30, 2020 | | | [–]

Is there a ranking board for which carriers have caused the most accumulated network carnage out there? I think the world deserves this.

mikegioia on Aug 30, 2020 | | | [–]

Me too. I can only connect to one of my DO servers. The rest are all unreachable.

Yetanfou on Aug 30, 2020 | | | [–]

As noticed in another comment I see loads of problems within Cogentco, all on *.atlas.cogentco.com. Might the problem lies there?

mikiem on Aug 30, 2020 | | | [–]

Cogent and Cox are also having problems, but we are seeing a lot more successful traffic on Cogent than CenturyLink. It appears that CL is also not withdrawing stale routes. It seems CLs issues are causing issues on/with everything connected to it.

pseudoramble on Aug 30, 2020 | | | [–]

Same here. I actually opened a support ticket with them because I was worried my ISP had started blocking their IP addresses for some unknown reason. Luckily it seems to clear up, and in the ticket they mentioned routing traffic away from the problematic infrastructure. Seems to have worked for now for my things.

toss1 on Aug 30, 2020 | | | [–]

Yup, definitely noticed earlier outages to both EU sites and also to HN. Looked far upstream because many sites/lots of things worked fine. Good to see it's at least largely fixed

mercer on Aug 30, 2020 | | | [–]

I had problems accessing my Hetzner VPS', but I haven't tried connecting directly with the IP. So I suppose it could be a DNS thing?

mikiem on Aug 30, 2020 | | [–]

M5 Hosting here, where this site is hosted. We just shut down 2 sessions with Level3/CenturyLink because the sessions were flapping and we were not getting complete full route table from either session. There are definitely other issues going on on the Internet right now.

exikyut on Aug 30, 2020 | | [–]

Oooh, maybe that's why HN wasn't working for me a little while ago (from AU)...

eastdakota on Aug 30, 2020 | | [–]

Analysis of what we saw at Cloudflare, how our systems automatically mitigated the worst of the impact to our customers, and some speculation on what may have gone wrong: https://blog.cloudflare.com/analysis-of-todays-centurylink-l...

ngold on Aug 30, 2020 | | [–]

Great write up. It is embarrassing that most of America has no competition in the market.

>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC. The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United. Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.

throwaway344655 on Aug 31, 2020 | | | [–]

I remember working the support queue _before_ this automatic re-routing mitigation system went in and it was a lifesaver. Having to run over to SRE and yell "look! look at grafana showing this big jump in 522s across the board for everything originating in ORD-XX where the next hop is ASYYYY! WHY ARE WE STILL SENDING TRAFFIC OVER THAT ARRRGHH please re-route and make the 522 tickets stop"

it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts

lemiffe on Aug 30, 2020 | | [–]

I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.

The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.

Is there a source where you can get instant information on Level3 / global DNS / major outages?

kitteh on Aug 30, 2020 | | [–]

Ddos tracking sites are eye candy and garbage. Stop using them.

Outages and nanog lists are your best bet, short of being on the right IRC channels.

xwdv on Aug 30, 2020 | | | [–]

What are the right IRC channels?

dudus on Aug 30, 2020 | | | [–]

I believe these are mostly non public channels where backbone and network infrastructure engineers from different companies congregate to discuss outages like this.

kitteh on Aug 30, 2020 | | | [–]

Not just discuss, but fix too :)

rolph on Aug 30, 2020 | | | | [–]

also channels where hats of various type discuss advantages opportunities and challenges presented by such outages

xwdv on Aug 30, 2020 | | | [–]

Which channels

ficklepickle on Aug 30, 2020 | | | [–]

They wouldn't be non-public if they told us plebs

rolph on Aug 30, 2020 | | | [–]

please dont call yourself that its more like i [and others] are hyper paranoid and marginal in behavior due to the nature of pastimes [i myself can promise you that im not malicious but i cant speak for others, i would leave it up to them to speak for themselves]

rolph on Aug 30, 2020 | | | | [–]

it isnt so much the channels that you want its the current IP of a non indexed IRC server[s] that you need, of course you could create and maintain your own dynamic IRC server and invite people that you trust or feel kinship toward.

here are a couple of "for instance" breadcrumbs for you to start from:

https://github.com/saileshmittal/nodejs-irc-server

https://ubuntu.com/tutorials/irc-server

...>>> https://github.com/inspircd/inspircd/releases/tag/v3.7.0

danmg on Aug 30, 2020 | | | | [–]

packetheads irc

pseudoramble on Aug 30, 2020 | | | [–]

I agree!

I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.

So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.

manceraio on Aug 30, 2020 | | | [–]

I had a similar issue with my droplets as well. I thought I messed up something and then suddenly it worked again.

jrockway on Aug 30, 2020 | | | [–]

I found places talking about this earlier. A friend of mine who has CenturyLink as their ISP complained to me that Twitch and Reddit weren't working. But they worked for me, so I suspected a CDN issue. I did some digging to figure out what CDNs they had in common. I expected Twitch to be on CloudFront, but their CDN doesn't serve CloudFront headers; instead they are "Via: 1.1 varnish". Reddit is exactly the same. I did some googling and found out that they both apparently used Fastly, at least to some extent. Fastly has a status page and it was talking about "widespread disruption".

So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.

It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.

aosaigh on Aug 30, 2020 | | [–]

Has anyone any good resources for learning more about the "internet-level" infrastructure affected today and how global networks are connected?

q3k on Aug 30, 2020 | | [–]

Unfortunately, this infrastructure is at an uncanny intersection of technology, business and politics.

To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.

The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)

Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.

porkbrain on Aug 30, 2020 | | | [–]

Judging by other comments, it seems there's a space to fill this niche with a series of blog articles or a book, if you're that sort of person.

kitteh on Aug 30, 2020 | | | [–]

There are plenty of presentations out there. See nanog, ripe.

The books are meh because they're not written by operators. They're more academic and dated.

Plenty of clueful folks on the right IRC channels.

hpkuarg on Aug 30, 2020 | | | [–]

Second time in this thread you've mentioned the right IRC channels. Does one need to invoke some secret code to find out what they are? :-)

kazen44 on Aug 30, 2020 | | | [–]

usually these channels are either invite only (for members of a NOG for example) or are very, very hard to find if you don't know the proper people.

rolph on Aug 30, 2020 | | | | [–]

in certain instances, yes depending on the nature or subject matter of the channel

ok lets go for broke, there are a LOT of clandestine IRC servers and exclusive gatekeeping of channels. you wont know about unless you have an IRL reference

stoops on Aug 30, 2020 | | | [–]

The first rule of Fight Club is that you don't talk about Fight Club.

jsjohnst on Aug 30, 2020 | | | | [–]

> or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere

That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.

q3k on Aug 30, 2020 | | | [–]

A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year and some mild paperwork, there's plenty of companies/organizations that will help you with that (shameless plug: bgp.wtf, that's us).

Afterwards, announcing this space is probably the cheapest with vultr (but their BGP connectivity to VMs is uh, erratic at times) or with ix-vm.cloud, or with packet.net (more expensive). You can also try to colo something in he.net FMT2 and reach FCIX, or something in kleyrex in Germany. All in all, you should be able to do something like run a toy v6 anycast CDN at not more than $100 per month.

jsjohnst on Aug 30, 2020 | | | [–]

> A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year

ARIN starting price is 2.5x that much (used to be 5x) for just the ASN. Glad the pricing is better elsewhere in the world at least!

oarsinsync on Aug 30, 2020 | | | [–]

If you try to become an LIR in your own right, RIPE fees are much higher.

If you're looking for PI (provider independent) resources from RIPE, the costs to the LIR (on top of their annual membership fees) is around 50€/year. An ASN and a /48 of IPv6 PI space would therefore clock in around 100€/year (which is in line with the GP's pricing).

Membership fees are around 1400€/year, with a 2000€ signup fee. The number of PA (provider assigned) resources you have has no bearing on your membership fee. If you only have a single /22 of IPv4 PA space (the maximum you can get as a new LIR today) or you have several /16s, it makes no difference to your membership fees (this wasn't always the case, the fee structure changes regularly).

(EDIT: Source: the RIPE website, and the invoices they've sent me for my LIR membership fees)

MayeulC on Aug 30, 2020 | | | | [–]

> bgp.wtf, that's us

I feel like you ought to be part of the ffdn database: https://db.ffdn.org/

Though the "French Data Networks Federation" is a French organization, their db tries to cover every independent, nonprofit ISP in the world :)

q3k on Aug 30, 2020 | | | [–]

That couldn't work, unfortunately, per https://www.ffdn.org/en/charter-good-practices-and-common-co... :

> All subscribers to Internet access provided by the provider must be members of the provider and must have the corresponding rights, including the right to vote, within reasonable time and procedure.

Not all of our subscribers are members of our association. The association is primarily a hackerspace with hackerspace members being members of the association. We just happen to also be an ISP selling services commercially (eg. to people colocating equipment with us, or buying FTTH connectivity in the building we're located in).

MayeulC on Aug 30, 2020 | | | [–]

Ah, that's interesting, thank you for pointing this out to me, I didn't know about it. I take it that this isn't the first time you are asked about this, then?

Well, I on one hand I perfectly understand you not wanting to change your structure, especially if it works fine. On the other, can see a few ways around that restriction, and don't really see how having the ISP a separate association with its customers as members (maybe with their votes having less weight than hackerspace members) would have a downside (except if funds are primarilly collected for funding hackerspace activities?).

q3k on Aug 30, 2020 | | | [–]

> I take it that this isn't the first time you are asked about this, then?

First time, but I read the rules carefully :).

> [I] don't really see how having the ISP a separate association with its customers as members [...] would have a downside [...]

Paperwork, in time and actual accounting fees. If/when we grow, this might happen - but for now it's just not worth the extra effort. We're not even breaking even on this, just using whatever extra income from customers to offset the costs of our own Internet infrastructure for the hackerspace. We don't even have enough customers to legally start a separate association with them as members, as far as I understand. I also don't think our customers would necessarily be interested in becoming members of an association, they just want good and cheap Internet access and/or mediocre and cheap colocation space.

akritrime on Aug 30, 2020 | | | | [–]

What resources can I follow to start a non-profit ISP? I want to start one in my hometown for students who couldn't afford internet to join online classes.

dboreham on Aug 30, 2020 | | | [–]

Why not just raise money to pay for service from for-profit providers? Much more efficient use of donation funds.

akritrime on Aug 30, 2020 | | | [–]

Hmm, I actually didn't think about that at all. I guess I got too fascinated by this video[0] and wanted to apply something similar to our current scenario.

[0]: https://youtu.be/lEplzHraw3c

mschuster91 on Aug 30, 2020 | | | | [–]

Because often enough there is only one dominant service in the region who has no pressure to compete from anyone due to regulatory capture (esp. regarding right of way on utility poles) and so has no incentive to upgrade their offers to the customers.

walrus01 on Aug 30, 2020 | | | | [–]

If you intend to start a facilities-based last mile access ISP, what last-mile tech do you intend to use? There's a number of resources out there for people who want to be a small hyper local WISP. But I would not recommend it unless you have 10+ years of real world network engineering experience at other, larger ISPs.

odensc on Aug 30, 2020 | | | | [–]

https://startyourownisp.com/

q3k on Aug 30, 2020 | | | | [–]

There's a bunch of guidelines for starting a (W)ISP depending on your region.

akritrime on Aug 30, 2020 | | | [–]

I actually tried, but all I got was some consultancy services that would help you get an ISP with estimated cost of 10k USD (a middle class household earns half of that in a year here).

walrus01 on Aug 30, 2020 | | | [–]

http://wndw.net/book.html

throw0101a on Aug 30, 2020 | | | [–]

"An Open Platform to Teach How the Internet Practically Works" from NANOG perhaps:

* https://www.youtube.com/watch?v=8SRjTqH5Z8M

The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:

* https://learn.nsrc.org/bgp

NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:

* https://www.youtube.com/watch?v=nKeZaNwPKPo

In a similar vein, NANOG "Panel: Demystifying Submarine Cables"

* https://www.youtube.com/watch?v=Pk1e2YLf5Uc

bogomipz on Aug 30, 2020 | | | [–]

You want to learn about BGP in order to understand how routing on the internet works. The book "BGP" by Iljitsch van Beijnum is a great place to start. Don't be put off by the publication date, as almost everything in there is still relevant.[1]

Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]

Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]

Probably one of the best resource though still is to work for an ISP or other network provider for a stint.

[1] https://www.oreilly.com/library/view/bgp/9780596002541/

[2] http://drpeering.net/white-papers/Internet-Service-Providers...

[3] http://www.traceroute.org/#Looking%20Glass

[4] http://www.traceroute.org/#Route%20Servers

[5] http://www.routeviews.org/routeviews/

Famicoman on Aug 30, 2020 | | | [–]

It likely has some inaccurate info as I'm not a network engineer, but I gave a talk about BGP (with a history, protocol overview, and information on how it fails using real world examples) at Radical Networks last year. https://livestream.com/internetsociety/radnets19/videos/1980...

I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.

EvanAnderson on Aug 30, 2020 | | | [–]

It's important to recognize that there is a "layer 8" in Internet routing-- the political / business layer-- that's not necessarily expressed in technical discussion of protocols and practices. The BGP routing protocol is a place where you'll see "layer 8" decisions reflected very starkly in configuration. You may have networks that have working physical connectivity, but logically be unable to route traffic across each other because of business or political arrangements expressed in BGP configuration.

marmshallow on Aug 30, 2020 | | | [–]

+1

Many of the comments here presume knowledge about this stuff, and I can’t follow.

gkanai on Aug 30, 2020 | | | [–]

Don't forget Neal Stephenson's classic, "Mother Earth, Mother Board." 25 years old but still relevant.

https://www.wired.com/1996/12/ffglass/

walrus01 on Aug 30, 2020 | | | [–]

The business structures, ISP ownership and national telecoms have changed quite a lot in the past 25 years. But in terms of the physical OSI layer 1 challenges of laying cable across an ocean, that remains the most difficult and costly part of the process.

antsoul on Aug 30, 2020 | | | [–]

US and Israel looking at China's strategy at BGP-level in 2018 :

https://scholarcommons.usf.edu/mca/vol3/iss1/7/

vesh on Aug 30, 2020 | | | [–]

This might help https://community.fs.com/blog/tcpip-vs-osi-whats-the-differe...

ZWoz on Aug 30, 2020 | | | [–]

DrPeering is good material: http://drpeering.net/tools/HTML_IPP/ipptoc.html

Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.

I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.

albertTJames on Aug 30, 2020 | | | [–]

https://mobile.twitter.com/Level3 (not an internet level, just a company :)

Vinnl on Aug 30, 2020 | | | [–]

Were their tweets protected (i.e. only visible to approved followers) when you posted that link, or is that in response to this event?

iso947 on Aug 30, 2020 | | | [–]

Level3 was qquired/merged/changed to century link a year or so back, I think they closed their old twitter account then

When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 4.2.2.2 dns server), so people still refer to them as level3.

The account to follow for them now is https://mobile.twitter.com/CenturyLink but it won’t tell you much.

Godel_unicode on Aug 30, 2020 | | | [–]

Note that L3 is a separate company from Level 3 Communications, which was the ISP that was acquired by CenturyLink. L3 is an American aerospace and C4ISR contractor.

CenturyLink's current CEO, Jeff Storey, was actually the pre-acquisition Level 3 CEO.

mmaunder on Aug 30, 2020 | | | [–]

Read Internet Routing Architectures by Sam Halabi. It’s almost 20 years old now but BGP hasn’t changed and the book is still called The Bible by routing architects.

kitteh on Aug 30, 2020 | | | [–]

It's dated and not particularly useful if you want to learn how things are really done on the internet in a practical sense. So if you read it, be prepared to unlearn a bunch of stuff.

colechristensen on Aug 30, 2020 | | | [–]

I don't know something holistic, but if you are the Wikipedia rampage sort of person, here is a good place to start:

https://en.wikipedia.org/wiki/Internet_exchange_point

maxmouchet on Aug 30, 2020 | | | [–]

"Tubes" is a good book to get an high level overview: https://www.penguin.co.uk/books/178533/tubes/9780141049090

TallGuyShort on Aug 30, 2020 | | | [–]

No particular resource to recommend, though I first learned about it in a book by Radia Perlman, but BGP is a protocol you don't hear much about unless you work in networking, and is one of the key pieces in a lot of wide-scale outages. I'd start with that.

late2part on Aug 30, 2020 | | | [–]

read the last 26 years of NANOG archives

Yetanfou on Aug 30, 2020 | | [–]

Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo after which the packets disappear down a black hole. All routing problems occur within Cogentco.

    3  sth-cr2.link.netatonce.net (85.195.62.158) 
    4  te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com 
    5  be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
    6  be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)  
    7  be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205) 
    8  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)   
    9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  
   10  be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)  
   11  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  
   12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
   13  be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
   14  be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
   15  * level3.mia03.atlas.cogentco.com (154.54.10.58) 
   16  * * *
   17  * * *

cotillion on Aug 30, 2020 | | [–]

What seems to have happened is that Centurylinks internal routing has collapsed in some way. But they're still announcing all routes and they don't stop announcing routes when other ISPs tag their routes not to be exported by Centurylink.

So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.

vld on Aug 30, 2020 | | [–]

I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.

jbotz on Aug 30, 2020 | | [–]

Is there any one place that would be a good first place to go to check on outages like this?

It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.

guerby on Aug 30, 2020 | | | [–]

In the network world there's the outages mailing list:

https://puck.nether.net/mailman/listinfo/outages

Public archives:

https://puck.nether.net/pipermail/outages/

Latest issue reported:

https://puck.nether.net/pipermail/outages/2020-August/013187... "Level3 (globally?) impacted (IPv4 only)"

cultureulterior on Aug 30, 2020 | | | | [–]

https://www.thousandeyes.com/outages

tuukkah on Aug 30, 2020 | | | [–]

Based on that map, Telia seems to be one of the most affected which might explain why Scandinavia is so badly affected.

thejosh on Aug 30, 2020 | | | | [–]

Until that site also goes down.

lioeters on Aug 30, 2020 | | | [–]

Indeed, if we're to have a public Internet health meter, it must be distributed and hosted/served from "outside" somehow, to be resilient to all or parts of the network being down.

johnisgood on Aug 30, 2020 | | | [–]

Here's a thought: we should all be outside. :D

efreak on Aug 31, 2020 | | | | [–]

Something something anycast.

exikyut on Aug 30, 2020 | | | | [–]

This is an excellent idea and simple but moderately expensive for anyone to set up.

Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)

The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.

But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/

svdr on Aug 30, 2020 | | | | [–]

I go here :-)

mnadkvlb on Aug 30, 2020 | | | | [–]

Sounds like a good idea. The closest i know is the one from pingdom which i use the most. Its not detailed enough though. https://livemap.pingdom.com/

dexterdog on Aug 30, 2020 | | | | [–]

You just imagined the first target in an attack. Might as well just call it honeypotnumber1.

rewtraw on Aug 30, 2020 | | | [–]

Reddit, HN, etc. are inaccessible to me over my Spectrum fiber connection, but working on AT&T 4G. It’s not DNS, so a tier 1 ISP routing issue seems to be the most likely cause.

kossTKR on Aug 30, 2020 | | | [–]

Lots of local sites not working in Scandinavia either. So seems more global than a single Tier 1?

phoe-krk on Aug 30, 2020 | | | [–]

Probably relevant Fastly update:

> Fastly is observing increased errors and latency across multiple regions due to a common IP transit provider experiencing a widespread event. Fastly is actively working on re-routing traffic in affected regions.

dbetteridge on Aug 30, 2020 | | | | [–]

HN and reddit out on my talktalk link in London, 3 mobile 4g working normally.

aeyes on Aug 30, 2020 | | | [–]

Can confirm for a number of sites, even Hacker News was unreachable for me.

Benjamin_Dobell on Aug 30, 2020 | | [–]

This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).

abhishekjha on Aug 30, 2020 | | [–]

Its reverse for me. The broadband fails to connect to HN but my mobile ISP is able to reach it fine.

josephb on Aug 30, 2020 | | | [–]

Because networks are connected to others via different paths, it's not unusual that one method of connectivity would work and one doesn't.

Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.

willis936 on Aug 30, 2020 | | | | [–]

Same for me in midwest US.

I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.

bmlzootown on Aug 30, 2020 | | | [–]

I thought Cloudflare was having issues again, since I use their DNS servers, so I started by changing that. Then I tried restarting everything, modem/router/computer. Wasn't until I connected to a VM that a friend hosts that I was finally able to access HN, and thus saw this thread.

Hopefully this will get fixed within a reasonable timespan.

every on Aug 30, 2020 | | | | [–]

ycombinator.com pinged just fine but news.ycombinator.com dropped 100% packets. But all better now...

yreg on Aug 30, 2020 | | | [–]

I was so pissed at Waze earlier for giving up on me in a critical moment. Then I found out I'm also unable to send iMessages, but I was curious, since I could browse the web just fine.

When something doesn't work I always assume it's a problem with my device/configuration/connection.

Who would have thought it's a global event such as the repeated Facebook SDK issues.

bonestamp2 on Aug 30, 2020 | | | [–]

Yep, I had a similar experience. Sites that didn't work from my home connection worked fine on mobile. After rebooting and it persisted, I assumed it was just a DNS or routing issue since they were both connecting to different networks.

iso1210 on Aug 30, 2020 | | [–]

Looks like Centurylink/Level3 (as3356) might not be withdrawing routes after people close their peering?

josephb on Aug 30, 2020 | | [–]

That's what various networks have reported.

It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!

swinglock on Aug 30, 2020 | | | [–]

Quick hack; split all your announcements in two, making the new announcement route around their old stale announcement by being more specific.

regolithori on Aug 30, 2020 | | | [–]

What could cause this? I wonder what the technical problem is.

q3k on Aug 30, 2020 | | | [–]

These are usually called 'BGP Zombies', and here's a good summary of their prevalence and usual causes: https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

In this case however, it seems to be an L3/CL-specific bug.

jcims on Aug 30, 2020 | | | | [–]

I would love to hear the inside scoop from folks working at CenturyLink. I’ve used their DSL for years and the network is a mess. I don’t know if it them here or legacy Level3 but i have a guess.

Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!

iso1210 on Aug 30, 2020 | | | [–]

Used level3 IP for a long time professionally with limited issues, ceratainly not on the list of worst ISPs.

Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.

crizzlenizzle on Aug 30, 2020 | | | | [–]

We had this once with one of our former ISPs configuring static routes towards us and announcing them to a couple of IXPs. I have no idea why they did it, but it caused a major downtime once for us and basically signed the termination.

bregma on Aug 30, 2020 | | [–]

Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".

In some ways I'm a little bit disappointed it's only a glitch in the internet.

_eigenfoo on Aug 30, 2020 | | [–]

Can somebody please clarify - what exactly is this an outage of, and how serious is it?

rmrfstar on Aug 30, 2020 | | [–]

Here is a fantastic, though somewhat outdated overview [1]. Section 5 is most relevant to your question. The network topology today is a little different. Think of Level3 as an NSP, which is now called a "Tier 1 network" [2]. The diagram should show links among the Tier 1 networks ("peering"), but does not.

[1] https://web.stanford.edu/class/msande91si/www-spr04/readings...

[2] https://en.wikipedia.org/wiki/Tier_1_network

jsjohnst on Aug 30, 2020 | | | [–]

tl;dr One of the large Internet backbone providers (formerly known as Level3, but now known as CenturyLink usually) that many ISPs use is down. Expect issues connecting to portions of the Internet.

Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.

Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.

g105b on Aug 30, 2020 | | | [–]

Is this affecting all geographic regions?

dredmorbius on Aug 30, 2020 | | | [–]

US, Europe, and Asia that I'm aware of (NANOG mailing list).

mikro2nd on Aug 30, 2020 | | [–]

Had to laugh: "I'm seeing complaints from all over the planet on Twitter"

The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)

quickthrower2 on Aug 30, 2020 | | [–]

I could not get on HN as a logged in person (logged out was OK) during this. I wondered how big the cloudflare thread would be if people could get on to comment on it :-)

emilstahl on Aug 30, 2020 | | [–]

CNN just blames Cloudflare.. :facepalm: https://edition.cnn.com/2020/08/30/tech/internet-outage-clou...

ihatecloudflare on Aug 31, 2020 | | [–]

CNN is absolutely right. Every day I read news that something goes down at CloudFlare. CloudFlare do much more harm than they "fix" with their services.

dathinab on Aug 30, 2020 | | [–]

I guess that why HN was temporary unreachable from my home?

protomyth on Aug 30, 2020 | | [–]

and why Cloudflare was having so many issues https://www.cloudflarestatus.com/

jetru on Aug 30, 2020 | | [–]

Oh lord. I'm oncall and we were like "WHATS HAPPENING"

b3lvedere on Aug 30, 2020 | | [–]

Same here :) Couple of companies started complaining. Told them it's a worldwide issue. It seems going better at the moment.

iso1210 on Aug 30, 2020 | | [–]

No peering problems from my network with Level3 in London Telehouse West, maybe a minute or so of increased latency at 10:09 GMT

Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net

No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia

Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.

johnchristopher on Aug 30, 2020 | | [–]

So, that's why HN is unreachable from Belgium at the moment (right when I was trying to figure a dns cache problem in Firefox,of course).

An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile

iso1210 on Aug 30, 2020 | | [–]

HN working for me from the UK on BT, but traceroute showing lots of different bouncing around and a lot of different hops in the US

  7  166-49-209-132.gia.bt.net (166.49.209.132)  9.877 ms  8.929 ms
    166-49-209-131.gia.bt.net (166.49.209.131)  8.975 ms
  8  166-49-209-131.gia.bt.net (166.49.209.131)  8.645 ms  10.323 ms  10.434 ms
  9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  95.018 ms
    be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5)  7.627 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  102.570 ms
  10  be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  89.867 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  101.469 ms  101.655 ms
  11  be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  103.990 ms  93.885 ms
    be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  97.525 ms
  12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)  106.027 ms
    be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  98.149 ms  97.866 ms
  13  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.558 ms  122.330 ms  120.071 ms
  14  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  123.662 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.351 ms
    be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.746 ms
 15  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  145.939 ms  137.652 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.043 ms
  16  be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77)  150.015 ms
    be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121)  152.793 ms  152.720 ms
  17  be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.881 ms
    te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66)  153.452 ms
    be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.054 ms
  18  te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  162.835 ms
    te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  146.643 ms
    te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  153.714 ms
  19  te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  151.212 ms  145.735 ms
    38.96.10.250 (38.96.10.250)  147.092 ms
  20  38.96.10.250 (38.96.10.250)  149.413 ms * *

josephb on Aug 30, 2020 | | | [–]

Guessing the traceroute looks a bit messy because of multiple paths being available.

You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.

iso1210 on Aug 30, 2020 | | | [–]

I don't normally see multi paths for a given IP, but that aside it's bouncing through far more than I'd expect. That said, it's rare I look at traceroutes across the continental U.S, maybe that many layer 3 hops are normal, maybe routes change constantly.

HN has dropped off completely from work - I see the route advertised from Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581 21581). Telia is longer, so traffic goes into to Level3 at Docklands via our 20G peer to London15, but seems to get no further.

Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.

My gut feeling is a core problem in Level3's continental US network rather than something more global.

RKearney on Aug 30, 2020 | | | [–]

This is normal for Cogent. They do per-packet load balancing across ECMP links. What you're seeing is normal for the given configuration.

sgt on Aug 30, 2020 | | | [–]

It was also down from South Africa. It's luckily up now. Gasps for breath

kuroguro on Aug 30, 2020 | | | [–]

Was down from Latvia too, up now.

wiremine on Aug 30, 2020 | | [–]

In a situation like this, what are the best "status" sites to be watching?

OskarS on Aug 30, 2020 | | [–]

HN is not the worst place, honestly.

Timothycquinn on Aug 30, 2020 | | | [–]

Agreed. I went to Reddit r/networking and the mods were closing helpful threads in real-time :(

innocenat on Aug 30, 2020 | | | | [–]

HN was down for me, unfortunately. (Connecting from Japan, so most CDN-based website load fine since it isn't route via Europe)

ojagodzinski on Aug 30, 2020 | | | [–]

https://downdetector.com/ client perspective is best perspective ;) Problem in this outage is that site X works ok but transit provider for clients in US works badly and generates "false positives"

traceroute66 on Aug 30, 2020 | | | [–]

For a situation like this, the various tools hosted by RIPE are likely your best bet. You won't get a pretty green/red picture, but you'll get a more than enough data to work with.

badrabbit on Aug 30, 2020 | | | [–]

stat.ripe.net

dkdk8283 on Aug 30, 2020 | | | [–]

Nanog is also pretty helpful for this specific type of issue

zowanet on Aug 30, 2020 | | | [–]

Here's a direct link to this month's messages:

https://mailman.nanog.org/pipermail/nanog/2020-August/thread...

lucb1e on Aug 30, 2020 | | | | [–]

You mean nanog.org? I don't see a stats page linked in their menu.

toomuchtodo on Aug 30, 2020 | | | [–]

It’s a mailing list for network operations/engineering folks. The emails are the status updates. You’ll have to look to each network’s own site if you want connectivity, peering, and IXP red/green/ up/down status.

quickthrower2 on Aug 30, 2020 | | | [–]

Ham radio might be the answer to this one day!

tn890 on Aug 30, 2020 | | | [–]

https://www.internetweathermap.com/map

tillinghast on Aug 30, 2020 | | | [–]

Except for the fact that internetweathermap.com is super green, and the internet is not currently super green.

eric_khun on Aug 30, 2020 | | | [–]

Currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

stevekemp on Aug 31, 2020 | | | [–]

Your front page has a typo: "titme".

Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...

gnyman on Aug 30, 2020 | | [–]

This had me really confused until I saw it was a global outage. I have been getting delayed iOS push notifications (from prowl) now for the last few hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)

Got questioning if I really disconnected it before I left.

I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.

dredmorbius on Aug 30, 2020 | | [–]

NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of 03:00 US/Pacific, AS209 possibly also affected.

AS3356 is Level 3, AS209 is CenturyLink.

https://mailman.nanog.org/pipermail/nanog/2020-August/209359...

ffpip on Aug 30, 2020 | | [–]

DDG, down detector are all very slow. Both are on cloudflare.

Fastly, HN, Reddit too.

Only Google domains are loading here.

thejteam on Aug 30, 2020 | | [–]

From where I am (mid-altantic US) Google site are completely down (google.com, youtube)

jlgaddis on Aug 30, 2020 | | [–]

> "Root Cause: An offending flowspec announcement prevented BGP from establishing correctly, impacting client services."

--

That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...

quickthrower2 on Aug 30, 2020 | | [–]

This might be a silly question but is there such a thing as CI/CD for this sort of thing that may have caught the problem?

dsr_ on Sept 1, 2020 | | | [–]

There are two aspects to this:

1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.

2. Is there a DWIM check available, so you can see the effect of the change before committing? No. That would require a complete model of, at a minimum, your entire network plus all directly connected networks -- that still wouldn't be complete, but it could catch some errors.

based2 on Aug 30, 2020 | | [–]

https://status.ctl.io/history/f19a0555-abbd-4038-91cb-b55a76...

https://twitter.com/g_bonfiglio/status/1300022993251446785?s...

https://old.reddit.com/r/networking/comments/ijb8tn/global_a...

blantonl on Aug 30, 2020 | | [–]

Everything to Oracle Cloud's Ashburn US-East location is down.

Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.

system2 on Aug 30, 2020 | | [–]

Status pages of the companies are just PR disasters for them. Most of the time they don't report what's up.

_fool on Aug 30, 2020 | | | [–]

But when they do, it can be amazing. https://www.atlassian.com/blog/statuspage/how-to-write-a-goo...

tyfon on Aug 30, 2020 | | [–]

Seems like "the internet" works again here in Norway. I've been limited to local sites all day.

Hacker news has been off for several hours for me.

Whatever it was it must have been nasty.

djxfade on Aug 30, 2020 | | [–]

I had the same issue on my fiber connection (Altibox/BKK), however, no problems on my mobile using 4G (Dipper/Telenor)

matsemann on Aug 30, 2020 | | | [–]

I couldn't reach HN on neither Altibox or 4g/telenor.

tyfon on Aug 30, 2020 | | | [–]

Both altibox and telia 4g was down for me as well.

janmo on Aug 30, 2020 | | [–]

There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.

gailees on Aug 30, 2020 | | [–]

The beginning of WWIII probably looks something like this.

vbsteven on Aug 30, 2020 | | [–]

I'm having lots of issues with Hetzner machines not being available (and even the hetzner.com website). Don't know if this is related.

zepearl on Aug 30, 2020 | | [–]

Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my own dedicated server hosted at Hetzner datacenter in Germany seems to be reachable/working as well.

Connecting from Switzerland.

vinni2 on Aug 30, 2020 | | [–]

I had to use a VPN With US location to post this comment. I am in Europe.

lucb1e on Aug 30, 2020 | | [–]

HN works fine from Germany with Telefonica (O2) and also from the Netherlands with XS4ALL.

Edit: Somewhere between 14:00 and 14:46Z it also went down from O2; XS4ALL still works, and O2 can reach XS4ALL.

summarity on Aug 30, 2020 | | | [–]

No luck on T-Mobile

omnibrain on Aug 30, 2020 | | | [–]

Yes, I had to switch to my Vodafone eSIM for data to connect to Hacker News.

crizzlenizzle on Aug 30, 2020 | | | | [–]

Yup.

``` Prefix 209.216.230.0/24 BGP as_path 3356 21581 21581 ```

As seen from AS3320.

summarity on Aug 30, 2020 | | | [–]

Even NordVPN to the nearest German hub is screwed. Have to vpn to the US to access HN.

lucb1e on Aug 30, 2020 | | | [–]

I see a lot of ads for NordVPN, but you should know they're not necessarily reliable. Just look for NordVPN on hacker news search: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... (see e.g. the second hit: https://news.ycombinator.com/item?id=21664692 covering up security issues, using your connection to proxy other people's traffic, a related company does data mining...). The only VPN that seemed to fit the bill when I looked for one about a year ago was ProtonVPN, but I certainly didn't manage to look at every VPN on the planet and I'm just a random internet stranger so... take that with a grain of salt.

ScannerSparkly on Aug 31, 2020 | | | [–]

Yeah, I agree that their marketing is aggressive, but a lot else I think is just speculation. The VPN market is very cut-throat, competitors create all kinds of crack-pot conspiracy theories to sow doubt amongst potential customers. As far as I know there was one incident, It wasn't very serious, no customer data was stolen. Also, the company now does 3rd party audits. I think NordVPN is pretty decent, but that's just me.

summarity on Aug 30, 2020 | | | | [–]

I know. But they are required to unlock streaming services. I’m not using them for privacy or even normal traffic.

lucb1e on Aug 30, 2020 | | | [–]

Alright, just making sure. Happy to hear you're an informed netizen :)

vladvasiliu on Aug 30, 2020 | | | [–]

Works OK in Paris via SFR (home fiber) and Sipartech (Business fiber).

Doesn't work via Bouygues 4G.

My SFR fiber doesn't seem affected all that much. I've been following this for a while on the other HN post [0] and all services people have noted seem to work here.

Both SFR and Sipartech seem to have direct peerings with Cogentco.

[0] https://news.ycombinator.com/item?id=24322513

edit: Spotify seems partially down: app doesn't say it's offline, but songs won't play.

osipovas on Aug 30, 2020 | | [–]

A service I run on Digital Ocean was affected by this early this morning. Looks like it was mitigated by DO - so I'm very grateful for that. Although, the service I run is time sensitive so failures like this are pretty unfortunate for me. Where would I get started with building in redundancy against these sort of outages?

naringas on Aug 30, 2020 | | [–]

seems like the internet in 2020 has a diminished ability to route around damage

Frost1x on Aug 30, 2020 | | [–]

My opinion is that this, like many issues were seeing today, is largely an issue with ongoing consolidation trends. The less diversity of systems/solutions we have for a given problem or set if problems, the less chance we're protected from unknown unknowns that creep up. The more diversity you have in systems, the more likely you have some option that is hardened against unknown unknowns when they arrive and the quicker we can work around them.

Modern society is all about consolidating systems into a few efficient solutions typically dictated by market forces which I argue, don't concern themselves much with these sorts of problems. As a result, when we run into problems, we're left with fewer options to resort to and instead have to identify problems and develop new solutions on-the-fly. Consolidation leads to complacency and stagnation.

Sometimes this is reasonable (and even desirable) for certain non-critical systems, it just doesn't make financial sense to pour resources into system diversity for certain systems we could do without--find the one that works best/most efficiently and use it. If it breaks, it's not critical and the work around can wait.

On the other hand, if a system is critical, then I think it behooves us to continue looking at improvements of existing systems and alloting resources to investigating new approaches.

sp332 on Aug 30, 2020 | | | [–]

BGP has always had this issue. It depends on trustworthy information being available. Any trusted source who starts lying (or just screws up) is going to cause routing problems.

salawat on Aug 30, 2020 | | | [–]

Note, trustworthyness jumps off of being a technical problem, and becoming a human/people problem. Level 8 as someone mentioned, or GIGO (Garbage-In-Garbage-Out) as others may know it.

To safely use a system, your operator needs to be 10% smarter than the system being operated. It is clear that we have problems in that department with certain AS's. This is about, what the third major outage attributed to CenturyLink in the last handful of years? I have no idea what exactly their process must look like, but good heavens, a better look need be taken, as this is becoming a bit regular for my tastes.

swinglock on Aug 30, 2020 | | | [–]

Yes and no.

Yes, because maybe so.

No, because he issue you're commenting on doesn't suggest that. It looks like the nature of this particular outage is such that a previous iteration of the Internet wouldn't have been any better equipped to solve this faster.

tambre on Aug 30, 2020 | | [–]

Fastly is also seeing problems. [0]

However, they report that they've identified the issue and are fixing it.

[0]: https://status.fastly.com/

xyst on Aug 30, 2020 | | [–]

Internet infrastructure is broken.

Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?

kzrdude on Aug 30, 2020 | | [–]

Why doesn't stuff just route around this automatically, if one provider has problems?

johncolanduoni on Aug 30, 2020 | | | [–]

The problem is the provider having problems is still sending misconfigured routes after the other providers have tried to pull them in response to the outage. So it’s as if CenturyLink was doing a massive BGP attack against their peers, pointing at a black hole.

q3k on Aug 30, 2020 | | | | [–]

Things mostly routed around the problem. Issues arose because a) some people are single-homed to Level3/CenturyLink b) apparently Level3/CenturyLink continued announcing unreachable prefixes, which breaks the Internet BGP trust model.

danecek099 on Aug 30, 2020 | | [–]

Even https://downdetector.com/ has problems loading for me. Middle Europe *internetweathermap is down

neuronic on Aug 30, 2020 | | [–]

Who watches the Watchmen...

danecek099 on Aug 30, 2020 | | | [–]

Broadband here just fell down for few minutes, mobile ISP's are ok

hkc on Aug 30, 2020 | | [–]

Chess.com was down due to the outage and some of the Indian players got disconnected and lost on time, so FIDE declared India-Russia joint winner of the Online Chess Olympiad 2020.

eric_khun on Aug 30, 2020 | | [–]

Shameless plug:

I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.

So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io

reimertz on Aug 30, 2020 | | [–]

FYI: Your site is down because of GitHub pages maintenance.

Edit: it’s up again!

Just want to let you know about the spelling error ”Save titme” :)

maxshash on Aug 31, 2020 | | | [–]

Hi Eric,

Congratulations on your startup!

There is at least one big tool that does exactly the same you wrote. It is called StatusGator https://statusgator.com There are at least 3 much smaller ones.

Have you tried any of them? If yes, what's your point of difference?

And how do you plan to market it? As I see the plans are cheap, means your LTV is low.

vladvasiliu on Aug 30, 2020 | | | [–]

Another typo:

> Know when services you depend on goes down

"Services go down", not "goes".

naavis on Aug 30, 2020 | | | [–]

Maybe fix this typo? "Save titme on issues investigation"

kzrdude on Aug 30, 2020 | | | [–]

And > Monitor all 3rd parties services

*3rd party services or possibly 3rd parties' services

eric_khun on Aug 30, 2020 | | | [–]

Thank you, fixed it, I definitely didn't pay attention on this

Now wondering if it impacted conversion rate?

Axsuul on Aug 30, 2020 | | | [–]

Your landing page doesn't build enough trust. You have run on sentences. It's still unclear what the service does.

emilstahl on Aug 30, 2020 | | [–]

Cloudflare status page: Update - Major transit providers are taking action to work around the network that is experiencing issues and affecting global traffic.

We are applying corrective action in our data centers as the situation changes in order to improve reachability Aug 30, 14:26 UTC

https://www.cloudflarestatus.com

rantanplan on Aug 30, 2020 | | [–]

Incidentally I can't connect to HN directly from Greece, but only if I use my VPN through New York. Probably somehow related?

RedShift1 on Aug 30, 2020 | | [–]

Ironically this page doesn't load for me

Cyphase on Aug 30, 2020 | | [–]

I just experienced HN down for several minutes before it loaded and I saw this story at the top.

I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.

I haven't noticed any obvious issues elsewhere yet.

(Just got a delay while trying to submit this comment.)

redwood on Aug 30, 2020 | | [–]

Could this be a Russia move vis a vis today's expected Belarus protests?

(I hope this doesn't mean a violent crackdown is imminent)

Oy https://mobile.twitter.com/HannaLiubakova/status/13000645356...

badrabbit on Aug 30, 2020 | | [–]

I don't see any bgpmon alerts, that's unlikely.

haunter on Aug 30, 2020 | | [–]

I'm in Hungary EU. My fiber works fine but 4G gone except for domestic addresses can't connect to anything

gnicholas on Aug 31, 2020 | | [–]

Can anyone help me understand why I can't access HN from my iPhone, but I can from my computer? both are on the same network. I'm getting "Safari cannot open the page because the server cannot be found", and many apps won't work at all either.

lmm on Aug 31, 2020 | | [–]

One might be using IPv6 and the other v4. Or you might have different DNS settings.

one2know on Aug 30, 2020 | | [–]

Based on twitter, the outage was on multiple continents. What would cause that? Subsea cable broken?

stordoff on Aug 30, 2020 | | [–]

It wasn't a total outage for the site I was trying to reach. It took about 20 minutes to make an order, but after multiple retries (errors were reported as a 522 with the problem being somewhere between Manchester, UK and the host), it did go through.

nottorp on Aug 30, 2020 | | [–]

I have two pipes from two different (consumer ISPs) at home. One can reach HN, the other can't.

Incidentally, uBlock Origin seems to be completely broken. It doesn't have any local blacklists to work when their ?servers? are unavailable?

tpmx on Aug 30, 2020 | | [–]

From the other (Cloudflare) thread (post: https://news.ycombinator.com/item?id=24322603), the outages list (https://puck.nether.net/mailman/listinfo/outages).

https://puck.nether.net/pipermail/outages/2020-August/thread...

Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident.

Edit: removed details about the similarity to a 1997 incident based in input from commenters.

jsjohnst on Aug 30, 2020 | | [–]

> Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident, possibly reminiscent of the https://en.wikipedia.org/wiki/AS_7007_incident in 1997.

As you aren’t a network engineer, I can understand making that leap based on the context, but no, this is nothing like the AS7007 event.

The “black hole” in this case is due to networks pulling their routes via AS3356 to try and avoid their outage, but when they do, CenturyLink is still announcing those routes and as such those networks blackhole.

tpmx on Aug 30, 2020 | | | [–]

So it's not a BGP blackhole incident then?

jsjohnst on Aug 30, 2020 | | | [–]

Not all BGP blackholes are the same. The AS7007 incident from over twenty years ago is an entirely different cause, and thus unrelated.

tpmx on Aug 30, 2020 | | | [–]

What I take from that: It is a BGP blackhole incident.

jsjohnst on Aug 30, 2020 | | | [–]

What I take from this is that you’re offering input to a thread which you don’t have experience in or even actually understand, thus are spreading misinformation. You then are continually doubling down further showing your maturity.

You aren’t helping, so please stop.

waheoo on Aug 30, 2020 | | | [6 more]

[flagged]

tpmx on Aug 30, 2020 | | | [–]

Heh, I knew I was setting myself up for that from networking people - i know the attitude. I was of course merely repeating the sentiment in that thread. What more disclaimers do you need to avoid displaying your superiority in networking? Sheesh.

waheoo on Aug 30, 2020 | | | [–]

You were repeating a suspicion as if it was yours, as if it was a shared view in that thread, it wasn't, i'm not a network engineer but I read the thread too, nobody needs people spreading misinformation in a crisis just to sound smart, its not useful and usually harmful.

No superiority here.

tpmx on Aug 30, 2020 | | | [–]

> You were repeating a suspicion as if it was yours

That is a misrepresentation.

I wrote:

"Not a network engineer, but based on the comments there". I prefaced the insecure speculation part with "possibly". It was obvious to any reader this was a summary of the emails there.

You are overreacting.

waheoo on Aug 30, 2020 | | | [–]

> You're not a network engineer but it looks like a blackhole incident to you?

I'm the one overreacting?

All I did was ask you if it was your opinion or not.

jsjohnst on Aug 30, 2020 | | | [–]

To be fair, when combined with your next sentence after the one you quoted, it was a bit flamebaity, but I agree otherwise.

Darmody on Aug 30, 2020 | | [–]

Half of the internet is down. Crazy...

I can't even access the private WoW server I play.

tc313 on Aug 30, 2020 | | [–]

FWIW, I can’t connect to Madden NFL online servers.

rglover on Aug 30, 2020 | | [–]

This knocked out the Starbucks app and some of their systems this morning. A bunch of people in line couldn't log in and they were saying parts of their whole internal system were down, too.

EE84M3i on Aug 30, 2020 | | [–]

I'm confused about why Cloudflare had problems but other CDN providers/sites with private CDNs like Google did not. Is there something different about how Cloudflare operates?

blooalien on Aug 30, 2020 | | [–]

I experienced this issue while reading docs at "Read the Docs" (and ironically had connection issues while trying to read this very exact page right here, too.)

system2 on Aug 30, 2020 | | [–]

I am having trouble with Hulu right now. I bet it is related.

dancemethis on Aug 31, 2020 | | [–]

Probably due to the incredibly ugly name this company has. No one in their right mind should shake hands with a thing called Level 3.

bovermyer on Aug 30, 2020 | | [–]

SalesForce/Office365 is also having trouble.

corford on Aug 30, 2020 | | [–]

No impact here in Lisbon, PT (using MEO). I can access: HN, twitter, cloudflare, AWS, DO, Hetzner, DDG, Scaleway etc.

tictok4 on Aug 30, 2020 | | [–]

This (thread) explains why we've been having internet problems this morning.... lots of sites not working.

jsumrall on Aug 30, 2020 | | [–]

The iDeal payment network used by most online stores of the Netherlands was down/flaky all afternoon.

TreeInBuxton on Aug 30, 2020 | | [–]

Looks like an issue with AS3356, they are advertising stale routes - lots of unrelated services impacted

2fast4you on Aug 30, 2020 | | [–]

Centurylink is my isp, it looks like traffic drops out after 2 hops. It’s been this way for a few hours

2fast4you on Aug 30, 2020 | | [–]

Youtube is still trucking though, not sure how that works

badrabbit on Aug 30, 2020 | | | [–]

Youtube colocates at most major ISPs on the planet, that might help.

t0mas88 on Aug 30, 2020 | | | | [–]

They have servers inside a lot of ISPs. Same for Netflix.

adamcharnock on Aug 30, 2020 | | | | [–]

They probably peer into Google at the local IX/Data centre. Google traffic will therefore take a different path which isn’t suffering the current outage.

CarCooler on Aug 31, 2020 | | [–]

Yep, internet has been horrible out here, I had to use Cloudflare DNS to reach websites!

eatmyshorts on Aug 30, 2020 | | [–]

I was doing a big release over the evening. I was working fine up until about 6 hours ago, when I signed off. Our network monitors show an outage started about half an hour later (at about 4:05am CST). Service restored a few minutes ago, at about 9:44am CST. I don't know if our problem is the same as this problem, but we are on CenturyLink.

nurettin on Aug 30, 2020 | | [–]

karpolan on Aug 30, 2020 | | [–]

Deployment to Netlify fails on installing of any version of Node :)

_fool on Aug 30, 2020 | | [–]

more specifically, npmjs.com and nodejs.org are not available from Netlify's datacenter due to this outage.

ausjke on Aug 30, 2020 | | [–]

I wasted two hours for this, diagnosis, reboots,etc.

person_of_color on Aug 31, 2020 | | [–]

Imagine a ransomware attack against these jokers.

ezconnect on Aug 30, 2020 | | [–]

Namecheap is also having network connection issues.

pgoodjohn on Aug 30, 2020 | | [–]

Pressing F for everyone else who was on call today

skee0083 on Aug 30, 2020 | | [–]

Good. It's about time ISP switched to ipv6.

chkaloon on Aug 30, 2020 | | [–]

Wonder if that's that why Feedly is down

pinkano on Aug 31, 2020 | | [–]

Yes

tiernano on Aug 30, 2020 | | [–]

1.1.1.1 warp is having issues too...

mathieubordere on Aug 30, 2020 | | [–]

stackoverflow seems to be unreachable

ihatecloudflare on Aug 31, 2020 | | [–]

It's probably just another daily outage at CloudFlare, they are famous for their the most unreliable infrastructure on the entire planet.

tpmx on Aug 30, 2020 | | [7 more]

[flagged]

srosmd on Aug 30, 2020 | | [–]

What was the intent of posting this? This is an article on a global network outage - some folks want the technical nitty-gritty and others don't. You seem adversarial or pretentious when you unexpectedly post things like this even if your intentions are well-meaning.

tpmx on Aug 30, 2020 | | | [–]

I agree it was needlessly adversarial (sorry about that!) - but it got the desired effect - an excellent explanation of the concept with lots of relevant background information (thanks, kitteh, upvoted). I think this helps the discussion a lot since a lot more people would be able to join. Less gatekeeping.

kitteh on Aug 30, 2020 | | | [–]

BGP is a routing protocol that is mostly used for propagating routing/reachability information that also includes additional data that can be used (communities as tags, etc).

A few years ago folks wanted to bake in additional functionality. For example, packet filters (aka ACLs) normally are deployed to router configuration files using each operators own tooling. To deploy this against hundreds or thousands of routers rapidly was a challenge for them (not good at swdev, etc.). So the idea was we already have a protocol that propagates state to every router rapidly in the network, let's find a way to bake ACLs into the BGP updates.

The result wasnt that good for a few reasons: 1) bgp state isn't sticky. If a router goes offline or bgp sessions reset, acls go away. That means if you are using flowspec for a critical need like always on packet filters you've got the wrong tool. 2) the implementation had various bugs. 3) most importantly it gave people a really easy way to hurt themselves globally. There was no phased deployment with pre and post checks. What you deployed led to packet filters being installed across the network in seconds. In most cases (depends on your config) the only way to remove it is remove the specific flowspec route or have bgp reset to it.

I've seen bad flowspec routes core dump the daemon on a router responsible for programming ACLs that led to them being unable to withdraw the programmed entry. I've seen as bugs on tcp/UDP port matches go wrong and eat lot more than intended. I've seen so many flowspec rules installed on a network where it exhausted routers ability to inspect and process packets and you'd see flat lining of packets being dropped.

In my opinion, it's a hack around not having a good ACL deployment tool that has led to many outages in its wake.

Edit: another flowspec gotcha. Some folks like to integrate ddos tooling systems into flowspec. An example of this is if I run a network and some IP address behind me gets lit up, deploy a rule for that specific IP and rate limit traffic to it. Unfortunately, sometimes folks don't put a lot of care into making sure it can't mess with internal IPs that should be off limits. Like route reflectors, router loopback IPs, etc. I've seen situations where some networks have had a bad day due to a ddos or traffic mis classified as ddos by auto installing rules to protect something but actually impair legitimate communications to network infrastructure which then causes the outage.

Also, flowspec doesn't work like regular ACLs where you have input and output on a per interface basis - it applies to all traffic traversing a router, which makes it difficult to say which interfaces should be exempt (think internal vs external).

rmdashrfstar on Aug 30, 2020 | | | [–]

Thanks for taking the time to write all of this out!

tpmx on Aug 30, 2020 | | | | [–]

Thanks! Excellent backgrounder.

dang on Aug 31, 2020 | | | [–]

We detached this subthread from https://news.ycombinator.com/item?id=24325352.

ramshanker on Aug 30, 2020 | | [–]

I hope these kind of “ipv4” only outages encourages more and more websites to upgrade to ipv6.

#OutageBenefit ;)

itguy4321365 on Aug 30, 2020 | | [–]

This doesn't have anything to do with IPv4 vs IPv6. It is a routing issue with BGP. To give an analogy,

if every website were a house, and every house has a house number (IP address-- either IPv4 or IPv6), and a group of houses form cities and towns that can be identified by a number (AS/ Atonomous System number), the highways between cities are similar to BGP routes, and if half of the world's internet traffic goes through the city of Centurylink (AS3356),

If the city of CenturyLink (AS3356) shut down traffic, either on purpose or on accident.

...then it doesn't matter if your house number / IP address is a 32bit number or a 128bit number because traffic needs to take a different route.

This is what everyone is worried about BGP routes, not IP addresses.

cuu508 on Aug 30, 2020 | | | [–]

Sadly, in my experience, ipv4 is generally more reliable than ipv6 still.

Set up two hosts, host A and host B in two different data centers. Make them send HTTP requests to each other over ipv4 and over ipv6. You'll see that latency spikes, packet loss is more frequent over ipv6.

bigdict on Aug 30, 2020 | | | [–]

Why is that?

chaboud on Aug 30, 2020 | | | [–]

We’ve observed this in end-user devices, especially on some ISPs.

It makes sense if the overall adoption and resource allocation are comparatively smaller, making individual or small-group coincident spikes more impactful against the amortized whole.

It’s a lot like a market with low volume/liquidity. Someone wanders in with a big transaction and blows everything up.

eskaytwo on Aug 30, 2020 | | | [–]

It would appear from the limited info so far, to be an issue in the v4 routing configuration - I haven’t seen anything that says this couldn’t have been the other way around.

iso1210 on Aug 30, 2020 | | | [–]

Few people care if ipv6 breaks so it doesn't make headlines

tpmx on Aug 30, 2020 | | [–]

How the xxxx did it take CenturyLink/Level3 like 3-4 hours to fix this problem?

Again (https://news.ycombinator.com/item?id=24322988) not a network engineer, but it seemed like their routers actively stopped other networks from working around the problem since L3 would still keep pushing other networks' old routes, even after those networks tried to stop that.

Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

q3k on Aug 30, 2020 | | [–]

> Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.

This has been attempted a number of times, but this is a political problem, not a technical problem: there's no single agreed source of truth for routing policy.

A lot of US Internet providers won't even sign up for ARIN IRR, or even move their legacy space to a RIR - so there isn't even any technical way of figuring out address space ownership and cryptographic trust (ie. via RPKI). Hell, some non-RIR IRRs (like irr.net) are pretty much the fanfiction.net equivalent of IRRs, with anyone being able to write any record about ownership, without any practical verification (just have to pay a fee for write access). And for some address space, these IRRs are the only information about ownership and policy that exists.

Without even knowing for sure who a given block belongs to, or who's allowed to announce it, or where, how do you want to fix any issues with a new dynamic routing protocol?

kitteh on Aug 30, 2020 | | | [–]

RPKI is a totally diff problem here, though.

If people refuse to sign ROAs, then they don't get protection. The ARIN TAL thing is real and people have to keep fighting that.

As it is right now you can xfer v4 out of ARIN but not v6. So even if you wanted to you can't.

tpmx on Aug 30, 2020 | | | | [–]

Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.

sneak on Aug 30, 2020 | | | [–]

Having a single, cryptographically assured source of truth for routing data is a turnkey censorship nightmare waiting to happen.

All it takes is a national military to care enough to put pressure on the database operator, legal or otherwise, and suddenly your legitimate routes are no longer accepted.

If you think this wouldn't be used to shut down things like future Snowden-style leaks or Wikileaks or The Shadow Brokers, you may not have been paying attention to the news.

kitteh on Aug 30, 2020 | | | [–]

sneak you should come back to irc :)

sneak on Aug 30, 2020 | | | [–]

Where? Send me an email rather than spamming this thread; my email address is on my profile.

tpmx on Aug 30, 2020 | | | | [–]

Yes, obviously. How is that related to the discussion above?

q3k on Aug 30, 2020 | | | | [–]

> Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.

What sort of incentive would anyone have to join such a coalition? Why would anyone work with providers from such a coalition, when they can work with an alternative ISP outside it and not have to deal with packet drops?

I think you're underestimating how many people have been attempting to solve this. The Internet community has some quite clever people in it, but it's also very, very large, and sweeping changes are difficult to pull off (see: IPv6 adoption).

kazen44 on Aug 30, 2020 | | | | [–]

and who should be the spearhead of this coalition?

Let's not forget that this is mainly a political problem and not a technical one. Would countries be willing to join a coalition with heavy influence from china for example? (or vice versa with the US).

DyslexicAtheist on Aug 30, 2020 | | | [–]

> Also: BGP probably needs to redesigned from the ground up

SCION from ETH Zurich:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

tpmx on Aug 30, 2020 | [–]

Based on what I've seen: They essentially "shut down the Internet" for probably a quarter of the global population for about 3-4 hours.

That response time is atrocious. It wasn't that they needed to fix broken hardware, rather they needed to stop running hardware from actively sabotaging the global routing via the inherently insecure BGP protocol. That took 3-4 hours to happen.

As an example: Being in Sweden with an ISP that uses Telia Carrier for connectivity things started working around the time of https://twitter.com/TeliaCarrier/status/1300074378378518528

swinglock on Aug 30, 2020 | [–]

Seems they didn't even get around to doing so, rather asking other carriers to stop peering with them.

https://twitter.com/TeliaCarrier/status/1300074378378518528?...

matsur on Aug 30, 2020 | | [–]

CenturyLink requested depeering to give them some breathing room and stop the bleeding. Hug ops.

tpmx on Aug 30, 2020 | | [–]

That is a fantastic euphemism. Personally I'm disappointed Telia didn't de-peer two hours earlier, after diagnosing the issue for 30 minutes, since that whole lack of functioning routning to very large parts of the internet forced me to use VPN in north america to access many web services, including HN.

I realize I'm going to get insanely downvoted by the elite internetworking crowd again but I think this needs to be said.

From an outsider's POV: There seems to be a very strange and almost incestual relationship between the networking companies. Or maybe it's just their hangaround supporters? I dunno.