Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue
to be affecting users across multiple markets. The IP Network Operations
Center (NOC) was engaged, and initial research identified that an
offending flowspec announcement prevented Border Gateway Protocol (BGP)
from establishing across multiple elements throughout the CenturyLink
Network. The IP NOC deployed a global configuration change to block the
offending flowspec announcement, which allowed BGP to begin to correctly
establish. As the change propagated through the network, the IP NOC
observed all associated service affecting alarms clearing and services
returning to a stable state.
Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.
Is "doing what you ask" considered a sharp edge? Network-related tools don't really have safeties, ever (your linux host will happily "ip rule add 0 blackhole" without confirmation). Every case of flowspec shenanigans in the news has been operator error.
Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.
Edit: if you are a Level3 customer shut your sessions down to them.
There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").
It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.
The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.
Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.
The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.
Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.
This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...
That's why the most reliable way to instil this lesson is to instil it into our tools. Automate as much testing as possible, so that bypassing the tests becomes more work than running them.
I disagree, it's in part a people problem - more draconian test suites just make developers more inclined to cheat and they tend to write tests which are not valid or just get the tool passing...
It's more important to visually model and test than to enforce some arbitrary set of rules that don't apply universally - then you have at least the visual impetus of 'this is wrong' or 'I need to test this right'.
A lot of time is spent visually testing UIs and yet these same people struggle with testing the code that matters...
Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.
It’s been many years since I read it, but I recall it being a very interesting read.
For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.
The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.
Unfortunately most people become a manager by bring a stellar independent contributor. People management and engineering are very different skills, I'm always impressed when I see someone make that jump smoothly.
I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.
Yeah, it just sucks that you eventually have someone making significant people management decisions without the technical knowledge of what the consequences could end up being. This would be even worse if you had people manager hiring be completely decoupled. The US military works this way and I have to say it's not the best mode.
Typically yes actually, the director of engineering should always be an engineer. Of course, these are hardware companies so it would probably be some kind of hardware engineer.
As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.
The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.
This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.
I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?
Interesting, thanks! That is different to the story i remember, but it's possible that i remember incorrectly, or read an incorrect explanation.
I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:
I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.
The idea isn't that Erlang is infallible in the design of distributed systems.
The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language
Are you referring to CenturyLink’s 37-hour, nationwide outage?
> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.
I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).
But your queue will grow and grow and the fraction of time you spend servicing old messages grows and grows.
Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).
Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.
The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.
you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.
That's my point; as a subscriber, you can prefix a long distance call with a routing code to avoid, for example, a shut down long distance network without any administrator changes. Routing to the long distance networks is done independently through the local network, so if AT&T's long distance network was having issues, it'd have no impact on your ability to access Verizon's long distance network.
There's actually no technical reason why you couldn't do that with IP (4 or 6); although you'd need a approriately located host to be running a relay daemon[0].
0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.
Edit: It probably wouldn't work for TCP though - maybe try TOR?
There are plenty of ways to do what you're describing, and they all work with TCP. Some of them only work if the encapsulated traffic is IPv6 (and a designed to give IPv6 access on ISPs that only support IPv4). Some of them may end up buffering the TCP stream and potentially generating packet boundaries at different locations than in the original TCP stream.
This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.
BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.
If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.
This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.
BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.
For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.
ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)
Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."
As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)
IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.
CenturyLink/Level3 on Twitter:
"We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused."
India just lost to Russia in the final of the firstever online chess olympiad, probably due to connection issues of two of its players. I wonder if it's related to this incident and if the organizers are aware.
Edit: the organizers are aware, and Russia and India have now been declared joint winner.
I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.
Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.
Related: World champion Magnus Carlson recently resigned a match after 4 moves as an act of honor because in his previous match with the same opponent, Magnus won solely due to his opponent having been disconnected.
His opponent, Ding Liren, is from China, and has been especially plagued by unreliable internet since all the high level chess tournaments have moved online. He is currently ranked #3, behind Magnus Carlson and Fabiano Caruana.
All professional chess games have a time limit for each player (if you've ever heard of "chess clocks" -- that's what they're used for). In "slow chess" each player has a 2-hour limit and all of the other time control schemes (such as rapid and blitz) are much shorter.
There’s an interesting protocol for splitting a Go or chess game over multiple days so that neither party has the entire time to think about their response to the last move: at the end of the day the final move is made by one player but is sealed, not to be revealed until the start of the next session.
For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.
This wouldn’t save any real-time esports, but would be serviceable for turn based systems.
The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.
> why don't they start over
That would be unfair to the player who was ahead.
That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.
Seems like one of those cases where solving a “little” issue would actually require rearchitecting the entire system.
Namely, in this case, it seems like the “right thing” is for games to not derive their ELO contributions from pure win/loss/draw scorings at all, but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason (where checkmate, forfeit, and game disruption are all valid reasons.) Perhaps with some Best-rank (https://www.evanmiller.org/how-not-to-sort-by-average-rating...) applied, so that games that go on longer are “more proof” of the competitive edge of the player that was ahead at the time.
Of course, in most central cases (of chess matches that run to checkmate or a “deep” forfeit), such a scoring method would be irrelevant, and would just reduce to the same data as win/loss/draw inputs to ELO would. So it’d be a bunch of effort only to solve these weird edge cases like “how does a half-game that neither player forfeited contribute to ELO.”
> but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason
Except for the obvious positions that no one serious would even play, there is no agreed-upon way of calculating who has an advantage in chess like that. One man's terrible mobility and probable blunder is another's brilliant stratagem.
Hm, you’re right; guess I was thinking in terms of how this would apply to Go, where it’d be as simple as counting territory.
Still, just to spitball: one “obvious” approach, at least in our modern world where technology is an inextricable part of the game, would be to ask a chess-computer: “given that both players play optimally from now on, what would be the likelihood of each player winning from this starting board position?” The situations where this answer is hard/impossible to calculate (i.e. estimations close to the beginning of a match) are exactly the situations where the ELO contribution should be minuscule anyway, because the match didn’t contribute much to tightening the confidence interval of the skill gap between the players.
Of course, players don’t play optimally. I suspect that, given GPT-3 and the like, we’ll soon be able to train chess-computers to mimic specific players’ play-styles and seeming limits of knowledge (insofar as those are subsets of the chess-computer’s own capabilities, that it’s constraining its play to.) At that point, we might actually be able to ask the more interesting question: “given these two player-models and this board position, in what percentage of evolutions from this position does player-model A win?”
Interestingly, you could ask that question with the board position being the initial one, and thus end up with automatically-computed betting odds based on the players’ last-known skill (which would be strictly better than ELO as a prediction on how well an individual pair of players would do when facing off; and therefore could, in theory, be used as a replacement for ELO in determining who “should” be playing whom. You’d need an HPC cluster to generate that ladder, but it’d be theoretically possible, and that’s interesting.)
I was doing development work which uses a server I've got hosted on digital ocean. I started getting intermittent responses which I thought weird as I hadn't changed anything on the server. I spent a good ten minutes trying to debug the issue before searching for something on duckduckgo, which also didn't respond. Cloudfare shouldn't be involved at all with my little site, so I don't think it's limited to just them.
Cogent and Cox are also having problems, but we are seeing a lot more successful traffic on Cogent than CenturyLink. It appears that CL is also not withdrawing stale routes. It seems CLs issues are causing issues on/with everything connected to it.
Same here. I actually opened a support ticket with them because I was worried my ISP had started blocking their IP addresses for some unknown reason. Luckily it seems to clear up, and in the ticket they mentioned routing traffic away from the problematic infrastructure. Seems to have worked for now for my things.
Yup, definitely noticed earlier outages to both EU sites and also to HN. Looked far upstream because many sites/lots of things worked fine. Good to see it's at least largely fixed
M5 Hosting here, where this site is hosted. We just shut down 2 sessions with Level3/CenturyLink because the sessions were flapping and we were not getting complete full route table from either session. There are definitely other issues going on on the Internet right now.
Great write up. It is embarrassing that most of America has no competition in the market.
>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.
The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United.
Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.
I remember working the support queue _before_ this automatic re-routing mitigation system went in and it was a lifesaver. Having to run over to SRE and yell "look! look at grafana showing this big jump in 522s across the board for everything originating in ORD-XX where the next hop is ASYYYY! WHY ARE WE STILL SENDING TRAFFIC OVER THAT ARRRGHH please re-route and make the 522 tickets stop"
it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts
I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.
The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.
Is there a source where you can get instant information on Level3 / global DNS / major outages?
I believe these are mostly non public channels where backbone and network infrastructure engineers from different companies congregate to discuss outages like this.
please dont call yourself that its more like i [and others] are hyper paranoid and marginal in behavior due to the nature of pastimes [i myself can promise you that im not malicious but i cant speak for others, i would leave it up to them to speak for themselves]
it isnt so much the channels that you want its the current IP of a non indexed IRC server[s] that you need, of course you could create and maintain your own dynamic IRC server and invite people that you trust or feel kinship toward.
here are a couple of "for instance" breadcrumbs for you to start from:
I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.
So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.
I found places talking about this earlier. A friend of mine who has CenturyLink as their ISP complained to me that Twitch and Reddit weren't working. But they worked for me, so I suspected a CDN issue. I did some digging to figure out what CDNs they had in common. I expected Twitch to be on CloudFront, but their CDN doesn't serve CloudFront headers; instead they are "Via: 1.1 varnish". Reddit is exactly the same. I did some googling and found out that they both apparently used Fastly, at least to some extent. Fastly has a status page and it was talking about "widespread disruption".
So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.
It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.
Unfortunately, this infrastructure is at an uncanny intersection of technology, business and politics.
To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.
The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)
Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.
in certain instances, yes depending on the nature or subject matter of the channel
ok lets go for broke, there are a LOT of clandestine IRC servers and exclusive gatekeeping of channels.
you wont know about unless you have an IRL reference
A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year and some mild paperwork, there's plenty of companies/organizations that will help you with that (shameless plug: bgp.wtf, that's us).
Afterwards, announcing this space is probably the cheapest with vultr (but their BGP connectivity to VMs is uh, erratic at times) or with ix-vm.cloud, or with packet.net (more expensive). You can also try to colo something in he.net FMT2 and reach FCIX, or something in kleyrex in Germany. All in all, you should be able to do something like run a toy v6 anycast CDN at not more than $100 per month.
If you try to become an LIR in your own right, RIPE fees are much higher.
If you're looking for PI (provider independent) resources from RIPE, the costs to the LIR (on top of their annual membership fees) is around 50€/year. An ASN and a /48 of IPv6 PI space would therefore clock in around 100€/year (which is in line with the GP's pricing).
Membership fees are around 1400€/year, with a 2000€ signup fee. The number of PA (provider assigned) resources you have has no bearing on your membership fee. If you only have a single /22 of IPv4 PA space (the maximum you can get as a new LIR today) or you have several /16s, it makes no difference to your membership fees (this wasn't always the case, the fee structure changes regularly).
(EDIT: Source: the RIPE website, and the invoices they've sent me for my LIR membership fees)
> All subscribers to Internet access provided by the provider must be members of the provider and must have the corresponding rights, including the right to vote, within reasonable time and procedure.
Not all of our subscribers are members of our association. The association is primarily a hackerspace with hackerspace members being members of the association. We just happen to also be an ISP selling services commercially (eg. to people colocating equipment with us, or buying FTTH connectivity in the building we're located in).
Ah, that's interesting, thank you for pointing this out to me, I didn't know about it. I take it that this isn't the first time you are asked about this, then?
Well, I on one hand I perfectly understand you not wanting to change your structure, especially if it works fine. On the other, can see a few ways around that restriction, and don't really see how having the ISP a separate association with its customers as members (maybe with their votes having less weight than hackerspace members) would have a downside (except if funds are primarilly collected for funding hackerspace activities?).
> I take it that this isn't the first time you are asked about this, then?
First time, but I read the rules carefully :).
> [I] don't really see how having the ISP a separate association with its customers as members [...] would have a downside [...]
Paperwork, in time and actual accounting fees. If/when we grow, this might happen - but for now it's just not worth the extra effort. We're not even breaking even on this, just using whatever extra income from customers to offset the costs of our own Internet infrastructure for the hackerspace. We don't even have enough customers to legally start a separate association with them as members, as far as I understand. I also don't think our customers would necessarily be interested in becoming members of an association, they just want good and cheap Internet access and/or mediocre and cheap colocation space.
What resources can I follow to start a non-profit ISP? I want to start one in my hometown for students who couldn't afford internet to join online classes.
Hmm, I actually didn't think about that at all. I guess I got too fascinated by this video[0] and wanted to apply something similar to our current scenario.
Because often enough there is only one dominant service in the region who has no pressure to compete from anyone due to regulatory capture (esp. regarding right of way on utility poles) and so has no incentive to upgrade their offers to the customers.
If you intend to start a facilities-based last mile access ISP, what last-mile tech do you intend to use? There's a number of resources out there for people who want to be a small hyper local WISP. But I would not recommend it unless you have 10+ years of real world network engineering experience at other, larger ISPs.
I actually tried, but all I got was some consultancy services that would help you get an ISP with estimated cost of 10k USD (a middle class household earns half of that in a year here).
NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:
You want to learn about BGP in order to understand how routing on the internet works. The book "BGP" by Iljitsch van Beijnum is a great place to start. Don't be put off by the publication date, as almost everything in there is still relevant.[1]
Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]
Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]
Probably one of the best resource though still is to work for an ISP or other network provider for a stint.
It likely has some inaccurate info as I'm not a network engineer, but I gave a talk about BGP (with a history, protocol overview, and information on how it fails using real world examples) at Radical Networks last year. https://livestream.com/internetsociety/radnets19/videos/1980...
I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.
It's important to recognize that there is a "layer 8" in Internet routing-- the political / business layer-- that's not necessarily expressed in technical discussion of protocols and practices. The BGP routing protocol is a place where you'll see "layer 8" decisions reflected very starkly in configuration. You may have networks that have working physical connectivity, but logically be unable to route traffic across each other because of business or political arrangements expressed in BGP configuration.
The business structures, ISP ownership and national telecoms have changed quite a lot in the past 25 years. But in terms of the physical OSI layer 1 challenges of laying cable across an ocean, that remains the most difficult and costly part of the process.
Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.
I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.
Level3 was qquired/merged/changed to century link a year or so back, I think they closed their old twitter account then
When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 4.2.2.2 dns server), so people still refer to them as level3.
Note that L3 is a separate company from Level 3 Communications, which was the ISP that was acquired by CenturyLink. L3 is an American aerospace and C4ISR contractor.
CenturyLink's current CEO, Jeff Storey, was actually the pre-acquisition Level 3 CEO.
Read Internet Routing Architectures by Sam Halabi. It’s almost 20 years old now but BGP hasn’t changed and the book is still called The Bible by routing architects.
It's dated and not particularly useful if you want to learn how things are really done on the internet in a practical sense. So if you read it, be prepared to unlearn a bunch of stuff.
No particular resource to recommend, though I first learned about it in a book by Radia Perlman, but BGP is a protocol you don't hear much about unless you work in networking, and is one of the key pieces in a lot of wide-scale outages. I'd start with that.
Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo after which the packets disappear down a black hole. All routing problems occur within Cogentco.
What seems to have happened is that Centurylinks internal routing has collapsed in some way. But they're still announcing all routes and they don't stop announcing routes when other ISPs tag their routes not to be exported by Centurylink.
So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.
I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.
Is there any one place that would be a good first place to go to check on outages like this?
It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.
Indeed, if we're to have a public Internet health meter, it must be distributed and hosted/served from "outside" somehow, to be resilient to all or parts of the network being down.
This is an excellent idea and simple but moderately expensive for anyone to set up.
Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)
The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.
But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/
Reddit, HN, etc. are inaccessible to me over my Spectrum fiber connection, but working on AT&T 4G. It’s not DNS, so a tier 1 ISP routing issue seems to be the most likely cause.
> Fastly is observing increased errors and latency across multiple regions due to a common IP transit provider experiencing a widespread event. Fastly is actively working on re-routing traffic in affected regions.
This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).
Because networks are connected to others via different paths, it's not unusual that one method of connectivity would work and one doesn't.
Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.
I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.
I thought Cloudflare was having issues again, since I use their DNS servers, so I started by changing that. Then I tried restarting everything, modem/router/computer. Wasn't until I connected to a VM that a friend hosts that I was finally able to access HN, and thus saw this thread.
Hopefully this will get fixed within a reasonable timespan.
I was so pissed at Waze earlier for giving up on me in a critical moment. Then I found out I'm also unable to send iMessages, but I was curious, since I could browse the web just fine.
When something doesn't work I always assume it's a problem with my device/configuration/connection.
Who would have thought it's a global event such as the repeated Facebook SDK issues.
Yep, I had a similar experience. Sites that didn't work from my home connection worked fine on mobile. After rebooting and it persisted, I assumed it was just a DNS or routing issue since they were both connecting to different networks.
I would love to hear the inside scoop from folks working at CenturyLink. I’ve used their DSL for years and the network is a mess. I don’t know if it them here or legacy Level3 but i have a guess.
Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!
Used level3 IP for a long time professionally with limited issues, ceratainly not on the list of worst ISPs.
Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.
We had this once with one of our former ISPs configuring static routes towards us and announcing them to a couple of IXPs.
I have no idea why they did it, but it caused a major downtime once for us and basically signed the termination.
Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".
In some ways I'm a little bit disappointed it's only a glitch in the internet.
Here is a fantastic, though somewhat outdated overview [1]. Section 5 is most relevant to your question. The network topology today is a little different. Think of Level3 as an NSP, which is now called a "Tier 1 network" [2]. The diagram should show links among the Tier 1 networks ("peering"), but does not.
tl;dr One of the large Internet backbone providers (formerly known as Level3, but now known as CenturyLink usually) that many ISPs use is down. Expect issues connecting to portions of the Internet.
Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.
Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.
I could not get on HN as a logged in person (logged out was OK) during this. I wondered how big the cloudflare thread would be if people could get on to comment on it :-)
CNN is absolutely right. Every day I read news that something goes down at CloudFlare. CloudFlare do much more harm than they "fix" with their services.
HN working for me from the UK on BT, but traceroute showing lots of different bouncing around and a lot of different hops in the US
7 166-49-209-132.gia.bt.net (166.49.209.132) 9.877 ms 8.929 ms
166-49-209-131.gia.bt.net (166.49.209.131) 8.975 ms
8 166-49-209-131.gia.bt.net (166.49.209.131) 8.645 ms 10.323 ms 10.434 ms
9 be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 95.018 ms
be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5) 7.627 ms
be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 102.570 ms
10 be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197) 89.867 ms
be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130) 101.469 ms 101.655 ms
11 be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106) 103.990 ms 93.885 ms
be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197) 97.525 ms
12 be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158) 106.027 ms
be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106) 98.149 ms 97.866 ms
13 be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 120.558 ms 122.330 ms 120.071 ms
14 be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 123.662 ms
be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222) 128.351 ms
be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70) 120.746 ms
15 be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65) 145.939 ms 137.652 ms
be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222) 128.043 ms
16 be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77) 150.015 ms
be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121) 152.793 ms 152.720 ms
17 be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33) 152.881 ms
te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66) 153.452 ms
be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33) 152.054 ms
18 te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70) 162.835 ms
te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190) 146.643 ms
te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70) 153.714 ms
19 te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190) 151.212 ms 145.735 ms
38.96.10.250 (38.96.10.250) 147.092 ms
20 38.96.10.250 (38.96.10.250) 149.413 ms * *
I don't normally see multi paths for a given IP, but that aside it's bouncing through far more than I'd expect. That said, it's rare I look at traceroutes across the continental U.S, maybe that many layer 3 hops are normal, maybe routes change constantly.
HN has dropped off completely from work - I see the route advertised from Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581 21581). Telia is longer, so traffic goes into to Level3 at Docklands via our 20G peer to London15, but seems to get no further.
Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.
My gut feeling is a core problem in Level3's continental US network rather than something more global.
https://downdetector.com/ client perspective is best perspective ;) Problem in this outage is that site X works ok but transit provider for clients in US works badly and generates "false positives"
For a situation like this, the various tools hosted by RIPE are likely your best bet. You won't get a pretty green/red picture, but you'll get a more than enough data to work with.
It’s a mailing list for network operations/engineering folks. The emails are the status updates. You’ll have to look to each network’s own site if you want connectivity, peering, and IXP red/green/ up/down status.
Currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.
Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...
This had me really confused until I saw it was a global outage. I have been getting delayed iOS push notifications (from prowl) now for the last few hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)
Got questioning if I really disconnected it before I left.
I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.
1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.
2. Is there a DWIM check available, so you can see the effect of the change before committing? No. That would require a complete model of, at a minimum, your entire network plus all directly connected networks -- that still wouldn't be complete, but it could catch some errors.
There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.
Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my own dedicated server hosted at Hetzner datacenter in Germany seems to be reachable/working as well.
I see a lot of ads for NordVPN, but you should know they're not necessarily reliable. Just look for NordVPN on hacker news search: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... (see e.g. the second hit: https://news.ycombinator.com/item?id=21664692 covering up security issues, using your connection to proxy other people's traffic, a related company does data mining...). The only VPN that seemed to fit the bill when I looked for one about a year ago was ProtonVPN, but I certainly didn't manage to look at every VPN on the planet and I'm just a random internet stranger so... take that with a grain of salt.
Yeah, I agree that their marketing is aggressive, but a lot else I think is just speculation. The VPN market is very cut-throat, competitors create all kinds of crack-pot conspiracy theories to sow doubt amongst potential customers. As far as I know there was one incident, It wasn't very serious, no customer data was stolen. Also, the company now does 3rd party audits. I think NordVPN is pretty decent, but that's just me.
Works OK in Paris via SFR (home fiber) and Sipartech (Business fiber).
Doesn't work via Bouygues 4G.
My SFR fiber doesn't seem affected all that much. I've been following this for a while on the other HN post [0] and all services people have noted seem to work here.
Both SFR and Sipartech seem to have direct peerings with Cogentco.
A service I run on Digital Ocean was affected by this early this morning. Looks like it was mitigated by DO - so I'm very grateful for that. Although, the service I run is time sensitive so failures like this are pretty unfortunate for me. Where would I get started with building in redundancy against these sort of outages?
My opinion is that this, like many issues were seeing today, is largely an issue with ongoing consolidation trends. The less diversity of systems/solutions we have for a given problem or set if problems, the less chance we're protected from unknown unknowns that creep up. The more diversity you have in systems, the more likely you have some option that is hardened against unknown unknowns when they arrive and the quicker we can work around them.
Modern society is all about consolidating systems into a few efficient solutions typically dictated by market forces which I argue, don't concern themselves much with these sorts of problems. As a result, when we run into problems, we're left with fewer options to resort to and instead have to identify problems and develop new solutions on-the-fly. Consolidation leads to complacency and stagnation.
Sometimes this is reasonable (and even desirable) for certain non-critical systems, it just doesn't make financial sense to pour resources into system diversity for certain systems we could do without--find the one that works best/most efficiently and use it. If it breaks, it's not critical and the work around can wait.
On the other hand, if a system is critical, then I think it behooves us to continue looking at improvements of existing systems and alloting resources to investigating new approaches.
BGP has always had this issue. It depends on trustworthy information being available. Any trusted source who starts lying (or just screws up) is going to cause routing problems.
Note, trustworthyness jumps off of being a technical problem, and becoming a human/people problem. Level 8 as someone mentioned, or GIGO (Garbage-In-Garbage-Out) as others may know it.
To safely use a system, your operator needs to be 10% smarter than the system being operated. It is clear that we have problems in that department with certain AS's. This is about, what the third major outage attributed to CenturyLink in the last handful of years? I have no idea what exactly their process must look like, but good heavens, a better look need be taken, as this is becoming a bit regular for my tastes.
No, because he issue you're commenting on doesn't suggest that. It looks like the nature of this particular outage is such that a previous iteration of the Internet wouldn't have been any better equipped to solve this faster.
Why do a few companies control the backbone of the internet? Shouldn’t there be a fallback or disaster recovery plan if one or more of these companies become unavailable?
The problem is the provider having problems is still sending misconfigured routes after the other providers have tried to pull them in response to the outage. So it’s as if CenturyLink was doing a massive BGP attack against their peers, pointing at a black hole.
Things mostly routed around the problem. Issues arose because a) some people are single-homed to Level3/CenturyLink b) apparently Level3/CenturyLink continued announcing unreachable prefixes, which breaks the Internet BGP trust model.
Chess.com was down due to the outage and some of the Indian players got disconnected and lost on time, so FIDE declared India-Russia joint winner of the Online Chess Olympiad 2020.
I spent too much time losing precious time when github/npm/cloudflare are going down, until I figure out it was them.
So currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.
There is at least one big tool that does exactly the same you wrote. It is called StatusGator https://statusgator.com
There are at least 3 much smaller ones.
Have you tried any of them?
If yes, what's your point of difference?
And how do you plan to market it? As I see the plans are cheap, means your LTV is low.
Cloudflare status page:
Update - Major transit providers are taking action to work around the network that is experiencing issues and affecting global traffic.
We are applying corrective action in our data centers as the situation changes in order to improve reachability
Aug 30, 14:26 UTC
I just experienced HN down for several minutes before it loaded and I saw this story at the top.
I'm doing something with the HN API as I type this, so for a moment I was trying to decide if I'd been IP blocked, even though the API is hosted by Firebase.
I haven't noticed any obvious issues elsewhere yet.
(Just got a delay while trying to submit this comment.)
Can anyone help me understand why I can't access HN from my iPhone, but I can from my computer? both are on the same network. I'm getting "Safari cannot open the page because the server cannot be found", and many apps won't work at all either.
It wasn't a total outage for the site I was trying to reach. It took about 20 minutes to make an order, but after multiple retries (errors were reported as a 522 with the problem being somewhere between Manchester, UK and the host), it did go through.
> Not a network engineer, but based on the comments there it looks like it's a BGP blackhole incident, possibly reminiscent of the https://en.wikipedia.org/wiki/AS_7007_incident in 1997.
As you aren’t a network engineer, I can understand making that leap based on the context, but no, this is nothing like the AS7007 event.
The “black hole” in this case is due to networks pulling their routes via AS3356 to try and avoid their outage, but when they do, CenturyLink is still announcing those routes and as such those networks blackhole.
What I take from this is that you’re offering input to a thread which you don’t have experience in or even actually understand, thus are spreading misinformation. You then are continually doubling down further showing your maturity.
Heh, I knew I was setting myself up for that from networking people - i know the attitude. I was of course merely repeating the sentiment in that thread. What more disclaimers do you need to avoid displaying your superiority in networking? Sheesh.
You were repeating a suspicion as if it was yours, as if it was a shared view in that thread, it wasn't, i'm not a network engineer but I read the thread too, nobody needs people spreading misinformation in a crisis just to sound smart, its not useful and usually harmful.
> You were repeating a suspicion as if it was yours
That is a misrepresentation.
I wrote:
"Not a network engineer, but based on the comments there". I prefaced the insecure speculation part with "possibly". It was obvious to any reader this was a summary of the emails there.
This knocked out the Starbucks app and some of their systems this morning. A bunch of people in line couldn't log in and they were saying parts of their whole internal system were down, too.
I'm confused about why Cloudflare had problems but other CDN providers/sites with private CDNs like Google did not. Is there something different about how Cloudflare operates?
I experienced this issue while reading docs at "Read the Docs" (and ironically had connection issues while trying to read this very exact page right here, too.)
They probably peer into Google at the local IX/Data centre. Google traffic will therefore take a different path which isn’t suffering the current outage.
I was doing a big release over the evening. I was working fine up until about 6 hours ago, when I signed off. Our network monitors show an outage started about half an hour later (at about 4:05am CST). Service restored a few minutes ago, at about 9:44am CST. I don't know if our problem is the same as this problem, but we are on CenturyLink.
What was the intent of posting this? This is an article on a global network outage - some folks want the technical nitty-gritty and others don't. You seem adversarial or pretentious when you unexpectedly post things like this even if your intentions are well-meaning.
I agree it was needlessly adversarial (sorry about that!) - but it got the desired effect - an excellent explanation of the concept with lots of relevant background information (thanks, kitteh, upvoted). I think this helps the discussion a lot since a lot more people would be able to join. Less gatekeeping.
BGP is a routing protocol that is mostly used for propagating routing/reachability information that also includes additional data that can be used (communities as tags, etc).
A few years ago folks wanted to bake in additional functionality. For example, packet filters (aka ACLs) normally are deployed to router configuration files using each operators own tooling. To deploy this against hundreds or thousands of routers rapidly was a challenge for them (not good at swdev, etc.). So the idea was we already have a protocol that propagates state to every router rapidly in the network, let's find a way to bake ACLs into the BGP updates.
The result wasnt that good for a few reasons:
1) bgp state isn't sticky. If a router goes offline or bgp sessions reset, acls go away. That means if you are using flowspec for a critical need like always on packet filters you've got the wrong tool.
2) the implementation had various bugs.
3) most importantly it gave people a really easy way to hurt themselves globally. There was no phased deployment with pre and post checks. What you deployed led to packet filters being installed across the network in seconds. In most cases (depends on your config) the only way to remove it is remove the specific flowspec route or have bgp reset to it.
I've seen bad flowspec routes core dump the daemon on a router responsible for programming ACLs that led to them being unable to withdraw the programmed entry. I've seen as bugs on tcp/UDP port matches go wrong and eat lot more than intended. I've seen so many flowspec rules installed on a network where it exhausted routers ability to inspect and process packets and you'd see flat lining of packets being dropped.
In my opinion, it's a hack around not having a good ACL deployment tool that has led to many outages in its wake.
Edit: another flowspec gotcha. Some folks like to integrate ddos tooling systems into flowspec. An example of this is if I run a network and some IP address behind me gets lit up, deploy a rule for that specific IP and rate limit traffic to it. Unfortunately, sometimes folks don't put a lot of care into making sure it can't mess with internal IPs that should be off limits. Like route reflectors, router loopback IPs, etc. I've seen situations where some networks have had a bad day due to a ddos or traffic mis classified as ddos by auto installing rules to protect something but actually impair legitimate communications to network infrastructure which then causes the outage.
Also, flowspec doesn't work like regular ACLs where you have input and output on a per interface basis - it applies to all traffic traversing a router, which makes it difficult to say which interfaces should be exempt (think internal vs external).
This doesn't have anything to do with IPv4 vs IPv6. It is a routing issue with BGP. To give an analogy,
if every website were a house,
and every house has a house number (IP address-- either IPv4 or IPv6),
and a group of houses form cities and towns that can be identified by a number (AS/ Atonomous System number),
the highways between cities are similar to BGP routes,
and if half of the world's internet traffic goes through the city of Centurylink (AS3356),
If the city of CenturyLink (AS3356) shut down traffic, either on purpose or on accident.
...then it doesn't matter if your house number / IP address is a 32bit number or a 128bit number because traffic needs to take a different route.
This is what everyone is worried about BGP routes, not IP addresses.
Sadly, in my experience, ipv4 is generally more reliable than ipv6 still.
Set up two hosts, host A and host B in two different data centers. Make them send HTTP requests to each other over ipv4 and over ipv6. You'll see that latency spikes, packet loss is more frequent over ipv6.
We’ve observed this in end-user devices, especially on some ISPs.
It makes sense if the overall adoption and resource allocation are comparatively smaller, making individual or small-group coincident spikes more impactful against the amortized whole.
It’s a lot like a market with low volume/liquidity. Someone wanders in with a big transaction and blows everything up.
It would appear from the limited info so far, to be an issue in the v4 routing configuration - I haven’t seen anything that says this couldn’t have been the other way around.
How the xxxx did it take CenturyLink/Level3 like 3-4 hours to fix this problem?
Again (https://news.ycombinator.com/item?id=24322988) not a network engineer, but it seemed like their routers actively stopped other networks from working around the problem since L3 would still keep pushing other networks' old routes, even after those networks tried to stop that.
Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.
> Also: BGP probably needs to redesigned from the ground up by software engineers with experience from designing systems that can remain working with hostile actors.
This has been attempted a number of times, but this is a political problem, not a technical problem: there's no single agreed source of truth for routing policy.
A lot of US Internet providers won't even sign up for ARIN IRR, or even move their legacy space to a RIR - so there isn't even any technical way of figuring out address space ownership and cryptographic trust (ie. via RPKI). Hell, some non-RIR IRRs (like irr.net) are pretty much the fanfiction.net equivalent of IRRs, with anyone being able to write any record about ownership, without any practical verification (just have to pay a fee for write access). And for some address space, these IRRs are the only information about ownership and policy that exists.
Without even knowing for sure who a given block belongs to, or who's allowed to announce it, or where, how do you want to fix any issues with a new dynamic routing protocol?
Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.
Having a single, cryptographically assured source of truth for routing data is a turnkey censorship nightmare waiting to happen.
All it takes is a national military to care enough to put pressure on the database operator, legal or otherwise, and suddenly your legitimate routes are no longer accepted.
If you think this wouldn't be used to shut down things like future Snowden-style leaks or Wikileaks or The Shadow Brokers, you may not have been paying attention to the news.
> Build an industry coalition. Put pressure on those who don't join. Randomly throw away 1 out of 10000 packets from the providers that fail to get with the times. Increase that frequency according to some published time function.
What sort of incentive would anyone have to join such a coalition? Why would anyone work with providers from such a coalition, when they can work with an alternative ISP outside it and not have to deal with packet drops?
I think you're underestimating how many people have been attempting to solve this. The Internet community has some quite clever people in it, but it's also very, very large, and sweeping changes are difficult to pull off (see: IPv6 adoption).
and who should be the spearhead of this coalition?
Let's not forget that this is mainly a political problem and not a technical one. Would countries be willing to join a coalition with heavy influence from china for example? (or vice versa with the US).
Based on what I've seen: They essentially "shut down the Internet" for probably a quarter of the global population for about 3-4 hours.
That response time is atrocious. It wasn't that they needed to fix broken hardware, rather they needed to stop running hardware from actively sabotaging the global routing via the inherently insecure BGP protocol. That took 3-4 hours to happen.
That is a fantastic euphemism. Personally I'm disappointed Telia didn't de-peer two hours earlier, after diagnosing the issue for 30 minutes, since that whole lack of functioning routning to very large parts of the internet forced me to use VPN in north america to access many web services, including HN.
I realize I'm going to get insanely downvoted by the elite internetworking crowd again but I think this needs to be said.
From an outsider's POV: There seems to be a very strange and almost incestual relationship between the networking companies. Or maybe it's just their hangaround supporters? I dunno.
Source https://puck.nether.net/pipermail/outages/2020-August/013229...