> Root Cause: A configuration issue impacted IP services in various markets across the United States.
> Fix Action: The IP NOC reverted a policy change to restore services to a stable state.
> Summary: The IP NOC was informed of a significant client impact which seemed to originate on the east coast. The IP NOC began investigating, and soon discovered that the service impact was occurring in various markets across the United States. The issue was isolated to a policy change that was implemented to a single router in error while trying to configure an individual customer BGP. This policy change affected a major public peering session. The IP NOC reverted the policy change to restore services to a stable state.
> Corrective Actions: An extensive post analysis review will be
conducted to evaluate preventative measures and corrective actions that can be implemented to prevent network impact of this magnitude. The individual responsible for this policy change has been identified.
[snip]
Sounds like "the individual responsible" forgot to set some communities on the peering session. Oops.
Well, there’s the problem right there. There was an individual responsible, meaning one set of hands unreviewed on a keyboard, for a change that could cause a global outage.
If I’m that person responsible, I’m going to hire two staff, ask them each to write the command scripts, justify any differences, produce a consensus script for my review, and then implement it. That seems like the minimum level of responsible engineering. Lint tools, notices of novel commands, and other ornaments have a place too, but the core idea is this:
The person with hands on keyboard is not the individual responsible for this error.
It's implied by their claim that there is a single responsible individual. If someone else reviewed it and said "yep, looks good, deploy the change" how is there one "individual responsible"?
The idea of an isolated root cause or single human error in the failure of complex systems is bogus anyway. I'm a huge fan of the work in this area championed by John Allspaw [0].
Indeed. Someone runs the department that individual works in and allowed this kind of uncontrolled process. Someone is that person’s boss and should have been asking how well controlled are our processes, and so on. Turned Turtles all the way up..
This could have been as simple as the individual accidentally "missing" a single line of a large, multi-line configuration when copy/pasting it into the router's console -- after all of the config review, etc., already occurred.
I'm not sure what vendor's gear was in use in this particular case, but the configs for a BGP peering session are typically (as mentioned above) large, multi-line configurations. For example, here's the (slightly redacted) configuration for one single BGP session on one of my routers:
That doesn't even include the associate prefix lists (or filter lists or route-maps or ...). All it would take is fat-fingering/typo'ing one of these lines or missing one to cause some very unintended effects.
Since we don't know exactly what happened, it's easy to say "they should've done this" or "they didn't do that". In reality, however, we simply don't know what they did or didn't do. You've shown no evidence that they didn't do any of the things you mention and, in some cases, you can do all of that and still have things go wrong.
Here's my paraphrasing in more straightforward fashion, since I know at least some of us would like such notices to be more to-the-point and less business-speak-ish:
Root Cause: Incorrect router configuration.
Fix Action: Revert the configuration.
Summary: Someone made the wrong settings on a router and made packets go the wrong way in parts of the US. We changed the settings back to what they were before.
Corrective Actions: We'll try to find ways to avoid doing this again. We know who did it.
I don't, sorry, it's in my inbox. The part I snipped is completely unrelated to the incident.
As far as BGP goes, Halabi's _Internet Routing Architectures_ [0] is pretty much considered the "bible". It's really old nowadays but it covers BGP4 (the current version in use) and not much has really changed.
I'm sure some of the newer BGP books are excellent as well but I can't personally recommend them as IRA and the (old) CCNP BGP book are all I've ever read/used (while preparing for the CCNP certification and in my day job).
Of course, pretty much everything is covered in RFC 4271 [1] (and updates) although the RFCs can be a bit "dry".
I found "BGP4: Inter-domain Routing in the Internet" by John Stewart III [0] to be very approachable. It's likewise ancient by tech world standards (published in 1998!) but that's because, amazingly, BGP has not fundamentally changed in all that time. It's cheap and less than 200 pages; definitely recommend it as a primer.
These large companies need much more SDN in their operations like NTT GIN. I think GIN only has 50 people in their entire organization to run that whole business. They do it by investing heavily in software (and not building out their own fiber all over the place).
Yeah, I saw all of that (and more). Doesn't rule out what I said.
I presume Comcast was advertising those (longer) prefixes to Level 3 to manage traffic flow and they shouldn't have been propoagated to other customers. To do that, you'd typically apply BGP communities (no-export or other, Level3-specific ones) to those prefixes. A lack of those communities on the prefixes would result in them getting propagated to other Level3 peers. When that happened, it would look exactly like and result in this route leak.
Every few months/years I'm reminded that a dozens to a few hundred people are responsible for routing packets across the entire internet. One typo could affect millions.
Early in the history of commercial internet in Brazil we had a couple issues like this one with the recently privatized telco that operated the big backbone that connected us to other countries. At that time we more or less seriously mused about whether all commercial ISPs should pool their resources and pay a top-tier consultancy to properly configure everything for the telco, provided they never ever touches those routers again.
I don't get it... I thought the Internet was supposed to be decentralized... How is it possible that one or two companies can cause such widespread issues?
BGP is such a great foot gun precisely because the internet is decentralized.
There's no central repository of how to route traffic for an IP [1]. If there was, it would probably mess things up from time to time, but not to such a large extent.
Instead, we just have to kind of trust BGP announcements -- especially if they come from ISPs that credibly could route anything (Level 3, other "teir 1" isps).
[1] Actually there are some efforts to develop this. After all, IP allocations are essentially centralized under the five regional internet registries. There are some registries of routing information (RADB is the most well known, I think); but not all ASNs participate, and filtering routes from large transit ISPs is still a major problem.
Any one error only affects a portion of the network. For example a problem might take Australia or the American East Coast offline, but that doesn't significantly affect the rest of the network (unless they try to reach affected regions). That was the design goal and it works perfectly.
What you want is entirely different. The European power network for example is designed for (n+1) redundancy: any one equipment failure doesn't have significant effects. If you take that a bit further you could include misconfigurations or even allow entire companies to fall out of the network. But each level of assurance requires more overprovisioning to compensate for failed equipment or lost capacity. And overprovisioning is expensive.
It is.. but BGP is a sort of privileged system in that once you're peering with another provider, you're trusting them to not put garbage routes out. When they do, those garbage routes propagate. The decentralized-ness is a strength as well as a weakness.
Much like IRC, also a decentralized system, where a rogue server (a server has privileged access) or services package (same) can cause widespread issues across the entire network.
Decentralized does not mean guaranteed availability; however, this failure shows that the internet is somewhat fault tolerant since most packets on the net were routed to the correct destination during that time.
> Root Cause: A configuration issue impacted IP services in various markets across the United States.
> Fix Action: The IP NOC reverted a policy change to restore services to a stable state.
> Summary: The IP NOC was informed of a significant client impact which seemed to originate on the east coast. The IP NOC began investigating, and soon discovered that the service impact was occurring in various markets across the United States. The issue was isolated to a policy change that was implemented to a single router in error while trying to configure an individual customer BGP. This policy change affected a major public peering session. The IP NOC reverted the policy change to restore services to a stable state.
> Corrective Actions: An extensive post analysis review will be conducted to evaluate preventative measures and corrective actions that can be implemented to prevent network impact of this magnitude. The individual responsible for this policy change has been identified.
[snip]
Sounds like "the individual responsible" forgot to set some communities on the peering session. Oops.