Widespread impact caused by Level 3 BGP route leak

jlgaddis · on Nov 13, 2017

From elsewhere:

> Root Cause: A configuration issue impacted IP services in various markets across the United States.

> Fix Action: The IP NOC reverted a policy change to restore services to a stable state.

> Summary: The IP NOC was informed of a significant client impact which seemed to originate on the east coast. The IP NOC began investigating, and soon discovered that the service impact was occurring in various markets across the United States. The issue was isolated to a policy change that was implemented to a single router in error while trying to configure an individual customer BGP. This policy change affected a major public peering session. The IP NOC reverted the policy change to restore services to a stable state.

> Corrective Actions: An extensive post analysis review will be conducted to evaluate preventative measures and corrective actions that can be implemented to prevent network impact of this magnitude. The individual responsible for this policy change has been identified.

[snip]

Sounds like "the individual responsible" forgot to set some communities on the peering session. Oops.

brians · on Nov 13, 2017

Well, there’s the problem right there. There was an individual responsible, meaning one set of hands unreviewed on a keyboard, for a change that could cause a global outage.

If I’m that person responsible, I’m going to hire two staff, ask them each to write the command scripts, justify any differences, produce a consensus script for my review, and then implement it. That seems like the minimum level of responsible engineering. Lint tools, notices of novel commands, and other ornaments have a place too, but the core idea is this:

The person with hands on keyboard is not the individual responsible for this error.

jeremyjh · on Nov 13, 2017

Where did they say the change was not reviewed?

jdwithit · on Nov 13, 2017

It's implied by their claim that there is a single responsible individual. If someone else reviewed it and said "yep, looks good, deploy the change" how is there one "individual responsible"?

The idea of an isolated root cause or single human error in the failure of complex systems is bogus anyway. I'm a huge fan of the work in this area championed by John Allspaw [0].

[0]: https://www.kitchensoap.com/2012/02/10/each-necessary-but-on...

dboreham · on Nov 13, 2017

Indeed. Someone runs the department that individual works in and allowed this kind of uncontrolled process. Someone is that person’s boss and should have been asking how well controlled are our processes, and so on. Turned Turtles all the way up..

jlgaddis · on Nov 14, 2017

This could have been as simple as the individual accidentally "missing" a single line of a large, multi-line configuration when copy/pasting it into the router's console -- after all of the config review, etc., already occurred.

I'm not sure what vendor's gear was in use in this particular case, but the configs for a BGP peering session are typically (as mentioned above) large, multi-line configurations. For example, here's the (slightly redacted) configuration for one single BGP session on one of my routers:

  neighbor 10.10.10.10 remote-as 65432
  neighbor 10.10.10.10 transport connection-mode passive
  neighbor 10.10.10.10 description TO CUST FOO BAR INC ...
  neighbor 10.10.10.10 ebgp-multihop 3
  neighbor 10.10.10.10 update-source Loopback0
  neighbor 10.10.10.10 send-community
  neighbor 10.10.10.10 soft-reconfiguration inbound
  neighbor 10.10.10.10 prefix-list ACCEPTED-PREFIXES-AS65432 in
  neighbor 10.10.10.10 prefix-list ADVERTISED-PREFIXES-AS65432 out
  neighbor 10.10.10.10 password 7 0123456789ABCDEF0123456789ABCDEF
  neighbor 10.10.10.10 maximum-prefix 200
  neighbor 10.10.10.10 default-originate

That doesn't even include the associate prefix lists (or filter lists or route-maps or ...). All it would take is fat-fingering/typo'ing one of these lines or missing one to cause some very unintended effects.

Since we don't know exactly what happened, it's easy to say "they should've done this" or "they didn't do that". In reality, however, we simply don't know what they did or didn't do. You've shown no evidence that they didn't do any of the things you mention and, in some cases, you can do all of that and still have things go wrong.

userbinator · on Nov 13, 2017

Here's my paraphrasing in more straightforward fashion, since I know at least some of us would like such notices to be more to-the-point and less business-speak-ish:

Root Cause: Incorrect router configuration.

Fix Action: Revert the configuration.

Summary: Someone made the wrong settings on a router and made packets go the wrong way in parts of the US. We changed the settings back to what they were before.

Corrective Actions: We'll try to find ways to avoid doing this again. We know who did it.

irl_zebra · on Nov 13, 2017

Do you have a link for this? Interested in the corrective action.

Also, any suggestions on reading to learn about bgp generally to the specificity one might learn IP/TCP from a networking book?

jlgaddis · on Nov 13, 2017

I don't, sorry, it's in my inbox. The part I snipped is completely unrelated to the incident.

As far as BGP goes, Halabi's _Internet Routing Architectures_ [0] is pretty much considered the "bible". It's really old nowadays but it covers BGP4 (the current version in use) and not much has really changed.

I'm sure some of the newer BGP books are excellent as well but I can't personally recommend them as IRA and the (old) CCNP BGP book are all I've ever read/used (while preparing for the CCNP certification and in my day job).

Of course, pretty much everything is covered in RFC 4271 [1] (and updates) although the RFCs can be a bit "dry".

[0]: https://www.amazon.com/dp/157870233X

[1]: https://tools.ietf.org/html/rfc4271

knorrie · on Nov 13, 2017

I wrote a routing protocols tutorial a while ago, that tries not to be overly complex and not "dry".

https://github.com/knorrie/network-examples/blob/master/READ...

It uses the bird routing daemon on linux to build some networks on the go and see OSPF and BGP happening.

Maybe it can help you a bit. :-)

irl_zebra · on Nov 18, 2017

I know this is late, but I’ve also been going through this, learned a ton. Thanks a lot!

bruce_one · on Nov 14, 2017

Thank you for writing and posting this, I'm really enjoying it (and really enjoying the writing style) :-)

jdwithit · on Nov 13, 2017

I found "BGP4: Inter-domain Routing in the Internet" by John Stewart III [0] to be very approachable. It's likewise ancient by tech world standards (published in 1998!) but that's because, amazingly, BGP has not fundamentally changed in all that time. It's cheap and less than 200 pages; definitely recommend it as a primer.

[0]: https://www.amazon.com/dp/0201379511/

scurvy · on Nov 13, 2017

These large companies need much more SDN in their operations like NTT GIN. I think GIN only has 50 people in their entire organization to run that whole business. They do it by investing heavily in software (and not building out their own fiber all over the place).

rando444 · on Nov 13, 2017

Sounds like "the individual responsible" forgot to set some communities on the peering session.

Not exactly: https://bgpstream.com/event/112734

jlgaddis · on Nov 14, 2017

Yeah, I saw all of that (and more). Doesn't rule out what I said.

I presume Comcast was advertising those (longer) prefixes to Level 3 to manage traffic flow and they shouldn't have been propoagated to other customers. To do that, you'd typically apply BGP communities (no-export or other, Level3-specific ones) to those prefixes. A lack of those communities on the prefixes would result in them getting propagated to other Level3 peers. When that happened, it would look exactly like and result in this route leak.

rando444 · on Nov 16, 2017

Ah, thank you very much for the explanation :)

NDizzle · on Nov 13, 2017

Every few months/years I'm reminded that a dozens to a few hundred people are responsible for routing packets across the entire internet. One typo could affect millions.

jwilk · on Nov 13, 2017

It took me a while to realize that "Level 3" is a company name.

rnhmjoj · on Nov 13, 2017

I know it's a company because regularly they seem to mess something up and there is a post about it on /r/networking: https://www.reddit.com/r/networking/search?q=level+3&restric...

It's so common that someone even made a website about it: http://fuckinglevel3.com/

DarronWyke · on Nov 13, 2017

Remember when they nearly broke the internet back in IIRC '06, when they had that peering dispute spat with Cogent?

scurvy · on Nov 13, 2017

Telia de-peering with Cogent was bigger. And as others have pointed out, Cogent is frequently a bad partner in these disputes.

toast0 · on Nov 13, 2017

Given the number of peering spats that Cogent has been in, is it fair to blame the other party when one comes up?

tomschlick · on Nov 13, 2017

https://islevel3down.com/

johnbatch · on Nov 13, 2017

It's not anymore, it's now CenturyLink, they closed their acquisition on Nov 1st.

smn1234 · on Nov 13, 2017

Level 3 v Layer 3 ;)

rbanffy · on Nov 13, 2017

Early in the history of commercial internet in Brazil we had a couple issues like this one with the recently privatized telco that operated the big backbone that connected us to other countries. At that time we more or less seriously mused about whether all commercial ISPs should pool their resources and pay a top-tier consultancy to properly configure everything for the telco, provided they never ever touches those routers again.

pc2g4d · on Nov 14, 2017

"Machine learning classifier predicts which route announcements are legitimate and which ones are erroneous" <- headline I'd like to see

AlphaWeaver · on Nov 13, 2017

I don't get it... I thought the Internet was supposed to be decentralized... How is it possible that one or two companies can cause such widespread issues?

toast0 · on Nov 13, 2017

BGP is such a great foot gun precisely because the internet is decentralized.

There's no central repository of how to route traffic for an IP [1]. If there was, it would probably mess things up from time to time, but not to such a large extent.

Instead, we just have to kind of trust BGP announcements -- especially if they come from ISPs that credibly could route anything (Level 3, other "teir 1" isps).

[1] Actually there are some efforts to develop this. After all, IP allocations are essentially centralized under the five regional internet registries. There are some registries of routing information (RADB is the most well known, I think); but not all ASNs participate, and filtering routes from large transit ISPs is still a major problem.

hanbura · on Nov 13, 2017

Any one error only affects a portion of the network. For example a problem might take Australia or the American East Coast offline, but that doesn't significantly affect the rest of the network (unless they try to reach affected regions). That was the design goal and it works perfectly.

What you want is entirely different. The European power network for example is designed for (n+1) redundancy: any one equipment failure doesn't have significant effects. If you take that a bit further you could include misconfigurations or even allow entire companies to fall out of the network. But each level of assurance requires more overprovisioning to compensate for failed equipment or lost capacity. And overprovisioning is expensive.

Karunamon · on Nov 13, 2017

It is.. but BGP is a sort of privileged system in that once you're peering with another provider, you're trusting them to not put garbage routes out. When they do, those garbage routes propagate. The decentralized-ness is a strength as well as a weakness.

Much like IRC, also a decentralized system, where a rogue server (a server has privileged access) or services package (same) can cause widespread issues across the entire network.

_lflx · on Nov 13, 2017

Decentralized does not mean guaranteed availability; however, this failure shows that the internet is somewhat fault tolerant since most packets on the net were routed to the correct destination during that time.

mholt · on Nov 13, 2017

Because it's not decentralized.