Hacker News new | past | comments | ask | show | jobs | submit login
Widespread impact caused by Level 3 BGP route leak (dyn.com)
148 points by pdcerb on Nov 13, 2017 | hide | past | favorite | 34 comments



From elsewhere:

> Root Cause: A configuration issue impacted IP services in various markets across the United States.

> Fix Action: The IP NOC reverted a policy change to restore services to a stable state.

> Summary: The IP NOC was informed of a significant client impact which seemed to originate on the east coast. The IP NOC began investigating, and soon discovered that the service impact was occurring in various markets across the United States. The issue was isolated to a policy change that was implemented to a single router in error while trying to configure an individual customer BGP. This policy change affected a major public peering session. The IP NOC reverted the policy change to restore services to a stable state.

> Corrective Actions: An extensive post analysis review will be conducted to evaluate preventative measures and corrective actions that can be implemented to prevent network impact of this magnitude. The individual responsible for this policy change has been identified.

[snip]

Sounds like "the individual responsible" forgot to set some communities on the peering session. Oops.


Well, there’s the problem right there. There was an individual responsible, meaning one set of hands unreviewed on a keyboard, for a change that could cause a global outage.

If I’m that person responsible, I’m going to hire two staff, ask them each to write the command scripts, justify any differences, produce a consensus script for my review, and then implement it. That seems like the minimum level of responsible engineering. Lint tools, notices of novel commands, and other ornaments have a place too, but the core idea is this:

The person with hands on keyboard is not the individual responsible for this error.


Where did they say the change was not reviewed?


It's implied by their claim that there is a single responsible individual. If someone else reviewed it and said "yep, looks good, deploy the change" how is there one "individual responsible"?

The idea of an isolated root cause or single human error in the failure of complex systems is bogus anyway. I'm a huge fan of the work in this area championed by John Allspaw [0].

[0]: https://www.kitchensoap.com/2012/02/10/each-necessary-but-on...


Indeed. Someone runs the department that individual works in and allowed this kind of uncontrolled process. Someone is that person’s boss and should have been asking how well controlled are our processes, and so on. Turned Turtles all the way up..


This could have been as simple as the individual accidentally "missing" a single line of a large, multi-line configuration when copy/pasting it into the router's console -- after all of the config review, etc., already occurred.

I'm not sure what vendor's gear was in use in this particular case, but the configs for a BGP peering session are typically (as mentioned above) large, multi-line configurations. For example, here's the (slightly redacted) configuration for one single BGP session on one of my routers:

  neighbor 10.10.10.10 remote-as 65432
  neighbor 10.10.10.10 transport connection-mode passive
  neighbor 10.10.10.10 description TO CUST FOO BAR INC ...
  neighbor 10.10.10.10 ebgp-multihop 3
  neighbor 10.10.10.10 update-source Loopback0
  neighbor 10.10.10.10 send-community
  neighbor 10.10.10.10 soft-reconfiguration inbound
  neighbor 10.10.10.10 prefix-list ACCEPTED-PREFIXES-AS65432 in
  neighbor 10.10.10.10 prefix-list ADVERTISED-PREFIXES-AS65432 out
  neighbor 10.10.10.10 password 7 0123456789ABCDEF0123456789ABCDEF
  neighbor 10.10.10.10 maximum-prefix 200
  neighbor 10.10.10.10 default-originate
That doesn't even include the associate prefix lists (or filter lists or route-maps or ...). All it would take is fat-fingering/typo'ing one of these lines or missing one to cause some very unintended effects.

Since we don't know exactly what happened, it's easy to say "they should've done this" or "they didn't do that". In reality, however, we simply don't know what they did or didn't do. You've shown no evidence that they didn't do any of the things you mention and, in some cases, you can do all of that and still have things go wrong.


Here's my paraphrasing in more straightforward fashion, since I know at least some of us would like such notices to be more to-the-point and less business-speak-ish:

Root Cause: Incorrect router configuration.

Fix Action: Revert the configuration.

Summary: Someone made the wrong settings on a router and made packets go the wrong way in parts of the US. We changed the settings back to what they were before.

Corrective Actions: We'll try to find ways to avoid doing this again. We know who did it.


Do you have a link for this? Interested in the corrective action.

Also, any suggestions on reading to learn about bgp generally to the specificity one might learn IP/TCP from a networking book?


I don't, sorry, it's in my inbox. The part I snipped is completely unrelated to the incident.

As far as BGP goes, Halabi's _Internet Routing Architectures_ [0] is pretty much considered the "bible". It's really old nowadays but it covers BGP4 (the current version in use) and not much has really changed.

I'm sure some of the newer BGP books are excellent as well but I can't personally recommend them as IRA and the (old) CCNP BGP book are all I've ever read/used (while preparing for the CCNP certification and in my day job).

Of course, pretty much everything is covered in RFC 4271 [1] (and updates) although the RFCs can be a bit "dry".

[0]: https://www.amazon.com/dp/157870233X

[1]: https://tools.ietf.org/html/rfc4271


I wrote a routing protocols tutorial a while ago, that tries not to be overly complex and not "dry".

https://github.com/knorrie/network-examples/blob/master/READ...

It uses the bird routing daemon on linux to build some networks on the go and see OSPF and BGP happening.

Maybe it can help you a bit. :-)


I know this is late, but I’ve also been going through this, learned a ton. Thanks a lot!


Thank you for writing and posting this, I'm really enjoying it (and really enjoying the writing style) :-)


I found "BGP4: Inter-domain Routing in the Internet" by John Stewart III [0] to be very approachable. It's likewise ancient by tech world standards (published in 1998!) but that's because, amazingly, BGP has not fundamentally changed in all that time. It's cheap and less than 200 pages; definitely recommend it as a primer.

[0]: https://www.amazon.com/dp/0201379511/


These large companies need much more SDN in their operations like NTT GIN. I think GIN only has 50 people in their entire organization to run that whole business. They do it by investing heavily in software (and not building out their own fiber all over the place).


Sounds like "the individual responsible" forgot to set some communities on the peering session.

Not exactly: https://bgpstream.com/event/112734


Yeah, I saw all of that (and more). Doesn't rule out what I said.

I presume Comcast was advertising those (longer) prefixes to Level 3 to manage traffic flow and they shouldn't have been propoagated to other customers. To do that, you'd typically apply BGP communities (no-export or other, Level3-specific ones) to those prefixes. A lack of those communities on the prefixes would result in them getting propagated to other Level3 peers. When that happened, it would look exactly like and result in this route leak.


Ah, thank you very much for the explanation :)


Every few months/years I'm reminded that a dozens to a few hundred people are responsible for routing packets across the entire internet. One typo could affect millions.


It took me a while to realize that "Level 3" is a company name.


I know it's a company because regularly they seem to mess something up and there is a post about it on /r/networking: https://www.reddit.com/r/networking/search?q=level+3&restric...

It's so common that someone even made a website about it: http://fuckinglevel3.com/


Remember when they nearly broke the internet back in IIRC '06, when they had that peering dispute spat with Cogent?


Telia de-peering with Cogent was bigger. And as others have pointed out, Cogent is frequently a bad partner in these disputes.


Given the number of peering spats that Cogent has been in, is it fair to blame the other party when one comes up?



It's not anymore, it's now CenturyLink, they closed their acquisition on Nov 1st.


Level 3 v Layer 3 ;)


Early in the history of commercial internet in Brazil we had a couple issues like this one with the recently privatized telco that operated the big backbone that connected us to other countries. At that time we more or less seriously mused about whether all commercial ISPs should pool their resources and pay a top-tier consultancy to properly configure everything for the telco, provided they never ever touches those routers again.


"Machine learning classifier predicts which route announcements are legitimate and which ones are erroneous" <- headline I'd like to see


I don't get it... I thought the Internet was supposed to be decentralized... How is it possible that one or two companies can cause such widespread issues?


BGP is such a great foot gun precisely because the internet is decentralized.

There's no central repository of how to route traffic for an IP [1]. If there was, it would probably mess things up from time to time, but not to such a large extent.

Instead, we just have to kind of trust BGP announcements -- especially if they come from ISPs that credibly could route anything (Level 3, other "teir 1" isps).

[1] Actually there are some efforts to develop this. After all, IP allocations are essentially centralized under the five regional internet registries. There are some registries of routing information (RADB is the most well known, I think); but not all ASNs participate, and filtering routes from large transit ISPs is still a major problem.


Any one error only affects a portion of the network. For example a problem might take Australia or the American East Coast offline, but that doesn't significantly affect the rest of the network (unless they try to reach affected regions). That was the design goal and it works perfectly.

What you want is entirely different. The European power network for example is designed for (n+1) redundancy: any one equipment failure doesn't have significant effects. If you take that a bit further you could include misconfigurations or even allow entire companies to fall out of the network. But each level of assurance requires more overprovisioning to compensate for failed equipment or lost capacity. And overprovisioning is expensive.


It is.. but BGP is a sort of privileged system in that once you're peering with another provider, you're trusting them to not put garbage routes out. When they do, those garbage routes propagate. The decentralized-ness is a strength as well as a weakness.

Much like IRC, also a decentralized system, where a rogue server (a server has privileged access) or services package (same) can cause widespread issues across the entire network.


Decentralized does not mean guaranteed availability; however, this failure shows that the internet is somewhat fault tolerant since most packets on the net were routed to the correct destination during that time.


Because it's not decentralized.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: