I don't know much about hardware at all, but aren't routers fairly simple, time tested pieces of hardware? Can they really corrupt en-masse in this way?
Routers at this level aren't just scale-ups of your home wifi/nat box. They aren't even scale ups of the simple IP routers for a basic IT data-closet that manages subnets and whatnot (already much more complex by dealing with vlans and subnets and dmz and vpn issues). At the level of big networking company they are a truly complex beast.
Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol. Internal routing protocols such a OSPF are equally ugly and prone to breakage. Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1. Other issues are BGP flapping, resulting in scary percentages of lost traffic. This doesn't even cover trickier stuff like routing loops...
Other considerations in big routers are things like ASN identifiers and peering points. Considerations like traffic cost, SLAs and QoS all go in to traffic balancing on such routers. MPLS clouds complicate (and oddly enough simplify) these things as well.
There are also important issues like Anycast, CDN and NAT that largely rely on router tricks and add to the complication.
Finally, on top of all this, is the security concern - you can't just throw a firewall in front of it, as many firewall issues are routing issues, therefore must also be present in the router.
All these layers interact and affect each other. Any given machine can only handle so much traffic and so many decisions, so something that is drawn as a single router on a networking chart may actually be several boxes cascaded to handle the complexity.
Oh yeah, and switches are getting progressively smarter with other rules and weirdnesses that provide horribly leaky abstractions that shouldn't matter to the upstream router, but turn out to add issues to the configuration and overall complexity on top of it all.
> Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1.
This is what route filters are for. If you peer with someone and they advertise 0.0.0.0/0 or something equally ridiculous, and you accept this as a valid route then you deserve to fail (and then given a firm stare if you then proceed to advertise it to other peers).
A similar fail on the part of Telstra (http://bgpmon.net/blog/?p=554) was to blame for much of Australia dropping off the map earlier this year.
Also:
> Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol.
I take offence at this. 80% of BGP related issues are due to misconfiguration by a given party, 19% is due to bad or missing route filters and the other 1% is due to bugs in router software. The actual implementation of BGP v4, originally designed back in the early 90's isn't completely without it's issues and behavioural quirks (I'm looking at you, route flaps) but the theory/algorithm behind is a work of art, and has coped amazingly with explosive growth, and growth that's only going to increase with IPv6.
Without it, there would be no HN
Thanks for your views on BGP - a lot of my knowledge of it comes from post-incident beers with our network grey-beards when I worked at an ISP, so my views are probably somewhat biased. There is nothing like a network explosion at an ISP to get people ranting about BGP, but I'll admit that the ranting is from a place of anger and frustration and largely venting rather then a fair technical assessment.
router tables are a stationary woodworking machines in which a vertically oriented spindle of a woodworking router protrudes from the machine table and can be spun at speeds typically between 3000 and 24,000 rpm
excluding static routes (which are then usually advertised to other peers), routING tables are dynamically built and only exist in non-persistent memory.
having up to date backups of router configuration is another matter entirely
At a first level, routers will cache information about what machines are on which port of the router. This enables them to not forward a packet to every port on the router (reducing network load). This is normally done using MAC Addresses.
On more expensive routers, the router can understand IGMP, or multicast and route based on multicast joins. This enables an optimization on simple broadcasts because not every machine on the network needs to see these packets, but multiple machines might want to.
Never mind when you start adding fault-tolerance and failover setups/configs to the mix.
It gets really complex really fast, and even the slightest lapse in attention to extreme detail can be disastrous and a nightmare to troubleshoot.
For instance, I once had a router reboot for some unknown reason, and the firmware/config wasn't properly flashed, so it reverted to the last point release of the firmware. That was enough to cause the failover heartbeat to be constantly triggered, and the master/slave routers just kept failing over to each other and corrupted all the ARP caches. Makes for a lot of fun when things work and then stop working in an inconsistent manner.
They're not that simple due to aggressive speed optimizations like caching frequently used routing information in hardware registers. If the cache gets corrupted due to electronic bogon flux, bad things happen. These events are rare (like once every billion operating hours) so you can imagine it's hard for the router vendors to find & fix them all.
It is a little sad that they didn't try rebooting the routers for so many hours.
> It is a little sad that they didn't try rebooting the routers for so many hours.
This assumes they're being literal and "corrupt" means "a bit randomly flipped" instead of using "corrupt" in the figurative sense of "operator error".