Does anyone know details on the throughput or packets per second?
This smells so much of gross negligence on the part of Dyn executives and all the unicorn web executive teams for single sourcing to me.
I present as a counterexample and encourage people to go research the architecture of Verisign. I attended a talk by Verisign, which runs .com and .org as well as root servers. They are a constant DDOS target. They are necessarily a single entity of failure and appointed by ICANN to perform these services for the most common TLD in the world. If they mess up, they will probably lose that status. They've had over 16 years of uptime.
Every layer of the stack is dual or triple sourced. Two server makers, two generations, two router vendors, two switch vendors, two network architectures, POP diversity, peering diversity. Services and capacity always added in pairs. Two separate NOC and Ops teams. FreeBSD, Linux and Solaris. nsd, bind, and internally developed userspace on Netmap. Code upgrade deployed in halves. Everything is structured for at least a full halving for The Big One zero day I don't think we've really seen yet.
They were doing 10gbps of DNS on a single commodity server 3 years ago. This makes it easy to absorb any DDOS and gradually clamp it at the public peering points.
The above should be standard operations structure for any breakaway success web business. It's not that hard, but you have to claw the charlatans out of the management chain and put in professionals that take the career seriously. Professionals.
What's really appalling is hearing some unicorn web biz uses one cloud vendor, one service provider like Dyn. This is absolutely trivial stuff to multi-source and has almost no effect on OpEx.
Last I heard from a Dyn insider was they were eliminating FreeBSD and had a brain drain a few years ago. Overall pretty unsurprising outcome.
Internet reliability no more complex and certainly much cheaper than analogous critical infrastructure like electrical generation for a region. I am constantly disappointed in this field for its lack of recognizing and installing professionals into management. We've been doing Internet architecture in modern form for 30 years. Get your shit together.
What kind of single-sourcing do you think was an issue on friday/how does double-sourcing help against pure DDOS attacks? I haven't read anything about a specific bug being exploited?
It's hard to know until/unless they are fully transparent. It was almost certainly inside plant resource exhaustion, otherwise you wouldn't need an IoT botnet to cause it.
Multisourcing gives you at least path diversity in this case. Hopefully application software performance as well.
This is handled in the Verisign example by basic mechanical sympathy; when your machines can push 10gbps (probably 4x that in the years since the talk), it is economically silly not to simply eat the DDOS.
Eating a DDOS is always the first mark of maturity. You can then improve with additional network intelligence. The most common way is with using sample based packet analysis. You match an attack signature and feed that into your gear at the peering points using BGP with blackhole destination. You are now filtering at ASIC speed, terabits per second on modern chassis switch or router.
I re-read your initial comment and apparently I misinterpreted what you wrote, so thanks for the high-quality answer despite this. (You didn't explicitly call out Dyn for not double-sourcing, but their customers, and only used it as an example for good practices in the Verisign example. Since you mentioned it lot, I gave it to much weight and read it as "Dyn failed because they didn't double-source", for which I see no evidence.)
So much this - but even worse - a particular unicorn who's primary business is identity management. DNS glue records ALL point to AWS/Route53, actual zone file NS records ALL point to Dyn. I looked at that and went WTF were you thinking?!?! It's one thing to give yourself a SPOF, but why would you give yourself 2 SPOFs?
Top Management don't care because they do not have a downside. By optimizing the cost they can get additional bonuses. Even if they sink the company, they still walk away rich and ready to screw someone else.
Middle manager are too busy to save their own asses to ask questions. If one off poor guy musters courage to speak up, they are immediately shut down in the name of falling in line. This is the usual response from top management when they don'have an answer.
This mirrors my experience. And for public companies, the Board of Directors are not attuned to these outcomes. For the most part B2B customers are in the same boat so everybody feigns an uncomfortable laugh, some moderate outrage or displeasure, put out a silly postmortem without intent to rectify the culture that enabled it, and continue collecting an outsized salary. I adore computers, but I really wish I could lateral into some career like construction management where the layman is much more aware of success and failure, who is good and who is bad at their profession.
Where are you getting all this? I have a thought - perhaps it's not that everyone but you and your favourite organization is automatically incompetent! Perhaps there are reasons people do things! Perhaps there is more to something than what you can immediately glean! You could wait to learn more to pass judgement, but it seems you don't feel like doing that.
> they were eliminating FreeBSD and had a brain drain a few years ago. Overall pretty unsurprising outcome.
Thus they must be incompetent, because in their narrow minded foolishness, they don't understand that it is the Superior Operating System. There could be no other reason to do what they did. It couldn't have made sense for some other reason. Right? I also appreciate the BSDs for what they are, but this is so condescending. This whole post is so condescending.
> Internet reliability no more complex and certainly much cheaper than analogous critical infrastructure like electrical generation for a region
Not if you're trying to also make it cost effective? Electrical generation has well understood usage patterns, hasn't grown that much in the last few decades and doesn't have the kind of demand spikes that happen on the internet. Plus, power companies are mostly monopolies and charge whatever they want. It's an apples and oranges comparison. Also, electrical grids fail from time to time too! See the northeast blackout of 2003.
Verisign has a similar monopoly to power companies and charges a huge tax margin on domains for all the services it provides. Without arguing about whether this is good or bad, we can say that Verisign is in effect pre-subsidized to provide however complex / redundant / reliable a service they want to provide.
Where am I getting what? My argument is that weak leadership causes the problems we see. You sound personally affronted because the I am pointing out a view of the industry that is not rosy glasses where we are all geniuses and Silicon Valley inspired culture solves all the hard problems with slam dunks. These problems are preventable and the tech industry is actually a wasteland of poor understanding and repeating organizational and managerial failure. These are ignored because everyone else is as almost as bad, and most people can't judge the good from the bad. This isn't cutting edge research, you carbon copy a competent company's design and they will probably explain it publicly if you ask nicely. There are forums like IETF and NANOG where you can exchange these practices, and professionals do that to build the profession.
Swap out BSD with anything you please, just don't single source when you've hit breakout success and revenue loss from failure is material. I'm not a zealot, I even think WinNT is a pretty nice kernel and could fit into this picture with the new Microsoft. The point is you aren't holding up whole digit numbers of worldwide GDP on one kernel with a very poor security track record. Every modern widely used kernel applies to this statement. This is insane, I hope you can take that away from this thread at least.
The cost of designing against common mode failure is inconsequential when properly budgeted against cost of actual failure. Any company in the Fortune 500, non-tech included, could afford to create a DDOS-proof DNS and web platform. Luckily they shouldn't have to if the service providers acted more like Verisign. And it scales down too, it would cost at most $100s of dollars a year for a single person company to multisource cloud providers, but likely it will be a wash since IAAS is most commonly usage billing.
Your sense of mechanical sympathy is not aligned with the advances in hardware and internet infrastructure. It would be possible to handle a 1.2tbps DDOS in a single IX like LON, AMS, FRA, LGA, IAD, ORD, ATL, MIA, SJC, LAX, SIN, HKG, ICN, KIX let alone world wide. It would take a lot more active management and work across companies if someone is clogging regional interconnect to a spoke, but that would not cause worldwide outage.
1.2tbps is 120 of Verisign's 2013 server and code, maybe a million dollars CapEx. And 12 100g peering ports, $150k monthly at bulk transit pricing if you can't do any settlement free for some reason. Yes there are other costs, but it is easy to scale and well within the means of midsize business let alone unicorns.
The thing about DDOS -- it's a lot harder to get aggregate bandwidth, those eyeball networks will bottleneck a lot sooner than 1000 meter runs of fiber between peers at an IX. And there are some natural imbalanced ratios like CDNs that are sitting on dozens of tbps of unused ingress that can jump into offering services for other companies that don't want to figure it all out.
The style of this post mortem is quite different from a typical post mortem of an attack on a large tech company, such as AWS, heroku or GitHub.
There is very little technical details on the investigation and mitigation process beyond phrases like "the NOC team was able to mitigate the attack and restore service to customers", "...but was mitigated in just over an hour".
Why was this written by a Chief Strategy Officer but not someone with more technical knowledge and insights?
> We observed 10s of millions of discrete IP addresses associated with the Mirai botnet that were part of the attack. (linked article)
> ... the Mirai botnet was at about 550,000 nodes, and that approximately 10 percent were involved in the attack on Dyn (from Level 3 CISO) [0]
Something really doesn't add up there - even if it turned out 100% of infected hosts in the Mirai botnet were targeting dyn (ie: 5.5 million nodes) - that still is a fraction of the number dyn is claiming.
I couldn't use twitter or github from noon till 4. If they mitigated the third attack and no customers were affected, why were github and twitter still down for me?
Can confirm as well. GitHub was inaccessible until about 5PM, and I'm in the southwest. My provider peers with Level3, so it looks like nearly everything gets routed through Los Angeles.
> We observed 10s of millions of discrete IP addresses associated with the Mirai botnet that were part of the attack.
OK, so why don't Dyn staff identify owners of all IP subnets represented, and provide each with a list of participating IP addresses? That's a trivial exercise, right? And maybe they could publicly shame ISPs that didn't act on the information.
why would they? I mean with how they are charging by the GB these days, it gives them financial incentive to not tell you your refrigerator (or any device) is part of a botnet, that's free money for them.
At some point, if the Internet is going to continue to function, one of the likely outcomes is that these customers wake up some morning and the only thing they can get is a web page telling them they have a choice of disabling their fridge's Internet of Shit device or connection, or doing without their full connection to the greater Internet.
Compare to, for example, our stopping our use of lead in paint and gasoline/petrol.
That's a good analogy. In many places, owners of lead-contaminated buildings must remediate. One issue, of course, is poisoning of resident children. But another is contamination of the surrounding environment. Paint manufacturers are also liable, but that's a separate issue.
In this case the IPs probably aren't spoofed, but there's always the chance that you're blaming IPs that didn't engage in the attack but somebody just put it randomly as source address in their packets.
Probably not in cases where the device is behind a NAT/router that happens to forward remote access ports that lead to the initial compromise. Then you'd be nmapping the customer's endpoint. Plus, it'd be too time consuming. Tens of millions of participating IPs spread across countless ISPs...
Good point. But still, even with risk of false positives, I believe that ISPs should be informed. And ISPs could acknowledge uncertainty when informing customers.
ISPs are dunning users about "illegal" file sharing. Because of pressure from MPAA etc. These IoT botnets are just testing capabilities, and they're already formidable. Going after operators is good, but it's so damn easy to hide. It seems that IoT botnets are using mainly direct attacks, so they're easy to identify. And why would ISPs want to support botnets? Don't ISP ToS typically say that attacks, threats and spam are grounds for service termination?
Maybe this isn't Dyn's job. Maybe some neutral NGO should handle it. But seriously, this is something I could do at home, on an old gaming box in MySQL. If I had the list of IP addresses, anyway. But why? How about acting in the public interest? Or good PR?
Dyn states their problems were fixed at 17 UTC, but me and most other people in the Netherlands have been seeing issues for the whole evening. Something about their story does not add up...
I would really like it if Dyn could give some credence to the 1.2Tpbs number or if it was higher.
As these attacks grow in scale, it becomes more and more important to know if this was an attack at record capacity or if it was just that Dyn had lower capacity hardware/links in place. If we're already facing, for example, 2Tbps attacks, a lot has to be done to make mitigating these attacks easier, either through hardware or strategic upgrades.
"Again, at no time was there a network-wide outage, though some customers would have seen extended latency delays during that time."
That can't be true. Here in the west coast, I know of over 10 "name-brand" sites that were absolutely down for over an hour. The east coast was apparently hit even harder and for longer.
Are you sure that it was Dyn that was down, or if there was a break somewhere in the nameserver chain?
I know that the digs I was running periodically were getting SERVFAIL responses from Google DNS even though my local nameservers were actually succeeding.
This should be a warning to everyone that was affected by this. Even the dependencies that you don't normally think about go down at some point. For DNS, make sure you are using multiple DNS providers.
> For DNS, make sure you are using multiple DNS providers.
In this case, that would have made it worse.
If your domain CNAMEs off another provider (as nearly all SaaS solutions do, as do mapping to AWS servers, etc) then you would have been affected by the attack on Dyn regardless of whether you could change your nameserver in time and have multiple providers at that level.
If you don't have any CNAMEs, then I think the better choice is to go same-origin same-provider for everything.
Which may sound bizarre... my personal sites (a lot of forums) would either be up or down. I would argue that availability having a binary nature is a lot better for end users than a constantly broken half-state that is frustrating or impossible to use at any time even though a graph might show partial availability. A hard sudden failure shows up better in monitoring, triggers end user feedback faster, and is in many ways easier to debug and solve.
With so much CNAME'd, and SaaS still growing, replicating and failing over DNS that sits above those wouldn't buy you anything.
A lot of sites lost Zendesk (their support.), Statuspage (their status.), Pagerduty (hey, they were quiet because no-one could let them know). They couldn't even help the end users having a bad time.
I haven't seen comment from any IP-transit/peering company that provides service to them. In fact, I haven't seen any infrastructure provider of them comment anything. This is starting to seem more like a marketing campaign or human error (action?) blamed on DDoS.
But it can also be substantially more costly than its worth, and a lot of the cost isn't super visible (people finding it marginally harder to get their job done, being unable to learn from informed discussion on HN, &c).
This smells so much of gross negligence on the part of Dyn executives and all the unicorn web executive teams for single sourcing to me.
I present as a counterexample and encourage people to go research the architecture of Verisign. I attended a talk by Verisign, which runs .com and .org as well as root servers. They are a constant DDOS target. They are necessarily a single entity of failure and appointed by ICANN to perform these services for the most common TLD in the world. If they mess up, they will probably lose that status. They've had over 16 years of uptime.
Every layer of the stack is dual or triple sourced. Two server makers, two generations, two router vendors, two switch vendors, two network architectures, POP diversity, peering diversity. Services and capacity always added in pairs. Two separate NOC and Ops teams. FreeBSD, Linux and Solaris. nsd, bind, and internally developed userspace on Netmap. Code upgrade deployed in halves. Everything is structured for at least a full halving for The Big One zero day I don't think we've really seen yet.
They were doing 10gbps of DNS on a single commodity server 3 years ago. This makes it easy to absorb any DDOS and gradually clamp it at the public peering points.
The above should be standard operations structure for any breakaway success web business. It's not that hard, but you have to claw the charlatans out of the management chain and put in professionals that take the career seriously. Professionals.
What's really appalling is hearing some unicorn web biz uses one cloud vendor, one service provider like Dyn. This is absolutely trivial stuff to multi-source and has almost no effect on OpEx.
Last I heard from a Dyn insider was they were eliminating FreeBSD and had a brain drain a few years ago. Overall pretty unsurprising outcome.
Internet reliability no more complex and certainly much cheaper than analogous critical infrastructure like electrical generation for a region. I am constantly disappointed in this field for its lack of recognizing and installing professionals into management. We've been doing Internet architecture in modern form for 30 years. Get your shit together.