Dyn Statement on 10/21/2016 DDoS Attack

kev009 · on Oct 22, 2016

Does anyone know details on the throughput or packets per second?

This smells so much of gross negligence on the part of Dyn executives and all the unicorn web executive teams for single sourcing to me.

I present as a counterexample and encourage people to go research the architecture of Verisign. I attended a talk by Verisign, which runs .com and .org as well as root servers. They are a constant DDOS target. They are necessarily a single entity of failure and appointed by ICANN to perform these services for the most common TLD in the world. If they mess up, they will probably lose that status. They've had over 16 years of uptime.

Every layer of the stack is dual or triple sourced. Two server makers, two generations, two router vendors, two switch vendors, two network architectures, POP diversity, peering diversity. Services and capacity always added in pairs. Two separate NOC and Ops teams. FreeBSD, Linux and Solaris. nsd, bind, and internally developed userspace on Netmap. Code upgrade deployed in halves. Everything is structured for at least a full halving for The Big One zero day I don't think we've really seen yet.

They were doing 10gbps of DNS on a single commodity server 3 years ago. This makes it easy to absorb any DDOS and gradually clamp it at the public peering points.

The above should be standard operations structure for any breakaway success web business. It's not that hard, but you have to claw the charlatans out of the management chain and put in professionals that take the career seriously. Professionals.

What's really appalling is hearing some unicorn web biz uses one cloud vendor, one service provider like Dyn. This is absolutely trivial stuff to multi-source and has almost no effect on OpEx.

Last I heard from a Dyn insider was they were eliminating FreeBSD and had a brain drain a few years ago. Overall pretty unsurprising outcome.

Internet reliability no more complex and certainly much cheaper than analogous critical infrastructure like electrical generation for a region. I am constantly disappointed in this field for its lack of recognizing and installing professionals into management. We've been doing Internet architecture in modern form for 30 years. Get your shit together.

dmourati · on Oct 23, 2016

Two sources claim 1.2 Tbps.

http://hosted.ap.org/dynamic/stories/D/DISRUPTIVE_CYBERATTAC...

http://www.cbc.ca/news/technology/hackers-ddos-attacks-1.381...

detaro · on Oct 22, 2016

What kind of single-sourcing do you think was an issue on friday/how does double-sourcing help against pure DDOS attacks? I haven't read anything about a specific bug being exploited?

kev009 · on Oct 22, 2016

It's hard to know until/unless they are fully transparent. It was almost certainly inside plant resource exhaustion, otherwise you wouldn't need an IoT botnet to cause it.

Multisourcing gives you at least path diversity in this case. Hopefully application software performance as well.

This is handled in the Verisign example by basic mechanical sympathy; when your machines can push 10gbps (probably 4x that in the years since the talk), it is economically silly not to simply eat the DDOS.

Eating a DDOS is always the first mark of maturity. You can then improve with additional network intelligence. The most common way is with using sample based packet analysis. You match an attack signature and feed that into your gear at the peering points using BGP with blackhole destination. You are now filtering at ASIC speed, terabits per second on modern chassis switch or router.

detaro · on Oct 23, 2016

I re-read your initial comment and apparently I misinterpreted what you wrote, so thanks for the high-quality answer despite this. (You didn't explicitly call out Dyn for not double-sourcing, but their customers, and only used it as an example for good practices in the Verisign example. Since you mentioned it lot, I gave it to much weight and read it as "Dyn failed because they didn't double-source", for which I see no evidence.)

youeeeeeediot · on Oct 22, 2016

So much this - but even worse - a particular unicorn who's primary business is identity management. DNS glue records ALL point to AWS/Route53, actual zone file NS records ALL point to Dyn. I looked at that and went WTF were you thinking?!?! It's one thing to give yourself a SPOF, but why would you give yourself 2 SPOFs?

drieddust · on Oct 22, 2016

Top Management don't care because they do not have a downside. By optimizing the cost they can get additional bonuses. Even if they sink the company, they still walk away rich and ready to screw someone else.

Middle manager are too busy to save their own asses to ask questions. If one off poor guy musters courage to speak up, they are immediately shut down in the name of falling in line. This is the usual response from top management when they don'have an answer.

"This isn't a Democracy."

kev009 · on Oct 22, 2016

This mirrors my experience. And for public companies, the Board of Directors are not attuned to these outcomes. For the most part B2B customers are in the same boat so everybody feigns an uncomfortable laugh, some moderate outrage or displeasure, put out a silly postmortem without intent to rectify the culture that enabled it, and continue collecting an outsized salary. I adore computers, but I really wish I could lateral into some career like construction management where the layman is much more aware of success and failure, who is good and who is bad at their profession.

makmanalp · on Oct 22, 2016

Where are you getting all this? I have a thought - perhaps it's not that everyone but you and your favourite organization is automatically incompetent! Perhaps there are reasons people do things! Perhaps there is more to something than what you can immediately glean! You could wait to learn more to pass judgement, but it seems you don't feel like doing that.

> they were eliminating FreeBSD and had a brain drain a few years ago. Overall pretty unsurprising outcome.

Thus they must be incompetent, because in their narrow minded foolishness, they don't understand that it is the Superior Operating System. There could be no other reason to do what they did. It couldn't have made sense for some other reason. Right? I also appreciate the BSDs for what they are, but this is so condescending. This whole post is so condescending.

> Internet reliability no more complex and certainly much cheaper than analogous critical infrastructure like electrical generation for a region

Not if you're trying to also make it cost effective? Electrical generation has well understood usage patterns, hasn't grown that much in the last few decades and doesn't have the kind of demand spikes that happen on the internet. Plus, power companies are mostly monopolies and charge whatever they want. It's an apples and oranges comparison. Also, electrical grids fail from time to time too! See the northeast blackout of 2003.

Verisign has a similar monopoly to power companies and charges a huge tax margin on domains for all the services it provides. Without arguing about whether this is good or bad, we can say that Verisign is in effect pre-subsidized to provide however complex / redundant / reliable a service they want to provide.

edit: Here's Krebs' breakdown of the attack: https://krebsonsecurity.com/2016/10/ddos-on-dyn-impacts-twit...

Looks like OVH, the latest mirai attack, was around 1.2tbps. That's nothing to scoff at.

kev009 · on Oct 23, 2016

Where am I getting what? My argument is that weak leadership causes the problems we see. You sound personally affronted because the I am pointing out a view of the industry that is not rosy glasses where we are all geniuses and Silicon Valley inspired culture solves all the hard problems with slam dunks. These problems are preventable and the tech industry is actually a wasteland of poor understanding and repeating organizational and managerial failure. These are ignored because everyone else is as almost as bad, and most people can't judge the good from the bad. This isn't cutting edge research, you carbon copy a competent company's design and they will probably explain it publicly if you ask nicely. There are forums like IETF and NANOG where you can exchange these practices, and professionals do that to build the profession.

Swap out BSD with anything you please, just don't single source when you've hit breakout success and revenue loss from failure is material. I'm not a zealot, I even think WinNT is a pretty nice kernel and could fit into this picture with the new Microsoft. The point is you aren't holding up whole digit numbers of worldwide GDP on one kernel with a very poor security track record. Every modern widely used kernel applies to this statement. This is insane, I hope you can take that away from this thread at least.

The cost of designing against common mode failure is inconsequential when properly budgeted against cost of actual failure. Any company in the Fortune 500, non-tech included, could afford to create a DDOS-proof DNS and web platform. Luckily they shouldn't have to if the service providers acted more like Verisign. And it scales down too, it would cost at most $100s of dollars a year for a single person company to multisource cloud providers, but likely it will be a wash since IAAS is most commonly usage billing.

kev009 · on Oct 23, 2016

Your sense of mechanical sympathy is not aligned with the advances in hardware and internet infrastructure. It would be possible to handle a 1.2tbps DDOS in a single IX like LON, AMS, FRA, LGA, IAD, ORD, ATL, MIA, SJC, LAX, SIN, HKG, ICN, KIX let alone world wide. It would take a lot more active management and work across companies if someone is clogging regional interconnect to a spoke, but that would not cause worldwide outage.

1.2tbps is 120 of Verisign's 2013 server and code, maybe a million dollars CapEx. And 12 100g peering ports, $150k monthly at bulk transit pricing if you can't do any settlement free for some reason. Yes there are other costs, but it is easy to scale and well within the means of midsize business let alone unicorns.

The thing about DDOS -- it's a lot harder to get aggregate bandwidth, those eyeball networks will bottleneck a lot sooner than 1000 meter runs of fiber between peers at an IX. And there are some natural imbalanced ratios like CDNs that are sitting on dozens of tbps of unused ingress that can jump into offering services for other companies that don't want to figure it all out.

paradite · on Oct 23, 2016

The style of this post mortem is quite different from a typical post mortem of an attack on a large tech company, such as AWS, heroku or GitHub.

There is very little technical details on the investigation and mitigation process beyond phrases like "the NOC team was able to mitigate the attack and restore service to customers", "...but was mitigated in just over an hour".

Why was this written by a Chief Strategy Officer but not someone with more technical knowledge and insights?

mino · on Oct 23, 2016

Because it's not a post-mortem but a "statement", which they felt like publishing to thank all the actor involved.

nmjohn · on Oct 23, 2016

> We observed 10s of millions of discrete IP addresses associated with the Mirai botnet that were part of the attack. (linked article)

> ... the Mirai botnet was at about 550,000 nodes, and that approximately 10 percent were involved in the attack on Dyn (from Level 3 CISO) [0]

Something really doesn't add up there - even if it turned out 100% of infected hosts in the Mirai botnet were targeting dyn (ie: 5.5 million nodes) - that still is a fraction of the number dyn is claiming.

[0]: https://threatpost.com/mirai-fueled-iot-botnet-behind-ddos-a...

dmourati · on Oct 24, 2016

I believe Dyn's numbers are conflating two things:

1. The number of source IPs seen 2. The size of the botnet

They are most certainly not equal because of spoofing.

helthanatos · on Oct 22, 2016

I couldn't use twitter or github from noon till 4. If they mitigated the third attack and no customers were affected, why were github and twitter still down for me?

herlitzj · on Oct 22, 2016

Same for us. We're backed by Shopify and it was down until 4pm. Not sure I buy they quashed this by 1pm

Zancarius · on Oct 23, 2016

Can confirm as well. GitHub was inaccessible until about 5PM, and I'm in the southwest. My provider peers with Level3, so it looks like nearly everything gets routed through Los Angeles.

mirimir · on Oct 22, 2016

> We observed 10s of millions of discrete IP addresses associated with the Mirai botnet that were part of the attack.

OK, so why don't Dyn staff identify owners of all IP subnets represented, and provide each with a list of participating IP addresses? That's a trivial exercise, right? And maybe they could publicly shame ISPs that didn't act on the information.

ssharp · on Oct 22, 2016

If they are residential or business IPs, it would be nice for the ISPs to let them know something on their local network is part of a massive botnet.

lilott8 · on Oct 23, 2016

why would they? I mean with how they are charging by the GB these days, it gives them financial incentive to not tell you your refrigerator (or any device) is part of a botnet, that's free money for them.

totalZero · on Oct 23, 2016

Assuming you mean this seriously and not just as a tongue-in-cheek dig at internet carriers...

Even if every refrigerator were on a pay-as-you-go plan, handling all of the data may ruin throughput for other customers, especially at peak usage.

mirimir · on Oct 22, 2016

That makes sense to me.

feld · on Oct 23, 2016

Yes, because an ISP has nothing better to do than waste money on contacting customers and telling them to unplug their refrigerators.

Not going to happen.

hga · on Oct 23, 2016

At some point, if the Internet is going to continue to function, one of the likely outcomes is that these customers wake up some morning and the only thing they can get is a web page telling them they have a choice of disabling their fridge's Internet of Shit device or connection, or doing without their full connection to the greater Internet.

Compare to, for example, our stopping our use of lead in paint and gasoline/petrol.

mirimir · on Oct 24, 2016

That's a good analogy. In many places, owners of lead-contaminated buildings must remediate. One issue, of course, is poisoning of resident children. But another is contamination of the surrounding environment. Paint manufacturers are also liable, but that's a separate issue.

Basically, this is just environmental pollution.

mirimir · on Oct 23, 2016

Well, https://en.wikipedia.org/wiki/Copyright_Alert_System is happening.

kpcyrd · on Oct 23, 2016

In this case the IPs probably aren't spoofed, but there's always the chance that you're blaming IPs that didn't engage in the attack but somebody just put it randomly as source address in their packets.

mirimir · on Oct 23, 2016

True. But once you have the IP, you can nmap to get the OS, right? That would probably exclude most false positives, where Mirai was implicated.

Zancarius · on Oct 23, 2016

Probably not in cases where the device is behind a NAT/router that happens to forward remote access ports that lead to the initial compromise. Then you'd be nmapping the customer's endpoint. Plus, it'd be too time consuming. Tens of millions of participating IPs spread across countless ISPs...

mirimir · on Oct 23, 2016

Good point. But still, even with risk of false positives, I believe that ISPs should be informed. And ISPs could acknowledge uncertainty when informing customers.

NetStrikeForce · on Oct 23, 2016

But in many cases you don't even know if the IP addresses are spoofed or not.

ryanlol · on Oct 22, 2016

> And maybe they could publicly shame ISPs that didn't act on the information.

so... all of them?

Why do you think Dyn should waste their time and money on this?

mirimir · on Oct 22, 2016

ISPs are dunning users about "illegal" file sharing. Because of pressure from MPAA etc. These IoT botnets are just testing capabilities, and they're already formidable. Going after operators is good, but it's so damn easy to hide. It seems that IoT botnets are using mainly direct attacks, so they're easy to identify. And why would ISPs want to support botnets? Don't ISP ToS typically say that attacks, threats and spam are grounds for service termination?

Maybe this isn't Dyn's job. Maybe some neutral NGO should handle it. But seriously, this is something I could do at home, on an old gaming box in MySQL. If I had the list of IP addresses, anyway. But why? How about acting in the public interest? Or good PR?

CaveTech · on Oct 23, 2016

The attack was surely not mitigated by 1 PM ET. We were experiencing issues until well after 3.

miken123 · on Oct 23, 2016

Dyn states their problems were fixed at 17 UTC, but me and most other people in the Netherlands have been seeing issues for the whole evening. Something about their story does not add up...

HappyTypist · on Oct 23, 2016

Our Dyn account manager confirmed to us that issues were ongoing at 2:27pm ET.

kimputin · on Oct 23, 2016

In the caribbean there were issues up to 5 PM

rasz_pl · on Oct 23, 2016

The attack was not mitigated _at all_, attacker simply stopped. This post mortem is a big fat lie.

Shank · on Oct 23, 2016

I would really like it if Dyn could give some credence to the 1.2Tpbs number or if it was higher.

As these attacks grow in scale, it becomes more and more important to know if this was an attack at record capacity or if it was just that Dyn had lower capacity hardware/links in place. If we're already facing, for example, 2Tbps attacks, a lot has to be done to make mitigating these attacks easier, either through hardware or strategic upgrades.

jimjimjim · on Oct 23, 2016

no.

what you get then is groups trying to outdo each other trying to get bragging rights for being the largest.

eragone · on Oct 22, 2016

Entertainingly, this page 503s currently.

Crosseye_Jack · on Oct 22, 2016

https://web.archive.org/web/20161022220033/http://hub.dyn.co...

stordoff · on Oct 23, 2016

400 error for me (Bad Request - Your browser sent a request that this server could not understand).

nodesocket · on Oct 23, 2016

Overall great write up, however:

"Again, at no time was there a network-wide outage, though some customers would have seen extended latency delays during that time."

That can't be true. Here in the west coast, I know of over 10 "name-brand" sites that were absolutely down for over an hour. The east coast was apparently hit even harder and for longer.

joatmon-snoo · on Oct 23, 2016

Are you sure that it was Dyn that was down, or if there was a break somewhere in the nameserver chain?

I know that the digs I was running periodically were getting SERVFAIL responses from Google DNS even though my local nameservers were actually succeeding.

dronemallone · on Oct 23, 2016

Does anyone here know the actual technical details of the attack? How exactly did the DDoS occur? Just a ton of bots making DNS requests over UDP??

Links to articles with tech details would be greatly appreciated.

ademarre · on Oct 22, 2016

I was hoping to read a more detailed account of the attack and specific mitigation strategies. Have such details emerged anywhere?

kpcyrd · on Oct 23, 2016

You can watch their bgp history from that day

https://stat.ripe.net/widget/bgplay#w.resource=208.78.70.16

variant · on Oct 22, 2016

The blog post indicates additional details are forthcoming.

ademarre · on Oct 22, 2016

Sure. But they also say, understandably:

> It is worth noting that we are unlikely to share all details of the attack and our mitigation efforts to preserve future defenses.

laluser · on Oct 23, 2016

This should be a warning to everyone that was affected by this. Even the dependencies that you don't normally think about go down at some point. For DNS, make sure you are using multiple DNS providers.

buro9 · on Oct 23, 2016

> For DNS, make sure you are using multiple DNS providers.

In this case, that would have made it worse.

If your domain CNAMEs off another provider (as nearly all SaaS solutions do, as do mapping to AWS servers, etc) then you would have been affected by the attack on Dyn regardless of whether you could change your nameserver in time and have multiple providers at that level.

If you don't have any CNAMEs, then I think the better choice is to go same-origin same-provider for everything.

Which may sound bizarre... my personal sites (a lot of forums) would either be up or down. I would argue that availability having a binary nature is a lot better for end users than a constantly broken half-state that is frustrating or impossible to use at any time even though a graph might show partial availability. A hard sudden failure shows up better in monitoring, triggers end user feedback faster, and is in many ways easier to debug and solve.

With so much CNAME'd, and SaaS still growing, replicating and failing over DNS that sits above those wouldn't buy you anything.

A lot of sites lost Zendesk (their support.), Statuspage (their status.), Pagerduty (hey, they were quiet because no-one could let them know). They couldn't even help the end users having a bad time.

eslater · on Oct 23, 2016

> In this case, that would have made it worse.

It might have made little difference depending on the specifics of the records in question but it would not have made things worse.

paulddraper · on Oct 23, 2016

Is Dyn any less vulnerable to this attack than 48 hours ago?

That's kind of the point of retrospection, but I missed the part where they say what they're doing now that they failed to do last week.

diegorbaquero · on Oct 23, 2016

I haven't seen comment from any IP-transit/peering company that provides service to them. In fact, I haven't seen any infrastructure provider of them comment anything. This is starting to seem more like a marketing campaign or human error (action?) blamed on DDoS.

willvarfar · on Oct 23, 2016

Why might the Mirai botnet attack Dyn? What might their motives be?

cagenut · on Oct 22, 2016

that's a killer IP list to license out

libeclipse · on Oct 23, 2016

>It is worth noting that we are unlikely to share all details of the attack and our mitigation efforts to preserve future defenses.

Security through obscurity.

labster · on Oct 23, 2016

Obscurity is a valid defense layer, so long as it is not the only layer, and so long as the obscurity is not just a way of covering up garbage.

dllthomas · on Oct 23, 2016

But it can also be substantially more costly than its worth, and a lot of the cost isn't super visible (people finding it marginally harder to get their job done, being unable to learn from informed discussion on HN, &c).