I find this extremely suspicious (I.E. knowing routers, I call bullshit). The ch...

nettdata · on Sept 11, 2012

Unless you have intimate knowledge of their network topology, and know the specifics of where those pinged IP's live in that topology, and what routes were used to provide DNS results, you can't say that it wasn't a routing issue.

"Routing" is a rather generic term when it comes to large networks, and everything from border routers, firewalls, load balancers, and switches actually perform routing.

Especially (as I've mentioned in another post) when you add fault tolerance / failover configurations to the mix.

Routing failure doesn't have to be an all-or-nothing thing. There are a number of ways in which I can see ICMP echo packets working but other traffic not, especially when you include complexities of source routing, load balancing, failover, etc.

Even something as "simple" as a poisoned ARP cache in a single box could screw up the entire internal network and cause the problems they've had, and still be considered a "routing issue".

$0.02

druiid · on Sept 11, 2012

None of that is necessarily incorrect... but per their news release 'corrupted router data tables' (their words) were the issue. I can't read too much into that, but that still doesn't change the fact that DNS wasn't resolving for a while after they made their Verisign change (for clients), yet their website was resolved with this change.

You are correct that I don't know the details of their internal network and I never said otherwise, just that the chain of events and their claims don't necessarily match up!

nettdata · on Sept 11, 2012

I can imagine that they'd understandably work to get their own site/etc up and running first as the priority, as a manual "hack". After all, it's the main page everyone would be going to for information on what's going on.

After that, coming up with an automated process for migrating what must be a shit-ton of zone information to another system must have taken some time. I have no idea what their specific solution was, but I'm fairly confident in the fact that it wasn't just a matter of copying over a few zone files. They'd probably have to do SOME sort of ETL (extraction / translation / load) process that would take some time to develop, test, never mind run.

And I can't remember the last time I gave technical information to a PR person who actually got it 100% technically correct. ;)

My intention wasn't to shit on your point or anything, or in any way defend Go-Daddy and their screwup, I'm just thinking it's a bit unrealistic to try and infer detailed information from a PR release.

In the end, it was technical, they screwed up, and I doubt they'd ever release a proper, detailed post-mortem of what happened.

druiid · on Sept 11, 2012

Heh, yeah. It is a bit difficult to interpret PR speak (and I have had to correct our guy before).

I think perhaps the takeaway from here is to not trust what is being said, go with your gut... and move any services off GoDaddy ;). Would be nice if like Google or Amazon they would release a real post-mortem post. Even if it's an internal 'uh-oh' I trust companies that are willing to admit to mistakes.

larrys · on Sept 11, 2012

"Would be nice if like Google or Amazon they would release a real post-mortem post."

Possible but highly unlikely. Godaddy is "old school" which means they will release as little info as necessary and move on. They aren't interested in the hacker community. Their primary market is SMB's.

omd · on Sept 11, 2012

I don't see it as defending GoDaddy at all, quite the opposite. I would be more reassured if it was an unexpected massive DDoS which they weren't prepared for but one which they might prepare for in the future.

The way it's described now is a weakness in their infrastructure of which I wonder if it's possible to prevent this from happening again.

larrys · on Sept 11, 2012

"The way it's described now is a weakness in their infrastructure"

Godaddy has plenty to lose by f-ing up. And to my knowledge (as a somewhat small competitor; I'm just pointing that out so my thoughts are taken in context) they have a fairly robust system (anecdotal) for the amount of data they manage. My issues with godaddy (as a competitor) were always on the sell side, the issues of constantly selling you things you don't need etc. Technically I really didn't have any issues with them.

andreasvc · on Sept 11, 2012

While it could indeed be a routing issue, who's to say that it wasn't caused intentionally by the guy in the tweets? It would be in GoDaddy's interests to cover that up and fix whatever exploit he used to get in, instead of admitting a security breach.

Osiris · on Sept 11, 2012

What we're being told internally matches the public response. Reading between the lines, it sounds like there may have been human error involved, but that's just speculation.

Our interim CEO confirmed that affected users are receiving a full month refund.

druiid · on Sept 11, 2012

And if it was human error, that's fine! Stuff happens and I can certainly say that I've done my share of human error.

I want to mention a couple things though. The first is that the blame is being placed (at least reading slightly into the PR release) on a 'technology' failure. That is fairly distinct from human error.

The other thing, is that if it was human error, how did the chain occur without a second pair of eyes or similar, such that the outage lasted more than a bit of time?

Third, why was the DNS changed to Verisign? That is still the biggest outstanding question I think in terms of their claimed outage reports. I should also mention that I do have skin in this game as plenty of customers were running at least SOMETHING through Godaddy and were yelling in this direction for stuff breaking...

jauer · on Sept 11, 2012

Router bugs aren't unheard of. There was the Juniper MX bug that caused multiple outages for Level3 & Time Warner Cable. That was supposedly just a bad pattern of route injections and withdrawls.

That said I agree that Godaddy's handling and RFO doesn't smell right.

0xbadcafebee · on Sept 11, 2012

Do you know how BGP works? There are easily 50 different ways routing problems can cause outages like this. More than likely there was a compound failure which can cause all kinds of retarded behavior, including different networks getting different kinds of traffic, to say nothing of a plain old network service on a single net being down.

Routers can "crash" for different reasons, but atypically due to high traffic. If you really wanted to fuck with someone you make one BGP change. Only newbs use DDoS's. (Which, Anonymous being newbs, would be their MO, but unlikely they could DDoS a connectionless resource record database)

druiid · on Sept 11, 2012

shrug I know how BGP works, yes. I think though from the symptoms of the issue it is going to end up being (If Godaddy is telling the truth about not being hacked), as you say, a compound failure. The exact cause of such a failure is left up to a truthful, full account of the outage being released. Further into this thread someone reported that an engineer from their side is going to release more information, so we'll see soon whom gets the prize :P.

count · on Sept 11, 2012

I'm familiar with BGP. I'm unfamiliar with how BGP has anything to do with me being able to ping their IP, but not get a response on UDP/53 or TCP/53 with any data in it.

0xbadcafebee · on Sept 12, 2012

Off the top of my head? One network they multihome had a weird packet loss only experienced by DNS and other services, so they tried to cut the routes over to the second network, but packets were still getting sent to the first network (which had DNS disabled but ICMP enabled on the hosts) and further router fuckage prevented them from switching back easily. Hell, they probably just couldn't get their BGP to propagate once they made the first change.

If you go with 'router tables' being the culprit, they probably had a core router that maxed out its RAM when they added another router in place, but they had already moved a part of the network that housed DNS by the time the routers synced and RAM filled from too many BGP lists to sort. You can still ping 'hosts' (which are I am almost certain a hardware load balancer and not an actual DNS host) while the DNS traffic is going nowhere because the backend DNS services were moved. Would take a couple hours to unfuck all of that.

SeanDav · on Sept 11, 2012

It would certainly be in GoDaddy's interests to claim a technical fault, rather than admit to a hack, which implies lax security.