Tumblr just dropped out of DNS

sbierwagen · on Dec 12, 2012

  @tumblr
    Tumblr has taken the site down in order to resolve a 
    network issue. We will update as we know more.

https://twitter.com/tumblr/status/279000741706878976

emeraldd · on Dec 12, 2012

Wow! Killing DNS is a nasty way to take down the site. That could take hours to come back.

adrr · on Dec 13, 2012

Recovery is going to spread over a few hours which could be a benefit.

takeda64 · on Dec 12, 2012

Not really. Looks like their domain is configured so negative responses are cached only for 300 seconds (5 minutes).

insaneirish · on Dec 13, 2012

TTLs are a recommendation, at best. Not worth relying on.

Florin_Andrei · on Dec 13, 2012

Technically true. In practice it does/could work pretty well.

Source: I've seen pure DNS-based site failover and it never had a problem.

bigiain · on Dec 13, 2012

True. We did a fair bit of experimenting with this (for both load balancing and failover) a few years back, and so long as you stay at 300sec or greater it works fine for well over 99% of all the traffic we tested with. Once you dropped below 300 seconds problems started appearing - from memory some older versions of Windows would default to 900secs, 3600secs, or 86400secs.

I haven't revisited that research for 3 or 4 years, but our findings then led to a policy of "if you're prepared to accept ~15mins of mixed availability (after you've identified a problem and hit the panic button) then DNS based failover works well enough. If you need significantly better response times than that, you need proper hardware/network based failover"

Florin_Andrei · on Dec 13, 2012

Using 10sec TTL currently for site failover. No complaints.

bigiain · on Dec 13, 2012

Do you have much Windows XP traffic? From memory, we saw evidence of XP using 86400sec (1 day) TTLs if you tried publishing TTLs lower than 300. (This was back in '08, so perhaps things have changed since. I'm still seeing ~30% WinXP in some of my Google Analytics accounts...)

TallGuyShort · on Dec 13, 2012

Exactly - my biggest complaint about ELB on AWS is that they relied on people honouring a TTL that was ridiculously low, and as such now honoured by many clients.

sveiss · on Dec 13, 2012

Yup.

The day I got a bunch of Netflix traffic sent to servers I was running thanks to ELB and caching resolvers not honoring TTL values was great fun.

(That was also the day we learnt that our 500 page was actually really expensive to render, which added to the excitement...)

Silhouette · on Dec 12, 2012

Given that they've been having can't-see-blogs-at-all scale problems for something like 24 hours already (I was looking for one last night, and it's currently nearly midnight UK time) I suspect a few more hours to clear up whatever fundamental screw-up is behind this could be a good investment.

nwh · on Dec 13, 2012

Seeing as they have engineers deploying code directly into production with `vim`, a fundamental screwup was inevitable.

IgorPartola · on Dec 13, 2012

No build process? In 2012?

Do you have a reference for deploying code with vim?

nwh · on Dec 13, 2012

The source of their index.php, and the subsequent discussion on HN. Just imagine someone franticly trying to get into insert mode, and saving on a production server. It stayed like that for a reasonable period before someone noticed.

http://pastebin.com/raw.php?i=aPQJUh1Q

http://news.ycombinator.com/item?id=2343330

fryguy · on Dec 13, 2012

That doesn't necessarily mean that they don't use source control or build control. The index file linked has tons of production-specific configuration information (database ip/keys), which wouldn't be checked into source control under best practices. That would require manual editing on the server to set up correctly.

bashtoni · on Dec 13, 2012

s/manual editing/mangement from configuration management like Puppet or Chef/

dan1234 · on Dec 13, 2012

It would have more polite to point the A record at another server with a friendly message about the outage and any estimated resolution time.

I think the majority of tumblers will be pretty confused atm.

nwh · on Dec 13, 2012

https://twitter.com/search/realtime?q=tumblr+die

sbierwagen · on Dec 13, 2012

Tumblr is back up, after 4 hours of downtime.

Official apology post: http://staff.tumblr.com/post/37811176241/a-message-from-the-...

state_machine · on Dec 12, 2012

Incidentally, this means every startup (and some of the bigger kids) that hosts its status page on Tumblr is missing a status page at the moment... eg: http://status.twitter.com

munificent · on Dec 13, 2012

There's something beautiful about:

    * Tumbler reports its outage status using twitter.
    * Twitter reports its outage status using tumbler.

TallboyOne · on Dec 13, 2012

Beautiful

citricsquid · on Dec 13, 2012

The Minecraft launcher also relies on tumblr to serve the landing page, a couple of million players will be seeing this: http://i.imgur.com/Zft71.png

Foomandoonian · on Dec 13, 2012

We've seen that before. It's better than the skin servers being down. Which they probably are.

rtkwe · on Dec 13, 2012

I'd rather no skins than the constant auth.minecraft outages.

emeraldd · on Dec 12, 2012

  dig +trace www.tumblr.com

  *snip*

  tumblr.com.             300     IN      SOA     pdns1.ultradns.net. hostmaster.tumblr.com. 2012121602 86400 7200 604800 300
  ;; Received 108 bytes from 204.74.108.1#53(204.74.108.1) in 21 ms

Looks like their dns is down alright. You could try in your hosts file:

  72.32.231.8 www.tumblr.com tumblr.com

The ip is from their whois info and appears to be giving the We're sorry error message.

rtkwe · on Dec 12, 2012

Huh the only way you can tell 72.32.231.8 redirects to tumblr's crash page is the link under find out why. Completely unbranded.

Pr0 · on Dec 12, 2012

Seems it started with just a normal outage: http://thenextweb.com/insider/2012/12/12/tumblr-confirms-use...

brokentone · on Dec 12, 2012

Remember when we all decided Tumblr wasn't stable to host a professional site? Well... these people didn't.

http://theatlantic.tumblr.com/ http://fox411.blogs.foxnews.com/ http://motherjones.tumblr.com/ http://gq.tumblr.com/ http://tumblr.elle.com/

blake8086 · on Dec 12, 2012

If the cost of hosting elsewhere exceeds the cost of the downtime, they made the right choice.

brokentone · on Dec 12, 2012

Used to work at Fox, helped write the theme

Lot lot lot more costs involved, including an inane templating system that took way longer to work with than to build normally. Hosting would have happened on the existing platform adding little to no cost. Also, the cost (lost revenue, people running around pointing fingers) of downtime for a site that large (yes, even the entertainment blog) quickly DWARFS any possible hosting costs.

edit:formatting

OverlordXenu · on Dec 13, 2012

Aren't those essentially just advertising for the main sites?

on Dec 12, 2012

[deleted]

Maxious · on Dec 12, 2012

-1

These notifications and the technical analysis that only HN provides are valuable for all us devop plebs trying to avoid the same thing.

skyebook · on Dec 12, 2012

+1 for your -1, these notifications and the postmortems are always excellent sources of information

zachinglis · on Dec 12, 2012

Compare it to the stuff that could be up here. People's projects.

I'm not sure what's to learn from this other than: * Tumblr is down * Killing the DNS is bad

lhnn · on Dec 13, 2012

There is no valuable postmortem on that instrusive-ad ridden site.

zachinglis · on Dec 12, 2012

+1 It's infuriating and seems like karma bait.

bbuffone · on Dec 13, 2012

Visit this page to see the length of downtime -> http://www.websitetest.com/ui/tests/50c922b17a6c8757bb000005.

The test will run every 10 minutes for the next 10 hours. Testing is only good for diagnosing issues like downtime and performance issues.

el_cuadrado · on Dec 12, 2012

Looks like they have MX and TXT, but missing A. Weird.

rdl · on Dec 12, 2012

They get hundreds of extra points for hosting their status page on their own service.

That's basically the main (already widely accepted) lesson people should take from this -- people want twitter updates as well as an outside-hosted blog and monitor for service availability.

Achshar · on Dec 13, 2012

Doesn't serving status page form the site itself defeat the purpose of status page? In what case would that make any sense? I am now intrigued, this is not a small thing that can get over looked.

rdl · on Dec 13, 2012

Yes, exactly.

For status other than "site is down", it's probably nice to have information on-site since I guess tumblr is about sharing easily within tumblr. But maintenance and server status needs to be both on-site (tumblr.com/status ?) and an off-site page at status.tumblr.com.

oh_sigh · on Dec 13, 2012

No, it does not defeat the purpose. Instead, it just lessens the effectiveness of the status page. It really depends on what tumblrs most common error scenario is. 95% of errors may manifest themselves in manners which do not affect the status of the status page.

moe · on Dec 13, 2012

The whole notion of "hosting" a status page is ridiculous.

You just generate it as static HTML and dump it on a bunch of free webhosters (or clouds if you want to be fancy). It's not rocket science to keep a HTML page online.

rdl · on Dec 13, 2012

It would be reasonable to have something more sophisticated, like heroku's status page, which has both text/timestamp information and graphical representations of status. If you have an API, reporting deeper info about each part of the API is also a good idea.

It still should probably be a static page, but updated frequently and by automatic tools.

vhost- · on Dec 13, 2012

Speculation: Gmail, Facebook, then Tumblr? Kill switch?

Half kidding...

beagle3 · on Dec 13, 2012

I was having the same thought.

But until proven otherwise, I'll assume strange coincidence. Although, when microsoft.com falls of the face of the earth tomorrow, I'm calling it a conspiracy.

_8ea7 · on Dec 13, 2012

The main site/webapp is still down (3 hours later) but individual sites are up. http://theparisreview.tumblr.com/

They're returning 66.6.36.7 for DNS.

kondro · on Dec 12, 2012

Could someone refresh my memory on how Tumblr makes its revenue?

citricsquid · on Dec 12, 2012

Sponsored posts, premium themes, promoted posts. Pretty sure they're far from profitable, but they're bringing some revenue in.

http://www.quora.com/Tumblr/How-does-Tumblr-make-money

http://www.businessinsider.com/tumblr-revenues-2012-9

cft · on Dec 13, 2012

If they are not profitable, it means that the downtime is actually improving their financials...;)

ChuckMcM · on Dec 13, 2012

Chuckled at that, given their scale I'm sure they are already at fixed rates for colo space and bandwidth.

jonathanmoore · on Dec 12, 2012

- Brand spotlights

- Sponsored radar posts (ads)

- Promoted posts (for all users)

- Premium theme shared revenue

- Selling access to data

sjtgraham · on Dec 12, 2012

If you really want to know why don't you find out and then come back to tell us instead of making fatuous comments

kondro · on Dec 12, 2012

Love to, but Tumblr's site is down… so I can't check.

tmx · on Dec 12, 2012

That's pretty bad.

Admittedly, they update their zone very frequently (every time a user signs up/changes name/deletes themselves), but you'd think they would have an independent secondary DNS provider somewhere.

cmelbye · on Dec 12, 2012

Do they really? It seems more likely that they'd simply use a wildcard.

tmx · on Dec 12, 2012

That'll teach me to post late at night. I forgot wildcards existed.

I was thinking that (if I built tumblr) they would do subdomain searching at the DNS level to avoid hitting their database.

Makes it even worse if they are using wildcards.

X-Istence · on Dec 13, 2012

But if you don't publish an A record you get an ugly browser error message, whereas publishing the wildcard gives the user a notice saying that no such blog exists.

rdl · on Dec 12, 2012

I believe they use ultradns (a nice, but very expensive, dns service).

kordless · on Dec 12, 2012

CloudFlare. Seriously guys.

onetwothreefour · on Dec 12, 2012

Looks like UltraDNS disabled their zones. Someone probably forgot to pay invoices or something.

You're pretty much screwed in this case.

emeraldd · on Dec 12, 2012

What are you basing the "disabled zone" statement on? A 'dig ANY tumblr.com' shows MX, TXT, and NS records . . .