Hacker News new | past | comments | ask | show | jobs | submit login
Tumblr just dropped out of DNS (tumblr.com)
133 points by state_machine on Dec 12, 2012 | hide | past | favorite | 64 comments



  @tumblr
    Tumblr has taken the site down in order to resolve a 
    network issue. We will update as we know more. 
https://twitter.com/tumblr/status/279000741706878976


Wow! Killing DNS is a nasty way to take down the site. That could take hours to come back.


Recovery is going to spread over a few hours which could be a benefit.


Not really. Looks like their domain is configured so negative responses are cached only for 300 seconds (5 minutes).


TTLs are a recommendation, at best. Not worth relying on.


Technically true. In practice it does/could work pretty well.

Source: I've seen pure DNS-based site failover and it never had a problem.


True. We did a fair bit of experimenting with this (for both load balancing and failover) a few years back, and so long as you stay at 300sec or greater it works fine for well over 99% of all the traffic we tested with. Once you dropped below 300 seconds problems started appearing - from memory some older versions of Windows would default to 900secs, 3600secs, or 86400secs.

I haven't revisited that research for 3 or 4 years, but our findings then led to a policy of "if you're prepared to accept ~15mins of mixed availability (after you've identified a problem and hit the panic button) then DNS based failover works well enough. If you need significantly better response times than that, you need proper hardware/network based failover"


Using 10sec TTL currently for site failover. No complaints.


Do you have much Windows XP traffic? From memory, we saw evidence of XP using 86400sec (1 day) TTLs if you tried publishing TTLs lower than 300. (This was back in '08, so perhaps things have changed since. I'm still seeing ~30% WinXP in some of my Google Analytics accounts...)


Exactly - my biggest complaint about ELB on AWS is that they relied on people honouring a TTL that was ridiculously low, and as such now honoured by many clients.


Yup.

The day I got a bunch of Netflix traffic sent to servers I was running thanks to ELB and caching resolvers not honoring TTL values was great fun.

(That was also the day we learnt that our 500 page was actually really expensive to render, which added to the excitement...)


Given that they've been having can't-see-blogs-at-all scale problems for something like 24 hours already (I was looking for one last night, and it's currently nearly midnight UK time) I suspect a few more hours to clear up whatever fundamental screw-up is behind this could be a good investment.


Seeing as they have engineers deploying code directly into production with `vim`, a fundamental screwup was inevitable.


No build process? In 2012?

Do you have a reference for deploying code with vim?


The source of their index.php, and the subsequent discussion on HN. Just imagine someone franticly trying to get into insert mode, and saving on a production server. It stayed like that for a reasonable period before someone noticed.

http://pastebin.com/raw.php?i=aPQJUh1Q

http://news.ycombinator.com/item?id=2343330


That doesn't necessarily mean that they don't use source control or build control. The index file linked has tons of production-specific configuration information (database ip/keys), which wouldn't be checked into source control under best practices. That would require manual editing on the server to set up correctly.


s/manual editing/mangement from configuration management like Puppet or Chef/


It would have more polite to point the A record at another server with a friendly message about the outage and any estimated resolution time.

I think the majority of tumblers will be pretty confused atm.



Tumblr is back up, after 4 hours of downtime.

Official apology post: http://staff.tumblr.com/post/37811176241/a-message-from-the-...


Incidentally, this means every startup (and some of the bigger kids) that hosts its status page on Tumblr is missing a status page at the moment... eg: http://status.twitter.com


There's something beautiful about:

    * Tumbler reports its outage status using twitter.
    * Twitter reports its outage status using tumbler.


Beautiful


The Minecraft launcher also relies on tumblr to serve the landing page, a couple of million players will be seeing this: http://i.imgur.com/Zft71.png


We've seen that before. It's better than the skin servers being down. Which they probably are.


I'd rather no skins than the constant auth.minecraft outages.


  dig +trace www.tumblr.com

  *snip*

  tumblr.com.             300     IN      SOA     pdns1.ultradns.net. hostmaster.tumblr.com. 2012121602 86400 7200 604800 300
  ;; Received 108 bytes from 204.74.108.1#53(204.74.108.1) in 21 ms
Looks like their dns is down alright. You could try in your hosts file:

  72.32.231.8 www.tumblr.com tumblr.com
The ip is from their whois info and appears to be giving the We're sorry error message.


Huh the only way you can tell 72.32.231.8 redirects to tumblr's crash page is the link under find out why. Completely unbranded.


Seems it started with just a normal outage: http://thenextweb.com/insider/2012/12/12/tumblr-confirms-use...


Remember when we all decided Tumblr wasn't stable to host a professional site? Well... these people didn't.

http://theatlantic.tumblr.com/ http://fox411.blogs.foxnews.com/ http://motherjones.tumblr.com/ http://gq.tumblr.com/ http://tumblr.elle.com/


If the cost of hosting elsewhere exceeds the cost of the downtime, they made the right choice.


Used to work at Fox, helped write the theme

Lot lot lot more costs involved, including an inane templating system that took way longer to work with than to build normally. Hosting would have happened on the existing platform adding little to no cost. Also, the cost (lost revenue, people running around pointing fingers) of downtime for a site that large (yes, even the entertainment blog) quickly DWARFS any possible hosting costs.

edit:formatting


Aren't those essentially just advertising for the main sites?


[deleted]


-1

These notifications and the technical analysis that only HN provides are valuable for all us devop plebs trying to avoid the same thing.


+1 for your -1, these notifications and the postmortems are always excellent sources of information


Compare it to the stuff that could be up here. People's projects.

I'm not sure what's to learn from this other than: * Tumblr is down * Killing the DNS is bad


There is no valuable postmortem on that instrusive-ad ridden site.


+1 It's infuriating and seems like karma bait.


Visit this page to see the length of downtime -> http://www.websitetest.com/ui/tests/50c922b17a6c8757bb000005.

The test will run every 10 minutes for the next 10 hours. Testing is only good for diagnosing issues like downtime and performance issues.


Looks like they have MX and TXT, but missing A. Weird.


They get hundreds of extra points for hosting their status page on their own service.

That's basically the main (already widely accepted) lesson people should take from this -- people want twitter updates as well as an outside-hosted blog and monitor for service availability.


Doesn't serving status page form the site itself defeat the purpose of status page? In what case would that make any sense? I am now intrigued, this is not a small thing that can get over looked.


Yes, exactly.

For status other than "site is down", it's probably nice to have information on-site since I guess tumblr is about sharing easily within tumblr. But maintenance and server status needs to be both on-site (tumblr.com/status ?) and an off-site page at status.tumblr.com.


No, it does not defeat the purpose. Instead, it just lessens the effectiveness of the status page. It really depends on what tumblrs most common error scenario is. 95% of errors may manifest themselves in manners which do not affect the status of the status page.


The whole notion of "hosting" a status page is ridiculous.

You just generate it as static HTML and dump it on a bunch of free webhosters (or clouds if you want to be fancy). It's not rocket science to keep a HTML page online.


It would be reasonable to have something more sophisticated, like heroku's status page, which has both text/timestamp information and graphical representations of status. If you have an API, reporting deeper info about each part of the API is also a good idea.

It still should probably be a static page, but updated frequently and by automatic tools.


Speculation: Gmail, Facebook, then Tumblr? Kill switch?

Half kidding...


I was having the same thought.

But until proven otherwise, I'll assume strange coincidence. Although, when microsoft.com falls of the face of the earth tomorrow, I'm calling it a conspiracy.


The main site/webapp is still down (3 hours later) but individual sites are up. http://theparisreview.tumblr.com/

They're returning 66.6.36.7 for DNS.


Could someone refresh my memory on how Tumblr makes its revenue?


Sponsored posts, premium themes, promoted posts. Pretty sure they're far from profitable, but they're bringing some revenue in.

http://www.quora.com/Tumblr/How-does-Tumblr-make-money

http://www.businessinsider.com/tumblr-revenues-2012-9


If they are not profitable, it means that the downtime is actually improving their financials...;)


Chuckled at that, given their scale I'm sure they are already at fixed rates for colo space and bandwidth.


- Brand spotlights

- Sponsored radar posts (ads)

- Promoted posts (for all users)

- Premium theme shared revenue

- Selling access to data


If you really want to know why don't you find out and then come back to tell us instead of making fatuous comments


Love to, but Tumblr's site is down… so I can't check.


That's pretty bad.

Admittedly, they update their zone very frequently (every time a user signs up/changes name/deletes themselves), but you'd think they would have an independent secondary DNS provider somewhere.


Do they really? It seems more likely that they'd simply use a wildcard.


That'll teach me to post late at night. I forgot wildcards existed.

I was thinking that (if I built tumblr) they would do subdomain searching at the DNS level to avoid hitting their database.

Makes it even worse if they are using wildcards.


But if you don't publish an A record you get an ugly browser error message, whereas publishing the wildcard gives the user a notice saying that no such blog exists.


I believe they use ultradns (a nice, but very expensive, dns service).


CloudFlare. Seriously guys.


Looks like UltraDNS disabled their zones. Someone probably forgot to pay invoices or something.

You're pretty much screwed in this case.


What are you basing the "disabled zone" statement on? A 'dig ANY tumblr.com' shows MX, TXT, and NS records . . .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: