True. We did a fair bit of experimenting with this (for both load balancing and failover) a few years back, and so long as you stay at 300sec or greater it works fine for well over 99% of all the traffic we tested with. Once you dropped below 300 seconds problems started appearing - from memory some older versions of Windows would default to 900secs, 3600secs, or 86400secs.
I haven't revisited that research for 3 or 4 years, but our findings then led to a policy of "if you're prepared to accept ~15mins of mixed availability (after you've identified a problem and hit the panic button) then DNS based failover works well enough. If you need significantly better response times than that, you need proper hardware/network based failover"
Do you have much Windows XP traffic? From memory, we saw evidence of XP using 86400sec (1 day) TTLs if you tried publishing TTLs lower than 300. (This was back in '08, so perhaps things have changed since. I'm still seeing ~30% WinXP in some of my Google Analytics accounts...)
Exactly - my biggest complaint about ELB on AWS is that they relied on people honouring a TTL that was ridiculously low, and as such now honoured by many clients.
Given that they've been having can't-see-blogs-at-all scale problems for something like 24 hours already (I was looking for one last night, and it's currently nearly midnight UK time) I suspect a few more hours to clear up whatever fundamental screw-up is behind this could be a good investment.
The source of their index.php, and the subsequent discussion on HN. Just imagine someone franticly trying to get into insert mode, and saving on a production server. It stayed like that for a reasonable period before someone noticed.
That doesn't necessarily mean that they don't use source control or build control. The index file linked has tons of production-specific configuration information (database ip/keys), which wouldn't be checked into source control under best practices. That would require manual editing on the server to set up correctly.
Incidentally, this means every startup (and some of the bigger kids) that hosts its status page on Tumblr is missing a status page at the moment... eg:
http://status.twitter.com
dig +trace www.tumblr.com
*snip*
tumblr.com. 300 IN SOA pdns1.ultradns.net. hostmaster.tumblr.com. 2012121602 86400 7200 604800 300
;; Received 108 bytes from 204.74.108.1#53(204.74.108.1) in 21 ms
Looks like their dns is down alright. You could try in your hosts file:
72.32.231.8 www.tumblr.com tumblr.com
The ip is from their whois info and appears to be giving the We're sorry error message.
Lot lot lot more costs involved, including an inane templating system that took way longer to work with than to build normally. Hosting would have happened on the existing platform adding little to no cost. Also, the cost (lost revenue, people running around pointing fingers) of downtime for a site that large (yes, even the entertainment blog) quickly DWARFS any possible hosting costs.
They get hundreds of extra points for hosting their status page on their own service.
That's basically the main (already widely accepted) lesson people should take from this -- people want twitter updates as well as an outside-hosted blog and monitor for service availability.
Doesn't serving status page form the site itself defeat the purpose of status page? In what case would that make any sense? I am now intrigued, this is not a small thing that can get over looked.
For status other than "site is down", it's probably nice to have information on-site since I guess tumblr is about sharing easily within tumblr. But maintenance and server status needs to be both on-site (tumblr.com/status ?) and an off-site page at status.tumblr.com.
No, it does not defeat the purpose. Instead, it just lessens the effectiveness of the status page. It really depends on what tumblrs most common error scenario is. 95% of errors may manifest themselves in manners which do not affect the status of the status page.
The whole notion of "hosting" a status page is ridiculous.
You just generate it as static HTML and dump it on a bunch of free webhosters (or clouds if you want to be fancy). It's not rocket science to keep a HTML page online.
It would be reasonable to have something more sophisticated, like heroku's status page, which has both text/timestamp information and graphical representations of status. If you have an API, reporting deeper info about each part of the API is also a good idea.
It still should probably be a static page, but updated frequently and by automatic tools.
But until proven otherwise, I'll assume strange coincidence. Although, when microsoft.com falls of the face of the earth tomorrow, I'm calling it a conspiracy.
Admittedly, they update their zone very frequently (every time a user signs up/changes name/deletes themselves), but you'd think they would have an independent secondary DNS provider somewhere.
But if you don't publish an A record you get an ugly browser error message, whereas publishing the wildcard gives the user a notice saying that no such blog exists.