Hacker News new | past | comments | ask | show | jobs | submit login
How we handle deploys and failover without disrupting user experience (mixpanel.com)
72 points by notknifescience on Sept 28, 2012 | hide | past | favorite | 14 comments



By far the easiest way to get similar behavior is to run your Django app via Gunicorn (proxied from Nginx). Gunicorn supports hot code reloading via SIGHUP, and it does so by forking and gracefully killing old processes.

If your requirements don't match gunicorn (not django, not python, etc) then you can use https://github.com/TimothyFitz/zdd a project I wrote to automate rewriting nginx config files to deal with changing proxied portfiles. To integrate any existing server, all you have to do is make it bind to port 0 (let the OS choose a port) and the write a foo.port file that contains the port number (like a pidfile). That's it.


I mentioned this in a thread the other day; I've yet to see a good use case for hot code reloading. Can you really not drain requests to that host via HAProxy (or similar), and then actually restart the service? The nice thing about that approach is that your choice of service runtime doesn't matter.


Your method will stall responses for server shutdown + server startup time, which for Ruby/Python apps is usually measured in tens of seconds, and for other web servers can be much worse. Hot code reloading lets you avoid any downtime at all, and with it usually being built into the framework/language specific server you get the functionality "for free".

Zdd (the project I linked to) is all about spawning a new process in parallel. All the advantages of your approach (switch to an entirely different language? Who cares) but without the stalls.

Zdd also lets you keep the old process alive through the duration of the deploy (and after), and with a little work could let you switch back in the event of a bad deploy without having to start the old version up again.


The method he's describing requires no downtime or stalled requests. Connections are drained from some pool in a load balance set if servers. The services is restarted with the new code. The hosts are given traffic again once the are initialized and healthy.

The advantages to this method include being completely platform agnostic, as well as giving you a window to verify successful update without worrying about production traffic.


Thanks for the clarification. The downsides to that approach are that you need multiple machines, and the duration of your deploys is much longer. Not to mention, you'd have to script a deploy process across multiple machines (which is not easy, in the way that "SIGHUP Gunicorn" is easy).

Personally I've found the "put new instances into a load balancer" method to make more sense for system changes (packages, kernels, OS versions) where deploying the change is inherently slow or expensive, but the method doesn't make sense for code deploys where deploy time is important.


Thanks for the Gunicorn hot-reloading tip. I've been using Gunicorn with Supervisor, and I wish I'd known about the SIGHUP trick earlier.


I thought the commonly accepted way to do this was to

* Have more than one application server

* Deploy new code to a non-active application server

* Send some traffic to the new app server (perhaps based on cookie)

* When confident, switch traffic (using Varnish/other load balancer) to the new application server


I'll share how wavii does this. We use amazon elb and chef. The first chef recipe that runs on our frontend is "touch /tmp/website_should_drain". The website then returns 404 from the /status health check URL. We wait 30s, update the site, and then rm the draining sentinel file. If the deployment fails the host stays out of the elb. If more than 1/3 host fail we abort the whole deployment. This works well for us and was very simple to implement.


We've found that when the ELB health check returns non-ok, the ELB will immediately kill any requests in flight to that box. So it's kind of impossible to do a real "drain" with ELB. Are you seeing different behavior?


We've seen the same behavior, so we do the draining via HAProxy, which does the right thing.


Mixpanel had some downtime 2 days ago, from 22:59 until 23:23 UTC according to our error logs. It might have only been for API users, but downtime nonetheless.

It's certainly possible it was scheduled, but it's not clear where that downtime is announced if so.


Another method for doing this with haproxy+iptables: http://www.igvita.com/2008/12/02/zero-downtime-restarts-with...


I hope this is all automated :)

This seems very tightly coupled to the front end webapp? Do you guys have a completely different system for backend services? Or is there a level of abstraction that isn't being revealed?


I like the way they implemented the custom connection timeout for queries. Great example of using exceptions intelligently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: