Continuous deployment sounds right to me in principle, but I have yet to see a system to handle rollback in the face of database handling (one slide mentioned this briefly). When you have a bug (say you delete the wrong data), how do you rollback this ?
The only solution I can see is one where databases are handled differently (they generally are anyway at the deployment stage), and IMO, this is the challenging issue. Continuously deploying app servers is not trivial, but I would consider it a mostly solved problem.
Soft-delete (+ action logs) for data recovery. Also, code review and lots and lots of tests.
For schema changes, there are a bunch of approaches. Just a few:
* Treat (non-purely additive) schema changes as a special case and take extra care deploying them. (This sounds like a cop-out, but it might be the right effort tradeoff for many startups.)
* Write forwards and reverse migrations for each change, and avoid data-losing schema changes.
* Implement all (or all non-purely additive) schema changes by shadowing data operations onto a new version of the schema (asynchronously migrate all previous data onto the new schema in some back-end process.) Write tests to verify the migration/shadowed data is correct. Switch data reads to be on the new schema. Eventually delete the old schema in a subsequent deploy after an appropriate stabilization/verification period.
I find #3 to be the most general, but there's plenty of overhead (computational, storage and man-hour) that may not be appropriate for your case.
A number of techniques help here. These are useful whether or not you're using continuous deployment.
A simple approach is not to delete data, but rather simply set a flag which marks the data as deleted. The application storage layer acts as if this data were deleted. A separate process (changed separately) is responsible for actually purging data which has remained deleted for a certain period of time.
In any case, do you really wish to delete the data? Keep it around -- it might be useful. The only reason to delete data, outside legal concerns, is to control the cost to retain it. Given that you mostly never really want to delete data, you probably want to orient your schema toward hiding it instead. In that case, if you need to roll back application changes, you can do so by backfilling the data to restore it to a visibile status.
Keeping detailed programmatic logs is also helpful for making the backfill easy to execute. For example, in addition to your main textual log file, you keep something like a log consisting of JSON objects, one per line, each which represent the essential details of some application action.
You are correct, at Etsy we deploy schema changes once per week. The code always works against both versions of the schema. We never take downtime for schema changes.
We avoid data loss by doing soft deletes as much as we can. Sometimes we do want to do real deletes, especially for data that is heavily denormalized, but in those cases we keep an audit trail so that the data can be reconstituted in the event of a mistake.
1. Use some sort of schema migration tool to keep track of and version all changes you've made to the database since your very first commit.
2. Never do a backwards-incompatible schema change. Never. Time and time again these are the ones that will completely bite you in the ass when the day comes that you broke production with a deployment and desperately need the old version back up and running. You're going to end up losing data and shit will hit the fan. It doesn't take much code-wise to support the old scheme, and it will ensure that you have zero downtime while rolling out your updates.
3. Never delete data. The only exception being if some user-submitted content violates a law, like child pornography. In which case, it shouldn't be part of your migration/update process to begin with. Just use deleted flags / boolean fields. Note in your code when certain fields are no longer used anymore because they're only there to support legacy versions.
I am pretty much a beginner in database handling, what would you suggest for 1 ?
Concerning 2: how do handle schema fixes (example: at my previous job, the tables were very badly designed, and the application+schema combination prone to frequent race conditions). What to do in this case ?
As for 3, I already came to this conclusion on my own, so I was not completely on the wrong track :)
I totally get all these points; is there any literature I can point our devs towards to convince them (a) #2 + #3 don't leave any corner cases and (b) they really should take the time to do it (ultimately so I can put an automation solution around the the DB)?
While it doesn't handle every case, using a database migration framework with Up/Down style methods as part of your deploy process can work pretty effectively.
Another option is to use a VM for your database and do a snapshot before deploying.
Chances are good that important tables do do transactions don't change as often as other things in the system. This is why I support things like CouchDB to handle the non-transaction stuff.
Based on a very brief conversation I had with a Paypal engineer a few years back, it's SOP in modern financial transaction handling to store/log details of every transaction in many places and various forms, and there's plenty of (probably excessive) paranoia involved.
The context of the discussion was performance, scalability, and engineering resources, so we didn't get into how the logs are used, but it would be my guess that if the "primary" database had some sort of failure that lead to a rollback, affected accounts would be locked out until the transactions can be properly recovered from other storage/logging mechanisms.
I'm curious about how HN feels about doing feature management all in trunk with if/then blocks (Flickr, and I assume Etsy), versus a more branch-based workflow. By having config flags you can be very particular about what features or bugfixes are enabled/disabled at any time, but it seems like what you are saving in merge time, you are paying by having a bunch of dead code hanging around until someone gets rid of it.
Even in svn, and much better in git, if you can keep a sane release and merging strategy then it's not going to introduce much overhead to getting new features or bugfixes out. Our 20+ person team deploys a few times a day using a branch system of (sprint + trunk/QA + release), but always looking for different ways to do things.
They're not mutually exclusive: you want basic correctness/design review of course (my favored approach is now pre-integration code review + authoritative master/master-always-deployable), but a proper feature-flagging system will let you beta or A/B test potentially incorrect (in a code-correctness, performance, or UX sense) changes gradually in the real world. Binary (on/off for everybody) flags are more the exception than the rule.
Server affinity + staged rollouts are another approach, but for long timeframes (e.g. A/B UX tests, feature prototypes) I'd think you'd want to go with flagging anyway from a codebase management point of view.
Some things do take more effort to make dynamically flaggable than it would take cost to delay integration, to be fair. Judgement call.
Config flags and how we commit code are two completely separate issues. Very simply, you need config flags to operate a site that degrades gracefully when something is going wrong. They cannot be avoided. Is there a query in the forums suddenly hitting the database too hard? Turn off the forums while we fix it so the rest of the site isn't affected. Etc.
When a feature is in development, its config flag is off or enabled for admins. We deploy it as we develop it, in pieces that are generally not more than a few dozen lines of code long. This is another critical idea: we don't push out all of the code for an entire sizable feature all at once, ever. Doing that is a recipe for having a problem and having to review thousands of lines of code to try to figure out what it is.
For new features, we generally do not remove the config flag once a feature is live because we want to be able to disable it if anything goes wrong.
If we are replacing a feature with another one, we generally want to keep the old one around for a little while (we generally do A/B testing or ramp up new versions whenever we do this). After that, we do delete the old code and reduce the config flag to an on/off switch which was probably there in the first place.
Thanks for the replies. I didn't mean to frame it in a mutually exclusive way; I'm up for ini or yaml configs any day. Am looking at this from a rel-eng perspective.
If you are specifically interested in this kind of continuous deploy process then "Continuous Delivery[1]" is a decent book on the subject. The IMVU post "Doing The Impossible 50 Times A Day[2]" is another good place to start reading.
Anyone have these slides in a non-Flash format? I try to advance the slide and it just corrupts the display. Using flash to swap out static images. Dumbfounding.
http://codeascraft.etsy.com/2010/05/20/quantum-of-deployment