How Balanced Does Database Migrations With Zero Downtime

sigil · on March 5, 2013

You mean you didn't have to "build a custom HTTP server and application dispatching infrastructure around Python's Tornado and Redis" in order to suspend traffic? [1]

I like the simplicity of your approach, it's literally 2 lines of code. Bravo guys.

[1] https://www.braintreepayments.com/blog/how-we-built-the-soft...

jtdowney · on March 5, 2013

To be fair, they had to upgrade to a development release (1.5) of HAProxy to configure it in this way. When we introduced the broxy at Braintree that feature did not exist.

(Disclosure - I work at Braintree)

pgr0ss · on March 5, 2013

The broxy also adds more functionality, such as intelligent rate limiting by merchant (no one merchant can consume all of our backend app processes).

(Disclosure - I also work at Braintree)

msherry · on March 5, 2013

We actually have much the same functionality. We can dynamically route/rate-limit requests on a per-marketplace (or anything else, really) basis, using Nginx's Lua integration capabilities. We may end up writing another post on this, if there's any interest from the community.

rurounijones · on March 6, 2013

I think you can safely assume that any post along the lines of

"How we handle something <TECHNICAL> at balanced"

or

"How we handled <SITUATION> using <TECHNOLOGY> at balanced"

is of interest to this community.

No need to ask, get writing, hop to it! :)

zwily · on March 5, 2013

Consider this an expression of interest from the community. :)

(We're planning on doing something similar in the future.)

daemon13 · on March 6, 2013

Nginx + Lua = YES, YES, YES, give us more... :-)

sandGorgon · on March 6, 2013

this is quite interesting. please do elaborate more.

bokonist · on March 5, 2013

This strikes me as a very risky way to do a migration. If there are bugs in the new application code you basically cannot roll back, since the old version of the code ran against an old database schema. The slower approach of making the schema change backwards compatible, deploying new code, and then dropping the old columns seems a lot safer.

msherry · on March 5, 2013

You're definitely correct that bugs in the new application code would be a showstopper. Another bit of infrastructure we plan to write about if there's interest is our testing setup.

Basically, before any deploy (not just drastic ones like this), not only do all existing unit tests for the new code have to pass (with a certain minimum code coverage threshold), but also a full acceptance suite, which tests the new code and how it connects to all of our other bits of infrastructure. We simulate an extensive set of operations that a client might perform against a test instance of the server, and also run a number of tests with all of our services loaded in-memory, which allows us to mock/patch arbitrary points in the code, to assert that what we expect to happen is actually happening. We also run each of our various clients' test suites against the new code, to make sure that each client sees the behavior it expects to see. This testing suite has dramatically increased our confidence any time we have to do a deploy, and best of all, it's all done automatically.

mahmoudimus · on March 6, 2013

If HN is interested in how this is, I've demonstrated this to a few people but it allows us to move FAST and confidently.

Since we use services internally, being able to confidently test interactions between all our services (6+ at this point), it is a HUGE win for us.

Open up an issue here: https://github.com/balanced/balanced.github.com if you're interested.

krichman · on March 6, 2013

Are you seriously questioning that there's interest? I come here for the few links to tech articles with high signal:noise ratios. There are perhaps one or two a day.

Judging by what just got posted, please assume you can post anything about your tech stack and we will love it.

samratjp · on March 6, 2013

I would love to hear more about this!

bobf · on March 5, 2013

I gave a presentation on zero downtime database migrations at a devops conference in Boston in November 2012. Slides are here: http://www.completefusion.com/zero-downtime-deployments-with...

I also wrote a more detailed post about it for SysAdvent 2012 - http://sysadvent.blogspot.com/2012/12/day-3-zero-downtime-my...

victortrac · on March 5, 2013

What's the point of ELB, Nginx, and then HAProxy?

msherry · on March 5, 2013

Nginx is primary for SSL termination and static assets. At the time we set up our infrastructure, I don't believe HAProxy supported SSL termination. According to Willy Tarreau's comment to the first answer of this question (http://serverfault.com/questions/426919/should-i-use-an-ssl-...) it was added in the same release as the patch I mentioned, coincidentally.

victortrac · on March 5, 2013

Why not let ELB handle SSL termination and load balancing (ignoring the fact that HAproxy can delay connections by 15 seconds)?

msherry · on March 5, 2013

Due to the fact that we process credit card payments and thus fall under PCI scope, we have to adhere to the PCI DSS (data security standard). There's a "quick" summary of it here https://www.pcisecuritystandards.org/documents/pci_ssc_quick... , and section 4.1 in particular specifies that we have to secure cardholder data all they way to our servers -- Amazon's ELB doesn't quite count.

stephen · on March 5, 2013

I believe Amazon is PCI compliant now? Would that change things?

msherry · on March 5, 2013

Amazon being PCI-compliant was a requirement for us using them in the first place :) We could have possibly made a case for their PCI-compliance obviating the need for us to do our own SSL termination, but that could have gone either way, depending on our PCI audits.

Using Nginx also lets us do fun stuff with routing using Nginx's Lua integration, which we may end up writing about in the future as well.

zwily · on March 5, 2013

OK then why HAProxy? Why not just let nginx do the load balancing? (Obviously you have a reason now if you plan to use the method in the blog post again, but what about before?)

AaronBBrown · on March 6, 2013

I use nginx + haproxy and use haproxy for the load balancing piece, too. haproxy simply has much more visibility into the queue. I'm not aware of anything built into nginx that is as robust as the logging and stats page from haproxy. This makes horizontal scaling decisions infinitely easier.

zwily · on March 6, 2013

I see... Do you run nginx and haproxy on the same box?

AaronBBrown · on March 6, 2013

blantonl · on March 5, 2013

The 1.5 branch of Haproxy supports SSL termination, and it works quite well.

stove · on March 5, 2013

I was also wondering this.

gngeal · on March 5, 2013

"- perform schema changes in a way that won’t break existing code (e.g., temporarily allowing new non-nullable columns to be NULL). - deploy code that works with both the old and new schema simultaneously, populating any new rows according to the new schema. - perform a data migration that backfills any old data correctly, and updates all constraints properly. - deploy code that only expects to see the new schema."

That sounds a lot like transactional schema updates in Firebird to me. Plus being careful about how the app handles the data. Schema updates in Firebird are essentially instantaneous, with the row updates performed lazily (although if you need it, a SELECT COUNT(*) will force an update of all rows immediately).

joseph_cooney · on March 5, 2013

Oracle is the same - DDL is automatically committed. I think in SQL Server DDL can be transactional. Not sure about Postgres.

joevandyk · on March 5, 2013

PostgreSQL can have almost all DDL changes inside a transaction.

brianr · on March 5, 2013

That's a great approach for the cases where the migration takes only a few seconds. When you start running into situations where the migration takes hours or days, it's back to the "normal" way.

msherry · on March 5, 2013

Sure, this definitely isn't the right approach for all migrations everywhere. When we have migrations that we know will take a long time, we try to make them non-invasive enough that we can have code that works with old and new schemas simultaneously. That way, the migration can take as long as it needs to, and everything Just Works™.

dantiberian · on March 5, 2013

This was a really interesting article. I feel silly for asking this but why didn't you set up a script which did all of the maintenance, migration and deploy tasks automatically?

You obviously thought about this a lot but I don't see why you'd want two humans doing it instead of a script.

msherry · on March 6, 2013

Thanks! We had been practicing on test servers and so we had the commands ready to be repeated, but you're right -- for reproducibility, a script would have been the way to go.

hvidgaard · on March 6, 2013

Not just for reproducibility, but to avoid human errors. I always just get nervous when humans make changes to production environments. I'd much rather have a tested script do it.

AaronBBrown · on March 5, 2013

This isn't zero downtime. It's still 15s of downtime (which is really a trivial amount of time for a migration). As a user, I would rather see a maintenance page up than have my connections stall out and have me staring at a blank page.

mahmoudimus · on March 5, 2013

(Posting for msherry since he can't seem to respond at the moment)

The whole point of this was so API requests wouldn't fail, they would just take slightly longer. API requests don't get the option to see a maintenance page -- they would just return errors to the client, which potentially means lost business for them.

jstanley · on March 5, 2013

This sounds incredibly rudimentary compared to http://www.percona.com/doc/percona-toolkit/2.1/pt-online-sch...

msherry · on March 5, 2013

Hi there. I'm the author of this post.

If I'm reading this correctly, your suggestion would alter a single table online, and at the end, I would end up with a table with a new schema (assuming I had no foreign keys referencing the table being modified, which seems to introduce additional complications). Presumably, this change happens while my application was running, which means that during the migration, I would have to use the old table format, and then cut over to the new one instantly once the migration has completed.

Our migration at the time involved multiple table changes, many of which had foreign keys referencing each other. It doesn't sound like this tool would atomically switch all tables to the new schemas, which would have led to broken data for us. Does that make sense?

EDIT: grammar

jstanley · on March 5, 2013

You're right about needing to switch your application over instantly. Where I've used it, it has mostly been to add columns to a table and thus the old application code continues to run perfectly fine.

While you can't use pt-osc to do multi-table updates directly, you can use the same strategy. All it does is creates a new table with the new schema, adds triggers to the old table to duplicate row modifications to the new table, and then copies old rows across. Then, when all the copies are done, atomically renames the new table into place, then deletes the old table.

There is nothing to stop you from delaying the rename until all new tables are ready, except that it is more hassle than just using pt-osc as it comes.

But, point taken: your case is more complicated than the one I was thinking of. And thanks for your thoughtful response to my somewhat dismissive comment :)

jpollock · on March 5, 2013

I would have used sharding, and then normal failure handling and resynchronisation between nodes. If you have the requirement that two adjacent versions have to be able to resynchronise after failure upgrades (and rollback) become equivalent to normal node failure.

As it stands, during your upgrade you've lost all of your fault tolerance and can't meet your performance requirements - you've gone from 5 nodes able to process the traffic to 1!

msherry · on March 5, 2013

You make a good point re: fault tolerance. As it happens, I've simplified the diagram quite a bit to make it simpler to visualise. We have more than 2 Nginx instances, and more than one shard of our app was running the new code.

jpollock · on March 5, 2013

It sounds like a shard collapsed down to a single database instance? Otherwise, you wouldn't have had to turn off requests to the shard, you could have quiesced one of the nodes, taken it out of service and brought it back up with the new release?

jpollock · on March 6, 2013

I'm sounding really negative, I apologize. :) I love that you've got your upgrades to 15s, that's a powerful place to be!

perkof · on March 6, 2013

It seems like the interaction required by the two engineers could have been scripted to remove the human element. Was there a reason you chose not to do this?

orofino · on March 6, 2013

Perhaps I'm missing something, but why not down a segment (shard I guess), upgrade the application and database, then fail over to that shard?

This would provide a fail back mechanism assuming you could resolve data continuity.

I'm sure there is a reason this wasn't viable (possibly the data issue), but I was curious.

msherry · on March 6, 2013

This is basically what we did, except during the "upgrade the application and database" step, we suspended all traffic to our app servers. The schema change was an incompatible change (I think this is what you mean by "resolve data continuity"?) So basically, our old code and new code could not run simultaneously, because they were designed against incompatible schemas.

orofino · on March 6, 2013

So if this was the case, why couldn't the other shards handle application requests while you casually upgraded the application and database tier in say... 30 seconds (or even minutes) as opposed to sub 15s?

jdunck · on March 6, 2013

As stated in the post, the db migration was a large enough change that having both codebases working on the migrating db would have been a high cost.

akoumjian · on March 5, 2013

I've been eyeing Soundcloud's Large Hadron Migrator (https://github.com/soundcloud/large-hadron-migrator). I would love to see a Django/South specific implementation.

matclayton · on March 5, 2013

We're a django shop, and pt-online-schema-change is an amazing tool. We're run it on tables with 10-50M rows in production with minimal downtime, <1 Second. I can't speak highly enough of it if you are a MySQL shop.

ultimoo · on March 5, 2013

Great writeup! Loved reading it and adding this to my stash of known HA strategies. Did you folks also chart out historical traffic and carry out this migration when it was the most sparse? Like early morning on a weekend or something.

msherry · on March 5, 2013

Absolutely.

Being a payments-processing company, we have a variety of users (marketplaces), each of whom has users who are widely geographically distributed, so our usage doesn't drop off as much as some other types of sites might on weekends. That being said, on weekends we see slightly lower traffic than during the weekday, so we performed this migration on a Saturday evening.

trungonnews · on March 5, 2013

what happens if it takes more than 15 seconds to migrate the database?

msherry · on March 5, 2013

Then some of our clients would have been disappointed by the timeout errors they had started seeing, if they happened to make a request at the very start of the migration ;)

We ran our migrations multiple times on test instances of our database, because we were worried about this exact issue. We optimized the migration to remove extraneous changes a few times in order to cut down the time taken. Also, 13 seconds was actually the upper bound of what we saw -- many times we ran it, it took closer to 9-10 seconds.

trungonnews · on March 5, 2013

The time to migrate your database will only get larger over time as the data grow. Looking toward to your follow up post. :)

ultimoo · on March 5, 2013

Not sure about the structure of your database, but do you have any strategies identified for when maybe you will have twice the rows in your biggest table? Would it then take more than 15 seconds?

lsh123 · on March 5, 2013

"Zero Downtime" migration but if the migration will take less than 15 seconds. Honestly, this is not very interesting. The same effect could be achieved by increasing connection timeout on the clients and the server and then just letting the clients to wait while the DB schema changes takes place. This works great for small tables/data sets. When you get to a bigger scale and you migration takes longer (minutes, hours, days) then you might start to look at more advanced tools including percona tools or custom migration code in your app.

msherry · on March 5, 2013

Are you saying it would work without code changes by increasing the timeout limit? We had to deploy new code to work with the new schemas, since the underlying models went through drastic changes. Having our code be compatible with old and new schemas simultaneously, as it is during most of our migrations, would have been extremely difficult in this specific case, which is what prompted this solution. I don't think this is a problem that could have been solved by simple timeout changes, but I'd love to hear your thoughts on it.

lsh123 · on March 5, 2013

For, say, adding a new column or changing indexes, one can probably just run update w/o any code changes. For more complex cases, you might need to modify the code to work with both old/new schemas. It's all about details :)

ralph · on March 6, 2013

Is the socat required in the Fabric bits at the end?