Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In my experience, my pain hasn't been from managing the migration files themselves.

My pain comes from needing to modify the schema in such a way it requires a data migration, reprocessing data to some degree, and managing the migration of that data in a sane way.

You can't just run a simple `INSERT INTO table SELECT * FROM old_table`, or anything like that because if the data is large, it takes forever and a failure in the middle could be fairly difficult to recover from.

So what I do is I split the migration into time-based chunks, because nearly every table has a time component that is immutable after writing, but I really want a migration tool that can figure out what that column is, what those chunks are, and incrementally apply a data migration in batches so that if one batch fails I can go in there to investigate and know exactly how much progress on the data migration has been made.



This makes a ton of sense! We've faced the same exact problem because whenever we need to add a new materialized table or column, we need to backfill data into it.

For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.

For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.

I would love to think more about this to see if it's something that would make sense to handle within Houseplant.


I've opened an issue to track this https://github.com/juneHQ/houseplant/issues/30

Would love to chat more here or there if you're keen!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: