This doesn't scale well. It is a perfectly fine approach for smaller systems wit...

saltcured · on Aug 24, 2023

Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data. For a complex system that has more than one stateful store, this is no small feat.

The dev/staging sandboxes are essentially the pragmatic hack to create these snapshots. Ugly sacrifices are made to construct the writable snapshot across disparate pieces. It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state. Also, if the sandbox copy-on-write mechanism differs too much, you end up changing the test environment so much that you are no longer emulating how it will behave in production. So the old-school approach is a replica of the full environment on redundant hardware matching the same characteristics as production.

But before I read the linked article, I was expecting a different anti-pattern to be discussed: where people forget that the dev/staging processes are for software testing, to prepare for when you deploy high-quality software to production. They are not for data preparation. Your deployment eventually needs to combine new software with the existing production data, and not depend on accumulated state of the sandbox data. I've seen people twist themselves into pretzels conflating software and data, and trying to somehow move data from the sandbox into production in a misguided "upgrade".

Software flows from developer, through the sandbox(es), to eventually be in production use by users. Data flows the opposite direction, from production users into snapshots loaded into sandboxes, and eventually into developer's hands with their experimental code. Ignoring of course situations where developers are not authorized to see real user data...

ryan_green · on Aug 25, 2023

saltcured, find these comments super insightful!

> Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data.

That's absolutely a huge assumption. This technology has been a game changer for us: https://lakefs.io/

> It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state.

This exactly the situation we were encountering.

ryan_green · on Aug 25, 2023

NBJack, your point about the difficult of managing external dependencies is well taken. That said, our data pipeline uses cloud storage and multiple external services and the scenarios you're describing haven't materialized so far. We have found that we need to take extreme care in managing the logical state of the data pipeline (e.g. ensuring that we use explicit versions of external services). And we can certainly end up in trouble if external service provider violates their api contract. So I don't think this a replacement for a strong data testing regimen which is what would hopefully help us if this occurred. I also think you can encounter these same issues if you go the dev/stage/prod route. curious to get your thoughts.

NBJack · on Aug 26, 2023

Believe me, I'm not claiming that the dev/stage/prod pattern is any kind of cure-all. It has its own problems, which are probably too numerous for a late-night post.

From what you've described, you're doing the right thing for you and your team. Keep it simple as long as you possibly can. I can only advise you to just keep the goal of balancing the time needed to maintain your approach vs. the return you get from it.

The key advantage of the dev/stage/prod approach is only at sufficient scale and proper discipline among teams, each maintaining their own version of their product at the dev and stage points. This has plenty of headaches, but you're at least getting a chance to exercise your work in something that will be as close to production as possible without actually being there. It tends to work 'best' when you only start holding other teams accountable at the stage point.

Cloud dependencies are where I've seen thing get the weirdest and most volatile. There are all kinds of limitations that can crop up even if you try to maintain the highest level of separation and discipline.

For example, did you know that AWS limits a single account to no more than 5 Elastic IP addresses, and that there's an upper-limit to how many Elastic Network Interfaces can be held in a region? [1] It sounds stupid, but I've actually seen these limits hit even after politely asking AWS to make them as large as possible; keeping developers empowered to deploy their own, compartmentalized version of the product became a real pain.

[1] https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-...