Rules of schema growth (2017)

d3ckard · on Oct 31, 2023

Going to take the risk and politely say I do not agree with this article at all.

Alternative advice: never allow more than one app to share the db and expose data through APIs, not queries. Then you can actually remove cruft and solve compatibility through API versioning that you probably need to do anyway. Also, never maintain more than two versions at the time.

ako · on Nov 1, 2023

There are databases (and data warehouses) that need to provide random query api’s, e.g, for reporting, analytics, etc. Databases can be considered to provide an api layer themselves, way more flexible than most standards APIs provide. Graphql and odata go a long way to solve this. Also, database views can provide a data api layer for a database. Views provide a stable datamodel, that allow the underlying tables to be changed.

contravariant · on Nov 1, 2023

I'm inclined to agree, but just to provide a counterpoint: if the data is modelled properly then there are very few reporting queries that actually make sense. Except of course if you want people to be able to change definitions on the fly, but I'd argue that is an anti-feature when it comes to reporting.

hosh · on Nov 1, 2023

Data gets messy, because the real world is messy. Once you start ingesting from other sources, it gets very messy. There is a reason ideas like the "data mesh" or semantic web (distributed schemas) were attempted.

contravariant · on Nov 1, 2023

Sure, but that complexity is something you do not want to deal with in your reporting API layer. A reporting API should report correct figures, you don't want to put the responsibility for ensuring the numbers add up on the person requesting the report.

hosh · on Nov 1, 2023

Beware the fallacy of legibility.

https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-call...

contravariant · on Nov 1, 2023

Surely communication is the one area where it's fairly safe to apply the concept of legibility?

hosh · on Nov 1, 2023

It's already bad enough that people, by default, are not aware of the frame of reference in which they see the world, let alone communicate with.

As just a very simple example, unless I am explicitly clear about speaking from a Living Systems world view and a regenerative paradigm, when I talk about permaculture design here in this forum, I end up talking past the vast majority of people here on HN.

But even staying with the same shared technological world views and paradigms -- the people across the Three Tribes of Programmers (https://josephg.com/blog/3-tribes/) will get into flamewars because the frame differs in distinct ways.

Because a schema makes distinctions on information in a particular way, it will always encode a frame in which to view and understand that information.

contravariant · on Nov 1, 2023

This is turning into a rather interesting discussion, but surely it would be better to decide on a schema with each other (or perhaps multiple schemas if absolutely necessary), as opposed to just giving people the ability to query whatever they want and present the results?

If the query itself cannot be explained simply then what does its output even communicate? The more freedom you put in the API the more room you leave for confusion, this might be appropriate for research but not when you're reporting on something.

hosh · on Nov 2, 2023

I agree that you have to make decisions somehow. My points were directed at the phrase “properly modeled”, rather than “very few queries that make sense”.

I see the purpose of reporting within an organization is to allow someone, somewhere to evaluate things and make decisions.

You can have an agreed upon report, but it doesn’t mean that the report itself will always lead to wise decisions. It can also become worse when those very decisions lead to forcing things around you to make things easier to report — that is the core thesis described in that essay, and the book, “Seeing Like a State”

Another example — in the realm of strategic decision making, decisions will always be made with imperfect information. A core way of strategy is deliberately manipulating how the opposing force gathers and interprets information, this influencing their actions. (Example, OODA). So what is already messy gets incredibly messy. This is the kind of stuff that falls way beyond reporting, and not something I think can ever be adequately modeled.

refset · on Nov 1, 2023

Views seem like the ideal solution in theory, but SQL implementations of views are often problematic in practice due to planning complexity & overhead. In theory they should also be a good mechanism for handling writes (see "updateable views" / "writable views") but the list of caveats is long and many developers are understandably nervous about pushing lots of logic into TRIGGERs.

oconnore · on Nov 1, 2023

This reminds me of the microservices trend where the main justification is modularity —- which is of course perfectly possible to implement in a programming language using standard language constructs.

Similarly: any separation of concerns you can implement with APIs and multiple databases you can also implement with a schema. The difference being you have to reimplement a bunch of capabilities that are baked into an rdbms (and will probably never correctly implement something like a hash join).

mpweiher · on Nov 1, 2023

It should be possible to implement modularity in a programming language using standard language constructs.

In practice, the programming language and software engineering communities have largely failed to provide usable modularity. IMNSHO due to our clinging to call/return (so procedural/functional/method-oriented) as our modularity mechanism.

It ain't working.

So we the OS/systems guys and gals need to bail us out. Process boundaries are pretty hard, though of course we then manage to build distributed monoliths.

One thing that's interesting is that µservices, if actually REST-based, use data as the modularity mechanism, rather than procedures.

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

javcasas · on Nov 1, 2023

So because we have bad programmers that don't use the stuff provided by the programming language, we are going to throw away the baby with the bath water.

And, somehow, this is going to fix the fact that this happened because we have bad programmers. Bad programmers that now have to also deal with network complexity on top of the basic complexity they already can't deal with.

mpweiher · on Nov 1, 2023

> bad programmers that don't use the stuff provided by the programming language

Er, no. We actually don't have the right stuff provided by the programming language.

And no, µservices are not a good solution. They are a bad solution to a real problem that programming languages do not solve.

javcasas · on Nov 1, 2023

Ok, in your opinion, what would it look like the right stuff as provided in a programming language?

Microservices are a good solution for scaling a company into multiple sub-sections, I/E a solution for an org chart problem. If your org chart has less than 4 levels, most likely you are doing microservices too soon. At that point, you can have people in charge of dealing with network interactions exclusively, and that allows your bad programmers to still be productive.

mpweiher · on Nov 3, 2023

It is not necessary to be able to solve a problem in order to identify it.

That said, what I think what is needed is general support for architectural connectors in programming languages. Our so-called "general purpose programming languages" (actually: domain specific languages for the domain of algorithms, see ALGOL) effectively support one: the procedure call.

https://insights.sei.cmu.edu/library/procedure-calls-are-the...

See also:

https://blog.metaobject.com/2019/02/why-architecture-oriente...

and:

https://2020.programming-conference.org/details/salon-2020-p...

candiddevmike · on Oct 31, 2023

It's so simple! If you never delete or change things in your schema, you never have to worry about changing it.

The article is pretty devoid of actionable advice.

plandis · on Oct 31, 2023

The article lists fairly sensible rules for backwards compatibility and growth in my opinion.

candiddevmike · on Oct 31, 2023

The central thesis revolves around continuous growth with no advice given for removal/cleanup. This is not a sound strategy for a database schema, at least for the SQL side. Column bloat, trigger bloat, index bloat... Schemas cannot continuously grow, there needs to be DROPs along the way.

slowmovintarget · on Oct 31, 2023

Datomic is not a SQL database. You can simulate tabular data, but it is a very different beast. Datomic schemas are always EAVT. If you add an attribute, or deprecate one, there may still be values tagged to that attribute. That should be fine.

That something as simple as change--a universal condition of all systems--defeats so many schema designs in SQL databases should suggest that there's something wrong with the SQL databases.

cakoose · on Nov 1, 2023

But the article claims it applies to SQL databases as well.

> Are these rules specific to a particular database? > > No. These rules apply to almost any SQL or NoSQL database. The rules even apply to the so-called "schemaless" databases.

plandis · on Oct 31, 2023

Yes you eventually need to do the things you mention but probably less frequently than a normal application needs to add new columns or the like to support new use cases.

The article thesis is essentially make breaking changes as infrequently as possible. The easiest way to do that is never change your data but that’s a sure way to have your competitors crush you as you stagnate. The next best thing you can do is make sure existing producers and consumers are not impacted when you make changes. For most changes being made the advice in the article gives a set of things you can do to achieve this goal.

For times where your database itself is not scaling which are the types of things you’re mentioning, I think there are other things you can do to, if not eliminate backwards incompatibility, at least make the transition easier. For example fronting your DB via an API and gate all producers/consumers through that. If you’re frequently having to handle scaling issues perhaps it’s time to reevaluate your system design all together.

hyperpape · on Oct 31, 2023

From the article:

    never break it

    Never remove a name

    Never reuse a name

Your point is a very reasonable statement, but you are really disrespecting the author by putting a reasonable statement in their mouth. They had every chance to say the reasonable thing, and they clearly made a choice to say the unreasonable thing. Respect that decision (and tell them that they're wrong).

butlerm · on Nov 1, 2023

Never removing a name is relatively straightforward with views and aliases. Never breaking it is somewhat harder, but is still possible for well designed tables - especially if you use updateable views. Never reusing a name is also straightforward. It does require planning your schema evolution years if not decades in advance of course.

slowmovintarget · on Oct 31, 2023

This advice is rock solid... for Datomic. Not for Postgres or other traditional relational DBs.

ako · on Nov 1, 2023

Isn’t this the same versioning approach that graphql proposes?

n0w · on Oct 31, 2023

You've got more than one app sharing a db when you deploy a new version. Unless you're happy with downtime during deploys as the cost of not having to manage how your schema evolves.

These kinds of best practices make sense regardless of how many apps access a db.

Following the advice doesn't also prevent you from enforcing a strict contract for external access and modification of the data.

cogman10 · on Oct 31, 2023

> You've got more than one app sharing a db when you deploy a new version. Unless you're happy with downtime during deploys as the cost of not having to manage how your schema evolves.

2 deploys is all it takes to solve this problem.

    * 1 to deploy the new schema for the new version.
    * 1 to remove the old schema.

This sort of "tick tock" pattern for removing stuff is common sense. Be it a database or a rest API, the first step is to grow with a new one and the second is to kill the old one which allows destructive schema actions without downtime.

koreth1 · on Oct 31, 2023

2 deploys isn't enough for robustness. It depends on what the change is, but the full sequence is often more like

* Add the new schema

* Write to both the new and old schemas, keep reading from the old one (can be combined with the previous step if you're using something like Flyway)

* Backfill the new schema; if there are conflicts, prefer the data from the old schema

* Keep writing to both schemas, but switch to reading from the new one (can often be combined with the previous step)

* Stop writing to the old schema

* Remove the old schema

Leave out any one of those steps and you can hit situations where it's possible to lose data that's written while the new code is rolling out. Though again, it depends on the change; if you're, say, dropping a column that no client ever reads or writes, obviously it gets simpler.

reissbaker · on Oct 31, 2023

Yup, it depends on the change. Sometimes two deploys is enough — e.g. making a non-nullable column nullable — and sometimes you need a more involved process (e.g. backfilling).

Nonetheless, I agree with the OP that the article's advice is pretty bad. If you ensure that multiple apps/services aren't sharing the same DB tables, refactoring your schema to better support business needs or reduce tech debt is

a. tractable, and

b. good.

The rules from the article make sense if you have a bunch of different apps and services sharing a database + schema, especially if the apps/services are maintained by different teams. But... you really just shouldn't put yourself in that situation in the first place. Share data via APIs, not by direct access to the same tables.

sokoloff · on Nov 1, 2023

Why wouldn't one deploy be enough to convert a non-nullable column to nullable? Going the other way takes two deploys I can see, but this way seems like is entirely backwards compatible.

sfn42 · on Nov 1, 2023

If apps are using the column they might need code changes to handle the column suddenly having nulls. So you would have to change and redeploy the code, then make the column nullable.

sokoloff · on Nov 1, 2023

Thank you! (I should have been able to get there on my own but, for whatever reason, I obviously didn't.)

skeeter2020 · on Nov 1, 2023

I agree that the article is pretty bad, but it's not like "make this the API team's problem" is really an answer. API versioning is probably tougher than database schema versioning IME.

sheepscreek · on Nov 1, 2023

Aside from DBs, there are many other communication tools that use a schema. I can think of at least two: Kafka and serialization libraries like Protobuf and Thrift.

kahirsch · on Nov 1, 2023

> never allow more than one app to share the db

I'm a little stunned by this suggestion. I've worked in quite a few different context for application systems, e.g. retail, manufacturing of fiber optic cable, manufacturing of telecommunications equipment, laboratory information management, etc.

I wouldn't even know what you mean by "app" in this context. There may a dozen or more classes of users who collectively have hundreds, even thousands, of different types of interaction with the system.

Sometimes there were natural divisions where you could separate things into a separate database. For example, the keep/dispose system for laboratory specimens, which tracked which specimens needed to be kept for possible further testing, where, and for how long. But most problem domains were not like that.

And sometimes we had to interact with other systems because they were for a separate division (because of mergers and acquisitions). But those kinds of separations made for more limited functionality and more difficulty in managing change, not less.

IceDane · on Nov 1, 2023

Most of this advice is this bad because it's entirely based on the idea that you interact with your database using an extremely dynamic language: clojure. Like most of Hickey's advice, it's terrible and founded in dogma.

throdor23 · on Nov 1, 2023

What makes it easier to follow the advice in the article in a dynamic language?

Swizec · on Oct 31, 2023

> 5. Never remove a name. > Removing a named schema component at any level is a breaking change for programs that depend on that name. Never remove a name.

I agree with this in theory and have seen it go oh so very wrong in practice. Tables with dozens of columns, some of which may be unusued, invalid, actively deceiving, or at the very least confusing. Then a new developer joins and goes "A-ha! This is the way to get my data." ... except it's not and now their query is lying to users, analysts, leadership, anyone who thinks they're looking at the right data but isn't.

You absolutely have to make time to deprecate and remove parts of the schema that are no longer valid. Even if it means breaking a few eggs (hopefully during a thorough test run or phased rollout)

BWStearns · on Oct 31, 2023

This x100. The most miserable and frustrating periods of my career have been in places that never deprecated anything. You could spend hours doing something that looked quite sensible, get a working draft that seemed to work, and then be told "oh yeah, that's deprecated, that data isn't even populated anymore, those rows just _happened_ to have data in dev." Then you either start sanity checking everything before doing anything and your velocity sucks, or just keep stepping on landmines and losing whole afternoons.

Edited to add: docs can help, but only so much. Environments that cluttered also tend to have layers of docs that are equally misleading.

butlerm · on Nov 1, 2023

There are few things more important than comprehensive and up to date database documentation. Otherwise you don't even know what your data means. An organization that cannot produce documentation like that is somewhere between amateurish and waiting for a disaster to happen, unfortunately.

BWStearns · on Nov 1, 2023

I don’t really know how to screen for that before joining a company but I’d say 20% of companies seem to be at that point.

__jem · on Oct 31, 2023

Reclaiming the physical storage of an unused column is often a costly and sometimes impossible operation, which is why many legacy applications end up with the equivalent of my_column_final_final_v2. Database administration requires compromises like this sometimes in the name of uptime and data integrity. Big migrations are always inherently a little risky, and from the view of many DBAs, why even risk it just for a bit of clean up? Your schema shouldn't be totally transparent to your application's business logic anyway, so there are better places to enforce naming hygiene.

butlerm · on Nov 1, 2023

I believe in most relational databases you can just alter a column to allow null values and run a series of transactions in the background to set that column value to null, and that will quite effectively free up most of the physical overhead of the column in question. I would be reluctant to delete, rename, or even clear all the data out of a column without providing an alias though.

tootie · on Nov 1, 2023

Yeah this how you grow to the point of destruction. Your schema is half noise and nobody understands it. Then someone says you need to start from scratch.

xarope · on Nov 1, 2023

worse, I've seen supposedly unused columns be used for some other purpose, and then existing analytics fall apart.

dekhn · on Oct 31, 2023

Hyrum's law in action.

tkiolp4 · on Oct 31, 2023

But then why not addressing the real problem? If a table has a few columns which are not used or invalid or deceiving, why did we let developers introduce them? Lack of planning? Lack of peer review? Lack of talent?

I understand these “ten rules” as: as long as you have a decent codebase and decent engineers, these ten rules will make your life easier.

These rules are nothing if you are dealing with crap codebases (they can help, sure, but they will be just patches)

striking · on Oct 31, 2023

Because sometimes you make assumptions that are seemingly correct but eventually found to be wrong or based on flawed inputs from sources beyond your control.

Any system that ultimately relies on "engineers need to always do the right thing" is a flawed, brittle, ineffectual system. Because even the best engineers will make a mistake somewhere, and because you can't exclusively hire "the best" engineers.

Let's spend our time figuring out how to recover from mistakes rather than trying to pretend they'll never happen.

__jem · on Oct 31, 2023

I've worked with some databases that are 20+ years old and have outlived multiple application iterations. There's always going to be cruft in this kind of situation, it just comes with territory of supporting applications with real production users for a long time.

phtrivier · on Oct 31, 2023

Even the best team make design decision that turns out suboptimal when the requirement changes.

Also, even the best team will sometimes make mistake.

Db schemas are unforgiving.

yxhuvud · on Oct 31, 2023

Requirements change over time. Domain understanding change over tim. Business change over time. Environments change over time. Unless you are a seer with perfect precognition, most of what you have done will be invalidated over time.

Hence: make your code and data easy to change, but simple, as you cannot predict in what way it will change.

Swizec · on Oct 31, 2023

> Unless you are a seer with perfect precognition

Even then ain't nobody in a 10 person seed-stage startup got time, resources, or need to build the database you'll want to have when you're a 600 person Series C monster.

butlerm · on Nov 1, 2023

Developers without special training should generally not do database design for the sort of databases that are intended to last decades. It is a similar task to developing a complex file format that is usable twenty years later - not something to be done off the cuff, and if you want schema stability database design requires more care than most file formats.

sgarland · on Nov 1, 2023

While I agree with you, unfortunately this is unrealistic. Unless a startup happens to have someone skilled with schema design, they’re going to make do with what they can, and it’s very unlikely that they’d waste headcount on a dedicated DBA / DBRE at a young stage.

The immediate effect of that, of course, is that they also won’t try to hire any such person until the DB is a problem they can’t scale via throwing money at it.

pphysch · on Oct 31, 2023

> Data outlives code, and a valuable database supports many applications over time.

Probably an unpopular opinion, but I think having a central database that directly interfaces with multiple applications is an enormous source of technical debt and other risks, and unnecessary for most organizations. Read-only users are fine for exploratory/analytical stuff, but multiple independent writers/cooks is a recipe for disaster.

I prefer an architecture where the central "database" is a central, monolithic Django/Rails/NodeJS/Spring app that totally owns the actual database, and if someone needs access to the data, you whip up an HTTPS API for them.

Yes, it is a tiny bit of effort to "whip up an API" but it deals with so many of the footguns implied by this article. "I need X+Y tables formatted as Z JSON" is a 5 minute dev task in a modern framework.

gwn7 · on Oct 31, 2023

> I think having a central database that directly interfaces with multiple applications is an enormous source of technical debt and other risks, and unnecessary for most organizations.

I think that the operative word here is "over time". So what is meant is not necessarily supporting many applications at the same time, but rather serially.

So the message is supposed to be: Apps come and go as they can be rewritten for so many reasons, but there will be a lot less reasons to redesign / replace a "valuable" database.

timeagain · on Oct 31, 2023

Agreed. When a company outgrows its monolith this component you describe is normally the hardest part to figure out. All those sneaky little joins and stored procedures and undocumented use cases come pouring out. If your data layer has an API from the start it makes everyone act honestly about data access and provides more accountability. An API layer also allows for more options when dealing with caching, noisy neighbors, retention, consistency, and security.

iamwil · on Nov 1, 2023

what is noisy neighbors, and what's the issue with consistency?

bigger_cheese · on Nov 1, 2023

>Probably an unpopular opinion, but I think having a central database that directly interfaces with multiple applications is an enormous source of technical debt and other risks, and unnecessary for most organizations. Read-only users are fine for exploratory/analytical stuff, but multiple independent writers/cooks is a recipe for disaster.

In my org I've felt the pain of having centralized DBs (with many writers and many readers) a lot of our woes come because of legacy debt some of these databases are quite old - a number date back to the mid 90's so over time they've ballooned considerably.

The Architecture I've found which makes things less painful is to transition the the centralized database into two databases.

On Database A you keep the legacy schemas etc and restrict access only to the DB writers (in our case we have A2A messaging queues as well as some compiled binaries which directly write to the DB). Then you have data replicated from database A into database B. Database B is where the data consumers (BI tools, reporting, etc) interface with the data.

You can exercise greater control over the schema on B which is exposed to the data consumers without needing to mass recompile binaries which can continue writing to Database A.

I'm not sure how "proper" DBAs feel about this split but it works for my usecase and has helped control ballooning legacy databases somewhat.

rawgabbit · on Nov 1, 2023

What you did is CQRS architect ure pattern and is the recommended path. It separates command writes from query reads.

bigger_cheese · on Nov 1, 2023

Thankyou - good to know there is a name for the pattern I'm reading more about it now.

wrs · on Oct 31, 2023

In that architecture, the central API effectively becomes “the database”, because apps depend on it just as much. Most of the same rules would apply to the central API’s schema.

cturner · on Nov 1, 2023

You equate the central API to the database but they are not equivalent.

The issue is not how much apps depend on them, but maintenance options: in a central-database model, refactoring is high-friction. Whereas an API model can be built so that refactoring is low-friction.

When we talk about composition, we distinguish between data structures that are private to their codebase, and data structures that have a contract between codebases.

In the shared database model, everything is shared, so any change can affect all stakeholders.

The API model respects composition. This allows you to make changes behind the perimeter without the permission of stakeholders. If you want to make a major change to internal data structures, you can retain the old API, offer a new API, and then grandfather apps from the old endpoints to the new endpoints.

wrs · on Nov 1, 2023

I didn’t mean they’re exactly equivalent, but they share the fundamental concern that client code relies on the data structures being passed. In the case of an API, it’s the request parameters and response body — though in a GraphQL API it would look even closer to a DB query.

So, many of these rules still apply: You should only grow an existing API, not shrink it. You shouldn’t rename things in an API. You shouldn’t use the same name with different meanings except in different namespaces/endpoints. Etc.

pphysch · on Oct 31, 2023

Conceptually, yes they are the same unit. Which is part of the beauty of it; it's not a radical paradigm shift like monolith->microservices or systemd->k8s. But the developer experience/productivity of bending Django or Rails in different ways is far superior to bending e.g. MySQL or PostgreSQL*. HTTPS is plainly a much better, and more widely available, integration protocol than N different SQL dialects.

*much respect to what the Crunchy Data folks have accomplished

sgarland · on Nov 1, 2023

There are pros and cons to the abstraction that an ORM brings. Assuming it’s decent (like Django), you can be guaranteed that the queries it generates are good enough. They’re not always perfect, and sometimes need help depending on how convoluted your schema and desired query are, but they’re decent.

The downside is this can cause devs to never have to think in terms of what the DB is capable of, so they may not consider writing the code such that the ORM decides to write the query differently. I’ve specifically seen a dearth of semijoins in output, even though they’re a common construct in RDBMS, and often far faster for the desired end goal. Sometimes the SQL planner itself will produce them, of course, but that’s not a guarantee either.

plandis · on Oct 31, 2023

In my experience, it’s less risk to take the advice in the article because even if you only have a monolith as the sole producer/consumer you could still mess up and make a backwards incompatible change. Humans are fallible, so try to minimize human error.

One way to minimize the human error is by only extending the schema rather than changing it and forcing your monolith to correctly make changes to existing queries.

I’m not saying adding an API is bad, because it’s not. I just think it’s solving a different set of problems.

vb-8448 · on Oct 31, 2023

> is an enormous source of technical debt

I totally agree with you, but I think in the real world (mostly in monolithic apps, microservices shouldn't be affected) at some point someone will try to access directly the database. There are several reasons for doing this: API are too slow, it's simpler and more immediate writing some SQL vs a http client, the team responsible for APIs it's no more around and similar.

butlerm · on Nov 1, 2023

Every non-trivial API has a data model, and API stability requires that it be as carefully designed as any database, unless you are going to start rolling incompatible API versions off the photocopier, which is no small task.

The entire idea of an unanticipated JOIN is beyond the ken of most APIs as well, unfortunately. For an external API that may not be much of a problem, but for an internal one you might end up creating a new de facto schema with a new query language.

koito17 · on Oct 31, 2023

Interesting aside on rules 7 and 8: many Clojure users (myself included) did not often use namespaced keywords until clojure.spec was released, and in clojure.spec, you absolutely need namespacing since specs reside in a global registry. Though in the case of clojure.spec, we typically use the same namespace as our current ns, and there is syntax sugar for this. e.g.

  ::foo

resolves to

  (keyword *ns* "foo")

In Datomic, namespaces tend to represent your application's models, like :person/date-of-birth.

I find it very useful mostly for human readability, it offers a way to distinguish what exactly :name refers to in your application's model. It also helps with editor autocomplete since you can type a namespace and see all keys up front, no need to consult a keys spec itself (or the schema of your database in Datomic). And when in doubt, in Datomic, you can always pull, and it is not too hard to run a query that extracts all attributes that exist in your database (this is actually an exercise in Learn Datalog Today[1], highly recommend going through this tutorial yourself if you want to play with databases like Datomic or XTDB).

[1] Exercise 2 in https://www.learndatalogtoday.org/chapter/4

dekhn · on Oct 31, 2023

Many of these rules are things that only became obvious after I worked on an extremely large production database where downtime was counted in millions of dollars an hour, we didn't know who all our users were, some unknown users were load-bearing, and we had a bunch of postmortems describing historical incidents.

I don't think the article mentions it, but one other technique we used- which still is a big question mark to me- is that we "trickled" changes to the database. Instead of changing all the rows in a single big transaction, the change was broken into thousands of little changes that were rolled out over a series of days. The reason for this is that if there is an unexpected problem, you have more time to stop the change and mitigate the damage.

cdchn · on Nov 1, 2023

>Instead of changing all the rows in a single big transaction, the change was broken into thousands of little changes that were rolled out over a series of days.

Having been in a situation where a large schema update required migrating data in a transaction, which then broke for some data, but the data was so big that rolling back the transaction caused the database to become totally hosed, I can see this technique being very attractive.

smfjaw · on Nov 1, 2023

I really really like partitioned data formats like Parquet/Delta etc that allow this for this reason. Often do stuff like this in our data warehouse but hadn't thought of 'psuedo partitioning' the data like this on an operational DB

codethief · on Oct 31, 2023

> the change was broken into thousands of little changes

Could you elaborate on how this worked?

dekhn · on Nov 1, 2023

Literally, there were many thousands of "UPDATE WHERE" each targeting a different row, and the file was sharded, and then some process would read through a shard and apply statements one at a time with a delay. It would export stats (# of updates, # of errors) and could be easily paused/resumed.

We called it trickling, see https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/Ad...

djbusby · on Nov 1, 2023

Not OP but here what we have is some routine to get a fact from the DB (eg Customer->getOrders()) and the routine is schema aware. On v0 it queries a table with one rules, on v1 it's a different slightly different query. So, one small schema mod to eg add a column. Roll out. Another schema mod to populate, possibly a few days later, then roll out queries over a few days, then, later slowly remove the dead parts of the schema and then later drop the legacy code using the old columns. The process can take weeks. And some throwaway code (scaffolding I say) has to pass through the system. I recently was able to clean up and old view that took months to process through (we're a small team)

One recent one was moving from bitflags in an int column to flags in a jsonb column. Tedious.

What makes it work? Testing, testing, testing and management that gives time for that process.

exabrial · on Nov 1, 2023

Postulate #1: your most valuable resource is strong types, well defined business vocabulary, and the follow on generations of developers understanding these concepts

Postulate #2: once in production, a schema is likely to outlive you. Spend an enormous amount of time minimizing its scope and making sure it’s correct.

Postulate #3: you data has a schema no matter what the mongodb users try to say otherwise.

klysm · on Nov 1, 2023

I fight until and past exhaustion for a high quality and normalized schema. A lot of devs I work with don’t seem to like it but I still believe it’s worth it

bob1029 · on Oct 31, 2023

In cloud native arrangements, we are looking at the database as the ultimate integration tool. These schema are not owned by any specific service. There is a totally separate universe responsible for how all of that works outside code. Allowing code to drive schema (and permitting the implication that this is the best way) is the biggest frustration I have with ORMs & their higher-order consequences.

In my view, if the schema must change so radically that traditional migrations and other 'grow-only' techniques fall apart, you are probably looking at a properly-dead canary and in need of evacuating the entire coal mine.

The Quote regarding flowcharts and tables applies here - if you radically alter the foundation, everything built upon it absolutely must adapt. Every flowchart into the trashcan instantly. Don't even think about it. They're as good as a ball & chain now. Allowing parts of the structure to dictate parts of the foundation is where we find ourselves with circular firing squads.

Take things to the extreme - There is a reason you will start to find roles like "Schema Owner" in large, legacy org charts. These people cannot see the code or they will become tainted. They only have one class of allegiance - LOB owners. These are who they engage to develop & refine schema over time. The schema owner themselves has a full time job that is entirely dedicated to minimizing the impact of change over time to the org. This person is ideally the most ancient wizard in the org chart and has the prior Fred Brooks quote framed on their wall.

You can make schema change a top-down event that touches the entire organization. This happens quite often in banking when the central system is completely swapped for a different vendor & tech stack. Most of a bank is just a SQL database, but every vendor has a different schema that has to be adapted to. This is known as "core conversion" in the industry and is one of the more hellish experiences I have ever seen. If a bank with 4 decades of digital records can pull something like this off with regularity, there aren't many excuses that remain for a hole-in-the-wall SaaS app with 6 months of customer data.

skybrian · on Nov 1, 2023

Compare with the usual rules for protobuf evolution:

All fields are optional. This is because fields have a lifecycle. Your code might be reading data generated by some other code written before the field was defined, or generated by some other code written after the field became obsolete.

With protobufs, you can never reuse a field number. However, you can remove a field (keeping its number reserved), which effectively means ignoring it in new applications. It's neither read nor written, which is fine because see above.

You can also rename the field itself. This doesn't affect the wire protocol.

It seems more sensible than having misleading names around for people to trip over. Maybe we'd be better off if databases used field numbers to refer to columns in the wire protocol?

sundbry · on Nov 1, 2023

This is one nice property of Apache Iceberg (a data lake format, built on top of Parquet). The schemas in the data are evolvable, and require both a name and a field ID to identify the field. The names are changeable, but the ID's are immutable, so you can remove a field, and add a new field by the same name, having a different type.

mvdtnz · on Nov 1, 2023

Wow these are shockingly bad rules. This is how we end up with DBs with hundreds of tables and duplicate columns like `name` and `name2`.

My number one rule of growing schemas is to design your schemas and applications with a good custom field system. Some kind of flexible way of being able to add fields to items as data.

AtlasBarfed · on Nov 1, 2023

Ahhh the entity-attribute-value pattern.

When your sql database needs another sql engine to translate to actual EAV queries.

But they are fun schemas!

The old joke that there are two actually hard problems in CS: caching and naming things. The older I get, naming things IMO is far more difficult.

cdchn · on Nov 1, 2023

When you start looking at SQL tables that are just id,key,val then you know you've teleported into this forbidden realm.

There are 2 hard problems in computer science: caching, naming things, and off-by-one errors.

sgarland · on Nov 1, 2023

It’s even better when you find out that key is a text field, and stores ints, uuids, custom key formats…

cdchn · on Nov 1, 2023

And there are a different 'value' column for every possibly data type.

yxhuvud · on Nov 1, 2023

The thing about naming is that it is so closely related to how the problem domain (not business domain, though they are of course related) is understood, and how well it is conceptualized.

JackMorgan · on Nov 1, 2023

I've seen some horror stories at companies with that sort of custom column schema. Today I think just adding a new column is a lot better in the long term. Custom fields systems will quickly contain several different "versions" of the records, but there is no easy way to see which is what. At some point it feels like re-inventing a datalog or key-value database, but those have query systems better-suited to the task.

klysm · on Nov 1, 2023

Custom fields system is generally a bad idea imo. Just do JSON if you want that

hosh · on Nov 1, 2023

Another way I understand the idea of schema growth that, you should transform schemas with monoids.

packetlost · on Oct 31, 2023

I wish there were open source, less Clojure-centric implementations of Datomic, particularly as a plugin over something like Postgres. It has many good ideas

sweetsocks21 · on Oct 31, 2023

I've been keeping an eye on https://github.com/cozodb/cozo which is pretty close to something I've wanted, a sqlite version of datalog/datomic.

stcg · on Nov 1, 2023

Datahike [0] provides similar functionality to datomic and is open source. It lacks some features however that Datomic does have [1].

[0]: https://github.com/replikativ/datahike

[1]: https://github.com/replikativ/datahike#when-to-choose-datahi...

yayitswei · on Oct 31, 2023

There's XTDB which is still Clojure-centric but is at least open-source. It supports SQL queries.

lysecret · on Nov 1, 2023

Summary, reading the article and the comments. There are no rules, it is all one big DEPENDS.

ewams · on Nov 1, 2023

Anyone have good examples of rule 8, namespaces? Either in pg or SQL server?

Aeolun · on Oct 31, 2023

They completely lost me when they said the database is the definite source of the schema.

Sounds like 90s DB admins making a case for the bad old way of doing things.

butlerm · on Nov 1, 2023

It has never been standard practice to store the database schema and its history, notes, and documentation in the database, although data dictionaries come close. There are database catalogs that can carry a subset of that information, but a more traditional "90s" approach is to use a data modeling tool.

If you want detailed change history it is hard to avoid using DDL files with source code control, although notes go a long way. Most databases are not good at change history - not even good at representing it, unfortunately.

It is also quite easy to do a mediocre job of anything, and I would be careful of counting anything as the "bad old way" with out a careful evaluation of the pros and cons of the "new good way" as well. A casual observer might wonder why so many modern web apps have response times that are ten to twenty times longer than they were two decades ago, for example. Perhaps - on occasion - the new good way isn't so good.

JackMorgan · on Nov 1, 2023

Could you elaborate on this?

Aeolun · on Nov 1, 2023

As another commenter on this posted, you should keep your schema in version control just like your code.

iamwil · on Nov 1, 2023

Same. Why is having the schema in the db a bad idea?

riffic · on Nov 1, 2023

why is no one talking about modeling or ERD or anything of that sort?

butlerm · on Nov 1, 2023

I wouldn't design a non-trivial API without doing a data model of some sort, personally, because every non-trivial API exposes one.