Going to take the risk and politely say I do not agree with this article at all.
Alternative advice: never allow more than one app to share the db and expose data through APIs, not queries. Then you can actually remove cruft and solve compatibility through API versioning that you probably need to do anyway. Also, never maintain more than two versions at the time.
There are databases (and data warehouses) that need to provide random query api’s, e.g, for reporting, analytics, etc. Databases can be considered to provide an api layer themselves, way more flexible than most standards APIs provide. Graphql and odata go a long way to solve this. Also, database views can provide a data api layer for a database. Views provide a stable datamodel, that allow the underlying tables to be changed.
I'm inclined to agree, but just to provide a counterpoint: if the data is modelled properly then there are very few reporting queries that actually make sense. Except of course if you want people to be able to change definitions on the fly, but I'd argue that is an anti-feature when it comes to reporting.
Data gets messy, because the real world is messy. Once you start ingesting from other sources, it gets very messy. There is a reason ideas like the "data mesh" or semantic web (distributed schemas) were attempted.
Sure, but that complexity is something you do not want to deal with in your reporting API layer. A reporting API should report correct figures, you don't want to put the responsibility for ensuring the numbers add up on the person requesting the report.
It's already bad enough that people, by default, are not aware of the frame of reference in which they see the world, let alone communicate with.
As just a very simple example, unless I am explicitly clear about speaking from a Living Systems world view and a regenerative paradigm, when I talk about permaculture design here in this forum, I end up talking past the vast majority of people here on HN.
But even staying with the same shared technological world views and paradigms -- the people across the Three Tribes of Programmers (https://josephg.com/blog/3-tribes/) will get into flamewars because the frame differs in distinct ways.
Because a schema makes distinctions on information in a particular way, it will always encode a frame in which to view and understand that information.
This is turning into a rather interesting discussion, but surely it would be better to decide on a schema with each other (or perhaps multiple schemas if absolutely necessary), as opposed to just giving people the ability to query whatever they want and present the results?
If the query itself cannot be explained simply then what does its output even communicate? The more freedom you put in the API the more room you leave for confusion, this might be appropriate for research but not when you're reporting on something.
I agree that you have to make decisions somehow. My points were directed at the phrase “properly modeled”, rather than “very few queries that make sense”.
I see the purpose of reporting within an organization is to allow someone, somewhere to evaluate things and make decisions.
You can have an agreed upon report, but it doesn’t mean that the report itself will always lead to wise decisions. It can also become worse when those very decisions lead to forcing things around you to make things easier to report — that is the core thesis described in that essay, and the book, “Seeing Like a State”
Another example — in the realm of strategic decision making, decisions will always be made with imperfect information. A core way of strategy is deliberately manipulating how the opposing force gathers and interprets information, this influencing their actions. (Example, OODA). So what is already messy gets incredibly messy. This is the kind of stuff that falls way beyond reporting, and not something I think can ever be adequately modeled.
Views seem like the ideal solution in theory, but SQL implementations of views are often problematic in practice due to planning complexity & overhead. In theory they should also be a good mechanism for handling writes (see "updateable views" / "writable views") but the list of caveats is long and many developers are understandably nervous about pushing lots of logic into TRIGGERs.
This reminds me of the microservices trend where the main justification is modularity —- which is of course perfectly possible to implement in a programming language using standard language constructs.
Similarly: any separation of concerns you can implement with APIs and multiple databases you can also implement with a schema. The difference being you have to reimplement a bunch of capabilities that are baked into an rdbms (and will probably never correctly implement something like a hash join).
It should be possible to implement modularity in a programming language using standard language constructs.
In practice, the programming language and software engineering communities have largely failed to provide usable modularity. IMNSHO due to our clinging to call/return (so procedural/functional/method-oriented) as our modularity mechanism.
It ain't working.
So we the OS/systems guys and gals need to bail us out. Process boundaries are pretty hard, though of course we then manage to build distributed monoliths.
One thing that's interesting is that µservices, if actually REST-based, use data as the modularity mechanism, rather than procedures.
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
So because we have bad programmers that don't use the stuff provided by the programming language, we are going to throw away the baby with the bath water.
And, somehow, this is going to fix the fact that this happened because we have bad programmers. Bad programmers that now have to also deal with network complexity on top of the basic complexity they already can't deal with.
Ok, in your opinion, what would it look like the right stuff as provided in a programming language?
Microservices are a good solution for scaling a company into multiple sub-sections, I/E a solution for an org chart problem. If your org chart has less than 4 levels, most likely you are doing microservices too soon. At that point, you can have people in charge of dealing with network interactions exclusively, and that allows your bad programmers to still be productive.
It is not necessary to be able to solve a problem in order to identify it.
That said, what I think what is needed is general support for architectural connectors in programming languages. Our so-called "general purpose programming languages" (actually: domain specific languages for the domain of algorithms, see ALGOL) effectively support one: the procedure call.
The central thesis revolves around continuous growth with no advice given for removal/cleanup. This is not a sound strategy for a database schema, at least for the SQL side. Column bloat, trigger bloat, index bloat... Schemas cannot continuously grow, there needs to be DROPs along the way.
Datomic is not a SQL database. You can simulate tabular data, but it is a very different beast. Datomic schemas are always EAVT. If you add an attribute, or deprecate one, there may still be values tagged to that attribute. That should be fine.
That something as simple as change--a universal condition of all systems--defeats so many schema designs in SQL databases should suggest that there's something wrong with the SQL databases.
But the article claims it applies to SQL databases as well.
> Are these rules specific to a particular database?
>
> No. These rules apply to almost any SQL or NoSQL database. The rules even apply to the so-called "schemaless" databases.
Yes you eventually need to do the things you mention but probably less frequently than a normal application needs to add new columns or the like to support new use cases.
The article thesis is essentially make breaking changes as infrequently as possible. The easiest way to do that is never change your data but that’s a sure way to have your competitors crush you as you stagnate. The next best thing you can do is make sure existing producers and consumers are not impacted when you make changes. For most changes being made the advice in the article gives a set of things you can do to achieve this goal.
For times where your database itself is not scaling which are the types of things you’re mentioning, I think there are other things you can do to, if not eliminate backwards incompatibility, at least make the transition easier. For example fronting your DB via an API and gate all producers/consumers through that. If you’re frequently having to handle scaling issues perhaps it’s time to reevaluate your system design all together.
never break it
Never remove a name
Never reuse a name
Your point is a very reasonable statement, but you are really disrespecting the author by putting a reasonable statement in their mouth. They had every chance to say the reasonable thing, and they clearly made a choice to say the unreasonable thing. Respect that decision (and tell them that they're wrong).
Never removing a name is relatively straightforward with views and aliases. Never breaking it is somewhat harder, but is still possible for well designed tables - especially if you use updateable views. Never reusing a name is also straightforward. It does require planning your schema evolution years if not decades in advance of course.
You've got more than one app sharing a db when you deploy a new version. Unless you're happy with downtime during deploys as the cost of not having to manage how your schema evolves.
These kinds of best practices make sense regardless of how many apps access a db.
Following the advice doesn't also prevent you from enforcing a strict contract for external access and modification of the data.
> You've got more than one app sharing a db when you deploy a new version. Unless you're happy with downtime during deploys as the cost of not having to manage how your schema evolves.
2 deploys is all it takes to solve this problem.
* 1 to deploy the new schema for the new version.
* 1 to remove the old schema.
This sort of "tick tock" pattern for removing stuff is common sense. Be it a database or a rest API, the first step is to grow with a new one and the second is to kill the old one which allows destructive schema actions without downtime.
2 deploys isn't enough for robustness. It depends on what the change is, but the full sequence is often more like
* Add the new schema
* Write to both the new and old schemas, keep reading from the old one (can be combined with the previous step if you're using something like Flyway)
* Backfill the new schema; if there are conflicts, prefer the data from the old schema
* Keep writing to both schemas, but switch to reading from the new one (can often be combined with the previous step)
* Stop writing to the old schema
* Remove the old schema
Leave out any one of those steps and you can hit situations where it's possible to lose data that's written while the new code is rolling out. Though again, it depends on the change; if you're, say, dropping a column that no client ever reads or writes, obviously it gets simpler.
Yup, it depends on the change. Sometimes two deploys is enough — e.g. making a non-nullable column nullable — and sometimes you need a more involved process (e.g. backfilling).
Nonetheless, I agree with the OP that the article's advice is pretty bad. If you ensure that multiple apps/services aren't sharing the same DB tables, refactoring your schema to better support business needs or reduce tech debt is
a. tractable, and
b. good.
The rules from the article make sense if you have a bunch of different apps and services sharing a database + schema, especially if the apps/services are maintained by different teams. But... you really just shouldn't put yourself in that situation in the first place. Share data via APIs, not by direct access to the same tables.
Why wouldn't one deploy be enough to convert a non-nullable column to nullable? Going the other way takes two deploys I can see, but this way seems like is entirely backwards compatible.
If apps are using the column they might need code changes to handle the column suddenly having nulls. So you would have to change and redeploy the code, then make the column nullable.
I agree that the article is pretty bad, but it's not like "make this the API team's problem" is really an answer. API versioning is probably tougher than database schema versioning IME.
Aside from DBs, there are many other communication tools that use a schema. I can think of at least two: Kafka and serialization libraries like Protobuf and Thrift.
I'm a little stunned by this suggestion. I've worked in quite a few different context for application systems, e.g. retail, manufacturing of fiber optic cable, manufacturing of telecommunications equipment, laboratory information management, etc.
I wouldn't even know what you mean by "app" in this context. There may a dozen or more classes of users who collectively have hundreds, even thousands, of different types of interaction with the system.
Sometimes there were natural divisions where you could separate things into a separate database. For example, the keep/dispose system for laboratory specimens, which tracked which specimens needed to be kept for possible further testing, where, and for how long. But most problem domains were not like that.
And sometimes we had to interact with other systems because they were for a separate division (because of mergers and acquisitions). But those kinds of separations made for more limited functionality and more difficulty in managing change, not less.
Most of this advice is this bad because it's entirely based on the idea that you interact with your database using an extremely dynamic language: clojure. Like most of Hickey's advice, it's terrible and founded in dogma.
Alternative advice: never allow more than one app to share the db and expose data through APIs, not queries. Then you can actually remove cruft and solve compatibility through API versioning that you probably need to do anyway. Also, never maintain more than two versions at the time.