Why is storing all data normalized in a database important to you? Whenever I se...

tannhaeuser · on May 24, 2020

Are you saying fully normalized RDBs end up reimplementing key/value stores, and then suggest to just use a JSON blob instead? The reason why RDBs are normalized is to avoid update anomalies, and it isn't really difficult to understand. Technically, adding a column to a DB table is a matter of "ALTER TABLE T ADD (mycol NUMBER)", but it sounds like the real problem to be worked on is the organizational overhead going on. At least the way you're describing your use of JSON blobs in a relational DB really sounds like a guerilla tactic to sidestep DBAs gatekeeping your DB. JSON is just an accidental data serialization format supported by browsers OOTB. If you're using any other language in your backend, like Java, then you need additional serialization libs to read and write JSON anyway (like with every other format such as XML), and the convenience of JSON is lost. Though lately browser apps become so bloated with MVw frameworks and whatnot, I wonder if the original point of using JSON - that it can be handled as JavaScript object literal OOTB - is really all that important. If you've worked with MongoDB, you'll know that Mongo's JSON is not quite JSON, but uses additional ad-hoc constructs to re-introduce some typing (such as for dates and IDs).

So I have to agree with GP wondering why JSON is so important to people, and is even portrayed as a relief or saviour to developers. In my experience, JSON in a relational DB is always a sign of organizational failure, developers using the wrong tool for the job, or not knowing what they want or do.

marcan_42 · on May 24, 2020

I'm not sidestepping the DBA, because I'm the DBA :-)

Fully normalized databases are a nice academic idea, but the supposed benefits of going all the way don't materialize in the real world. That kind of approach is just old school cargo cult - just like full NoSQL is new school nonsense. Good developers know that the answer isn't to use whatever was the fad was when they were in school, be it fully relational databases or NoSQL or anything else, but rather to look at the available technologies and take the most appropriate bits and pieces in order to make a project successful.

After all, if JSON were nonsense, why would a serious relational database like PostgreSQL be adding full support for it? They know it has good use cases.

I know it has full use cases, so I use it, along with proper primary and foreign keys and a properly relational base data model. Yes, all my primary objects are tables with proper relations and foreign keys (and a few constraints for critical parts; not for everything because 100% database side consistency is also an impossible pipe dream, as not every business rule can be encoded sanely in SQL). Just don't expect me to add a user_attribute table to build a poor man's KVS just because people in the 90s thought that was the way to go and databases didn't support anything better. I'll have an attributes JSONB column instead.

And yes, JSON is just a trivial data serialization format that happens to have become de facto. It has an intuitive model and the concept isn't new. It just happens to have been what became popular and there is no reason not to use it. Keep in mind that PostgreSQL internally stores it in a more compact binary form anyway, and if clients for some programming languages don't yet support skipping the ASCII representation step that is an obvious feature that could be added later. At that point it ceases to be JSON and just becomes a generic data structure format following the JSON rules.

HelloNurse · on May 24, 2020

It really depends on applications. In most databases I've seen, if information is unimportant it is just omitted and if it's important it's worth putting in a proper column to allow relational operations; and if important information is big it's usually unstructured or ad-hoc (e.g. log messages, PDF documents), not a "document" and particularly not JSON.

marcan_42 · on May 24, 2020

I disagree. Most of the information in a database is not something I'd want to query by (otherwise I'd be indexing every column, which I'm not), certainly not as part of the normal application.

What's the point of putting, say, every single user management field into columns in a "users" table? The regular application is never going to have to relate users by their CSRF token, or what their UI language is, or any of the other dozens of incidental details associated to a user, some visible, some implementation details.

What matters are things like the username, email, name - things the app needs to actually run relational operations on.

If you look at any real application, pretty much everyone has given up on trying to keep everything relational. That would be a massive pain in the ass and you'd end up with hundreds of columns in your users table. You'll find some form of key/value store attached to the user instead. And if you're going to do that, you might as well use a json field.

And if you do use a json field with a database engine that supports it well, like PostgreSQL, you'll find that it can be indexed if you need it anyway, and querying it is easier than joining a pile of tables implementing a KVS or two. Because yes, I might want to make a report on what users' UI language is some day, but that doesn't mean it has to be a column when Postgres is perfectly happy peeking into jsonb. And I don't need an index that will just cause unnecessary overhead during common update operations that don't use it, when I can just run reports on a DB secondary and not care about performance.

I designed an application in this manner and it has turned out exceedingly well for me. We have only had about a dozen schema changes total across the lifetime of the app. One of them involved some JSON querying to refactor a field out of JSON and into its own table (because requirements changed) and that was no problem to run just like any other database update. If we need to move things to columns we will, but starting off with an educated guess of what will need to be a column and dumping everything else into JSON has undoubtedly saved us a lot of complexity and pain.

One of our tables is just a single primary key and then a json blob. It stores event configuration. Why? Because it's always loaded entirely and there is never any reason to run relational ops on it. It's a huge json blob with lots of little details, only edited administratively, with nested fields and sub-fields (which is way more readable than columns, which are a flat namespace), including a sub-document that is actually a structure that drives generation of a pile of HTML forms. If I'd tried to normalize this I would have ended up with a dozen tables, probably more complexity than the entire rest of the DB, and a huge pile of code to drive it all, and every single application change that added an admin knob would've had to involve a dabatase schema change... All for what? Zero benefit.