Is 'Not performance bound, and dot knowing the future shape of your data' a vali...

dspillett · on June 27, 2022

There are circumstances where you really don't know the shape of the data, especially when prototyping for proof of concept purposes, but usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.

More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.

Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.

spmurrayzzz · on June 27, 2022

> usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.

This is a good point and probably correct often enough, but I also think not understanding the entire problem you are solving is not only common, but in fact necessary to most early-stage velocity. There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.

> a relational DB with well-defined constraints could have saved you from while still in dev

This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints. This has not been my experience personally. But its possible I don't understand your meaning of constraints in this context. I assumed it to mean, for instance, something like schemas which are fairly common now in the nosql world. Curious what other constraints were you referencing?

happimess · on June 27, 2022

> There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.

If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).

I agree with GP that this is a central problem to solve, not something to figure out _after_ you write software. Build your house on stone.

spmurrayzzz · on June 27, 2022

>If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).

I agree with your comment that it's a central problem to solve and that both options, throwing out data or map old records onto new semantics, is an endemic choice both stacks need to make. I don't agree that it's always possible to solve entirely up front though.

In my experience, it has been less so about whether the storage engine is schemaless or not, even many modern nosql stacks now ship with schemas (e.g. MongoDB). I think the differentiation I make between these platforms is mostly around APIs. Expressive, flexible semantics that (in theory) let you move quickly.

As an aside, I also think the differentiation between all these systems is largely unimpactful for most software engineers. And the choice often made is one of qualitative/subjective analysis of dev ergonomics etc. At scale there are certainly implementation details that begin to disproportionately impact the way you write software, sometimes prohibitively so, but most folks aren't in that territory.

funcDropShadow · on June 27, 2022

Admittedly, my experience with MongoDB and Cassandra has gained some rust over the last decade, but what makes you say such nosql databases have expressive APIs? Compared to PostgreSQL they have miniscule query languages and it is very hard, if at all possible, to express constraints. And constraints, sometimes self-imposed sometimes not, are what makes projects successful, even startups. Many startups try to find this one little niche they can dominate. That is a self-imposed constraint. People tend to think freedom makes them creative, productive, and inventive, while in fact the opposite is the truth. With opposite of freedom I mean carefully selected constraints not oppression.

dspillett · on June 27, 2022

> schemas

The confusion there is due to the fact that non-R-DBMS (particular when referred to as noSQL) can mean several different things.

In this context I was replying to a comment about not knowing the shape of your data which implies that person was thinking about solutions that are specifically described as schemaless, which is what a lot of people assume (in my experience) if you say non-relational or noSQL.

That is the sort of constraints I was meaning: primary & unique keys and foreign keys for enforcing referential integrity and other validity rules enforced at the storage level. There are times when you can't enforce these things immediately with good performance (significantly distributed data stores that need concurrent distributed writes for instance - but the need for those is less common for most developers than the big data hype salespeople might have you believe) so then you have to consider letting go of them (I would advise considering it very carefully).

marcosdumay · on June 27, 2022

> This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints.

Are you kidding? They never can.

The entire point of ditching the relational model is discarding data constraints and normalization.

spmurrayzzz · on June 27, 2022

Never? Many NoSQL stores are offering parity in many of the feature verticals that were historically the sole domain of RDBMS.

Mongo has always had semantics to support normalized data modeling, has schema support, and has had distributed multi-document ACID transactions since 2019 [1]. You don't have to use those features as they're opt-in, but they're there.

I know that full parity between the two isn't feasible, but to say they never can is a mischaracterization.

[1] Small addendum on this: Jepsen highlighted some issues with their implementation of snapshot isolation and some rightful gripes about poor default config settings and wonky API (you need to specify snapshot read concern on all queries in conjunction with majority write concern, which isnt highlighted in some docs). But with the right config, their only throat clearing was whether snapshot isolation was "full ACID", which would apply to postgres as well given they use the same model.

funcDropShadow · on June 27, 2022

What is the point of using MongoDB with multi-document ACID transactions? Enabling durability in MongoDB is usually costly enough that you can't find a performance benefit compared to Postgres. With JSONB support in PostgreSQL, I dare say, it can express anything that MongoDB can express with its data model and query language. That leaves scalability as the only possible advantage of MongoDB compared to PostgreSQL. And the scalability of MongoDB is rather restrictive, compared to e.g. Cassandra.

And I would never trust a database that has such a bad track record, regarding durability as MongoDB, although I admit that PostgreSQL had theoretical problems there as well in the past.

spmurrayzzz · on June 27, 2022

I actually agree with you on the point about multi-document tx's, I wouldn't choose mongo solely for that feature. Its nice to have maybe for the niche use case in your nosql workload for when its beneficial. But the point I was originally making was that nosql stacks are not fundamentally incompatible with most of the features or safety constraints offered by other RDBMS.

> And I would never trust a database that has such a bad track record, regarding durability as MongoDB

I can't comment on your own experience obviously, but I've been using mongo since 2011 in high throughput distributed systems and it's been mostly great (one of my current systems averages ~38 million docs per minute, operating currently at 5 9s of uptime).

Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.

As you noted, Postgres has had its own headaches as well. Finding out that all their literature claiming their transactions were serializable when they were in fact not serializable could be considered a mar on their record. But much like mongo, they have been quick to address implementation bugs as they are found.

funcDropShadow · on June 27, 2022

> I can't comment on your own experience obviously, but I've been using mongo since 2011 in high throughput distributed systems and it's been mostly great (one of my current systems averages ~38 million docs per minute, operating currently at 5 9s of uptime).

> Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.

Good to read that some are actually using MongoDB to their benefit. Indeed I have encountered problems with durability in the wild. Nothing, that I would like to repeat. But as always for a specific use case the answer is: it depends. For a general advice with what to start a new project I would select PostgreSQL in 10 out of 10 cases, if a database server is actually required.

adrianmonk · on June 27, 2022

> future shape of your data

Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.

If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.

This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.

If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.

grogers · on June 27, 2022

Maintaining backwards compatibility for reading old records in code is not hard. You can always rewrite all rows to the newer format if you want to remove the conpat code, or if the structure changes in an incompatible way. It's pretty comparable to what you have to do to evolve the code/schema safely together.

Having schema is much better for ad-hoc queries though, doubly so if your schemaless types aren't JSON (e.g. protobufs).

emerongi · on June 27, 2022

With Postgres, you can always just have a JSONB column for data whose shape you're unsure of. Personally, I'd rather start with Postgres and dump data into there and retain the powers of RDBMS for the future, rather than the other way around and end up finding out that I really would like to have features that come out of the box with relational databases.

I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.

mslot · on June 27, 2022

Most $100B+ companies (e.g. Google, Meta, Amazon) were built primarily using relational databases for the first 10-15 years.

packetlost · on June 27, 2022

Postgres' JSON implementation perfectly adheres to the JSON spec, which actually sucks if you need to support things like NaNs, Inf, etc. It's a good option, but it doesn't work for all datasets.

slaymaker1907 · on June 27, 2022

Wow, TIL that NaN isn't valid JSON.

qaq · on June 27, 2022

No longer an issue with things like Spanner, CockroachDB etc

gryn · on June 27, 2022

as long as you're willing to write a whole lots of data validation, constraint checking scripts by hand in the future, ETL scripts for non-trivial analytical queries (depending on what NoSQL you chose, but if you chose it for perf this one is usual a price you have to pay). and keep a very rigorous track of the conceptual model of your data somewhere else, or simply don't care about its consistency when different parts of your not-schema have contradicting data (at that point why are you even storing it?)

and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.

the point is noting is free, and you have to be sure it's a price your willing to pay.

are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...

xmprt · on June 27, 2022

I think the tradeoff is similar to using a weakly typed vs strongly typed language. Strong typing is more up front effort but it will save you down the line because it's more predictable. Similarly, an RDBMS will require more up front planning and design and regular maintenance but that extra planning will save you more time down the line.

marcosdumay · on June 27, 2022

> Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason?

That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.

If you invert any of those conditions, it may become interesting to study alternatives.

alecfong · on June 27, 2022

I find myself forced to model access pattern when choosing non relational dbs. This often results in a much less flexible model if you didn’t put a lot of thought into it. Ymmv

roflyear · on June 27, 2022

No absolutely not. 1 hr spent making a schema and a few hours of migrations is way less than the headaches you'll have by going nosql first.