The problem with specialized databases is that people use them as a starting point at the beginning instead of realizing that that is a premature optimization which they will likely come to regret later on. Initially you are much better off by choosing a standard relational database (Postgres,MySQL,MSSQL, pick your favorite flavor) and then, when you really have to you may have to add something that is specialized. So unless there are hard technical constraints right off the bat (which is almost never the case) stick to simple.
Relational databases are an extremely specialized kind of datastore that people only treat as "standard" because they've been around a long time. They're not better than the alternatives, and in many ways they're worse: by picking a relational database you're committing to difficulty deploying schema changes, difficulty sharding, an awkward square-table model and a terrible query language, and if you make the mistake of trying to use the transactional functionality that's the one actual selling point of those datastores then you're practically guaranteed to deadlock yourself in production at some point during your growth process.
Kinda with you until "terrible query language" as, in my mind, it's the best query language.
You can also opt our of rigid schema by using JSON columns. Which I generally promote as a new best practice when the database is being uses as a dumb store for a smart application. It comes down to who should be the source of truth for the schema.
> Kinda with you until "terrible query language" as, in my mind, it's the best query language.
It's a decent language for ad-hoc querying by humans; the problem is it's the only interface to the database that you get. It was never designed for machine use; in a modern RDBMS, 3/4 of the time to execute a pkey lookup is spent parsing the SQL string. Yes, prepared statements can help in some cases, but they come with their own overheads that make them difficult to use safely in a large system.
> You can also opt our of rigid schema by using JSON columns.
You can, but usually in a database-specific way, and support for that in drivers and especially at the ORM level is pretty spotty.
All the databases seem to have database-specific ways. SQL is the least vulnerable to this. Obviously if you leave SQL for e.g. Mongo, then you could not have picked a more bespoke and unique interface. Notionally SQL is standardized, but who migrates databases anyway.
Network latency is the real performance drag, not string parsing.
I think JSON support will improve, but SQLAlchemy for example is ok with a common JSON interface over mySQL and Postgres. I am sure this will resolve itself in time, its just a bit new for SQL.
> Obviously if you leave SQL for e.g. Mongo, then you could not have picked a more bespoke and unique interface.
In principle yes; in practice you can expect to find full-featured drivers in all major languages, and anything higher-level that claims support for Mongo will also have support. Certainly even the most basic Mongo drivers will let you have things like collection columns.
> Network latency is the real performance drag, not string parsing.
Depends what kind of network, if any, is between the two endpoints. But the performance aspect is just illustrative of what a poor format for expressing machine-level queries it is.
> In a modern RDBMS, 3/4 of the time to execute a pkey lookup is spent parsing the SQL string
These times are on the order of sub-milliseconds; about 0.1ms with PostgreSQL on my modest laptop with ~30 million row table.
You make it sound like it's some sort of horrible performance hog, but 0.1ms for parsing a dynamic language really isn't that bad. Actually fetching the row takes about 1ms (without anything cached, faster otherwise), so that's hardly "3/4th" either. With cache it's about 0.1ms, which is about half.
But in reality most queries are more complex, and the parsing time becomes negligible; even for a fairly complex query it's about 0.6ms, which is hardly worth thinking about if you consider that the query execution takes about 60ms.
> But in reality most queries are more complex, and the parsing time becomes negligible
Depends on your usage pattern. If you're actually doing a bunch of different ad-hoc aggregations (which is what SQL was designed for) then yes, query parse time is irrelevant. If you're using an RDBMS as a glorified key-value store (which is what most web applications tend to do) then it's very possible for pkey lookups to be the majority of your queries. (My point isn't really about performance, it's about SQL not being designed for that use style in general).
I have never seen SQL being used as a key-value store. That most web apps use it like that is an incredible claim to me. Do you have any source for that?
I've seen database profiling numbers from some of my clients, but obviously I can't publish those. To be clear all I'm claiming is that for most webapps the overwhelming majority of database queries are simple indexed column lookups, not that most webapps are putting document blobs in the database or using EAV style or anything like that.
Well, "key-value store" means different things to different people. If most queries are simple indexed column lookups then that carries my original point: query parse time is actually a significant proportion of overall datastore request processing time.
The performance aspect is just illustrating of what a poor representation SQL is for machine-level use. And I'm by no means demanding a "full-fledged" ORM; even basic querying requires a layer or two above SQL to do cleanly.
All of which is still much better than rebuilding that stuff in your application layer, which is what you'll inevitably end up doing otherwise. Besides the fact that you'll likely do so in an incomplete and buggy way.
It's what you end up doing even with a relational database, because the database's built-in implementations aren't controllable enough. There's definitely space for a library that offers standard implementations of these things rather than everyone implementing it themselves from scratch, but applications need a lot more control over them than a traditional database gives them.
> You end up rebuilding transactions, foreign keys, MVCC and similar features on relational db's?
Yes, because the database-level builtins are too restricted; pretty much the only behaviour you can get out of them is "reject the change and throw away the data" which is almost never what you want, and requires you to keep the data around on the client side and hope that all clients implement recovery consistently. Think about the user-facing behaviour that you want when e.g. one user tries to join a group at the same time as another user deletes the group. It ends up being easier to implement the right thing without using database-level transactions and foreign keys.
The audacity to say that when your very own blog includes an article titled "People who disagree with you aren't trying to make things complex. Maybe you're wrong."
There is a reason relational databases are the standard. It's one of the rare things in computer science actually based on mathematics.
> The audacity to say that when your very own blog includes an article titled "People who disagree with you aren't trying to make things complex. Maybe you're wrong."
What point are you trying to make? The point of that article was that the rhetoric about "simplicity" is unproductive, no-one makes a deliberate choice to use something they think is complex. But I didn't make that kind of argument.
> There is a reason relational databases are the standard. It's one of the rare things in computer science actually based on mathematics.
Relational databases are a very finely engineered hammer. But not everything is a nail.
Without knowing the details of your use case, Cassandra. (Well, in reality Kafka, but that's a bigger leap). Consistent driver availability, easy horizontal scaling (not for performance to start with but for reliability), basic things like collection-valued columns can just work in a common-sense way, harder to shoot yourself in the foot with the consistency model, separate interfaces for fast key lookups and slow table scans so you don't mix them up.
This makes me think about how other people here feel about the arguments presented in this post. Personally, i actually agree with some of these points!
> difficulty deploying schema changes
Definitely agreed, even with tools like Flyway, Liquibase, dbmate or most of the framework provided options (such as Active Record Migrations for Rails and Doctrine for Symfony), most migrations still end up feeling brittle, because you oftentimes do things like renaming a column, or processing some data into a new format, or cleaning up old data etc. Well, you want to do that anyways, but then you realize that instead of simply renaming a column, you'll probably do a rolling migration for the apps that use the DB, therefore you need to create a new column that the app will write data into, then migrate all of the app instances to the new version and then clean up the old column, god forbid validations use the wrong column while this is going on. I don't think it's possible to work around problems like this with technologies like MongoDB either, since then dealing with missing data in an "old" version of a document would still be annoying. I don't know of any good solutions to address how data evolves over time, regardless of technology.
> difficulty sharding
Definitely agreed, in general it seems like most DBMS mostly scale vertically better than they do horizontally. For example, master-slave replication seems doable, but once you want to do master-master replication, you run into problems with latency and data consistency. There are some solutions like TiDB which attempt to give you a distributed database in a transparent way, without making you worry about its inner workings, but that only works until suddenly it doesn't. It seems like this problem affects most distributed systems and i'm not sure how to address it, short of making each new data entry reference the previous state, like CouchDB does with revisions ( https://docs.couchdb.org/en/stable/intro/api.html#revisions ) and even that won't always help.
> an awkward square-table model and a terrible query language
Partially agreed, SQL is pretty reasonable for what it does, despite its dialects being somewhat inconsistent, many of the procedural extensions being clunky and most of the in-database processing heavy systems that i've encountered being a nightmare from a debugging and logging perspective, though i guess that's mostly the fault of the tooling surrounding them. Discoverability can be a big problem if OTLT and EAV are heavily used ( https://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-bi... ) and foreign keys are not used. Window functions, analytical functions, partitioning and other functionality feels like it's implemented in unintuitive ways in some systems, but that could also be a question of familiarity and a steep learning curve.
> if you make the mistake of trying to use the transactional functionality that's the one actual selling point of those datastores then you're practically guaranteed to deadlock yourself in production at some point during your growth process
Partially agreed, it can definitely happen, but being able to revert bad changes to the data and even test them in the first place sometimes feels like a godsend. Well, there should always be a local instance that's safe to break, but in practice that doesn't really come true often.
>Definitely agreed, in general it seems like most DBMS mostly scale vertically better than they do horizontally.
NoSQL databases do nothing special to achieve horizontal scaling. They simply don't support transactions or atomic operations across documents. If that's what you want you can just choose an RDBMS with that behavior.
Sounds like vaporware to me. If there are any RDBMSes that support practical autoscaling, they're certainly less mature/established than e.g. Cassandra.
> Well, you want to do that anyways, but then you realize that instead of simply renaming a column, you'll probably do a rolling migration for the apps that use the DB, therefore you need to create a new column that the app will write data into, then migrate all of the app instances to the new version and then clean up the old column, god forbid validations use the wrong column while this is going on. I don't think it's possible to work around problems like this with technologies like MongoDB either, since then dealing with missing data in an "old" version of a document would still be annoying. I don't know of any good solutions to address how data evolves over time, regardless of technology.
IME the best way to do it is to build your system on stream transformation (i.e. Kafka) and then you can just produce the new representation in parallel, wait for it to catch up, migrate the readers over gradually and then eventually stop producing the old representation. That tends to be what you end up doing with a traditional RDBMS too, but if you're using something like Kafka then the pieces that you use are more normal parts of your workflow so it's less error-prone.
> It seems like this problem affects most distributed systems and i'm not sure how to address it, short of making each new data entry reference the previous state, like CouchDB does with revisions ( https://docs.couchdb.org/en/stable/intro/api.html#revisions ) and even that won't always help.
There are two approaches that I've known to work: 1. actual multiple concurrent versions as you say, with vector clocks or equivalent, forcing the clients to resolve conflicts if you're not using CRDTs - Riak was the best version of this approach, 2. having a clear shard key and allowing each partition to have its own "owner", making it clear what you do and don't guarantee across partitions - e.g. Kafka.
> Partially agreed, SQL is pretty reasonable for what it does, despite its dialects being somewhat inconsistent, many of the procedural extensions being clunky and most of the in-database processing heavy systems that i've encountered being a nightmare from a debugging and logging perspective, though i guess that's mostly the fault of the tooling surrounding them.
I wasn't talking about the fancy analytics so much as just the basic data model - e.g. having a collection-valued column is just way harder than it should be. Everything being nullable everywhere is also a significant pain.