This article is informative. I have found that databases in general tend to be less sexy than the front-end apps...especially with the recent cohort of devs. As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it. This is especially true the bigger it gets, and the longer it stays in production. That said, if you are influencing the design of a database, imagine years later what someone looking at it might want to know if having to rip all the data out into some other store. Having migrated many legacy systems, I tend to sleep better when I know the data is well-structured and easy to normalize. In those cases, I really don't care so much about the apps. If I can sort out (haha) the data, I worry less about the new apps I need to design. I have been known to bury documentation into for-purpose tables...that way I know that info won't be lost. Export the schema regularly, version it, check it in somewhere. And, if you can, please, limit the use of anything that can hold a NULL. Not every RDBMS handles NULL the same way. Big old databases live a looooong time.
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."
-- Fred Brooks, The Mythical Man Month (1975)
i've taken out duplicating combinations where different teams wanted same things over time, but never looked up what others had done before and decided to spin their own...
I would be having words with the DBA about this. If there is a DBA. No DBA I would hire would allow this silliness. At least not in the actual application schema.
> Realize that any reasonably used database will likely outlast the applications leveraging it.
I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).
About a decade ago I worked for an insurance company. It was an offshoot that was spun out of of another insurance company from another state, which itself was decades old. As best as I could infer from my vantage point, my expertise at the time, and the spare time I was willing to investigate the matter, the database schema and a good chunk of the core data tables were first created in the late-80s on a mainframe and had outlived 4 or 5 application rewrites and (at least) two SQL variant migrations. I'm hand-waving exact details because nobody from the original company or that time period was still around even prior to the corporate split and so there was nobody who could answer history questions in detail, but that's also a testament to how persistent data can be. There was one developer from the parent company they slapped with golden handcuffs who knew where most of the bodies were hid in that software stack that enabled decent productivity but even she was lacking a solid 15 years of first-hand experience of its inception. To the best of my knowledge that database is still in use today.
Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.
I think this is and will continue to be a common use case. I'm very thankful for these applications that the data was still stuck in a crusty old relational database for me to work on top of as I built a new application.
It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.
The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.
This is a double-edged sword. I have seen massive business logic baked into stored procedures...so much so, that the applications themselves are rather slim. If this stored procedure code is properly versioned and otherwise managed, this is not entirely bad. If the data model is sound, I don't worry that much...stored procs vs 100KLOC of Java code? I can tell you what is easier to migrate. The other side of it is that stored procedures can also serve as an API into the actual database. I built a system once (90's) where the developers never accessed the actual database tables directly. They always called stored procedures. The advantage here was that we could tune the snot out of the database without a code change/build/deploy. It also allowed the DBA some control over poorly written, runaway queries. YMMV. I think today I probably would try to keep the database as vanilla as possible.
> YMMV. I think today I probably would try to keep the database as vanilla as possible.
Sure, YMMV.
In any non trivial dataset a lot of fields are effectively computed. For example merged entries: in order to get correct and whole data, one needs to consult some merge mapping, which can be easy to forget and tricky to get right - you have more than one relational identifier. This is not strictly business logic, but rather data assembly logic.
Similarly, a value can easily be spread over multiple fields (and tables) and it is crucial from data integrity standpoint to always update them in tandem if applicable. Again, this is very easy to screw up in client code, because the hidden relationship can be non-obvious. On one hand this is business logic, on the other hand violating this implicit relationship will result in non-agreeing data. This occurs for example when data represents parallel states.
Effectively, stored code acts as some kind of gateway API stored concurrently with the data. Sure, some peculiarities can be implemented with functions and triggers, but IMO those are just different sides of the same coin. In the end really depends on the dataset and what it represents.
Database stored code is a tool. Used appropriately it solves problems, used inappropriately it causes problems. If I were to design a database today, I would too try and make do without stored code, but would not try to twist data model so that it fits the relational model of RDBMSes.
Also an old timer. I’ve gone from mostly complex app and no DB logic to mostly heavily optimised DB and lots of procs. They protect the DB, make carefully crafted interfaces available and allow changes to the implementation.
Except for eg infinitely scalable cloud data stores like Google’s, or for ML where it’s just massive data and you need a dumb-ish store of many GB of parquet.
I share this sentiment. The apps will come and go, the real value is in the data. If the database can cover its ass, I am less concerned about ham-fisted developers randomly messing it up with ill-conceived DML. It's not that they are malicious...it just happens. I have seen devs that code in Erlang, Haskell and even Assembly...run in terror at doing some SQL. It's weird. Trust but verify. And hire persnickity, passionate DBAs.
If I publish a stored proc as an API, it leaves me free (within reason) to alter the actual SQL supporting it. Like modern API design, as long as you don't remove functionality, then change it as required.
A colleague of mine has a very strong opinion that any candidate who mentions using stored procedure in an interview without immediately disparaging it is an immediate no-hire. I sometimes wonder what kind of experience formed that opinion. (I personally never worked at a place with significant use of stored procedures.)
Some asshole DBA probably hurt his feelings way back.
If you have DBAs with organizational power and their shit together, they tend to ask awkward questions and tell you the baby is ugly. It’s easy for developers to do stupid shit and blame solar flares or whatever, but stored procedures are the DBAs realm, and the DBA knows who gets blamed when it blows up.
Like anything, there’s ups and downs. But if you can commit to a DBMS platform for the life of the app, there are compelling reasons to use it, even if it pisses off the devs.
Your colleague is dumb. Stored procedures have advantages and disadvantages. For certain data processing operations it is much better to perform everything inside the database than transfer large amount of data back and forth to some external application.
But I have also seen organizations with the policy that any operation touching base tables should be encapsulated in a stored procedure. This makes development extremely cumbersome, especially if some DBA is gatekeeper for the stored procedures. Something like this might have burned your colleague.
> Stored procedures have the downside that they often work only in one vendor's database.
Doesn't really matter, because no one is writing database-agnostic SQL (unless that's part of your product). Any non-trivial implementation is going to require the use of proprietary SQL.
It has always amused me when developers try to make their application's database layer vendor-agnostic. Not only does that restrict you to the most vanilla dialect of SQL imaginable, it's also a pointless exercise generally because it is orders of magnitude more likely that your app will be replaced and the database kept than the other way round.
It was much more important when software products were sold to companies that ran their own databases. You didn’t have the luxury of choosing which database your app would run on so half the job of a software dev was making running it on many different db so you didn’t lose the sale. It’s one of the many reasons saas became the dominant way to sell software.
Once you control the deployment environment it largely makes very little sense to spend much time on db agnosticism. But that’s also true of many other things that have left vestigial forms in our software today such as IOC containers, pluggable log libraries, configuration files instead of configuration code, etc.
Even if you think you use database-agnostic SQL unless you actually run tests with multiple DBMS you cannot be sure. One can write code which depends on some DB specific behavior unintentionally, Hyrum's Law [1] works for databases too.
Its not only about standards, but also about tools to check that you are actually following them. It is easy to write C code which will compile with one compiler but not with another. That's why if compatibility is desired it is better to compile code with different compilers in CI.
In theory we can use a tool which tells if you SQL is ANSI compliant, but this would not tell if your code expect only standard compliant behavior from the database. E. g. some RDBMs are flexible in which formats they accept datetime strings and some are not. SQL can be standards-compliant but code still can fail with one DB and work with another.
It is not to say that standards are useless. It much easier to port an application from one DB to another if a developer was trying to use only ANSI SQL. Or you can write an application which can work with multiple RDBMs without modification but like with almost everything else - if you haven't tested it you cannot be sure that it works.
The NULL issue is so true. We migrated a large database from Oracle to Postgres. It took 2 years. By far and away the biggest issue was rewriting queries to account for the (correct) way Postgres handles NULLs versus how Oracle does it.
Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.
>the database is almost always the main cause of any performance issues
I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.
Ok, story time. Worked on a near-real-time system that processed millions of rows an hour. Devs routinely did a "select " and pulled 100K rows into an array on a local client. And they sorted it there*. The DBAs never looked into these because a)no resource issue, b)no perf issues and (most relevant) c) no one complained. One day this silliness was identified, and a quick meeting with the devs and DBA resulted. SQL 101. As if by magic, the client app was suddenly 3X more performant. The devs were applauded and there was much happiness in user-land. Only the SVP knew the truth...we left the sun shining on the developers. It was a rough week.
> As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it.
I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!
Thanks. Scar tissue sometimes breeds insight. In further conversation on this phenomenon, I would argue that "long lived databases" are not so as result of brilliant design. Rather, it happens because the database itself is neglected and largely misunderstood, and gets less investment. And they live on and on...managers come and go...no investment. And then, years later, some poor bastard is stuck with a hideous mess that can't go anywhere. Don't let this happen to you.
The article left out one of the most fundamantal topics of databases--clustering of data in storage is everything. Examples:
1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.
2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.
In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.
At multiple companies I worked for, they concluded the backwards schema changes are not worth the risks and testing overhead. It's usually quick enough to issue a hotfix.
With rollback, you always risk losing data - remember, you're doing it when something didn't go as you expected. What are the odds, the rollback will break something further?
These were all Postgres shops, so schema changes within a transaction - that can be rolled back safely if it fails in the middle.
This is one reason that ORMs which wish to own the database schema make me uncomfortable. How much fun is that schema going to be years down the road when that ORM is out of fashion, but you still need an app working with that data? Some are better than others at doing things in a sane way, of course.
Or even worse, when the ORM is written in a programming language your organisation no longer uses and is part of codebase that is no longer under development.
I agree. I have some kind of design hierarchy. Database -> Architecture -> Services for outside consumers -> Backend -> Frontend. Things coming first must be designed more thoroughly as they're likely to live longer. Proper database design is paramount. Spend as much time as necessary. Iterate before going live as long as necessary to ensure that design is sound. Because it's so much harder to change database later. Trivial changes often require huge efforts.
Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
> And, if you can, please, limit the use of anything that can hold a NULL.
I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.
The article probably means: Define anything as non-nullable which can be non-nullable. Unfortunately SQL defaults to nullable, so there is a tendency to define too many columns as nullable. Normalization can also reduce the need for nullable columns in base tables (but you will get them back if you perform an outer join, so it is not a panacea).
But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.
Noticed I said "limit", and not "eliminate". The concept of NULLS in an RDBMS has been discussed and argued for decades. Three valued logic is generally not well understood, and as such, it's usually skipped over. Binary logic is easy...0 or 1. It's there, or not. OFF/ON. 3 valued logic introduces a third state: "unknown". It really means that there is no meaningful answer...not yet, anyway. In simple terms, NULL in RDBMS is dangerously often equated to 0 (zero) in numeric fields. Or "space" in character fields. Neither are true. On occasion, these might behave as such...but you are playing with fire here. What's worse, you can't compare a NULL to a NULL. NULL != NULL. Which is why you often see the "IS NULL" operator used in DML for such things. What it boils down to is that your applications need to pay careful attention when digging around (read: joining) tables with NULLS. Additional code logic is often required to ensure that things work the way you expect them to when NULLS are involved. Formal primary keys cannot NULL (this is enforced by the RDBMS) but it does not stop ad-hoc clever queries from including NULL columns as part of the "where..." clause. So what do? You can tell your DBA to ensure that all columns are NOT NULL. This really tightens things down, and makes some operations a bit more sane. However, if a column value is actually not known (yet!) then one is forced to populate it with data that may not be correct/relevant. These are often called "sentinal" values and can cause a mess of their own. There are use cases where a RDBMS schema with everything as NOT NULL can make sense. In my experience, databases whose data is never (directly) seen/input by actual people can work. When a human sees a field with "placeholder value" instead of just blank space..it is uncomfortable. My advice is to really understand why something might be NULL, and don't blindly add a mess of columns to a table as NULL because it's easy. Remember, that shit will live forever. Google around for "three valued logic" and start down the rabbit hole. Long-term (think: migrating from one RDBMS impl to another) you will absolutely find that NULLs don't behave the same. Various operations may or not be consistent from one to another...and this will break your apps. The key (haha) relationships modeled in your schema...if you strip all the unimportant stuff away...should avoid NULL. The flip side of this is to do a code scan (app side) and search for "is NULL" , "is NOT NULL" in the embedded SQL. Especially when there are a lot of "and ____ IS NOT NULL and ___IS NOT NULL" and so forth. This will indicate those parts of the database that are "hot spots" for NULL issues. I have seen SQL where 80% of the DML is taken up with NULL handling of some kind.
Great post. Also highly recommend Designing Data-Intensive Applications
by Martin Kleppmann (https://www.amazon.com/Designing-Data-Intensive-Applications...). The sections on "Storage and Retrieval", "Replication", "Partitioning" and "Transactions" really opened up my eyes!
I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.
I have not read it personally, but I've seen 'How Query Engines Work' highly recommended several times before. I have a procrasinatory tab open to check it out some day.
Seconding Database Internals - it's not just about "Internals of a database", as part 2 gets nitty gritty with the general problems of distributed systems, consensus, consistency, availability, etc. etc.
> RDBMS can solve pretty much every data storage/retrieval problem you have.
Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.
RDBMS can have pleasant APIs. It is not a fundamental limitation. We have built layers upon layers upon layers of abstraction over popular RDBMSes to provide nice APIs and they work well enough. But those additional layers come with a lot of added complexity and undesirable dependencies that most would prefer to see live in the DBMS itself instead.
At least among the RDBMSes we've heard of, there does not seem to be much interest in improving the APIs at the service level to make them more compelling to use natively like alternative offerings outside of the relational space have done.
I've honestly never understood why people have such a distaste for SQL. SQL and Linux/Unix have been the biggest constants of my entire programming career to this point (20ish years). I always know I can count on them.
I love the data model of RDBMS / SQL. I hate SQL _the language_. Verbose, irregular, attempting to mimic English and thus making it unintuitive and next to impossible to remember. 80% of my lookups into reference documentation is about syntax for things I don't use that often.
Examples: GRANT TO, REVOKE FROM, DENY TO (oh yes, what's the difference between REVOKE and DENY? and did you know that you can REVOKE a DENY?), arbitrary requirements for punctuation (e.g., = in BACKUP DATABASE AdventureWorks2012 TO DISK = 'X:\SQLServerBackups\AdventureWorks1.bak'), dubious context-sensitive "magical identifiers" (e.g., SELECT DATEPART(year, @t)), non-composability (how do you dynamically give a part specification to DATEPART?), etc, etc, etc.
It's possible to write a novel about just how unstructured and irregular "structured" query language is.
All of your examples are vendor specific extensions to SQL. Nothing of that is part of the standard. So you are effectively hating a specific implementation which is not that logical.
Honestly, I think it's because a lot of folks go their entire careers avoiding set-based thinking. Many of these folks are talented programmers so I would assume they'd become quite decent at SQL if they devoted time to it. I'm speaking more on the DML side than the DDL side here.
In my experience <20% of developers are good enough to be dangerous with SQL, and maybe 5% what I'd consider adept.
The rest range from "SELECT * is as far as I'll go; where's the ORM" to "why the hell are you using cursors to aggregate"
SQL is powerful, elegant, and reliable. With modern DB support of JSON, ARRAY, statistical functions, and much more, SQL is probably the #1 most underutilized/improperly leveraged tool today. SQL-killers have been coming out for 40 years now and (for its domain!) SQL's lead is pulling farther away if anything.
*yes there are some questionable implementations, so please replace "SQL" with "PostgreSQL" if nonstandard SQL implementations have caused nightmares for you in the past.
The reason isn’t set based thinking avoidance. I think it is because fundamentally we want to work with smart objects and not rows of data in most cases.
More often than not that's a good example of set-based thinking avoidance. The application graveyard is full of projects that felt their transactional data was too sophisticated for "dumb rows" and reinvented an inefficient relational model. I'd love to hear more about these "smart objects" - they're usually not that smart and typically can be represented relationally.
Even for use-cases like graph based models you still find Twitter and Facebook using MySQL to build the graph on top of. It’s simply heavily abstracted for the majority of engineers at those companies (where I'd wager the <20% proficiency in set-based thinking holds true) but it still fundamentally relies on SQL.
I'm not too well-versed in the academic/mathematical background of SQL, other than than to say it's closer to relational algebra than Set theory, and the resource to read is probably the 1970 Codd paper: https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
SQL does implement the set operations of UNION, INTERSECT, and EXCEPT, but I meant "thinking in sets" more colloquially.
When it comes to prototyping, I'm not going to fuck with something like Java-- I'm going to reach for Python. If I don't know what I'm doing to begin with, I don't need something telling me all the ways in which I'm doing it wrong.
Same goes for SQL/NoSQL. I loosely know what I need to model and may revise it arbitrarily. SQL does not lend itself to this. NoSQL was designed for it.
NoSQL is the "fuck you dad you can't tell me what to do" of the database world. SQL is your annoying dad that tries to institute curfews and won't let you go to underage drinking parties. In the end, it's the latter you're going to be calling from a holding cell, but there's a lot more fun you can have with the former.
In the case of SQL/NoSQL I think it is the other way around. In order to use a NoSQL database to its strength, you have to know all access path up front [1]. Because you are designing your table structure with all the queries in mind. In SQL on the other hand, you can always go for a normalized structure and you are pretty safe to be prepared for most requirements that are coming in the future.
Additionally, I think drafting a DDL schema is a great way to learn about a new application domain. It forces me to ask hard questions. And that improves my understanding of the domain.
I guess that is similar to some people who like to prototype using Haskell type signatures.
[1] Unless you mean specifically a document store without schemas.
> If I don't know what I'm doing to begin with, I don't need something telling me all the ways in which I'm doing it wrong.
This is unrelated to this conversation, but this is my main beef with Rust. I love Rust (like a lot), but it's just not good for prototyping. End of non-pertinent rant.
SQL is great at what it is designed to do, and absolutely horrible for anything else. Sometimes I have had to use tools that only allow (a subset of) SQL for querying data. (Especially BI) Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous. These kinds of "why the f do i have to use sql for this"-moments happened surprisingly often to me working as a data analyst.
But then I'm trying to do relatively simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets, while in SQL it would be a simple statement that anyone can understand after reading it once.
>Doing regex date validation in a json derived from a string in an sql dialect without function-definitions is horrendous.
Doing regex is horrendous, but doing it on SQL in a modern database is no more difficult than in a full-fledged programming language. Most modern DBs have strong JSON support and built-in regex functions
The problem you are describing is probably rooted in the specific SQL dialect you had to use. Selecting from inside JSON strings and matching regular expressions should be a no brainer. And it is straight forward in PostgreSQL, e.g.
> simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets
With dplyr, wouldn’t this be mutate(x = case_when(… ~ y, TRUE ~ x)) or the same but with ifelse?
I haven't used much R, but how would you do something like:
CASE WHEN
SUM(daily_revenue) OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN ('North','West','Misc')) >
AVG(revenue) OVER (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND CURRENT ROW)
AND NOT COALSECE(had_prev_month_party, FALSE)
THEN pizza_party_points + 5
WHEN <above> AND had_prev_month_party THEN pizza_party_points + 3
WHEN MIN(sale_time) over (PARTITION BY department) = DATE_TRUNC('month', current_date) then 5
ELSE GREATEST(pizza_party_points - 1, 0)
END as pizza_party_performance_points_current
this example may be a bit esoteric but it actually draws from lots of real-world cases (comparing a partitioned sum with an overall average, checking conditions on other columns, messy data parsing and manipulation, implicit waterfall nature of CASE WHEN vs. explicit elseif, etc)
SQL is also a big constant of my programming career, I know it can do everything I need, but sometimes it is frustrating how its limitations make some simple things complicated.
For example, one of the biggest gripes I have with SQL is the fact that the table is the only output format, it doesn't allow outputting data with different cardinalities in a single query. The internal relational model of a RDBMS is extremely rich and powerful, but when extracting that data using SQL you have to leave all the richness behind and flatten the data into a simple dumb table. So as result we have to do much back and forth with separate queries based on results of previous ones (or even ugly hacks like concatenate higher cardinality data into a single field to then re-separate it in the application). A suitable successor for SQL should have a way to output a subset of the relational model with the relational part intact.
I'm not sure a successor to SQL is even necessary, rather there would be benefit to some higher level interfaces provided by the RDMBS that compliment SQL to cut down on the egregious application boilerplate that becomes necessary to achieve common tasks using SQL directly.
SQL is essentially the assembly language of the database. That makes it very powerful, but sometimes you just want a language that gives you a "garbage collector" for free, while still being able to drop down to assembly when you need additional power.
There is no technical reason why an RDBMS can't support different modes for accessing data. We are just so used to putting that work into the application that it has become a hard sell to want to move it to the right place.
> For example, one of the biggest gripes I have with SQL is the fact that the table is the only output format, it doesn't allow outputting data with different cardinalities in a single query.
With MSSQL at least you can return multiple result sets. Not sure about other database vendors.
SQL has some annoying quirks but on balance I am not sure there is the distaste for SQL that you claim. It's fine, even good in many circumstances, but also too low level for some tasks which pushes common problems onto application developers to solve over and over again or build up layers of dependencies to utilize someone else's solution. That's not the fault of SQL but the fault of the RDBMS.
Our layers of abstractions atop SQL along with approaches taken by DBMSes outside of the common relational names have shown that there is room for improvement within those tasks, able to be done natively by the DBMS, and it does not have to come at the expense of losing SQL when you need to be able to describe lower level questions. Different tools for different jobs, but little will within the RDMBS space to explore those other tools.
I don’t think that people have a distaste for SQL, I think that large companies have a vested interest in trying to sell no SQL solutions because they more tightly integrate into a “cloud” ecosystem. A database solves that problem domain too well, it’s not marketable.
The dealbreaker here is having to do this dance for every nested field on every entity, and having to write a separate backend query for each separate front-end use-case for a single entity.
It just isn’t feasible to write everything out when the schema has >30 entities, some with N:M relations, and when queries routinely go several levels deep to fetch dozens of joined fields at every level. The boilerplate overhead is too much.
A natively GraphQL database makes such queries a magnitude less verbose to write out, and all the queries can stay in the frontend (or can become persisted queries on a GraphQL ”middle-end” server).
I blame ORMs. ORMs are promoted by scare mongering novice developers away from learning SQL in the first place. I'm ashamed to say I fell for it for a few years. When I eventually learned SQL it was like a fog being lifted from my mind that I hadn't even noticed before.
I tried not to use an ORM for my last project, but I you ended up rolling my own ORM for the application anyways because it was easier to keep the structures in the web application tied to the database. How else do you keep the two consistent?
The irony is, to effectively use an ORM, you need to be able to debug the SQL it generates when it’s not performing correctly (eg operating on a collection, row by row). I like to use ORM for row based, transactional operations in an OLTP. But usually look for the way to write straight SQL when doing OLAP style reporting stuff. Both of which happen in almost every business.
With SQL you kind of have two options/extremes that are unpleasant in their own way.
You either model things in a very domain specific and classic fashion. Here you get the benefit of being quite declarative and ad-hoc queries are natural. Also your schema is stronger, as in it can catch more misuse by default. But this kind of schema tends to have _logical_ repetition and is so specific that change/evolution is quite painful, because every new addition or use-case needs a migration.
Or you model things very generically, more data driven than schema driven. You lose schema strength and you definitely lose sensible ad-hoc queries. But you gain flexibility and generality and can cover much more ground.
You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.
But now you get into another problem, which is that homogeneous tables are just _super_ rigid as result sets. There are plenty of very common types you cannot cover. For example tagged unions, or any kind of even shallowly nested result (extremely common use case), or multiple result groups in one query. All of these things usually mean you want multiple queries (read transaction) or you use non-SQL stuff like building JSONs (super awkward).
If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.
I wonder if we're generally doing it wrong though, especially in web development. Shouldn't the backend code quite literally live on the database? I wish my backend language would be a data base query language first and a general purpose language second, so to speak. Clojure and its datalog flavors come close. But I'm thinking of something even more integrated and purpose built.
I have, and I would recommend it, if it generated a Postgre schema from a GraphQL schema and not the other way around.
The advantages of GraphQL are its integration of backend and frontend workflows, and that won’t happen with PostGraphile: a frontend needing a schema change needs to go through the backend instead of the backend extending to meet the frontend in the middle.
That's also my reading. And I agree that SQL is a nice DSL yet way behind general programming languages we enjoy today.
I'm dealing with some pretty involved ETL at work these days, and it's really hard and ugly to do it in SQL to "stay in the DB" and keep things fast. It's tempting to read, transform in a high-level language (think pandas) then write back to the DB, but then you give up so much perf.
I would love another API than SQL like what Sparq is doing. You can keep a query optimizer, an engine to execute queries efficiently physically, etc, but it could be controlled with a nice general language API.
Another way to put it is that SQL became the standard and alternative haven't been developed much in comparison. Imagine like the ML world being stuck with Prolog or some DSL. It would be annoying to say the least. Instead people have a variety of APIs and the popular and most developed ones are actually using high-level languages to drive optimized ML engines. These convert function calls and chaining into efficient transformation that use memory, computing units (CPU, GPU, etc) efficiently. The same for DB would be wonderful.
Yeah, a lot of RDBMS are adding JSON support, but the support is often a bit clunky to use and may be missing important features. If you're dealing with a bunch of semistructured APIs that return JSON natively, Mongo makes it really easy to just dump all that into a collection and then just add indices as needed.
I've found it's not just scale, but also down to query patterns across the data being stored.
I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.
One involved a lot of analytics queries (aggregations, filters, grouping etc.) on ~100-200GB of data. No matter what we tried, we couldn't get enough performance from Postgres (column-based DBs / Parquet alternatives gave us 100x speedups for many queries).
Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time. Performance was also very poor in Postgres/MySQL. We ended up using a key/value store, heavily compressing everything before storing, and got a 30x speedup compared to using an RDBMS using a far smaller instance host size.
I wouldn't call either of them massive scale, more just data with very specific query needs.
> Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time.
Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.
Yeah, I think some of the problems are when both or needed on the same data, or when the use case changes over time.
E.g. our customers are stored in Postgres, so let's also log their actions there linked to the user table.
5 years on someone decides we need to run analytical queries across years of 200M logged actions, joined with other data in the DB.
So now we either have to live with horrible performance, migrate the logs to something suitable for OLAP (and lose all the benefits of a solid RDBMS), or have some syncing/export process to duplicate somewhere suitable for querying.
The other day I said to a junior dev, when you started planning a locking scheme to handle concurrency in you file based system it is time to swap to a db
Similarly. People don't use Object Modeling/Entity relation-ship diagrams anymore.
Every day, I see people struggling with problems that would be easy to understand if you had one. You don't even need to have an RDBMs. They are good just to model how things are related to each other.
There are circumstances where you really don't know the shape of the data, especially when prototyping for proof of concept purposes, but usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.
More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.
Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.
> usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.
This is a good point and probably correct often enough, but I also think not understanding the entire problem you are solving is not only common, but in fact necessary to most early-stage velocity. There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.
> a relational DB with well-defined constraints could have saved you from while still in dev
This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints. This has not been my experience personally. But its possible I don't understand your meaning of constraints in this context. I assumed it to mean, for instance, something like schemas which are fairly common now in the nosql world. Curious what other constraints were you referencing?
> There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.
If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).
I agree with GP that this is a central problem to solve, not something to figure out _after_ you write software. Build your house on stone.
>If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).
I agree with your comment that it's a central problem to solve and that both options, throwing out data or map old records onto new semantics, is an endemic choice both stacks need to make. I don't agree that it's always possible to solve entirely up front though.
In my experience, it has been less so about whether the storage engine is schemaless or not, even many modern nosql stacks now ship with schemas (e.g. MongoDB). I think the differentiation I make between these platforms is mostly around APIs. Expressive, flexible semantics that (in theory) let you move quickly.
As an aside, I also think the differentiation between all these systems is largely unimpactful for most software engineers. And the choice often made is one of qualitative/subjective analysis of dev ergonomics etc. At scale there are certainly implementation details that begin to disproportionately impact the way you write software, sometimes prohibitively so, but most folks aren't in that territory.
Admittedly, my experience with MongoDB and Cassandra has gained some rust over the last decade, but what makes you say such nosql databases have expressive APIs? Compared to PostgreSQL they have miniscule query languages and it is very hard, if at all possible, to express constraints. And constraints, sometimes self-imposed sometimes not, are what makes projects successful, even startups. Many startups try to find this one little niche they can dominate. That is a self-imposed constraint. People tend to think freedom makes them creative, productive, and inventive, while in fact the opposite is the truth. With opposite of freedom I mean carefully selected constraints not oppression.
The confusion there is due to the fact that non-R-DBMS (particular when referred to as noSQL) can mean several different things.
In this context I was replying to a comment about not knowing the shape of your data which implies that person was thinking about solutions that are specifically described as schemaless, which is what a lot of people assume (in my experience) if you say non-relational or noSQL.
That is the sort of constraints I was meaning: primary & unique keys and foreign keys for enforcing referential integrity and other validity rules enforced at the storage level. There are times when you can't enforce these things immediately with good performance (significantly distributed data stores that need concurrent distributed writes for instance - but the need for those is less common for most developers than the big data hype salespeople might have you believe) so then you have to consider letting go of them (I would advise considering it very carefully).
Never? Many NoSQL stores are offering parity in many of the feature verticals that were historically the sole domain of RDBMS.
Mongo has always had semantics to support normalized data modeling, has schema support, and has had distributed multi-document ACID transactions since 2019 [1]. You don't have to use those features as they're opt-in, but they're there.
I know that full parity between the two isn't feasible, but to say they never can is a mischaracterization.
[1] Small addendum on this: Jepsen highlighted some issues with their implementation of snapshot isolation and some rightful gripes about poor default config settings and wonky API (you need to specify snapshot read concern on all queries in conjunction with majority write concern, which isnt highlighted in some docs). But with the right config, their only throat clearing was whether snapshot isolation was "full ACID", which would apply to postgres as well given they use the same model.
What is the point of using MongoDB with multi-document ACID transactions? Enabling durability in MongoDB is usually costly enough that you can't find a performance benefit compared to Postgres. With JSONB support in PostgreSQL, I dare say, it can express anything that MongoDB can express with its data model and query language. That leaves scalability as the only possible advantage of MongoDB compared to PostgreSQL. And the scalability of MongoDB is rather restrictive, compared to e.g. Cassandra.
And I would never trust a database that has such a bad track record, regarding durability as MongoDB, although I admit that PostgreSQL had theoretical problems there as well in the past.
I actually agree with you on the point about multi-document tx's, I wouldn't choose mongo solely for that feature. Its nice to have maybe for the niche use case in your nosql workload for when its beneficial. But the point I was originally making was that nosql stacks are not fundamentally incompatible with most of the features or safety constraints offered by other RDBMS.
> And I would never trust a database that has such a bad track record, regarding durability as MongoDB
I can't comment on your own experience obviously, but I've been using mongo since 2011 in high throughput distributed systems and it's been mostly great (one of my current systems averages ~38 million docs per minute, operating currently at 5 9s of uptime).
Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.
As you noted, Postgres has had its own headaches as well. Finding out that all their literature claiming their transactions were serializable when they were in fact not serializable could be considered a mar on their record. But much like mongo, they have been quick to address implementation bugs as they are found.
> I can't comment on your own experience obviously, but I've been using mongo since 2011 in high throughput distributed systems and it's been mostly great (one of my current systems averages ~38 million docs per minute, operating currently at 5 9s of uptime).
> Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.
Good to read that some are actually using MongoDB to their benefit. Indeed I have encountered problems with durability in the wild. Nothing, that I would like to repeat. But as always for a specific use case the answer is: it depends. For a general advice with what to start a new project I would select PostgreSQL in 10 out of 10 cases, if a database server is actually required.
Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.
If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.
This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.
If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.
Maintaining backwards compatibility for reading old records in code is not hard. You can always rewrite all rows to the newer format if you want to remove the conpat code, or if the structure changes in an incompatible way. It's pretty comparable to what you have to do to evolve the code/schema safely together.
Having schema is much better for ad-hoc queries though, doubly so if your schemaless types aren't JSON (e.g. protobufs).
With Postgres, you can always just have a JSONB column for data whose shape you're unsure of. Personally, I'd rather start with Postgres and dump data into there and retain the powers of RDBMS for the future, rather than the other way around and end up finding out that I really would like to have features that come out of the box with relational databases.
I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.
Postgres' JSON implementation perfectly adheres to the JSON spec, which actually sucks if you need to support things like NaNs, Inf, etc. It's a good option, but it doesn't work for all datasets.
as long as you're willing to write a whole lots of data validation, constraint checking scripts by hand in the future, ETL scripts for non-trivial analytical queries (depending on what NoSQL you chose, but if you chose it for perf this one is usual a price you have to pay). and keep a very rigorous track of the conceptual model of your data somewhere else, or simply don't care about its consistency when different parts of your not-schema have contradicting data (at that point why are you even storing it?)
and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.
the point is noting is free, and you have to be sure it's a price your willing to pay.
are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...
I think the tradeoff is similar to using a weakly typed vs strongly typed language. Strong typing is more up front effort but it will save you down the line because it's more predictable. Similarly, an RDBMS will require more up front planning and design and regular maintenance but that extra planning will save you more time down the line.
I find myself forced to model access pattern when choosing non relational dbs. This often results in a much less flexible model if you didn’t put a lot of thought into it. Ymmv
It is very frustrating to work with engineers who don't understand the nuances of RDBMS and assume they can solve all the things. The small company I work for has 3B rows. We have a high write volume. Can you use an RDBMs database to solve this? Sure, but it would be a terrible waste of engineering effort.
A "high write volume" requires fast disks, and billions of rows require large disks. Two simple requirements that are the same for any relational or less relational database.
What's interesting is query performance, and a RDBMS supports explicit control over indexing (usually including analyzing execution plans to find out which queries are going to work well). Where do you see "a terrible waste of engineering effort"?
Isn't this like saying you can solve every programming problem you have with <insert your favorite Turing-complete language here>? Of course you can, but aren't there any cases where the tradeoffs outweigh the benefits, even if it's about something selfish like ergonomics or, dare I say, fun?
I agree with you. However, conversely I don't see anything that proves him wrong. Databases are not like programming languages. There is a reason why we don't use punch cards anymore.
The premise here (I think, correct me if I'm mistaken) is that there are net-negative tradeoffs to using nosql/non-rdbms.
If that assumption is true, then it follows that the same argument used in the last statement also applies— that if you're not at massive scale, then its likely the aforementioned tradeoff of not using RDBMS is likely de minimis.
(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)
> (This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)
The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.
This rings true in my experience. SQL knowledge has consistently helped me over my career. A simple exercise in designing the relational data model vastly improves the system architecture.
Good point. Its often the problem space and other constraints that usually drive these decisions. Its important that you deal with problems when you have them.
Pretty easy. RDBMs have a shit API (SQL is terrible) and the largest (PostgreSQL) have a shit HA story. IMO you should think why you are using an RDBMs.
I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.
Hard disagree. The operational overhead of RDBMS and specifically their inherent reliance on a single primary node makes them, in my opinion, a bad place to start your architecture.
I want to be able to treat the servers in my database tier as cattle instead of pets and RDBMSs don't fit this paradigm well. Either NoSQL or NewSQL databases are, in my opinion, a much better place to start.
I feel like RDBMSs being the "default" option is because most people have worked with them in the past and already understand them. It doesn't mean they are the best tool for the job or even the tool most likely to solve the unknown problems you'll encounter in the future.
Only once have I worked on a project where a document database did not completely gimp our ability to deliver the data that was required of us, and that was only because that data was regularly cloned to a relational database we could use for asynchronous tasks. As a project grows, I have, without fail, come to find that you need relations to efficiently deliver the data that new requirements demand.
You can have multi tb postgree database, that are fast and usable whit limited number of cache layers for speed, but you probably don need it. mediums migrate from single postress in 2020.
I don't understand why this comment is down voted. I've been part of a project that uses a dozen different PostgreSQL databases from different services, one of those databases is multi TB with an OLAP usage patterns. And it beats performance-wise all the previous attempts to solve that specific problem.
Postgres is the Post-Ingres database. Originally it used QUEL (as in PostQUEL as in libpq) as a query language. When SQL support was added the product name was changed to PostgreSQL
Postgree isn’t a thing.
Postgres or PostgreSQL are acceptable. Postgre or postgree are not.
> a dirty read occurs when you perform a read, and another transaction updates the same row but doesn't commit the work, you perform another read, and you can access the uncommitted (dirty) value
It's even worse than this with MS SQL Server. When using the READ UNCOMMITTED isolation level it's actually possible to read corrupted data, e.g. you might read a string while it's being updated, so the result row you get contains a mix of the old value and new value of the column. SQL Server essentially does the "we got a badass over here" Neil deGrasse Tyson meme and throws data at you as fast as it can. Unfortunately I've worked on several projects where someone apparently thought that READ UNCOMMITTED was a magic "go fast" button for SQL and used it all throughout the app.
SERIALIZABLE is ridiculously slow if you have any level of concurrency in your app. READ COMMITTED is a reasonable default in general. The behavior GP is describing sounds like an out and out bug.
Dirty reads incidentally weren't supported for quite some time in the Sybase architecture (which forked to MS SQL Server in 1992). There was a Sybase effort to add dirty read support around 1995 or so. The project name was "Lolita."
Not sure how to use these recommendations in practice though even if the info is somewhat correct. SQL is a beast of tech and it is used because of battle history and since there is simply no other viable tech replacing it when it comes to transactions and aggregated queries.
Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.
Too many indexes can cause significant performance problems if RAM is short. If the indexes are actually used (rather than sitting idle on disk because other indexes are better choices for all your applications' typical queries) then they will “compete” for memory potentially causing a cache thrashing situation.
But yes, the issue with too many indexes is more often that they harm write performance.
A related issue is indexes that are too wide, either covering many columns or “including” them. As well as eating disk space they also eat extra memory (and potentially cause extra IO load) when used (less rows per page, so more pages loaded into RAM for the same query).
Both problems together, too many indexes many of which are too wide, usually comes from blindly accepting recommendations from automated tools (particularly when they are right that there is a problem, and it is a problem that a given index may solve, but fixing the queries so existing indexes are useful could have a much greater effect than adding the indexes).
Mostly because of overlapping indexes. Then if there are include columns it may get out of hand. Not too difficult to achieve. Just blindly follow recommendations from a tool or a cloud service.
It's not that SQL is all that beastly, it's that most tutorials fail to explain the internals and basics and so you just see all these features and interfaces of the system and can't build a mental model of how the system works.
Well, SQL does come with liberties. I worked with expensive commercial software that destroys the performance of databases by doing everything from complicated ad hoc queries to massive amounts of point reads.
All useful tools can be used incorrectly - and I agree SQL is one of the more frequently misused ones. I think a lot of that is that it's one of the more powerful tools.
I still think about my first job out of college. Shopping cart application, we would add indexes exclusively when there was a problem rather than proactively based on expected usage patterns. It's genuinely a testament to MySQL that we got as far as we did without knowing anything about what we were doing.
One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).
The product I work on now collects more rows than that a day in a number of tables.
Introductory material is always welcome, but I suspect this isn't going to hit the target for most people. For example:
> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs
This needs a tiny extra bit of detail: if you're buying random IO (IOPS) or throughput (MB/s), SSDs are significantly (orders of magnitude!) cheaper than HDDs. HDDs are only cheaper on space, and only if your need for throughput or IO doesn't cause you to "strand" space.
> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.
This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!
From the SERIALIZABLE explanation: “The database runs the queries one by one … It is essential to have some retry mechanism since queries can fail.”
I know they’re trying to simplify, but this is confusing. If the first part is true, the second part can’t be. In reality the database does execute the queries concurrently, but will try to make it seem like they were done one by one. If it can’t manage that, a query will fail and have to be retried by the application.
I believe there was a caveat around this exact point later in the post. It was really tough striking a balance for people learning this for the first time and more knowledgeable audience without confusing them further.
I do appreciate the feedback and will look to add some more color here! Thank you!
Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.
I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.
> Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.
I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.
The definition of certain basic concepts like 'consistency' is even confusing to experts at times. This is made all the more confusing by introducing concepts from the distributed systems world, where consistency is often understood to mean something else.
Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:
This is a person that is a longtime peer of the people that invented the concepts!
Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.
Everyone can disagree on what is the precise place to slice "this is beginner content" from "this is almost-beginner content". I could stick my own oar in in this regard but I won't.
I think your level of abstraction is quite good for the absolute "what on earth are people talking about when they use that 'database' word?". With an extremely high level understanding, when they encounter more detail they'll have a "place to put it".
Repeatable read isolation creates read locks so that other transactions cannot write to those records. Of course our own transaction has to first wait for outstanding writes to those records to commit before starting.
Best as I know the goal is not to prevent one's own transaction from updating the records we read; the read locks will just get upgraded to write locks.
> Repeatable read isolation creates read locks so that other transactions cannot write to those records.
No it doesn't, that's just one possible implementation strategy. Postgres for example does not do this.
> Best as I know the goal is not to prevent one's own transaction from updating the records we read;
I'm talking about updates from other transactions. In postgres with REPEATABLE READ, the following transaction can be executed concurrently by two clients:
BEGIN
SELECT bar FROM foo WHERE id = 1; -- Returns 0
UPDATE foo SET bar = bar + 1 WHERE id = 1;
COMMIT
Both clients can see a value of "0" from the first SELECT, but after both COMMIT, the value of "bar" will be "2". ie. the "read" of "bar" in "bar = bar + 1" for one of the transactions does not use snapshot isolation.
" Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "
> "Scale of data often works against you, and balanced trees are the first tool in your arsenal against it."
An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.
That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.
If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.
You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.
The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.
You said that B-Trees "use in indexing has dwindled with time". This is demonstrably false.
> Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.
Even if that was true (which it definitely isn't), why would anybody judge the commercial or scientific relevance of B-Trees by looking at what new systems do? There are very few new systems that are intended to be competitive as general purpose systems, which is where most of the market is.
You still haven't actually named a single example of a "modern database kernel" that exemplifies what you're talking about.
You are understating the limitations of B+trees for real workloads. A common and growing problem is the lack of online indexing that scales, the particularly data model doesn't matter that much. Index construction throughput and scaling has been a serious problem at some pretty boring companies I've done work for.
Use of B+trees in new database kernels has definitely diminished. I'm not counting the installed base of SQLite etc. Ubiquity doesn't make something the pinnacle of technology -- just as often it means "legacy installed base". I still use PostgreSQL a lot and mod it when I need to but I am not under any illusions about its limitations.
A "modern" database kernel that can efficiently use modern hardware is going to be a thread-per-core architecture with all I/O and execution scheduling done in user space, and the ability to operate on modern storage densities found on database servers, which can exceed a petabyte of direct-attached storage. The implications of storage density and its interaction with indexing drive most of the real changes in the way database kernels are designed. You can find elements of this in open source, but mostly in big data platforms rather than proper database engines.
That said, no one builds new high-end databases for retail anymore, the economics don't make sense. All the money moved to more specialized implementations that cater to smaller audiences where you don't need to advertise. The kernels are fully general, and widely reused, but the interfaces and surrounding bits are purpose-built for particular workloads. Hell, my old storage engines are still used under license by that lot. The days of database billboards on the 101 are an anachronism.
You keep talking about how B-Trees are rarely used but I’ve seen relatively new systems deployed that use them (or some minor variation). FoundationDB, FASTER, and a few others.
Other than in-memory hash indexing as used by SAP HANA, I’m not aware of any other data structures anywhere near as popular for database engines.
Can you name the data structure(s) that have superseded these?
> You are understating the limitations of B+trees for real workloads.
I never said anything about workloads. All I said was that your statements about B+Trees having dwindling usage are clearly false.
If you make a claim that is self-evidently bogus, then you shouldn't expect anything else that you may have said at the same time to be taken seriously.
"
Like many modern analytical engines [18, 20], Procella
does not use the conventional BTree style secondary indexes,
opting instead for light weight secondary structures such as
zone maps, bitmaps, bloom filters, partition and sort keys [1].
The metadata server serves this information during query
planning time. These secondary structures are collected
partly from the file headers during file registration, by the
registration server, and partly lazily at query evaluation time
by the data server. Schemas, table to file mapping, stats,
zone maps and other metadata are mostly stored in the
metadata store (in Bigtable [11] and Spanner [12])."
You never post anything constructive. Just that everything in open source sucks while you do fancy algorithms in your basement with spatial data at the boring facility. Anytime anybody asks for more you never reply with something constructive or link to a paper/algorithm or something.
The maximum you've told is "yeah do what scylladb does but you will still suck".
It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.
What kinds of indexing structures are used instead, and how do they differ from B+trees? Do you have examples of which relational databases have replaced B+tree indexes?
The property being optimized for, relative to B+trees, is extreme compactness of representation. In the pantheon of possible indexing algorithms, B+trees are pretty far on the bloated end of the spectrum in terms of the ratio between data space and index space. All indexes have a scaling performance cliff due to the index structure filling up and eventually overflowing available cache, crowding out the data and forcing page faults for almost every index write. In B+tree indexes this happens relatively early and often.
Radically improving index compactness is achieved by loosening design constraints on B+trees: the indexes represent a partial order which only converges on a total order at the limit and the search structure is unbalanced. In the abstract these appear slightly less efficient but it enables the use of selectivity-maximizing succinct representations of the key space that can get pretty close to the information theoretic limits. Scalability gains result from the radical reduction in cache footprint when represented this way.
Optimal compressive indexes are not computable (being equivalent to AI), so the efficient approximation strategies people come up with tend to be diverse, colorful, and sometimes impractical. Tangentially, some flavors have excellent write performance. It is not a trivial algorithm problem but there are a few design families that generalize well to real databases engines. I wouldn't describe this as a fully solved problem but many ordinary cases are covered.
There isn't much incentive to design a relational database engine that can use these types of indexes, since the types of workloads and data models that recommend them usually aren't relational. Someone could, there just isn't much incentive. It is more de rigueur for graph, spatiotemporal, and some types of analytical databases, where there is no other practical option if scalability matters at all.
It's definitely possible, and can make a lot of sense -- MyRocks/RocksDB for MySQL seems like an interesting and well designed system to me. It is fairly natural to compare MyRocks to InnoDB, since they're both MySQL storage engines. That kind of comparison is usually far more useful than an abstract comparison that ignores the practicalities of concurrency control and recovery.
The fact that MyRocks doesn't use B+Trees seems like half the story. Less than half, even. The really important difference between MyRocks and InnoDB is that MyRocks uses log-structured storage (one LSM tree for everything), while InnoDB uses a traditional write-ahead log with checkpoints, and with logical UNDO. There are multiple dimensions to optimize here, not just a single dimension. Focusing only on time/speed is much too reductive. In fact, Facebook themselves have said that they didn't set out to improve performance as such by adopting MyRocks. The actual goal was price/performance, particularly better write amplification and space amplification.
Databases designed to query large amounts of data tend to rely on partition key to allow parallel workloads (i.e. map reduce). And this partition data is stored like hash index.
To go big picture; I'm kind of glad databases are largely like cars in this respect, in ways that other software tooling isn't.
Which is to say they're frequently good enough such that the human working with them on whatever level can safely not know a lot of these details and get a LOT done. Kudos to whoever deserves them here.
Isn't that true for almost all software? You only need to know the implementation of a small subset of parts. I would say databases are worse since you need to know how they are implemented else you will start making O(rows) queries or doing other inefficient stuff.
Going broadly (which is all I can do because I teach this stuff and don't build in depth) -- "the database" is the part I can most easily "abstract" away as if it were walled off?
As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.
If possible, would be great to get Mark Brooker (Principal at AWS) to provide some notes on bridging the gap between CAP theorem and how AWS relaxes constraints for building out Elastic Block Storage (EBS).