At what scale does all this stuff start to actually matter? I have an database with ~100 tables and ~500M rows driving a medium-traffic web app and various back-end systems. We use auto-incrementing integers as primary keys and try not to expose them externally. Indexes are added as necessary to enable specific queries. We don't enforce any other constraints (e.g. not-null or foreign keys) at the database level.
... and it all works and performs just fine? The considerations the author mentions all make some sense to me in theory, but when do they actually matter in practice in a modern system?
You might not have any observable effects until the day you do, at which point you may be faced with an arbitrarily large problem.
What you are doing is passing up on the opportunity of catching various errors (you are also passing up the opportunity for some optimizations, but that is probably a secondary issue.) In particular, you are passing up on some opportunities to catch inconsistencies in how different applications (or different parts of the same application) create and use data.
One argument made against putting these sort of rules in effect is that they constrain what application developers can do, but that is the wrong way to look at it: any such conflict is an indication of a misunderstanding (not necessarily on the part of the application developers) that has been caught before it can lead to bigger problems, such as a database full of irretrievably inconsistent or incomplete data.
When the problems finally do arise, it is often the case that some sort of workaround is the only practical solution. This, in my experience, is one of the common ways by which systems accrue gratuitous complexity, which in turn has at least two real-world consequences: an increased time to make changes, upgrades and extensions, and an increased frequency of errors, especially WTF-type errors (and probably also efficiency/performance hits.)
My point is that you cannot know when problems will show up. If you do everything right, they will never happen, but why pass up an opportunity to reduce the risk?
I disagree that UUIDs are generally preferable over integers. For one, they take up more space (on disk and in memory). And for something like a key, it is likely that there will be multiple copies of that value stored, since it will exist in the table itself, at least one index (possibly more) and foreign keys. More space means fewer records per page on disk, more I/O and more memory usage (potentially leading to more I/O). I would wager that for most users (including this case), the cost of this additional I/O is far greater than some theoretical scalability limitation on generating IDs.
Most database vendors make sequence generation (whether through explicit SEQUENCE objects or auto-incrementing columns) performant by making a few compromises:
* Numbers may not always be sequential (you might get 1, 3, 2 - in that order)
* There may be gaps (you might get 1, 2, 5, 6)
But since these are supposed to be opaque identifiers, neither of these compromises should be a concern for most users. But this means that these sequences can live outside of a transaction (you might grab an ID, rollback, and that ID is gone) and that systems with multiple nodes can be allocated a "block" of numbers from which they can quickly pull new values, without needing to coordinate with a master node.
Client-side ID generation (e.g. UUIDs) can be very useful. For one, it lets the client send an entire batch payload of one or more inserts, plus updates that reference the new inserts, in a single round trip. It doesn't need to wait to get the inserted IDs back to make updates, or send more requests to other services. All it needs is an OK back. This is very convenient in systems where, say, you ingest a bunch of data that needs to be dumped into a final database. The entire batch can be ready, including IDs, at the sender side.
This is particularly useful in some other scenarios such as distributed, eventually-consistent systems that are able to resolve concurrency conflicts. Send out changes optimistically (with those pregenerated IDs), resolve conflicts by untangling the ones that didn't work.
UUIDs are the best choice when developer time is more important than space usage. Also, they can be generated and used by the client when the connection to the db is frequently down (eg: clients store/query data locally in SQLlite and replicate to master).
"The random nature of standard version 3, 4, and 5 UUIDs and the ordering of the fields within standard version 1 and 2 UUIDs may create problems with database locality or performance when UUIDs are used as primary keys. For example, in 2002 Jimmy Nilsson reported a significant improvement in performance with Microsoft SQL Server when the version 4 UUIDs being used as keys were modified to include a non-random suffix based on system time. This so-called "COMB" (combined time-GUID) approach made the UUIDs non-standard and significantly more likely to be duplicated, as Nilsson acknowledged, but Nilsson only required uniqueness within the application."
This article mentions this question. Database locality is a complex scaling question that may not be a problem depending on how read/write heavy your operations are. On the one hand, randomly distributed UUIDs may be a bad idea on a single SQL server trying to balance a single B-Tree index (and thrashing that index with a lot of incoming data); on the other hand it can be somewhat ideal for partition sharding across multiple servers. As with any trade-offs in database design, your mileage will vary with your application needs and resource availability.
Also, there are other options for time-ordered GUID/UUID alternatives. ULID is the one I've been heavily using in projects lately: https://github.com/ulid
(ULID uses a timestamp prefix and random suffix for reasonable time-ordered database locality; an interesting compromise between V1 and V4 UUIDs, though not directly compatible with either. The L stands for "lexicographic" in that its also meant to be sortable in string indices as well, which can be important for database locality in many document/NoSQL databases.)
I'm also not a DB expert, but I definitely think locality is a concern with UUIDs (at least in Postgresql, which stores records in sorted order).
But, storage locality has always struck me as a use-case-specific optimization, in the sense that it's hard to take locality into account without knowing what traffic patterns the table is expected to experience. So, although UUIDs provide locality for exactly no types of queries, that may not be a problem for you.
For instance, you could use an autoincrementing sequence, but that only provides good locality for queries over sequential records. The COMB method would provide good locality for queries over time periods. But if you aren't performing those types of queries, you might be better off with a different type of PK.
UUIDs are a great option for creating IDs offline, but I don't understand your first comment: How do UUIDs optimize for developer time? In my experience, autoincrementing integers are the easiest choice for primary keys.
Sometimes yes, sometimes no, but mostly I agree that uuids are better for dev time. The first time something is written, they're roughly the same (or a very small bias towards autoincrement), but UUIDs have fewer dev-problems in the long run.
When you're starting out and don't have (m)any simultaneous writes to your DB, autoincrements are often perfectly sequential and identically sorted as a created_at column... which is convenient, but too often I see systems implicitly relying on this and having subtle problems when they start growing. E.g. pagination that doesn't account for gaps, or a missing "order by" that leads to the 1, 3, 2 issue mentioned above. It's also not often caught with tests, since they're frequently not run in parallel or at high enough velocity, and that lack of warning, time-coincidence with when the buggy code was introduced, or easily-testable reproducibility can make them hard to track down and fix.
There are a bunch more fairly minor things that others have mentioned (e.g. UUIDs make it much harder to write joins incorrectly, as they'll just always be empty until you do it right), but generally I'd just call it "death by a thousand papercuts". Autoincrements are great and I love them, but generally I lean towards UUIDs since they're a bit more bug-resistant. And performance-wise, if you're having problems I generally doubt you'd be able to fix it by switching to autoincrements (assuming it was just flipping a switch). For a short time, possibly, but generally at that point you'll be beyond the "pick whatever is easiest" stage and can make an informed decision that'll have a far larger impact.
---
tl;dr: I disagree with primary keys being "opaque identifiers" because they do have an affect on behavior in buggy code. Autoincrements mask more problems than UUIDs, and fixing even one of those costs more time than autoincrement saves.
As you can see leaking through in a few of the examples, the scenarios where it matters are the ones where non-experts are reading to and writing from the database. If a data entry clerk is just typing values into a form thinly wrapping an INSERT, you'd better make sure any input that would break implicit assumptions is rejected. If you have a legion of Bobs from marketing who "know some SQL", you can't hand-optimize every single query they try to run.
When the database is just serving a webapp you control end to end, the only part that'll really have a huge impact is making sure you can partition by the primary key effectively. Which is good, because complex indexing schemes and foreign key constraints actually scale very badly.
I would certainly look to add not-null and uniqueness constraints, although you might find that by now your data is actually violating some of those constraints!
A lot of people take the view that the application code is sufficient proof against bad data getting into your database, but this ignores (a) bugs and (b) back-door data loading, which in most applications will almost certainly be happening at some point.
The data in your database is the crown jewels (the application that fronts it is by comparison a piece of crap that you can replace at any time), so put guards on that gate ...
No constraints will often be faster, but your data will not be 100% consistent.
Over time you _will_ have child child records pointing to non existing parents and the same with foreign keys. These often don't really hurt anything, until they do. Since most of the testing is with clean data, most issues due to bad data are in production and reported by end users.
I've worked successfully on systems without any real constraints and it is usually much better to start with and remove them later as needed. Adding them later is a pain as you first have to clean up the data and fix the code issues that created the bad data in the first place.
I've seen auto_increment fail to scale on MySQL. Large, long-running insert statements (e.g. using insert into ... select from, or load data infile) can lock the the table resource and block concurrent inserts. Pretty easy to work around: have a table containing the next available key for every table, and use a separate small transaction to bump it before bulk inserts.
Another reason not to use auto-inc is if you need to insert lots of data into several separate tables that have foreign key relationships. If you know what you're inserting up front, and you need to insert millions of rows quickly, you're better off allocating the primary keys ahead of time and supplying them directly with the right keys for the relationship at insert time.
Separately, another argument against compound keys: if you're in a situation where you're trying to optimize joins, sorts and filters over moderately large tables, you want to minimize the number of columns you touch. Every column measurably increases query time; more data to shuffle, sort, compare, bigger indexes, etc. You won't see this if you're doing simple key / row lookups, but you will see it if you're presenting user-interactive slices over million+ row data sets.
auto_increment doesn't lock the table. When you use an auto_increment column, MySQL will grab the next int as it creates the insert write-ahead log message. Two concurrent transactions with T1 beginning first and T2 beginning second but actually committing out of order can thus have out of order ids. e.g. T2(id=10) T1(id=9)
Also, note that this means auto_increment IDs are not continuous (read: a reader looking after T2 commits but before T1 will see a gap, and if T1 fails that gap is permanent!)
"While initializing a previously specified AUTO_INCREMENT column on a table, InnoDB sets an exclusive lock on the end of the index associated with the AUTO_INCREMENT column. In accessing the auto-increment counter, InnoDB uses a specific AUTO-INC table lock mode where the lock lasts only to the end of the current SQL statement, not to the end of the entire transaction. Other sessions cannot insert into the table while the AUTO-INC table lock is held; see Section 14.5.2, “InnoDB Transaction Model”. "
If you have a long-running statement, it can block concurrent transactions. I've seen it specifically with 'load data infile', IIRC. We had to go through some painful migrations to remove auto-increment on some fairly large tables when we started seeing this.
That is an odd design. PostgreSQL only holds the lock long enough to increment a counter in memory, and every 32th time also write a write-ahead log memory in RAM. I can't see why one would need to lock the counter for the duration of the query.
You can get some real headaches with auto-incrementing integers if you have to go to a multi-master or some other distributed system for data inserts.
How do you know which integer to insert next for a table that is replicated across systems? One system could do even numbers and one could do odd. One could do every other 100 integers. Or you could use UUIDs and not worry about it.
My argument is that auto-incrementing integers are inherently not scalable. They can scale to a certain extent, and then you have to switch key systems.
They don't necessarily matter until you hit Twitter-size for performance purposes, but take care to pay close attention to the idea of natural vs. artificial keys, and heed his warning about turning artificial keys into natural keys.
One day you'll be integrating database systems together and you'll be glad you followed his advice. I've seen some really ugly bizdata schemas.
My background is in financial applications, where an inconsistent state can literally be a multimillion dollar problem, so maybe my perspective is skewed a little. But FKs and other constraints are simply used to outright enforce consistency, in my experience at least. If you rely on your custom business logic to maintain consistency, then you leave yourself open to the possibility of breaking that in some way, and if you do, who says you'll notice right away? What if you don't notice for 6 months, and then realise that you have a billion transactions that have some form of integrity compromise?
Complexity obviously makes this problem more serious, but something like this can easily occur at any scale of complexity or transaction volume. If you try to commit some code that breaks the integrity of an important relationship in some way, or even if you just want to run an arbitrary statement directly against the database, then you want the DB to throw an error.
... and it all works and performs just fine? The considerations the author mentions all make some sense to me in theory, but when do they actually matter in practice in a modern system?