For the vast, vast, vast majority of people, if you don't have an obvious primary key, choosing UUIDv7 is going to be an absolute no-brainer choice that causes the least amount of grief.
Which of these is an amateur most likely to hit: crash caused by having too small a primary key and hitting the limit, slowdowns caused by having a primary key that is effectively unsortable (totally random), contention slowdowns caused by having a primary key that needs a lock (incrementing key), or slowdowns caused by having a key that is 16 bytes instead of 8?
Of all those issues, the slowdown from a 16 byte key is by far the least likely to be an issue. If you reach the point where that is an issue in your business, you've moved off of being a startup and you need to cough up real money and do real engineering on your database schemas.
The problem is that companies tend to only hire DB expertise when things are dire, and then, the dev teams inevitably are resistant to change.
You can monitor and predict the growth rate of a table; if you don’t know you’re going to hit the limit of an INT well in advance, you have no one to blame but yourself.
Re: auto-incrementing locks, I have never once observed that to be a source of contention. Most DBs are around 98/2% read/write. If you happen to have an extremely INSERT-heavy workload, then by all means, consider alternatives, like interleaved batches or whatever. It does not matter for most places.
I agree that UUIDv7 is miles better than v4, but you’re still storing far more data than is probably necessary. And re: 16 bytes, MySQL annoyingly doesn’t natively have a UUID type, and most people don’t seem to know about casting it to binary and storing it as BINARY(16), so instead you get a 36-byte PK. The worst.
> contention slowdowns caused by having a primary key that needs a lock (incrementing key)
This kind of problem only exists in unsophisticated databases like SQLite. Postgres reserves whole ranges of IDs at once so there is never any contention for the next ID in a serial sequence.
I think you’re thinking out the cache property of a sequence, but it defaults to 1 (not generating ranges at once). However, Postgres only needs a lightweight lock on the sequence object, since it’s separate from the table itself.
MySQL does need a special kind of table-level lock for its auto-incrementing values, but it has fairly sophisticated logic as of 8.0 as to when and how that lock is taken. IME, you’ll probably hit some other bottleneck before you experience auto-inc lock contention.
I was in the never-UUID camp, but have been converted. Of course depends on how much do you depend on your PKs for speed, but using UUIDs has a a great benefit in that you can create a unique key without a visit to the DB, and that can enormously simplify your app logic.
I’ve never understood this argument. In every RDBMS I’m aware of, you can either get the full row you just inserted sent back (RETURNING clause in Postgres, MariaDB, and new-ish versions of SQLite), and even in MySQL, you can access the last auto-incrementing id generated from the cursor used to run the query.
Now imagine that storing the complete model is the last thing you do in a business transaction. So the workflow is something like 'user enters some data, then over the course of the next minutes adds more data, the system contacts various remote services that too can take long time to respond, the user can even park the whole transaction for the day and restore it later', but you still want to have an unique ID identifying this dataset for logging etc. There is nothing you can insert at the start (it won't satisfy the constraints and is also completely useless). So you can either create a synthetic ID at the start but it won't be the real ID when you finally store the dataset. Or you can just generate an UUID anywhere anytime and it will be a real ID of the dataset forever.
So have a pending table with id, user_id, created_at, and index the latter two as a composite key. SELECT id FROM pending WHERE user_id = ? ORDER BY created_at DESC LIMIT 1.
Preferably delete the row once it's been permanently stored.
Keeping an actual transaction open for that long is asking for contention, and the idea of having data hanging around ephemerally in memory also seems like a terrible idea – what happens if the server fails, the pod dies, etc.?
> Keeping an actual transaction open for that long is asking for contention
Yes, which is why this is not an actual db transaction, it's a business transaction as mentioned.
> and the idea of having data hanging around ephemerally
The data is not ephemeral of course. But also the mapping between business transaction and model is not 1:1, so while there is some use for the transaction ID, it can't be used to identify a particular model.
> With UUIDs there is absolutely no need for that.
Except now (assuming it’s the PK) you’ve added the additional overhead of a UUID PK, which depending on the version and your RDBMS vendor, can be massive. If it’s a non-prime column, I have far fewer issues with them.
> The data is not ephemeral of course.
I may be misunderstanding, but if it isn’t persisted to disk (which generally means a DB), then it should not be seen as durable.
The whole content of the business transaction is persisted as a blob until it's "done" from the business perspective. After that, entities are saved(,updated,deleted) to their respective tables, and the blob is deleted. This gives the users great flexibility during their workflows.
Yes the overhead of UUIDs was something I mentioned already. For us it absolutely makes sense to use them, we don't anticipate to have hundreds of millions of records in our tables.
I also do that for convenience. It helps a lot in many cases. In other cases I might have tables that may grow into the millions of rows (or hundreds of millions), then I'd absolutely not use UUID PK's for those particular tables. And I'd also shard them across schemas or multiple DBs.
If you don't have a natural primary key (the usual use case for UUIDs in distributed systems such that you can have a unique value) how do you handle that with bigints? Do you just use a random value and hope for no collisions?
Wouldn't you just have an autoincrementing bigint as a surrogate key in your dimension table?
Or you could preload a table of autoincremented bigints and then atomically grab the next value from there where you need a surrogate key like in a distributed system with no natural pk.
Yes, if you have one database. For a distributed system though with many databases sharing data, I don't see a way around a UUID unless collisions (the random approach) are not costly.
Yes. This is the way so long as you can guarantee you wont grow past the bits.
Otherwise you can still use the pregenerated autoincrements. You just need to check out blocks of values for each node in your distributed system from the central source before you would need them:
N1 requests 100k values, N2 requests 100k values, etc. Then when you've allocated some amount, say 66%, request another chunk. That eay you have time to recover from a central manager going offline before it's critical.
I have no problem with using uuids but there are ways around it if you want to stick with integers.
But if you use sequential integers as primary key, you are leaking the cardinality of your table to your users / competitors / public, which can be problematic.
Is it really enormous? bigint vs UUID is similar to talking about self-hosting vs cloud to stakeholders. Which one has bigger risk of collision? Is the size difference material to the operations? Then go with the less risky one.
You shouldn't be using BIGINT for random identifiers so collision isn't a concern - this is just to future proof against hitting the 2^31 limit on a regular INT primary key.
Twice now I was called to fix UUIDs making systems crawl to stop.
People underestimate how important efficient indexes are on relational databases because replacing autoincrement INTs with UUIDs works well enough for small databases, until it doesn't.
My gripe against UUIDs is not even performance. It's debugging.
Much easier to memorize and type user_id = 234111 than user_id = '019686ea-a139-76a5-9074-28de2c8d486d'
I’ll go further; don’t automatically default to a BIGINT. Put some thought into your tables. Is it a table of users, where each has one row? You almost certainly won’t even need an INT, but you definitely won’t need a BIGINT. Is it a table of customer orders? You might need an INT, and you can monitor and even predict the growth rate. Did you hit 1 billion? Great, you have plenty of time for an online conversion to BIGINT, with a tool like gh-ost.
I work for a company that deals with very large numbers of users. We recently had a major project because our users table ran out of ints and had to be upgraded to bigint. At this scale that's harder than it sounds.
So your advice that you DEFINITELY won't need a BIGINT, well, that decision can come back to bite you if you're successful enough.
(You're probably thinking there's no way we have over 2 billion users and that's true, but it's also a bad assumption that one user row perfectly corresponds to one registered user. Assumptions like that can and do change.)
I’m not saying it’s not an undertaking if you’ve never done it, but there are plenty of tools for MySQL and Postgres (I assume others as well) to do zero-downtime online schema changes like that. If you’ve backed yourself into a corner by also nearly running out of disk space, then yes, you’ll have a large headache on your hands.
Also, protip for anyone using MySQL, you should take advantage of its UNSIGNED INT types. 2^32-1 is quite a bit; it’s also very handy for smaller lookup tables where you need a bit more than 2^7 (TINYINT).
> but it's also a bad assumption that one user row perfectly corresponds to one registered user. Assumptions like that can and do change.
There can be dupes and the like, yes, but if at some point the Customer table morphed into a Customer+CustomerAttribute table, for example, I’d argue you have a data modeling problem.
Use bigint, never UUID. UUIDs are massive (2x a bigint) and now your DBMS has to copy that enormous value to every side of a relation.
It will bloat your table and indexes 2x for no good reason whatsoever.
Never use UUIDs as your primary keys.