The article compares UUIDv7 vs v4, but doesn't say why you'd do either instead of just serial/bigserial, which has always been my goto. Did I miss something?
Good question. There's a few reasons to pick UUID over serial keys:
- Serial keys leak information about the total number of records and the rate at which records are added. Users/attackers may be able to guess how many records you have in your system (counting the number of users/customers/invoices/etc). This is a subtle issue that needs consideration on a case by case basis. It can be harmless or disastrous depending on your application.
- Serial keys are required to be created by the database. UUIDs can be created anywhere (including your backend or frontend application), which can sometimes simplify logic.
- Because UUIDs can be generated anywhere, sharding is easier.
The obvious downside to UUIDs is that they are slightly slower than serial keys. UUIDv7 improves insert performance at the cost of leaking creation time.
I've found that the data leaked by serial keys is problematic often enough; whereas UUIDs (v4) are almost always fast enough. And migrating a table to UUIDv7 is relatively straightforward if needed.
This advice assumes /customers/:id/bills is public. Protected routes shouldn't expose sensitive information such as bills anyway, so this is more of an authorization issue (who can access which resource) more than privacy concerns. So this means, if you can access customes/4000/bills, then that's an application logic issue more than the type of ID itself.
In a well designed application, you shouldn't be able to guess whether a record exists or not simply by accessing a protected URL.
As a counter argument - normal BIGINT or serial PKs are performant and are more than enough for most applications.
You describe a world where human skill is required to prevent these class of bugs, time and time again we've proven that people are people and bugs happen.
Systems must be _structurally architected_ with security in mind.
Security is layered, using a random key with 128-bit space makes guessing UUIDs infeasible. But _also_ you should be doing AuthZ on the records, and also you should be doing rate limiting on API so they can't be brute forced, either.
One of my previous orgs used incrementing number for Github profiles. While there was no privacy concern about accessing a Github profile of another user, i don't think the org expected someone to calculate tech attrition and layoff count using this URL pattern ;)
That advice was born primarily _because_ of the bigint/serial problem. If the PK is UUIDv4 then exposing the PK is less significant.
In some use cases it can be possible to exclude, or anonymize the PK, but in other cases a PK is necessary. Once you start building APIs to allow others to access your system, a UUIDv4 is the best ID.
There are some performance issues with very large tables though. If you have very large tables (think billions of rows) then UUIDv7 offers some performance benefits at a small security cost.
Personally I use v4 for almost all my tables because only a very small number of them will get large enough to matter. But YMMV.
It's not about table size so much as number of joins. You don't need to trade off between security and performance if you simply expose a uuid4 secondary col on a serial PK'd table.
So the client side can create the ID before insert - that's the case that (mostly) drives it for me. The other is where you have distributed systems and then later want to merge the data and not have any ID conflicts.
Client = backend here, right? So you could make a bunch of rows that relate to each other then insert, without having to ping the DB each time to assign a serial ID. Normally the latter is what I do, but I can imagine a scenario where it'd be slow.
The usual flow would be INSERT ... RETURNING id, which gives you the db-generated id for the record you just inserted with no performance penalty. That doesn't work for circular dependencies and it limits the amount of batching you can do. But typically those are smaller penalties than the penalty from having a 128 bit primary key vs a 64 bit key
It can be quite elegant. You can avoid the whole temporary or external ID mess when the client generates the ID, this is particularly useful for offline-first clients.
Of course you need to be sure the server will accept the ID, but that is practically guaranteed by the uniqueness property of UUIDs.
“client” here may refer to a backend app server.
So you can have 10-100s of backend servers inserting into a same table without having a single authority coordinating IDs.
Except if you're using a sharding or clustering database system, where the record itself may be stored to separate servers as well as the key generation itself.
In those cases yes. There's still a case for sequential there depending on the use pattern, but write-heavy benefits from not waiting on one server for IDs.
Client-generated IDs are necessary for distributed or offline-first systems. My company, Vori, builds a POS for grocery stores. The POS generates UUIDv7 IDs for all data it creates and that data is eventually synced to our backend. The sync time can range from less than 1 second for a store with fast Internet to hours if a store is offline.
Is a collision possible? Yes, but the likelihood of a collision is so low that it's not worth agonizing over (although I did when I was designing the system).
For when you inevitably need to expose the ids to the public the uuids prevent a number of attacks that sequential numbers are vulnerable to. In theory they can also be faster/convenient in a certain view as you can generate a UUID without needing something like a central index to coordinate how they are created. They can also be treated as globally unique which can be useful in certain contexts. I don't think anyone would argue that their performance overall is better than serial/bigserial though as they take up more space in indexes.
People really overthink this. You can safely expose internal IDs by doing a symmetric cipher, like a Feistel cipher. Even sequential IDs will appear random.
But these are internal IDs only, and public ones should be a separate col. Being able to generate uuid7 without a central index is useful in distributed systems, but this is a Postgres DB already.
Now, the index on the public IDs would be faster with a uuid7 than a uuid4, but you have a similar info leak risk that the article mentions.
"Distributed systems" doesn't have to mean some fancy, purpose-built thing. Just correlating between two Postgres databases might be a thing you need to do. Or a database and a flat text file.
I usually just have a uuid4 secondary for those correlations, with a serial primary. I've done straight uuid4 PK before, things got slow on not very large data because it affected every single join.
If you need an opaque ID like a uuid because, for example, you need the capability to generate non-colliding IDs generated by disparate systems, the best way I've found is to separate these two concerns. Use a UUIDv4 for public purposes and a bigint internally. You don't need to worry about exposing creation time, and you can still manage your data in the home system with all the properties that a total ordering affords.
That's the point. Those are only system-unique, not universally. It's a lower-level attribute that is an implementation detail, like for referential integrity in an rdbms. At that point, if you need it, you have atomic increment.
I believe the concern is if your primary key in the database is a serial number it might be exposed to users unless you do extra work to hide that ID from any external APIs and if there are any flaws in your authorization checks it can allow enumeration attacks exposing private or semi-private info. With UUIDs being virtually unguessable that makes it less of a concern.
There is a big difference though. Serial keys allow attackers to guess the rate at which data is being added.
UUID7 allows anyone to know the time of creation, but not how many records have been created (approximately) in a particular time frame. It leaks data about the record itself, but not about other records.
Far, far less than sequential Ids, and the random part is some pretty big values numerically... I mean there's billions of possible values for every MS on the generating server... you aren't going to practically "guess" at them.
>why you'd do either instead of just serial/bigserial, which has always been my goto. Did I miss something?
So the common response is sequential ID crawling by bad actors. UUIDs are generally un-guessable and you can throw them into slop DBs like Mongo or storage like S3 as primary identifiers without worrying about permissions or having a clever interested party pwn your whole database. A common case of security through obscurity.
Your comment heavily implied (to me) scaling databases horizontally. Yes, it's not necessarily "Google scale" either, but it's a ton of extra complexity that I'm happy to avoid. But a Google employee is probably going to approach every public-facing project with the assumption of scaling everything horizontally.
With multiple servers talking to a single database, I'd still prefer to let the database handle generating IDs.
Yeah, there's too much advice jumping straight to uuid4 or 7 PKs for no particular reason. If you're doing a sharded DB, maybe, and even then it depends.
Speaking of Google, Spanner recommends uuid4, and specifically not any uuid that includes a timestamp at the start like uuid7.