UUIDs are way over used. There is almost always a better key to use, usually a bigint for databases. If you're making some kind of leaderless distributed data store, then maybe, but even then there are other ID sharding strategies I'd go for first depending on the constraints.
For a single database, bigints are smaller and faster, with less footguns.
UUIDs can be nice for an opaque public ID, however I'd still prefer something like a Sqid for space and usability.
Using a Feistel cipher and base 32 encoding at the boundaries of the system can help catching vibe coded edge code that attempt to decode identifiers in javascript. It also somewhat obfuscate the cardinalities and fill rate of the tables.
I hate this so much because you can’t nicely serialise a BigInt as JSON. Using a string is nicer but it
only makes sense where int64 is used as an ID, not where it’s used as a number; and you don’t wanna have to configure this per field per query.
IMO, I'm tending toward thinking that having types on your readable serialization format is a mistake, and that they should be always input to the (de)serializer instead.
Oh, that's much worse! The JSON string `{"a":9007199254740993}` decodes to the object `{"a":9007199254740992}` with typical JSON parsers like JavaScript's `JSON.parse`.
Browser!! The browser reads it as Number. If your rest api returns {"id": 1324535222364012585} for example, javascript will try and parse that as number from the response!!!
You can of course, change the api such that it does {"id": "1324535222364012585"} instead and voila, it will no longer try parsing it as number. Or the many other workarounds people have recommended above (like appending a prefix, or using a different encoding), but why is it trying to parse a number thats too big and instead of throwing it just rounds down without telling you????!
You can achieve this with numeric sequences too, by having a consistent step and unique offset in all your sequences. For example, if you will never exceed 16 types, reserve four bits as the type discriminant. (You don’t have to use powers of two, but it may be convenient.)
All sequences use step 16.
Type A has discriminant/offset 0, yielding IDs {0, 16, 32, 48, 64, …}.
Type B has discriminant/offset 1, mapping to IDs {1, 17, 33, 49, 65, …}.
All the way up to Type P with discriminant/offset 15 and IDs {15, 31, 47, 63, 79, …}.
This is also trivially invertible so that you can determine the type from the ID.
A more common approach is to make IDs opaque strings and put a type prefix—A0, B12, P34, that kind of thing. But this way you can keep it as a number, if you wish.
They just mean you catch incorrect joins more easily because there is usually no overlap in keys between unrelated tables. Using int, you’re usually going to have some shared values between two unrelated tables.
Statistically impossible to inadvertently generate a collision using UUID keys. UUID is designed to be unique when generated across any computer system. Practically speaking if you have an exactly matching pair of UUIDs from disparate system you have found the exact record match. The name gives a hint "Universally unique identifier". -Not a cryptographer.
UUIDs make client code so much simpler. Just create a UUID, use it client side to create your object graph and commit or not as appropriate. No need to retrieve an incremented integer.
Yes this matters even more if you are doing a lot of joins. Naive string UUIDs are 32 bytes (though I use binary uuid in the post which is 16) compared to 8 bytes for a 64-bit int. This matters even more with sqlite as it uses varint encoding. The upshot of all this is your indexes take up a lot less space in memory.
Providing an ID from the client is a big advantage that's missing though. Especially if you want a UI with optimistic rendering that's dealing with something async
I am finding UUIDs help a lot if your primary schema consumer is an LLM.
Inappropriate aliasing of integer keys allows for silent errors in queries because it will actually return some result a lot of the time. A UUID is immune to this problem. The model recognizes its mistake a lot more reliably when previously non-empty tables start showing up empty after attempting a join.
Thanks for the benching, Anders! So grateful for the stuff you've shared over the years. Invariably, every single post has been useful and/or educational to me.
I read this post more as an illustration of the *value* of UUIDv7 as primary key, over integer primary keys, in lieu of minimal loss of read/write performance, and marginally more data on disk bloat.
SQLite's automatic integer rowID primary key is a no-brainer, when the SQLite application is local-only, such as application storage format (mobile and desktop). Or is never intended to grow beyond a single server instance. Basically, where each SQLite file is private to a singular instance of the application.
However, if there is even an outside chance of needing to cooperate across application instances, e.g. the minimal limit case of a personal knowledge base that should seamlessly sync across a person's devices, as well as a hosted service, then a high-quality sequential random ID starts to make a lot more sense. (No-brainer arbitrary table merges / splits / remerges, de-duplication, etc.)
Random ID primary key is a bad idea period, whether it be the UU kind or the SQ kind, or any other kind. As far as my DB knowledge goes, this class of ID destroys all tree-algorithms, and we are stuck with the fact that there is no practically better way, than an appropriate tree-structure, to group and organise a meaningful amount of data, efficiently and effectively.
I've updated the article with the correct rowid alias (integer not int) so the rowid version is now 715ms. I've also added an example of rowid and a secondary index UUID4, and that also seems to be bad for performance (as although it's not a clustered index it's still random inserts into a b-tree).
Well, I expect to never need WITHOUT ROWID. And even if such an arcane situation hits my system, WITHOUT ROWID has so many ifs and buts that I'll probably elect to eat the $$$ cost of running an un-optimised normie SQLite as far as possible.
> The WITHOUT ROWID syntax is an optimization. It provides no new capabilities. Anything that can be done using a WITHOUT ROWID table can also be done in exactly the same way, and exactly the same syntax, using an ordinary rowid table. The only advantage of a WITHOUT ROWID table is that it can sometimes use less disk space and/or perform a little faster than an ordinary rowid table.
As of now, I am doing the following in my (Bitemporal data system) experiment (When will it see the light of day? Nobody knows.).
All data are globally uniquely identified by a UUIDv7. However all tables have `rowid` integer primary key asc (which is just an alias for SQLite's autoincrement int id). The `rowid` is the basis for joins, and is the foreign key reference. This lets me offload some useful disambiguation work to the DB as well as have it enforce global (across data systems) record uniqueness guarantees, while retaining local (within process) query efficiency by retaining the ability to use integer rowids.
While the idealised insert performance in your bench is indeed mind-boggling, the DB Schema isn't doing anything CPU-intensive during inserts (checks, constraints, triggers etc.). My schema / query pattern yields comparatively meagre throughput, but I am happy with the ballpark it has landed in, given all the work I'm making SQLite do for me on each `assert!` and `redact!`.
cf. my dirty-but-useful-enough bench, with production-like record content:
A poor man's napkin-mathy, append-only SQLite write/read benchmark
;; Okay, I think I can live with this...
;; - "facts" table: 12M+ records
;; - single process writes to it
;; - ~ 400 transactions/second
;; - append-only table, enforced via SQLite "before" triggers
;; - "now" table:
;; - updates on every assert/redact on "facts" table, via triggers
;; - currently at "limit case": for each read it is empty, or very small, because writes do back-to-back assert/redact of the same fact
;; - gets reads from two reader threads (evenly split)
;; - ~41,000 reads/second
;; - all reads are concurrent with writes (poor man's futures)
Thanks to its oh so convenient automatic integer rowIDs, I believe one can amortise some of the other overheads of UUIDv7s for "in-between" queries, viz. indices, joins, ctes, virtual tables etc., with appropriate schema / query design.
UUIDv7 and sequential integers are quite similar. Sequential integers disclose count and neighboring IDs while UUIDv7 discloses timestamp. Either can be a security issue in certain cases.
So, UUIDv4 as a PK on a clustered index can be perfectly feasible for cases where you want to avoid disclosing stuff and row insertion performance isn’t that important.
Update the article there's now a section for UUID4 with rowid. It's less bad than UUID4 without rowid but it's still about 4-6x slower than UUID7 without rowid.
Why would you use UUIDs a primary keys? Let SQLite use rowids internally (which is automatic and invisible), and have a different (indexed) column with UUID if you need that for publishing the ID somewhere.
My rule for primary keys and id's is simple: Sequential integer (or bigint) as the PK and if I need to make it public, I have a GUID (or UUID) in the row too, e.g. tbl_person would have Id (int|bigint) and person_guid as (UUID).
The Integer id is used for joins and looks ups and such but that's it. If I need to send anything to the frontend or outside of the app/DB then that's the UUID.
I enjoy these carefully worded posts from Anders Murphy, illustrative and informative, not opinionated and preachy. Very useful, it’s great to see the process, and ofc bookmarkeable material for sharing with others.
Yes it's writing to disk (on a M1 mac which has terribly slow fsync). But, because of the transaction the fsync dance is done once per batch. Each row is the id + a 50 byte data blob.
There's only one index so there's no real write amplification. The numbers will go down as you add more data and indexes.
Small nit: uuid7 is 128 bits (16 bytes) by definition. So there’s no need to convert it to binary. It already is. Unless you’re working with a stringified version of the uuid7.
How do I know the time zone of an integer? Sure there are plenty of cases where one doesn't care, but there are also many cases where the original time zone is important.
> and you should always convert that to binary to optimize everything
I disagree. I tried this once. Now you need a client access layer to touch the DB in any context. All your console tools no longer work well or at all. If they show up in URLs you need to deoptimize them for transport.
You give up a lot of convenience for this optimization. You should be absolutely sure your design requires it before using it.
The script to create the benchmark numbers appears to be inserting 100 batches, not 10. (The benchmark numbers in the table appear to be consistent with the text, so I guess the actual script used to create them was correct.)
Is this relevant for other databases? For postgres for example, which supports concurrent writers, wouldn't sequential keys lead to contention on the page at the frontier?
That's a good question. I don't know the answer. I will say, generally you can get higher write throughput with a single writer. Even more so if you're prepared to shard along boundaries where you don't need atomic transactions.
Contention and coordination are real killers, concurrent writes (that require coordination like postgres) often underdeliver.
An insignificant amount for the comparison (why I didn't mention it), it's a fast implementation and the JVM C2 JIT has kicked in by the time the first batch has completed.
Although not as prominent as insert SELECT and UPDATE both benefit from page cache locality, assuming rows that are stored near each other are often selected/updated together.
For a single database, bigints are smaller and faster, with less footguns.
UUIDs can be nice for an opaque public ID, however I'd still prefer something like a Sqid for space and usability.
reply