Last time this discussion came up, someone shared this nice article[0] about all the different ID options. It mentions Stripe IDs, which have pretty prefixes like "prod_foo" and "user_bar". These are great for users of their API (no wondering what an ID refers to) and for internal tooling and debugging, but have to be stored as inefficient varchars, which is not great for indexing and whatnot.
So I spent the weekend creating "UPID"[1], which is like ULID but prefixable! Up to four characters for the prefix, 40 bits of time (256ms precision, but could change in a revision) and still 64 bits of randomness. So you can do pretty Stripe-style IDs without losing the benefits of a neat 128-bit UUID column. Have implemented it for Python, Rust and Postgres, maybe worth a kick around for someone!
My understanding of these prefixed IDs is that translation happens at the API boundary - what you store in the database is the unprefixed ULID/UUID in an efficient binary column.
Then, whenever the API includes an id in a response, it adds the appropriate prefix. When receiving an id in a request, validate that the prefix is correct and then strip the prefix.
That gets you all the advantages of prefixed IDs but still keep all 128bits (or however many bits, you don't have to stick to UUIDs) for the actual id.
Or, to put it another way, there's no need to store the prefix in the column because it will be identical for all rows.
EDIT: this is not to knock your work - quite the opposite. If you do have a use case where you need rows in the same table to have a dynamic prefix, or the client takes the IDs and needs to put them in their own database, then your solution has a lot of advantages. I think what I'm getting at is that if you're using prefixes then it's a worthwhile discussion to be had about where you apply the prefix.
My understanding from Stripe was that they're stored with the prefix as text in the DB, but I might be wrong.
As for doing the translation at the API boundary, my only gripe is that it's likely to be error-prone: every dev needs to remember to add/strip the correct prefix in every route. Of course you can add some middleware that is context-aware, but still there will be cases (eg live querying while talking to non-tech team) where not having to translate back-and-forth would be great!
Anyway, appreciate the comment and definitely agree that for most teams just using a UUID and adding a bit of code is a more obvious route than using a new ID format someone just made up!
As someone who has done that exact sort of stripping at a previous job it's not that bad. If you mess it up your tests immediately fail but also it's just so easy to have a library that does it for you
You are awesome! I work with a former engineering manager of Stripe and one initiative they had me on was migrating our UUID's to the prefixed strategy you mentioned in Stripe.
Most of the effort, honestly, went into the compatibility layer when bridging which version of the ID was accepted on a given API, and ensuring that this major change was implemented between services universally (i.e., is service B going to fail if I send a prefixed ID?).
All of this to say, I am hesitant to generally recommend migrating from UUID -> prefixed UUID, but I will be considering starting future projects of my own with UPID
Thanks for the comment! Yeah I really love Stripe IDs and if I was building an API-first product would definitely make that a priority, but can imagine a bit of headache if there isn't global buy-in from day one...
Would be great to hear about it if you do give it a spin. On projects with simpler domain models I can't imagine a huge benefit, but as soon as you have piles of tables it's quite nice that every row tells you up-front exactly what it is.
Oh thanks for sharing, never found that somehow. It seems that's mostly formalising the general idea of adding a prefix to a UUID, but you still end up storing them as variable length strings.
I left 4 bits out in UPID as a version specifier, so could theoretically add a version with longer prefixes or more/less timestamp precision if that seemed useful to anyone.
I’m a little late to the conversation, but I’m using the Elixir version of TypeID[0] in my project. While the repo linked to above doesn’t provide any guidance on storing the TypeIDs, the Elixir version does this really well. You store the prefix in the your schema (model) code, but the underlying database column is simply a UUIDv7.
So when you call the new function, it will generate a UUIDv7, base32 encode it, and then prepend the prefix. Then at the database layer it will translate that to a UUIDv7 for storing and translate back to the string version when loading.
Oh yeah that looks ideal. My initial idea was down those lines, but realised this wouldn't be possible with many (most?) languages/ORMs/adapters. Eg for Prisma (TypeScript ORM) it would need to be built in to Prisma itself.
Another small benefit of UPID is that it works even for raw SQL sessions, but this obviously requires that it's installed into Postgres, which unfortunately for most is much harder than installing an Elixir/whatever lib.
But even if you just store them as u128/UUID, a nice thing is the IDs always know what their prefix is, so eg if you dump data into a warehouse, the IDs don't lose their context and an analytics person can still find what they're looking for.
I don't think that prefixing IDs with some semantic "type" is not such a great idea, because it makes IDs unnecessary long (and inefficient) and because it is basically a mixture of the actual data and a type, but we have typing systems for that. In general it is also not really an advantage. In case of URLs like http://example.com/items/0190b12d-5ed1-72f5-964c-2252afd2da7... "items" is already the "type". To avoid confusion on the server side the value extracted from the path could be put into some object like an ItemId (basically a typed wrapper around a UUID).
Main reason would be that every dev must then remember to add/strip a prefix at some boundary around the DB, which I think would be pretty error prone.
A more minor benefit maybe is that they're natively presented with the prefix in the DB, so eg communicating with non-technical team while running SQL queries doesn't involve constantly adding/removing prefixes...
There's no mention of the UUID v7 in the README. Based on my basic understanding, v7 uses the first 48 bits to encode the Unix epoch, which makes them more useful for sort/indexing scenarios. Might be useful to compare ULID and v7 in the README to make it clearer the benefits one has over the other.
You know, I didn't even look at the last commit! I assumed it was a new thing, and hence why it was being shared now. Here's an idea for HN: when processing a repo link, auto add (<time since last commit>) to the end of the title.
Good idea, but in my opinion that information really only needs to be added if it is significantly long ago, such as when appending a parenthesized year to a title.
Agreed. If the thing hasn't seen a commit in say 3 years, it'd be useful for idiots like me to know before they assume it's new because it's being shared now. Still, my bad :)
I think it’s meant to be monotonic within one node but not cross-node (cross-node the sort order is random within the same millisecond). The random portion of the IDs are sequentially incrementing within a single MS on a given node.
Afaik it depends on the generate function you use. You can let it keep track of the increment when generating, at a performance cost (atomic operations).
Using ULID has probably the best ROI among all the architectural decisions I've made. Generating IDs on the server before insert is helpful, and not just for distributed system issues.
Issues to think about before you switch:
- The ID includes the timestamp of it's creation. If you don't want to leak it, you should use another identifier along with it. (I've never needed this in my use cases)
- You won't be able to see the ID in its common representation format if you are using a database GUI. (I'm using Postico 2 for postgresql and the developer was kind enough to add "display as Crockford Base 32" to bytea columns making this a non-issue for me. UUIDv7 will also fix this.)
While looking into a replacement for auto increment primary keys, I also encountered TSIDs / Snowflakes. At least with TSIDs you can store them easily as 64 bit integers; ULIDs were not as compact. Haven't made a decision yet though.
Keep ints for primary keys and keep all the db magic for your internal references then use ULIDs or TSIDs for your external keys and keep the ability to have well behaved indexes. Win-win with minimal disk impact
The whole point of ULIDs and newer things like UUIDv7 is precisely so you don't have to do that.
I would definitely just use a ULID or UUIDv7 as my sole primary DB key for new projects going forward. You get the benefit of ascending ordered primary keys, but the key is still globally unique so can be used as public keys.
What's the point of ULID again? People say one use case of UUIDs is that clients are able to generate the (random) primary key. Fair enough, I guess... Then they say autoincremental ids are bad because it could give indication about business activity (like how many orders have been done in a time frame). But with ULID or UUIDv7, this "benefit" is gone to some degree because timestamp is included.
If client-generation is not needed, then UUID is stupid to begin with, because any incremental id can be transformed ("encrypted") to arbitary different format (which hides the real id). Also UUID is usually much larger: Waste of db/index space, degraded performance. But I guess if people push everything in the cloud, they don't even realize they are paying more than needed...
Conclusion: I don't understand what you people are trying to "solve". All those different UUID versions leads me to believe the issue is not the id, but developer confusion.
> But with ULID or UUIDv7, this "benefit" is gone to some degree because timestamp is included
Including the timestamp only tells someone the time the UUID was generated. Unlike incrementing numeric IDs (which can simply be subtracted from one another to measure a change in database records), there’s no way to meaningfully count change over time.
> any incremental id can be transformed ("encrypted") to arbitary different format (which hides the real id)
Easier said than done… You then have the overhead of encrypting/decrypting every client-facing ID before querying the database. You’ll also need to code workarounds in many frameworks, to bypass conventions where they expect keys in URL paths (for example).
> Also UUID is usually much larger: Waste of db/index space, degraded performance
With older UUIDs, sure, but sorted ones (like ULIDs) have consistent prefixes, allowing for very efficient indexing and querying with a binary tree search.
> there’s no way to meaningfully count change over time.
You are right, should have finished my morning coffee first =) I guess for client-side generation it makes sense then. But how often is that really needed? I don't know...
As for the other points. I still don't buy that. For example implementing UUIDs vs implementing the transformation: CPU cost is negligible (you don't need to use cryptographic secure cypher). UUID might be a little simpler to implement, but combined with space and performance savings it's well worth it: 16 byte vs 4/8 byte (which might be stored multiple times in related tables) - it adds up and fills resources. Again I understand, most people don't seem to care about that, because they were born into cloud culture and have no clue what they are doing in terms of efficiency money/resource-wise.
You mean combining logically (or even physically) separated databases, collapsing their tuples into one? Why would you do that, and how often does that occur?
> first hand support across various databases/systems
UUIDs? Of the RDBMS most likely to be used (MySQL, Postgres, SQLite) only Postgres has a UUID type. The others store them as strings (please no) or binary types. MariaDB and Oracle have UUID types, and SQL Server has a GUID (essentially the same thing) type, but those are all less commonly seen.
What does have universal support is integers. They scale just fine (PlanetScale uses them [0] internally), and you can use them in a distributed system – if you even need one in the first place – via a variety of methods: interleaved ranges or a central server allocating chunks are two popular methods that come to mind.
Sometimes various but similar database get smashed together when companies merge. Or when putting all the things in a "data-lake". Like inventory lot IDs or transaction IDs from 100s of stores local database uploaded to the corporate global DB (eg: Albertsons, Kroger, Walmart)
> Again I understand, most people don't seem to care about that, because they were born into cloud culture and have no clue what they are doing in terms of efficiency money/resource-wise.
They have no clue about how computers work, full stop. Sure, they know programming languages, but generally speaking, if you ask them about IOPS, disk or network latency, NUMA, cache lines, etc. they’ll tell you it doesn’t matter, and has been abstracted away for them. Or worse, they’ll say sub-optimal code is fine because shipping is all that matters.
There is certainly a difference between sub-optimal and grossly un-optimized code. Agonizing over a few msec outside of hot loops is probably not worthwhile from an efficiency standpoint, but if it's trivial to do correctly, why not do it correctly? One recent shocking example I found was `libuuid` in its various forms. util-linux's implementation [0] at its most recent tag is shockingly slow in larger loops. I'm fairly certain it's due to entropy exhaustion, but I haven't looked into it enough yet.
MacOS uses arc4random [1] (which for Linux, is in glibc as of v2.36, but you can get it from libbsd-dev otherwise), and it's much, much faster (again, on large loops).
I made some small C programs and a shell runner to demonstrate this [2].
> You then have the overhead of encrypting/decrypting every client-facing ID before querying the database.
As opposed to the overhead of generating a UUID? It may be fast, but it’s still overhead.
> code workarounds in many frameworks … they expect keys in URL paths
I personally despise this practice, not least of which because then your URL has a UUID (because it’s always UUIDs) in it, which is ugly; it also gives ammunition to the argument of using v4 so as to not expose time-based information. I still doubt that the latter matters, and think that it’s a hypothetical dreamt up by the same kind of people who shard their database and use Kafka at tiny startups because “we might need it.”
> very efficient indexing
You are of course correct that the sorted prefix helps the B+tree immensely, but you can’t get around the 16 bytes (or worse, if stored as a string) vs. 8 bytes or smaller for other types. Despite what people seem to think, this does add up, it does impact your buffer pool, and it does slow down queries.
One advantage is to use more compact data type for performance-critical paths. Table ID datatype will be used not just in the given table, but also in all tables referencing given table. Table ID is always indexed and foreign keys very often are indexed as well, so this datatype "leaks" to the many database indexes. If your index is compact, it allows for better RAM caching and generally faster queries. And, of course, faster writes.
Another advantage is that you don't need to "think" about using sortable id, numeric serial id is sortable by default. It also helps with index size and insertion speed for primary key index and often foreign key indices.
Using completely random UUID leaks no data about record creation time. It might not look like sensitive information, but generally it's better to be on a cautious side about leaking information.
Generating sortable ID is far from generally accepted solution. You need to find obscure libraries or write non-trivial code yourself for all languages you're using. It'll be solved in time, as UUIDv7 became standard, but we're not there yet. UUID v4 is available in any language (and generally trivial to generate).
Do ULIDs have a defined endianness for their binary representation? Because there was none defined for UUIDs and there ended up to be two versions of it: Big Endian and Microsoft. That mess has created many problems.
An auto incrementing integer (for internal use) and a string (for external) will be enough for the vast majority of use cases, is simple to understand, and works well.
You’re not wrong. This is my preferred method (from a DBRE perspective) if there are concerns about time-based solutions like UUIDv7, and keys must be publicly exposed.
Hot take: I wish people just used UUIDs as UUIDs, instead of seeing the plethora of bits and going "it's free real estate". Treat it like a random identifier and move on with your life, rather than having all sorts of weird emergent properties that kinda sorta provide something valuable but not if you squint at it too hard.
You’ve maybe not had a use case for sortable uuids, so I can provide one:
Columnar databases like Clickhouse don’t have the kind of indices that you’d be used to if you only used OLTP databases like eg Postgres. Instead, you essentially get one index per table, which is the order that things are laid out on disk*
The problem is that you might want 2 access patterns, one for aggregating data within a time range, and another for looking up / joining data by its ID. So which do you use as the index? Timestamp or ID? If the ID is a uuidv4 and you index by ID, your time range queries need to scan the entire table, and vice versa.
If the ID is time-sortable (eg uuidv7), this stops being a problem, and you don’t need some kind of workaround*
* it’s a little more complicated than this but that not in a way that affects the example
* eg a projection of your data with a different index
So I spent the weekend creating "UPID"[1], which is like ULID but prefixable! Up to four characters for the prefix, 40 bits of time (256ms precision, but could change in a revision) and still 64 bits of randomness. So you can do pretty Stripe-style IDs without losing the benefits of a neat 128-bit UUID column. Have implemented it for Python, Rust and Postgres, maybe worth a kick around for someone!
[0] https://brandur.org/nanoglyphs/026-ids
[1] https://github.com/carderne/upid