While looking into a replacement for auto increment primary keys, I also encount...

leetrout · 2024-07-14T06:23:15 1720938195

Keep ints for primary keys and keep all the db magic for your internal references then use ULIDs or TSIDs for your external keys and keep the ability to have well behaved indexes. Win-win with minimal disk impact

hn_throwaway_99 · 2024-07-14T07:00:01 1720940401

The whole point of ULIDs and newer things like UUIDv7 is precisely so you don't have to do that.

I would definitely just use a ULID or UUIDv7 as my sole primary DB key for new projects going forward. You get the benefit of ascending ordered primary keys, but the key is still globally unique so can be used as public keys.

Arch-TK · 2024-07-14T07:15:52 1720941352

You don't have to do that if you don't mind leaking metadata about the time when a record was created.

aabhay · 2024-07-14T07:41:52 1720942912

Fair tradeoff if you ask me

janto · 2024-07-14T10:17:26 1720952246

Sounds like premature optimization with security implications. I'm hesitant to use that.

Phil_Latio · 2024-07-14T07:58:13 1720943893

What's the point of ULID again? People say one use case of UUIDs is that clients are able to generate the (random) primary key. Fair enough, I guess... Then they say autoincremental ids are bad because it could give indication about business activity (like how many orders have been done in a time frame). But with ULID or UUIDv7, this "benefit" is gone to some degree because timestamp is included.

If client-generation is not needed, then UUID is stupid to begin with, because any incremental id can be transformed ("encrypted") to arbitary different format (which hides the real id). Also UUID is usually much larger: Waste of db/index space, degraded performance. But I guess if people push everything in the cloud, they don't even realize they are paying more than needed...

Conclusion: I don't understand what you people are trying to "solve". All those different UUID versions leads me to believe the issue is not the id, but developer confusion.

jw1224 · 2024-07-14T08:56:41 1720947401

> But with ULID or UUIDv7, this "benefit" is gone to some degree because timestamp is included

Including the timestamp only tells someone the time the UUID was generated. Unlike incrementing numeric IDs (which can simply be subtracted from one another to measure a change in database records), there’s no way to meaningfully count change over time.

> any incremental id can be transformed ("encrypted") to arbitary different format (which hides the real id)

Easier said than done… You then have the overhead of encrypting/decrypting every client-facing ID before querying the database. You’ll also need to code workarounds in many frameworks, to bypass conventions where they expect keys in URL paths (for example).

> Also UUID is usually much larger: Waste of db/index space, degraded performance

With older UUIDs, sure, but sorted ones (like ULIDs) have consistent prefixes, allowing for very efficient indexing and querying with a binary tree search.

Phil_Latio · 2024-07-14T10:47:20 1720954040

> there’s no way to meaningfully count change over time.

You are right, should have finished my morning coffee first =) I guess for client-side generation it makes sense then. But how often is that really needed? I don't know...

As for the other points. I still don't buy that. For example implementing UUIDs vs implementing the transformation: CPU cost is negligible (you don't need to use cryptographic secure cypher). UUID might be a little simpler to implement, but combined with space and performance savings it's well worth it: 16 byte vs 4/8 byte (which might be stored multiple times in related tables) - it adds up and fills resources. Again I understand, most people don't seem to care about that, because they were born into cloud culture and have no clue what they are doing in terms of efficiency money/resource-wise.

Maybe I'm just too old fir this!

zepolen · 2024-07-14T11:24:52 1720956292

On the contrary, I'd say perhaps you're not old enough, otherwise you'd have come across the advantages of:

- you can merge databases while guaranteeing ids won't conflict

- first hand support across various databases/systems

- you use it and never have to deal with ids again

sgarland · 2024-07-14T13:21:04 1720963264

> merge databases

You mean combining logically (or even physically) separated databases, collapsing their tuples into one? Why would you do that, and how often does that occur?

> first hand support across various databases/systems

UUIDs? Of the RDBMS most likely to be used (MySQL, Postgres, SQLite) only Postgres has a UUID type. The others store them as strings (please no) or binary types. MariaDB and Oracle have UUID types, and SQL Server has a GUID (essentially the same thing) type, but those are all less commonly seen.

What does have universal support is integers. They scale just fine (PlanetScale uses them [0] internally), and you can use them in a distributed system – if you even need one in the first place – via a variety of methods: interleaved ranges or a central server allocating chunks are two popular methods that come to mind.

[0]: https://github.com/planetscale/discussion/discussions/366

djbusby · 2024-07-14T18:50:00 1720983000

Sometimes various but similar database get smashed together when companies merge. Or when putting all the things in a "data-lake". Like inventory lot IDs or transaction IDs from 100s of stores local database uploaded to the corporate global DB (eg: Albertsons, Kroger, Walmart)

sgarland · 2024-07-14T15:02:25 1720969345

> Again I understand, most people don't seem to care about that, because they were born into cloud culture and have no clue what they are doing in terms of efficiency money/resource-wise.

They have no clue about how computers work, full stop. Sure, they know programming languages, but generally speaking, if you ask them about IOPS, disk or network latency, NUMA, cache lines, etc. they’ll tell you it doesn’t matter, and has been abstracted away for them. Or worse, they’ll say sub-optimal code is fine because shipping is all that matters.

There is certainly a difference between sub-optimal and grossly un-optimized code. Agonizing over a few msec outside of hot loops is probably not worthwhile from an efficiency standpoint, but if it's trivial to do correctly, why not do it correctly? One recent shocking example I found was `libuuid` in its various forms. util-linux's implementation [0] at its most recent tag is shockingly slow in larger loops. I'm fairly certain it's due to entropy exhaustion, but I haven't looked into it enough yet.

MacOS uses arc4random [1] (which for Linux, is in glibc as of v2.36, but you can get it from libbsd-dev otherwise), and it's much, much faster (again, on large loops).

I made some small C programs and a shell runner to demonstrate this [2].

[0]: https://github.com/util-linux/util-linux/blob/stable/v2.40/l...

[1]: https://man7.org/linux/man-pages/man3/arc4random.3.html

[2]: https://gist.github.com/stephanGarland/f6b7a13585c0caf9eb64b...

sgarland · 2024-07-14T13:29:01 1720963741

> You then have the overhead of encrypting/decrypting every client-facing ID before querying the database.

As opposed to the overhead of generating a UUID? It may be fast, but it’s still overhead.

> code workarounds in many frameworks … they expect keys in URL paths

I personally despise this practice, not least of which because then your URL has a UUID (because it’s always UUIDs) in it, which is ugly; it also gives ammunition to the argument of using v4 so as to not expose time-based information. I still doubt that the latter matters, and think that it’s a hypothetical dreamt up by the same kind of people who shard their database and use Kafka at tiny startups because “we might need it.”

> very efficient indexing

You are of course correct that the sorted prefix helps the B+tree immensely, but you can’t get around the 16 bytes (or worse, if stored as a string) vs. 8 bytes or smaller for other types. Despite what people seem to think, this does add up, it does impact your buffer pool, and it does slow down queries.

dagss · 2024-07-14T07:22:36 1720941756

You say win-win but what is the win of having a seperate internal ID specifically?

It is possible to do as you say ofc but what is the advantage?

vbezhenar · 2024-07-14T07:40:25 1720942825

One advantage is to use more compact data type for performance-critical paths. Table ID datatype will be used not just in the given table, but also in all tables referencing given table. Table ID is always indexed and foreign keys very often are indexed as well, so this datatype "leaks" to the many database indexes. If your index is compact, it allows for better RAM caching and generally faster queries. And, of course, faster writes.

Another advantage is that you don't need to "think" about using sortable id, numeric serial id is sortable by default. It also helps with index size and insertion speed for primary key index and often foreign key indices.

Using completely random UUID leaks no data about record creation time. It might not look like sensitive information, but generally it's better to be on a cautious side about leaking information.

Generating sortable ID is far from generally accepted solution. You need to find obscure libraries or write non-trivial code yourself for all languages you're using. It'll be solved in time, as UUIDv7 became standard, but we're not there yet. UUID v4 is available in any language (and generally trivial to generate).

drowsspa · 2024-07-14T08:35:53 1720946153

Sure, if your only goal with them is hiding meta info and you're generating your IDs into a RDBMS...

luismedel · 2024-07-14T11:32:16 1720956736

I maintain the Python version[0] of this TSID spec[1]. Using it in several projects and I'm very happy with the result.

[0] https://github.com/luismedel/tsid-python

[1] https://github.com/f4b6a3/tsid-creator