Hacker News new | past | comments | ask | show | jobs | submit login
Understanding UUIDs, ULIDs and string representations (sudhir.io)
213 points by sudhirj on Jan 5, 2022 | hide | past | favorite | 100 comments



I really liked this article but I feel it misses one somewhat important point about using incremental numbers: They are trivially guessable and one needs to be very cautious when exposing them to the outside world.

If you encounter some URL like https://fancy.page/users/15 chances are that the 15 is a numeric ID and 1 to 14 also exist. And the lower numbers tend to be admin accounts as they are usually created first. This might be used by an attacker to extract data or maybe gain access to something internal. One could argue that using UUIDs only hides a security hole in this case but thats better than nothing I guess.


Beyond the obvious and important security implications of incremental numbers, there is one other major problem with them.

They make life hell for database clustering, merges and migrations.

In addition, on a more minor level, in a client-centric (apps, browser JS etc) world, the use of incremental numbers is an un-necessary pain point. If you use UUIDs, the client can generate its own without the need to a call back to the API (unless necessary in context, obviously).

Frankly, IMHO in the 21st century, the use of incremental numbers for IDs in databases thoroughly deserves to be consigned to the history books. The desperate clutching at straws arguments that went before (storage space, indexing etc.) are no longer applicable in the modern database and modern computing environment.


This greatly overstates the benefits of UUIDs, and ignores the myriad ways in which they demonstrably have poor properties in real systems. I've worked on several large systems designed around UUIDs at scale that we had to later modify to not use UUIDs due to their manifest issues in practice. And then there is the reality that v3/4/5 UUIDs are expressly prohibited in some application contexts due to their defects and weaknesses.

Also, sequence generators are a non-problem in competent architectures, since you can trivially permute the number such that it is both collision-free and non-reversible (or only reversible with a key).

It is still common to use structured 128-bit keys in very large scale databases, and it is good practice in some cases, but these are not UUIDs because the standardized versions all have problems that can't be ignored. This leads to the case where there are myriad non-standard UUID-like identifiers (in that they are 128-bit) but not interoperable.


> I've worked on several large systems designed around UUIDs at scale that we had to later modify to not use UUIDs due to their manifest issues in practice.

As I said in my later comment, let's put the "don't use UUID because high scale" argument to one side, shall we ?

Because per-later comment:

    1) Vast majority of database implementations are not remotely "high scale". Most database implementations could use UUIDs and nobody would ever notice.
    2) "High scale" brings specific environment concerns, not only related to databases
Seeking to bring "high scale" mentality lower down the foodchain only results in people implementing premature optimisation (and vastly overspending on over-complex cloud implementations).


Is it really true that concerns around UUIDs as primary keys are wholly irrelevant? Maybe I'm working off outdated information but in high scale environments there are a lot of downsides primarily related to the random write patterns into B-trees causing page splitting and things like that.


> Is it really true that concerns around UUIDs as primary keys are wholly irrelevant?

I would say yes, with the options we have today with modern compute.

We live in a world where compute is powerful enough to enable Let's Encrypt to issue SSL certificates for 235 million websites every 90 days off the back of a single MySQL server[1].

For high scale environments there are also other options such as async queues and Redis middleware.

Database technology itself is also evolving, and the degree of measurable downside is less than it might have been 10 years ago.

I would still argue that for the vast majority of people, UUIDs are the way to go. I would certainly urge caution against premature optimisation involved with the "but high scale" argument. Sure things MIGHT be noticeable at high scale, but I think its fair to say most people are not doing anywhere enough high scale to do so and should probably just use UUIDs and cross the "high scale" bridge if/when they ever come to it.

Finally, its also worth pointing out that all the hyperscalers use UUIDs or other unique identifiers widely in their infrastructure and APIs, all of which must inevitably be tied into a database backend.

[1]https://letsencrypt.org/2021/01/21/next-gen-database-servers...


You're right that random unordered writes are worst case for an indexed (ordered) key. ULID and ordered UUIDs (v6+) help solve this.

For dimensions, UUIDs are usually fine since writes are infrequent. For facts or timeseries data, ordered IDs are more efficient.


Yes, ordered UUIDs help solve this. The unfortunate deep dive rabbit hole here is how UUIDs are sorted in different databases and making sure your UUID generation matches that.

One fun for instance I worked directly with: Microsoft's SQL Server made some interesting assumptions based on UUID v1 and sorts the last six bytes first. In UUIDv1 those would have been the MAC addresses and clustering by originating machine first has some sort of sense to it in terms of ordered writes. The ULID timestamp is coincidentally also six bytes (48 bits) so (ignoring the Endian issues of the other "fields" in the UUID) you can get Microsoft's SQL Server to order UUIDs in mostly the same way as their ULID representation by just transposing the first six bytes to be the last six bytes.

Unfortunately UUID v6+ won't sort well in Microsoft SQL Server's sort order today.

Other databases will vary on what you need to do to get sortable UUIDs.

A reference to me on all of this deep rabbit hole was Raymond Chen's blog on the many GUID/UUID sort orders just in Microsoft products: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...

(With the fun punchline at the bottom being the link to the Java sort order. My sympathies to anyone trying to sort UUIDs in an Oracle database.)


I speak from bitter experience - UUID for PKs is not a good idea out of the box. Both writing as well as reading took very significant penalties. I did not design that particular system, but I had to figure out why relatively simple queries took minutes to return.


> I did not design that particular system ... why relatively simple queries took minutes to return

The road of databases is paved with many such bodies.

Whether it is developers treating databases like some blackbox dumping ground, or designing generic "portable" schemas, or people who don't know SQL writing weird long convoluted queries.

Many people are quick to blame "the database", but 99% of the time its the fault of those who designed the schema and/or the queries that run on it.

I think your statement "UUID for PKs is not a good idea out of the box" is unfair and too broad brush. Without knowing the exact details of every bit of your environment (from database hardware upwards), its not possible to accept such a generic statement to be read as a fact.


There have been / still are tons of attacks where you can see other people's data by just incrementing and decrementing the ID in the URL.

Will see if I can a section about security implications, there's a similar time based argument to be made for ULIDs as well — you don't inadvertently want to expose a timestamp in some cases.


I once discovered by accident that a big hospital in my big city used incremental IDs for loading exams results (one of my exams wasn't loading while the others were, so I just opened dev tools and 2 minutes later I noticed I could access 500_000 exams of random people just by changing something like /exam/ID).

UUIDs could have prevented the leak even if they still managed to completely disregard any authentication logic on the backend.


It would be similarly feasible for a competent person to do the same for UUIDs, at least the RFC4122 timestamp-derived UUIDs which many people and libraries use. Of the 128 bit field, several sections are constant over variables like (i) the identity of the machine and (ii) the process, and the RFC describes exactly what those fields are. For the timestamp field, you would then guess at timestamps near to the original UUID's timestamp.

It's not as easy as incremental IDs, without doubt, but it's worth correcting the idea that (most) UUIDs are designed to provide security in this situation, beyond maybe a quarter-layer of defence in depth. In fact, the RFC explicitly says:

> Do not assume that UUIDs are hard to guess; they should not be used as security capabilities (identifiers whose mere possession grants access), for example.

https://datatracker.ietf.org/doc/html/rfc4122#section-6


Very true and important to state, UUIDs on their own at most provide obscurity, not security. Can the MAC address of the host that is used for some versions be extracted/read from the UUID or maybe inferred by observing a number of UUIDs?


Yup, it can absolutely be extracted. It's not hashed or anything like that, it's just a sequence of fields in the order that the spec gives. It's really not even 'extract', it's more just 'read'.

I think people may be misled by the fact that UUIDs are frequently hex-encoded prior to being sent over the wire (or even, stupidly, in the database). It looks like a hash, but it's very much not one.

Edit: This is all referring to RFC 4122, to be clear. It's entirely possible that there are some other UUID schemes out there which do hash their contents.


Definitely didn't know that, thanks for that insight, really appreciate it! I always just assumed they were hashed but never really bothered to check. V4 shouldn't have this problem, right?


That can't be achieved if UUIDV4 is used.


This is the problem with incremental numbers:

https://en.wikipedia.org/wiki/German_tank_problem


More trivially, it also gives insights into your business if I can determine the upper bound of your resources by trial-and-error guessing. If the highest user ID is 94, that may be an (hopefully unwarranted!) red flag to potential customers or investors.


Using https://hashids.org is sensible for mapping to external ID.


I recently learned about this. We're thinking of using them with our project really soon. Are there any gotchas to be aware of with it?


The OP proposes using `ULID`s, which are the same number of bytes as UUIDs, but have an initial timestamp component (ms since epoch), plus a subsequent random component. While these are sequential (not exactly "incremental"), so give two of them you can know which came first -- they aren't really "guessable", as you'd need to guess not only an exact timestamp (not infeasible if more of a challenge than with incremental integers), but a large subsequent random component (infeasible).

Apparently there are some proposals to make official UUID variants with this sort of composition too, which some threads in this discussion go into more detail on.


They aren't guessable, except for ULIDs generated by the same process in the same millisecond. To keep chronological order even within the same timestamp, ULIDs within generated within the same millsecond become incremental. This can become relevant for example when an attacker requests a password reset for himself and the victim simultaneously.


Interesting, I learned about ULIDs for the first time from this article, which says: "The remaining 80 bits [in a ULID] are available for randomness", which I read as saying those last 80 (non-timestamp) bytes were random, not incremental. But this was misleading/I got the wrong idea?

Going to the spec [1]... Yeah, that's weird. The spec calls those 80 bytes "randomness", and apparently you are meant to generate a random number for the first use within a particular ms... but on second and subsequent uses you need to increment that random number instead of generating another random number?

Very odd. I don't entirely understand the design constraints that led to a non-random "randomness" section still called "randomn" in the spec even though they are not.

[1]: https://github.com/ulid/spec


From what I gather this is done to persist the sort order. All calls within the same millisecond will get the same timestamp component so that can't be used to sort the ULIDs. So the "random" part is incremented and the resulting ULIDs can still be sorted by the order of function calls. This wouldn't be possible if the random part were truly random. I'm not sure this is a good idea but that is what I understood from the spec.


It’s not clear why they don’t just use a finer-grained timestamp component (e.g. nanoseconds) and increase that number if two events are within the same clock interval, and keep the random component random.


The reference implementation is written in Javascript and I don't think that provides a reliable way to get timestamps this fine grained.


The underlying clock doesn’t need to be fine-grained; only the target representation needs to be fine-grained enough for the additional increments inserted by the code that generates the IDs. Effectively, the precision of the timestamp part needs to be large enough to support the number of IDs per second you want to be able to generate sustainably.


Also, the spec notes there is an entropy trade off in using more bits for the time stamp. More timestamp bits is fewer random bits, because ULIDs constrained themselves to a combined total 128 bits (to match GUID/UUID width).


This is often dealt with trivially using collision-free hashing. It exports a number in the same domain as the sequence (i.e. 64-bit sequence -> 64-bit id) but is not reversible and is guaranteed to be unique.


This is true with identifiers which are already random, but unless you’re doing something like keyed hashing, a naive implementation of say SHA256(predictable_id) isn’t going to solve this problem against a determined attacker, but I’d like to learn a bit more about what you’re discussing here.


For example, running the sequence generator through AES rounds. This is keyed, very fast if the hardware supports it, and permutations of types smaller than the block size are collision-free and non-reversible[0] if the AES block is setup suitably.

In other cases, a 128-bit key is simply encrypted (conveniently being the same block size as AES), which allows you to put arbitrary structure inside the exported key.

[0] http://www.jandrewrogers.com/2019/02/12/fast-perfect-hashing...


I was not aware of this!

Do you have a favorite link for more information on this?


There's also a proposal for UUIDv6-8, lexicographically sortable variants.

https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...


To summarise the differences:

* UUIDv6 - sortable, with a layout matching UUIDv1 for backward compatibility, except the time chunks have been reordered so the uuid sorts chronologically

* UUIDv7 - sortable, based on nanoseconds since the Unix epoch. Simpler layout than UUIDv6 and more flexibility about the number of bits allocated to the time part versus sequence and randomness. The nice aspect here is the uuids sort chronologically even when created by systems using different numbers of time bits.

* UUIDv8 - more flexibility for layout. Should only be used if UUIDv6/7 aren't suitable. Which of course makes them specific to that one application which knows how to encode/decode them.

UUIDv7 is thus the better choice in general.

(I recently wrote Python and C# implementations - https://github.com/stevesimmons/uuid7 and https://github.com/stevesimmons/uuid7-csharp)


I see this pop up from time to time and it looks interesting. Does anyone know if there's actual progress on seeing this get adoption. I don't have any background on how to evaluate or how seriously to take such a draft.... is this draft under serious debate by those that could chose to adopt it or is it just written by someone with high hopes of throwing a draft out there and getting some attention for their idea?


Brad Peabody did the original -00 draft, which was discussed as an FYI at an IEFT meeting in March 2020. See [1], around 50 lines from the bottom.

Kyzer Davis has since submitted two further revisions -01 and -02 in April and October 2021. See history in [2].

The current -02 draft is due to expire in April 2022. Presumably Kyzer Davis will try to get it discussed before then.

The GitHub repo tracking these drafts is https://github.com/uuid6/uuid6-ietf-draft/.

[1] https://datatracker.ietf.org/meeting/107/materials/minutes-1...

[2] https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-...


The whole section on serial number IDs is a bit FUDy in my opinion, especially this:

> If you suddenly have a million people who want to buy things on your store, you can't ask them to wait because your sequence generator can't number their order line items fast enough. And because a sequence must store each number to disk before giving it out, your entire system is bottle-necked by the speed of rewriting a number on one SSD or hard disk — no matter how many servers you have.

There’s maybe a handful of apps in the world that see so much traffic that this would be a problem. Unless you expect to reach Amazon-scale anytime soon, or need distributed ID generation (like generating them in mobile apps or SPAs), just starting with a simple BIGSERIAL (or rather, BIGINT GENERATED BY DEFAULT AS IDENTITY as the state of the art is) will be good enough to get started.

You can always add complexity to your app later. Taking it away once added is much more difficult.


UUIDs aren’t particularly complex. In a lot of ways, they’re more convenient than using an auto-incrementing identifier. For instance, if you’re using DynamoDB, getting an incrementing integer is way more complicated than a random UUID.


Well, they do have opportunities for screw-ups. We recently had some bad data loss issues because part of an application was using a buggy UUID generator that produced lots of collisions. If you google for UUID collision bugs, its not an uncommon occurrence.

Besides, UUIDs have fragmentation issues. I’d use ULID if I had need to generate IDs in a distributed fashion.


Think you’re right about the FUD part, will tone it down. Still think numbers are a horrible idea, but this shouldn’t be consideration for most people.


Yes, this was hyperbole. Durably generating tens of millions per second is entirely within the realm of possibility. However, events generated by intentional human action, like buying something or sending a text message, historically maxes out at hundreds of thousands per second.


Sequence numbers don't scale to distributed databases and distributed data creation though.

"Handful" is wrong. Any major system will start to run into this as soon as you start saying the word "scale" in design meetings.

UUIDs can also provide room for encoding other information, like the type of the object, where it was created, etc, since MAC address is often integrated in the UUID.


“Handful” is referring to cases where “your sequence generator can't number their order line items fast enough” is true. And I stand by that.

And if you want “other information”, good database design would put that other information in a column of its own.

Distributed databases have their place. But the tradeoffs they bring are often not worth it for your 1.0/MVP app.


I'm liking ULIDs more and more recently, as a UUIDv4 is random, insert performance is going to be subpar compared to bigserial. But going to a ULID which includes the time allows you slightly quicker insert performance. Also allowing for some tiered storage architectures, where if you know the ULID, you know where to look (approximately).


Depends on your storage system. For the one I work with most, a common prefix on your primary keys will hurt performance because it causes hotspotting. A UUID primary key would be the best case because it optimally shards writes. A ULID would be the worst case -- I would need to store it with the bits reversed.


And conversely, if storage is chunked (e.g. parquet files on S3), having time-ordered uuids may turn it into essentially an append-only log-structured store.

Here hotspotting is the aim, since it lets you efficiently prune query plans from index scans to direct reads of the right chunk.


Why not UUIDv1?


This source provides some insight on the ‘risks’ of collisions:

http://www.h2database.com/html/advanced.html#uuid

If you generate 70 trillion UUIDs, the odds of two of these being a duplicate is approximately the same as the chance of one person being hit by a meteorite this year.

To put it bluntly, to add structures like time, Mac addresses or domain ids to UUIDs to avoid collisions is really not useful and considering the downside that it leaks that information is a bad idea.


For anyone else wondering: ULID = Unique Lexicographically IDentifiers


Thanks. I read the article to find out what this meant. ULID was mentioned 16 times but never spelled out.

UUID wasn’t either but I at least knew that.


We use K-Sortable Globally Unique IDs: https://github.com/segmentio/ksuid

Some differences:

128 bit strongly generated payload (instead of 80 bits for ULIDs).

Only 32 bit time precision but that's wall clock time anyway.

Base62 encoded.


Author here, self-posted. AMA.


I think Base85 [0] warrants a mention for those needing to minimize the string representation length of their UUIDs.

A year ago or so we had to store a reference to one of our entities into a legacy third-party system, which used char(20) as the column size and of course couldn't be changed.

Since Base85 encodes a UUID as exactly 20 ASCII characters, it saved me from having to add an extra indirection. (Also, and to be honest mainly, from giving ammo to our CEO who had never liked UUIDs)

Of the various 85-character encodings, I thought Z85 [1] was the best one. It's not URL-safe, but it's safe for copy-pasting into queries, source code, XML, JSON, CSV, etc.

[0] https://en.wikipedia.org/wiki/Ascii85

[1] https://rfc.zeromq.org/spec/32/


Thank you! I was familiar with Ascii85 but not Z85. I just added it to the "Other projects" section of my base converter. https://convert.zamicol.com/


Nice, thanks, haven’t see. This before. Can’t say I like the symbols in there, but will add it as a reference.

Should probably add base58 as well, Bitcoin uses it.


I just want to say that I really enjoyed reading this article. It's among the clearest, most accessible writing about a technical subject that I've encountered in a while.


Great article! I hadn't heard of this, but it sounds like something I might suddenly be using in the near future! Also, I loved this line: "Given the way we're going, humanity in its present form isn't likely to exist them, so when this becomes an issue it'll be somebody else's problem. More likely something else's problem."


Great write up!

Do you know if lex62 has any performance disadvantage versus bases that are in 2^n? (32, 64 etc?)

I always assumed the conversation to 2^n bases could be done more efficiently.


Hmm. I haven’t actually gone low level enough to answer that, but I suppose it’s possible - doesn’t seem likely to be significant in a web application server scenario, though. The random jitter on the lan line to the DB is probably going to be more significant.


Was just looking for a more human readable representation of my Postgres UUID v4 primary keys today - Douglas Crockford alphabet is perfect for that. Thanks!


I was excited to see that as well. It still is quite long when used in a URL


First time hearing about ULIDs. The locality is interesting, but they leak information about when they were generated down to the millisecond, which could lead to problems if combined with other issues. I'd be wary of using them client-side.


Yeah, I find they’re best used for data where the creation time is public information, like chat messages or logs


Can someone expand on the practical consequences with leaking when things were created to the millisecond?


Consider a timing attack: https://en.wikipedia.org/wiki/Timing_attack

Let's say a UUID comes back with an error message. This could be used to figure out how long it took to generate the error. That could tell you if a particular resource is cached, even if you don't have access to that resource.

Timing attacks are usually pretty creative. It's hard to predict how extra timing information could be misused.


You might want to expose joining dates for example, or exactly when something happened. That kind of info can leak unintentionally if someone looks at the ID and you didn’t want it exposed.


"48 bits is enough to represent a millisecond-precision Unix timestamp (the number of milliseconds since an epoch at the beginning of Jan 1, 1970) till the year 10889 AD. Given the way we're going, humanity in its present form isn't likely to exist them, so when this becomes an issue it'll be somebody else's problem. More likely something else's problem."

Lol this is an odd bit of conjecture to interject.


I liked the article, but I think you really need to be concious of the terrible performances of uuidv4. ULID is better but not perfect either. My goto article on the subject is https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...


In my experience, sequential numeric IDs are usually absolutely fine. The problems identified in the article are either phantoms, or easily overcome. Let's go through them.

"When using a numeric primary key, you need to be sure the size of key you're using is big enough" - as the article itself notes, 64 bits should be enough for anyone.

"That number that you first pulled out and didn't use is lost forever. This is a common error with sequences — you can never assume that the highest number you see as an ID is implicitly the count of the number of items or rows." - true, so always treat numeric IDs as opaque, like UUIDs. The fact that they are actually sequential is an implementation detail.

"You can copy a table over to a new database and forget to copy over the sequence generator object and its state, setting yourself up for an unexpected blowup." - doing weird manual operations on your database offers a wide range of ways to screw up, far beyond this. Just don't do this? When you copy a database around, you need to copy the whole thing, to preserve its integrity. If you're creating a frankenbase, then of course you need to exercise caution. If you're really worried, on app startup, check that the sequence's next value is higher than any existing ID, and crash if it isn't.

"Having a single place where identifiers are generated means that you can add data only as fast as your sequence generator can reliably generate IDs." - this is a real problem, but it's easily overcome by batching. Rather than using hitting the database for a new ID every time you need one, the application can occasionally hit the database to acquire a range of IDs, keep that range in memory, and use them as needed. You might be able to build batching on top of the database's built-in sequence machinery, or you might not, or you might prefer not to even though you can. At worst, it means adding a table to the database to track sequence values. Scaling is then accommodated by tuning the batch size and scope (per instance, per thread, etc).

"On a scaling-related note, numeric IDs limit your sharding options" - the approach i have seen is to use batched sequences, and move the sequence machinery out of the database that is being sharded, and into a separate service, or its own database. Application instances can all pull batches of IDs from the shared service or database, which ensures that they are non-overlapping.

The nice thing about numeric IDs is that you can start with the simple and easy approach, a standard database sequence, and then migrate to more scalable generation strategies as your database grows, without having to change your data model. The problem of generation is nicely encapsulated.


Those are ways to make sequential IDs work. But there are, in the complex case, a lot of moving parts you don't need at all with UUIDs, no?

> The nice thing about numeric IDs is that you can start with the simple and easy approach, a standard database sequence, and then migrate to more scalable generation strategies as your database grows, without having to change your data model.

But with UUIDs you don't have to "migrate to a more scalable generation strategy" as you grow, you've started with a simple and easy approach that just keeps working as you grow, no? It would be odd to suggest that's an advantage to sequential IDs.

Or is the suggestion that UUIDs aren't as simple and easy as sequential IDs? I'd say they are just as simple and easy to implement (most DBs will do it for you with no more trouble than a sequence); but they are, it's true, a bit more inconvenient to use as a human-friendly ID, whether in developer debugging or URLs. That is, I'd agree, their main downside.


Cockroach DB supports ULID since v21.2.3 using gen_random_ulid().


Support (native) in DBs and in libraries is currently the only disadvantage I have found using ULIDs for a while now. But adding a single library to generate ULIDs (front & back end) seems to be good enough for now.


I haven't used SQL Server directly in a while, but IIRC, we can use the identity() function, which does 2 things: 1. tells the system to start counting at a certain number, and 2. tells it to increment the next number by a certain multiple (i.e. x2)

I do agree that serially incrementing numbers may sometime be relegated to the history books, although this is such a baked in function that, lacking any other prebuilt functions, old-time SQL developers simply reach for it.


> This entire idea of using random IDs assumes that your computers can generate random numbers that are random and unpredictable — this isn't an easy problem, and there's a lot of research being done in the field.

Can anyone here provide some hints as to how you can verify that the randomness is actually random?

If I was to create some blackbox and I claimed it generated 100% random numbers, is it possible to disprove my claims?


If trickery is the goal then you could for example hash a seed using a strong hash function. Output the hash bytes and feed the hash back in. You have random looking data that should pass statistical tests (dieharder et al).

This question arises with hardware RNGs - could some of them just output random looking data that the dark-powers know the seed of?


Sure, nothing is 100% random so that is not possible. At best it's difficult to predict.


YouTube seems to use 10 characters for their video id, does anyone know what is the tech behind that?


> YouTube seems to use 10 characters

FYI no commitment is made to that by YouTube.

The API merely defines it as a "string" with no further commitment as to length, format or characters [1]

As for the "how?" there is an unsubstantiated answer based on reverse-engineering posted on SO[2]

    [1] https://developers.google.com/youtube/v3/docs/videos#id
    [2] https://webapps.stackexchange.com/a/101153


There’s a way of using some number theory to generate a random-looking sequence of sort codes without repeating them, I wonder if they do something like that. I implemented the technique in Rust here: https://github.com/paulgb/tiny_id


I remembered it being 11 characters, and indeed: https://www.youtube.com/watch?v=dQw4w9WgXcQ.


Most people are familiar with time and random UUIDs but and I had sort of known about v5 UUIDs but recently used them to get consistent identifiers from an input value… super useful because you can do `uuidv5($namespace_uuid, data)` and get the same UUID everytime.


Isn’t that just a hash?


If you have data-driven tests that compute UUIDs along the way to their results, plopping in an UUID generator that is a predictable hash algo is a good thing.

This is similar to C rand(), which you wouldn't want to use in production but is useful when generating the same test data for the same seed every time.


It is a hash, but in the format of a UUID. This is useful if you are storing it in the database, since there is usually a dedicated data type for UUIDs. Also, you can mix UUID v4 and v5 IDs freely.


Think an MD5 hash is the exact same length as a UUID, so you can just put hyphens in. Same thing for truncated SHA-N :-P


One of my beefs with UUIDs is indeed with handling them in URLs. Very unwieldy..


Yeah, I wrote the shortuuid libraries to help make nicer URL strings in base62 (or any other alphabet of your choice) for that. Added links at the bottom of the article.


All this stuff about collisions and avoiding them, even though they will never happen, feels like a PHB compliance issue.

“Great work Geoff! One question: what’s the probability of two transactions having the same ID?”

“It is very low”

“Hmmm. But it’s not zero?”

“It’s so low that it practically is zero.”

“But it’s not technically zero? This company wasn’t built on taking chances, son! Come back when your product complies with our corporate zero-risk policy.”


This is when any tech person worth their salt should lean over and say:

"Okay boss, we _could_ do that, but so you know what that would mean?"

And then you tell them about cosmic rays, bitflips and redundant computing and what that would mean for the cost of IT at your company.

"... or, we could just use UUIDs like nearly everybody else. I will spend a few days thinking about what would happen in case of a UUID collision and create a mechanism that adverts the worst consequences if you want. That should be enough in my judgement, we could also ask $collegue what if they agree with that conclusion"

On a side note: bosses who think they just need to be convincing enough in order to change physics are the worst. Some bosses expect NASA-level solutions for no or minor resource cost at all.


In my experience, saying stuff like "the chance of a UUID collision is about the same as your car being struck by lighting and hit by a meteorite every day for a year" works on some non-STEM people.

But there are a lot more for whom "look, Azure uses UUIDs for their VMs, and it's good enough for them" is somehow more convincing.


Or be helpful and do the rounding for your boss instead. The difference between „zero“ and „practically zero“ is only interesting to academia in this case.


During the Boeing 737 Max software fix, the FAA required Boeing engineers to make the system handle 5 simultaneous cosmic bit flips.

https://www.seattletimes.com/business/boeing-aerospace/newly...


As Dostoevsky once said "All functional people are the same, all dysfunctional people are dysfunctional in their own way". Don't get me wrong, but you can always invent an idiot who will go out of their way to make something unbearable, but it doesn't mean we should always base our judgment on that.


Didn't Tolstoy also say "All happy families are the same, and all unhappy families unhappy in their own way"? (I'm paraphrasing)


Why not CUID?


Snowflake ID[1] is another system related to those in the article. It uses 64 bits only, and has good sharding support, so can be more useful in some contexts. However, it naturally has worse independent random-collision chances than a 128 bit system. [1] https://en.wikipedia.org/wiki/Snowflake_ID


First time hearing about CUID too, but have used ULID before. ULID is basically just 48 bits of timestamp + 80 bits of randomness, while CUID adds a counter for monotonicity, and a client fingerprint for collision resistance, presumably at the cost of less bits of randomness?

Would love to hear experts chime in on the tradeoffs here:

In terms of collision resistance, how much does adding a client fingerprint component really help? 80 bits of randomness in ULID already sounds pretty collision resistant to me, since that's a 50% chance of collision after generating 2^40 IDs. It kind of feels like the risk of collision on the fingerprinting mechanism itself (here it's described as 2 chars from PID and 2 chars from hostname for node, which honestly sounds a little bit shakey to me) combined with the reduced bits of randomness could undermine any potential gains in collision resistance through client fingerprinting.

Do folks know of examples where collision resistance through 80 bits or more randomness has failed in practice and generated collisions? Would love to see more reading material on this kind of stuff.


Some details on how CUIDs work: https://github.com/ericelliott/cuid


Haven’t looked into it, will check it out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: