Be Careful with UUID or GUID as Primary Keys

bdarnell · on June 9, 2017

One of the post's points is that UUIDs will scatter your writes across the database, and that for this reason you want a (more or less) sequential key as your primary key. This crucially depends on both your database technology and your query patterns.

In a single-node database or even a manually-sharded one, this post's advice is good (For Friendfeed, we used a variation of the "Integers Internal, UUIDs External" strategy on sharded mysql: https://backchannel.org/blog/friendfeed-schemaless-mysql).

But in a distributed database like CockroachDB (Disclosure: I'm the co-founder and CTO of Cockroach Labs) or Google Cloud Spanner, it's usually better to get the random scattering of a UUID primary key, because that spreads the workload across all the nodes in the cluster. Sometimes query patterns benefit enough from an ordered PK to overcome this advantage, but usually it's better to use randomly-distributed PKs by default.

For CockroachDB, my general recommendation for schema design would be to use UUIDs as the primary keys of tables that make up the top level of an interleaved table hierarchy, and SERIAL keys for tables that are interleaved into another. (Google's recommendations for Spanner are similar: https://cloud.google.com/spanner/docs/schema-design#choosing...)

hodgesrm · on June 10, 2017

You make a good point about knowing your database technology. For me the takeaway from the article is "know your database technology and your data" especially if you plan to get big.

The original post refers to RDBMS and seems to be largely based on MySQL/InnoDB experience. InnoDB clusters on the primary key and copies primary key values into secondary indexes, so scattered writes and exploding secondary indexes are as well-known problem if you use UUIDs as the primary key on MySQL. Another well-known issue is that UUIDs can lead to random scattering of reads which chew up space in the buffer pool.

That said, other RDBMS types don't necessary have the same problems. SQL server can cluster on non-primary key columns or not cluster at all (heap organization). In the latter case secondary indexes on the table use row IDs (RIDs) to refer to rows from the index. Using UUID as a primary key does not inexorably entail performance problems though it might if you made bad choices about indexing.

It seems to me people sometimes make a bigger deal of UUIDs and DBMS than is really called for. Badly chosen surrogate keys or time-series clustering can lead to performance problems that are just as severe depending on your access patterns. There are perfectly reasonable workarounds for UUID issues like using integer surrogate keys, which RDBMS support quite well. To use them effectively though you need to spend time to understand your DBMS technology and implementation choices thoroughly. A lot of people skip this step.

Twirrim · on June 9, 2017

With DynamoDB you also want to be scattering your reads. As you scale up the table (either data or read/writes) it will be partitioned and your total read/write capacity is split evenly amongst the partitions.

If you don't have even distribution of requests across the table, you're going to keep running in to throttle limits.

manigandham · on June 11, 2017

dynamo and cassandra use consistent hashing of the primary key, ordered numbers are already hashed randomly.

tkahnoski · on June 9, 2017

I've been on the wrong-side of this unfortunately.

Rather than implementing a sane archive delete strategy, data was left there to collect so there was pretty much no hope for a particular join not to generate page miss.

Eventually we settled on a NoSQL DB with denormalized data in partitions. A few years later we realized we made a poor choice in having too few and too big of partitions, and basically just repeated the mistake with a different database technology (made even worse because now the things on disk were even bigger that we needed to fetch!)

malkia · on June 9, 2017

Also Cloud Bigtable - https://cloud.google.com/bigtable/docs/schema-design#types_o...

emmelaich · on June 10, 2017

Could you retain simple ints as the primary key and use a hash of it for sharding instead? Then you'd get the best of both worlds.

jrs235 · on June 9, 2017

>One of the post's points is that UUIDs will scatter your writes across the database

And will wreak havoc on any indexes it is included in [if you do a lot of writes and don't have sufficient padding]... page split city!

UK-AL · on June 9, 2017

Some UUID and Guids can be made sequential.

braveo · on June 10, 2017

I can't speak for other databases, but SQL Server's NEWSEQUENTIALID is only guaranteed until the server reboots. At which point the next UUID integer value can be less than the previous, at which point you're back to splitting pages.

UK-AL · on June 10, 2017

Normally your application would generate them not the dB

braveo · on June 10, 2017

I didn't provide enough context.

There's nothing wrong with using UUID's, but I would avoid them for clustered indexes. And I would say typically you wouldn't care about whether or not UUID's were sequential unless you WERE putting them into a clustered index.

I'm sure there are use cases where you would worry about the sequential nature for reasons not related to clustered indexes, but in general what I said is true.

UK-AL · on June 10, 2017

You application generates sequential guids before placing in a database with a clustered index.

Thus solving the issue with NEWSEQUENTIALID and clustered index issue.

lazulicurio · on June 10, 2017

Just for other people reading this thread, as I've posted before[1], a word of caution if you're trying to generate sequential UUIDs client-side for SQL server: SQL server does not sort UUIDs in the order you'd expect. The bytes are ordered 10-11-12-13-14-15-8-9-7-6-5-4-3-2-1-0.

[1] https://news.ycombinator.com/item?id=13120522

braveo · on June 10, 2017

I'm not sure what that buys you that you can't get by just using an identity column.

UK-AL · on June 10, 2017

Places where the user request, is detached from the actual insert. For example if you are using a queing system for user requests, but you still need to return an Id to the user.

Fairly common when building a distributed application.

braveo · on June 10, 2017

why do you need it to be sequential then?

UK-AL · on June 10, 2017

Because it will eventually go into a clustered index once it's been processed.

braveo · on June 11, 2017

I'm not trying to be hardheaded, and I understand I don't see all of the nuance because I'm not familiar with the problem you were solving.

But having said that, I have to go back to asking why you didn't just use an identity of some sort.

I cannot see any reason outside of convenience why you would insist on sequential UUID's for that, and I would argue against the convenience seeing as how UUID's are not naturally sequential and any work you did to guarantee them being sequential would also have been useful for generating sequential integers in the software as well (even 128 bit integers).

You could have literally created a 128-bit sequential integer, waved your hands, and then announced to the world that it's a UUID (you wouldn't even need to set the bit to determine which version of UUID unless you were using it outside of your environment for some reason).

Unless there is some other requirement that I'm not aware of, I don't really understand what it buys you.

UK-AL · on June 11, 2017

Combed Guids are normally only partly sequential.

It means they go be generated on any machine, returned to the user, and the request placed on a queue.

Without completely messing up a clustered index.

braveo · on June 12, 2017

I don't think you can guarantee the generation of anything sequential across multiple machines without some sort of authority. A single machine, sure as long as the resolution you're using is small enough and/or you have a single point of authority on that machine.

But across multiple machines?

Can you explain to me how you're doing that across multiple machines without any sort of authority? If there's a way to do that I'm genuinely curious.

UK-AL · on June 14, 2017

Most simply use the system clock. Now if I would was building a machine critical distributed system I would advise against using the system clock and instead use some sort of logical clock. But in this case the clock doesn't need to be super accurate, it just needs accurate enough it get placed near the end of the table.

But the worst case in this instance is your insert is slow, and fragments your index. I trust infrastructure to make sure the clocks are accurate.

sroussey · on June 10, 2017

Sometimes, you don't want a workload spread across all the nodes of a cluster.

lugg · on June 10, 2017

Thanks for this reclarification, lead me to do a little searching.

https://mariadb.com/kb/en/mariadb/guiduuid-performance/

Has some helpful tidbits, like how pks get implicit copies to all other indexes making uuids quite expensive memory wise if you have a lot of referencing tables and a lot of indexes.

One of the recommendations I'm not sure on is to compress and reorder the uuid for time sequence.

I get the compression part (remove dashes, convert to bin with unhex), but I'm concerned reordering the uuid so the time sections are together (to order your writes better) might actually decrease entropy? and make collisions likely.

Does anyone here know much about uuid generation and whether the time sections reordered would make a difference across a table of around 100m~?

I think I might just go with internal incremental and external uuid. Solves the knowing key before insert problem with little downsides.

The time swap thing is now implemented in mysql8 (broken link in the mariadb post)

https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functi...

caleblloyd · on June 10, 2017

The probability of a collision can be estimated using the "Birthday Paradox" equation: http://planetmath.org/approximatingthebirthdayproblem

Say you can tolerate a 1 in 1 Billion probability of collision:

With completely random 128-bit UUIDs, you would need slightly over 800 Trillion Rows to have a 1 in 1 Billion chance of a collision. Equation: sqrt(2(2^128)ln(1/(1-1/1000000000)))

Adapting the UUID to contain the first 8 bytes as the number of milliseconds since the 1970 epoch would mean 64 bits of random data. You would need to insert over 192,000 rows IN A GIVEN MILLISECOND to have a 1 in 1 Billion chance of collision. Equation: sqrt(2(2^64)ln(1/(1-1/1000000000)))

meritt · on June 10, 2017

> might actually decrease entropy? and make collisions likely.

It's rearranging in a deterministic manner, there's no loss of information, it's just a permutation. There is no impact to entropy nor the likelihood of a collision.

hodgesrm · on June 10, 2017

The UUID time-swapping trick seems like a bad idea outside of special cases. Overloading the UUID in this way makes you dependent on generating UUIDs correctly so that the ordering works. You can get insertion ordering by using an auto-increment integer primary key which also results in a relatively short key for secondary indexes. This is much easier to understand and cannot be messed up by application errors.

braveo · on June 10, 2017

The one point I went looking for and didn't see is that UUID's make for very poor clustered indexes. SQL Server tried to mitigate this with the 'NEWSEQUENTIALID', but it essentially 'resets' when the server is rebooted.

If your data is small enough that it doesn't matter then fine, but in general UUID's are better as a surrogate key than as a primary key.

UK-AL · on June 10, 2017

Very few people generate the guids database side. They normally generate them before they even insert it into the database.

jrochkind1 · on June 10, 2017

Yes, that's in the OP, exactly what you said.

braveo · on June 10, 2017

It was, I meant I went looking for it in the article and didn't see it :)

platz · on June 9, 2017

> secondary primary key

This is called a "candidate key" in existing literature. much has been written about such things.

Both UUID's and auto ID's are "surrogate keys" because they are arbitrary with respect to the data.

lastly, "natural keys" are combinations of columns that consist of the business data.

dghf · on June 9, 2017

> > secondary primary key

> This is called a "candidate key" in existing literature. much has been written about such things.

To be picky, a "primary primary" key is also a candidate key. A superkey is any set of columns the values of which are guaranteed to be unique in any row in a given table; a candidate key is a minimal superkey (i.e., if you remove any column from the key, you destroy its uniqueness). (Relational purists will want to read "attribute" for "column", "tuple" for "row", and "relation" for "table".)

From a logical perspective, the designation of a candidate key as the primary key is more or less arbitrary; as per this article, that may well not be the case from a physical perspective.

tomcam · on June 9, 2017

This is super helpful for a mediocre devops guy like me. I had thought through all the issues in the article but didn't know if it was prudent to follow these principles. Now I know what to look for in the literature.

squeaky-clean · on June 9, 2017

I've been using natural keys in a project recently without even knowing they were called that, or even that it was a pattern.

Is there a book or any other reading you would recommend to be more familiar with these sorts of details? The longest text I could find about this was the c2 wiki. I'd like to know what else I don't know.

Delphiza · on June 9, 2017

Anything more recent by CJ Date. He was at IBM with EF Codd at the beginning of the development of the relational model. Try "Database in Depth: Relational Theory for Practitioners". It's good, when reading his stuff, to remember that SQL is a compromise to fit with a pure relational model, so you see things that make sense, but are not practical. Like choices for candidate keys, I suppose.

matwood · on June 9, 2017

Very rarely have I found natural keys that end up working long term. I always default to a single column surrogate and use natural keys as exceptions.

As to your question, any introduction to RDBMSs book will teach all these concepts.

awj · on June 9, 2017

The ultimate problem with natural keys is that their uniqueness is a property of business rules. Unless you live in some magical fairytale land where business rules are stable, sensible, or both, depending on them for uniqueness is asking to have problems.

dghf · on June 9, 2017

There is an argument that if your data does not have natural keys -- that if it depends on the addition of arbitrary system IDs to prevent duplication -- then it is not truly relational.

awj · on June 9, 2017

Sure, and I can see that, but my argument is that natural keys which depend on business logic are subject to definitions that can change given a long enough time frame.

ScottBurson · on June 9, 2017

Interesting that you would say that. It does seem to me that if your rows have these arbitrary IDs, they are more like objects than relational purists would like to admit. (I seem to recall that the relational people used to beat up on OODBs for lacking this relational purity. But OODBs are pretty much dead anyway, I guess.)

TeMPOraL · on June 9, 2017

Interesting indeed. I'd say that a primary thing that makes an object is its identity - and the identity is never a part of object data, nor it could ever be. In programming languages, objects' basic identities are determined by however the language implements references; in databases, a natural solution would be a surrogate unique key...

int_19h · on June 10, 2017

That's exactly it.

And furthermore, objects in real world also have identities, which are distinct from their attributes. You aren't you because you have a certain name and date of birth, or a certain SSN. You're you because you're you. And we all intuitively know and understand this concept of identity. So any software that deals with real world entities has to reflect that concept, which is exactly what synthetic primary keys do.

And it might not be purely relational; but if purely relational model cannot handle this, then it's simply not a particularly useful model for most practical scenarios.

BTW, a pedantic argument could be made here that, ultimately, identity of physical objects is also data - you just need to go low-level enough (ultimately, the identity of any entity is its wavefunction) to see that. Of course, that ignores the practical aspects.

Koshkin · on June 10, 2017

> You're you because you're you.

Are you sure that 'you' today is the same as 'you' yesterday? The notion of identity, however intuitive it may seem to some, is far from being universally obvious.

TeMPOraL · on June 10, 2017

If you get all the way down to where philosophy mixes with quantum magic, then sure, I can close my eyes, open them again and not be sure if I haven't died in the meantime.

But there is an intuitive concept of identity that every human after some age has. A lot of philosophy goes to point out that this concept of identity doesn't really work well if you stress it enough - as evidenced e.g. by this identity being constant under ship of Theseus transformation. But under this level of scrutiny, a lot of other things break too, including the meaning of words themselves ("what do you mean by 'is'?").

As it is, this concept of identity is natural to us, it's the way we see the world and organize our societies, and it works fine for all practical purposes.

Koshkin · on June 10, 2017

Actually, I was far from being philosophical. Even physics has nothing to do with this. Thinking otherwise would be reductionist. In fact, the notion of (a person's) identity depends on the social context (including law). There is even a process by which a person can have their identity changed.

int_19h · on June 10, 2017

It's universally obvious in a sense that every human culture and every human language has it (or at least I'm not aware of examples of the contrary).

And it may well be fuzzy, but it's rather telling that this fuzziness nevertheless remains confined to philosophy and metaphysics. Asharites, for example, believed that the world is recreated by God "at every moment" (and that this is what makes the notions of time, motion etc possible), so in that sense, every person is also re-created. But their societies did not derive any peculiar social customs from that belief - they still treated any person at any given moment as if they possessed stable identity over time in practice.

jankins · on June 10, 2017

At odds with Reactocese, who suggested that the world at each moment is diffed against a 'shadow world' to apply the minimal set of changes that render the new reality

kazinator · on June 11, 2017

Identity means that:

1. We have some EQ function such that if EQ(X, Y) reports true, then X and Y are the same object; and

2. There doesn't exist any function which is more discerning than EQ; if EQ concludes that X and Y are the same, then no other function can find a difference.

In programming, the currrent object may be different from that object moments ago because of mutations. However, inside the virtual machine, we are forced conclude it's the same object. The reason is that within the machine, we do not have a function which can distinguish the two objects. And the reason for that is that under mutation, we we do not simultaneously have the previous object and the new object. Therefore, we cannot apply both of them to a function.

We can discuss differences between yesterday's you and today's you, and find differences. But the discussions of you are not you; they are ideas of you.

Likewise in the machine, we can snapshot the state of an object into some serialized format or whatever, and then two snapshots at different times are different, informing us how the object changed. The two snapshots are not EQ, but they are not the object.

gregmac · on June 10, 2017

This actually only underscores the usefulness of surrogate keys.

Let's say you are building a cloud-based HR software of some type. You have an employees table, and store employees.

If someone gets married and changes their name, or gets a promotion and changes title, they're still the same person (and the same database entry).

However, if that someone leaves that job, and then gets a new job at a totally separate company -- that also happens to use your service -- they are a new database entry.

dghf · on June 10, 2017

I'm not sure I get your point? In the scenario you describe, the natural key for the employees table could be, for example, legal company number + employee's payroll number.

Or are you saying that by using a surrogate key, we only ever have to create one record for an employee, and just update their company details etc. when they change jobs?

gregmac · on June 10, 2017

I suppose you could derive a key like that, but then is it still a "natural" key? What's the benefit over just having a completely arbitrary key?

I'd argue for a surrogate key in this case because I wouldn't want someone to be parsing my id, and having to worry about changes and backwards compatibility.

E.g., the key might be 2360108 where 236 is company number. Then as we grow eventually you get an id like 10910249, which is ambiguous (company 109 grew to over 10,000 or company 1091?)

I deliberately choose a short prefix for illustration, but this could be an issue with any scheme (you'll never perfectly predict all the future requirements), which using surrogate keys completely avoids.

dghf · on June 10, 2017

To be clear, I'm not proposing that the company and employee numbers be concatenated into a single column: I'm saying you could have a composite key, comprising the company number column and the employee payroll number column. This would still be a natural key: each component has a meaning outside the system (company numbers appear on company filings and other formal documentation, employee payroll numbers appear on payslips and are communicated to the tax authorities), and the two in combination have a clear business meaning.

Of course, you may still want to use a surrogate key for performance or security or other technical reasons, as the article describes.

I just wasn't clear on how the specific scenario you laid out made any additional case for their use.

Too · on June 10, 2017

Another good food for thought regarding identity: If you take a complex object like a jumbo jet and replace every single component in it one after another, at the end, is it still the same jumbo jet?

Koshkin · on June 10, 2017

Less hypothetically, most of the atoms in a complex living organism are replaced every few months. (One objection could be that atoms of the same sort are all identical, but that makes the issue of 'identity' even more complex.)

Koshkin · on June 10, 2017

> nor it could ever be

Well, never say never... In a world of immutable objects, an object's identity (reference) can be derived from its content. Look, for example, at how objects are stored in Git.

seanp2k2 · on June 10, 2017

And of course, by now everyone has seen https://stackoverflow.com/questions/3475648/sha1-collision-d... . It's an interesting problem space, trying to generate a unique ID for the contents of a thing where the size of that unique ID is less than the size of the thing. If you consider that = the thing (a deep representation of the object) is the most obvious way to uniquely identify the thing, but then figure you can do better, it would seem that it would always be technically possible, outside of lossless compression algorithms (like gzip), to generate something with the same hash. I'm sure that there is tons of theory on this and that my terminology isn't correct here. Can someone steer me in the right direction regarding what this type of thing is called?

EDIT: important detail I'd forgotten: hash functions take arbitrary-length data in and spit out fixed-length hashes. I learned about https://en.wikipedia.org/wiki/Perfect_hash_function and https://crypto.stackexchange.com/questions/29264/why-cant-we... is relevant. Relevant quote: """ Mathematically speaking, there is no such thing as a collision-free hash. Practically speaking, there is. """

TeMPOraL · on June 10, 2017

A hash function that takes arbitrary data and generates fixed-length hashes will always have collisions, because there is simply more possible inputs than possible outputs. Also known as Pigeonhole principle[0] - if you have 8 pigeons and 5 holes for them, and you want to put all pigeons in the holes, then at least 3 of them will have to go to an already occupied one.

The practical aspect of making hash function "collision-free" is by using trickery and dark magic to reduce the probability of collision down to somewhere below "Earth just got hit by an asteroid, we all have other problems". You want to go really, really low because hashes are mostly used to quickly verify data, and malicious actors will be happy to throw compute at trying to generate collisions so they can alter the data.

As for how the trickery and dark magic is performed, I'm a wizard of too low level to explain these details to you; all that I recall is that you try and make the hash maximally sensitive to every bit of the input data, so that a random bitflip anywhere has a huge probability of greatly changing the hash.

--

[0] - https://en.wikipedia.org/wiki/Pigeonhole_principle

theoh · on June 9, 2017

But, to state the obvious, anonymous items in computing (e.g. anonymous functions, or untitled/unaddressed draft emails etc.) generally need to be assigned an id for the software to keep track. The practice of avoiding natural keys and assigning unique ids is analogous. No paradigm is implied.

TeMPOraL · on June 9, 2017

In programs, anonymous objects always have an unique ID automatically assigned - by the very virtue of being separate entities. Usually those IDs are what we call "references" in programming.

theoh · on June 9, 2017

I'm talking about the level of e.g. programming language implementation. A lambda needs to be assigned an id, for example, but it's not a reference that the user can manipulate. It's an implementation detail, as are surrogate keys.

TeMPOraL · on June 10, 2017

You don't manipulate the reference directly (i.e. you shouldn't learn its numeric value (though you can - useful for debugging), and you can't change it) - but you do use it indirectly all the times with ==, eq, or however the referential equality operator is called in a given programming language.

This is what, given (excuse the Java):

    MyObject a = new String("choo choo");
    MyObject b = a;
    MyObject c = new String("choo choo");

allows you to observe that:

    a == b;
    a != c;
    b != c;

even though:

    a.equals(b);
    a.equals(c);
    b.equals(c);

under string comparison.

This is exactly like surrogate keys. An implementation detail that's crucial to the meaning of the whole thing.

squeaky-clean · on June 9, 2017

> Unless you live in some magical fairytale land where business rules are stable, sensible, or both, depending on them for uniqueness is asking to have problems.

Thankfully mine are legally enforced, so it works well :)

TeMPOraL · on June 9, 2017

It works until it doesn't :). A lot depends on how well they're enforced. SSNs are a go-to example, but in other countries which have actual ID numbers for citizens[0], duplicates do happen every now and then because of e.g. clerical errors. Real world is messy, and something can be trivially fucked up, it will be.

--

[0] - as opposed to the way US handles the SSN, where the number was explicitly forbidden from being used for identification, but people use it for it anyway.

codedokode · on June 10, 2017

There are not only duplicates, but sometimes the documents can be reissued with another number or ID, sometimes there are people without the document that your system expects everybody to have.

nvivo · on June 9, 2017

Here in Brazil, we have the CPF (somewhat equivalent to US SSN), a "unique" number assigned by the government for identification. It's used by almost every business I have ever know as a key. Not always primary key, but certainly people treat it as unique.

Except for the fact they are not unique, and there have been cases where people from different states have the same number because each state generated this "random" number, but didn't actually check with each other if it's really unique. It is legally enforced, no two people should ever have the same, but they do.

The only way to have uniqueness is if you know the algorithm AND can control that it behaves as expected. Anything else is just kidding yourself.

ksenzee · on June 9, 2017

Even that is not a forever guarantee. Laws can and do change.

dbenhur · on June 9, 2017

And people act outside the bounds of the law. SSN seems like a good natural key candidate, until you discover some of the people your business interacts with are illegal aliens.

int_19h · on June 10, 2017

Or legal aliens, like tourists. Or foreigners, who aren't even in the country. Or really anyone else who doesn't work in the country, and doesn't have to report taxes.

Ironically, many illegal immigrants actually do report taxes. In which case they use ITIN, which is compatible to SSN format-wise (###-##-####)... but it is not unique over time.

squeaky-clean · on June 9, 2017

Definitely true, but at a certain point you need to draw a line. If these things were to change, the entire product would have to change, not just what I use as my keys, so I think natural keys are a safe bet in my case.

jrochkind1 · on June 10, 2017

> Thankfully mine are legally enforced, so it works well

What, you think laws never change?

jholman · on June 10, 2017

I strongly agree. Natural keys are an ivory-tower pattern, and a software-development anti-pattern.

holri · on June 10, 2017

Why not change the definition of natural keys then?

timClicks · on June 9, 2017

Pretty much any undergraduate database course will discuss this early on. One of the first MOOCs #dbclass covered this in depth.

squeaky-clean · on June 9, 2017

My database courses were pretty terrible, everything besides basic SQL syntax and some implementation specific stuff I've had to teach myself (some classes were pretty much an ad for MS SQL). And judging by many people I've worked with, I think this may be true more often than not.

I think I've found the #dbclass series on YouTube, is it the one from Stanford? And thanks for the help.

Dangeranger · on June 9, 2017

Yes, you want the course taught by Jennifer Widom of Stanford offered on Lagunita.

Links below:

https://lagunita.stanford.edu/courses/Engineering/db/2014_1/...

https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/abou...

brid · on June 9, 2017

Check out the Data Warehouse Toolkit. You might want to look into natural vs surrogate keys.

https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimen...

squeaky-clean · on June 9, 2017

Found a pdf copy (just to browse, I promise!) and it looks great. Ordered a used copy, thanks.

Edit: no instances of candidate key when I ctrl+f though, still looks pretty good.

nkozyra · on June 9, 2017

Fundamentals of Database Systems (which is an academic textbook, check Abebooks) covers a lot of this along with underlying design. I'm going through it now for a DB class.

sitkack · on June 10, 2017

All of the Joe Celko books and https://www.amazon.com/SQL-Design-Patterns-Programming-Focus...

problems · on June 9, 2017

> Botnets will just keep guessing until they find one.

Why does your security rely on primary key obscurity? This seems like you're doing something horribly wrong, put some authentication on that or something.

And no, no they won't. Hitting a collision is very hard if you're using cryptographic strength random UUIDs, you wouldn't even be able to bruteforce 64 bits over the internet in a reasonable timeframe.

Go ahead, try the math on that, the only reason small keys are vulnerable to local attack is because you can perform an enormous number of attempts per second, often in thousands of millions of attempts per second and they can keep at it for as long as they want. The database server won't let you query anywhere near that fast. You will never get anything like that for network based attacks as you're limited by bandwidth, latency and of course, the other side who will notice if you even try to do this for any significant period of time and likely block your attempts or limit them greatly.

dsacco · on June 9, 2017

> Why does your security rely on primary key obscurity? This seems like you're doing something horribly wrong, put some authentication on that or something.

You can't put authentication on everything. Some things are inherently vulnerable to guessing because you can't toss an access control on it (the thing is the access control), which what the high bit-lengths and rate-limiting are for. I wouldn't do this with a customer's UUID because it's static, but sooner or later you have to rely on security by obscurity somewhere (for example, password reset tokens).

As a theoretical matter a UUID shouldn't be the sole distinguishing data for authorized actions, but in practice it's probably best not to expose them in case you have an errant insecure direct object reference (which approximately all modern applications do, somewhere or other).

I'm just nitpicking your use of the phrase, "security by obscurity" here. That terminology is levied as a criticism against a lot of modern security design but it's really fine (and sometimes necessary) to use it for defense in depth (unlike, say, a cryptographic algorithm, which does not generally benefit from security by obscurity, though particulars of the peripheral infrastructure can).

vertex-four · on June 9, 2017

You can't put authentication on everything... but you can put an HMAC on almost everything, e.g. /receipt?id=1&hmac=hmacsha256(1, somesecretkey), which decouples your enumeration prevention from your ID system.

dsacco · on June 9, 2017

True, but you've just shifted the goalpost a little. Instead of brute-forcing a random token I'm now brute-forcing a MAC, or a secret key. You don't really need to use a MAC if you're already implementing rate-limiting and time-based expiry (in addition to use-based). That's just a performance hit. You might as well send a token generated by a good PRNG and skip the integrity checks.

It's unrealistic to brute-force one well designed token in the absence of authentication (if it's long enough you could go without any other security controls, including expiry or rate-limiting, but that's also a performance hit at scale). Adding one with an integrity check is effectively rate-limiting anyway.

Funny story though - I was once on an assessment in which a company used a scheme just like that. The UUID was HMAC'd for a password reset request, and UUIDs could be found by messaging another user. Unfortunately, they used the same secret key for password reset HMACs as the one embedded in their mobile app for HMAC signing requests. Worse still, password resets didn't need to be activated, so you could effectively takeover any account by retrieving the HMAC secret key from the app and running the victim's UUID through it.

vertex-four · on June 10, 2017

A performance hit on your web servers might very well be worth it if it means your database can cope with your data better, though obviously, test that. It tends to be a little more difficult to scale up your database server than your web server.

The other side of it is that safety from enumeration attempts is the sort of thing that gets added on well after the software's actually in use, unfortunately, and changing existing keys might not be practical.

A social network I use is in the position where they don't really want people enumerating their users by ID... but all their users are identified in the URL by sequential ID. An HMAC would help them, and isn't difficult to implement.

And yeah, bad implementations are bad, but I wouldn't trust someone who can't use an HMAC securely to generate and use unguessable IDs properly either.

greenleafjacob · on June 10, 2017

The "performance hit" is the basis of most modern security [1].

[1] https://en.wikipedia.org/wiki/Computational_hardness_assumpt...

dsacco · on June 10, 2017

I'm well aware of the CHA (this is my primary research focus :). But that doesn't mean you should strive to make things computationally difficult if you don't need to (for example, Rijndael vs Serpent for AES). After a certain point there's no need for further computational cost.

problems · on June 9, 2017

I'm not saying a long random number cannot be secured or that using one is security by obscurity in the bad way, just that it's far too easy to leak a primary key by accident.

Jabbles · on June 9, 2017

That quote refers to sequential IDs:

The original issue with simple auto-incrementing values is that they are easily guessable as I noted above. Botnets will just keep guessing until they find one.

Pxtl · on June 10, 2017

This is why I'm starting to loathe SQL. The theory is great, but when the theory meets the practice and everything falls apart, the perfect kernel of relational beauty turns into a trash fire and I just want to get my freaking graph of objects out of the database. If I use numbers for keys, I deal with disaster when I try to merge from disparate sources. If I use guids as keys, I get terrible performance. Or I can just use a goddamned document store of Json or XML and have related objects get stored right next to their parents and tell the beautiful mathematics of relational algebra to shove it.

I'm tired of hearing "you don't have to say how to get the data, you have to tell the database what you want and it will get that in the most efficient manner" and then deal with an encyclopedia of byzantine rules to get it to do the aforementioned "efficient manner" with anything approaching decent performance. I can see the art, but the practicality mars it beyond recognition. It's like Venus de Milo sculpted out of duct-tape and bubble gum.

Sorry for the rant, I'm just getting frustrated with performance problems in small data sets. I've taken the courses, I've read Date and Darwen, and I'm just starting to get terribly disillusioned.

3pt14159 · on June 10, 2017

I get this feeling from developers from time to time, and occasionally I'll get to see them on a project where they use a document store for long enough and they start to appreciate why we started using RDMSes in the first place.

You trade complex retrieval of subgraphs for complex retrieval of aggregate measures. Sometimes it is worth it, but I usually find that if you need both (and that is a big if) it's easy to scale relational databases than it is to scale document stores.

This is for many reasons, but an underappreciated one is that it is easier to cache individual objects than it is to cache aggregate measures. So if you need a subgraph just store it as JSON in a table full of simple key value mappings with some tables that handle how to invalidate the cache. But because aggregate measures span so many business objects it it's harder to achieve this. Furthermore, it is more likely to be possible to enumerate subgraphs than it is to enumerate aggregations.

For example, imagine a tool that estimates the number of companies that meet certain criteria. "Count the companies by country that have at least one employee that earns at least $100k a year and who has had at least 2 previous positions." The number of permutations are endless, and while these types of queries aren't trivially fast on RDMSes, it is at least possible to handle them.

I'm not saying any of this is really easy, but humans don't live in an easy world. I used to be a structural engineer (EIT), and what amazed me was that despite the fact that we've built thousands of skyscrapers, each one was its own hell. Concrete trucks would go bad, water would gush out of places in the ground it wasn't supposed to, people would die. But we continue to push forward because we want to succeed.

johnnyhillbilly · on June 10, 2017

SQL is the best when it comes to reasoning about information as tables. If you want to reason about tables as graphs, SQL becomes impractical for many applications.

Performance characteristics of different RDBMSs varies by orders of magnitude dependent on workloads.

Keys are also a flexible concept. They can be formalized to ensure referential integrity (domain), or you can join on whatever you want.

Recursive queries can handle some reasoning on graphs, but they quickly become impractical.

Even so: The flexibility SQL offers when you want to change which data you think should be related is tremendous - on HPC database machines :)

evadne · on June 9, 2017

I recall reading something about this in the PostgreSQL mailing list, message written in 2016 but may still be relevant

https://www.postgresql.org/message-id/20151222124018.bee10b6...

“There's no substance to these claims. Chasing the links around we finally find this article: http://www.sqlskills.com/blogs/kimberly/guids-as-primary-key... which makes the reasonable argument that random primary keys can cause performance robbing fragmentation on clustered indexes.

But Postgres doesn't _have_ clustered indexes, so that article doesn't apply at all. The other authors appear to have missed this important point.

One could make the argument that the index itself becomming fragmented could cause some performance degredation, but I've yet to see any convincing evidence that index fragmentation produces any measurable performance issues (my own experiments have been inconclusive).”

sroussey · on June 10, 2017

MySQL's innodb has a clustered index. On DB2, SqlServer, Oracle, etc, they are configurable. PostgreSQL is more like MySQL myisam in that the data and index are separate. In this case, the random nature only screws up the index (assuming btree), not the data.

It can be a bigger issue if you proxy sql queries and shard to 1000 DB servers and you thus rely on part of the query to indicate which shard it is on. You can do this with a type of uuid that embeds the shard, or you have to include additional parts in the query beyond the PK to get the row (it could be a column that the sharing is based on, or a comment). And yes, careful with joins from an a root object's table that joins to other tables that use uuids unless you can guarantee that they are in the same node.

sp332 · on June 9, 2017

"Things got really bad in one company where they had decided to use Latin-1 character set. When we converted to UTF-8 several of the compound-key indexes were not big enough to contain the larger strings."

This shouldn't be right. UTF-8 encoding uses the same 8 bits for each valid UUID character that Latin-1 would. Unless someone put invalid characters in the UUID field, I would guess that the new encoding was actually UTF-16 or something.

treve · on June 9, 2017

In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.

All the bytes transmitted on the wire will be UTF-8 strings, but this is disconnected from the underlying storage engine.

I don't know if this is the case for other databases, but when I read this my first thought was "This must have been MySQL"

sillysaurus3 · on June 9, 2017

W... What? Why?

The whole point of UTF-8 is that it's identical to Latin-1 (edit: ASCII) except in cases where it can't be. In those cases, it uses a very efficient encoding.

I understand that people (for whatever reasons) don't want to store text in UTF-8, but that's incredible that such a popular piece of software would take that notion to the extreme. "Let's make UTF-8 almost equivalent to UTF-32 when the data is at rest! Yes!"

Are you sure that info is correct? And if it's correct, are there any valid engineering reasons for that decision? It's probably best to be charitable and assume there must be a reason.

mitchty · on June 9, 2017

It is beyond correct. Mysql's "utf8" only stores stuff from 1-3 bytes.

https://mathiasbynens.be/notes/mysql-utf8mb4

You have to use utf8mb4 to have proper utf8 support. Yes its stupid, this is the main reason I prefer not using mysql for anything. Too many gotchas lurking about.

nvivo · on June 9, 2017

Exactly, but in any case even mysql version of UTF-8 should store an UUID the same way as ASCII.

seorphates · on June 9, 2017

In dealing with conversions or characterset handling the UUID pitfall is the hyphen. This is an extremely vulnerable character when dealing with multiple data handling and feed sources (to say nothing of data conversions).

The hyphen has a handful of imposters and one of the more troublesome ones is in extended ASCII as "en dash" (there's another as "em dash"). These do not have a direct mapping to unicode as they are not "hyphens" but they sure do look like them to humans in a hurry. It will throw your meticulously thought out and implemented keys (or any other well-measured column) into a tailspin. Dealing with extended ASCII implementations has been the most hazardous area of characterset handling, at least for me.

Although probably unlikely the UUID is not a guaranteed direct ASCII to UNICODE conversion. I'd advise awareness and perhaps even some level of caution if you have a complex data flow (or, more apt, a seemingly ridiculously simple and bulletproof one).

Extended ASCII sucks so if you use it and don't have one or both feet nailed to the floor then get off of it, sooner.

sroussey · on June 10, 2017

Store, yes. But indexes don't do variable length (think about the index being a btree of binary values), so it gets expanded. UTF8 (utf8mb4) will thus expand to 4 bytes per character. And MySQL (pre 5.7) will stop you from making such stupid indices.

TazeTSchnitzel · on June 9, 2017

> The whole point of UTF-8 is that it's identical to Latin-1 except in cases where it can't be.

It's only compatible with ASCII, not Latin-1. You are maybe confusing UTF-8 with the Unicode codepoint space, of which the first 256 codepoints are the same as Latin-1.

dlubarov · on June 9, 2017

Just in case there's any confusion:

- 7-bit ASCII encodes 128 characters; latin-1 encodes 256 characters; UTF-8 encodes all unicode characters

- latin-1 and UTF-8 both extend 7-bit ASCII; the three encodings are equivalent for ordinals 0-127

- UUID characters are part of 7-bit ASCII, so they have the same encoding in ASCII, latin-1 and UTF-8

seorphates · on June 9, 2017

100% correct - UUID strings are ASCII < 128 .. but there are imposters for character code 45 ("-"). Assumption based approaches to conversions or field handling should not be assumed to be safe ones just because UUID is "technically" ASCII. Only the keeper of the field can truly enforce that constraint.

sillysaurus3 · on June 9, 2017

Aha, thanks for calling me out. I was indeed thinking of ASCII.

geofft · on June 9, 2017

I think this is correct for CHAR and incorrect for VARCHAR.

A CHAR column always allocates the maximum size, so a utf8 (MySQL's weird "UTF-8, but only for the BMP" encoding) CHAR(100) needs to reserve 300 bytes per row, and a utf8mb4 (MySQL's actual UTF-8 encoding) CHAR(100) needs to reserve 400 bytes per row.

But a VARCHAR is dynamically allocated, so, yes, a VARCHAR(100) that stores 100 lower-ASCII characters is only going to use 102 bytes of storage (using 2 bytes for the string length).

See, e.g., the "tip" section at the end of https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

ivanhoe · on June 9, 2017

Author mentions compound keys, so issue was probably with the index key prefix length, which is by default limited at 767 bytes. That limits you to a maximum of having just first 255 characters of string data indexed with utf8, but in latin1 (one byte per character) you can have 767 characters long prefix. So with latin1 it's perfectly ok to have compound index on 2 varchar(255) columns, but you can't do that with utf8. Converting from latin1 to utf8 will break all those indices.

morgo · on June 9, 2017

The 767 limit is with the default row format. In 5.7 the row format changes, and allows 3000+

morgo · on June 9, 2017

With innodb it is variable storage regardless of varchar or char when UTF8.

sbov · on June 9, 2017

It wouldn't be the only utf8 quirk on mysql: utf8 isn't even utf8. For that you need utf8mb4.

devrandomguy · on June 9, 2017

(In David Attenboroughs voice): And here we have the gentle Sillysaurus in it's natural environment, grazing on a field of failweed in the tidal shallows. Notice however, that his eyes remain just above the surface of the water, ever vigilant for the dreaded Snarkasaurus Rex.

A flock of Correctosaurs glide by overhead, scanning the muck with uncanny vision for the telltale motion of their Suttlebug prey. It is their rare, mid-air collisions that give the Suttlebug away, despite their otherwise excellent camouflage.

morgo · on June 9, 2017

Hi! I am the product manager for the mysql server. I can confirm that for innodb, this is not the case - storage is the variable length.

You might be thinking about the memory storage engine (and temp tables).

brainopener · on June 9, 2017

I believe everyone is referencing the index key size length in MySQL. In that case you can get "ERROR 1071 (42000): Specified key was too long; max key length is xxxx bytes" errors.

This happens even in InnoDB -- it's taking the maximum byte size of all the key fields. For a single VARCHAR(255) in MySQL utf8 that ends up being 3*255=765 for just that one field. If you add other fields, it's really easy to hit the max key length and have the create or alter fail.

morgo · on June 9, 2017

See my reply to parent on why the max must be calculated (no false positives on duplicate key in edge case).

The default row format changes in 5.7, so key size is max is now much larger.

sp332 · on June 9, 2017

Oh yeah, that makes sense. At least for CHAR(36), but I'm not sure how VARCHAR(36) as given in the article would behave. (Never mind I just looked it up and it depends on the storage engine - e.g. MEMORY will always allocate the max size.)

bitJericho · on June 9, 2017

You'd be insane to put the primary key in a varchar. Mysql is much faster with static width tables. Your main user table should be static width for all the fields.

cwyers · on June 9, 2017

As opposed to the insanity of storing UUIDs as strings at all? Once you start going into the rabbit hole, you better be prepared to go all the way down.

bitJericho · on June 9, 2017

No doubt. You can store them as INTs! Tho it does get a little hairy dealing with the conversion with MySQL.

paulryanrogers · on June 9, 2017

Is this true for all storage engines?

morgo · on June 9, 2017

It is not true for innodb (default). See my other comments in this thread.

sroussey · on June 10, 2017

Storage, and index are different. You expand the UTF8 chars into bytes that are then part of a binary tree. It actually gets more interesting when you consider the collations.

lilyball · on June 9, 2017

Really? That sounds like a tremendously stupid design.

But even with that design, you'd expect VARCHAR(36) to have space for 36 ASCII characters, not 9.

tveita · on June 9, 2017

It does - it allows for 36 code points using your current charset, which with UTF-8 can mean up to 144 bytes.

Of course, that is the worst case, but for index size limits, MySQL has to assume the worst case, which means you quickly bump into the default index key limit of 767 bytes with large string keys.

lilyball · on June 9, 2017

That can't be true. The article's anecdote is converting from LATIN-1 to UTF-8 causing the indexes to not be able to contain the "larger strings", but UUIDs are ASCII (when represented as strings), so the index size shouldn't change unless the storage engine is storing it internally as something other than UTF-8.

And the parent comment says "In MySQL for example, using UTF-8 charset will cause it to use 3 or 4 bytes per character.", and if it's really using 3 or 4 bytes per character regardless of what that character is, then it's probably storing it internally as a sequence of unicode scalars instead of UTF-8.

Granted, upon re-reading this, neither source explicitly says that a CHAR(36) column won't actually be able to hold 36 characters, but if it can then I don't know why the indexes would stop working.

EDIT: It occurs to me that a CHAR(36) column would require 4 bytes per character to be allocated, since it's a fixed-width column, though it should still only actually use 1 byte for an ASCII character. But 144 bytes is still well under the 767 byte limit you mentioned.

zerocrates · on June 9, 2017

It's true.

The problem isn't so much the content itself, but the column and index definitions: keeping the same length in characters in the definition will use more bytes in utf8 or utf8mb4, at least against the storage engine index limits. I'm not sure whether this is because MySQL is just lazily/smartly computing the worst case, or if it's actually storing using multiple bytes per character regardless. I assume it's the former.

The documentation's example of the problem with InnoDB goes like this:

This definition is legal, a 255-character index at 3 bytes per char coming in under the 767-byte limit:

    col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))

But to use utf8mb4 instead, which allows for characters beyond the BMP and can use up to 4 bytes per character, the index must be smaller, only covering the first 191 characters:

    col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))

It comes up often when people have used a VARCHAR(255) as a default "kinda big" string column, are indexing the whole thing, and want to switch to utf8mb4.

In the case of the UUIDs in the article, I believe it mentioned it was a multiple-column index, so the increase of 3x (or 4x) from latin1 could easily push them over the limit if there's another big column or columns in the index.

nvivo · on June 9, 2017

You must be confusing UTF-8 with 4-byte Unicode. UTF-8 by design will use variable length encoding, otherwise it's not UTF-8.

I can confirm both these commands return 36 in my mysql installation:

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8_general_ci);

SELECT LENGTH('15c915fc-4d50-11e7-9bc7-50b7c307c620' COLLATE utf8mb4_general_ci);

pornel · on June 9, 2017

Nope, tveita is right.

MySQL's storage has some space limits defined in terms of bytes, and to express that in terms of characters it has to allow for the worst case.

If it allowed 255-byte space to be used as VARCHAR(255) in UTF-8, then an insert of 255-character long string with emojis would fail.

kpcyrd · on June 10, 2017

The name VARCHAR is confusing since it's not properly defined what a character is. 1 byte? Something on the unicode table? If it'd be named VARBYTE(255) it would be pretty obvious that a 255-emoji-string insert would fail.

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/

nvivo · on June 9, 2017

That's not a real issue. I can understand this may bite some people that don't read the docs or don't understand encoding, but it's more misinformation than a technical problem.

UTF-8 is variable length anywhere, not just inside mysql. If you don't want this behavior, you can either use UCS-2 or UTF-32. UTF-16 and UTF-8 are variable length encodings, period.

Most databases that have use char length for unicode actually work with UCS-2, and use 2 bytes per char, like MS SQL Server.

nvivo · on June 10, 2017

Not sure what deserved a downvote here... Is there anything incorrect?

pornel · on June 10, 2017

It's not incorrect, but a strawman (I didn't downvote you btw).

The question was why UTF-8 in MySQL has weird limits. It wasn't about what MySQL should have done in an alternative universe.

nvivo · on June 10, 2017

Yeah, but I didn't point out what mysql should have done in an alternative universe. Mysql DOES support fixed lenght unicode, just use the correct encoding: ucs2 or utf32. What it does with utf-8 is what any system that supports utf-8 must do.

flatline · on June 9, 2017

You can specify encodings on a per-column basis, at least with ISAM tables, so you can have a UTF-8 database with Latin-1 key values. MySQL 5.7 has some functions to aid string wrangling of guids, so you can store and index them as bin(8) and address them as string values.

More fun MySQL Unicode facts...utf-16 uses 32 bits per character to ensure it can handle supplementary characters. UTF-8 only handles characters up to 3 bytes; gotta use UTF-8mb4 if you expect to handle 4-byte values. You can create prefix indexes for Unicode string values that exceed the maximum index size.

lilyball · on June 9, 2017

That's remarkably bizarre. Who implements "UTF-8" but restricts it to only the BMP?

morgo · on June 9, 2017

It is historical. See:

http://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ove...

http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf... (Problem #1)

(Author is the first link here.)

flatline · on June 9, 2017

Remember this is for in-table storage, so it makes a certain amount of sense - this saves a byte over UTF-16 with support beyond the BMP. You have a hard limit on the byte size of the table - how do you determine a priori how much storage a 20 character UTF8 field will consume? The alternatives are to store the value in a clob field or set a hard byte count on the field and let the application or user be surprised when 20 print characters are rejected. I actually don't know how other providers handle in-table Unicode fields, MySQL made some poor choices on naming things at the least.

lilyball · on June 9, 2017

At a bare minimum you'd expect it to be something like utf-8bmp for 3-byte storage and utf-8 for 4-byte.

ars · on June 9, 2017

It's historical. When it was implemented 4 byte unicode did not exist.

Now that it does they can not change the names anymore.

It's the same as why windows is stuck with utf-16, because when they implemented it unicode was 2 bytes.

lilyball · on June 9, 2017

> When it was implemented 4 byte unicode did not exist.

Incorrect. When UTF-8 was invented, it was actually variable up to 6 bytes in length, being capable of representing code points up to U+7FFFFFFF. It was only shortened to 4 bytes in 2003. There is no point in history where UTF-8 was only limited to 3 bytes.

Dylan16807 · on June 10, 2017

That doesn't seem right.

The 1998 version of MySQL didn't support unicode at all yet.

Unicode 2.0 introduced UTF-16 in 1996, making the need for non-BMP characters very explicit.

And UTF-8 at the time supported 31-bit code points.

sp332 · on June 9, 2017

Sure, the VARCHAR(36) column would have enough room, but if you have a 36-byte index on that column, then suddenly it won't be indexing the whole column.

morgo · on June 9, 2017

The reason is not that the indexes became larger, its that mysql has to protect against all 4 byte characters, since the standard defines the n in varchar(n) as chars not bytes. This is obviously exceedingly rare, but if if it instead truncated somewhere in the index you could get a duplicate key error for something not duplicate.

In mysql 5.7+ the default row format allows 3000+ potential bytes (up from 767 bytes), so this becomes a non issue for most cases.

foolfoolz · on June 9, 2017

sounds like the author thinks "uuids are a pain" and wants the benefits of them but with a smaller representation. but doesn't provide any reasonings why uuids are a pain other than not being able to remember them or say them out loud. these are not things anyone does with primary keys!

you'll never say this out loud : 7383929. you may be able to remember it, maybe. in a uuid you'll match the last few and first few letters just as fast in your head

uuids are fine. sorting is an issue but at scale (the entire point of this article) how often do you need to sort your entire space of objects by primary key? you'll have another column to sort on

hiding primary keys and having 2 keys seems like a great way to make all queries and debugging 2x as complicated

jdcarter · on June 9, 2017

I've started using ULIDs when I need IDs that are time-sortable while still retaining the advantages of a UUID. Read more here:

    https://github.com/alizain/ulid

The ultra-quick summary: 48 bits of timestamp followed by 80 bits of randomness. Also, when representing as strings, they use Crockford Base32 encoding, meaning 26 characters instead of a UUID's 36. (I'd argue that UUIDs should use this encoding as well.)

Scirra_Tom · on June 10, 2017

I would think it would be better to keep the time stamp in a separate column. In essence you're duplicating the time stamp across FKs which is space wasteful - and a little abstract as one field is holding multiple data.

brightball · on June 9, 2017

Isn't that the point of Snowflakes anyway? Sortable UUID style keys?

https://blog.twitter.com/engineering/en_us/a/2010/announcing...

nvivo · on June 9, 2017

Completely agree. I have been using UUIDs for a lot of stuff for a while, and you get used to it just as you get used to any other thing. Performance is usually indistinguishable. Most of the issues in the article are related to practices, some are related to specific platforms. Every piece of code may bite you if you don't know what you're doing.

Now, if the developer don't know when it makes sense to use a UUID, there are certainly more problems in the application to worry about. Which is 99% of the cases in most business applications.

VintageCool · on June 9, 2017

In MySQL/InnoDB, using UUIDs as your primary key will eventually lead to pretty substantial fragmentation and use way more disk space than necessary.

You can see this visualized with the innodb_ruby project here:

https://www.percona.com/blog/2015/04/03/illustrating-primary...

nvivo · on June 9, 2017

This is true, but I wasn't able to find a real issue with this. This issue is likely to appear in benchmarks, but unless you're trying to squeeze that last bit from the database, not using any caching anywhere in your app, not using any index, not having enough free memory, etc you're likely to never notice the difference.

problems · on June 9, 2017

> you'll match the last few and first few letters just as fast in your head

Definitely true, in fact I tend to prefer hex for this purpose. But one minor note - assuming the whole thing is cryptographic-strength random, there's no need to compare first and last, just do first only or last only (last is usually preferred for UUIDs due to them sometimes containing a timestamp at the start), the odds of an accidental match is the same and the visual processing time of switching from one end to the other is lower.

trothamel · on June 9, 2017

You'll never say a phone number like 738-3929 out loud? Short identifiers are useful in a lot of places. I work in IOT, and it's nice to have a short device id when we need to troubleshoot something.

edejong · on June 9, 2017

Where I come from, our tables go towards a billion rows, so there's copying of keys regardless of whether it's uuid or sequential id.

eric_the_read · on June 9, 2017

for event store databases, you'll need to have some sortable sequence ID-- you can't always even use timestamps, since on occasion you'll get multiple events happening at exactly the same time, and you still need to sequence based on received order.

nvivo · on June 9, 2017

Well, event stores are a very specific type of database that exists solely for the purpose of knowing exactly the order of events. It's not fair to say UUIDs are a problem because one very specific case shouldn't use it.

eric_the_read · on June 9, 2017

Definitely not! In fact, I use UUIDs everywhere else but the event store. I was just pointing out one case where having an int (well, bigint) primary key is actively useful, and UUIDs aren't.

(edit: spelling)

drawkbox · on June 9, 2017

Maintaining UUIDs is much easier than maintaining id/int lookups that may be autonumbered (mssql, mysql, pg) or sequenced (oracle), even if using them internally and UUIDs externally. This especially comes into play when syncing across dev, staging and production environments and when clustering and servicing out parts of your app.

The moment any db starts to grow to these areas, UUIDs lead to far less issues than incrementing ids everytime.

Most RDBMS now have optimizations and native types (uniqueid) for UUIDs/GUIDs and this is really a moot point at this point, most UUIDs are no longer strings in DBs unless legacy from the time before native UUID types.

UUIDs are right for most projects but not all and as typical in any system, the environment and needs of your project will dictate whether it makes sense to use them.

UUIDs eliminating the round trip and negating dealing with autonumbering/sequencing is a massive benefit, the only real con of UUIDs is the extra 8 bytes but make up for it in less need to lookup during runtime when creating new or associating data with them.

bborud · on June 9, 2017

It is like that line from jurassic park 2...

(The problem with UUID is that the two first letters are a cruel joke. But hey: people need to make their own mistakes because some things you can't just tell people: some rakes they need to step on until they learn or are unable to reproduce)

drawkbox · on June 9, 2017

Universally Unique, if there is ever a conflict you can easily use a constraint or check upon creation on the app/server/service side, no different than checking if an int id or slug/public code is already used. No roundtrip is needed for every new item and if there ever was a conflict you just have an api error stating as much. You might want to also have it email you to play the lotto that week because it is virtually impossible.

Try syncing autonumbering across different database types where some use autonumbering and others use sequences. That is a nightmare scenario but there would be zero issues with UUID keys.

Primary keys as ints/longs will be constant rake stepping as you grow and maintain an app.

UUIDs emerged from a need that int keys weren't filling especially in a distributed/serviced world. I have used them in most DBs for over a decade and no rake steppage.

Side note off topic: If you were a piece of data would you want to be a universally unique piece of data or an int and merely a clone? All my data is a unique piece of data, I am a good Bender.

bborud · on June 10, 2017

I've worked on quite a few distributed systems where you need to generate unique IDs and I can tell you that UUID is not a solution in quite a few of them because you actually need to understand what problem you are solving rather than taking the promises of others for granted or shrugging and gambling on thins sorting themselves out.

In many cases you have two problems: network partitions and/or high ID assignment rates. So statements "if there ever was a conflict you'd have an API stating as much" is simply passing the buck to some piece of magic and actively not knowing what goes on. It is a bullshit statement.

In my experience, this is the kind of problem where people tend to fail because they don't grasp that it is actually kinda hard. Especially if you have additional constraints (like the size of IDs). People tend to make all manner of assumptions and then not test those assumptions.

The arguments are familiar though. I hear a re-run of them about every 2 years, when someone has a system in production and the thing starts to generate duplicate IDs. Although both code and notes are available on previous solutions within the company, people tend to ignore that and just plow ahead.

MithrilTuxedo · on June 9, 2017

Can confirm: using MySQL and for reasons... everything in the DB gets a primary key set by taking a random UUID, stripping the dashes, and then doing an `UNHEX(id)` in the stored procedures. Those IDs are both the primary keys and the keys used in the service's APIs.

One of our Ops guys did an experiment where they put a uniqueness constraint on the ID column and added an auto-incrementing primary key column that's never exposed to the code driving the thing. It apparently sped up our DB performance by orders of magnitude.

It also turns out that MySQL would perform faster just by leaving those values as strings instead of converting them to binary values. We've got some outside pressure to use Oracle instead of MySQL, and apparently it performs much better than MySQL with our current schema so we apparently aren't going to do anything to improve the MySQL performance or change any of this behaviour.

sudhirj · on June 10, 2017

Shameless plug: Anyone bothered about the wasted space in UUID string representation (and using Ruby) can check out https://github.com/sudhirj/shortuuid - it re-encodes your UUID into any alphabet you choose, with a Base62 default (I find that to be a sweet spot that gives both URL safety and efficiency).

Let me know if you want ports in any other languages - the the algorithm is to really just treat the UUID as a hexadecimal number (that's actually what it is) and re-encode it into any other alphabet of choice.

That said, always use native UUID types in datastores - they'll convert to bytes / numbers internally and will always be the most efficient. For other situations, remember that they're just numbers, so you can write them in binary, ternary, octal, decimal, hexadecimal, vowels, baseXX or really any other alphabet you want. The bigger your alphabet (as long encoding remains efficient, like ASCII under UTF-8), the better your gains will be.

makmanalp · on June 9, 2017

Yep, glad to see this posted. In the python world, this is why we have UUID.int (https://docs.python.org/3/library/uuid.html#uuid.UUID.int), though the native postgres UUID type with uuid-ossp works well too if you need them auto-generated in the DB rather than in application code.

aphextron · on June 9, 2017

Is there an auto-incrementing version of this? I'm trying to implement a time-series database and can't find a suitable method of generating incremental UUIDs.

imhoguy · on June 9, 2017

You may try ULIDs - "Universally Unique Lexicographically Sortable Identifiers" posted here last year. https://news.ycombinator.com/item?id=12205158

nvivo · on June 9, 2017

You can generate v1 UUIDs. v4 are random by design. You can also create your own version of UUID, as long as you set the version bits correctly, although I wouldn't recommend it.

tomphoolery · on June 9, 2017

doesn't uuid-ossp store the data as binary?

mark242 · on June 9, 2017

The reason I don't like the internal-int-external-UUID strategy is that all of your queries now require an extra join. It's no longer "select microblog.* where userid = ?" now it's "select microblog.*,user.id from microblog,user where microblog.userid = user.id and user.uuid = ?".

This may be practical from a storage standpoint but string-based indexes on an SSD are pretty damned efficient.

_kyran · on June 9, 2017

The benefit of an external guid is that it obfuscates counts.

I've seen cases where it's possible for a business to track a competitor's growth by watching a public facing user id count increase.

user5994461 · on June 10, 2017

I do that too.

Ask an invoice from a HN startup. See the reference number at the top "2017-0000438".

They had 438 potential customers this year :D

tdicola · on June 9, 2017

That's why you typically add caching (like cookies) that store the UUID with some user context so it can be provided directly in queries.

dimgl · on June 9, 2017

This article is so poorly written it's hard to take it serious. The entire paragraph about the size of a UUID takes reading it three or four times before you can actually understand what the author means...

In what context would a primary key change, even when sharding? In my entire career I have yet to see it. Also any sane person would never sort random values. If you need sorting in your table, provide some kind of indexed timestamp.

bjt · on June 9, 2017

> In what context would a primary key change, even when sharding?

The only time I've seen that happen is when a DBA evangelized for just using 'natural' keys, which in this case were slugs based on the item name. The problem came when users decided they wanted to rename things and then wondered why the IDs they were seeing in URLs didn't match the new names.

I'm not a believer in natural keys anymore.

> If you need sorting in your table, provide some kind of indexed timestamp.

A v4 UUID contains a timestamp. So it seems a bit wasteful to add the separate timestamp column. Why not just use the timestamp built into the ID? This is what MongoDB's internally-generated IDs do. (Though they're not quite UUIDs.)

There's a catch with v4 UUIDs though: the timestamp portion of the ID is not the most significant digits, so a simple ORDER BY doesn't do what you want.

I like the solution developed by Instagram, which Rob Conery discusses at the link below. It uses a pretty simple plpgsql function to generate a 64 bit integer based on the current time, a shard ID, and a per-shard sequence. This lets you generate 1024 IDs per second per shard. It also puts the timestamp portion of the ID first, so if all the shards' data ends up in Hadoop or something you can order consistently across the whole set.

http://rob.conery.io/2014/05/29/a-better-id-generator-for-po...

kpxxx3 · on June 9, 2017

Item name was never suitable as a key... It's not stable. There are natural keys, they are just very rare. And the thing is even when you think you've got one it is often better to err conservatively. A good anti-example is SSN, often used in text books. They do change. Other than DNA sequences I can't think of a good person key. Just use an internal surrogate and be done with it.

spb · on June 13, 2017

> Other than DNA sequences I can't think of a good person key.

In theory, one could use 3D geoposition at time of birth, time of birth, and sibling order (for twins / triplets / etc delivered surgically), where said order is dictated by the parent(s) or ob/gyn present. Of course, the main problem with this in practice is that not everybody has this information.

I would have said you could do something via retinal imagery, but not everybody has eyes. If we had non-invasive neural imagery, would it maybe be possible to derive a key from a simplification of a person's physical brain topology?

thinkloop · on June 9, 2017

Even DNA changes!

http://www.sciencemuseum.org.uk/whoami/findoutmore/yourgenes...

Turing_Machine · on June 10, 2017

Not to mention all the identical twins out there.

azernik · on June 9, 2017

In a some countries national ID numbers are good primary keys, but that's only when they're designed (and used internally) as such by governments. e.g. Finnish, French, or Israeli ID numbers. Even then, though, using them limits you to people who are in the system - no dealing with international customers, for example.

buzer · on June 10, 2017

At least in Finland, national ID can change in some cases, e.g. when you change your legal gender or if it's repeatedly misused by someone else (like in identify theft scenario).

bjt · on June 9, 2017

I mixed up uuid versions above. v1 are the timestamp ones, and v4 purely random.

Spooky23 · on June 9, 2017

It's a DBA rant.

Unfortunately, the author doesn't really get UUIDs -- the whole point of generating them and using them as a primary key is that you get a unique entry that can be used between systems. That is a magical thing that gives you superpowers.

In a past life, we would keep the customers basic data in the production database forever, but denormalize and export most historical records out of the transaction system to a data warehouse after 18 months of account activity. We also dumped login/audit data to a big flat file for compliance purposes.

The beauty of using the UUID as the key in that scenario is that the data became fungible between the systems. We didn't care about the audit logs, but had to keep them -- if we had to process them, we could do so with text tools (i.e. Perl/Awk) and do whatever querying we needed to do with a subset of data.

Relational databases are shitty and expensive when used as file systems. If you don't need the relational magic, you don't need the cost and operational overhead of a RDBMS. Literally billions of dollars are wasted keeping unused data in databases.

Terr_ · on June 10, 2017

Another wonderful reason to use UUIDs across systems: Consider what happens when one system's database has an accident, gets restored from a backup, and then starts generating numeric IDs which -- from the perspective of the other systems -- are duplicates.

Best case scenario, things break cleanly and you have to fix up your auto-incrementing counters. Worse is when you start getting data corruption because some remote system has its information about "Customer 11492" (who registered yesterday) overwritten by details for "Customer 11492" who registered today.

Spooky23 · on June 10, 2017

Exactly.

Also consider that you commonly have systems that you cannot control. What's the internal representatation of s user or action in a SaaS app?

Don't know, don't care. I slurp in data, with UUID as an attribute!

mrtbld · on June 9, 2017

> Also any sane person would never sort random values.

I think sorting by a arbitrary, unique value as the last sorting dimension is useful to ensure you always get the same ordering when first dimensions are equal. For instance sorting users by (last_name, first_name, user_id) ensures homonymous users don't move around in the list from request to request.

This is quite a sane thing to do, don't you think?

anotherevan · on June 9, 2017

Yes, I do that all the time. Even for tables that have a sequence column to indicate the desired display order, I'll add the primary key on the end of the order by to ensure consistency just on the off chance of two rows having the same seq value.

devrandomguy · on June 9, 2017

Have you never needed to merge datasets, that grew separately, in parallel? It is a royal pain when the rows have consecutive sequential primary IDs. Something has to make way for the other.

IMO, this is one thing Mongo got right: keys, by default, are a timestamp + some random string. This way, they still sort nicely, but don't collide with their cousins in a different instance of the DB.

kemitche · on June 9, 2017

Even if you come across a situation where your primary key has to change, it seems like you could "simply" migrate your initial primary key to the "additional column" concept referred to in the article, and add your new primary key as needed.

I can't imagine that a primary key migration would be easy even if you'd followed the author's suggestions around having an internal key and an external key, and I feel like if you're optimizing for a future primary key migration you might be focusing on the wrong part of your product.

chiph · on June 9, 2017

> In what context would a primary key change, even when sharding?

If you're using natural keys, and one of them is displayable text, and then the business says "We decided to rename the 8th sales district. Can you change it for us?"

Was this a good design? Maybe not. But legacy happens.

We're dealing with a vendor right now that has references to look-up tables based on display text. They should have used surrogate keys of some kind. Or at the least, code columns so that the database is mostly unaffected when values shown to the users get renamed. (Also useful for I18N work)

deathanatos · on June 10, 2017

> In what context would a primary key change, even when sharding?

When the model underlying the data is significantly incorrect, and needs to change. I.e., the relation you modeled does not map to the real-world concept like you thought it did. (Or, it did, but no longer does.)

I feel like this most often happens when two concepts end up being modeled by a singular table. You find out later that the table really represents separate, distinct things, and needs splitting. The split off things can — depending on how you do PKs / your migration — get new PKs.

zmoazeni · on June 9, 2017

I recently wrote about changing the Primary Key as a way to organize/cluster the data on disk for better query performance (MySQL, unsharded environment). http://techtime.getharvest.com/blog/improving-report-perform...

twoodfin · on June 9, 2017

Some databases will pre-sort a stream of primary keys in order to read a table sequentially.

rikkus · on June 10, 2017

Lots of talk about performance, but no numbers cited. I did my own benchmarks before using sequential ("COMB") GUIDs as 'PRIMARY KEY' (yes, they're surrogate keys) and found no material performance difference. I didn't keep the results, but someone else has made their numbers public here: https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/gui...

zimbatm · on June 10, 2017

Little rant on UUIDs:

Notice how the author assumes UUID v4[1] in the conversation. There are very few reasons to use the other versions but we are still paying for their price in code complexity all the time.

Look at this UUID parsing code: https://github.com/sporkmonger/uuidtools/blob/master/lib/uui...

What it really should be is `[uuid_string.gsub('-', '')].pack('H*')` (for non-rubyists: remove the dashes, decode the hex back to binary).

Their representation is also not that good since hex encoding is not very compact.

I guess what I'm trying to say is that UUIDs are often used as a default unique identifiers but they are actually not that good.

[1]: https://en.wikipedia.org/wiki/Universally_unique_identifier#...

wvh · on June 10, 2017

Postgresql has a UUID type which should store them as a 16-byte number. If you use time-based UUIDs – for instance based on the unix time stamp in hex, like CouchDB – then you also get sortable primary keys, which conceptually might or might not be useful to your application, but it probably speeds up indexes. I've done exactly this for two different projects, and it works well.

On top of that you get IDs that are impractical to guess, which while wouldn't replace other security measures, would still give you some collision resistance and probably avoid some bugs because of the unlikeliness of accidentally picking the same key for two different entities.

I'm sure there are pathological cases for UUIDs as primary keys in certain scenarios, like perhaps a very high number of small records, but I've not come across them myself. You obviously have to know your own data and database if you have some very specific requirements.

caleblloyd · on June 9, 2017

I work on an Entity Framework Core Implementation for MySQL and we recently added sequential GUID generation for primary keys that are of type Guid. The first 8 bytes of the GUID are the current UTC timestamp in ticks and the last 8 bytes are cryptographically random.

One interesting thing we ran into when implementing is that C#'s binary format and string format must be different to be sequential. So we have to detect whether the GUID is stored as a string or binary and put the timestamp in the correct place to ensure it is actually sequential.

Here's the PR for the feature for anyone interested: https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCo...

vkrm · on June 10, 2017

Datomic [0] uses SQUUIDs [1] (Semi sequential UUIDs) to work around this:

    Many UUID generators produce data that is particularly
    difficult to index, which can cause performance issues 
    creating indexes. To address this, Datomic includes a 
    semi-sequential UUID generator, Peer.squuid. Squuids are 
    valid UUIDs, but unlike purely random UUIDs, they 
    include both a random component and a time component.

[0] http://www.datomic.com/

[1] http://docs.datomic.com/identity.html#sec-6

edit: formatting

michaelcampbell · on June 9, 2017

> Aside from the 9x cost in size, strings don’t sort as fast as numbers because they rely on collation rules.

Why would you sort these to begin with; what ordering of essentially randomness (part of the point) makes sense?

mi100hael · on June 9, 2017

I believe UUIDv1 is time-based, so you can sort the keys and get the sequential order the UUIDs were generated.

fooey · on June 9, 2017

for when everything else is equal, you need something to let you get consistent sorting in your output

For example, if your page size is N and you have N+1 things with the same timestamp, you need a backup sort key or your pagination breaks down

ahachete · on June 9, 2017

You should really read this: https://www.slideshare.net/MarkusWinand/p2d2-pagination-done...

whateveracct · on June 9, 2017

Yep the Seek Method for pagination as suggested there would cause you to be sorting by UUID.

    SELECT uuid FROM t where filterable_field > $whatever and uuid > $lastSeen limit $pageSize

with a covering index on (filterable_field, uuid)