Hacker News new | past | comments | ask | show | jobs | submit login
On Primary Keys (orchestrate.io)
19 points by dstroyallmodels on May 22, 2014 | hide | past | favorite | 36 comments



"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. "

No, they shouldn't. The fact that your dataset may contain a piece of information that is usually unique its completely unrelated to the need of RDMSs to have identifiers to implement table relationships and to identify records in a table.

Not to mention that it can bring lots of headaches down the line. The classic example is using social security numbers (or another country's equivalent, such as CPF in Brazil) to identify individuals. Down the line you get a requirement to support international customers. That is now a major undertaking.

You are also offloading the responsibility of maintaining the uniqueness property to a third party, in effect trusting their database.


The noob never saw a person get married and change both their last name and email address. There is no such thing as a "natural key" that can never change, that's why auto-incrementing integers are used. Also, users should not rely on the keys being monotonic.


Forget down the line, the US gov't makes mistakes -- anybody who has tried using SSN as a key can tell you there are multiple people with the same valid SSN. (And, of course, people who don't have one or who have a stolen or fake one)


DO NOT USE USER-SUPPLIED DATA AS KEYS

Database keys are more-or-less equivalent to pointers. In the case of natural keys, they're pointers that (a) do not have ASLR, and (b) are exposed to the user.

As the article mostly ignores, you will guess wrong about what can be reasonably considered permanent or unique. Email addresses change (and aren't always unique), names change (and aren't unique), etc. This is much harder to fix if you used that guess as the basis for your entire data model.

The problem with proper keys was given as "auto_increment doesn't work on distributed systems". This is rather silly, as there are plenty of alternatives (GUID, UUID, the mentioned snowflake keys).

It's also interesting that proper keys are a "crutch" for designing data models and are best replaced with cleverness, and apparently are only useful to dictate record order (I know that Oracle at least doesn't work that way). This brings to mind "debugging code is twice as hard as writing code; if you write code as cleverly as possible, you are by definition not clever enough to debug it". DON'T BE (OVERLY) CLEVER. Do the least-fragile thing that doesn't add complexity (that is, doesn't add complexity visible to you.). Only do clever things that add fragility if you have an actual measured reason to do so.


"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. This will lead to better performance when the natural key is the most commonly used identifier, but is often not considered out of a habit."

As a database architect I always use auto-incrementing as the primary key. This ID is actually a surrogate key for the real primary composite key, which I assign to a unique key.

So, with the unique key in place, the performance will be nearly as good as you'd expect for direct looks by key attributes. However, the performance gain using the surrogate ID within join tables and as foreign keys is much greater.

"It’s also worth noting that over their years of use in SQL databases they’ve become somewhat of a crutch when designing the data model for a system."

They are not a crutch. They are a wonderful feature. One could model with GUIDs if necessary or something similar for your surrogate key but nothing beats the simplicity of an auto incrementing ID.

Even a NoSQL database like MongoDB creates a unique _id for you for each document.


An article promising to be quite technical demonstrates early on the author's inability to get their head around the spelling of the possessive form of "its". (Perhaps it's just me that finds this an unsettling juxtaposition. The About page wasn't useful in identifying native writer status.)

Then we dive into some mistruths about emails being unique per user (I know plenty of people who still share email addresses, both personal and also in shared commercial accounts). I worked at a place a few years ago that had designed their entire accounts system around the idea of email addresses being immutable. They'd subsequently had to introduce a secondary column for actual email addresses -- customers would login with their historical email address as their ID. The idea that a uid is a 'natural key' just seems perverse (or at least demonstrates a profound misunderstanding of what a natural key is).

A sideways reference - without actually using the phrase - to CAP theorem. Composite keys are suggested as a solution to a vaguely phrased problem without acknowledging the performance issues, and subsequent design constraints, they introduce. Ultimately there seems to be a misunderstanding about the differences between a (primary) key and an index.


The biggest problem with "natural" keys, as referenced in the article: they have a nasty habit of changing. In the article's example, making "username" or "email" the PK is going to be messy if you have foreign keys referencing the value when it updates.


They are often also difficult to get quite right. If you use an artificial key, you're pretty much good to go unless you want to make a distributed system. With natural keys, you run the risk of not having analyzed your data properly and picking the wrong thing.


Using email I can see being a problem, but username? If this is for identifying accounts, then that shouldn't be one that ever changes.


Jane Doe gets married and changes her name Jane Smith. She no longer wishes to log into your system as janedoe and would prefer to login as janesmith. Certainly one could make the argument that you should just tell her that the name can't be changed and she'll have to just live with it. Depending on your business, that may or not be acceptable.


Even worse, Jane Smith decides she hates that bleep John and they get divorced. She really won't want to see Smith as part of her username anymore.


If identity is a concern, maintain a list of old usernames. But in the real world, there are always extenuating circumstances in which allowing a username to change is good policy. Admins could blindly enforce a "never change usernames" policy, but why not just use a meaningless key and avoid the risk?


I am glad that twitter, github and stackoverflow allowed me to change my username. You could call it 'display name'. I wish hn would allow it.


You're ubinator. Unfortunately, you come home one day to find Mr. Binator in bed with some 20-something blond hussie.

The divorce paperworks sails through. Do you still want to be ubinator, reminded of that bastard every time you have to log in, or would you prefer to go back to umaidenname?

We both know where HR would stand on that one :V


really depends on the site, several places let you change your username.


Suppose my name is "Paul", my username is "paul". Now I've just transitioned and I'm not "Paul" anymore. What then?


Someone else wants to take over my account. Forever and ever they have to log in using my name?


If it's liable to change, it's not a natural key. Yes, requirements change and flexibility is a good thing to build in, but any system needs to be designed with a set of invariants in mind. And those invariants need to be effectively communicated to stakeholders. These invariants are useful, as they help people understand the system.

Usernames are a great example. If users want to refer to one another (or to themselves) they're going to need a consistent and transparent way to do so. Showing them a username which is behind-the-scenes mapped to a synthetic key is just leaving the door open for bugs to manifest down the track.

The benefits flow on when you are presenting related information live or through an API -- a human-readable foreign key means you can provide useful information without having to perform a join. This does wonders for debugging and discoverability.

Useless use of opaque keys (integers or shudder GUIDs) when a perfectly good natural key is available is in my experience a tremendous code smell. It leaves the opportunity for lazy developers to forget about invariants -- in schema design you'll see such things missing unique constraints, and then foreign systems to which the true keys aren't exposed doing horrible, fragile things whenever they have to traverse a relation.


Natural keys are also a nightmare to deal with when you have to do complex joins. They are not guaranteed to be a single column value and can lead to joining across a half dozen columns when you need data from multiple tables.

Usernames are actually a very good example of something that looks like an invariant when in actuality is anything but. On top of that, they are notoriously inconsistent between systems so you can't rely on usera in one system being usera in another even when you control both systems.

Even if you do have a nice single column natural key, what happens when a business requirement changes and there's suddenly a completely new coding system? Or perhaps your coding system needs a new distinction, do you rely on hand or create a new column? How many tables to you have to touch to change to accommodate the new data? How much code is relying your tables be structured just so?

Natural keys look appealing on face, but they can all too easily lead to a big mess down the road. The larger and more complex the system, the more external systems that you have to integrate, the bigger the mess becomes.


There's a big difference between "things that need to be unique" and "things that are unique and also never, ever change"


In a recent large project, I moved to using GUIDs. Following advice from this article: http://www.codeproject.com/Articles/388157/GUIDs-as-fast-pri... it was fairly straightforward and performant.

Obviously YMMV (as may your definition of 'large' and 'performant').


UUIDs are wonderful: they're fairly small (16 bytes), and can be generated on multiple servers independently with no coordination. If you use version 4 or 5 UUIDs, the first couple of bytes are random, so short-circuiting comparisons should in theory bail quickly… but I've never had a need to measure.

Perhaps the only gotcha I've run into is that JS can't reliably generate them. (You need a decent RNG, and while JS has an RNG, no guarantees are made about it.) (Trivially work-aroundable with an AJAX endpoint on the server, of course.)


Strictly speaking they are 16 bytes when stored as binary, which sounds grat, but when represented as strings (as they usually are) they are 36 characters which is 36-bytes in ASCII, but 108(!) bytes in 3-byte unicode.

So, in practice they are terrible primary keys in systems that do not have a native UUID type. In that case, an 8-byte bigint generated externally by a ticket service like Snowflake is often far superior.


> they are 36 characters which is 36-bytes in ASCII, but 108(!) bytes in 3-byte unicode.

I'm going to argue that that's not the norm: "3-byte unicode" is kinda WTF, since it doesn't really exist. If you're in UTF-8, it's a 36 byte comparison.

That sounds, however, a lot like MySQL, which has perhaps one of the more braindead "UTF-8" implementations. That said, I'm not sure that it uses 3 bytes for code points that don't require it. (At least, that it spaces them that way: there may be nulls past the data, sure, but those won't count in a comparison.)


"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. This will lead to better performance when the natural key is the most commonly used identifier, but is often not considered out of a habit."

Primary/unique keys are indexed. Creating an index of strings instead of integers is surely going to result in a huge index and slow performance?


Well, like everything, it depends on your database, your data and how its accessed. I've had a lot of varchar indexes in my time.

The article makes it seem like your primary key and main index have to be the same, but that's not necessarily the case. Most DBs have some support for physically ordered indexes.

So take username, for example. Let's say username has to be unique, but it can change. (It doesn't even matter too much if it's not totally unique, just mostly unique.) Let's say you're going to be doing most of your lookups in the user table by username.

If performance is an issue, keep your auto-incrementing guaranteed-unique primary key and add a clustered index on user name. The data is now physically stored by its natural key, but you still have a nice unique, unchanging primary key. Lookups are very fast, though depending on the implementation, inserts/updates might not be so much.

Postgres supports this with the CLUSTER command, although it has to be run manually - it doesn't keep up as updates and inserts come in. SQL Server calls it a clustered index and enforces the physical ordering on inserts and updates. Oracle has indexed-ordered tables which can achieve a similar effect, though IIRC the index has to be the PK (nothing to stop you from having a unique constraint on a sequence-generated column, though.)


In oracle, the index on an index ordered table only needs be a unique. So, for example:

      1234 "John Doe" null null null
      5824 "John Doe" "Sale 1" "item 1" $5
      4321 "John Doe" "Sale 1" "item 2" $50
      4382 "John Doe" "Sale 2" "item 1" $4
      
      
      CREATE UNIQUE INDEX (name, sale, item);


We've been having this argument about primary keys for three decades at least. If the natural key vs. auto-generated ID theories can't be reduced to practice, is there any home for software development as a profession?


> An email address is a piece of information that is guaranteed to be unique to each user because everybody has their own email address.

I once let myself in for a lot of headaches by designing a system around this unwarranted assumption.


It looks like the purpose of the article is to explain why it's not a big deal that the company's database solution doesn't support auto-incrementing primary keys.


Instagram had a nice post on how they did their PK generation/sharding a couple years back. Still a good read, http://instagram-engineering.tumblr.com/post/10853187575/sha....

Previously discuss here, http://news.ycombinator.com/item?id=3058327


There's nothing that requires most auto id keys to be monotonically sequential, which is the bottleneck when distributed. If uniqueness is all that's required, blocks of id's can be handed to different hosts and replenished on demand.

Gap-free, monotonically increasing id's should only be used for things that have regulatory compliance issues. For example, there shall be no gaps in invoice id's, and voided ones shall be marked.


Small project that may be of interest to those involved in this discussion: https://github.com/brandtg/alicia

Kind of "eventually monotonic" distributed auto-increment keys. Your mileage may vary.


Could someone point out to me how filtering and selecting records over some parts of the primary key (ex {sensorID} over {deviceID}_{sensorID}_{timestamp}) is better than just having regular indexed fields (ex {sensorID}) containing those pieces of information?


I believe that technique is intended for nosql document storage systems that otherwise don't have indexed fields.


Slightly off-topic:

With JS disabled, the page header is white text on a transparent background.

With JS enabled, I see no way to remove the header. I have little enough vertical screen real estate, adding an extra chunk of stuff on the top doesn't help. I ended up removing it entirely.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: