After a decade of large systems relying on RDBMs, we now use 64-bit integers for all primary keys with a global Hi/Lo id generation system (app reserves a range of numbers on startup to assign to records automatically).
This means plenty of ID space, maintains rough numeric ordering, allows ID creation without a roundtrip for every insert, is easily portable across different databases, and produces unique IDs for every row in the database which greatly simplifies everything from caching to replication.
The trouble with "natural keys" is that they're rarely actually unique. The barcode is a typical example. A naive developer might use a barcode as a primary key, but will soon be in for a world of pain when he realizes that products often use the same barcodes for different configurations (packaging etc), which usually need different SKUs. The same product from different origins may have the same barcode, which often matters.
Not critique, but some additional perspectives for those less well versed in designing databases.
It's also important to realise that almost every time a duplicate key on an assumed unique key would trigger a constraint violation error, absence of this constraint would lead to some non sensical effects elsewhere in the system if ignored.
Barcodes are a typical example also for this, as unless the entire dataflow from warehouse to cash register is equipped to handle duplicate EAN codes, all kinds of nonsense results could be the result,
potentially with significant economic risk if duplicates are allowed to enter into the system unchecked.
Everytime a constraint is removed from something that 'obviously' is/needs to be unique, the next step should be to add validation, duplicate and/or alias handling code everywhere that column is touched. Especially important at input and output.
Good database design is really hard, if at all possible, have other people try to come up with ways to break your scheme.
It's funny that while I fully agree with your whole post, I still lost count of the times a unique constraint that needed to be removed made for much more of a headache than if it hadn't been there from the beginning - because it turned out to never have been neccessary, but you still had to care for the uniqueness assumption in all associated code.
That's why I wouldn't recommend the authors rule of thumb:
The rule of thumb is to add a key constraint when a column is unique for the values at hand and will remain so in reasonable scenarios.
I'd consider that a premature optimization. Add a uniqueness constraint if and only if there is a clearly defined need for it, because assessing reasonable scenarios quickly descends into a lesson about hubris.
I still lost count of the times a unique constraint that needed to be removed made for much more of a headache than if it hadn't been there from the beginning - because it turned out to never have been neccessary, but you still had to care for the uniqueness assumption in all associated code.
Where you work do people never make uniqueness assumptions in application code unless there's an explicit unique constraint in the database? That's impressive. Are you hiring?
It's just an observation based on experience (two decades in different companies of all sizes btw.) about the most blatant cases of this kind: the constraint was just set based on that rule of thumb, and since it was there in the schema, it was taken as gospel and any code throughout all the layers written with it in mind and shortcuts taken accordingly, all while there never was any requirement for it nor was the uniqueness actually assured (cf. those posts about names) and the breaking only a matter of time.
Mabye I've become jaded, but I've come to see those as being among the most annoying and unnecessarily time-consuming classes of issues.
ISO Country codes, language codes and currency codes are so common (and don't change very often) that they should really be part of the databases's standard distribution as types (eg, enums), so that everyone is singing from the same hymn book and that not every developer needs to implement the same tables in their DB.
An ISO country code isn't a surrogate key; it's not db-specific, and it has domain meaning (assuming you are using it for it's intended purpose). It's ususlly, when used, a representation chosen for natural domain information, so if it's a key, it's a natural key.
The naturalness or artificiality of unique properties in a database is relative to the outside world. A key which was artificial at birth in some standards body or government agency becomes natural to us because it’s generally agreed upon in the world at large, and/or imprinted on objects.
does seem to me that the author is a hobbyist or a junior developer with little real word experience
Advocates of natural keys are aware that many people spend their careers creating auto increment primary keys everywhere. When combining disparate systems within a single organization, natural keys help. The biggest headache being difficult-to-identify duplicate data. However, this headache can be avoided if you have a natural candidate key. So just be sure to add unique constraints where appropriate, and the auto increment primary key (e.g. to make ActiveRecord happy) won't cause problems.
Can you elaborate a bit? Are you using an ORM, eg. Hibernate? If so have to considered other strategies like pooled-lo, IDENTITY or recursive CTEs that offer the same benefits but make the sequence values match the database values and reduce "id loss"?
I'm also interested in some elaboration of this scheme. While I'm less interested in ORMs etc, the advantages/disadvantages would be useful to compare to some recent thoughts I've had regarding the use of databases.
Professionally I have little use for RDBMSs, however recently I've been working on some side projects with a focus on flexibility and modularity that have changed my perspective on how databases should be used. Formerly, I (and I assume many newcomers) had a database driven approach where the database structure came first, and code grew around that. This reduces flexibility because your data is all coupled together, which will force refactoring of data when your storage configuration changes. Data refactoring sounds like a thing of nightmares, but ability to easily migrate data between a RDBMS-backed module to a redis-backed module with minimal fuss is the kind of flexibility I'd like to have.
The alternative that I've settled on is to begin development with the notion of a generic 'data store' and later fitting a database around that. Though I am yet to see the scheme come to reality, I believe this will reduce data coupling, increase testability and debuggability for both code AND data associated with a given module, and increase migratability of the data. This has also pushed me toward the exclusive use of integers for IDs as they can be considered universally supported with minimal effort, which has me curious about further benefits of different ID generation schemes.
Anything with atomic increments can do this, we use a simple table called idsequence with a single column and row. When an application starts up, it upserts 100000 and gets the new value. This gives both the min and max values that the app can then use as ids for any entities with no overlap with anything else.
It's actually not Hi/Lo but similar. You can also add another column to that table if you want multiple "sequences" but we like to keep it simple and with bigints, you'll never realistically run out.
If you're using an ORM, you might need some extra work but most of the time they are smart enough to use an existing ID if the object already has one rather than try to get one from the database.
>> This means plenty of ID space, maintains rough numeric ordering, allows ID creation without a roundtrip for every insert, is easily portable across different databases, and produces unique IDs for every row in the database which greatly simplifies everything from caching to replication.
"For instance, a database of hobbyist club members could include uniqueness on the two columns first_name, last_name of members. Duplicates are likely unintentional at this scale, and if necessary the constraint can be dropped. Until an actual conflict happens the key is a reasonable precaution."
Absolutely do not do this.
People have names that are duplicates. A situation where someone is unable to join a club because their name clashes with an existing member is not OK.
Expecting a club administrator to be able to drop a uniqueness key from their database in order to resolve this situation is not a reasonable solution!
That's not the only point of the article I really disagree with:
> Here are some values that often work as natural keys:
> login names
This makes user names static. Which would annoy people if they change name (eg marriage, gender change, nationalization change (Chinese name vs English name)) or they just want to update their online handle.
> email addresses
Same problem as above. What happens if someone wants to change email addresses (eg they used a work one and changed jobs, or they got a new email to combat spam on the old one, or a new email address to reflect a name change (see above))
> mac address on a network
This makes more sense than above, but MAC addresses can be spoofed or even just used as a proxy, so you would need to be careful that the information you're uniquely storing against the MAC address is intended to be unique. Most use cases that immediately springs to my mind wouldn't follow his unique rule but I'm sure there will some examples that do.
> (lat,lon) for points on the earth
That might work if you also included an elevation point as well but office and apartment blocks (for example) will often have different tenants at the same latitude and longitude coordinates.
> Some people seem to struggle with the choice of “natural” key attributes because they hypothesize situations where a particular key might not be unique in some given population. This misses the point.
Actually no it doesn't. That's entirely the point as it demonstrates good planning and future-proofing rather than setting arbitrary business rules that are hard to rectify if and when you do run into those particular edge case scenarios.
RFC 2392 Message-ID values are supposed to be unique, but they may not be if the software generating them is buggy. I remember reading somewhere about some email program that assumed they were unique (in effect using them as a natural key). If you were unlucky enough to get messages with identical Message-ID values, all but one would disappear.
> This makes more sense than above, but MAC addresses can be spoofed or even just used as a proxy, so you would need to be careful that the information you're uniquely storing against the MAC address is intended to be unique. Most use cases that immediately springs to my mind wouldn't follow his unique rule but I'm sure there will some examples that do.
As one example, there was a bug in one version of CyanogenMOD that caused many phones to broadcast the same MAC, even when connected to Wifi. In some cases, this would cause different devices to kick each other off of a connection. See e.g. https://forum.xda-developers.com/tmobile-lg-g3/help/2x-lg-g3...
Does it? I can understand why making it a foreign key would make it static, but why would making it a normal key make it static? It seems to me that making login names unique would be preferable, same with emails.
If i understand the article correctly it's discussing using natural keys as primary keys. Eg if you're using login_name as a natural key then that's presumably replacing a more traditional user_id. So it would be hard not to use login_name as a foreign key in that scenario.
Where as if you still had a user_id as your primary key, you could still have user_id as a unique key (most RDBMS I've used support "unique keys") to enable the business logic discussed but without encouraging it's usage as a foreign key.
I think the article did touch on the fact that you can have multiple keys, but given it's heavy emphasis on finding natural keys over the more traditional logic of incremental integers, I can only assume the point of it wasn't about additional unique keys but rather alternatives for primary keys.
> If i understand the article correctly it's discussing using natural keys as primary keys.
I'm not sure about the author's intent here. In the prior section they dismiss the necessity of primary keys. However, the author rails on about unique keys. I'm not familiar with PostgreSQL, but I assume every reference to "key" is effectively an index with a unique constraint. If that is true then all of the discussed issues with unique names are the same even if the author doesn't care about primary key in particular.
From how I understand it the author does not make any suggestions to use natural keys as primary key. He just defines what is natural key. And he does mention that they may change and it has to be considered.
Actually in the summary he suggests to create column with uuid in each table and use it as primary key.
I agree, I can't think of any system off the top of my head that lets you change your handle, or why you would really want to. If it's a forum or has commenting capabilities, your handle is your absolute identity. If it's not a forum, no one ever really sees your username, so there's little motivation to change it.
> I can't think of any system off the top of my head that lets you change your handle, or why you would really want to.
Pretty much any system with a real name policy both does and needs to let you change your handle, because preferred real (and legal) names change. Now, these systems often use email addresses as login names, but they sometimes also support changing the primary email address as well. In effect, identity is most likely managed by a surrogate key that isn't exposed.
Twitter does (though I don't know of anyone who has). As does Google and Facebook. Github too.
Away from web services, gaming platforms like Nintendo's Switch allow for user name changes. As does the more enterprisey databases like ActiveDirectory (Windows logins), ldap and Linuxes / Unix passwd table also allows for login name changes.
To be honest, I've encountered more systems that do let you change your user name than systems that don't.
For extra fun there was talk today on twitter about several people that have to file Japanese tax returns but have originally english names. There are many ways to convert the name to english, and different systems actaully enforce you to convert it in a different way with various limitations.
This results in them having the system believe they are two people, one of who hasn't filed returns and no easy way to fix it (apparently the easiest fix is to possibly create a company?).
This despite there being some kind of unique ID submitted in both systems, but that ID is not currently reconciled.
Not sure 100% on the details myself but I guess the moral of that story is, duplicate checks on the name itself may not even be a guarantee of uniqueness - even assuming there was only 1 person with that name!
Completely agree. Adding a proper primary key e.g. from a sequence involves very little extra overhead and protects against this. Asking them to drop the constraint in the future sounds a bit silly as that may mean two members get deleted when you intended only one. Plus your foreign keys now have to be maintained. So if a member changes their name you have to update all foreign keys. Also if the hobbyist club then builds a website to view members the URLs can use the primary key e.g. a long / bigint, rather than names. The names may introduce further complexity with having to escape the URL etc. I am all for keeping it simple, but all relations should have a guaranteed unique primary key.
I know the author is talking about a small scale project here, but shortcuts like this (in my experience) just lead to a mess later on; irrespective of the complexity of the problem being solved.
To expand a little on your last point: one of the guide lines I try to follow is expect your system to be successful. If your club is popular then it will eventually have two people with the same name. Don't waste people's time by putting in artificial constraints that will eventually become invalid.
> Expecting a club administrator to be able to drop a uniqueness key from their database in order to resolve this situation is not a reasonable solution!
Especially since if you have any FK relationships, and that was the only candidate key for the members relation (and why have such a bad key if it's not the only one), you then need to add new attributes to create a new key, populate it to all existing roles, change all existing FK relationships and associated queries, on top of the easy part, which is dropping the old unique constraint.
Yes, that's the problem. I understand what he wants to say (basically YAGNI), but imagine the reception desk employee having this problem, escalating it until it reaches a developer just to register John Doe.
I agree with you, but I think you did overthink his metaphorical example a little.
I think the point was that for a 40 people club odd are very very poor that two people would have the same name. And even if so a club manager could still differentiate by adding a middle name or a nickname (in a 40 people scenario).
Of course IF your neighborhood club expand and you need to manage a lot of people you'll have to switch to a better technique. His point was that you must fit to you business case. (But hey some people change name when they marry so... problems can happen fast, but that's mutability issue not uniqueness).
Overall very good article but keeping critical mind as you did is needed.
I'm not convinced that the odds are "very very poor". Some names (like Robert Smith or Maria Garcia) are quite common. And a hobby club is not a random selection of people, members would typically tend towards a specific geographic area, social class or culture depending on the subject. Some cultures have a very limited pool of names, depending on tradition.
Just from looking at a 200 member club I'm part of, I can see at least two name collisions in our current club roster. So absolutely agree with your assessment.
I don't have exactly the same name as my father, and we are not members of the same squash club. So, in the sample so far, there's only a 50% chance of that happening. I suspect that if we enlarged the sample, it would fall much further.
A single counterexample does not falsify the theory that the odds are very very poor!
Ah, this brings me back to the time when I was trying to convince a product manager that it was a bad idea to "validate" email addresses with regular expressions. I failed, and the product was rolled out with the following regex: ^[a-z0-9.-]+@[a-z0-9.-]+\.[a-z]{2,4}$
I used to hold this opinion, but this Stack Exchange post https://dba.stackexchange.com/a/165923/34006 changed my mind. In short, HTML5 defines its own specification of an email address here [2], and notes:
> This requirement is a willful violation of RFC 5322, which defines a syntax for e-mail addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.
> The following JavaScript- and Perl-compatible regular expression is an implementation of the above definition.
To quote the DBA Stack Exchange post, "if it's good enough for HTML5, it's probably good enough for you". And if you're using inputs with `type="email"` in HTML5, you already are constraining your emails to that format.
(That said, it sounds like the regular expression you were asked to use was not a good one)
That's an interesting choice of rules. When cutting down the complexity so drastically, do you really need to double the length of your regex just to enforce the rule that dashes only appear in the middle of a domain segment? This version is so much more legible:
I'm guessing he put that in just to annoy people that are easily annoyed. Kind of like how every woodworking video on YouTube is invariably slammed with hundreds of commenters accusing the maker of being literally Hitler over some very small safety detail. After awhile they stop trying and just poke fun.
It does make me appreciate the one guy who can rise above all that, Paul Sellers.
While technically accurate, this is non-standard to such a degree, and represents such a minority percentage of the population, that "don't do this" really is an appropriate response to the user.
I'm curious how often people share e-mail addresses. I used to see it a lot for people who used their ISP's mail system. Most people who are even slightly tech oriented have gmail, hotmail, etc these days, but now I can't help but wonder how many people just stick with the default ISP e-mail.
Mailing lists and shared mailboxes might be used by several people (support@foo.com). Noone uses them to sign up to online services I guess, but the address might be stored in a database and be assocoated with several people somehow.
True people do share. But you have to believe it isn’t. If your system already assumes email must be unique, then your system will make email field as a unique constraint. Otherwise what will you considered as unique?
Phone number, however, should be avoided as an ID. Far more likely to have phone number ownership changed than Google reclaiming your email address and give it to someone else realistically.
If your system already assumes email must be unique...
...then you're not building a system that works with the way users work. Some (very small) percentage of your users will have problems.
I first encountered this problem when I asked my parents to test my first startup. They shared an email account. When they signed up to things they just used the shared account. It was fine until they wanted separate accounts on the same service and it failed. It's not OK to assume everyone has a unique email address and that works as a reference to an individual. The assumption is incorrect. Email addresses are not unique to individuals.
A lot of services fail at this. If you're targeting older people then you shouldn't use a users email address as a username, and nor should you use it as a security factor.
> It was fine until they wanted separate accounts on the same service and it failed.
When it comes to designing a system, I try my best to prioritize this way: security, privacy, and finally usability. A product with the best user-experience may have the most vulnerable design. I say this because we need to balance between security and UX. Pick some constraints which make sense to your system.
If we have to accommodate every corner case, the system will become extremely insecure and unreasonable. In your case, perhaps solve it with "sub-account" if that makes sense for your system. To Netflix, sub-account seems reasonable so Netflix gives that as an option for user to share his/her account with his/her family/buddies.
Actually, sorry if being a little out of bound, you probably should help your parents setting up unique email address for them. They want a separate account for some specific reasons, I assume, so why not help them?
Email was created TO be unique. It is we the users who decide how to use that email address, BUT it is not the system/service designer TO accommodate every use case.
They want a separate account for some specific reasons, I assume, so why not help them?
That's the point - they didn't want different email accounts. They wanted to share one. There will always be edge cases that don't conform to your expectations of how users should be doing things, and your applications probably need to work even in those less-than-ideal circumstances. If your app is going to be used by older people who might share an email address then I wouldn't make the account process require a unique address for every user.
I actually go a step further these days and try to make things that don't even require any email address. The less personal information required to create an account the better. So long as users are aware that they won't be able to recover an account if they forget their password there's no real need to ask for an email in the first place.
How did they want to identify themselves upon using the service? Email plus name? Is it worth making login more complex for everyone else?
Or do you expect old people that share an email account to gracefully handle unique usernames? I'm pretty skeptical of that one.
And "So long as users are aware that they won't be able to recover an account" sounds like a joke. Maybe one in twenty people will truly internalize that, if you're lucky.
The bit about duplicates likely being "unintentional at this scale" is an important part of the use context too though. There are many cases where allowing duplicate names increases errors and confusion for users -- not only because of unintentional duplication, but because users themselves often think of names as unique.
To me, surrogate keys are just about always preferable over natural keys, but it's also important to think about the "naturalness" of natural keys from a user's pov. In some contexts, using a surrogate key while also enforcing unique names (with the rare possibility of a case where an admin has to go in and do something weird like add a genuinely identical name with a "2" after it, say) can be a better trade-off.
At what scale does all this stuff start to actually matter? I have an database with ~100 tables and ~500M rows driving a medium-traffic web app and various back-end systems. We use auto-incrementing integers as primary keys and try not to expose them externally. Indexes are added as necessary to enable specific queries. We don't enforce any other constraints (e.g. not-null or foreign keys) at the database level.
... and it all works and performs just fine? The considerations the author mentions all make some sense to me in theory, but when do they actually matter in practice in a modern system?
You might not have any observable effects until the day you do, at which point you may be faced with an arbitrarily large problem.
What you are doing is passing up on the opportunity of catching various errors (you are also passing up the opportunity for some optimizations, but that is probably a secondary issue.) In particular, you are passing up on some opportunities to catch inconsistencies in how different applications (or different parts of the same application) create and use data.
One argument made against putting these sort of rules in effect is that they constrain what application developers can do, but that is the wrong way to look at it: any such conflict is an indication of a misunderstanding (not necessarily on the part of the application developers) that has been caught before it can lead to bigger problems, such as a database full of irretrievably inconsistent or incomplete data.
When the problems finally do arise, it is often the case that some sort of workaround is the only practical solution. This, in my experience, is one of the common ways by which systems accrue gratuitous complexity, which in turn has at least two real-world consequences: an increased time to make changes, upgrades and extensions, and an increased frequency of errors, especially WTF-type errors (and probably also efficiency/performance hits.)
My point is that you cannot know when problems will show up. If you do everything right, they will never happen, but why pass up an opportunity to reduce the risk?
I disagree that UUIDs are generally preferable over integers. For one, they take up more space (on disk and in memory). And for something like a key, it is likely that there will be multiple copies of that value stored, since it will exist in the table itself, at least one index (possibly more) and foreign keys. More space means fewer records per page on disk, more I/O and more memory usage (potentially leading to more I/O). I would wager that for most users (including this case), the cost of this additional I/O is far greater than some theoretical scalability limitation on generating IDs.
Most database vendors make sequence generation (whether through explicit SEQUENCE objects or auto-incrementing columns) performant by making a few compromises:
* Numbers may not always be sequential (you might get 1, 3, 2 - in that order)
* There may be gaps (you might get 1, 2, 5, 6)
But since these are supposed to be opaque identifiers, neither of these compromises should be a concern for most users. But this means that these sequences can live outside of a transaction (you might grab an ID, rollback, and that ID is gone) and that systems with multiple nodes can be allocated a "block" of numbers from which they can quickly pull new values, without needing to coordinate with a master node.
Client-side ID generation (e.g. UUIDs) can be very useful. For one, it lets the client send an entire batch payload of one or more inserts, plus updates that reference the new inserts, in a single round trip. It doesn't need to wait to get the inserted IDs back to make updates, or send more requests to other services. All it needs is an OK back. This is very convenient in systems where, say, you ingest a bunch of data that needs to be dumped into a final database. The entire batch can be ready, including IDs, at the sender side.
This is particularly useful in some other scenarios such as distributed, eventually-consistent systems that are able to resolve concurrency conflicts. Send out changes optimistically (with those pregenerated IDs), resolve conflicts by untangling the ones that didn't work.
UUIDs are the best choice when developer time is more important than space usage. Also, they can be generated and used by the client when the connection to the db is frequently down (eg: clients store/query data locally in SQLlite and replicate to master).
"The random nature of standard version 3, 4, and 5 UUIDs and the ordering of the fields within standard version 1 and 2 UUIDs may create problems with database locality or performance when UUIDs are used as primary keys. For example, in 2002 Jimmy Nilsson reported a significant improvement in performance with Microsoft SQL Server when the version 4 UUIDs being used as keys were modified to include a non-random suffix based on system time. This so-called "COMB" (combined time-GUID) approach made the UUIDs non-standard and significantly more likely to be duplicated, as Nilsson acknowledged, but Nilsson only required uniqueness within the application."
This article mentions this question. Database locality is a complex scaling question that may not be a problem depending on how read/write heavy your operations are. On the one hand, randomly distributed UUIDs may be a bad idea on a single SQL server trying to balance a single B-Tree index (and thrashing that index with a lot of incoming data); on the other hand it can be somewhat ideal for partition sharding across multiple servers. As with any trade-offs in database design, your mileage will vary with your application needs and resource availability.
Also, there are other options for time-ordered GUID/UUID alternatives. ULID is the one I've been heavily using in projects lately: https://github.com/ulid
(ULID uses a timestamp prefix and random suffix for reasonable time-ordered database locality; an interesting compromise between V1 and V4 UUIDs, though not directly compatible with either. The L stands for "lexicographic" in that its also meant to be sortable in string indices as well, which can be important for database locality in many document/NoSQL databases.)
I'm also not a DB expert, but I definitely think locality is a concern with UUIDs (at least in Postgresql, which stores records in sorted order).
But, storage locality has always struck me as a use-case-specific optimization, in the sense that it's hard to take locality into account without knowing what traffic patterns the table is expected to experience. So, although UUIDs provide locality for exactly no types of queries, that may not be a problem for you.
For instance, you could use an autoincrementing sequence, but that only provides good locality for queries over sequential records. The COMB method would provide good locality for queries over time periods. But if you aren't performing those types of queries, you might be better off with a different type of PK.
UUIDs are a great option for creating IDs offline, but I don't understand your first comment: How do UUIDs optimize for developer time? In my experience, autoincrementing integers are the easiest choice for primary keys.
Sometimes yes, sometimes no, but mostly I agree that uuids are better for dev time. The first time something is written, they're roughly the same (or a very small bias towards autoincrement), but UUIDs have fewer dev-problems in the long run.
When you're starting out and don't have (m)any simultaneous writes to your DB, autoincrements are often perfectly sequential and identically sorted as a created_at column... which is convenient, but too often I see systems implicitly relying on this and having subtle problems when they start growing. E.g. pagination that doesn't account for gaps, or a missing "order by" that leads to the 1, 3, 2 issue mentioned above. It's also not often caught with tests, since they're frequently not run in parallel or at high enough velocity, and that lack of warning, time-coincidence with when the buggy code was introduced, or easily-testable reproducibility can make them hard to track down and fix.
There are a bunch more fairly minor things that others have mentioned (e.g. UUIDs make it much harder to write joins incorrectly, as they'll just always be empty until you do it right), but generally I'd just call it "death by a thousand papercuts". Autoincrements are great and I love them, but generally I lean towards UUIDs since they're a bit more bug-resistant. And performance-wise, if you're having problems I generally doubt you'd be able to fix it by switching to autoincrements (assuming it was just flipping a switch). For a short time, possibly, but generally at that point you'll be beyond the "pick whatever is easiest" stage and can make an informed decision that'll have a far larger impact.
---
tl;dr: I disagree with primary keys being "opaque identifiers" because they do have an affect on behavior in buggy code. Autoincrements mask more problems than UUIDs, and fixing even one of those costs more time than autoincrement saves.
As you can see leaking through in a few of the examples, the scenarios where it matters are the ones where non-experts are reading to and writing from the database. If a data entry clerk is just typing values into a form thinly wrapping an INSERT, you'd better make sure any input that would break implicit assumptions is rejected. If you have a legion of Bobs from marketing who "know some SQL", you can't hand-optimize every single query they try to run.
When the database is just serving a webapp you control end to end, the only part that'll really have a huge impact is making sure you can partition by the primary key effectively. Which is good, because complex indexing schemes and foreign key constraints actually scale very badly.
I would certainly look to add not-null and uniqueness constraints, although you might find that by now your data is actually violating some of those constraints!
A lot of people take the view that the application code is sufficient proof against bad data getting into your database, but this ignores (a) bugs and (b) back-door data loading, which in most applications will almost certainly be happening at some point.
The data in your database is the crown jewels (the application that fronts it is by comparison a piece of crap that you can replace at any time), so put guards on that gate ...
No constraints will often be faster, but your data will not be 100% consistent.
Over time you _will_ have child child records pointing to non existing parents and the same with foreign keys. These often don't really hurt anything, until they do. Since most of the testing is with clean data, most issues due to bad data are in production and reported by end users.
I've worked successfully on systems without any real constraints and it is usually much better to start with and remove them later as needed. Adding them later is a pain as you first have to clean up the data and fix the code issues that created the bad data in the first place.
I've seen auto_increment fail to scale on MySQL. Large, long-running insert statements (e.g. using insert into ... select from, or load data infile) can lock the the table resource and block concurrent inserts. Pretty easy to work around: have a table containing the next available key for every table, and use a separate small transaction to bump it before bulk inserts.
Another reason not to use auto-inc is if you need to insert lots of data into several separate tables that have foreign key relationships. If you know what you're inserting up front, and you need to insert millions of rows quickly, you're better off allocating the primary keys ahead of time and supplying them directly with the right keys for the relationship at insert time.
Separately, another argument against compound keys: if you're in a situation where you're trying to optimize joins, sorts and filters over moderately large tables, you want to minimize the number of columns you touch. Every column measurably increases query time; more data to shuffle, sort, compare, bigger indexes, etc. You won't see this if you're doing simple key / row lookups, but you will see it if you're presenting user-interactive slices over million+ row data sets.
auto_increment doesn't lock the table. When you use an auto_increment column, MySQL will grab the next int as it creates the insert write-ahead log message. Two concurrent transactions with T1 beginning first and T2 beginning second but actually committing out of order can thus have out of order ids. e.g. T2(id=10) T1(id=9)
Also, note that this means auto_increment IDs are not continuous (read: a reader looking after T2 commits but before T1 will see a gap, and if T1 fails that gap is permanent!)
"While initializing a previously specified AUTO_INCREMENT column on a table, InnoDB sets an exclusive lock on the end of the index associated with the AUTO_INCREMENT column. In accessing the auto-increment counter, InnoDB uses a specific AUTO-INC table lock mode where the lock lasts only to the end of the current SQL statement, not to the end of the entire transaction. Other sessions cannot insert into the table while the AUTO-INC table lock is held; see Section 14.5.2, “InnoDB Transaction Model”. "
If you have a long-running statement, it can block concurrent transactions. I've seen it specifically with 'load data infile', IIRC. We had to go through some painful migrations to remove auto-increment on some fairly large tables when we started seeing this.
That is an odd design. PostgreSQL only holds the lock long enough to increment a counter in memory, and every 32th time also write a write-ahead log memory in RAM. I can't see why one would need to lock the counter for the duration of the query.
You can get some real headaches with auto-incrementing integers if you have to go to a multi-master or some other distributed system for data inserts.
How do you know which integer to insert next for a table that is replicated across systems? One system could do even numbers and one could do odd. One could do every other 100 integers. Or you could use UUIDs and not worry about it.
My argument is that auto-incrementing integers are inherently not scalable. They can scale to a certain extent, and then you have to switch key systems.
They don't necessarily matter until you hit Twitter-size for performance purposes, but take care to pay close attention to the idea of natural vs. artificial keys, and heed his warning about turning artificial keys into natural keys.
One day you'll be integrating database systems together and you'll be glad you followed his advice. I've seen some really ugly bizdata schemas.
My background is in financial applications, where an inconsistent state can literally be a multimillion dollar problem, so maybe my perspective is skewed a little. But FKs and other constraints are simply used to outright enforce consistency, in my experience at least. If you rely on your custom business logic to maintain consistency, then you leave yourself open to the possibility of breaking that in some way, and if you do, who says you'll notice right away? What if you don't notice for 6 months, and then realise that you have a billion transactions that have some form of integrity compromise?
Complexity obviously makes this problem more serious, but something like this can easily occur at any scale of complexity or transaction volume. If you try to commit some code that breaks the integrity of an important relationship in some way, or even if you just want to run an arbitrary statement directly against the database, then you want the DB to throw an error.
The most important property of surrogates, which I never see mentioned, is that two tuples, anywhere in the database, have the same surrogate key if and only if they are (believed) to refer to the same real-world entity. This is a crucial difference wrt auto-incrementing or randomly generated values that are independent in different relations (which is a common practice). Among the rest, with surrogates defined as above you may freely join relations on surrogate keys being sure that you are joining related data.
In his landmark 1979’s paper, Codd defined surrogates and pretty clearly, and that part of his work was based on an earlier paper entitled On Entities and Relations (I don’t recall the authors right now), so this is not exactly news.
Unfortunately, current DBMSs provide little to no support for properly implemented surrogates.
I would argue that the support is there. You can either use random UUIDs or a global shared sequence for this. Nothing requires you to have a sequence per table other than some synthactic sugar in the SQL standard. PostgreSQL also has its own OIDs, but I think that is more of a legacy from before they implemented sequences.
> Unfortunately, current DBMSs provide little to no support for properly implemented surrogates.
UUIDs instead of per-table autoincrement keys seem to be a solution which covers the part of the problem that involves DB support as opposed to data model design, and several DBs have adequate support for UUIDs.
UUIDs (and other mostly random IDs) also have the important advantage that they can be generated offline - in mobile apps, client-side javascript or elsewhere.
With autoincrement keys, you need to either contact the database server for each data object you create, or implement complex logic to replace temporary IDs with database IDs later.
You can't blindly overwrite existing data based on the UUID, sure.
But I find that the security checks against "stealing UUIDs" are much simpler to implement than a system of composite primary keys (deviceid+autoincrement or userid+autoincrement) or some sort of temporary primary keys. If you see suspicious UUIDs, you can just discard the data.
Having the client generate the "final" IDs for objects is particularly convenient when the client can produce a complex graph of objects offline, for example an order + order items + comments + more. If using temporary IDs until first contact with server, all these links would have to be patched up when the objects get real server-side-assigned IDs.
On Postgres I prefer to use a single sequence to generate ids across all tables. This reduces the chance of accidents (eg accidentally deleting the wrong thing with id #123) and reduces information leakage ("oh, I see I'm customer #5, you must only have 4 other customers").
Another way I've seen done to preserve uniqueness where order leakage doesn't matter is prefixing IDs uniquely per table. Something like uid147... for user IDs pid12... for post ID... It is certainly helpful when debugging and isn't hard to write a test to verify prefixes aren't duplicated across tables.
That's not a bad option, but I think I'd find locally-unique integer IDs spread across tables to be a little confusing. In that case I think I'd lean towards UUIDs, which may be a little easier overall, or use separate sequences but expose IDs via Hashids in my API/frontend: http://hashids.org
UUIDs are easier because they immediately scan as random values when you look at them. They are a clear indicator of surrogacy in a way that a spread-out sequence is not.
It doesn't matter how difficult picking a sequence is (and your post is odd because I don't think I gave any indication of unfamiliarity with Postgres?). It's a cognitive concern with keys that look simple but are not obviously unrelated (as a normal sequence is by convention or UUIDs are by definition) to the rest of the table.
Sure, but Postgres doesn't to my knowledge support a cross-table UNIQUE index to insure your sequences are followed in all cases. I imagine that all it takes is one confused junior developer to add a bunch of manually IDed data by hand to create some weird and confusing corner cases in this scheme.
You can define the field as GENERATED ALWAYS, though, in which case specifying the value explicitly in INSERT won't work (there's an escape hatch, but it has to be explicitly and very verbosely invoked).
What is the advantage over UUIDs? The author says that the surrogate keys should never be exposed outside of the database anyway, so no problems with data leakage. If an artificial key is to be exposed then it can be obfuscated to prevent information leakage.
I think it should be obvious that it is obfuscated, though, like Youtube video IDs, for example. If you generate integers then your users might assume they are not obfuscated. Eventually a customer will get some low value integer and might assume that they are "customer #5" even if they are not.
Using a global sequence means 8 bytes less storage per row (including index entries and all foreign keys which refer to the PK), and more importantly typically mostly sequential inserts into B-tree indexes which means less bloat and faster insert and access.
Disadvantages are data leakage and, theoretically, minor contention on accessing the sequence.
It's an interesting article with interesting ideas.
I'm squarely in the camp of using natural PKs until there is good reason to use surrogate PKs.
I generally disagree with the notion of using a combination of surrogate and natural keys. In SQL, a PK isn't just another unique key: a PK an important block of communication.
As a rule of thumb, when a PK is attached to a semantic value, it is saying that this is the identifier of the table. If a PK is on a surrogate key, it is saying that there is no good unique identifying value in the table. When you are dealing with larger data sets, this distinction isn't minor, as it helps to understand the intent of the data when working with it. PKs serve as guideposts in your design along with guideposts to the person who has to maintain (or fix) your database later on.
I know some disagree with me on that, but there are many undeniably good reasons to use a natural key. A good place is a check-constraint table, where you say, have a list of US states and you want to ensure that "New York" and not "New Yoerk" inserted into the state column of an address table. Put a PK on valid_state_names and FK to the PK from addresses.
> If a PK is on a surrogate key, it is saying that there is no good unique identifying value in the table.
Almost always there is no good unique unchanging identifying value on a table. Take something as simple as "Person" -- there is literally no unique unchangeable value for such an entity. And that's the rule more than the exception.
I take exactly the the opposite approach; nearly all entities should be identified with a surrogate key. You pretty much cannot go wrong with this approach.
In the US state names are pretty stable, but in other countries state/province names have changed so even that value is poor choice for a primary key.
Yes, each set of data should reflect the data you want to use. That's why I said I default to natural keys unless there is good reason to use a surrogate key.
If I have to compare databases I've worked on over the years, the databases that are all surrogate will have many more data errors than a database that is predominantly natural keys. I could write a book on why that happens, but if you don't know relational theory cold (not saying you don't), you should just use natural keys. The short answer is that using all surrogate keys is equivalent to not using unique constraints at all.
Data is bound to change at some point in the future. The entire point of having constraints is to ensure that your data keeps its integrity, now and later on, no matter how the data needs to be updated.
I'm working with a 3rd party database right that used more than a few natural keys and it's awful. The primary key for "users" is username so we can't rename our users to improve our company's username security policy. And that's just one of the problems. We'll be offline for a full 24 hours in April to handle changing the value (for necessary business reasons) of the natural primary key in the client table.
I'm all for constraints, especially uniqueness constraints, where appropriate. You can use unique indexes; you don't have to use natural keys as primary keys. Data is bound to change, so make that change possible.
A badly built database isn't the reason to toss out good practices.
It's actually quite obnoxious to see how bad many databases are, and this often causes a blow-back of tossing the baby out with the bathwater. It's a trap.
I'm not really sure what all the problems are, but if changing username to something else is causing a 24 shutdown, there are many deeper issues than a PK on username.
> A badly built database isn't the reason to toss out good practices.
It's a good reason to toss out bad practices like, for example, using natural keys as primary keys. Which is, almost universally, considered a bad practice. It is, in fact, the bad practice that causes the most pain in this bad database.
> if changing username to something else is causing a 24 shutdown, there are many deeper issues than a PK on username.
This is true; the software is so old it pre-dates most modern RDBMS features and that is a factor in the database design. And it contains a few decades worth of data. And the process to go through the all records and re-do every foreign and primary key value is expensive (through the app server) but is a well-tested path. None of this would be necessary, however, if they had used surrogate keys for these tables. They did use surrogate keys for other tables so it wasn't a technical limitation.
> One thing to avoid in publicly exposed numbers is sequential order. It allows people to probe for resources (/videos/1.mpeg, /videos/2.mpeg etc) and also leaks cardinality information. Add a Feistel cipher on top of a sequence. This hides the ordering while maintaining uniqueness.
> The PostgreSQL pseudo encrypt wiki gives an example cipher function:
> [...]
I get that they want to leave the key value in the schema, but practically this kind of thing feels worth pulling out into the storage engine. Just generate a random string as the ID upon row insertion and force uniqueness on that column. If insertion fails, generate another random string. (And it should pretty much never fail because otherwise you're still suspect to the enumeration attack you're trying to prevent.)
Although given that Postgres can read from a file, could the above be done in the schema by reading from /dev/urandom? If so that seems like the better approach.
>Modern SQL is supposed to abstract from the physical representation. Tables model relations, and should not expose an implicit order in their rows. However even today SQL Server creates a clustered index by default for primary keys, physically ordering rows in the old tradition.
Which is why the UUID (or GUID in SQL Server speak) can have other drawbacks there in comparison to auto incrementing bigints, namely delays due to the data being ordered randomly on disk. There are obviously ways to counter that like adding extra non-clustered indexes or changing the clustered index to something besides the primary key, but at that point the extra time and overhead might defeat the benefits you gained from going with UUIDs in the first place.
In theory yes, it solves some of the problems and introduces others. It isn't usable everywhere NEWID is usable, so if you assign UUIDS as part of a query you are out of luck. It also assumes you are generating all your UUIDs on the same machine without restarting which once again negates some of the benefits of UUIDs.
The solution to your concern is using " UUID generation algorithms (like Twitter’s “snowflake” or uuid_generate_v1() in the uuid-ossp extension for PostgreSQL) produce monotonically increasing values per machine." and is discussed in the article.
Which is what your sibling comment mentioned. The author does touch on that approach but never touches on the drawbacks, some of which I mentioned in my other comment and some which are directly in conflict with the benefits of UUIDs. My point is merely that there are other details that a MSSQL user should take into consideration before simply following the author's suggestion. A novice MSSQL user might be better off with auto incrementing bigints in a way a novice user of a different RDBMS might not.
I think article was pretty focused on PG. :) unlike MSSQL it does not cluster data on an index (you can run cluster command but it will not be preserved as data is inserted updated) so that concern is fairly specific to MSSQL.
This is well done. And yet many people get this stuff wrong. As a language person, I think about how we could make these choices more natural by rewriting SQL; make them the path of least resistance rather than requiring much pondering and wisdom.
Let's be honest: If you could make these choices more natural then you would have done so by now. If you could make SQL better than it is, you already would have done that as well. In fact, if most people who are interested in SQL could make SQL better than SQL is, we wouldn't be using SQL anymore.
We're not using it because we have some weird tradition we enforce. We're not using it because we like the way it looks, the way it writes, or the way it handles. We're not using it because we've tried to use ORMs and those ORMs have proved conclusively better across use cases than SQL.
The reason we use it is because we don't yet have a better solution. If you can write a better solution than SQL I would love to hear about it: I absolutely loathe SQL, but I use it because out of everything I've seen, it's the simplest and most elegant solution to the problem that I've ever come across.
So yeah, while we could possibly and should definitely make SQL better if we can, I don't think that suggesting that we just re-do it from scratch is the way to go. That way lies madness, and possibly javascript frameworks.
I think you're discounting network-effects and historical inertia.
Even things which are functionally very simple to implement won't necessarily catch on, ex: "If you could make keyboard layouts more practical than QWERTY, then you would have done so by now."
I mean, maybe I am, maybe I'm not. How many other languages from the '70s haven't been replaced in a domain where they should have been?
And, I mean, maybe your example isn't really the best example you could have given. Name a layout quote-unquote more practical than QWERTY that should have caught on but didn't. Dvorak? No faster. Colmac? Nah. Any, y'know, studied benefits? None I can find, at least not with a P-value worth mentioning.
I'm not saying SQL is the definitive end of database query languages, of course, because that would be silly. There is, almost certainly, a more objective truth. What I'm saying is good luck finding it; It's only semi-cynically that I say I wish you luck finding a better language than SQL. I honestly, with all my heart, wish that SQL could be replaced with something better, because I hate SQL.
When some guy comes along and is like "we should do a better SQL!" the answer is yes, we should, but I'm not exactly going to hold my breath,
> And, I mean, maybe your example isn't really the best example you could have given. Name a layout quote-unquote more practical than QWERTY that should have caught on but didn't. Dvorak? No faster. Colmac? Nah. Any, y'know, studied benefits? None I can find, at least not with a P-value worth mentioning.
Do you believe that all keyboard layouts are of indistinguishable practicality? That seems to be the inescapable conclusion of your claim.
I'm happy to take hug's claim that there is no such evidence at face value. What I'm really interested in, however, is whether the absence of such evidence causes him/her to believe that there is no difference.
Sorry for taking so long to get back to you: What I believe is that there are currently no keyboard layouts worth the hassle of switching, and to be honest I have a bit of a problem believing that there's any particular layout of 12 inch by 3 inch bit of space covered in linearly activated keys that is going to ever change that fact.
Most changes in keyboard layouts are obviated by the fact that peoples fingers are more flexible than the language they type in.
Do I believe there's better possible input methods than a 104-key keyboard? Most certainly. Have we got any yet? I don't think so.
Thanks for the response. For most people I'd agree with you. I found my RSI symptoms were significantly improved by switching to Dvorak though. I concede that is not a scientific test :)
Sure SQL is pretty good as far as entrenched 70s techs go, but that's a pretty low bar to clear for design (separate from making something which people actually use).
The more I learn about databases, the more I realize just how hoary that world is, teeming with verdant beasts just itching to snatch you away from your comfortable application into Cthulian madness.
SQL abstracts all of that away. Honestly we should thank the computing gods and make daily offerings that such a thing is even possible.
I think there is some network-effect and inertia, but I also think it is hard to design a query language which is significantly better than SQL. All attempts I have used have instead been clearly worse.
I share your opinion that SQL is a very flawed language but it is still the best option I know of.
I am absolutely sure someone could make a better SQL from scratch. (But you are correct that most people won't be able to.) I am not a language designer but I've been working on an SQL generator (almost DSL) in Racket. Although it's still very rough, I like it better than SQL for writing views and stored procedures. Imagine if a world-class language designer developed a new query language for your RDBMS of choice. It would probably fix most of the things you loathe.
1. Rename the constraints to mimick the classification of keys described here. E.g. require UUIDs for artificial keys.
2. Make joins with foreign artificial keys much lighter syntactically than free form joins on natural keys. Make joins involving non-ubique columns in the WITH (today's syntax) harder still. People should be able think of foreign keys as references to be dereferenced separate from full joins.
2. Going further, break apart "tables" to represent the are used for different things people use them for, e.g. key-value maps vs "join tables" (a term I had never heard before but immediately could define from experience) vs cannonical data. In math terms, distinguish sets of aggregate data from relations. Having full algebraic data types for the sets of agreggate data but not relations could help. (maps are a special case of relations.) Another final related distinction is tables-as-types vs tables-as- collections. I suppose that relates (no pun :)) back to what joins are allowed.
4. Keep the references pointing in the right direction. For example, à la Rust and capability theory, the owner should point to it's uniquely owned children. But, in SQL today, we instead invert the back reference, having the children refer to their unique parent. Supporting lists (of children) and unaliased foreign keys directly would make the cannonical back-reference an implementation detail.
The relatively new "INSERT ... ON CONFLICT" mechanism is a great way to circumvent this issue. Using that feature, you can tell it to ignore conflicting dupes, or you can have it transform the insert into an update.
"There’s no need to manually create indexes on columns already declared unique; doing so would just duplicate the automatically-created index."
Is the last part real? Makes me feel like forking postgesql just to save the world from accidental duplicate expensive indexes.
An index is usually the most efficient way to check for uniqueness, so most databases do in fact handle this constraint with an index. Some of them will also automatically reuse this index if you manually try to create one for the same column.
I think the question is more on whether a duplicate index would be created instead of having the unique constraint and primary key designation on the same column share an index
Arguably Joker is the number, and like Ace is 1 you might use 0 or 14 depending on math preferences as a surrogate number (or stick to the character designations like A, K, Q, J).
Decks with four jokers typically have suited jokers. Decks with two are typically (though not always) red/black and could be given arbitrary suits (say hearts/spades).
Decks with four jokers are often suited (just as the most common decks with two jokers are red/black), so a deck with three jokers you could treat as suited jokers that are missing a suit.
programmers who are intrigued by the recommended pg_hashids extension have a wide variety of alternatives to using a pg extension for the same key obfuscation logic: http://hashids.org/
OK. But how does that relate to this article? If you're using keepalived what you're doing is moving the connection from one instance to another if the current one is detected as being dead. Even if only the MAC was used to generate the UUID (which it is not, timestamp is factored in too) I struggle to find a scenario in which this would be a problem. Even if you time it down to the nanosecond, if keepalived is used as intended you'd never generate a UUID at the same time on a number of machines sharing the MAC.
And from a privacy concern, if the MAC cannot be considered unique, that's only a bonus?
This means plenty of ID space, maintains rough numeric ordering, allows ID creation without a roundtrip for every insert, is easily portable across different databases, and produces unique IDs for every row in the database which greatly simplifies everything from caching to replication.