Hacker News new | past | comments | ask | show | jobs | submit login

Interesting indeed. I'd say that a primary thing that makes an object is its identity - and the identity is never a part of object data, nor it could ever be. In programming languages, objects' basic identities are determined by however the language implements references; in databases, a natural solution would be a surrogate unique key...



That's exactly it.

And furthermore, objects in real world also have identities, which are distinct from their attributes. You aren't you because you have a certain name and date of birth, or a certain SSN. You're you because you're you. And we all intuitively know and understand this concept of identity. So any software that deals with real world entities has to reflect that concept, which is exactly what synthetic primary keys do.

And it might not be purely relational; but if purely relational model cannot handle this, then it's simply not a particularly useful model for most practical scenarios.

BTW, a pedantic argument could be made here that, ultimately, identity of physical objects is also data - you just need to go low-level enough (ultimately, the identity of any entity is its wavefunction) to see that. Of course, that ignores the practical aspects.


> You're you because you're you.

Are you sure that 'you' today is the same as 'you' yesterday? The notion of identity, however intuitive it may seem to some, is far from being universally obvious.


If you get all the way down to where philosophy mixes with quantum magic, then sure, I can close my eyes, open them again and not be sure if I haven't died in the meantime.

But there is an intuitive concept of identity that every human after some age has. A lot of philosophy goes to point out that this concept of identity doesn't really work well if you stress it enough - as evidenced e.g. by this identity being constant under ship of Theseus transformation. But under this level of scrutiny, a lot of other things break too, including the meaning of words themselves ("what do you mean by 'is'?").

As it is, this concept of identity is natural to us, it's the way we see the world and organize our societies, and it works fine for all practical purposes.


Actually, I was far from being philosophical. Even physics has nothing to do with this. Thinking otherwise would be reductionist. In fact, the notion of (a person's) identity depends on the social context (including law). There is even a process by which a person can have their identity changed.


It's universally obvious in a sense that every human culture and every human language has it (or at least I'm not aware of examples of the contrary).

And it may well be fuzzy, but it's rather telling that this fuzziness nevertheless remains confined to philosophy and metaphysics. Asharites, for example, believed that the world is recreated by God "at every moment" (and that this is what makes the notions of time, motion etc possible), so in that sense, every person is also re-created. But their societies did not derive any peculiar social customs from that belief - they still treated any person at any given moment as if they possessed stable identity over time in practice.


At odds with Reactocese, who suggested that the world at each moment is diffed against a 'shadow world' to apply the minimal set of changes that render the new reality


Identity means that:

1. We have some EQ function such that if EQ(X, Y) reports true, then X and Y are the same object; and

2. There doesn't exist any function which is more discerning than EQ; if EQ concludes that X and Y are the same, then no other function can find a difference.

In programming, the currrent object may be different from that object moments ago because of mutations. However, inside the virtual machine, we are forced conclude it's the same object. The reason is that within the machine, we do not have a function which can distinguish the two objects. And the reason for that is that under mutation, we we do not simultaneously have the previous object and the new object. Therefore, we cannot apply both of them to a function.

We can discuss differences between yesterday's you and today's you, and find differences. But the discussions of you are not you; they are ideas of you.

Likewise in the machine, we can snapshot the state of an object into some serialized format or whatever, and then two snapshots at different times are different, informing us how the object changed. The two snapshots are not EQ, but they are not the object.


This actually only underscores the usefulness of surrogate keys.

Let's say you are building a cloud-based HR software of some type. You have an employees table, and store employees.

If someone gets married and changes their name, or gets a promotion and changes title, they're still the same person (and the same database entry).

However, if that someone leaves that job, and then gets a new job at a totally separate company -- that also happens to use your service -- they are a new database entry.


I'm not sure I get your point? In the scenario you describe, the natural key for the employees table could be, for example, legal company number + employee's payroll number.

Or are you saying that by using a surrogate key, we only ever have to create one record for an employee, and just update their company details etc. when they change jobs?


I suppose you could derive a key like that, but then is it still a "natural" key? What's the benefit over just having a completely arbitrary key?

I'd argue for a surrogate key in this case because I wouldn't want someone to be parsing my id, and having to worry about changes and backwards compatibility.

E.g., the key might be 2360108 where 236 is company number. Then as we grow eventually you get an id like 10910249, which is ambiguous (company 109 grew to over 10,000 or company 1091?)

I deliberately choose a short prefix for illustration, but this could be an issue with any scheme (you'll never perfectly predict all the future requirements), which using surrogate keys completely avoids.


To be clear, I'm not proposing that the company and employee numbers be concatenated into a single column: I'm saying you could have a composite key, comprising the company number column and the employee payroll number column. This would still be a natural key: each component has a meaning outside the system (company numbers appear on company filings and other formal documentation, employee payroll numbers appear on payslips and are communicated to the tax authorities), and the two in combination have a clear business meaning.

Of course, you may still want to use a surrogate key for performance or security or other technical reasons, as the article describes.

I just wasn't clear on how the specific scenario you laid out made any additional case for their use.


Another good food for thought regarding identity: If you take a complex object like a jumbo jet and replace every single component in it one after another, at the end, is it still the same jumbo jet?


Less hypothetically, most of the atoms in a complex living organism are replaced every few months. (One objection could be that atoms of the same sort are all identical, but that makes the issue of 'identity' even more complex.)


> nor it could ever be

Well, never say never... In a world of immutable objects, an object's identity (reference) can be derived from its content. Look, for example, at how objects are stored in Git.


And of course, by now everyone has seen https://stackoverflow.com/questions/3475648/sha1-collision-d... . It's an interesting problem space, trying to generate a unique ID for the contents of a thing where the size of that unique ID is less than the size of the thing. If you consider that = the thing (a deep representation of the object) is the most obvious way to uniquely identify the thing, but then figure you can do better, it would seem that it would always be technically possible, outside of lossless compression algorithms (like gzip), to generate something with the same hash. I'm sure that there is tons of theory on this and that my terminology isn't correct here. Can someone steer me in the right direction regarding what this type of thing is called?

EDIT: important detail I'd forgotten: hash functions take arbitrary-length data in and spit out fixed-length hashes. I learned about https://en.wikipedia.org/wiki/Perfect_hash_function and https://crypto.stackexchange.com/questions/29264/why-cant-we... is relevant. Relevant quote: """ Mathematically speaking, there is no such thing as a collision-free hash. Practically speaking, there is. """


A hash function that takes arbitrary data and generates fixed-length hashes will always have collisions, because there is simply more possible inputs than possible outputs. Also known as Pigeonhole principle[0] - if you have 8 pigeons and 5 holes for them, and you want to put all pigeons in the holes, then at least 3 of them will have to go to an already occupied one.

The practical aspect of making hash function "collision-free" is by using trickery and dark magic to reduce the probability of collision down to somewhere below "Earth just got hit by an asteroid, we all have other problems". You want to go really, really low because hashes are mostly used to quickly verify data, and malicious actors will be happy to throw compute at trying to generate collisions so they can alter the data.

As for how the trickery and dark magic is performed, I'm a wizard of too low level to explain these details to you; all that I recall is that you try and make the hash maximally sensitive to every bit of the input data, so that a random bitflip anywhere has a huge probability of greatly changing the hash.

--

[0] - https://en.wikipedia.org/wiki/Pigeonhole_principle




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: