That's entirely a schema/data policy question. Null could indicate no name, or it just as easily could indicate tha data was not collected, and an empty string would indicate it was collected but is empty.
Both are entirely valid methods of encoding the data, and require knowing how the data is intended to be collected, entered and used to determine the best choice.
If that's the data policy of the group collecting the data, then an empty date (which some systems support, 0000-00-00 is valid in MySQL for certain modes of strictness IIRC), or a sentinel value could be used (1111-11-11, depending on expected values).
If the data needs to differentiate between those cases and a separate boolean bit of data to track whether it was collected is no feasible (as it so often isn't in the real world), then you do what you must. How that works out in the real world would be the policy for how to interpret the data in that field. It's really no different than any time you've seem a field that is named something counter-intuitive to what it holds in some schema. Something tracks that difference, whether it be institutional knowledge, come conversion code, or a label prior to display. That's what I mean by it being a "schema/data policy question".
Yes absolutely a data schema question. And yes, you can set whatever arbitrary policy you want. eg "nmn" for no-middle-name. Or require that a horse's birthday is recorded as 1st August.
But SQL and the relational model is meant to be a logical system, and I'm interested in preserving some conceptual integrity in the domain modelling.
I would say there's a spectrum. My guess is that whatever schema you could come up with could have an extra layer of metadata applied to describe the data to some benefit, but at some point you have to make a decision about how something is best represented.
To keep with the example of asking questions, if a sruvey is asked and each question is represented by field in a table for that survey, but they also want to track whether any particular question is answered, there are multiple ways to track this information. It can be done through a separate table (or multiple), it can be done with an extra boolean field per original question field, or it could be done by making each question field nullable, and ensuring that no question answered will ever be left null.
Which of those cases breaks the conceptual integrity of the domain modeling? I would argue that as long as they are applied consistently and accurately, none of them break it.
If your SQL query tells you the most popular name is "nmn" or that that a whole lot of people died before they were born (DOD = 1111-11-11) then yes, I think there has been a loss of integrity.
That only matters if your SQL data is ever meant to be used in isolation. If the only appropriate way to access the data and retain integrity is either through the same set of applications that insert it, or through fucntions written to mediate access within the SQL instance, than how the data is stored on disk is mostly irrelevant.
If all your access is mediated by a layer that takes care of this, then complaining about how it's stored in SQL is no different than complaining about how your DBMS is storing the data on disk. It doesn't matter, you aren't supposed to be doing anything with it there anyway.
Data integrity is not something a database can ensure, for any non-trivial use. ANSI SQL provides tools to help ensure data integrity, through data types, referential integrity checks if you define them and they are supported, but ultimately, those tools can only account for what the database supports. Will they prevent the favorite_color field from receiving the data that should have been entered into the pet_name field? Unlikely?
It's no different in programming languages and bugs (as a data integrity problem is a bug in data domain). Using C, Java, Rust and Haskell will not prevent all your bugs, and Perl, Python, JavaScript and Bash do not mean you will have bugs. A bug free Perl script is possible, and so is a bug riddled Rust script, as not all bugs are (or even can be) caught by the assurances the more strict languages provide.
Unless all your SQL in a schema use is through a directly connected SQL client that loads data directly from native SQL formats, runs queries, and exports data as the output of those queries, the applications that have ingress and egress to that database are what really matters for data integrity, and worrying that a date field might contain 1111-11-11 as a special sentinel value that means something other than what null value in the same field means is being unable to see the forest for the trees.
> No, what you suggest is counter intuitive and should never pass a code review.
You're assertion is that it's counter-intuitive to use the database's ability to encode a lack of information to denote a lack of knowledge about the information, and should only be used to denote the information does not universally exist? And additionally that this is so well accepted that to do otherwise would not be accepted by any peers?
If so, that's a fairly bold assertion to make when you haven't provided any any evidence or reasoning to back it up.
The problem is a bit more subtle. NULL is often used with two distinct meanings:
- I know that Theodore Roosevelt has no middle name.
- I don't know what Theodore Roosevelt's middle name is
The semantics of each case are different in a subtle but important way. Consider the following two predicates
'Jim' = [no value]
vs
'Jim' = [I don't know]
The former predicate is obviously false. 'Jim' is not equivalent to a non-existent value. The latter predicate however, can't be assigned a truth value. It's entirely possible that the unknown value could later turn out to be 'Jim'.
The key issue is that SQL treats NULL values according to the latter ('I don't know') rule and ternary (Kleene) logic, as TFA discusses. That, in turn means that using NULL to represent [no value] will lead to painful and hard to diagnose logic bugs (e.g., WHERE x = 1 OR x != 1 not returning all records).
I do recognize (1) there is a logical difference between "missing because absent" and "missing because unknown", (2) empty string is not an element from the domain of personal names.
In a sense, empty string is available as a marker, just as NULL is available as a marker, to satisfy either scenario. I think the possibility of three-valued logic applies in either scenario.
I'm not at all convinced empty (non-null) string is appropriate for many real world applications, just as empty (non-null) date is appropriate for similar scenarios.
Eg. Date of death: unknown - NULL; not dead yet - NULL
Take the middle name example: "What was U.S President Theodore Roosevelt’s middle name?" when you know he didn't have a middle name.
Are you suggesting that a blank is the correct choice here?
I don't think it's accurate to say they have a blank middle name. I think it's better to say they don't have a middle name.