Most of this advice is stuff they tell you in school that doesn’t matter at all in reality.
Real advice:
- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)
- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.
- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.
- When in doubt, just have a column full of JSON and figure it out later.
- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.
> Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)
No it won't, it depends entirely on the situation. Denormalised data (besides allowing update discrepancies) can expand the amount of data, thus push otherwise memory resident working sets onto disk. Big slowdown. You are wrong. The correct answer is "it depends" - and denormalising really is a last resort that has to be done judiciously and with care.
> You need nullable dates and nullable numbers. You never need a nullable string.
Utter unjustified garbage. Correct answer (from a practical, not super-normalised form) is: "it depends"
> When in doubt, just have a column full of JSON and figure it out later.
OMG please never let me come near any fool who does this. I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.
>I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.
I can see XML being really bad for performance, but given that json column data is handled like native data in modern databases (and you can index/query on fields in the json, which pretty much negates the performance impact), using a json/jsonb column is actually a decent guide.
You are spot on in picking me up on this, thanks. It needs clarification. xml/json is neither right nor wrong, it depends. Performance, well, it depends. What is unquestionably wrong is the poster's advice "When in doubt, just have a column full of JSON and figure it out later".
Pouring structurable data into an unstructured heap is just laziness and will hurt you later. I've seen the cost of deciding to 'leave it for later' - it's high! I've been involved in cleaning it up. It's horrible.
Years ago I picked up a book written in the 1960s (so pre IT)for secretaries, and it had a beautiful line in it "the filing cabinet is a place where you find things, not lose them". Still applies. Always will do.
> "the filing cabinet is a place where you find things, not lose them"
"Those who cannot remember the past are condemned to repeat it." - Santayana
“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem—neat, plausible, and wrong.” - Mencken
Software project management is full of thin and runny rationalizations...I (HHOS) include this form under the heading of "Data Abnormalization"...
I've had a lot of experience. You can learn from it or not. If not, feel free to prove Santayana right again, and at great personal cost :) In fact, I encourage you to!
Is there a reason then not to use a single column with JSON? The columns would then be "virtual" (the column jsondata would spawn a column jsondata.firstname etc.)
This question is your of pure curiosity, I do not deal with databases so I hopefully correctly understood your comment.
I believe the right answer is somewhere in the middle.
I know a couple of successful products where the dB scheme is completely abused. Yet commercially a succes for over 8 years.
A good dB specialist also knows the implications of a 'bad' design choice. It's all about the impact at a later moment. Just like software design in general. Finding the balance between complexity and simplicity. Between slow and elegant versus fast and ugly.
Balance is key
I did say clearly it depends. And that's because it does.
> Between slow and elegant versus fast and ugly
No! Unqualified, this is a false dichotomy! Because often you can have elegant and fast, and that's because elegance of design often gives you that speed. Not universally true, but much more often than not.
>>I did say clearly it depends. And that's because it does.
You've arrived a tad hot to the discussion to now start to backtrack on how it depends.
After all, if it depends for your defense then shouldn't it also depend for the OP's point? Because if you know that it depends but still start to accuse others of trolling, you are not being helpful nor honest.
Wasting your design time allotment in rationalizing futile compromise is the flip side of seeing everything as a dichotomy. The reality modeled by human discourse may be only consensual, not actual...
You could, conceivably, have a "display_name" column, an "informal_name" column, a "formal_name" column, etc. depending on your needs. But for absolutely most people and purposes, a single "name" column will suffice, and will steer programmers away from making unfounded assumptions about people’s names.
But then you don't have my informal name. When I receive an informal update email, it would be clunky to have it begin Dear Max Barraclough.
You can't automatically deduce that the way to address me informally is Max, without making assumptions that will break outside the English speaking world. See Points 18, 24, and 30.
The W3C suggest that one way to handle names is to have a Full name field and a What should we call you? field.
You'll regret this decision as soon as you have to integrate with another system that uses separate first and last names. Or as soon as you want to build a sorted list of five thousand names or output a bibliography in a specific format or anything.
Technically the right approach to the problem of storing international names would be to store markup: a string with an associated structure and a discriminant:
You can then have precise automated ways to transform this into whatever derived form you need. Using a single name field is just avoiding the problem. Using multiple fields is a maintenance burden, because in most cases fields are not fully independent and you have to update them all to keep the data coherent.
Using markup does not really change anything, that is just a way of serializing a data structure. You can represent the same information in normalized form.
Of course, I don't mean storing literal XML, but I think it's important to support generic markup in databases. It's not that hard, after all; something like these five tables:
Markup (markupId, string)
Ns (nsId, uri)
QName (qNameId, nsId, localName)
Elt (markupId, eltId, qNameId,
start, length /* string indices */,
level, index /* or other way to represent hierarchy */ )
Attr (eltId, qNameId, value)
would be enough to store and fully index any XML out there.
I think markup is an important separate data type. We have records, we have scalars, now we have a hybrid type, markup: a string (or maybe a byte/bit stream) + records linked to indexes in this string/stream and maybe forming a hierarchy.
A database should store the actual data, not some serialized format. You don't store the comma tokens from a CSV file in the databases either - you store data which can be serialized to CSV. The same information could be serialized as CSV, XML or JSON depending on the context.
Storing a serialized format as an opaque string or blob might make sense in particular cases.
A) the proposal is crazy. You probably don’t need to know if someone has a patronymn or a kunya or whatever. Base what you store on what you need.
B) Storing this as normal data would be an intense pain in the ass to use, for no gain in performance or correctness. Use JSONB, this is what it’s for.
If you actually need to know such details then it should be stored in normalized form.
Say you have a visa database which must be compatible with multiple naming systems from different countries and where no information can be discarded for security or legal reasons. Then you store the data you need in normalized form. Why would you use weird hacks for essential information?
And if you don't need all that detail - then you just save the name as a string and discard the metadata.
I can't see it would ever make sense to store a name in JSON format inside a relational database. Either the information is important or it isn't.
A) In many cases you don't, but if you store data about authors in a bibliographic database then you probably do. And it's just a part of the package: for example, Donald Knuth went as far as to print people names in their native language in addition to English in his books' indices :)
Matters for what? A database is a tool to solve a problem, not a timeless repository of deep cultural meaning.
If I make a webapp that is only localized in English, display, first, last will let me solve my common problems: I need to say "My account (John Smith)" in the corner (display name), and I want to be able to write emails that say "John, we know you're wondering how company that sold you a pair of pants three years ago is feeling about COVID-19…" (first name) and list articles by author sorted by last name. Those are all common use cases for apps used by English speakers.
Now, if I were making an app to do voting in Myanmar, I would need to deal with there not being last names for many Burmese. If I was trying to track Arab speaking terrorists around the world, I'd want a long list of their aliases and kunya in multiple romanizations (I think the CIA used to prefer "UBL" for Osama bin Laden because they called him "Usama"). If I was making a library app, I might want to have the English romanized name plus the Unicode native language name. Tons of possible problem spaces with different solutions.
I just think tagging the different name parts for "Gabriel José de la Concordia García Márquez" is going to be overkill for most English language apps because when are you going to look someone up by maternal family name versus of just doing a full text search for "Garcia Marquez".
I'm not arguing about storing serialized format; instead I suggest to parse it and end up with something like that:
NAME ( 1, "Fyodor Mikhailovich Dostoyevsky" )
TAG ( FIRSTNAME, 1, 0, 5 )
TAG ( LASTNAME, 1, 20, 30 )
TAG ( PATRONYMIC, 1, 7, 18 )
I'm not saying it's efficient (UPD: speedy), but it's sufficient to get the required details and can be mechanically transformed into a more efficient form. UPD: And it's very flexible and easy to extend. I believe it can handle just about any kind of naming from "Falsehoods about names".
This one lacks ordering. It's not a problem for Russian names as the rules are fixed, but may be a problem for other ones.
And how would it scale? E.g. John Ronald Reuel Tolkien and George Herbert Walker Bush have two middle names, while full name of García Márquez is Gabriel José de la Concordia García Márquez. Markup handles this uniformly creating only as few metadata records as necessary, but a field-based model will need fields for all of this that will stay empty most of the time and likely some meta fields to set additional quirky flags.
Regardless of how complex naming schemes you need to support, you just make it more complex and harder to query by storing tag names and character indexes and whatnot.
Isn’t that what RDF is all about? Each field being a named property of a resource, the properties themselves being properties of other resources describing the structure of the first properties. And so on, going on as meta as necessary
There is no need for an arbitrary "markup" that you then restrict based on your use-case (in your case it's some sort international-aware storage of different data).
Databases are designed to store data, and storing markup like your example above is basically a misuse on some level. Your example above has a clear structure, and that structure can be encoded as raw data in database tables. So instead of having a generic set of tables (as per your other comment) to "store markup" data, you would rather come up with tables that represent your actual requirement and what that markup is meant to indicate.
Some database also have inheritance models which allow you to store the data above quite trivially. You basically have a "base" and then extend it with specific fields for each type (en, ru, es, etc).
So please, don't put markup in your database like that. It's a big no-no, and I'm sure your DBAs would tell you that very quickly. Else, just throw that crap into a document-store and make it someone else's problem as is unfortunately all too-common lately.
What you want is not markup, but tagged unions. It's a real shame SQL and many other languages don't support them.
Also, your example about Spanish names doesn't take into account that nowadays parents can choose the order of family names, and that gay marriage has been legal for 15 years. Which reflects a deeper truth: don't make too strong assumptions about reality, for it may change.
Tagged unions, if I understand what they are, can be done in SQL:
create table Foo(
fooId primary key autoincrement,
type integer /* tag/discriminant */ );
create table FooA(fooId primary key,
/* data specific to FooA */ );
create table FooB(fooId primary key,
/* data specific to FooB */ );
They're not quite markup though. Markup has a string/stream with associated records. The number of these records, their kind, and their placement varies, so it can handle ordering issues or multiple similar names or missing name (e.g. missing middle name), or ignore parts of name quite naturally.
That could be an implementation of tagged unions. Without language level support (pattern matching, exhaustiveness checking, etc.) you gain very little from this.
It's not a simulation, it's a data model for reducing the pain for what's really a hard problem - names we call people are not unique, they are colloquial identifiers (HHOS).
Hum... I'm not from the US and do not live there if that's what you are asking.
People usually have enough trouble separating the first name from whatever you call the rest of it and deciding on what part goes where, and you expect them to be able to tell what are the cultural roots of the name?
Keep in mind that not all names on databases are filled by their owners. Besides, I really doubt I would be able to enter my name into whatever you are designing without some amount of lying.
I spent 13 years maintaining a large-ish system that separated first- and last names for customers.
Separating makes searching/indexing on last name easier.
The alternative would be to put last names first by convention, which renders them unusable for anything else; or go insane trying to find a universal way of splitting and recombining for display.
Keeping them separate inevitably results in a convention to put company names in the last name field for corporations, and failed searches when the convention is broken.
It's one of those questions that I suspect will keep bugging me until I die.
Yeah, but if you only have one last name field, you can only index on people's first last name, not their second last name? How are you going to solve that?
> Separating makes searching/indexing on last name easier.
The point is that such a search or index is a mistake. It's simply not meaningful based on the real properties of names, it just happens to work 99% of the time if you're only dealing with a mostly homogenous culture.
> Keeping them separate inevitably results in a convention to put company names in the last name field for corporations
In most (IMO sane) designs, a company name would usually be a separate field (or better yet: companies would be represented in an entirely different table).
But in many cases the customer is either an individual or a corporation. Keeping them in separate tables turns simple tasks such as generating a list of customers into joins. Which brings us to normalization, too much of that kills performance.
I feel like if a join is killing performance, either it's a really weird join (e.g. joining on an OR) or you ain't indexing.
Regardless, you could always just assume everyone's a "corporation" (and for individual customers, that "corporation" would just be a reflection of the individual customer's data). That, or only deal with contacts within a "corporation" (and an individual customer would just be a contact with a null "corporation"). I've seen both approaches in practice, and while they have their tradeoffs, they work reasonably well.
"Corporation" is a pretty wonky term for this, anyway (not all non-individual customers are corporations); "organization" would be better. But I digress.
Also for international users browsing in English labeling the fields "given name" and "family name" are way more clearer than using first and last names.
Exactly what I was thinking - plus a field to state the order they should be displayed in seeing as Asian cultures usually state the Family Name first.
What do you mean "the legal surname"? I'm Spanish and my legal name is my full name with both family names. You can't just ignore one. Also, Spanish people have two family names, but they can be composite, so they may look as three or four family names.
However, most people are usually known by one family name, but which one is used depends on which one is more easily recognizable.
> This list is by no means exhaustive. If you need examples of real names which disprove any of the above commonly held misconceptions, I will happily introduce you to several.
Prince, famously, didn't have what we'd recognise as a name for years. All the "Unknown Soldier"s don't have names. "John/Jane Doe" doesn't technically have a name.
These examples seem somewhat contrived, but of course it comes down to the use case of the database. I'm sure Prince remained registered under his legal name on contracts, checks etc, even if he used the symbol in marketing. But a database over bodies in a morgue obviously need to be able to support unidentified bodies.
Interesting, I think you're right. According to Wikipedia, he only changed his stage name, not his legal name. The BBC didn't seem to make the distinction.
If I had a dime for address forms that have "state" as a mandatory field, while having a country dropdown that contains countries that do not have states, I'd be rich, RICH I TELL YOU!
> Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.
It all comes down to the purpose of the database. There is no one-size-fits-all solution. For a hospital which need to register newborns, you can't even count on everyone having a name. For a library database, you need both first name and last name of authors, since books are sorted by authors last name - but having a "display_name" would be redundant and violate normalization.
If you can get away with having a single "name" field, this will say you lots of trouble. Only add additional fields if you need it for some use case.
> “ Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.”
I love this! It’s so true only bots dive that deep and it’s usually a total waste. But what do you recommend instead? Should you just deny any calls past page 10? Or do away with pagination altogether?
If there's somehow no need to show anything past page 10, then by all means, just do that. But most applications can't get away with that.
There are plenty of resources for pagination online.
For example, your [Next Page] button could use a `where date > {date of last item on prev page}` since you have an index on date. e.g. <a href="/things?after={prevDate}">Next</a>.
A forum topic could have a consecutive and contiguous "index" on each of its posts that starts at 0 and increments. Now going to page 1000 is simply a matter of grabbing posts between index 990 to 1000 which is instant.
Don’t use offset for paging. Presumably you’re sorting the results by something, add a condition on that value in the where clause based on the last row in the previous page.
Super fun until you have a column that has an amount of duplicates greater than the page size. Now you'll have to add a second column to your order clause, your serialized cursor and your where clause. Oh, and don't forget to check whether your database driver uses the same date precision as your database does, because you might be in for another surprise.
No it wont! In general it will have worse performance. Deliberate denormalization can be used to optimize for particular queries but it will make other operations more expensive and/or error prone. So it has to be done very carefully and with a full understanding of cost vs benefit.
And many database systems supports materialized queries which give the same optimization without having to compromise the schema.
Have you ever used a complex database in a formal, by-the-book, third normal form?
The number of tables explodes, the number of joins become a pain and every single data change requires careful planning.
As for the "many database systems support materialized queries", I only know of a few that truly do: Oracle and DB2.
PostsgreSQL supports them, but you need to handle updating them yourself. SQL server kind of supports them, but are limited.
In other RDBMs you can work around that using triggers (or similar concepts), but that IMHO is definitely a bad practice and the times I've used that (usually because there was no other option) has only brought me grief.
Denormalization means you (by definition) have duplicate and redundant data in the tables. This certainly requires careful planning around updates to avoid data corruption.
> The number of tables explodes
This is IMHO a backwards way to understand normalization. Normalization tell you how many "logical relations" you have in the data - whether you like it or not. If you decide to squash multiple logical relations into the same physical table then you get redundant data (by definition). This might be a fine optimization, but it doesn't reduce the inherent complexity of the data model. It just saves some joins as the cost of more complex update logic.
I think there's a decent amount of hyperbole, but even taken at face value: Every single data change should require careful planning. Especially if it's a schema change, and not just a query.
You can do a lot with generated code, refactoring, etc. Data is much less agile, so I'm a lot more conservative when it comes to that end. Never mind that with decent naming, table comments etc. it's a good documentation in itself.
> No it wont! In general it will have worse performance.
Since the advice is to only denormalize to improve performance, this would be a contradiction. Yes it makes everything more error-prone, but if you're profiling to check performance, then you only keep the denormalizations that improve the performance you care about.
Yep, exactly, materialised views are the ideal way to create query specific schema - but there's always a cost, namely, refreshing them needs to be done thoughtfully.
You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.
What's an example of a data entry error that results from this scenario? Wouldn't it entirely depend on the data entry interface?
Someone is filling out the form. They skip MI. Do I mark it as null or blank? The only way to trust this is to have a big check mark that say “does not have a middle initial” that stores a boolean. But realistically, who cares? Unless you’re dealing with legal names, you don’t need to distinguish the cases.
> Never store firstname and lastname in the same field
I disagree. Why would you want to force the "firstname lastname" pattern on people? What if someone has three names? Or only one? Just use a "name" field!
Scandinavian here, our names don't really fit this pattern very well either.
My last name has an "ö" in it. That doesn't work in 99% of so-called global websites, so I have to transcribe it as "oe" instead, which is just wrong.
My husband has three first names and one last name. However, his actual name is his second first name. If his full name is "A B C D", if you ask him what his name is, he'd say "B D". If you know him you can call him "B", if you don't, you can call him "Mr D" or "Mr B D". Most American companies call him "Mr A" or "Mr A D", which is just wrong.
I have a friend with two first names and two last names, but his second first name is his name, and he prefers his first last name. If his full name is "A B C D", he'd say his name is "B C" if you ask him. Calling him "Mr A" is just wrong, and calling him "Mr A D" is super duper wrong.
(In government systems in our home country, this is of course handled. You can "mark" which first name(s) are preferred, so mail to my husband from either government or private companies will always be correctly addressed to "B D" and nothing else.)
My examples aren't about nicknames, they're about the difference between someone's full name, and their name.
And even in American name culture it's not obvious how to transform someone's legal name to their name. How would you encode that the name of "William Henry Gates III" is actually "Bill Gates" ?
That was my point: trying to account for every preference an user might have or every particular scenario is too complex and takes much time and human resources. The right thing is to address the majority of your user base.
To take your example, why should my app encode that "William Henry Gates III" is actually "Bill Gates"? I will provide a form with first name and last name and Mr Gates can register as "Bill" "Gates" or "William Henry" "Gates".
I have three given names and I find it a non issue completing forms. And that's because a string can contain white spaces or even be empty so you can encode any arbitrary combination of first name + last name.
Also, people get to much into politics, trying to be "revolutionary" instead of trying to solve the technical issue at hand.
I can solve a technical problem in 2 minutes or get in political phylosophy and argue for 2 years without solving anything and without helping anyone. I rather solve the problem.
Absolutely! Having 'name_given', 'name_middle', 'name_family' or some nonsense like this is one of the most common mistakes in database (and application) design, usually made by culturally narrow-minded Western developers.
Take names in Myanmar [1] as an example. From Wikipedia:
> Burmese names lack the serial structure of most modern names. The Bamars have no customary patronymic or matronymic system and thus there is no surname at all.
That said, it is hard to push back on a business requirement to sort users A-Z based on family name, just because.
It’s not really a mistake to localize your application design and UX to the first major market that you are targeting.
To immediately target the entire world, will leave you burning cash as you figure out all 200+ counties customs, or with a hopelessly generic app that is hard to use for everyone, pleases no one, and hard to extend in culturally relevant ways.
This kind of mentality lead to huge fuckup that were the ASCII code-pages, and we are still suffering fallout from that.
I still cannot input one of letters in my name into most systems for example.
There is a decent middle ground, as you can store first-name(or equivalent), last name(or equivalent) for sorting purposes... and a display name which is culture sensitive.
Non-American here. I believe US citizens often have a middle name. We Spanish-speaking people have 2 last names. Japanese reverse the "firstname" and "lastname". I had a lot of troubles when I tried to buy a ticket in a Chinese website a few years back. So no 99% of the users don't follow the "firstname" and "lastname" pattern. We might put up with it, but it doesn't suit 99% of the users as stated.
To clarify the comment you are replying to (not the author): 99% of their users. Clearly 99% of people globally aren't going to follow that custom, but many (most?) applications written in the US have a user-base where 99% of their users will follow the custom. Switching to something more globally applicable isn't necessarily a non-zero cost either - using a single name field has downsides like making sorting by last name more difficult (a ubiquitous thing in the US) and not being able to display first vs. last name where culturally expected.
But I am replying exactly to that. I don't believe 99% of the users of a given application for any non-trivial number of users conform to have exactly a "firstname" + "lastname".
In the US it's not rare to have a middlename. And there are many Latinos, Japanese and Chinese in the US as well, all of them which do not follow the convention.
A lot of US-based startups target the global market early on, although they behave as if they only had to cater to US users.
As a side note, the group of HN commenters seem to fall into two categories, the first think group of commenters seem to think HN is a mostly US only forum, whereas the second group seems to think HN is an international forum. I don't know what the truth is or whether there is one at all. But I know, that it sometimes irritates me, when commenters or posters assume, that they are talking only to US citizens.
I suspect it's usually accidental but I feel very little rancor towards them. You solve things the way you think about them (until yesterday it wouldn't have ever struck me that a currency symbol could be in the middle of the value). And that's how, I suppose, you end up with a bunch of people with US Visas that say FNU (first name unknown) in front because they only have one name.
A Burmese here and thanks for pointing out! I have to explain to other people that we don't follow the father's name most of the time. Some names the father's name at the end. e.g if my father name is Myint Swe, my name could be Ko Ko Myint Swe.
The same also applies for the structured address format. I've seen some stupid decisions in the name of this.
Post codes: 5 years ago, my country brought in post codes. prior to this there were none. It was not for the benefit of our postal system, who stated they didn't really care, but for badly designed web forms (and the occasional rural house with an unhelpful address)
They also may not fit your convention of what addresses in your country are. The obvious examples are stuff like Japan and one or two central European cities that are addressed "Building number, Block Number, City", but the aforementioned unhelpful rural addresses include a former friend of mine who had to give his address as "House name, near village, County". Actually the village was not near but it got the mail to the right post office which knew what to do with it.
As a counterpoint, I've had places that have very strict fields like "Street Name" (20 char limit), City, County, Country. For a few years I lived on a street who's name appeared three times in the same city. The actual form to get it delivered was "Street name, Locality, City", but because of the length limits I had to sneak the locality into the City field, which luckily worked as the City and County shared a name.
Hopping on the topic of addresses and not completely related to your point - but if anyone's having challenges with handling address fields I'd highly recommend checking out Shopify's address package:
>> For a few years I lived on a street who's[sic] name appeared three times in the same city
> This is why you have postcodes ;p
I guess, but it was foreign vendors which expected "Street Name, Postcode, County". Locals and the postal service were working fine with "Street Name, Locality, County"
Very cool to see mention of "one name only" case. My wife had only one name before we married. She's Indonesian. When we got married in Australia our forms pretty much crashed the Government's Birth's, Death's and Marriages system. We got bounced around between all different people trying to compute the reality that in some parts of the world, people have only one name.
I once worked on a project, which used the Dow Jones list of people of interest, a list of politicians and terrorists, mostly. One day the software began to fail catastrophically, when parsing the latest version of that list. That day, a Mr. Null was added to that list ;-)
As in no name at all? Now that’s an entirely new level. Pretty forward thinking parents considering today’s surveillance... maybe they where phreakers tried to name him ' OR '1'='1 but the Government didn’t accept special characters so they resorted to naming him null.
because governments prefer not respecting people names and customs, just so they can push their own customs (and in that case their name too, because "FNU" is the 30'000 most popular name in the US now)
A family friend had this problem when he immigrated to the UK from India. He just repeated his single name as it was both his forename and his surname.
Yeah, "humble guide..." using never/always in 3 points, seems like author thinks he saw everything but he is up for unpleasant surprise in the future :)
TFA specifies neither of these requirements. You could have someone's first name (which I would recommend calling given name) be "Pablo Emilio" and their surname be "Escobar Gaviria". Or alternatively you could have someone with a given name of "Sukarno" and an empty string for the family name. Patronyms and matronyms are sort of ambiguous, but would probably be categorized as surnames.
I use a single name field, but I don't think a given name and surname split is crazy.
My favorite way to do things is to just ask for a "full name"; and then, separately, a "calling name" (i.e. "what do we call you?")
The "full name" is for putting in the c/o part of a mailing address; while the "calling name" is for showing beside your profile picture or for saying "Dear [name]" in a mail-merge.
Neither of these is a nickname. They're both your "real" name. But neither of them are your legal name, either. You're free to make either of them whatever you want, and to make them both entirely unrelated to one-another. They're just two forms of "your name" that you might expect to see in one place or another. And we can ask for them, so that we can then spit them back out at you in the situations that call for them.
Doing things this way skips the entire minefield of "given names", "family names", titles/appellations, etc. You've got use-cases for someone's name? Ask the user directly, for each use-case, what text they'd like to appear. There's not too many use-cases, really; you only need to ask a couple such questions, in the end.
This is the correct answer. The question to ask is, what are you doing with the names? Splitting first and last only makes sense if the thing you’re doing is listing people by last name. Sometimes you do that! But not always and certainly not in all countries.
No, I think GP is assuming that names are very complex, and even the assumption that people have a first and last name isn't always correct. You can just use a name field, and let people supply the name they want to use.
What do you really need a first and last name separated for anyways? Sorting? A salutation on correspondence? Sorting by last name isn't usually important[1], and you can generally use a full name in a salutation.
Depending on the domain needs, you choose the right representation. If you expect US citizens and want their name as entered on their tax returns or social security card, then you match what those can hold and note to the user that you are expecting those values. If you're just collecting a name to have a name, why put an extra constraint in there that doesn't matter?
1: Not to mention you then get to mostly ignore the craziness that is utf-8 and combining characters and cases and sorting.
> What do you really need a first and last name separated for anyways? Sorting?... Sorting by last name isn't usually important...
When paying customers / clients want to sort by last name, because “that’s how we do things”, then you sort by last name. So it’s only important when you want to make money.
Edit to add: your customers also don’t care that someone somewhere has 6 first names and no last name. “This wasn’t a problem with our old software.”
If customers need it, then it's a requirement. Nothing I said indicates you should leave out customer requirements. In fact, I specifically alluded to including separate names in cases where it's needed.
Sometimes the customer is you or your org, and in those cases it may not be a requirement.
Exactly as you say: if customers (that is, the people paying for the software, not necessarily the people filling in forms) need it, then it's a requirement.
The point is that while sorting by last name (say) might not seem important to an engineer, and requiring a last name might seem outright stupid to a person filling in a form, nevertheless it is often an important requirement. Frequently folks will observe some "stupid" form, and link to some "falsehoods engineers believe about X" document, suggesting that engineering "got it wrong", when in fact they have simply misunderstood who the software is actually for.
Wait till you get into some Dutch surnames "van der Kleij", "van den Bosch", "de Jager" which are not patronyms. They would still fit into "first name, last name" pattern. Because you assume what is last name can have https://en.wikipedia.org/wiki/Tussenvoegsel
Though a lot of pages in NL have separate field for it.
Or German ("von Trips") or Italian ("De Sica") and possibly other countries.
Anyway, people who move to other countries have to adapt. For the database of a local business is often a good idea to stick with the format of official documents because everybody has to fill out forms on government websites, regardless where they were born. Ids, taxes, healthcare, schools etc.
Yes, Arabs use "Ibn" and jews use "Ben". Slavs use father's name and "ich" suffix. People in Iceland use different kind of patronimics.
You can either use just one name field if your logic doesn't need a name or surname or provide two fields for name and surname and let the user handle the splitting.
I have three first names. When I fill a form, if it's an official form or if it's for business purposes I fill all my names as first name, if not, I only write the name I use.
Much of the official paperwork in my life has "[First Initial] [Middle name]" in the first name field, which works (usually). But why should I have to do that? It's a pain in the ass.
The usual pattern in this case is to add first name and then all rest of the names in the first_name field. And last name in the last name field. Users can be made aware of this on the UI.
Another option is to create a middle_names or other_names field, its first_name then other_names then last_name in that order.
One way to look at it, if you have more than one name, you'll usually have a preferred order of writing them, regardless of culture. The first-name used will have different protocols and meanings in different cultures, but it is still the first-name, literally.
It gives an example of something wrong. Not once does it give an example of something right. It says "change this" but doesn't show what to change it to.
> Then the “areas” column has values that aren’t atomic. ... So this table is not in first normal form. ... To be in first normal form you should store a single value per field
Second normal form:
> In this design employee name can be directly inferred from employee_id because the idea is that the name of an employee is uniquely defined by it’s id. ... So we have two examples of columns where the columns can be inferred from only part of the primary key.
Don't include those fields. It's not explicitly said, but I think the author assumes that if you're told don't include data that can be inferred otherwise, these fields can be inferred otherwise, so they cause a problem, the solution is to remove those fields. You aren't losing data, it's already been covered that it can be inferred.
Third normal form:
> For a table to be in third normal form then it must be in second normal form and there should not be a non prime attribute* that depends transitively from the primary key. ... In this table the department_number can be inferred from the employee_id, but the department_name can be inferred from the department_number!
Same as second normal form. These fields cause problems, you already have the data elsewhere, get rid of them.
Perhaps you expected an article that said which commands to run to alter the tables. It's pretty obvious that wasn't the goal of this article. That said, it had everything you asked for, as long as you read it all and payed attention.
> To be in first normal form you should store a single value per field
Where? How? How instead should you store groups of things? Blog doesn't say.
> Don't include those fields. You aren't losing data, it's already been covered that it can be inferred.
If you just remove fields, then you absolutely are losing data. Your software can't magically infer it unless you put the values and the relationship somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action and you've now put words in blog's mouth.
> These fields cause problems, you already have the data elsewhere, get rid of them.
Again, just deleting fields is 100% losing data unless you put those values somewhere else. How should you handle that? Where should they go? Blog doesn't say.
> Where? How? How instead should you store groups of things? Blog doesn't say.
By normalizing. This is obviously a guide for to do and not to do, not a 101 level course on database design and normalization. You first step should be to google database normalized forms, and even specifically second normal form or third normal form. I know from experience that Wikipedia has articles covering these. I've read portions of them.
> If you just remove fields, then you absolutely are losing data. _YOU_ might be able to infer it, but your software sure as hell can't unless you put the data somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action.
> Again, just deleting fields is 100% losing data unless you put those values somewhere else. Where should they go? Blog doesn't say
The point is that if you're working towards normalizing that data, then you have the data in a different location.
The blog entry also doesn't explain what SQL is, or what a primary key or foreign key is. When you encounter terms that are not defined, you should make sure you understand them. If you don't, then the information is likely not aimed at you. If you don't understand normalization, you aren't going to understand what's being recommended. This obviously isn't a guide on how to normalize perfectly, it's a recommendation to do certain things, some of them to do with normalization, and a justification for why.
Consider it the equivalent to telling someone that they should eat a balanced diet for their health and notes some problems too much of certain types of food can cause. It's still valid advice even if it doesn't come with an explanation of exactly what to eat, and references medical phenomena without explaining what they are. That doesn't make it a bad article, it just makes it not terribly useful for people that it isn't aimed at.
The poster said "I find that...", which is equivalent to "In my experience...". How is someone posting their experience the same as making a "sweeping over-generalization"? They never flat-out said "ALL textbooks".
Also, surely if it were not accurate you could easily drop the name of a good textbook without any trouble, since you apparently have enough experience with them to make the distinction.
If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.
I think programmers over complicate things too much when it comes to addresses. When you're dealing with a global system that is able to deliver a package to "the 3rd house on the right, past the pond, with the red fence," or whatever, in rural anywhere, you probably should just treat it like we all do in the real world, as just a string to be interpreted in context.
This fails as soon as you talk to an API that requires normalization, need to aggregate your data, or, really, do anything other than try to deliver it.
Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things. I know it's a cop out answer. But I don't have a better one.
> Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things
This is not always an option. For example, in some jurisdictions, you need to calculate, charge, and remit sales taxes based on client location, even if you are selling digital goods.
However these locational jurisdiction details will be unique (i.e. a state, a country, a city, a trading block, a tax code etc).
These can be different to the address details which is nothing more than a postal address to where the goods/letters need to be posted.
In other words the former details should be explicitly provide in a fault tolerant manner (i.e. drop down lists) and not parsed from the details found in the address.
USA mailing addresses are strictly that -- an identifier for the convenience of the Post Office which may or may not correspond to the underlying jurisdictions for things like political boundaries, tax authorities, etc.
I learned this when setting up voter databases. A house that is in a given city/precinct/etc (the atomic identity is parcel number) may have a mailing address in a completely different city. There are many such cases for Los Gatos vs Saratoga, for example. Also, the USPS considers every address to be in some "city" even on unincorporated parcels.
Back when phone area codes and prefixes had precise geographical meanings, many edge cases ended up on the "wrong" area code (e.g. Sunnyvale numbers being in 415/650 instead of 408 between the 85 freeway and the Sunnyvale border).
The local taxes are rarely based on the street address and typically coincides with a larger municipal division.
The problem in that case is trying to use the “address” for two purposes which may not be aligned, getting goods to the buyer and trying to find their tax domicile.
local taxes are rarely based on the street address
No, they're based on the underlying geographical boundaries based on the parcel number. This usually corresponds to the city in the mailing address (in contiguously Incorporated lots) but not always.
On the e-commerce platform I work on we need to send customer data including addresses to multiple vendor API's. You can't just design your db schema in isolation usually.
Some counter-examples include Route Planning or determining where to locate a Consolidating Freight Service. If your addresses are not normalized, you lose the ability to see two addresses listed as "from city center, three lefts" and "from city center, one right" take you to neighboring locations.
These are not counter examples, they are examples of the types of problems that I think turn out very poorly when they rely on normalized addresses. What I'm saying is we shouldn't be using addresses for things like this. I realize I'm talking about, basically, upending several entire industries or something, so I know it's unreasonable, pragmatically.
I just think that the hard problem of 100% accurate address normalization suffers from an extremely fat tail[0] of edge case issues, and becomes economically unviable to solve, very quickly.
95% of the problem is having an up to date record of zip/postcodes. As long as you treat the postcode as a separate field, the rest of the address can be handled by the courier. Most couriers will go off the postcode to determine whether or not they can deliver a package and cost the delivery - even if the rest of the address is garbage.
IMO, A database schema for addresses should consist of: (Country, Postal code, Address). I would split Address into 4 strings and let the user fill out whatever they want in them, in whatever order. I'd also suggest that is isn't even necessary to validate this information on your input forms - but the validation is best performed by contacting the courier with the information input by the user when requesting a delivery cost. (And if this validation fails, email the user back). One suggestion would be to forbid commas in any address fields because CSV formats are accepted by some couriers with varying degrees of support for quoted fields, and often end up requiring manual intervention.
Unless you are the courier, you should probably not waste your resources on attempting to normalize any other parts of the address.
This is from experience in a business which ships hundreds of international packages a day, and tens of thousands domestically (UK).
People usually solve that problem by learning their address according to Google Maps (the most widely used service). For most of us it's the same address we always use. For somebody it's what they write when they want people or stuff to reach their home.
Why not just store GPS coordinates and do geospatial queries? which gives you nice features like "show me stuff within X kilometres of a shop" and "is cutomer X within bounding box Y?".
That might help to some extent depending on the use case, but the tectonic plates are moving all the time, and shifting a lot in some places where you have geological events like earthquakes. A set of GPS coordinates may no longer point to your current (moved) location accurately in that case.
this is going to be quite random but I recently saw a comment you made about deadlines (https://news.ycombinator.com/item?id=19361199) and am interested in some of your thoughts on what an ideal software development methodology looks like, if you care to talk about it please throw me an email smtheard@gmail.com
I’ve been through this before: storing a couple addresses for a few hundred different countries in an otherwise very normalized RDBMS. We couldn’t figure out anyway to do it other than EAV tables and a defined attribute pattern for each country.
Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?
At a past job, we basically did "both" - we'd store our best guess at a normalized address, and we'd store the text representation. For stuff like shipping labels, we'd use the text. For analytics, we'd lean on our best-guess normalization (and understand that there's some potentially significant error).
You can also lean on various APIs to normalize, but nothing that I'm aware of does this well on a global scale.
At that point you might as well just make an "Addresses" or "Address" table with a raw text field for the address. At the end of the day this thing is something that a human will read and map to the real world location.
> At the end of the day this thing is something that a human will read and map to the real world location.
That is quite an assumption.
I personally would use a field with a structured format e.g. XML (most RDBMS systems can deal with XML easily) and different types of address field e.g. postcode, zip code etc.
Then you could use a strategy for each country when dealing with it in your application.
We went the EAV route instead of just a free form text field because we still wanted to expose discrete fields to in the UI and have some front-end validation. But sure, from a data integrity standpoint it’s no better than a text field.
> Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?
I've always gone the other way. Bland fields (Lines 1-4, city, state, postcode, country) or - when I didn't need to validate it, a textarea for free form entry.
Most of the time validation has been required by the shipping provider, so freeform hasn't been possible. As the shipping providers haven't got anything 'better', we stick to their format and field lengths (which are terrible for out-of-country addresses)
If you are mailing something, using a plain formatted address string is usually fine.
Here's a harder problem from a previous job:
Given two addresses, tell me if they both point to the same "place".
(As you might imagine, the goal of this was joining various datasets.)
There really is no good way to do this, but to come close, you might consider a few dimensions of the problem:
1. Formatted address - a plain string. What you would write on an envelope.
2. Tokens (what I call "address components") - the bits of information that make up an address. Something like, {"street_number": "10", "street": "downing st", "locality": "london", ...}. You can get these back from most location-focused APIs.
3. Locations. Coordinates of a point or area in space.
4. Time. "Places" change over time. For example, a new business can open in place of another. Buildings are built and demolished. If you are working with data over a long time period, this is more of a concern.
Unfortunately, there are many-to-many relationships between all of these. For any address, there are many possible locations. For any location, there are many possible addresses. For any set of address components, there are many possible formatted addresses. For any address, there may have been multiple different "places" found there over time. Et cetera.
Consider an office building, for an example. That building probably has at least one address corresponding to its main entrance, but it might have other entrances. An office within the building can sometimes use the building's main address as its own address, or the address of its own entrance, or an address with a floor number, or an address with a mailroom number. An office building may have a café (usually off of the lobby) that has its own addresses. Offices in the building can be split and merged over time. I have even seen multiple buildings merged into one. You might also have a co-working space with individual addresses for rooms within it.
My logic with the address field is to try to consider what is the purpose of the address in the application.
If it is used for someone to send you a letter in the mail (for example: university admission) then you are better off with a big box where the user can write the magic blob of text that usually gets their mail delivered to the right place, up to instructions to the local postman.
If you need to pull off some analytics, or need to estimate shipping, then just extract the fields that your shipping agent demands: country, zip code etc. Leave all else intact in the "magic blob".
I think city, postal code, state/province (optional), and country would be fairly standard. At least in North America, Japan, Australia, and the UK. Of course for country you would want to use standardized country codes, which are somewhat political.
The author self-describes himself as "_Trusted_ Consultant" in the blog's subtitle.
The comic strips illustrating the article do not mention about the permission from identified license provider(s), and are cropped, does not help building trust - at least in my case.
Are they used with authorization from their authors?
Author here! Ha, I'm a big follower of Patrick, specially his microconfs! I didn't know he had an article an article on the subject, thanks for linking it!
However I agree with him!
What happens is that in my experience I've yet to encouter a case where what I suggest in the blog doesn't work. No doubt that there will be cases where it will not be enough. But by reading Patrick's article, clearly the complexity of the system needed to handle all the cases would be huge!
So I guess one has to compromise :)
Yes, compromise in that typically US-centric way, and to hell with any culture that doesn't fit the mold, they can adapt!
It's true, they can adapt! They proved that when they didn't have much choice, back in the ASCII days when you had to conform or stay in the pre-computer age. But maybe we should try to better now that we can...
So, my identity records got mixed up due in part to the impedance mismatch between (first name, last name) and (first name, middle name, last name), resulting in my legal name suddenly changing until I could get the issue sorted out.
Two fields works in the sense that most people are going to be have something to put in them, but what that is and what they mean isn't always going to match what you intend. Either the system is going to be tolerant of such things (in which case you can probably get away with some variant of (display_name[, external_name]), or you're using it to interface with some other identity provider/consumer that requires a specific format, and that's what should be asked for.
These are really the basics of database design. I would argue that you should do these until you know when you shouldn't. That is, you should normalize to this level until you learn when and why to denormalize some things for performance, if that's ever even needed. I think it means "humble" as in not trying to teach anything complex or revolutionary.
The advice is to set NOT NULL if you think the field can't be null. That's valid, and a matter of encoding the schema to correctly match the data. Same with check values. Keep bad data out of the database wherever you can.
> and constraints are not high-performance
Then turn off constraint checking after you've built your schema and identified them as a performance problem. Don't optimize away your data integrity before you've even determined it's a problem.
If you actually don't care about your schema, just use a text field and throw whatever in it, or use a document store without any strict checking. If you're using SQL, presumably you're using it for a reason, and that's likely so you can take advantage of what it has to offer.
The article actually advises setting fields to NOT NULL unless there's a specific reason to do otherwise.
And re constraints: eh, for most applications this feels like premature optimization. My usual approach is to include whatever constraints are easy to express, but also have the application not rely on the DB for enforcement. That way you get some extra safety checks that might catch bugs, but if they do become a bottleneck you can just drop them.
Not a DBA, just a software dev who's learned not to sacrifice safety for perf unless there's a clear, demonstrable need.
They do, and obviously it is workable, but I think every big MySQL org ends up with a tool/workflow like gh-ost: https://github.com/github/gh-ost which have their own set of challenges, like not playing well with foreign keys.
This is a silly argument. Facebook & Google have made massive investments into the source code of MySQL and have helped make massive operational improvements to it over the last decade. It's always glaringly obvious when people have no experience with something and then throw shade at it, like the obligatory Microsoft-bash whenever that topic comes up.
> Never store firstname and lastname in the same field
funny, where would you store the middle name then? and the second middle name? married name in cultures where the spouse changes or double it? of course in normal form you need a name table, with an order column, and so on and so long.
anyway, this article goes straight to the notions everyone can fetch on any book and skip the interesting and diffiylcult part about database design, which is understanding your domain so that no future expansion is prevented by a wrong cardinality on a relation that was simplified from reality.
because if it's a user database you might get away with just the name, so you can address people during support calls. if it's an accounting database, you might want to know someone name at the time of invoicing, having multiple records with start and end validity times, and if it's a FBI database for protected identities you might have to have multiple name aliases referring to a single person.
I'd store the middle name in the first_name column. Like for example "Juan Martin" would be in the first_name columns and "Del Potro" in the last_name. Apart from the middle/first name incongruence that you may highlight, I believe that this design is not awfully bad and should serve in most situations.
why on earth would you, and the article writer for that matter, use a column named first_name for storing the middle names as well?
just call it name!
as I said above, the article goes to length about normal form, and skimps on the important part: the data model is a representation of the real world, and a first_name column with middle names inside is basically a straight up lie of what the column models and what data it contains.
there's not one way to represent names, there are application specific domain where generalities need to be more or less detailed, but the core aspect of it all is that the data model shouldn't lie about what it is that it's modelling.
Since everyone has to have an opinion and provide it, here's mine, maybe a retitle?
>A humble guide to transactional database schema design
There are a number of things that don't apply to analytical database design and I'm seeing the two confused fairly often, then again, this comes entirely from my point of view so take it for what it's worth.
I work on an e-commerce platform that's live in about 20 countries. One of the hardest records to normalize correctly is the address, because not only do users have very different rules depending on their country (e.g. not all addresses in Turkey have post codes, GB has some very weird building names/numbers...), we also connect with vendor API's that have their own ideas of how to model an address.
My personal opinion on this is something like address_line_1, address_line_2, city, region, post_code, with each field being varchar(max). Anyone else have an opinion on how to do this?
After spending a long time trying to figure this out I came to the opposite conclusion of the article. Unless you're only operating in one/a small number of countries trying to normalise addresses is too much of a moving target.
The schema I ended up with for my use case is: country (required), region (optional depending on country), address (rest of address text) and a lookup to get geographical coordinates for the address (with user confirmation to tell if it's somewhat accurate). Filtering/sorting on country and region then gets you pretty far and for the more complicated queries I found it took less time to use the coordinates with PostGIS than trying to match up bits of the address text.
Yeah, I guess it really depends on your use case and your overall system too. Our system connects to multiple vendor API's, so having one field for the rest of the address text would be very difficult -- we'd have to try and automatically split it into constituent parts depending on the requirements of vendors. Some of our data then goes to credit check API's too, so an incorrect address can result in an (incorrectly) lost order for us.
I think that's the best way of going about it as far as cultural differences are concerned. Same thing with names, where you'll have firstname (name_1), lastname (name_2)
Recently I ran into a project where everything is stored in around 3k unique fields, with many levels of nesting, in a JSONB column with NO schema or documentation. Generated JSDOC from the data and I'm hoping to create some kind of schema.
You can use JS on the server. Just don't do it this way ...
“Beginning database design” By Clare Churcher is a really accessible book on these issues, very practical (starts off with simple spreadsheet tables!) and takes you through step by step to understanding the problems with and the need for the different levels of normalization.
How I wish for nothing more than an alternative to SQL to communicate with databases, something that automatically infers coupling tables too. Set theory is nice. SQL.. personally, hard pass.
What parts do you find difficult? I agree that the benefit to having something learn the schema and the relationships would be nice but I get that through something like DBeaver already.
That's one issue I see with NoSql, data is not normalized and consistency goes out of the window if database is big enough and the volume of data is big enough. That might not matter if you work on a search engine, social network, music suggestion app, but for most business apps and for most websites, having inconsistency is really bad.
On the other hand, having some data denormalized can improve performance as you can avoid expensive joins. It's for designer to see if they need to trade consistency for performance and in what measure. I'd avoid it, I'd prefer to add more servers, use load balancing, use caches.
In my programming life I've barely used SQL databases so far, and there is something I've never understood about these recommendations. Maybe someone here can explain that to me.
I thought databases are designed to organize the data optimally internally and execute all operations on them as efficiently as possible. Why do you then need to organize databases so stringently? Doesn't that mean that there is something wrong with how the db works?
To me the advice always seemed similar to bit twiddling instead of relying on the compiler to optimize. Shouldn't the database lay out the data optimally internally?
I'm not here to say the author is right. S/he is making a lot of assumptions, and as you can see in this thread, everyone is saying it depends.
But one fact is that reducing redundancy in a database is a GOOD thing.
To make a comparison, I'd say it's like when I have to maintain a codebase that is large, and I have to pass a constant (let's say, ERR_STATUS = 1) around in some functions in various folders.
I wouldn't want to have to define it in every single file. One would want that in a header, in a library or in a file that defines static values somewhere. Anywhere but only one location for it.
Database is a bit like a dumping ground. Once the business gets big enough, a lot of other application are going to plunge in and do things with data. I wouldn't want to have inconsistent data. As a programmer, if one had inconsistent things in one's source code, one could deal with it. But data is different, data is produced by users and interactions and so on. Once you begin to have inconsistent data, you can't trust it anymore until you find out exactly what did it. A lot of wasted time. And that leads to a lot of problems.
But I agree with you about your point on performance. However by organising data correctly, having indexes on things that matter, joining in the right way, one's using business knowledge and programming knowledge to optimise the database. And that's not something that can be done 100% on its own (as of now).
The schema establishes the contract between you the user and the database.
You have to help the database so that it can help you.
The information you are giving it is mostly things that it can't reliably know or decide on your behalf (such as whether you want to store 2 things together as 1 thing or not like first name last name).
The database still does a ton of stuff for your that you don't have to think about. For example SQL itself by definition is like that. You declare what you want to take out and the database figures out how to do that for you (this part is similar to the compiler optimisation of your example).
Going by your example, not having an schema is similar to saying "why do I have to write code, can't the compiler just write the code for me?".
In both cases, on one hand yes the compiler or the database does the magic for you, but at the same time it can't work without your input and decisions (writing code, or the schema).
The DB will work just fine regardless of how you organize your data, some queries might just not be possible or efficient. A well designed database will easily serve many different type of applications at the same time.
The power in databases lies in their ability to easily transform or restructure the same source data in multiple kinds of different ways. The catch is that it has to be arranged in a specific way (normalization) for this to happen. Normalization is actually based on math and set theory.
My humble opinion is that poor choices in database design are the source of a great deal of technical debt.
I used to avoid making schema changes as they were "scary" (until I got a decent understanding of Django Migrations). Just a quick hack to make up for it in the application layer was far easier. Then another, and another until you have real mess on your hands.
And avoid nulls when you can. Unfortunately most of the time I have had a new database design there is existing data in excel sheets that is messy and you can't avoid it.
1) There is no "good design". It is always a compromise of trying to satisfy various practical constraints. OLAP vs OLTP probably rings a bell here. And this Last / First name thing is laughable. The complexity of handling all possible human names can be insane. Depending on the goal the proper design can be anything starting from single field and up to some relational structure.
2) The article itself looks like let's just write something up to score some points.
> 4) Never store firstname and lastname in the same field
I have the opposite opinion: unless you have a specific reason why you need to know one's "first name" and "last name" (e.g. to interface with some other system that requires names to be split up like that), just use a single field and stop caring about whether or not the name conforms to your preconceived notions of format and structure.
(fwiw, I like the article, and I find it suitably humble)
Just want to hang a thought here ... depending on the precise definition of First Normal Form, it is either impossible to avoid[0] or impossible to enforce[1] without further constraints.
[0] if definition is "must be rectangular", i.e. all columns have the same number of rows and all rows have the same number of columns.
[1] if the definition includes "do not allow duplicate rows"
RE: not storing address in a single field, splitting an address into semantic fields feels like a premature optimisation. You’re forced to anticipate all the fields that you may want to aggregate by, before you may actually even use them.
Instead, you could just create a custom function that takes in a string and tries to identify the city or street number or whatever element you want extract from the address e.g.
select * from users where extract_city(address) = 'london';
Use third normal form. - Till your system gets 10 year old and linking your user's first name to their address requires joins on 10 tables having millions of rows.
Firstname lastname is too fragile and oversimplified. Suffix, prefix and a variable number of middle names/initials can also be necessary. Occasionally, legal name vs. nickname is also needed. But if there's no pressing need for grouping or sorting by surname, why piecemeal any of it out?
Real advice:
- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)
- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.
- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.
- When in doubt, just have a column full of JSON and figure it out later.
- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.