Hacker News new | past | comments | ask | show | jobs | submit login
A humble guide to database schema design (mikealche.com)
414 points by helloiloveyou on April 7, 2020 | hide | past | favorite | 245 comments



Most of this advice is stuff they tell you in school that doesn’t matter at all in reality.

Real advice:

- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

- When in doubt, just have a column full of JSON and figure it out later.

- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.


Fuck me, are you trolling?

Guy with actual DB experience here, so:

> Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

No it won't, it depends entirely on the situation. Denormalised data (besides allowing update discrepancies) can expand the amount of data, thus push otherwise memory resident working sets onto disk. Big slowdown. You are wrong. The correct answer is "it depends" - and denormalising really is a last resort that has to be done judiciously and with care.

> You need nullable dates and nullable numbers. You never need a nullable string.

Utter unjustified garbage. Correct answer (from a practical, not super-normalised form) is: "it depends"

> When in doubt, just have a column full of JSON and figure it out later.

OMG please never let me come near any fool who does this. I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.

Please nobody listen to such 'advice'.


>I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.

I can see XML being really bad for performance, but given that json column data is handled like native data in modern databases (and you can index/query on fields in the json, which pretty much negates the performance impact), using a json/jsonb column is actually a decent guide.


You are spot on in picking me up on this, thanks. It needs clarification. xml/json is neither right nor wrong, it depends. Performance, well, it depends. What is unquestionably wrong is the poster's advice "When in doubt, just have a column full of JSON and figure it out later".

Pouring structurable data into an unstructured heap is just laziness and will hurt you later. I've seen the cost of deciding to 'leave it for later' - it's high! I've been involved in cleaning it up. It's horrible.

Years ago I picked up a book written in the 1960s (so pre IT)for secretaries, and it had a beautiful line in it "the filing cabinet is a place where you find things, not lose them". Still applies. Always will do.


> "the filing cabinet is a place where you find things, not lose them"

"Those who cannot remember the past are condemned to repeat it." - Santayana

“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem—neat, plausible, and wrong.” - Mencken

Software project management is full of thin and runny rationalizations...I (HHOS) include this form under the heading of "Data Abnormalization"...


I've had a lot of experience. You can learn from it or not. If not, feel free to prove Santayana right again, and at great personal cost :) In fact, I encourage you to!


XML was there, being handled natively in databases much before they even looked at JSON.


Is there a reason then not to use a single column with JSON? The columns would then be "virtual" (the column jsondata would spawn a column jsondata.firstname etc.)

This question is your of pure curiosity, I do not deal with databases so I hopefully correctly understood your comment.


Of course the answer to everything is always “it depends”. The point is to give generalizations that point people in the right direction to start.


It must be said that json performance of postgres is good.


I believe the right answer is somewhere in the middle. I know a couple of successful products where the dB scheme is completely abused. Yet commercially a succes for over 8 years. A good dB specialist also knows the implications of a 'bad' design choice. It's all about the impact at a later moment. Just like software design in general. Finding the balance between complexity and simplicity. Between slow and elegant versus fast and ugly. Balance is key


I did say clearly it depends. And that's because it does.

> Between slow and elegant versus fast and ugly

No! Unqualified, this is a false dichotomy! Because often you can have elegant and fast, and that's because elegance of design often gives you that speed. Not universally true, but much more often than not.


>>I did say clearly it depends. And that's because it does.

You've arrived a tad hot to the discussion to now start to backtrack on how it depends.

After all, if it depends for your defense then shouldn't it also depend for the OP's point? Because if you know that it depends but still start to accuse others of trolling, you are not being helpful nor honest.


Wasting your design time allotment in rationalizing futile compromise is the flip side of seeing everything as a dichotomy. The reality modeled by human discourse may be only consensual, not actual...


> - Names should be stored as display_name, first_name, and last_name.

No.

“[…] refer people to this post the next time they suggest a genius idea like a database table with a first_name and last_name column.

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

You could, conceivably, have a "display_name" column, an "informal_name" column, a "formal_name" column, etc. depending on your needs. But for absolutely most people and purposes, a single "name" column will suffice, and will steer programmers away from making unfounded assumptions about people’s names.


But then you don't have my informal name. When I receive an informal update email, it would be clunky to have it begin Dear Max Barraclough.

You can't automatically deduce that the way to address me informally is Max, without making assumptions that will break outside the English speaking world. See Points 18, 24, and 30.

The W3C suggest that one way to handle names is to have a Full name field and a What should we call you? field.

https://www.w3.org/International/questions/qa-personal-names


If the database owner is not billing you, he probably doesn't need your formal name, so he can have just one column, and call it "name".

Otherwise, yes, he'll need the 2 columns.


You'll regret this decision as soon as you have to integrate with another system that uses separate first and last names. Or as soon as you want to build a sorted list of five thousand names or output a bibliography in a specific format or anything.

Technically the right approach to the problem of storing international names would be to store markup: a string with an associated structure and a discriminant:

    <name lang="en">
      <first>Kenneth</first>
      <middle>Elton</middle>
      <last>Kesey</last>
    </name>
    <name lang="ru">
      <given>Fyodor</given>
      <patronymic>Mikhailovich</patronymic>
      <surname>Dostoyevsky</surname>
    </name>
    <name lang="es">
      <given>Gabriel</given>
      <family-paternal>García</family>
      <family-maternal>Márquez</family>
    </name>
You can then have precise automated ways to transform this into whatever derived form you need. Using a single name field is just avoiding the problem. Using multiple fields is a maintenance burden, because in most cases fields are not fully independent and you have to update them all to keep the data coherent.


Using markup does not really change anything, that is just a way of serializing a data structure. You can represent the same information in normalized form.


Of course, I don't mean storing literal XML, but I think it's important to support generic markup in databases. It's not that hard, after all; something like these five tables:

    Markup (markupId, string)
    Ns (nsId, uri)
    QName (qNameId, nsId, localName)
    Elt (markupId, eltId, qNameId, 
      start, length /* string indices */, 
      level, index /* or other way to represent hierarchy */ )
    Attr (eltId, qNameId, value)
would be enough to store and fully index any XML out there.

I think markup is an important separate data type. We have records, we have scalars, now we have a hybrid type, markup: a string (or maybe a byte/bit stream) + records linked to indexes in this string/stream and maybe forming a hierarchy.


A database should store the actual data, not some serialized format. You don't store the comma tokens from a CSV file in the databases either - you store data which can be serialized to CSV. The same information could be serialized as CSV, XML or JSON depending on the context.

Storing a serialized format as an opaque string or blob might make sense in particular cases.


A) the proposal is crazy. You probably don’t need to know if someone has a patronymn or a kunya or whatever. Base what you store on what you need.

B) Storing this as normal data would be an intense pain in the ass to use, for no gain in performance or correctness. Use JSONB, this is what it’s for.


If you actually need to know such details then it should be stored in normalized form.

Say you have a visa database which must be compatible with multiple naming systems from different countries and where no information can be discarded for security or legal reasons. Then you store the data you need in normalized form. Why would you use weird hacks for essential information?

And if you don't need all that detail - then you just save the name as a string and discard the metadata.

I can't see it would ever make sense to store a name in JSON format inside a relational database. Either the information is important or it isn't.


A) In many cases you don't, but if you store data about authors in a bibliographic database then you probably do. And it's just a part of the package: for example, Donald Knuth went as far as to print people names in their native language in addition to English in his books' indices :)


This comment is culturally insensitive. You are implicitly assuming that your culture's way of expressing names is all that matters


Matters for what? A database is a tool to solve a problem, not a timeless repository of deep cultural meaning.

If I make a webapp that is only localized in English, display, first, last will let me solve my common problems: I need to say "My account (John Smith)" in the corner (display name), and I want to be able to write emails that say "John, we know you're wondering how company that sold you a pair of pants three years ago is feeling about COVID-19…" (first name) and list articles by author sorted by last name. Those are all common use cases for apps used by English speakers.

Now, if I were making an app to do voting in Myanmar, I would need to deal with there not being last names for many Burmese. If I was trying to track Arab speaking terrorists around the world, I'd want a long list of their aliases and kunya in multiple romanizations (I think the CIA used to prefer "UBL" for Osama bin Laden because they called him "Usama"). If I was making a library app, I might want to have the English romanized name plus the Unicode native language name. Tons of possible problem spaces with different solutions.

I just think tagging the different name parts for "Gabriel José de la Concordia García Márquez" is going to be overkill for most English language apps because when are you going to look someone up by maternal family name versus of just doing a full text search for "Garcia Marquez".


Missing the point. If you just have a single "name" field then you can accommodate any culture perfectly.


This fits the 75% use case, but misses things that do come up, like alphabetizing by last name and emails with casual forms of address.


I'm not arguing about storing serialized format; instead I suggest to parse it and end up with something like that:

    NAME ( 1, "Fyodor Mikhailovich Dostoyevsky" )
    TAG ( FIRSTNAME, 1, 0, 5 ) 
    TAG ( LASTNAME, 1, 20, 30 ) 
    TAG ( PATRONYMIC, 1, 7, 18 )
I'm not saying it's efficient (UPD: speedy), but it's sufficient to get the required details and can be mechanically transformed into a more efficient form. UPD: And it's very flexible and easy to extend. I believe it can handle just about any kind of naming from "Falsehoods about names".


But why include all the cruft fist place? You could just save this information (in normalized form):

  FIRSTNAME: Fyodor
  LASTNAME: Mikhailovich
  PATRONYMIC: Dostoyevsky


This one lacks ordering. It's not a problem for Russian names as the rules are fixed, but may be a problem for other ones.

And how would it scale? E.g. John Ronald Reuel Tolkien and George Herbert Walker Bush have two middle names, while full name of García Márquez is Gabriel José de la Concordia García Márquez. Markup handles this uniformly creating only as few metadata records as necessary, but a field-based model will need fields for all of this that will stay empty most of the time and likely some meta fields to set additional quirky flags.


Regardless of how complex naming schemes you need to support, you just make it more complex and harder to query by storing tag names and character indexes and whatnot.


Please don't do that. I hope you haven't implemented such a set of tables in any database.


Not yet :) But I may just for the sake of it.


That's a good idea. You'll learn firsthand why it's a bad idea.


Isn’t that what RDF is all about? Each field being a named property of a resource, the properties themselves being properties of other resources describing the structure of the first properties. And so on, going on as meta as necessary


There is no need for an arbitrary "markup" that you then restrict based on your use-case (in your case it's some sort international-aware storage of different data).

Databases are designed to store data, and storing markup like your example above is basically a misuse on some level. Your example above has a clear structure, and that structure can be encoded as raw data in database tables. So instead of having a generic set of tables (as per your other comment) to "store markup" data, you would rather come up with tables that represent your actual requirement and what that markup is meant to indicate.

Some database also have inheritance models which allow you to store the data above quite trivially. You basically have a "base" and then extend it with specific fields for each type (en, ru, es, etc).

So please, don't put markup in your database like that. It's a big no-no, and I'm sure your DBAs would tell you that very quickly. Else, just throw that crap into a document-store and make it someone else's problem as is unfortunately all too-common lately.


What you want is not markup, but tagged unions. It's a real shame SQL and many other languages don't support them.

Also, your example about Spanish names doesn't take into account that nowadays parents can choose the order of family names, and that gay marriage has been legal for 15 years. Which reflects a deeper truth: don't make too strong assumptions about reality, for it may change.


Tagged unions, if I understand what they are, can be done in SQL:

    create table Foo(
      fooId primary key autoincrement, 
      type integer /* tag/discriminant */ );
    create table FooA(fooId primary key,
      /* data specific to FooA */ );
    create table FooB(fooId primary key,
      /* data specific to FooB */ );
They're not quite markup though. Markup has a string/stream with associated records. The number of these records, their kind, and their placement varies, so it can handle ordering issues or multiple similar names or missing name (e.g. missing middle name), or ignore parts of name quite naturally.


That could be an implementation of tagged unions. Without language level support (pattern matching, exhaustiveness checking, etc.) you gain very little from this.


Yes, you can simulate them, but it's painful.


It's not a simulation, it's a data model for reducing the pain for what's really a hard problem - names we call people are not unique, they are colloquial identifiers (HHOS).


Joined tables on the same primary key are a simulation of tagged unions. Proper tagged unions would solve more concisely and with more clarity.


I'd really like to see an RDBMS with built-in support for both inheritance and tagged unions. That would solve 90% of my schema woes.


Hum... Who exactly is entering that data onto the database? And are they ok with entering things with that level of detail?


Your American is showing? It's just a standard name input form but internationalized.


Hum... I'm not from the US and do not live there if that's what you are asking.

People usually have enough trouble separating the first name from whatever you call the rest of it and deciding on what part goes where, and you expect them to be able to tell what are the cultural roots of the name?

Keep in mind that not all names on databases are filled by their owners. Besides, I really doubt I would be able to enter my name into whatever you are designing without some amount of lying.


I spent 13 years maintaining a large-ish system that separated first- and last names for customers.

Separating makes searching/indexing on last name easier.

The alternative would be to put last names first by convention, which renders them unusable for anything else; or go insane trying to find a universal way of splitting and recombining for display.

Keeping them separate inevitably results in a convention to put company names in the last name field for corporations, and failed searches when the convention is broken.

It's one of those questions that I suspect will keep bugging me until I die.


> Separating makes searching/indexing on last name easier.

What do you mean "last name"?

If you have two last names, then what?

If you have no last name, then what?

> Keeping them separate inevitably results in a convention to put company names in the last name field for corporations

If you only have a single field for name, this problem disappears.


I mean everything but the first name.

Yes, but then you have to deal with the indexing/search problem instead.


Yeah, but if you only have one last name field, you can only index on people's first last name, not their second last name? How are you going to solve that?


> Separating makes searching/indexing on last name easier.

The point is that such a search or index is a mistake. It's simply not meaningful based on the real properties of names, it just happens to work 99% of the time if you're only dealing with a mostly homogenous culture.


Which is good enough in plenty of situations, compared to the "perfect" alternative that doesn't work at all.


> Keeping them separate inevitably results in a convention to put company names in the last name field for corporations

In most (IMO sane) designs, a company name would usually be a separate field (or better yet: companies would be represented in an entirely different table).


But in many cases the customer is either an individual or a corporation. Keeping them in separate tables turns simple tasks such as generating a list of customers into joins. Which brings us to normalization, too much of that kills performance.


I feel like if a join is killing performance, either it's a really weird join (e.g. joining on an OR) or you ain't indexing.

Regardless, you could always just assume everyone's a "corporation" (and for individual customers, that "corporation" would just be a reflection of the individual customer's data). That, or only deal with contacts within a "corporation" (and an individual customer would just be a contact with a null "corporation"). I've seen both approaches in practice, and while they have their tradeoffs, they work reasonably well.

"Corporation" is a pretty wonky term for this, anyway (not all non-individual customers are corporations); "organization" would be better. But I digress.


Also for international users browsing in English labeling the fields "given name" and "family name" are way more clearer than using first and last names.


Exactly what I was thinking - plus a field to state the order they should be displayed in seeing as Asian cultures usually state the Family Name first.


Shouldn't that be a localization/translation issue?

My experience living in Japan is that they will write family name first if using Japanese, but given name first if using an European language.


Also, in some cultures like in Spain the legal surname is typically the first family name, which might be the first of two or three family names.


What do you mean "the legal surname"? I'm Spanish and my legal name is my full name with both family names. You can't just ignore one. Also, Spanish people have two family names, but they can be composite, so they may look as three or four family names.

However, most people are usually known by one family name, but which one is used depends on which one is more easily recognizable.


> What do you mean "the legal surname"? I'm Spanish and my legal name is my full name with both family names.

I'm referring to the primer apellido. Even government forms require that people break down their surnames in first surname and all trailing surnames.


From the article:

> This list is by no means exhaustive. If you need examples of real names which disprove any of the above commonly held misconceptions, I will happily introduce you to several.

...

> 40. People have names.


Newborns have no known names until the birth certificate is filled out. There is still data about their medical care that needs stored in a database.


Prince, famously, didn't have what we'd recognise as a name for years. All the "Unknown Soldier"s don't have names. "John/Jane Doe" doesn't technically have a name.


These examples seem somewhat contrived, but of course it comes down to the use case of the database. I'm sure Prince remained registered under his legal name on contracts, checks etc, even if he used the symbol in marketing. But a database over bodies in a morgue obviously need to be able to support unidentified bodies.


Interesting, I think you're right. According to Wikipedia, he only changed his stage name, not his legal name. The BBC didn't seem to make the distinction.

https://en.wikipedia.org/wiki/Prince_(musician) , https://www.bbc.co.uk/news/magazine-36107590


Some oauth systems will give you a user id or an email but not permission to see a name, for example.


How do you store addresses? There’s a use case for breaking things up.


If I had a dime for address forms that have "state" as a mandatory field, while having a country dropdown that contains countries that do not have states, I'd be rich, RICH I TELL YOU!


> Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

It all comes down to the purpose of the database. There is no one-size-fits-all solution. For a hospital which need to register newborns, you can't even count on everyone having a name. For a library database, you need both first name and last name of authors, since books are sorted by authors last name - but having a "display_name" would be redundant and violate normalization.

If you can get away with having a single "name" field, this will say you lots of trouble. Only add additional fields if you need it for some use case.


> “ Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.”

I love this! It’s so true only bots dive that deep and it’s usually a total waste. But what do you recommend instead? Should you just deny any calls past page 10? Or do away with pagination altogether?



Thanks! This is really useful! I didn't know that limit offset scanned all rows before the offset.


If there's somehow no need to show anything past page 10, then by all means, just do that. But most applications can't get away with that.

There are plenty of resources for pagination online.

For example, your [Next Page] button could use a `where date > {date of last item on prev page}` since you have an index on date. e.g. <a href="/things?after={prevDate}">Next</a>.

A forum topic could have a consecutive and contiguous "index" on each of its posts that starts at 0 and increments. Now going to page 1000 is simply a matter of grabbing posts between index 990 to 1000 which is instant.


Don’t use offset for paging. Presumably you’re sorting the results by something, add a condition on that value in the where clause based on the last row in the previous page.


Super fun until you have a column that has an amount of duplicates greater than the page size. Now you'll have to add a second column to your order clause, your serialized cursor and your where clause. Oh, and don't forget to check whether your database driver uses the same date precision as your database does, because you might be in for another surprise.


> denormal will have better performance

No it wont! In general it will have worse performance. Deliberate denormalization can be used to optimize for particular queries but it will make other operations more expensive and/or error prone. So it has to be done very carefully and with a full understanding of cost vs benefit.

And many database systems supports materialized queries which give the same optimization without having to compromise the schema.


Have you ever used a complex database in a formal, by-the-book, third normal form?

The number of tables explodes, the number of joins become a pain and every single data change requires careful planning.

As for the "many database systems support materialized queries", I only know of a few that truly do: Oracle and DB2.

PostsgreSQL supports them, but you need to handle updating them yourself. SQL server kind of supports them, but are limited.

In other RDBMs you can work around that using triggers (or similar concepts), but that IMHO is definitely a bad practice and the times I've used that (usually because there was no other option) has only brought me grief.


Denormalization means you (by definition) have duplicate and redundant data in the tables. This certainly requires careful planning around updates to avoid data corruption.

> The number of tables explodes

This is IMHO a backwards way to understand normalization. Normalization tell you how many "logical relations" you have in the data - whether you like it or not. If you decide to squash multiple logical relations into the same physical table then you get redundant data (by definition). This might be a fine optimization, but it doesn't reduce the inherent complexity of the data model. It just saves some joins as the cost of more complex update logic.


I think there's a decent amount of hyperbole, but even taken at face value: Every single data change should require careful planning. Especially if it's a schema change, and not just a query.

You can do a lot with generated code, refactoring, etc. Data is much less agile, so I'm a lot more conservative when it comes to that end. Never mind that with decent naming, table comments etc. it's a good documentation in itself.


> No it wont! In general it will have worse performance.

Since the advice is to only denormalize to improve performance, this would be a contradiction. Yes it makes everything more error-prone, but if you're profiling to check performance, then you only keep the denormalizations that improve the performance you care about.


Yep, exactly, materialised views are the ideal way to create query specific schema - but there's always a cost, namely, refreshing them needs to be done thoughtfully.


Materialized views are great, but it is just a denormalized form with in-database updating instructions.


I prefer to think of it as a cached query result. Normalization is not compromised since there is no way to introduce inconsistencies.


If you do a non-atomic update like “add 10 to transactions total” where total is a materialized view, it could bite you, but yes, it is less likely.


You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

What's an example of a data entry error that results from this scenario? Wouldn't it entirely depend on the data entry interface?


Someone is filling out the form. They skip MI. Do I mark it as null or blank? The only way to trust this is to have a big check mark that say “does not have a middle initial” that stores a boolean. But realistically, who cares? Unless you’re dealing with legal names, you don’t need to distinguish the cases.


OK, but why does that require the database to have non-nullable string types?


No reason per se, but it’s inconvenient and I’ve seen tons of bugs from “NoneType has no method ‘uppercase()’” that this eliminates.


Sorry, I can't get behind changing the database design to accommodate under-specified data contracts and/or lazy software development.


I can't get behind letting theoretical database design concerns trump real world problems with unexpected nulls leading to 500s in production. :-)


Your main database should always be 100% normalized. You can add a separate database for lookups later that can be as denormalized as you want.


That assumes people who read it actually have been in school; there are a LOT that never had anything about database design.

Your other points make a lot of assumptions and there's a lot of footguns in them but other commenters have pointed them out already.


> Never store firstname and lastname in the same field

I disagree. Why would you want to force the "firstname lastname" pattern on people? What if someone has three names? Or only one? Just use a "name" field!


Scandinavian here, our names don't really fit this pattern very well either.

My last name has an "ö" in it. That doesn't work in 99% of so-called global websites, so I have to transcribe it as "oe" instead, which is just wrong.

My husband has three first names and one last name. However, his actual name is his second first name. If his full name is "A B C D", if you ask him what his name is, he'd say "B D". If you know him you can call him "B", if you don't, you can call him "Mr D" or "Mr B D". Most American companies call him "Mr A" or "Mr A D", which is just wrong.

I have a friend with two first names and two last names, but his second first name is his name, and he prefers his first last name. If his full name is "A B C D", he'd say his name is "B C" if you ask him. Calling him "Mr A" is just wrong, and calling him "Mr A D" is super duper wrong.

(In government systems in our home country, this is of course handled. You can "mark" which first name(s) are preferred, so mail to my husband from either government or private companies will always be correctly addressed to "B D" and nothing else.)


Some people have nicknames and prefer to be called like that. A known singer would prefer to use a sign instead of a name.

People can have thousands of preferences and an app can't account for all, but for the majority of use cases.


My examples aren't about nicknames, they're about the difference between someone's full name, and their name.

And even in American name culture it's not obvious how to transform someone's legal name to their name. How would you encode that the name of "William Henry Gates III" is actually "Bill Gates" ?

"William" - nope.

"Mr III" - no no no.

"Mr William H. G. III" - nooooooooooo.


That was my point: trying to account for every preference an user might have or every particular scenario is too complex and takes much time and human resources. The right thing is to address the majority of your user base.

To take your example, why should my app encode that "William Henry Gates III" is actually "Bill Gates"? I will provide a form with first name and last name and Mr Gates can register as "Bill" "Gates" or "William Henry" "Gates".

I have three given names and I find it a non issue completing forms. And that's because a string can contain white spaces or even be empty so you can encode any arbitrary combination of first name + last name.

Also, people get to much into politics, trying to be "revolutionary" instead of trying to solve the technical issue at hand.

I can solve a technical problem in 2 minutes or get in political phylosophy and argue for 2 years without solving anything and without helping anyone. I rather solve the problem.


Don't dismiss nicknames. In Germany e.g. you can put an artist name on the id card and your pass port.


This isn't about nicknames, its about cultural differences.


Absolutely! Having 'name_given', 'name_middle', 'name_family' or some nonsense like this is one of the most common mistakes in database (and application) design, usually made by culturally narrow-minded Western developers.

Take names in Myanmar [1] as an example. From Wikipedia:

> Burmese names lack the serial structure of most modern names. The Bamars have no customary patronymic or matronymic system and thus there is no surname at all.

That said, it is hard to push back on a business requirement to sort users A-Z based on family name, just because.

[1] https://en.wikipedia.org/wiki/Burmese_names


It’s not really a mistake to localize your application design and UX to the first major market that you are targeting.

To immediately target the entire world, will leave you burning cash as you figure out all 200+ counties customs, or with a hopelessly generic app that is hard to use for everyone, pleases no one, and hard to extend in culturally relevant ways.


This kind of mentality lead to huge fuckup that were the ASCII code-pages, and we are still suffering fallout from that.

I still cannot input one of letters in my name into most systems for example.

There is a decent middle ground, as you can store first-name(or equivalent), last name(or equivalent) for sorting purposes... and a display name which is culture sensitive.


> usually made by culturally narrow-minded Western developers.

Usually made by developers who realize this will suit 99% of their users. I'm culturally eastern myself and realize this.


Non-American here. I believe US citizens often have a middle name. We Spanish-speaking people have 2 last names. Japanese reverse the "firstname" and "lastname". I had a lot of troubles when I tried to buy a ticket in a Chinese website a few years back. So no 99% of the users don't follow the "firstname" and "lastname" pattern. We might put up with it, but it doesn't suit 99% of the users as stated.


To clarify the comment you are replying to (not the author): 99% of their users. Clearly 99% of people globally aren't going to follow that custom, but many (most?) applications written in the US have a user-base where 99% of their users will follow the custom. Switching to something more globally applicable isn't necessarily a non-zero cost either - using a single name field has downsides like making sorting by last name more difficult (a ubiquitous thing in the US) and not being able to display first vs. last name where culturally expected.


But I am replying exactly to that. I don't believe 99% of the users of a given application for any non-trivial number of users conform to have exactly a "firstname" + "lastname".

In the US it's not rare to have a middlename. And there are many Latinos, Japanese and Chinese in the US as well, all of them which do not follow the convention.


A lot of US-based startups target the global market early on, although they behave as if they only had to cater to US users.

As a side note, the group of HN commenters seem to fall into two categories, the first think group of commenters seem to think HN is a mostly US only forum, whereas the second group seems to think HN is an international forum. I don't know what the truth is or whether there is one at all. But I know, that it sometimes irritates me, when commenters or posters assume, that they are talking only to US citizens.


I suspect it's usually accidental but I feel very little rancor towards them. You solve things the way you think about them (until yesterday it wouldn't have ever struck me that a currency symbol could be in the middle of the value). And that's how, I suppose, you end up with a bunch of people with US Visas that say FNU (first name unknown) in front because they only have one name.


That's why you store a "name" field, and then spend the rest of your carrier writing parsing logic to answer silly management questions.


A Burmese here and thanks for pointing out! I have to explain to other people that we don't follow the father's name most of the time. Some names the father's name at the end. e.g if my father name is Myint Swe, my name could be Ko Ko Myint Swe.


It is typical to sort such things in multiple columns. You just need to decide if people with no last name sort first or last.


+1 for "name", though you may want different fields for different display purposes: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


The same also applies for the structured address format. I've seen some stupid decisions in the name of this.

Post codes: 5 years ago, my country brought in post codes. prior to this there were none. It was not for the benefit of our postal system, who stated they didn't really care, but for badly designed web forms (and the occasional rural house with an unhelpful address)

They also may not fit your convention of what addresses in your country are. The obvious examples are stuff like Japan and one or two central European cities that are addressed "Building number, Block Number, City", but the aforementioned unhelpful rural addresses include a former friend of mine who had to give his address as "House name, near village, County". Actually the village was not near but it got the mail to the right post office which knew what to do with it.

As a counterpoint, I've had places that have very strict fields like "Street Name" (20 char limit), City, County, Country. For a few years I lived on a street who's name appeared three times in the same city. The actual form to get it delivered was "Street name, Locality, City", but because of the length limits I had to sneak the locality into the City field, which luckily worked as the City and County shared a name.


Hopping on the topic of addresses and not completely related to your point - but if anyone's having challenges with handling address fields I'd highly recommend checking out Shopify's address package:

https://github.com/Shopify/quilt/tree/master/packages/addres...

And corresponding blog post:

https://engineering.shopify.com/blogs/engineering/handling-a...


This seems like it hits their graphql api. Is this free to use?


> For a few years I lived on a street who's name appeared three times in the same city

You should try going to Atlanta and finding a business on “Peachtree” street. [0]

[0] - https://en.wikipedia.org/wiki/Peachtree_Street


I agree with everything you mentioned in your post.

That being said:

> For a few years I lived on a street who's[sic] name appeared three times in the same city

This is why you have postcodes ;p


>> For a few years I lived on a street who's[sic] name appeared three times in the same city

> This is why you have postcodes ;p

I guess, but it was foreign vendors which expected "Street Name, Postcode, County". Locals and the postal service were working fine with "Street Name, Locality, County"


Just store GPS coordinates, floor number and apartment number. :D


All fine and dandy until the geography shifts.


Very cool to see mention of "one name only" case. My wife had only one name before we married. She's Indonesian. When we got married in Australia our forms pretty much crashed the Government's Birth's, Death's and Marriages system. We got bounced around between all different people trying to compute the reality that in some parts of the world, people have only one name.


One of my indonesian friends also had only one name. She named herself as Noel as she felt that was better than being referred to as Ms. Null.


I once worked on a project, which used the Dow Jones list of people of interest, a list of politicians and terrorists, mostly. One day the software began to fail catastrophically, when parsing the latest version of that list. That day, a Mr. Null was added to that list ;-)


As in no name at all? Now that’s an entirely new level. Pretty forward thinking parents considering today’s surveillance... maybe they where phreakers tried to name him ' OR '1'='1 but the Government didn’t accept special characters so they resorted to naming him null.

:)


Reminds me of little Bobby Tables.

https://xkcd.com/327/


In the US, she would have been forced to use the infamous "FNU" (first name unknown)

https://uk.usembassy.gov/visas/visa-errors/name-appears-as-f...

because governments prefer not respecting people names and customs, just so they can push their own customs (and in that case their name too, because "FNU" is the 30'000 most popular name in the US now)


Is there a "last name unknown" as well? There's absolutely individuals for which this is true as well.


A family friend had this problem when he immigrated to the UK from India. He just repeated his single name as it was both his forename and his surname.


Yeah, "humble guide..." using never/always in 3 points, seems like author thinks he saw everything but he is up for unpleasant surprise in the future :)


You seem to be assuming that each field:

1. Could not contain whitespace.

2. Would have a minimum length.

TFA specifies neither of these requirements. You could have someone's first name (which I would recommend calling given name) be "Pablo Emilio" and their surname be "Escobar Gaviria". Or alternatively you could have someone with a given name of "Sukarno" and an empty string for the family name. Patronyms and matronyms are sort of ambiguous, but would probably be categorized as surnames.

I use a single name field, but I don't think a given name and surname split is crazy.


My favorite way to do things is to just ask for a "full name"; and then, separately, a "calling name" (i.e. "what do we call you?")

The "full name" is for putting in the c/o part of a mailing address; while the "calling name" is for showing beside your profile picture or for saying "Dear [name]" in a mail-merge.

Neither of these is a nickname. They're both your "real" name. But neither of them are your legal name, either. You're free to make either of them whatever you want, and to make them both entirely unrelated to one-another. They're just two forms of "your name" that you might expect to see in one place or another. And we can ask for them, so that we can then spit them back out at you in the situations that call for them.

Doing things this way skips the entire minefield of "given names", "family names", titles/appellations, etc. You've got use-cases for someone's name? Ask the user directly, for each use-case, what text they'd like to appear. There's not too many use-cases, really; you only need to ask a couple such questions, in the end.


This is the correct answer. The question to ask is, what are you doing with the names? Splitting first and last only makes sense if the thing you’re doing is listing people by last name. Sometimes you do that! But not always and certainly not in all countries.


No, I think GP is assuming that names are very complex, and even the assumption that people have a first and last name isn't always correct. You can just use a name field, and let people supply the name they want to use.

What do you really need a first and last name separated for anyways? Sorting? A salutation on correspondence? Sorting by last name isn't usually important[1], and you can generally use a full name in a salutation.

Depending on the domain needs, you choose the right representation. If you expect US citizens and want their name as entered on their tax returns or social security card, then you match what those can hold and note to the user that you are expecting those values. If you're just collecting a name to have a name, why put an extra constraint in there that doesn't matter?

1: Not to mention you then get to mostly ignore the craziness that is utf-8 and combining characters and cases and sorting.


> What do you really need a first and last name separated for anyways? Sorting?... Sorting by last name isn't usually important...

When paying customers / clients want to sort by last name, because “that’s how we do things”, then you sort by last name. So it’s only important when you want to make money.

Edit to add: your customers also don’t care that someone somewhere has 6 first names and no last name. “This wasn’t a problem with our old software.”


If customers need it, then it's a requirement. Nothing I said indicates you should leave out customer requirements. In fact, I specifically alluded to including separate names in cases where it's needed.

Sometimes the customer is you or your org, and in those cases it may not be a requirement.


Exactly as you say: if customers (that is, the people paying for the software, not necessarily the people filling in forms) need it, then it's a requirement.

The point is that while sorting by last name (say) might not seem important to an engineer, and requiring a last name might seem outright stupid to a person filling in a form, nevertheless it is often an important requirement. Frequently folks will observe some "stupid" form, and link to some "falsehoods engineers believe about X" document, suggesting that engineering "got it wrong", when in fact they have simply misunderstood who the software is actually for.


Wait till you get into some Dutch surnames "van der Kleij", "van den Bosch", "de Jager" which are not patronyms. They would still fit into "first name, last name" pattern. Because you assume what is last name can have https://en.wikipedia.org/wiki/Tussenvoegsel

Though a lot of pages in NL have separate field for it.


Or German ("von Trips") or Italian ("De Sica") and possibly other countries.

Anyway, people who move to other countries have to adapt. For the database of a local business is often a good idea to stick with the format of official documents because everybody has to fill out forms on government websites, regardless where they were born. Ids, taxes, healthcare, schools etc.


Arabic culture will sometimes adopt a name after you have a child as “father of X”. https://en.wikipedia.org/wiki/Kunya_(Arabic)


Yes, Arabs use "Ibn" and jews use "Ben". Slavs use father's name and "ich" suffix. People in Iceland use different kind of patronimics.

You can either use just one name field if your logic doesn't need a name or surname or provide two fields for name and surname and let the user handle the splitting.

I have three first names. When I fill a form, if it's an official form or if it's for business purposes I fill all my names as first name, if not, I only write the name I use.


Ibn is a patronym, but a kunya is the other way: your name is the name of your child. E.g. Palestine’s Abbas is also “Abu Mazen”, the father of Mazen.


Exactly what I meant, the fields CAN contain whitespace. The first_name, last_name is one that I find useful


Much of the official paperwork in my life has "[First Initial] [Middle name]" in the first name field, which works (usually). But why should I have to do that? It's a pain in the ass.


The usual pattern in this case is to add first name and then all rest of the names in the first_name field. And last name in the last name field. Users can be made aware of this on the UI.

Another option is to create a middle_names or other_names field, its first_name then other_names then last_name in that order.


One way to look at it, if you have more than one name, you'll usually have a preferred order of writing them, regardless of culture. The first-name used will have different protocols and meanings in different cultures, but it is still the first-name, literally.


What if you're ordering by 'last name'?

Say it's an HR app. Do you just assume the last word in the name is the last name?


Design guidance aside, this is not a good article because it doesn't show how to fix the presented problem scenarios.

"This is not first normal form" - OK, how would you fix it?

"This is not second normal form" - OK, how would you fix it?

"This is neither second nor third normal form" - OK, how would you fix it?

If you're going to show problems, show solutions as well.

The rest is a big red flag list of falsehoods that programmers believe about names and addresses.


Although others have replied to you, I'll take your comment as constructive criticism. Though the tone could be better.


That's exactly what it does. When talking about each normal form, it gives an example and what to change.


> it gives an example and what to change

It gives an example of something wrong. Not once does it give an example of something right. It says "change this" but doesn't show what to change it to.


First normal form:

> Then the “areas” column has values that aren’t atomic. ... So this table is not in first normal form. ... To be in first normal form you should store a single value per field

Second normal form:

> In this design employee name can be directly inferred from employee_id because the idea is that the name of an employee is uniquely defined by it’s id. ... So we have two examples of columns where the columns can be inferred from only part of the primary key.

Don't include those fields. It's not explicitly said, but I think the author assumes that if you're told don't include data that can be inferred otherwise, these fields can be inferred otherwise, so they cause a problem, the solution is to remove those fields. You aren't losing data, it's already been covered that it can be inferred.

Third normal form:

> For a table to be in third normal form then it must be in second normal form and there should not be a non prime attribute* that depends transitively from the primary key. ... In this table the department_number can be inferred from the employee_id, but the department_name can be inferred from the department_number!

Same as second normal form. These fields cause problems, you already have the data elsewhere, get rid of them.

Perhaps you expected an article that said which commands to run to alter the tables. It's pretty obvious that wasn't the goal of this article. That said, it had everything you asked for, as long as you read it all and payed attention.


> To be in first normal form you should store a single value per field

Where? How? How instead should you store groups of things? Blog doesn't say.

> Don't include those fields. You aren't losing data, it's already been covered that it can be inferred.

If you just remove fields, then you absolutely are losing data. Your software can't magically infer it unless you put the values and the relationship somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action and you've now put words in blog's mouth.

> These fields cause problems, you already have the data elsewhere, get rid of them.

Again, just deleting fields is 100% losing data unless you put those values somewhere else. How should you handle that? Where should they go? Blog doesn't say.


> Where? How? How instead should you store groups of things? Blog doesn't say.

By normalizing. This is obviously a guide for to do and not to do, not a 101 level course on database design and normalization. You first step should be to google database normalized forms, and even specifically second normal form or third normal form. I know from experience that Wikipedia has articles covering these. I've read portions of them.

> If you just remove fields, then you absolutely are losing data. _YOU_ might be able to infer it, but your software sure as hell can't unless you put the data somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action.

> Again, just deleting fields is 100% losing data unless you put those values somewhere else. Where should they go? Blog doesn't say

The point is that if you're working towards normalizing that data, then you have the data in a different location.

The blog entry also doesn't explain what SQL is, or what a primary key or foreign key is. When you encounter terms that are not defined, you should make sure you understand them. If you don't, then the information is likely not aimed at you. If you don't understand normalization, you aren't going to understand what's being recommended. This obviously isn't a guide on how to normalize perfectly, it's a recommendation to do certain things, some of them to do with normalization, and a justification for why.

Consider it the equivalent to telling someone that they should eat a balanced diet for their health and notes some problems too much of certain types of food can cause. It's still valid advice even if it doesn't come with an explanation of exactly what to eat, and references medical phenomena without explaining what they are. That doesn't make it a bad article, it just makes it not terribly useful for people that it isn't aimed at.


A good ER database textbook illustrates normal forms much better than a blog post.


I find that textbooks focus way too much on the strict CS definition of the various normal forms as opposed to practical examples.


That's a sweeping over-generalization. I said find a good textbook.


The poster said "I find that...", which is equivalent to "In my experience...". How is someone posting their experience the same as making a "sweeping over-generalization"? They never flat-out said "ALL textbooks".

Also, surely if it were not accurate you could easily drop the name of a good textbook without any trouble, since you apparently have enough experience with them to make the distinction.


If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.

Here's nearly 100 pages on the subject (from the perspective of addressing mail for USPS): http://www.columbia.edu/~fdc/postal/


I think programmers over complicate things too much when it comes to addresses. When you're dealing with a global system that is able to deliver a package to "the 3rd house on the right, past the pond, with the red fence," or whatever, in rural anywhere, you probably should just treat it like we all do in the real world, as just a string to be interpreted in context.


This fails as soon as you talk to an API that requires normalization, need to aggregate your data, or, really, do anything other than try to deliver it.


Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things. I know it's a cop out answer. But I don't have a better one.


> Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things

This is not always an option. For example, in some jurisdictions, you need to calculate, charge, and remit sales taxes based on client location, even if you are selling digital goods.


However these locational jurisdiction details will be unique (i.e. a state, a country, a city, a trading block, a tax code etc).

These can be different to the address details which is nothing more than a postal address to where the goods/letters need to be posted.

In other words the former details should be explicitly provide in a fault tolerant manner (i.e. drop down lists) and not parsed from the details found in the address.


USA mailing addresses are strictly that -- an identifier for the convenience of the Post Office which may or may not correspond to the underlying jurisdictions for things like political boundaries, tax authorities, etc.

I learned this when setting up voter databases. A house that is in a given city/precinct/etc (the atomic identity is parcel number) may have a mailing address in a completely different city. There are many such cases for Los Gatos vs Saratoga, for example. Also, the USPS considers every address to be in some "city" even on unincorporated parcels.

Back when phone area codes and prefixes had precise geographical meanings, many edge cases ended up on the "wrong" area code (e.g. Sunnyvale numbers being in 415/650 instead of 408 between the 85 freeway and the Sunnyvale border).


The local taxes are rarely based on the street address and typically coincides with a larger municipal division.

The problem in that case is trying to use the “address” for two purposes which may not be aligned, getting goods to the buyer and trying to find their tax domicile.


  local taxes are rarely based on the street address
No, they're based on the underlying geographical boundaries based on the parcel number. This usually corresponds to the city in the mailing address (in contiguously Incorporated lots) but not always.


On the e-commerce platform I work on we need to send customer data including addresses to multiple vendor API's. You can't just design your db schema in isolation usually.


Some counter-examples include Route Planning or determining where to locate a Consolidating Freight Service. If your addresses are not normalized, you lose the ability to see two addresses listed as "from city center, three lefts" and "from city center, one right" take you to neighboring locations.


These are not counter examples, they are examples of the types of problems that I think turn out very poorly when they rely on normalized addresses. What I'm saying is we shouldn't be using addresses for things like this. I realize I'm talking about, basically, upending several entire industries or something, so I know it's unreasonable, pragmatically.

I just think that the hard problem of 100% accurate address normalization suffers from an extremely fat tail[0] of edge case issues, and becomes economically unviable to solve, very quickly.

[0]: https://en.wikipedia.org/wiki/Fat-tailed_distribution


Except they don't turn out very poorly, they provide obvious value for a bunch of real-world companies doing it right now.

They don't provide 100% perfect results, but that's why we still make humans.


95% of the problem is having an up to date record of zip/postcodes. As long as you treat the postcode as a separate field, the rest of the address can be handled by the courier. Most couriers will go off the postcode to determine whether or not they can deliver a package and cost the delivery - even if the rest of the address is garbage.

IMO, A database schema for addresses should consist of: (Country, Postal code, Address). I would split Address into 4 strings and let the user fill out whatever they want in them, in whatever order. I'd also suggest that is isn't even necessary to validate this information on your input forms - but the validation is best performed by contacting the courier with the information input by the user when requesting a delivery cost. (And if this validation fails, email the user back). One suggestion would be to forbid commas in any address fields because CSV formats are accepted by some couriers with varying degrees of support for quoted fields, and often end up requiring manual intervention.

Unless you are the courier, you should probably not waste your resources on attempting to normalize any other parts of the address.

This is from experience in a business which ships hundreds of international packages a day, and tens of thousands domestically (UK).


People usually solve that problem by learning their address according to Google Maps (the most widely used service). For most of us it's the same address we always use. For somebody it's what they write when they want people or stuff to reach their home.


Why not just store GPS coordinates and do geospatial queries? which gives you nice features like "show me stuff within X kilometres of a shop" and "is cutomer X within bounding box Y?".


That might help to some extent depending on the use case, but the tectonic plates are moving all the time, and shifting a lot in some places where you have geological events like earthquakes. A set of GPS coordinates may no longer point to your current (moved) location accurately in that case.


you could have 500 people all at the same lat/lon, with different floors, no?


this is going to be quite random but I recently saw a comment you made about deadlines (https://news.ycombinator.com/item?id=19361199) and am interested in some of your thoughts on what an ideal software development methodology looks like, if you care to talk about it please throw me an email smtheard@gmail.com


I don't think any single answer to a question that broad should ever really be trusted.

It totally depends on the size and scope of what you are trying to do, what you are building, who your customers are, and like 100 other questions.


I’ve been through this before: storing a couple addresses for a few hundred different countries in an otherwise very normalized RDBMS. We couldn’t figure out anyway to do it other than EAV tables and a defined attribute pattern for each country.

Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?


At a past job, we basically did "both" - we'd store our best guess at a normalized address, and we'd store the text representation. For stuff like shipping labels, we'd use the text. For analytics, we'd lean on our best-guess normalization (and understand that there's some potentially significant error).

You can also lean on various APIs to normalize, but nothing that I'm aware of does this well on a global scale.


At that point you might as well just make an "Addresses" or "Address" table with a raw text field for the address. At the end of the day this thing is something that a human will read and map to the real world location.


> At the end of the day this thing is something that a human will read and map to the real world location.

That is quite an assumption.

I personally would use a field with a structured format e.g. XML (most RDBMS systems can deal with XML easily) and different types of address field e.g. postcode, zip code etc.

Then you could use a strategy for each country when dealing with it in your application.


We went the EAV route instead of just a free form text field because we still wanted to expose discrete fields to in the UI and have some front-end validation. But sure, from a data integrity standpoint it’s no better than a text field.


> Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?

I've always gone the other way. Bland fields (Lines 1-4, city, state, postcode, country) or - when I didn't need to validate it, a textarea for free form entry.

Most of the time validation has been required by the shipping provider, so freeform hasn't been possible. As the shipping providers haven't got anything 'better', we stick to their format and field lengths (which are terrible for out-of-country addresses)


If you are mailing something, using a plain formatted address string is usually fine.

Here's a harder problem from a previous job:

Given two addresses, tell me if they both point to the same "place".

(As you might imagine, the goal of this was joining various datasets.)

There really is no good way to do this, but to come close, you might consider a few dimensions of the problem:

1. Formatted address - a plain string. What you would write on an envelope.

2. Tokens (what I call "address components") - the bits of information that make up an address. Something like, {"street_number": "10", "street": "downing st", "locality": "london", ...}. You can get these back from most location-focused APIs.

3. Locations. Coordinates of a point or area in space.

4. Time. "Places" change over time. For example, a new business can open in place of another. Buildings are built and demolished. If you are working with data over a long time period, this is more of a concern.

Unfortunately, there are many-to-many relationships between all of these. For any address, there are many possible locations. For any location, there are many possible addresses. For any set of address components, there are many possible formatted addresses. For any address, there may have been multiple different "places" found there over time. Et cetera.

Consider an office building, for an example. That building probably has at least one address corresponding to its main entrance, but it might have other entrances. An office within the building can sometimes use the building's main address as its own address, or the address of its own entrance, or an address with a floor number, or an address with a mailroom number. An office building may have a café (usually off of the lobby) that has its own addresses. Offices in the building can be split and merged over time. I have even seen multiple buildings merged into one. You might also have a co-working space with individual addresses for rooms within it.


Thanks, this is a fascinating resource!

My logic with the address field is to try to consider what is the purpose of the address in the application.

If it is used for someone to send you a letter in the mail (for example: university admission) then you are better off with a big box where the user can write the magic blob of text that usually gets their mail delivered to the right place, up to instructions to the local postman.

If you need to pull off some analytics, or need to estimate shipping, then just extract the fields that your shipping agent demands: country, zip code etc. Leave all else intact in the "magic blob".


I think city, postal code, state/province (optional), and country would be fairly standard. At least in North America, Japan, Australia, and the UK. Of course for country you would want to use standardized country codes, which are somewhat political.


i don't mind web-pages (or whatever we are talking about) having lots of fields for address, as long as i don't have to type something in all of them.

But it's super annoying when I have to make up things for fields that doesn't make any sense where I live.


Off-topic comment / feedback about the blog post:

The author self-describes himself as "_Trusted_ Consultant" in the blog's subtitle.

The comic strips illustrating the article do not mention about the permission from identified license provider(s), and are cropped, does not help building trust - at least in my case.

Are they used with authorization from their authors?


This person would disagree about point 4: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


Author here! Ha, I'm a big follower of Patrick, specially his microconfs! I didn't know he had an article an article on the subject, thanks for linking it!

However I agree with him!

What happens is that in my experience I've yet to encouter a case where what I suggest in the blog doesn't work. No doubt that there will be cases where it will not be enough. But by reading Patrick's article, clearly the complexity of the system needed to handle all the cases would be huge! So I guess one has to compromise :)


Yes, compromise in that typically US-centric way, and to hell with any culture that doesn't fit the mold, they can adapt!

It's true, they can adapt! They proved that when they didn't have much choice, back in the ASCII days when you had to conform or stay in the pre-computer age. But maybe we should try to better now that we can...


So, my identity records got mixed up due in part to the impedance mismatch between (first name, last name) and (first name, middle name, last name), resulting in my legal name suddenly changing until I could get the issue sorted out.

Two fields works in the sense that most people are going to be have something to put in them, but what that is and what they mean isn't always going to match what you intend. Either the system is going to be tolerant of such things (in which case you can probably get away with some variant of (display_name[, external_name]), or you're using it to interface with some other identity provider/consumer that requires a specific format, and that's what should be asked for.

Names are personal and hard to get right.


Very, very few cases aren't handled by an arbitrary utf8 string of not more than say, 1KB.


In what way is this guide 'humble'? Seems fairly self-confidently opinionated to me.


These are really the basics of database design. I would argue that you should do these until you know when you shouldn't. That is, you should normalize to this level until you learn when and why to denormalize some things for performance, if that's ever even needed. I think it means "humble" as in not trying to teach anything complex or revolutionary.


I'd agree. The advice to use NULL is usually harmful, and constraints are not high-performance.

One nitpick is that "user" is a reserved word in PG, so "users" is better.

Other than that, I'd agree with most suggestions. But I'd use MySQL. :)

Source: DBA.


> The advice to use NULL is usually harmful

The advice is to set NOT NULL if you think the field can't be null. That's valid, and a matter of encoding the schema to correctly match the data. Same with check values. Keep bad data out of the database wherever you can.

> and constraints are not high-performance

Then turn off constraint checking after you've built your schema and identified them as a performance problem. Don't optimize away your data integrity before you've even determined it's a problem.

If you actually don't care about your schema, just use a text field and throw whatever in it, or use a document store without any strict checking. If you're using SQL, presumably you're using it for a reason, and that's likely so you can take advantage of what it has to offer.


Hey thanks for clarifying it for the other user! I also agree with everything that you're saying, in this and other comments in the thread


The article actually advises setting fields to NOT NULL unless there's a specific reason to do otherwise.

And re constraints: eh, for most applications this feels like premature optimization. My usual approach is to include whatever constraints are easy to express, but also have the application not rely on the DB for enforcement. That way you get some extra safety checks that might catch bugs, but if they do become a bottleneck you can just drop them.

Not a DBA, just a software dev who's learned not to sacrifice safety for perf unless there's a clear, demonstrable need.


Please don't use databases that don't support transactional schema changes. You will hate everything about it.


Facebook, Google, and many other companies run massive MySQL deployments and are doing just fine.


They do, and obviously it is workable, but I think every big MySQL org ends up with a tool/workflow like gh-ost: https://github.com/github/gh-ost which have their own set of challenges, like not playing well with foreign keys.


Or the Percona tools, which support stuff like online schema change.


Big MySQL orgs are sharded, which greatly reduces the utility of foreign key constraints anyway.


Given unlimited VC money to burn, I could engineer a high-traffc SaaS running on a beowulf cluster of Juiceros.

Would that be a good idea? Maybe.


This is a silly argument. Facebook & Google have made massive investments into the source code of MySQL and have helped make massive operational improvements to it over the last decade. It's always glaringly obvious when people have no experience with something and then throw shade at it, like the obligatory Microsoft-bash whenever that topic comes up.


They also have massive engineering budgets.


> Never store firstname and lastname in the same field

funny, where would you store the middle name then? and the second middle name? married name in cultures where the spouse changes or double it? of course in normal form you need a name table, with an order column, and so on and so long.

anyway, this article goes straight to the notions everyone can fetch on any book and skip the interesting and diffiylcult part about database design, which is understanding your domain so that no future expansion is prevented by a wrong cardinality on a relation that was simplified from reality.

because if it's a user database you might get away with just the name, so you can address people during support calls. if it's an accounting database, you might want to know someone name at the time of invoicing, having multiple records with start and end validity times, and if it's a FBI database for protected identities you might have to have multiple name aliases referring to a single person.


I'd store the middle name in the first_name column. Like for example "Juan Martin" would be in the first_name columns and "Del Potro" in the last_name. Apart from the middle/first name incongruence that you may highlight, I believe that this design is not awfully bad and should serve in most situations.


If you used that as an address in emails it'd sound very awkward to include a middle name when someone doesn't usually go by it.


Many Americans have two word first names like “Sue Ellen” or “Mary Anna” as well as a middle name. Double last names are even more common.


why on earth would you, and the article writer for that matter, use a column named first_name for storing the middle names as well?

just call it name!

as I said above, the article goes to length about normal form, and skimps on the important part: the data model is a representation of the real world, and a first_name column with middle names inside is basically a straight up lie of what the column models and what data it contains.

there's not one way to represent names, there are application specific domain where generalities need to be more or less detailed, but the core aspect of it all is that the data model shouldn't lie about what it is that it's modelling.


Since everyone has to have an opinion and provide it, here's mine, maybe a retitle?

>A humble guide to transactional database schema design

There are a number of things that don't apply to analytical database design and I'm seeing the two confused fairly often, then again, this comes entirely from my point of view so take it for what it's worth.


I work on an e-commerce platform that's live in about 20 countries. One of the hardest records to normalize correctly is the address, because not only do users have very different rules depending on their country (e.g. not all addresses in Turkey have post codes, GB has some very weird building names/numbers...), we also connect with vendor API's that have their own ideas of how to model an address.

My personal opinion on this is something like address_line_1, address_line_2, city, region, post_code, with each field being varchar(max). Anyone else have an opinion on how to do this?


After spending a long time trying to figure this out I came to the opposite conclusion of the article. Unless you're only operating in one/a small number of countries trying to normalise addresses is too much of a moving target.

The schema I ended up with for my use case is: country (required), region (optional depending on country), address (rest of address text) and a lookup to get geographical coordinates for the address (with user confirmation to tell if it's somewhat accurate). Filtering/sorting on country and region then gets you pretty far and for the more complicated queries I found it took less time to use the coordinates with PostGIS than trying to match up bits of the address text.


Yeah, I guess it really depends on your use case and your overall system too. Our system connects to multiple vendor API's, so having one field for the rest of the address text would be very difficult -- we'd have to try and automatically split it into constituent parts depending on the requirements of vendors. Some of our data then goes to credit check API's too, so an incorrect address can result in an (incorrectly) lost order for us.

It's a challenging area all round.


I think that's the best way of going about it as far as cultural differences are concerned. Same thing with names, where you'll have firstname (name_1), lastname (name_2)


Recently I ran into a project where everything is stored in around 3k unique fields, with many levels of nesting, in a JSONB column with NO schema or documentation. Generated JSDOC from the data and I'm hoping to create some kind of schema.

You can use JS on the server. Just don't do it this way ...


“Beginning database design” By Clare Churcher is a really accessible book on these issues, very practical (starts off with simple spreadsheet tables!) and takes you through step by step to understanding the problems with and the need for the different levels of normalization.


I’m just here with popcorn and for the comments.

Lucky for me SQL just clicked and when I wrote my first schema third normal form is how it came out more or less so I knew I was on to something.

I would say a study of how the internals of your databases of choice works would help in schema design.


How I wish for nothing more than an alternative to SQL to communicate with databases, something that automatically infers coupling tables too. Set theory is nice. SQL.. personally, hard pass.


What parts do you find difficult? I agree that the benefit to having something learn the schema and the relationships would be nice but I get that through something like DBeaver already.


That's one issue I see with NoSql, data is not normalized and consistency goes out of the window if database is big enough and the volume of data is big enough. That might not matter if you work on a search engine, social network, music suggestion app, but for most business apps and for most websites, having inconsistency is really bad.

On the other hand, having some data denormalized can improve performance as you can avoid expensive joins. It's for designer to see if they need to trade consistency for performance and in what measure. I'd avoid it, I'd prefer to add more servers, use load balancing, use caches.


In my programming life I've barely used SQL databases so far, and there is something I've never understood about these recommendations. Maybe someone here can explain that to me.

I thought databases are designed to organize the data optimally internally and execute all operations on them as efficiently as possible. Why do you then need to organize databases so stringently? Doesn't that mean that there is something wrong with how the db works?

To me the advice always seemed similar to bit twiddling instead of relying on the compiler to optimize. Shouldn't the database lay out the data optimally internally?


I'm not here to say the author is right. S/he is making a lot of assumptions, and as you can see in this thread, everyone is saying it depends.

But one fact is that reducing redundancy in a database is a GOOD thing.

To make a comparison, I'd say it's like when I have to maintain a codebase that is large, and I have to pass a constant (let's say, ERR_STATUS = 1) around in some functions in various folders.

I wouldn't want to have to define it in every single file. One would want that in a header, in a library or in a file that defines static values somewhere. Anywhere but only one location for it.

Database is a bit like a dumping ground. Once the business gets big enough, a lot of other application are going to plunge in and do things with data. I wouldn't want to have inconsistent data. As a programmer, if one had inconsistent things in one's source code, one could deal with it. But data is different, data is produced by users and interactions and so on. Once you begin to have inconsistent data, you can't trust it anymore until you find out exactly what did it. A lot of wasted time. And that leads to a lot of problems.

But I agree with you about your point on performance. However by organising data correctly, having indexes on things that matter, joining in the right way, one's using business knowledge and programming knowledge to optimise the database. And that's not something that can be done 100% on its own (as of now).


The schema establishes the contract between you the user and the database. You have to help the database so that it can help you.

The information you are giving it is mostly things that it can't reliably know or decide on your behalf (such as whether you want to store 2 things together as 1 thing or not like first name last name).

The database still does a ton of stuff for your that you don't have to think about. For example SQL itself by definition is like that. You declare what you want to take out and the database figures out how to do that for you (this part is similar to the compiler optimisation of your example).

Going by your example, not having an schema is similar to saying "why do I have to write code, can't the compiler just write the code for me?".

In both cases, on one hand yes the compiler or the database does the magic for you, but at the same time it can't work without your input and decisions (writing code, or the schema).


The DB will work just fine regardless of how you organize your data, some queries might just not be possible or efficient. A well designed database will easily serve many different type of applications at the same time.

The power in databases lies in their ability to easily transform or restructure the same source data in multiple kinds of different ways. The catch is that it has to be arranged in a specific way (normalization) for this to happen. Normalization is actually based on math and set theory.


My humble opinion is that poor choices in database design are the source of a great deal of technical debt.

I used to avoid making schema changes as they were "scary" (until I got a decent understanding of Django Migrations). Just a quick hack to make up for it in the application layer was far easier. Then another, and another until you have real mess on your hands.

And avoid nulls when you can. Unfortunately most of the time I have had a new database design there is existing data in excel sheets that is messy and you can't avoid it.


We have so many tables with is_active and active_col in the same table...


I've got 2 things to say:

1) There is no "good design". It is always a compromise of trying to satisfy various practical constraints. OLAP vs OLTP probably rings a bell here. And this Last / First name thing is laughable. The complexity of handling all possible human names can be insane. Depending on the goal the proper design can be anything starting from single field and up to some relational structure.

2) The article itself looks like let's just write something up to score some points.


> 4) Never store firstname and lastname in the same field

I have the opposite opinion: unless you have a specific reason why you need to know one's "first name" and "last name" (e.g. to interface with some other system that requires names to be split up like that), just use a single field and stop caring about whether or not the name conforms to your preconceived notions of format and structure.


(fwiw, I like the article, and I find it suitably humble)

Just want to hang a thought here ... depending on the precise definition of First Normal Form, it is either impossible to avoid[0] or impossible to enforce[1] without further constraints.

[0] if definition is "must be rectangular", i.e. all columns have the same number of rows and all rows have the same number of columns.

[1] if the definition includes "do not allow duplicate rows"


The Data Model Resource Book [1] is a good resource how to design database. It is a kind of reference book

[1] https://www.wiley.com/en-us/The+Data+Model+Resource+Book%2C+...


This article is perfect if you're just getting started.

There's lore around design tricks to get the most out of a database, but it's often implementation or domain specific.

You'll learn what rules to break, and why, as you solve real life production problems.

FWIW - you can go a long, long, long way before de-normalising is actually necessary.


I wish there will be a DBMS which satisfy all aspects of CAP theorem and also provide high performance.

Until then, we are stuck with RDBMS and NoSql as lesser evils based on the use cases.


NoSQL doesn't magically solve the CAP theorum in my experience it actually makes it worse.


Make the primary key the thing you query by. Treat a sql database as similar to a nosql database (no joins) and your performance will be amazing.


RE: not storing address in a single field, splitting an address into semantic fields feels like a premature optimisation. You’re forced to anticipate all the fields that you may want to aggregate by, before you may actually even use them.

Instead, you could just create a custom function that takes in a string and tries to identify the city or street number or whatever element you want extract from the address e.g.

    select * from users where extract_city(address) = 'london';


Use the array data type for simple collections if using Postgres because it saves a join.


An advice I think is worth adding to the list is: use indexing.


This a DDD aggregates design guide, not database schema


Use third normal form. - Till your system gets 10 year old and linking your user's first name to their address requires joins on 10 tables having millions of rows.


Firstname lastname is too fragile and oversimplified. Suffix, prefix and a variable number of middle names/initials can also be necessary. Occasionally, legal name vs. nickname is also needed. But if there's no pressing need for grouping or sorting by surname, why piecemeal any of it out?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: