A humble guide to database schema design

earthboundkid · on April 8, 2020

Most of this advice is stuff they tell you in school that doesn’t matter at all in reality.

Real advice:

- Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

- You need nullable dates and nullable numbers. You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

- Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

- When in doubt, just have a column full of JSON and figure it out later.

- Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.

throwaway_pdp09 · on April 8, 2020

Fuck me, are you trolling?

Guy with actual DB experience here, so:

> Normalize until it hurts, denormalize until it works. (I.e. prefer normal forms but denormal will have better performance.)

No it won't, it depends entirely on the situation. Denormalised data (besides allowing update discrepancies) can expand the amount of data, thus push otherwise memory resident working sets onto disk. Big slowdown. You are wrong. The correct answer is "it depends" - and denormalising really is a last resort that has to be done judiciously and with care.

> You need nullable dates and nullable numbers. You never need a nullable string.

Utter unjustified garbage. Correct answer (from a practical, not super-normalised form) is: "it depends"

> When in doubt, just have a column full of JSON and figure it out later.

OMG please never let me come near any fool who does this. I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.

Please nobody listen to such 'advice'.

callmeal · on April 8, 2020

>I was called in to consult on a database problem and they did the same but with XML. It was really bad, and bad for performance and comprehension.

I can see XML being really bad for performance, but given that json column data is handled like native data in modern databases (and you can index/query on fields in the json, which pretty much negates the performance impact), using a json/jsonb column is actually a decent guide.

throwaway_pdp09 · on April 8, 2020

You are spot on in picking me up on this, thanks. It needs clarification. xml/json is neither right nor wrong, it depends. Performance, well, it depends. What is unquestionably wrong is the poster's advice "When in doubt, just have a column full of JSON and figure it out later".

Pouring structurable data into an unstructured heap is just laziness and will hurt you later. I've seen the cost of deciding to 'leave it for later' - it's high! I've been involved in cleaning it up. It's horrible.

Years ago I picked up a book written in the 1960s (so pre IT)for secretaries, and it had a beautiful line in it "the filing cabinet is a place where you find things, not lose them". Still applies. Always will do.

no-s · on April 8, 2020

> "the filing cabinet is a place where you find things, not lose them"

"Those who cannot remember the past are condemned to repeat it." - Santayana

“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem—neat, plausible, and wrong.” - Mencken

Software project management is full of thin and runny rationalizations...I (HHOS) include this form under the heading of "Data Abnormalization"...

throwaway_pdp09 · on April 8, 2020

I've had a lot of experience. You can learn from it or not. If not, feel free to prove Santayana right again, and at great personal cost :) In fact, I encourage you to!

marcosdumay · on April 8, 2020

XML was there, being handled natively in databases much before they even looked at JSON.

BrandoElFollito · on April 9, 2020

Is there a reason then not to use a single column with JSON? The columns would then be "virtual" (the column jsondata would spawn a column jsondata.firstname etc.)

This question is your of pure curiosity, I do not deal with databases so I hopefully correctly understood your comment.

earthboundkid · on April 8, 2020

Of course the answer to everything is always “it depends”. The point is to give generalizations that point people in the right direction to start.

emteycz · on April 8, 2020

It must be said that json performance of postgres is good.

holoduke · on April 8, 2020

I believe the right answer is somewhere in the middle. I know a couple of successful products where the dB scheme is completely abused. Yet commercially a succes for over 8 years. A good dB specialist also knows the implications of a 'bad' design choice. It's all about the impact at a later moment. Just like software design in general. Finding the balance between complexity and simplicity. Between slow and elegant versus fast and ugly. Balance is key

throwaway_pdp09 · on April 8, 2020

I did say clearly it depends. And that's because it does.

> Between slow and elegant versus fast and ugly

No! Unqualified, this is a false dichotomy! Because often you can have elegant and fast, and that's because elegance of design often gives you that speed. Not universally true, but much more often than not.

rumanator · on April 8, 2020

>>I did say clearly it depends. And that's because it does.

You've arrived a tad hot to the discussion to now start to backtrack on how it depends.

After all, if it depends for your defense then shouldn't it also depend for the OP's point? Because if you know that it depends but still start to accuse others of trolling, you are not being helpful nor honest.

no-s · on April 8, 2020

Wasting your design time allotment in rationalizing futile compromise is the flip side of seeing everything as a dichotomy. The reality modeled by human discourse may be only consensual, not actual...

teddyh · on April 8, 2020

> - Names should be stored as display_name, first_name, and last_name.

No.

“[…] refer people to this post the next time they suggest a genius idea like a database table with a first_name and last_name column.”

— https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

You could, conceivably, have a "display_name" column, an "informal_name" column, a "formal_name" column, etc. depending on your needs. But for absolutely most people and purposes, a single "name" column will suffice, and will steer programmers away from making unfounded assumptions about people’s names.

MaxBarraclough · on April 8, 2020

But then you don't have my informal name. When I receive an informal update email, it would be clunky to have it begin Dear Max Barraclough.

You can't automatically deduce that the way to address me informally is Max, without making assumptions that will break outside the English speaking world. See Points 18, 24, and 30.

The W3C suggest that one way to handle names is to have a Full name field and a What should we call you? field.

https://www.w3.org/International/questions/qa-personal-names

marcosdumay · on April 8, 2020

If the database owner is not billing you, he probably doesn't need your formal name, so he can have just one column, and call it "name".

Otherwise, yes, he'll need the 2 columns.

Mikhail_Edoshin · on April 8, 2020

You'll regret this decision as soon as you have to integrate with another system that uses separate first and last names. Or as soon as you want to build a sorted list of five thousand names or output a bibliography in a specific format or anything.

Technically the right approach to the problem of storing international names would be to store markup: a string with an associated structure and a discriminant:

    <name lang="en">
      <first>Kenneth</first>
      <middle>Elton</middle>
      <last>Kesey</last>
    </name>
    <name lang="ru">
      <given>Fyodor</given>
      <patronymic>Mikhailovich</patronymic>
      <surname>Dostoyevsky</surname>
    </name>
    <name lang="es">
      <given>Gabriel</given>
      <family-paternal>García</family>
      <family-maternal>Márquez</family>
    </name>

You can then have precise automated ways to transform this into whatever derived form you need. Using a single name field is just avoiding the problem. Using multiple fields is a maintenance burden, because in most cases fields are not fully independent and you have to update them all to keep the data coherent.

goto11 · on April 8, 2020

Using markup does not really change anything, that is just a way of serializing a data structure. You can represent the same information in normalized form.

Mikhail_Edoshin · on April 8, 2020

Of course, I don't mean storing literal XML, but I think it's important to support generic markup in databases. It's not that hard, after all; something like these five tables:

    Markup (markupId, string)
    Ns (nsId, uri)
    QName (qNameId, nsId, localName)
    Elt (markupId, eltId, qNameId, 
      start, length /* string indices */, 
      level, index /* or other way to represent hierarchy */ )
    Attr (eltId, qNameId, value)

would be enough to store and fully index any XML out there.

I think markup is an important separate data type. We have records, we have scalars, now we have a hybrid type, markup: a string (or maybe a byte/bit stream) + records linked to indexes in this string/stream and maybe forming a hierarchy.

goto11 · on April 8, 2020

A database should store the actual data, not some serialized format. You don't store the comma tokens from a CSV file in the databases either - you store data which can be serialized to CSV. The same information could be serialized as CSV, XML or JSON depending on the context.

Storing a serialized format as an opaque string or blob might make sense in particular cases.

earthboundkid · on April 8, 2020

A) the proposal is crazy. You probably don’t need to know if someone has a patronymn or a kunya or whatever. Base what you store on what you need.

B) Storing this as normal data would be an intense pain in the ass to use, for no gain in performance or correctness. Use JSONB, this is what it’s for.

goto11 · on April 8, 2020

If you actually need to know such details then it should be stored in normalized form.

Say you have a visa database which must be compatible with multiple naming systems from different countries and where no information can be discarded for security or legal reasons. Then you store the data you need in normalized form. Why would you use weird hacks for essential information?

And if you don't need all that detail - then you just save the name as a string and discard the metadata.

I can't see it would ever make sense to store a name in JSON format inside a relational database. Either the information is important or it isn't.

Mikhail_Edoshin · on April 8, 2020

A) In many cases you don't, but if you store data about authors in a bibliographic database then you probably do. And it's just a part of the package: for example, Donald Knuth went as far as to print people names in their native language in addition to English in his books' indices :)

lonelappde · on April 8, 2020

This comment is culturally insensitive. You are implicitly assuming that your culture's way of expressing names is all that matters

earthboundkid · on April 8, 2020

Matters for what? A database is a tool to solve a problem, not a timeless repository of deep cultural meaning.

If I make a webapp that is only localized in English, display, first, last will let me solve my common problems: I need to say "My account (John Smith)" in the corner (display name), and I want to be able to write emails that say "John, we know you're wondering how company that sold you a pair of pants three years ago is feeling about COVID-19…" (first name) and list articles by author sorted by last name. Those are all common use cases for apps used by English speakers.

Now, if I were making an app to do voting in Myanmar, I would need to deal with there not being last names for many Burmese. If I was trying to track Arab speaking terrorists around the world, I'd want a long list of their aliases and kunya in multiple romanizations (I think the CIA used to prefer "UBL" for Osama bin Laden because they called him "Usama"). If I was making a library app, I might want to have the English romanized name plus the Unicode native language name. Tons of possible problem spaces with different solutions.

I just think tagging the different name parts for "Gabriel José de la Concordia García Márquez" is going to be overkill for most English language apps because when are you going to look someone up by maternal family name versus of just doing a full text search for "Garcia Marquez".

goto11 · on April 8, 2020

Missing the point. If you just have a single "name" field then you can accommodate any culture perfectly.

earthboundkid · on April 9, 2020

This fits the 75% use case, but misses things that do come up, like alphabetizing by last name and emails with casual forms of address.

Mikhail_Edoshin · on April 8, 2020

I'm not arguing about storing serialized format; instead I suggest to parse it and end up with something like that:

    NAME ( 1, "Fyodor Mikhailovich Dostoyevsky" )
    TAG ( FIRSTNAME, 1, 0, 5 ) 
    TAG ( LASTNAME, 1, 20, 30 ) 
    TAG ( PATRONYMIC, 1, 7, 18 )

I'm not saying it's efficient (UPD: speedy), but it's sufficient to get the required details and can be mechanically transformed into a more efficient form. UPD: And it's very flexible and easy to extend. I believe it can handle just about any kind of naming from "Falsehoods about names".

goto11 · on April 8, 2020

But why include all the cruft fist place? You could just save this information (in normalized form):

  FIRSTNAME: Fyodor
  LASTNAME: Mikhailovich
  PATRONYMIC: Dostoyevsky

Mikhail_Edoshin · on April 8, 2020

This one lacks ordering. It's not a problem for Russian names as the rules are fixed, but may be a problem for other ones.

And how would it scale? E.g. John Ronald Reuel Tolkien and George Herbert Walker Bush have two middle names, while full name of García Márquez is Gabriel José de la Concordia García Márquez. Markup handles this uniformly creating only as few metadata records as necessary, but a field-based model will need fields for all of this that will stay empty most of the time and likely some meta fields to set additional quirky flags.

goto11 · on April 8, 2020

Regardless of how complex naming schemes you need to support, you just make it more complex and harder to query by storing tag names and character indexes and whatnot.

zo1 · on April 8, 2020

Please don't do that. I hope you haven't implemented such a set of tables in any database.

Mikhail_Edoshin · on April 8, 2020

Not yet :) But I may just for the sake of it.

antonvs · on April 8, 2020

That's a good idea. You'll learn firsthand why it's a bad idea.

eecc · on April 8, 2020

Isn’t that what RDF is all about? Each field being a named property of a resource, the properties themselves being properties of other resources describing the structure of the first properties. And so on, going on as meta as necessary

zo1 · on April 8, 2020

There is no need for an arbitrary "markup" that you then restrict based on your use-case (in your case it's some sort international-aware storage of different data).

Databases are designed to store data, and storing markup like your example above is basically a misuse on some level. Your example above has a clear structure, and that structure can be encoded as raw data in database tables. So instead of having a generic set of tables (as per your other comment) to "store markup" data, you would rather come up with tables that represent your actual requirement and what that markup is meant to indicate.

Some database also have inheritance models which allow you to store the data above quite trivially. You basically have a "base" and then extend it with specific fields for each type (en, ru, es, etc).

So please, don't put markup in your database like that. It's a big no-no, and I'm sure your DBAs would tell you that very quickly. Else, just throw that crap into a document-store and make it someone else's problem as is unfortunately all too-common lately.

pezezin · on April 8, 2020

What you want is not markup, but tagged unions. It's a real shame SQL and many other languages don't support them.

Also, your example about Spanish names doesn't take into account that nowadays parents can choose the order of family names, and that gay marriage has been legal for 15 years. Which reflects a deeper truth: don't make too strong assumptions about reality, for it may change.

Mikhail_Edoshin · on April 8, 2020

Tagged unions, if I understand what they are, can be done in SQL:

    create table Foo(
      fooId primary key autoincrement, 
      type integer /* tag/discriminant */ );
    create table FooA(fooId primary key,
      /* data specific to FooA */ );
    create table FooB(fooId primary key,
      /* data specific to FooB */ );

They're not quite markup though. Markup has a string/stream with associated records. The number of these records, their kind, and their placement varies, so it can handle ordering issues or multiple similar names or missing name (e.g. missing middle name), or ignore parts of name quite naturally.

Quekid5 · on April 8, 2020

That could be an implementation of tagged unions. Without language level support (pattern matching, exhaustiveness checking, etc.) you gain very little from this.

naasking · on April 8, 2020

Yes, you can simulate them, but it's painful.

no-s · on April 8, 2020

It's not a simulation, it's a data model for reducing the pain for what's really a hard problem - names we call people are not unique, they are colloquial identifiers (HHOS).

naasking · on April 8, 2020

Joined tables on the same primary key are a simulation of tagged unions. Proper tagged unions would solve more concisely and with more clarity.

_mt3y · on April 9, 2020

I'd really like to see an RDBMS with built-in support for both inheritance and tagged unions. That would solve 90% of my schema woes.

marcosdumay · on April 8, 2020

Hum... Who exactly is entering that data onto the database? And are they ok with entering things with that level of detail?

lonelappde · on April 8, 2020

Your American is showing? It's just a standard name input form but internationalized.

marcosdumay · on April 8, 2020

Hum... I'm not from the US and do not live there if that's what you are asking.

People usually have enough trouble separating the first name from whatever you call the rest of it and deciding on what part goes where, and you expect them to be able to tell what are the cultural roots of the name?

Keep in mind that not all names on databases are filled by their owners. Besides, I really doubt I would be able to enter my name into whatever you are designing without some amount of lying.

codr7 · on April 8, 2020

I spent 13 years maintaining a large-ish system that separated first- and last names for customers.

Separating makes searching/indexing on last name easier.

The alternative would be to put last names first by convention, which renders them unusable for anything else; or go insane trying to find a universal way of splitting and recombining for display.

Keeping them separate inevitably results in a convention to put company names in the last name field for corporations, and failed searches when the convention is broken.

It's one of those questions that I suspect will keep bugging me until I die.

henrikschroder · on April 8, 2020

> Separating makes searching/indexing on last name easier.

What do you mean "last name"?

If you have two last names, then what?

If you have no last name, then what?

> Keeping them separate inevitably results in a convention to put company names in the last name field for corporations

If you only have a single field for name, this problem disappears.

codr7 · on April 8, 2020

I mean everything but the first name.

Yes, but then you have to deal with the indexing/search problem instead.

henrikschroder · on April 8, 2020

Yeah, but if you only have one last name field, you can only index on people's first last name, not their second last name? How are you going to solve that?

naasking · on April 8, 2020

> Separating makes searching/indexing on last name easier.

The point is that such a search or index is a mistake. It's simply not meaningful based on the real properties of names, it just happens to work 99% of the time if you're only dealing with a mostly homogenous culture.

codr7 · on April 12, 2020

Which is good enough in plenty of situations, compared to the "perfect" alternative that doesn't work at all.

yellowapple · on April 8, 2020

> Keeping them separate inevitably results in a convention to put company names in the last name field for corporations

In most (IMO sane) designs, a company name would usually be a separate field (or better yet: companies would be represented in an entirely different table).

codr7 · on April 8, 2020

But in many cases the customer is either an individual or a corporation. Keeping them in separate tables turns simple tasks such as generating a list of customers into joins. Which brings us to normalization, too much of that kills performance.

yellowapple · on April 12, 2020

I feel like if a join is killing performance, either it's a really weird join (e.g. joining on an OR) or you ain't indexing.

Regardless, you could always just assume everyone's a "corporation" (and for individual customers, that "corporation" would just be a reflection of the individual customer's data). That, or only deal with contacts within a "corporation" (and an individual customer would just be a contact with a null "corporation"). I've seen both approaches in practice, and while they have their tradeoffs, they work reasonably well.

"Corporation" is a pretty wonky term for this, anyway (not all non-individual customers are corporations); "organization" would be better. But I digress.

tasogare · on April 8, 2020

Also for international users browsing in English labeling the fields "given name" and "family name" are way more clearer than using first and last names.

Wildgoose · on April 8, 2020

Exactly what I was thinking - plus a field to state the order they should be displayed in seeing as Asian cultures usually state the Family Name first.

pezezin · on April 8, 2020

Shouldn't that be a localization/translation issue?

My experience living in Japan is that they will write family name first if using Japanese, but given name first if using an European language.

rumanator · on April 8, 2020

Also, in some cultures like in Spain the legal surname is typically the first family name, which might be the first of two or three family names.

pezezin · on April 8, 2020

What do you mean "the legal surname"? I'm Spanish and my legal name is my full name with both family names. You can't just ignore one. Also, Spanish people have two family names, but they can be composite, so they may look as three or four family names.

However, most people are usually known by one family name, but which one is used depends on which one is more easily recognizable.

rumanator · on April 8, 2020

> What do you mean "the legal surname"? I'm Spanish and my legal name is my full name with both family names.

I'm referring to the primer apellido. Even government forms require that people break down their surnames in first surname and all trailing surnames.

NOGDP · on April 8, 2020

From the article:

> This list is by no means exhaustive. If you need examples of real names which disprove any of the above commonly held misconceptions, I will happily introduce you to several.

...

> 40. People have names.

humanrebar · on April 8, 2020

Newborns have no known names until the birth certificate is filled out. There is still data about their medical care that needs stored in a database.

zimpenfish · on April 8, 2020

Prince, famously, didn't have what we'd recognise as a name for years. All the "Unknown Soldier"s don't have names. "John/Jane Doe" doesn't technically have a name.

goto11 · on April 8, 2020

These examples seem somewhat contrived, but of course it comes down to the use case of the database. I'm sure Prince remained registered under his legal name on contracts, checks etc, even if he used the symbol in marketing. But a database over bodies in a morgue obviously need to be able to support unidentified bodies.

MaxBarraclough · on April 8, 2020

Interesting, I think you're right. According to Wikipedia, he only changed his stage name, not his legal name. The BBC didn't seem to make the distinction.

https://en.wikipedia.org/wiki/Prince_(musician) , https://www.bbc.co.uk/news/magazine-36107590

jayd16 · on April 8, 2020

Some oauth systems will give you a user id or an email but not permission to see a name, for example.

ddrt · on April 8, 2020

How do you store addresses? There’s a use case for breaking things up.

henrikschroder · on April 8, 2020

If I had a dime for address forms that have "state" as a mandatory field, while having a country dropdown that contains countries that do not have states, I'd be rich, RICH I TELL YOU!

goto11 · on April 8, 2020

> Names should be stored as display_name, first_name, and last_name. A display name is not the same as just first plus last. They have different purposes, so don’t shove them together.

It all comes down to the purpose of the database. There is no one-size-fits-all solution. For a hospital which need to register newborns, you can't even count on everyone having a name. For a library database, you need both first name and last name of authors, since books are sorted by authors last name - but having a "display_name" would be redundant and violate normalization.

If you can get away with having a single "name" field, this will say you lots of trouble. Only add additional fields if you need it for some use case.

retreatguru · on April 8, 2020

> “ Offset pagination is bad for performance. Avoid it at all costs unless you want some spider indexing page 10,432 of your archives to eat all your CPU.”

I love this! It’s so true only bots dive that deep and it’s usually a total waste. But what do you recommend instead? Should you just deny any calls past page 10? Or do away with pagination altogether?

emmelaich · on April 8, 2020

https://use-the-index-luke.com/no-offset

samanator · on April 8, 2020

Thanks! This is really useful! I didn't know that limit offset scanned all rows before the offset.

hombre_fatal · on April 8, 2020

If there's somehow no need to show anything past page 10, then by all means, just do that. But most applications can't get away with that.

There are plenty of resources for pagination online.

For example, your [Next Page] button could use a `where date > {date of last item on prev page}` since you have an index on date. e.g. <a href="/things?after={prevDate}">Next</a>.

A forum topic could have a consecutive and contiguous "index" on each of its posts that starts at 0 and increments. Now going to page 1000 is simply a matter of grabbing posts between index 990 to 1000 which is instant.

hilbertseries · on April 8, 2020

Don’t use offset for paging. Presumably you’re sorting the results by something, add a condition on that value in the where clause based on the last row in the previous page.

janpot · on April 8, 2020

Super fun until you have a column that has an amount of duplicates greater than the page size. Now you'll have to add a second column to your order clause, your serialized cursor and your where clause. Oh, and don't forget to check whether your database driver uses the same date precision as your database does, because you might be in for another surprise.

goto11 · on April 8, 2020

> denormal will have better performance

No it wont! In general it will have worse performance. Deliberate denormalization can be used to optimize for particular queries but it will make other operations more expensive and/or error prone. So it has to be done very carefully and with a full understanding of cost vs benefit.

And many database systems supports materialized queries which give the same optimization without having to compromise the schema.

loopbit · on April 8, 2020

Have you ever used a complex database in a formal, by-the-book, third normal form?

The number of tables explodes, the number of joins become a pain and every single data change requires careful planning.

As for the "many database systems support materialized queries", I only know of a few that truly do: Oracle and DB2.

PostsgreSQL supports them, but you need to handle updating them yourself. SQL server kind of supports them, but are limited.

In other RDBMs you can work around that using triggers (or similar concepts), but that IMHO is definitely a bad practice and the times I've used that (usually because there was no other option) has only brought me grief.

goto11 · on April 8, 2020

Denormalization means you (by definition) have duplicate and redundant data in the tables. This certainly requires careful planning around updates to avoid data corruption.

> The number of tables explodes

This is IMHO a backwards way to understand normalization. Normalization tell you how many "logical relations" you have in the data - whether you like it or not. If you decide to squash multiple logical relations into the same physical table then you get redundant data (by definition). This might be a fine optimization, but it doesn't reduce the inherent complexity of the data model. It just saves some joins as the cost of more complex update logic.

mhd · on April 8, 2020

I think there's a decent amount of hyperbole, but even taken at face value: Every single data change should require careful planning. Especially if it's a schema change, and not just a query.

You can do a lot with generated code, refactoring, etc. Data is much less agile, so I'm a lot more conservative when it comes to that end. Never mind that with decent naming, table comments etc. it's a good documentation in itself.

naasking · on April 8, 2020

> No it wont! In general it will have worse performance.

Since the advice is to only denormalize to improve performance, this would be a contradiction. Yes it makes everything more error-prone, but if you're profiling to check performance, then you only keep the denormalizations that improve the performance you care about.

EdwardDiego · on April 8, 2020

Yep, exactly, materialised views are the ideal way to create query specific schema - but there's always a cost, namely, refreshing them needs to be done thoughtfully.

earthboundkid · on April 8, 2020

Materialized views are great, but it is just a denormalized form with in-database updating instructions.

goto11 · on April 8, 2020

I prefer to think of it as a cached query result. Normalization is not compromised since there is no way to introduce inconsistencies.

earthboundkid · on April 9, 2020

If you do a non-atomic update like “add 10 to transactions total” where total is a materialized view, it could bite you, but yes, it is less likely.

nerdponx · on April 8, 2020

You never need a nullable string. If you must distinguish blank string from unset, add a column called “unset”. Using null for this will just result in bad data entry.

What's an example of a data entry error that results from this scenario? Wouldn't it entirely depend on the data entry interface?

earthboundkid · on April 8, 2020

Someone is filling out the form. They skip MI. Do I mark it as null or blank? The only way to trust this is to have a big check mark that say “does not have a middle initial” that stores a boolean. But realistically, who cares? Unless you’re dealing with legal names, you don’t need to distinguish the cases.

nerdponx · on April 8, 2020

OK, but why does that require the database to have non-nullable string types?

earthboundkid · on April 9, 2020

No reason per se, but it’s inconvenient and I’ve seen tons of bugs from “NoneType has no method ‘uppercase()’” that this eliminates.

nerdponx · on April 9, 2020

Sorry, I can't get behind changing the database design to accommodate under-specified data contracts and/or lazy software development.

earthboundkid · on April 9, 2020

I can't get behind letting theoretical database design concerns trump real world problems with unexpected nulls leading to 500s in production. :-)

Aeolun · on April 8, 2020

Your main database should always be 100% normalized. You can add a separate database for lookups later that can be as denormalized as you want.

Cthulhu_ · on April 8, 2020

That assumes people who read it actually have been in school; there are a LOT that never had anything about database design.

Your other points make a lot of assumptions and there's a lot of footguns in them but other commenters have pointed them out already.

tasuki · on April 7, 2020

> Never store firstname and lastname in the same field

I disagree. Why would you want to force the "firstname lastname" pattern on people? What if someone has three names? Or only one? Just use a "name" field!

henrikschroder · on April 8, 2020

Scandinavian here, our names don't really fit this pattern very well either.

My last name has an "ö" in it. That doesn't work in 99% of so-called global websites, so I have to transcribe it as "oe" instead, which is just wrong.

My husband has three first names and one last name. However, his actual name is his second first name. If his full name is "A B C D", if you ask him what his name is, he'd say "B D". If you know him you can call him "B", if you don't, you can call him "Mr D" or "Mr B D". Most American companies call him "Mr A" or "Mr A D", which is just wrong.

I have a friend with two first names and two last names, but his second first name is his name, and he prefers his first last name. If his full name is "A B C D", he'd say his name is "B C" if you ask him. Calling him "Mr A" is just wrong, and calling him "Mr A D" is super duper wrong.

(In government systems in our home country, this is of course handled. You can "mark" which first name(s) are preferred, so mail to my husband from either government or private companies will always be correctly addressed to "B D" and nothing else.)

DeathArrow · on April 8, 2020

Some people have nicknames and prefer to be called like that. A known singer would prefer to use a sign instead of a name.

People can have thousands of preferences and an app can't account for all, but for the majority of use cases.

henrikschroder · on April 8, 2020

My examples aren't about nicknames, they're about the difference between someone's full name, and their name.

And even in American name culture it's not obvious how to transform someone's legal name to their name. How would you encode that the name of "William Henry Gates III" is actually "Bill Gates" ?

"William" - nope.

"Mr III" - no no no.

"Mr William H. G. III" - nooooooooooo.

DeathArrow · on April 9, 2020

That was my point: trying to account for every preference an user might have or every particular scenario is too complex and takes much time and human resources. The right thing is to address the majority of your user base.

To take your example, why should my app encode that "William Henry Gates III" is actually "Bill Gates"? I will provide a form with first name and last name and Mr Gates can register as "Bill" "Gates" or "William Henry" "Gates".

I have three given names and I find it a non issue completing forms. And that's because a string can contain white spaces or even be empty so you can encode any arbitrary combination of first name + last name.

Also, people get to much into politics, trying to be "revolutionary" instead of trying to solve the technical issue at hand.

I can solve a technical problem in 2 minutes or get in political phylosophy and argue for 2 years without solving anything and without helping anyone. I rather solve the problem.

funcDropShadow · on April 8, 2020

Don't dismiss nicknames. In Germany e.g. you can put an artist name on the id card and your pass port.

msaadat · on April 8, 2020

This isn't about nicknames, its about cultural differences.

clon · on April 8, 2020

Absolutely! Having 'name_given', 'name_middle', 'name_family' or some nonsense like this is one of the most common mistakes in database (and application) design, usually made by culturally narrow-minded Western developers.

Take names in Myanmar [1] as an example. From Wikipedia:

> Burmese names lack the serial structure of most modern names. The Bamars have no customary patronymic or matronymic system and thus there is no surname at all.

That said, it is hard to push back on a business requirement to sort users A-Z based on family name, just because.

[1] https://en.wikipedia.org/wiki/Burmese_names

true_religion · on April 8, 2020

It’s not really a mistake to localize your application design and UX to the first major market that you are targeting.

To immediately target the entire world, will leave you burning cash as you figure out all 200+ counties customs, or with a hopelessly generic app that is hard to use for everyone, pleases no one, and hard to extend in culturally relevant ways.

Xelbair · on April 8, 2020

This kind of mentality lead to huge fuckup that were the ASCII code-pages, and we are still suffering fallout from that.

I still cannot input one of letters in my name into most systems for example.

There is a decent middle ground, as you can store first-name(or equivalent), last name(or equivalent) for sorting purposes... and a display name which is culture sensitive.

alharith · on April 8, 2020

> usually made by culturally narrow-minded Western developers.

Usually made by developers who realize this will suit 99% of their users. I'm culturally eastern myself and realize this.

franciscop · on April 8, 2020

Non-American here. I believe US citizens often have a middle name. We Spanish-speaking people have 2 last names. Japanese reverse the "firstname" and "lastname". I had a lot of troubles when I tried to buy a ticket in a Chinese website a few years back. So no 99% of the users don't follow the "firstname" and "lastname" pattern. We might put up with it, but it doesn't suit 99% of the users as stated.

jayshua · on April 8, 2020

To clarify the comment you are replying to (not the author): 99% of their users. Clearly 99% of people globally aren't going to follow that custom, but many (most?) applications written in the US have a user-base where 99% of their users will follow the custom. Switching to something more globally applicable isn't necessarily a non-zero cost either - using a single name field has downsides like making sorting by last name more difficult (a ubiquitous thing in the US) and not being able to display first vs. last name where culturally expected.

franciscop · on April 8, 2020

But I am replying exactly to that. I don't believe 99% of the users of a given application for any non-trivial number of users conform to have exactly a "firstname" + "lastname".

In the US it's not rare to have a middlename. And there are many Latinos, Japanese and Chinese in the US as well, all of them which do not follow the convention.

funcDropShadow · on April 8, 2020

A lot of US-based startups target the global market early on, although they behave as if they only had to cater to US users.

As a side note, the group of HN commenters seem to fall into two categories, the first think group of commenters seem to think HN is a mostly US only forum, whereas the second group seems to think HN is an international forum. I don't know what the truth is or whether there is one at all. But I know, that it sometimes irritates me, when commenters or posters assume, that they are talking only to US citizens.

renewiltord · on April 8, 2020

I suspect it's usually accidental but I feel very little rancor towards them. You solve things the way you think about them (until yesterday it wouldn't have ever struck me that a currency symbol could be in the middle of the value). And that's how, I suppose, you end up with a bunch of people with US Visas that say FNU (first name unknown) in front because they only have one name.

pkulak · on April 8, 2020

That's why you store a "name" field, and then spend the rest of your carrier writing parsing logic to answer silly management questions.

yla92 · on April 8, 2020

A Burmese here and thanks for pointing out! I have to explain to other people that we don't follow the father's name most of the time. Some names the father's name at the end. e.g if my father name is Myint Swe, my name could be Ko Ko Myint Swe.

hinkley · on April 8, 2020

It is typical to sort such things in multiple columns. You just need to decide if people with no last name sort first or last.

Groxx · on April 7, 2020

+1 for "name", though you may want different fields for different display purposes: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

Macha · on April 8, 2020

The same also applies for the structured address format. I've seen some stupid decisions in the name of this.

Post codes: 5 years ago, my country brought in post codes. prior to this there were none. It was not for the benefit of our postal system, who stated they didn't really care, but for badly designed web forms (and the occasional rural house with an unhelpful address)

They also may not fit your convention of what addresses in your country are. The obvious examples are stuff like Japan and one or two central European cities that are addressed "Building number, Block Number, City", but the aforementioned unhelpful rural addresses include a former friend of mine who had to give his address as "House name, near village, County". Actually the village was not near but it got the mail to the right post office which knew what to do with it.

As a counterpoint, I've had places that have very strict fields like "Street Name" (20 char limit), City, County, Country. For a few years I lived on a street who's name appeared three times in the same city. The actual form to get it delivered was "Street name, Locality, City", but because of the length limits I had to sneak the locality into the City field, which luckily worked as the City and County shared a name.

radihuq · on April 8, 2020

Hopping on the topic of addresses and not completely related to your point - but if anyone's having challenges with handling address fields I'd highly recommend checking out Shopify's address package:

https://github.com/Shopify/quilt/tree/master/packages/addres...

And corresponding blog post:

https://engineering.shopify.com/blogs/engineering/handling-a...

chrischen · on April 10, 2020

This seems like it hits their graphql api. Is this free to use?

MR4D · on April 8, 2020

> For a few years I lived on a street who's name appeared three times in the same city

You should try going to Atlanta and finding a business on “Peachtree” street. [0]

[0] - https://en.wikipedia.org/wiki/Peachtree_Street

koyote · on April 8, 2020

I agree with everything you mentioned in your post.

That being said:

> For a few years I lived on a street who's[sic] name appeared three times in the same city

This is why you have postcodes ;p

Macha · on April 8, 2020

>> For a few years I lived on a street who's[sic] name appeared three times in the same city

> This is why you have postcodes ;p

I guess, but it was foreign vendors which expected "Street Name, Postcode, County". Locals and the postal service were working fine with "Street Name, Locality, County"

DeathArrow · on April 8, 2020

Just store GPS coordinates, floor number and apartment number. :D

legopelle · on April 8, 2020

All fine and dandy until the geography shifts.

thelittleone · on April 8, 2020

Very cool to see mention of "one name only" case. My wife had only one name before we married. She's Indonesian. When we got married in Australia our forms pretty much crashed the Government's Birth's, Death's and Marriages system. We got bounced around between all different people trying to compute the reality that in some parts of the world, people have only one name.

zebnyc · on April 8, 2020

One of my indonesian friends also had only one name. She named herself as Noel as she felt that was better than being referred to as Ms. Null.

funcDropShadow · on April 8, 2020

I once worked on a project, which used the Dow Jones list of people of interest, a list of politicians and terrorists, mostly. One day the software began to fail catastrophically, when parsing the latest version of that list. That day, a Mr. Null was added to that list ;-)

thelittleone · on April 9, 2020

As in no name at all? Now that’s an entirely new level. Pretty forward thinking parents considering today’s surveillance... maybe they where phreakers tried to name him ' OR '1'='1 but the Government didn’t accept special characters so they resorted to naming him null.

:)

codr7 · on April 8, 2020

Reminds me of little Bobby Tables.

https://xkcd.com/327/

1996 · on April 8, 2020

In the US, she would have been forced to use the infamous "FNU" (first name unknown)

https://uk.usembassy.gov/visas/visa-errors/name-appears-as-f...

because governments prefer not respecting people names and customs, just so they can push their own customs (and in that case their name too, because "FNU" is the 30'000 most popular name in the US now)

henrikschroder · on April 8, 2020

Is there a "last name unknown" as well? There's absolutely individuals for which this is true as well.

Wildgoose · on April 8, 2020

A family friend had this problem when he immigrated to the UK from India. He just repeated his single name as it was both his forename and his surname.

ozim · on April 7, 2020

Yeah, "humble guide..." using never/always in 3 points, seems like author thinks he saw everything but he is up for unpleasant surprise in the future :)

NewJazz · on April 7, 2020

You seem to be assuming that each field:

1. Could not contain whitespace.

2. Would have a minimum length.

TFA specifies neither of these requirements. You could have someone's first name (which I would recommend calling given name) be "Pablo Emilio" and their surname be "Escobar Gaviria". Or alternatively you could have someone with a given name of "Sukarno" and an empty string for the family name. Patronyms and matronyms are sort of ambiguous, but would probably be categorized as surnames.

I use a single name field, but I don't think a given name and surname split is crazy.

derefr · on April 7, 2020

My favorite way to do things is to just ask for a "full name"; and then, separately, a "calling name" (i.e. "what do we call you?")

The "full name" is for putting in the c/o part of a mailing address; while the "calling name" is for showing beside your profile picture or for saying "Dear [name]" in a mail-merge.

Neither of these is a nickname. They're both your "real" name. But neither of them are your legal name, either. You're free to make either of them whatever you want, and to make them both entirely unrelated to one-another. They're just two forms of "your name" that you might expect to see in one place or another. And we can ask for them, so that we can then spit them back out at you in the situations that call for them.

Doing things this way skips the entire minefield of "given names", "family names", titles/appellations, etc. You've got use-cases for someone's name? Ask the user directly, for each use-case, what text they'd like to appear. There's not too many use-cases, really; you only need to ask a couple such questions, in the end.

earthboundkid · on April 8, 2020

This is the correct answer. The question to ask is, what are you doing with the names? Splitting first and last only makes sense if the thing you’re doing is listing people by last name. Sometimes you do that! But not always and certainly not in all countries.

kbenson · on April 7, 2020

No, I think GP is assuming that names are very complex, and even the assumption that people have a first and last name isn't always correct. You can just use a name field, and let people supply the name they want to use.

What do you really need a first and last name separated for anyways? Sorting? A salutation on correspondence? Sorting by last name isn't usually important[1], and you can generally use a full name in a salutation.

Depending on the domain needs, you choose the right representation. If you expect US citizens and want their name as entered on their tax returns or social security card, then you match what those can hold and note to the user that you are expecting those values. If you're just collecting a name to have a name, why put an extra constraint in there that doesn't matter?

1: Not to mention you then get to mostly ignore the craziness that is utf-8 and combining characters and cases and sorting.

zachsnow · on April 8, 2020

> What do you really need a first and last name separated for anyways? Sorting?... Sorting by last name isn't usually important...

When paying customers / clients want to sort by last name, because “that’s how we do things”, then you sort by last name. So it’s only important when you want to make money.

Edit to add: your customers also don’t care that someone somewhere has 6 first names and no last name. “This wasn’t a problem with our old software.”

kbenson · on April 8, 2020

If customers need it, then it's a requirement. Nothing I said indicates you should leave out customer requirements. In fact, I specifically alluded to including separate names in cases where it's needed.

Sometimes the customer is you or your org, and in those cases it may not be a requirement.

zachsnow · on April 8, 2020

Exactly as you say: if customers (that is, the people paying for the software, not necessarily the people filling in forms) need it, then it's a requirement.

The point is that while sorting by last name (say) might not seem important to an engineer, and requiring a last name might seem outright stupid to a person filling in a form, nevertheless it is often an important requirement. Frequently folks will observe some "stupid" form, and link to some "falsehoods engineers believe about X" document, suggesting that engineering "got it wrong", when in fact they have simply misunderstood who the software is actually for.

ozim · on April 7, 2020

Wait till you get into some Dutch surnames "van der Kleij", "van den Bosch", "de Jager" which are not patronyms. They would still fit into "first name, last name" pattern. Because you assume what is last name can have https://en.wikipedia.org/wiki/Tussenvoegsel

Though a lot of pages in NL have separate field for it.

pmontra · on April 8, 2020

Or German ("von Trips") or Italian ("De Sica") and possibly other countries.

Anyway, people who move to other countries have to adapt. For the database of a local business is often a good idea to stick with the format of official documents because everybody has to fill out forms on government websites, regardless where they were born. Ids, taxes, healthcare, schools etc.

earthboundkid · on April 8, 2020

Arabic culture will sometimes adopt a name after you have a child as “father of X”. https://en.wikipedia.org/wiki/Kunya_(Arabic)

DeathArrow · on April 8, 2020

Yes, Arabs use "Ibn" and jews use "Ben". Slavs use father's name and "ich" suffix. People in Iceland use different kind of patronimics.

You can either use just one name field if your logic doesn't need a name or surname or provide two fields for name and surname and let the user handle the splitting.

I have three first names. When I fill a form, if it's an official form or if it's for business purposes I fill all my names as first name, if not, I only write the name I use.

earthboundkid · on April 8, 2020

Ibn is a patronym, but a kunya is the other way: your name is the name of your child. E.g. Palestine’s Abbas is also “Abu Mazen”, the father of Mazen.

helloiloveyou · on April 7, 2020

Exactly what I meant, the fields CAN contain whitespace. The first_name, last_name is one that I find useful

catalogia · on April 7, 2020

Much of the official paperwork in my life has "[First Initial] [Middle name]" in the first name field, which works (usually). But why should I have to do that? It's a pain in the ass.

nobrains · on April 9, 2020

The usual pattern in this case is to add first name and then all rest of the names in the first_name field. And last name in the last name field. Users can be made aware of this on the UI.

Another option is to create a middle_names or other_names field, its first_name then other_names then last_name in that order.

orange8 · on April 8, 2020

One way to look at it, if you have more than one name, you'll usually have a preferred order of writing them, regardless of culture. The first-name used will have different protocols and meanings in different cultures, but it is still the first-name, literally.

gremlinsinc · on April 8, 2020

What if you're ordering by 'last name'?

Say it's an HR app. Do you just assume the last word in the name is the last name?

ebg13 · on April 7, 2020

Design guidance aside, this is not a good article because it doesn't show how to fix the presented problem scenarios.

"This is not first normal form" - OK, how would you fix it?

"This is not second normal form" - OK, how would you fix it?

"This is neither second nor third normal form" - OK, how would you fix it?

If you're going to show problems, show solutions as well.

The rest is a big red flag list of falsehoods that programmers believe about names and addresses.

helloiloveyou · on April 7, 2020

Although others have replied to you, I'll take your comment as constructive criticism. Though the tone could be better.

kbenson · on April 7, 2020

That's exactly what it does. When talking about each normal form, it gives an example and what to change.

ebg13 · on April 7, 2020

> it gives an example and what to change

It gives an example of something wrong. Not once does it give an example of something right. It says "change this" but doesn't show what to change it to.

kbenson · on April 8, 2020

First normal form:

> Then the “areas” column has values that aren’t atomic. ... So this table is not in first normal form. ... To be in first normal form you should store a single value per field

Second normal form:

> In this design employee name can be directly inferred from employee_id because the idea is that the name of an employee is uniquely defined by it’s id. ... So we have two examples of columns where the columns can be inferred from only part of the primary key.

Don't include those fields. It's not explicitly said, but I think the author assumes that if you're told don't include data that can be inferred otherwise, these fields can be inferred otherwise, so they cause a problem, the solution is to remove those fields. You aren't losing data, it's already been covered that it can be inferred.

Third normal form:

> For a table to be in third normal form then it must be in second normal form and there should not be a non prime attribute* that depends transitively from the primary key. ... In this table the department_number can be inferred from the employee_id, but the department_name can be inferred from the department_number!

Same as second normal form. These fields cause problems, you already have the data elsewhere, get rid of them.

Perhaps you expected an article that said which commands to run to alter the tables. It's pretty obvious that wasn't the goal of this article. That said, it had everything you asked for, as long as you read it all and payed attention.

ebg13 · on April 8, 2020

> To be in first normal form you should store a single value per field

Where? How? How instead should you store groups of things? Blog doesn't say.

> Don't include those fields. You aren't losing data, it's already been covered that it can be inferred.

If you just remove fields, then you absolutely are losing data. Your software can't magically infer it unless you put the values and the relationship somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action and you've now put words in blog's mouth.

> These fields cause problems, you already have the data elsewhere, get rid of them.

Again, just deleting fields is 100% losing data unless you put those values somewhere else. How should you handle that? Where should they go? Blog doesn't say.

kbenson · on April 8, 2020

> Where? How? How instead should you store groups of things? Blog doesn't say.

By normalizing. This is obviously a guide for to do and not to do, not a 101 level course on database design and normalization. You first step should be to google database normalized forms, and even specifically second normal form or third normal form. I know from experience that Wikipedia has articles covering these. I've read portions of them.

> If you just remove fields, then you absolutely are losing data. _YOU_ might be able to infer it, but your software sure as hell can't unless you put the data somewhere else. So where should the data go instead? What should that look like? Blog doesn't say. Blog doesn't even actually say that you should remove those fields. That's just one possible course of action.

> Again, just deleting fields is 100% losing data unless you put those values somewhere else. Where should they go? Blog doesn't say

The point is that if you're working towards normalizing that data, then you have the data in a different location.

The blog entry also doesn't explain what SQL is, or what a primary key or foreign key is. When you encounter terms that are not defined, you should make sure you understand them. If you don't, then the information is likely not aimed at you. If you don't understand normalization, you aren't going to understand what's being recommended. This obviously isn't a guide on how to normalize perfectly, it's a recommendation to do certain things, some of them to do with normalization, and a justification for why.

Consider it the equivalent to telling someone that they should eat a balanced diet for their health and notes some problems too much of certain types of food can cause. It's still valid advice even if it doesn't come with an explanation of exactly what to eat, and references medical phenomena without explaining what they are. That doesn't make it a bad article, it just makes it not terribly useful for people that it isn't aimed at.

awaythrower · on April 7, 2020

A good ER database textbook illustrates normal forms much better than a blog post.

kevindong · on April 8, 2020

I find that textbooks focus way too much on the strict CS definition of the various normal forms as opposed to practical examples.

awaythrower · on April 8, 2020

That's a sweeping over-generalization. I said find a good textbook.

twojacobtwo · on April 8, 2020

The poster said "I find that...", which is equivalent to "In my experience...". How is someone posting their experience the same as making a "sweeping over-generalization"? They never flat-out said "ALL textbooks".

Also, surely if it were not accurate you could easily drop the name of a good textbook without any trouble, since you apparently have enough experience with them to make the distinction.

tomlagier · on April 7, 2020

If you're dealing with international addresses, point 3 becomes very challenging. The 'tokens' of an address are called different things everywhere, take different forms, and sometimes don't make much sense to compare. Figuring out a good balance of usability and generality can be really tricky.

Here's nearly 100 pages on the subject (from the perspective of addressing mail for USPS): http://www.columbia.edu/~fdc/postal/

Benjammer · on April 7, 2020

I think programmers over complicate things too much when it comes to addresses. When you're dealing with a global system that is able to deliver a package to "the 3rd house on the right, past the pond, with the red fence," or whatever, in rural anywhere, you probably should just treat it like we all do in the real world, as just a string to be interpreted in context.

_lqaf · on April 7, 2020

This fails as soon as you talk to an API that requires normalization, need to aggregate your data, or, really, do anything other than try to deliver it.

Benjammer · on April 7, 2020

Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things. I know it's a cop out answer. But I don't have a better one.

smnrchrds · on April 8, 2020

> Lmao, you're really going to hate me when I tell you I don't think we should be using addresses for anything other than delivering physical things

This is not always an option. For example, in some jurisdictions, you need to calculate, charge, and remit sales taxes based on client location, even if you are selling digital goods.

jussij · on April 8, 2020

However these locational jurisdiction details will be unique (i.e. a state, a country, a city, a trading block, a tax code etc).

These can be different to the address details which is nothing more than a postal address to where the goods/letters need to be posted.

In other words the former details should be explicitly provide in a fault tolerant manner (i.e. drop down lists) and not parsed from the details found in the address.

DrScump · on April 8, 2020

USA mailing addresses are strictly that -- an identifier for the convenience of the Post Office which may or may not correspond to the underlying jurisdictions for things like political boundaries, tax authorities, etc.

I learned this when setting up voter databases. A house that is in a given city/precinct/etc (the atomic identity is parcel number) may have a mailing address in a completely different city. There are many such cases for Los Gatos vs Saratoga, for example. Also, the USPS considers every address to be in some "city" even on unincorporated parcels.

Back when phone area codes and prefixes had precise geographical meanings, many edge cases ended up on the "wrong" area code (e.g. Sunnyvale numbers being in 415/650 instead of 408 between the 85 freeway and the Sunnyvale border).

JackFr · on April 8, 2020

The local taxes are rarely based on the street address and typically coincides with a larger municipal division.

The problem in that case is trying to use the “address” for two purposes which may not be aligned, getting goods to the buyer and trying to find their tax domicile.

DrScump · on April 8, 2020

  local taxes are rarely based on the street address

No, they're based on the underlying geographical boundaries based on the parcel number. This usually corresponds to the city in the mailing address (in contiguously Incorporated lots) but not always.

davedx · on April 8, 2020

On the e-commerce platform I work on we need to send customer data including addresses to multiple vendor API's. You can't just design your db schema in isolation usually.

aerbherhber · on April 7, 2020

Some counter-examples include Route Planning or determining where to locate a Consolidating Freight Service. If your addresses are not normalized, you lose the ability to see two addresses listed as "from city center, three lefts" and "from city center, one right" take you to neighboring locations.

Benjammer · on April 7, 2020

These are not counter examples, they are examples of the types of problems that I think turn out very poorly when they rely on normalized addresses. What I'm saying is we shouldn't be using addresses for things like this. I realize I'm talking about, basically, upending several entire industries or something, so I know it's unreasonable, pragmatically.

I just think that the hard problem of 100% accurate address normalization suffers from an extremely fat tail[0] of edge case issues, and becomes economically unviable to solve, very quickly.

[0]: https://en.wikipedia.org/wiki/Fat-tailed_distribution

_lqaf · on April 7, 2020

Except they don't turn out very poorly, they provide obvious value for a bunch of real-world companies doing it right now.

They don't provide 100% perfect results, but that's why we still make humans.

sparkie · on April 8, 2020

95% of the problem is having an up to date record of zip/postcodes. As long as you treat the postcode as a separate field, the rest of the address can be handled by the courier. Most couriers will go off the postcode to determine whether or not they can deliver a package and cost the delivery - even if the rest of the address is garbage.

IMO, A database schema for addresses should consist of: (Country, Postal code, Address). I would split Address into 4 strings and let the user fill out whatever they want in them, in whatever order. I'd also suggest that is isn't even necessary to validate this information on your input forms - but the validation is best performed by contacting the courier with the information input by the user when requesting a delivery cost. (And if this validation fails, email the user back). One suggestion would be to forbid commas in any address fields because CSV formats are accepted by some couriers with varying degrees of support for quoted fields, and often end up requiring manual intervention.

Unless you are the courier, you should probably not waste your resources on attempting to normalize any other parts of the address.

This is from experience in a business which ships hundreds of international packages a day, and tens of thousands domestically (UK).

pmontra · on April 8, 2020

People usually solve that problem by learning their address according to Google Maps (the most widely used service). For most of us it's the same address we always use. For somebody it's what they write when they want people or stuff to reach their home.

waste_monk · on April 8, 2020

Why not just store GPS coordinates and do geospatial queries? which gives you nice features like "show me stuff within X kilometres of a shop" and "is cutomer X within bounding box Y?".

AnonC · on April 8, 2020

That might help to some extent depending on the use case, but the tectonic plates are moving all the time, and shifting a lot in some places where you have geological events like earthquakes. A set of GPS coordinates may no longer point to your current (moved) location accurately in that case.

mgkimsal · on April 8, 2020

you could have 500 people all at the same lat/lon, with different floors, no?

stefanmichael · on April 9, 2020

this is going to be quite random but I recently saw a comment you made about deadlines (https://news.ycombinator.com/item?id=19361199) and am interested in some of your thoughts on what an ideal software development methodology looks like, if you care to talk about it please throw me an email smtheard@gmail.com

Benjammer · on April 13, 2020

I don't think any single answer to a question that broad should ever really be trusted.

It totally depends on the size and scope of what you are trying to do, what you are building, who your customers are, and like 100 other questions.

crgwbr · on April 7, 2020

I’ve been through this before: storing a couple addresses for a few hundred different countries in an otherwise very normalized RDBMS. We couldn’t figure out anyway to do it other than EAV tables and a defined attribute pattern for each country.

Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?

tomlagier · on April 7, 2020

At a past job, we basically did "both" - we'd store our best guess at a normalized address, and we'd store the text representation. For stuff like shipping labels, we'd use the text. For analytics, we'd lean on our best-guess normalization (and understand that there's some potentially significant error).

You can also lean on various APIs to normalize, but nothing that I'm aware of does this well on a global scale.

Benjammer · on April 7, 2020

At that point you might as well just make an "Addresses" or "Address" table with a raw text field for the address. At the end of the day this thing is something that a human will read and map to the real world location.

uk_programmer · on April 7, 2020

> At the end of the day this thing is something that a human will read and map to the real world location.

That is quite an assumption.

I personally would use a field with a structured format e.g. XML (most RDBMS systems can deal with XML easily) and different types of address field e.g. postcode, zip code etc.

Then you could use a strategy for each country when dealing with it in your application.

crgwbr · on April 7, 2020

We went the EAV route instead of just a free form text field because we still wanted to expose discrete fields to in the UI and have some front-end validation. But sure, from a data integrity standpoint it’s no better than a text field.

pbowyer · on April 7, 2020

> Even though that project is long in the past for me, I’m really curious how other people do this. Anyone come up with something better?

I've always gone the other way. Bland fields (Lines 1-4, city, state, postcode, country) or - when I didn't need to validate it, a textarea for free form entry.

Most of the time validation has been required by the shipping provider, so freeform hasn't been possible. As the shipping providers haven't got anything 'better', we stick to their format and field lengths (which are terrible for out-of-country addresses)

oftenwrong · on April 8, 2020

If you are mailing something, using a plain formatted address string is usually fine.

Here's a harder problem from a previous job:

Given two addresses, tell me if they both point to the same "place".

(As you might imagine, the goal of this was joining various datasets.)

There really is no good way to do this, but to come close, you might consider a few dimensions of the problem:

1. Formatted address - a plain string. What you would write on an envelope.

2. Tokens (what I call "address components") - the bits of information that make up an address. Something like, {"street_number": "10", "street": "downing st", "locality": "london", ...}. You can get these back from most location-focused APIs.

3. Locations. Coordinates of a point or area in space.

4. Time. "Places" change over time. For example, a new business can open in place of another. Buildings are built and demolished. If you are working with data over a long time period, this is more of a concern.

Unfortunately, there are many-to-many relationships between all of these. For any address, there are many possible locations. For any location, there are many possible addresses. For any set of address components, there are many possible formatted addresses. For any address, there may have been multiple different "places" found there over time. Et cetera.

Consider an office building, for an example. That building probably has at least one address corresponding to its main entrance, but it might have other entrances. An office within the building can sometimes use the building's main address as its own address, or the address of its own entrance, or an address with a floor number, or an address with a mailroom number. An office building may have a café (usually off of the lobby) that has its own addresses. Offices in the building can be split and merged over time. I have even seen multiple buildings merged into one. You might also have a co-working space with individual addresses for rooms within it.

clon · on April 8, 2020

Thanks, this is a fascinating resource!

My logic with the address field is to try to consider what is the purpose of the address in the application.

If it is used for someone to send you a letter in the mail (for example: university admission) then you are better off with a big box where the user can write the magic blob of text that usually gets their mail delivered to the right place, up to instructions to the local postman.

If you need to pull off some analytics, or need to estimate shipping, then just extract the fields that your shipping agent demands: country, zip code etc. Leave all else intact in the "magic blob".

NewJazz · on April 7, 2020

I think city, postal code, state/province (optional), and country would be fairly standard. At least in North America, Japan, Australia, and the UK. Of course for country you would want to use standardized country codes, which are somewhat political.

pelliphant · on April 8, 2020

i don't mind web-pages (or whatever we are talking about) having lots of fields for address, as long as i don't have to type something in all of them.

But it's super annoying when I have to make up things for fields that doesn't make any sense where I live.

me_bx · on April 7, 2020

Off-topic comment / feedback about the blog post:

The author self-describes himself as "_Trusted_ Consultant" in the blog's subtitle.

The comic strips illustrating the article do not mention about the permission from identified license provider(s), and are cropped, does not help building trust - at least in my case.

Are they used with authorization from their authors?

heisenzombie · on April 7, 2020

This person would disagree about point 4: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

helloiloveyou · on April 7, 2020

Author here! Ha, I'm a big follower of Patrick, specially his microconfs! I didn't know he had an article an article on the subject, thanks for linking it!

However I agree with him!

What happens is that in my experience I've yet to encouter a case where what I suggest in the blog doesn't work. No doubt that there will be cases where it will not be enough. But by reading Patrick's article, clearly the complexity of the system needed to handle all the cases would be huge! So I guess one has to compromise :)

jbotz · on April 7, 2020

Yes, compromise in that typically US-centric way, and to hell with any culture that doesn't fit the mold, they can adapt!

It's true, they can adapt! They proved that when they didn't have much choice, back in the ASCII days when you had to conform or stay in the pre-computer age. But maybe we should try to better now that we can...

andreareina · on April 8, 2020

So, my identity records got mixed up due in part to the impedance mismatch between (first name, last name) and (first name, middle name, last name), resulting in my legal name suddenly changing until I could get the issue sorted out.

Two fields works in the sense that most people are going to be have something to put in them, but what that is and what they mean isn't always going to match what you intend. Either the system is going to be tolerant of such things (in which case you can probably get away with some variant of (display_name[, external_name]), or you're using it to interface with some other identity provider/consumer that requires a specific format, and that's what should be asked for.

Names are personal and hard to get right.

dodobirdlord · on April 8, 2020

Very, very few cases aren't handled by an arbitrary utf8 string of not more than say, 1KB.

boffinism · on April 7, 2020

In what way is this guide 'humble'? Seems fairly self-confidently opinionated to me.

kbenson · on April 7, 2020

These are really the basics of database design. I would argue that you should do these until you know when you shouldn't. That is, you should normalize to this level until you learn when and why to denormalize some things for performance, if that's ever even needed. I think it means "humble" as in not trying to teach anything complex or revolutionary.

redis_mlc · on April 7, 2020

I'd agree. The advice to use NULL is usually harmful, and constraints are not high-performance.

One nitpick is that "user" is a reserved word in PG, so "users" is better.

Other than that, I'd agree with most suggestions. But I'd use MySQL. :)

Source: DBA.

kbenson · on April 7, 2020

> The advice to use NULL is usually harmful

The advice is to set NOT NULL if you think the field can't be null. That's valid, and a matter of encoding the schema to correctly match the data. Same with check values. Keep bad data out of the database wherever you can.

> and constraints are not high-performance

Then turn off constraint checking after you've built your schema and identified them as a performance problem. Don't optimize away your data integrity before you've even determined it's a problem.

If you actually don't care about your schema, just use a text field and throw whatever in it, or use a document store without any strict checking. If you're using SQL, presumably you're using it for a reason, and that's likely so you can take advantage of what it has to offer.

helloiloveyou · on April 7, 2020

Hey thanks for clarifying it for the other user! I also agree with everything that you're saying, in this and other comments in the thread