Hacker News new | past | comments | ask | show | jobs | submit login

"Consider a database of actual, real world, gravestone inscriptions. If someone's gravestone stone has "1890-02-30" inscribed on it, I know thats wrong but I don't care, I need to store it exactly as is for historical purposes, I don't want a DB crash or need to recompile postgres to accept it, I don't want to force the users to falsify gravesite records, I don't want to have to store as a CHAR or VARCHAR and have to write my own date handling routines in my app..."

I think this is a crucial point of distinction between the two philosophies: structure defined in the query; and structure structure defined before data is loaded.

Structure defined in the query is the more obvious approach. You collect all of the data, and write queries over that data that handle all the cases. The queries often become quite complex and error-prone. Even if the query is slightly wrong, the result generally looks about right. Queries may take a long time to develop and get right, and may react badly to new data that is loaded (e.g. "I thought that was a number field, but now it has letters"). This approach is useful when you are trying to interpret the input data in several different ways -- in other words, when the query is helping you determine the nature of the data you have.

Defining structure before loading is generally more robust and less error-prone, but requires planning that may be frustrating to people just trying to get their hands on the data. The queries generally don't have branches or special cases, so usually if the query runs at all, it will give the right answer. If someone is trying to file an expense, and the receipt says Feb 30th, the accountants still don't want to see the expense as happening on Feb 30th. If they let it in, it could (potentially) break all of their other queries by creating inconsistencies (e.g. it happens after one month is closed and before the next is opened, and it causes the accounts to be out of balance somewhere). So, the person filing the expense has some extra work to do -- maybe they need to look at their bank statement to see what day it really happened, and add a note saying the receipt has the wrong date, in case anyone does an audit.

Broadly speaking, the first philosophy is easier for writers of data, and people writing the applications that help people input data (in part, because you never have to tell the user that the data is wrong, and they need to reexamine their records). The second philosophy is easier and more reliable for the people querying the data, but harder for the people trying to input data and the people trying to write applications to help input data (because they have to handle more error cases and try to provide context so the user can correct them).

In your example, it all depends on what you are trying to ultimately do with the data you collect. The easiest thing to do is to take the data in an even more raw form: just have people take pictures and automatically upload them. But it's awfully difficult to query pictures, so you have to demand a little more structure at load time if you want to query the data at all. I'm not sure what the right balance is for you -- maybe they have a 13th month or a 32nd day, so you should just ask for 3 integers. Or maybe people put question marks or ranges (e.g. born sometime between X and Y), and you want to represent those as well. But the more of that you do, the more burden you put on query writers, and the higher the chance that you get wrong results.




In one of my other posts in the object-relational series I noted that select * has very different implications in an object-relational vs a strictly relational model. In a strictly relational model you want your SQL query to define your data structures on output. In an object-relational model often times you want your data structures to be formed properly so the db can do other object-relational stuff with them later. So there select * becomes very useful as a way of ensuring that the data structures on output can be simply re-used later.

Of course if you are doing pure physical storage queries, select * is probably not what you want but if you have a logical model built up, you may want to do select * from it in order to ensure that your output matches some specific set of rules.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: