Hacker News new | past | comments | ask | show | jobs | submit login
NoSQL Data Modeling Techniques (highlyscalable.wordpress.com)
226 points by plasma on March 1, 2012 | hide | past | favorite | 30 comments



First off I find the NoSQL term in itself very strange. How can you say anything intelligent about "everything that is not using SQL as a query language"? Its like talking about NoJava, instead of talking about Ruby.

Props for a well written article with lots of nice graphs but I dont agree with much of its content.

A few examples:

"software applications are not so often interested in in-database aggregation"

In my experience this is what 99% of business support apps are doing. Doing this aggregation in procedural application code will only give you more code to maintain and more bugs.

"joins are often handled at design time as opposed to relational model where joins are handled at query execution time"

Im glad you know beforehand about your changing requirements over the next 10 years and can "design" your joins for every eventuality right now. It feels like the exact oppisite of agile.

I also agree with the very insightful comment by Voice in the wind (comment #4 below the article)


"software applications are not so often interested in in-database aggregation"

Applications that used to be desktops apps are now moving to the web, and these "cloud applications" are prime use-cases for NoSQL. Apps like Gmail, Google Docs, Dropbox are cases where a NoSQL might be a better fit than SQL, the same way Word.exe doesn't use SQL internally, instead it uses linked lists and hashmaps [1].

"In my experience this is what 99% of business support apps are doing."

Business apps are the original, killer use-case for SQL, and here NoSQL makes less sense, I agree.

[1] Some apps like Firefox started using SQLite for local storage, a technical decision I don't agree with.


Applications that used to be desktops apps are now moving to the web, and these "cloud applications" are prime use-cases for NoSQL. Apps like Gmail, Google Docs, Dropbox are cases where a NoSQL might be a better fit than SQL

I don't see anything in Gmail and Google Docs data usage needs that make NoSQL a better fit for them --only the need to scale with no sharing etc.

(OTOH Dropbox might fit the case, they only need some hierarchical store a la filesystem, with versioning).

the same way Word.exe doesn't use SQL internally, instead it uses linked lists and hashmaps

Yes, but Word.exe internally deals with just ONE document and it's data structures, does not manage metadata for many documents and does not have to provide aggregated info on them. That would be Windows Explorer, that manages many .docs. And that one, MS did have a plan to make more SQL-like.


NoSQL is a better fit for Gmail because you wouldn't want one gigantic 'emails' or 'contacts' table, which is how you would model it in SQL. You also want to cache the hell out of it, which is easier done in a NoSQL model (e.g. if the cache layer is integrated and hence uses the same API as the persistent stuff).


Huh, why wouldn't you want those tables? You seem to be confusing the need to scale with how you would design the system.

Somewhere I worked before we used to import 100,000s of a companies email into the project system to display them against the system. They were all in an Emails table which just stored the body, I think there was an email header table and an email contacts table which was relationally linked directly to the Person table.

Worked fantastically, used to run an IMAP server off those tables for outlook integration.


I wasn't talking about "email" as a use-case, I was talking about "Gmail". Obviously you can serve 1000s of email users using an SQL and IMAP database.


NoSQL is a better fit for Gmail because you wouldn't want one gigantic 'emails' or 'contacts' table, which is how you would model it in SQL.

Per gmail account, you'd still dump all the mails into one gigantic emails table, either an SQL or a NoSQL one.

As for the totality of the gmail accounts, well, you'd use sharding, which is the same whether you use NoSQL or SQL.

You also want to cache the hell out of it, which is easier done in a NoSQL model (e.g. if the cache layer is integrated and hence uses the same API as the persistent stuff).

But a SQL DB can also use the same transparent API for the persistent stuff with their standard Query API. It wouldn't even need to parse the SQL statement, just hash it and compare it. And you could also use a layer that makes it transparent, such as an ORM.


So you'd have one 'emails' table per Inbox, right? So for 1 million users, you'd have 1 million tables, and no SQL database can handle that.


I agree that NoSQL is a strange term, although perhaps not quite for the same reason. I agree with the statement by @batista that SQL in this context is a defined way of thinking about problems. But NoACID seems much more appropriate especially as many "NoSQL" systems get SQL interfaces stuck on top.


First off I find the NoSQL term in itself very strange. How can you say anything intelligent about "everything that is not using SQL as a query language"? Its like talking about NoJava, instead of talking about Ruby.

Only it's not like that, because SQL is a well defined way of thinking at problems, not just a particular implementation of it.

So, NoSQL is like saying NonFunctional. Or NonImperative. Or NoRegister VMs, etc...


SQL is a language that is not particularly well defined, nor essential to solving problems. What you're referring to as a "well defined way of thinking" is the relational model. There have been relational databases that did not use SQL as their query language, and although they're not widespread, the same points the NoSQL community makes against e.g. Postgres could be made against them.


Like most buzz words, its popularity has overshadowed it's usefulness. I agree that a word to categorize these recently popular non-relational databases would be handy. But nosql is really not accurate. Unless you abandon all attempts to interpret it literally. When I see "nosql" that now tranlates into "that group of recent and fashionable databases which are primarily key/value stores with some traditional database features added in to varying degrees". So at least it means something (to me anyway), and it's abstracting quite a mouth full. Much better than "Web2.0" was. That always just translated simply to "ajax with nice UI" for me, but I saw the term used in so many ways that didn't fit that definition.


"that group of recent and fashionable databases which are primarily key/value stores with some traditional database features added in to varying degrees" is exactly what the term means.

So I guess it works.


Yeah, I think that's good. It's kind of like the term "Modernism," which makes no inherent sense but made sense when it was coined as a response to particular styles that came before it and were seen as, well, not modern enough. (Art people will tell me I'm wrong, but close enough.)

NoSQL refers to the databases that are proposed as alternatives in areas where programming culture had gotten used to thinking that the only solution was an ACID-compliant relational database. It means, "hey, this isn't the model you're used to, but give it a try anyway." You can't figure out the word without knowing the culture it emerged from.

And now I'm getting post-modern.


This is an interesting article.

I am primarily a PostgreSQL guy who does all sorts of things like hierarchical data representation in SQL. While these things have come a long way in the past few years. This being said, the more I read about NoSQL data modelling techniques, the more it occurs to me that some of these techniques may work well in relational data environments where data is read-frequent/write-seldom.

In LedgerSMB (http://www.ledgersmb.org) we already use key-value modelling in cases where it makes sense (system settings, and a few other things).

Currently what hierarchical stuff we are doing wouldn't benefit from the ideas in this paper, but I wouldn't rule it out for some other things in the future.

I guess what this is reinforcing for me is that NoSQL and SQL models are not entirely mutually exclusive.....


NoSQL is about the interface, not the implementation. NoSQL databases provide a better impedance match out of the box for some applications.

What is often lost in the conversation is that you can do the same thing using a competent SQL database engine if you can deal with the complexity. But you have to use SQL, which for some applications is a poor interface, and you have to configure the engine for your application and workload. This adds complexity to the process. If you have great database architects and DBAs, NoSQL does nothing that you can't do on a really good SQL engine. Most startups have neither the people nor money for that.

The vast majority of databases, whether labeled SQL or NoSQL, implement the same relational operator algorithms under the hood. They are not intrinsically different in that regard. Even graph databases, which in theory cannot be expressed in a simple relational algebra, can be and are expressed in practice as recursive relational algebras. As long as databases are using the same algorithms and representations they will have the same limitations.


Most startups also don't have complicated data models or a high enough traffic to justify having a DBA and commit premature optimization by denormalizing data right off the bat to speed up retrieval in your NoSQL DB.

As many have said before, NoSQL is a premature optimization in that all it does is to remove some restrictions in your technology stack to let you move come complexity such as data validation and the ability to easily aggregate data up the stack.

The need for NoSQL is a rich man's problem. When you organize your data like that using the article's techniques, you are going to have to write a lot of very odd looking code and tightly coupled code to do even some basic reporting. E.g when you try to query from the many-to-one direction.

If you are following the Lean Startup methodology at all, you should be aware that being able to measure things is crucial early on when you are trying to reach a business goal. Writing bunch of crazy for loops and map reduce stuff in the application layer isn't exactly easy to write, look at or maintain.


The impedance mismatch is real which anyone who has seen stored procs written by app developers can verify....

What NoSQL does is give you a light-weight object store which could have some very cool uses. Those uses are also pretty narrow for the reasons the article mentions (assuming the questions you have now are all the questions that are important for example).

However, I think some of the solutions may help in relational environments with the edge cases.


That's a little weak as an argument isn't it. SQL and the relational model has nothing to do with stored procedures and you certainly don't have to, or need to use stored procedures for reporting. In fact, prior to 2005, MySQL didn't support stored precedures for the longest time. How do you think those people did reporting? I don't know how you do reporting, but mind involves just a ridiculously simple and ugly web app that essentially just generates some HTML tables from a bunch of SQLAlchemy queries.

Why do I get the sense that the real reason most people use NoSQL stems from their destain of SQL, whatever that reason maybe...


I am a total relational guy, BTW. But the fact is that there is a mismatch between how you have to think when doing SQL queries (thinking in sets) and OO programming (thinking in instructions). Anyone who has dealt with stored procedures written by app folks understands what a mess you get when you try to program one side in techniques aimed at the other.

While NoSQL is a good choice for some environments, namely ones where ad hoc reporting is not likely to be needed and where other methods of interop are preferred (LDAP being a great example of something that could benefit from a NoSQL back-end), the fact is that this actually shows that, more often than not, you lose more than you gain by getting rid of the mismatch....

IOW, I think it is a moderately weak case for NoSQL in some environments and a strong case against in a much larger number of environments.....


Even if it's not write-infrequent, you can still try these ideas in SQL if you use a write-optimized database like tokudb (the one I work on).


I like the comparison of the design themes of relational modeling and NoSQL modeling as, respectively, "what answers do I have?" and "what questions do I have?"


Yes, a provocative thought. But I don't really buy it.

I think it's clearer to see the distinction between an information model and a storage model: The fundamental goal of an information model is to provide a mechanism to convert input information into output information. But if the model is so degenerate that it just echos back the same data in the same format that was put in (at a different time), then it's probably better to call it a storage model.

The relational model is a logic system that lets us use predicate calculus to infer from one set of propositions taken to be true, a second set which is by implication true. That is, it provides a means to obtain encoded answers in response to encoded questions applied to a body of encoded facts.

And a KV store is clearly a storage model.

And there are some in-between, such as the CouchDB document store, that can transform one body of information to another, but also pay attention to physical storage issues.


To me the key point is that in relational databases you describe - declaratively - the information you want and leave it to the database to figure out to retrieve it (x); whereas in most NoSQL databases you describe how to retrieve the information.

(x) this is not the entire truth, as SQL is actually a (somewhat) unfortunate mix between the declarative relational calculus and the procedural relational algebra.


I don't know enough to understand what makes relational algebra procedural. I read the "Haskell more successful cousin" article and I thought that the relational algebra part was what made SQL and Haskell similar.


Party like it's 1969! http://en.wikipedia.org/wiki/Network_model

Not that many old things don't work, well in some circumstances even. I'm just having a hard time seeing how throwing away ACID and denormalizing data is "post modern" rather than "back to the future".

Denormalization? How about a materialized view to support common searches to reduce I/O from assembling data.

Aggregates? You could probably abuse and extend entity attribute values to (physically) cluster arbitrary / sparse / repeating field values around a common parent.

It's actually a pretty good article, otherwise, I should probably leave myself a marker to find it. Sooner or later, I'll end up having to maintain one of these things -- long gone contractors will build a "latest and greatest" app using a database system not unlike a 1970 mainframe. There are a number of good work-arounds in the article for dealing with systems with poor indexing capabilities -- a single key field to be filtered.


I've the feeling that when talking about Data Modeling, Redis really does not fit into the key-value category.


To the extent that one could talk about data modeling using the constructs of an imperative language e.g. C then one can do the same with Redis (with the caveat that Redis lacks a reference type). K/V only stores restrict the semantics to that of maps e.g. map["foo"]=bar, but if they introduce richer operations e.g. map-reduce on the K/V containers (e.g. map.apply(func)), then it is pretty much Redis restricted to its String type.

I personally have a preference for unified views of systems so my bias is to look at the entire hierarchy of storage (image) and memory model (semantics) as a singular space, with back end disks as Ln to L1 cache on the CPU. In this light, to me Redis is a memory manager(/cache) + DSL, serving at Lx (where x is somewhere south of local Disk and north of a durable and consistent distributed FS backend e.g. HDFS).


That's somewhat true. There's no need to model data structures like trees, maps, sets and lists in a data store that supports them natively.

This article is still useful when using Redis because the article describes techniques apart from modeling fundamental data structures; most of the indexing techniques are still applicable to Redis, for instance.

As an example, the Composite Key Index technique could be accomplished using a sorted set to store the ordered set of keys and Redis's k/v functionality to retrieve the values.


I had a good talk with a potential customer two days ago and we covered non-relational databases in some detail. I blog about non-relational databases a lot so he thought it was 'religion' for me, but this is not the case at all: I default to wanting to use PostgreSQL (or another relational database) and instead choose MongoDB, CouchDB, etc. depending on special requirements.

As per the article, it is required to understand the CAP theorem and specific capabilities of different data stores.

Client library support varies a lot. Some ORM tools like Ruby's Datamapper let you design composite storage schemes using a relational database and, for example, MongoDB. For document oriented data stores I very much prefer writing client apps in languages like Ruby and CLojure that have a nice syntax for maps, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: