Hacker News new | past | comments | ask | show | jobs | submit login
NoSQL: If only it was that easy (bjclark.me)
88 points by bjclark on Aug 5, 2009 | hide | past | favorite | 43 comments



I'll be honest (and chances are I'll get a great explanation as to why I'm wrong), I don't understand the whole YeSQL-NoSQL argument.

Relational databases, such as MySQL and PostgreSQL, are great for some things -- especially when you need to (put simply) divide and recombine various pieces of information easily. JOINs. UNIONs, etc. are fantastic when you need them, and it's where relational databases excel.

On the other side of the equation are hash and "document" databases, which are mostly key-value stores with some unique functionality (e.g. the indexing feature of Tokyo Cabinet Tables). These are great when you need to store a large volume of data, but don't need to frequently recombine the stored data in many different ways. You can run simple queries with some hash DBs and retrieve specific data, by key, at the drop of a hat; this is where (for the most part) key-value stores win out over relational databases.

It is clear, at least to me, that these two classes of DB serve two different purposes, with some overlap. By that logic, I see no reason why they can't happily co-exist in a single project. In fact, I'm using multiple database formats in a single application without any problem. Sure, there is a little more management and some extra logistics involved, but the point is that each type of database is used with its strengths in mind.

I'm not going to force MySQL to be a giant hash table when something else will do the job better, and I'm not going to force Tokyo Tyrant to try and be a relational database (not that I really could).

Am I wrong in finding a happy medium between multiple technologies?


I wrote a key-value store['] and now I'm building a service which handles lots of data/requests that uses it. My short conclusion: KV stores are to (My)SQL what Assembly is to C++[''].

As in, when you design a relational DB by the book, you first create a normalized schema with indexes, and subsequently issue SELECTs with WHEREs and JOINs and GROUP BYs and ORDER BYs and LIMITs against them in a declarative manner.

With a KV store you basically have direct access to the btree layer, so you have fine control over what's going on. Just as with Assembly, it is much harder to organize your data/program; most people don't need this kind of low-level access, and will shoot themselves in the leg given the possibility. Currently it only makes sense to use a KV store if you need this kind of low level access for performance/scalability reasons (or you want really strong replication which only exists in Keyspace, which is a KV store).

Hardcore SQL admins will say that one can optimize the RDBMS , and they're right; this is the usual trade-off when your high-level abstraction's default behaviour is too slow and you start to break open the abstraction to tune the underlying layer: in my opinion, it makes sense to just abondon the higher-level abstraction at this point --- but this is a matter of taste since you're loosing many other conveniencies along the way.

The other use-case is sharding. With a KV store it's much more natural to store parts of your tree on different servers and just issue GETs and SETs over the network. This transparency is one of the things you gain if you go down a level in the layers of abstraction.

['] http://scalien.com/keyspace

[''] Prolog really


Am I wrong in finding a happy medium between multiple technologies?

No, you are correct. The main source of contention lies in people trying to use a particular technology outside of its comfort zone. Many in the NoSQL crowd suggest that this is a common symptom of SQL installations. Unfortunately, a good portion of the arguments that address this point are hyperbolic and seem to suggest that SQL is rarely the right tool. This is what keeps degrading the signal to noise ratio.


To elaborate on qhoxie's statement, NoSQL is great for some things, but there is this idea out there that it's a panacea for scaling your webapp. And while all the things highlighted in the article are performant, few of them even scale, and none of them are perfect.


I think you are missing another fundamental difference. The NoSQL solutions are more like a dynamically typed language where as the YeSQL ones are statically typed.

In NoSQL, I don't have to tell the database layer what I'm stuffing into the value (or even how I'm building the keys). Of course the application is taking on a lot of responsibility that previously belonged in the persistence layer.

I think this is the true draw to the NoSQL philosophy. Especially when you consider the apparent (to me) rise in meta-programming.


On the other hand, it also means that if you want the database to check any structural constraints on your data, such as that any comments posted must be associated with a valid user, you're stuck writing it in PHP, Ruby, etc., rather than using functionality in a database that has likely already had man-centuries of optimization and debugging. Oh, and every check means sending the entire record (or table!) over the network, unpacking, probably instantiating objects, and iterating over it in PHP/Ruby/Python/etc., repacking it (with stored statements, if you're lucky), and then sending it back. That's a lot of extra processing that RDBMSs are already designed to handle internally.

Static typing is a good analogy, but the main purpose of static typing is expressing the semantics of data structures in a way that the system itself can reason with them. (OCaml or Haskell's type systems are much better examples of static typing's utility than Java's or C's, FWIW.) If you're still prototyping and/or the data structures are still in flux, then something completely dynamic will avoid a lot of extra work, but once things settle down, having the database itself be smarter about processing the data is worth consideration. (Of course, if your entire data set fits in a Python dictionary, then using a real database is overkill anyway.)


I don't know that being schemaless is fundamentally different. Instead of thinking that key/value stores are schemaless, I think of them as having a single rigid schema: key and value. The data you stick into a k/v db will still have essentially, a schema, you just end up writing your application to be flexible to data inconsistency.

The comparison to dynamic vs static typing doesn't hold much water.

Can you explain how meta-programming relates to this topic at all?


Metaprogramming: I think he/she is associating schema-less DBs with dynamically typed languages and metaprogramming, and static typing with languages such as C that tend to lack sophisticated metaprogramming facilities.

I'm not sure it's a valid comparison, though: OCaml has a statically typed Lisp-style macro engine (camlp4), for example. (In all honesty, though, I've never used it. Lazy evaluation, the packaging system, and other language features cover many of the same use cases.) I think it's a case of assuming the C family's type system is the cutting edge of static typing, when it's actually pretty archaic.

The point about K/V databases having a rigid schema of (key_type -> untyped_value) is a good one, by the way. Of course, association tables (AKA "dictionaries") are pretty versatile as a basic collection type - looking at Lua, Python, Awk, or Javascript. They're not ideal for all cases, but they're a good start for most.


Sure I can see the single rigid schema view. But I'd rather see the K/V as just a persistence mechanism. I view the schema as being more related to the application.

meta-programming relationship: Since you can store anything in the value part of a K/V store. You could (and I do) store the schema definitions for the other values in the data store. And while you are at it, you could store code segments.

So imagine the following key store (in pseudo code):

{ key: 1728273, value: { fields: { first_name: string, last_name: string } } }

{ key: 8274289, value: { type: 1728273, data: { first_name: "Bruce", last_name: "Saunders" }}}

You could stuff code segments in there in a similar way. So I think K/V stores are potentially highly related to meta-programming. Though certainly the two could exist entirely separately of each other.


I may be missing something, but can't you create reasonably performant KV stores in relational databases? I'm thinking, for example, of IOT tables (indexed by key) in Oracle, or possibly pl/sql function caching in 11g.


No, you're not. This point is made in the article.


The subjunctive: if only it were that easy. Actually, it is in English because we only ever really notice it in a few places...

Grammar nitpickery aside, good article. Is client-side hashing really that big a deal? I don't know from first hand experience because I've never bothered with these things yet.


Just to clarify, client-side == application-side, or the application as a client of the database.

Is client-side hashing really that big a deal?

No, it's not that big of a deal, but I tend to feel that any layer of logic that can be done on the DB end rather than the implementation end is better off in the former. Voldemort is a good example, as noted in the article.


It's not a big deal when you do it the first time. It's a big deal when you need to start adding servers. What do you do?


I don't have scaling problems, I don't have sites that big, yet.


He's pointing out a situation where it becomes a bigger deal, to answer your question. While there are solutions in practice for this situation, it can be painful.


I thought he wanted to know what I did. Since I already said I don't know much about this problem domain, I'd be curious why it's a problem when you start adding servers.


One such bad situation would be if you have numeric keys that you shard based on a mod 2. You do this because you want to distribute data between 2 servers, so one will get all even keys and the other will get the odd keys.

A few months down the line your load gets so high that you need to add a third server, but your sharding mechanism is based on mod 2 - so you need to hack around this in one of many ways to utilize the third server effectively.

Make sense?


So don't people just publish client libs that handle the details for you? It doesn't seem like an unsurmountable problem, but once again, I've never dealt with it in practice.


So don't people just publish client libs that handle the details for you?

Sometimes, but oftentimes sharding issues are unique enough that they require custom solutions when the backend technology does not support it natively.


And these sharding issues don't happen with the things he marks as "scales!"?


That is correct. Things like S3 can expand without issue, and others like Voldemort handle partitioning data natively.



I thing there's an explanation to the NoSQL movement and why non-rdbms are gaining popularity these days.

Not a long time ago, if you had to handle 1 billion records then you were already big with piles of $ sitting beside you. Then you would just call Oracle or Greenplum (or whatever) and say "Hey, I've got some million $ here, care for it? I just want to be fast(tm)"

Lately it is quite possible to be small and forced to handle millions of records daily or struggle with new technologies that require less man power for development but need scaling a lot earlier (e.g Rails). You have limited resources and if you can get away with Tokyo Cabinet on a single machine instead of a 100k mysql cluster then you get to live longer.

As any new cool technology it'll get some hype and then fall back to being the right tool for the job. It's good to have some alternatives to RDBMS, some problems are really not relational.


Before relational databases, you know what was the standard? Hierarchical databases, aka key-value stores. http://en.wikipedia.org/wiki/Database_management_system#Hist... Since then, object oriented databases have come and gone - another hierarchical DB.

I wasn't there when it happened in the 70's, but I think it's essential to understand why relational beat hierarchical then; and to understand what has changed since; because only then can we make an informed prediction.

I think the issues were efficiency and ability to analyze the data. Anybody know what the actual reasons were?


There's a major niche in which hierarchial DBs are quite common - filesystems. While there have been attempts to graft relational features onto them ("journaling" overlaps strongly with transactions, symlinks, etc.), apparently most people haven't historically considered the ability to do arbitrary joins over a whole filesystem to be worth the impact it would probably have on performance.

Also: Hierarchial databases are very different from key/value stores. K/Vs are flat (joining, batch consistency checks, etc. are usually done clientside in a faster/safer language than the DB was implemented in, such as PHP), while hierarchial databases are (naturally) hierarchial. If a client searches for data in a directory, it will only see the nodes in that directory, or those found by walking any subdirectories. While this can be very fast (most records won't be in the same directory, so the first operation has already reduced the search space considerably), much data fits poorly in trees, and constantly rearranging subdirectories or adding symlinks is at best a workaround for the massive conceptual mismatch.

OTOH, it's a good fit for some cases - we're communicating on a hierarchial database right now. As moving one comment thread from one post or parent comment to another would be incredibly rare (if it were possible), a lot of DB functionality to allow flexibility in data modeling can be discarded. Modeling the discussion as a tree of nodes (S-exps, in this case) with pointers to parent comments, subthreads, user IDs, etc. is pretty straightforward.


A hierarchical databases is not a key/value store: http://en.wikipedia.org/wiki/Hierarchical_model

Though you are right that the reason RDBMS became the norm has a lot to do with working with the data (analyzing to use your word).

A key/value store is effectively just an associative array, while damn useful for certain things, I am hesitant to call that a DBMS.


object oriented databases have come and gone

They have never arrived in the mainstream, but they exist as a niche luxury item, mainly because the mainstream use broken languages with broke runtimes that can't persist objects between sessions without getting their pointers in a twist. So they settled for the next best thing: Object Relational Mapping, along with a matching pair of Impedance Mismatch. It's ORMs that are coming and going; at least in the industry hype-machine.


Why doesn't Oracle Coherence ever come up in these discussions of RDBMS alternatives?

I've been going over its documentation, and it looks like the cat's meow in getting rid of the database scalability bottleneck. Think of it as memcached done right. Coherence has automatic clustering, needs almost no configuration, supports many flexible distributed schemes, easy API. With Clojure and Scala, you don't even have to use Java to use it.

Has no one heard of it? Does it get no love because it's commercial and (probably) expensive?


>> Does it get no love because it's commercial and (probably) expensive?

Yes.


Peer pressure. I have never seen this in any other light.

The "versus" mentality of this stupid meme comes from those either proselytizing or stubborn. Plain and simple.

If you're too ignorant to understand what is the appropriate tool for your situation, or offended by the perceived alternatives, then you should withdraw from this topic.


If you're too ignorant to understand

That's not fair. Everyone starts out as ignorant. I think the "versus" mentality helps less informed people separate two fundamentally different approaches to persistence.

There's a lot of information in this post. Granted, it's just one person's opinion. Take it for what it's worth and move on.


I think what happened here is that a subtle debate about whether the characteristics of relational databases are a good fit for web development* has been hyped into some kind of movement. It doesn't help that a many peoples' first impressions of RDBMSs these days probably involve reading just enough of a MySQL tutorial to use it for a PHP project. Working with relational databases without understanding the relational model (e.g., understanding indexing, the rationale for normalization, good schema design, etc.), can lead to disappointing performance, and in those circumstances, people are unlikely to have someone better-informed convince them otherwise. (The object/relational mismatch is part of it, as well.)

* Among other things, RDBMSs put a lot of resources into ACID transactions -- in the use cases for which they were designed, data is expected to outlive some employees, and just dropping part of an update to e.g. a medical record or insurance claim would be a disaster. Similarly, the databases can do elaborate checks on the data, and any potential changes, to ensure that it's always internally consistent. Losing a youtube comment wouldn't be a big deal, though, and if you're willing to cut corners on transaction guarantees or internal consistency, there are performance benefits. (Being extremely protective of the data is a good default, IMHO.) And, yes, not just relational DMBSs are ACID, but many "NoSQL" dbs have that trade-off in mind.


Great post. NoSQL seems to be a band-aid. And for some problems, the best solution is to rip off the band-aid :)

My thoughts at http://hypecycles.wordpress.com/2009/08/05/look-ma-nosql/

SQL has survived because it is reasonably portable, is descriptive and has a rich ecosystem. NoSQL is a response to a problem but it is one that will not get mainstream adoption because it is non-standard.

It is a great stepping stone; what comes out of it is hopefully a standards based scalable data storage and retrieval system.


Coincidence... I'm looking for some simple caching layer right now. But what I need is that data is distributed to all server nodes. I don't care about insert time that much. And the only thing I need to store is a 'unique string -> string' map. (very short strings)

Can someone recommend any solution? I'd be perfectly happy with a "replicating memcached"...


Most of the noSQL information I've read appears to be more focused on caching, speed, scalability, distribution, etc. To me SQL is a query language, not an RDBMS. "noSQL" is unfortunately misnamed (and rhetorically loaded) although it's much catchier than "noRDBMS".


"What am I going to build my next app on? Probably Postgres. Will I use NoSQL? Maybe"

I just spent some time evaluating several k/v stores for a new project. My conclusion: These things may be useful when I'm ready to optimize. Until then, I'll put everything in postgres.


I stopped reading as soon as he posted something looking like a dictionary definition.


Hopefully you don't always stop reading thoughtful pieces for trivial reasons.


I don't think the primary drive or hype behind NoSQL is performance. I think it is that SQL has become some sort of norm for data persistance, when in many cases a SQL database isn't necessary, or even optimal.


Right: there are multiple possible benefits of "NOSQL" (nonrelational): (1) scalability; (2) better performance, and (3) easier coding and development especially with agile development methodologies. With certain of these products, notably the document-oriented stores (MongoDB&CouchDB), development is more natural as object<->relational mapping goes away. That is a nice benefit even for the smallest of projects.


Absolutely, it feels like you're just working with hashes and arrays, as opposed to SQL-tables masquerading as such.


im sure people affiliated with or users of the respective projects will take issue with points made. all told, however, the main takeaway for me is what i already knew and which the author mainly drives home in the last paragraph. specifically, use and take advantage of all the tools out there including the ones in the nosql camp and the more mature rdbms camp. and simply, get started already!

projects im working on now sometimes have the same data represented differently in different systems to take advantage of what those systems have to offer. many of them mentioned in the article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: