Hacker News new | past | comments | ask | show | jobs | submit login
NoSQL Databases: A Survey and Decision Guidance (medium.com/baqend-blog)
297 points by DivineTraube on Aug 15, 2016 | hide | past | favorite | 73 comments



On my personal projects, my default database choice is Postgres unless I have a glaring, obvious reason to choose something NoSQL (in which case I like CouchDB if the situation calls for a document store, don't know much about the others).

At work, it doesn't matter if we have a perfect use case for a particular NoSQL solution (I'm working on something now that I think would be a great fit for a Graph database like Neo4J). We use Oracle, because it's expensive, we're already paying for it, so damn it, we're going to use it.


RethinkDB is my default at this point. Relational documents provide 99.9% of the features I need. When I need actual graph data, I play around with ArangoDB.


As a fairly staunch advocate of PostgreSQL for the 99% use case, I'd be really interested to hear what makes RethinkDB stand out?


The admin UX... setting up a redundant sharded cluster with auto-failover in RethinkDB is in the box, straight forward and easy. With PostgreSQL you're stuck choosing between at least a half dozen less than optimal solutions.

That would be the single biggest difference... Neither have particularly great DBAAS offerings, outside of RDS Postgres, though if you're on Amazon, you can use Compose.io's rethinkdb option there too.

RethinkDB has sharding + redundancy similar to ElasticSearch or Cassandra, while having some other features that put it more in line with an SQL rdbms, while having an administrative UI that's second to none.. and I mean none.


Combine the best ideas from a relational database, a document store, sprinkle in some realtime change feeds, and an awesome admin interface. About all I need for a modern web application.


RethinkDB is also my default at this point, their scaling features are great. But they have several major unresolved issues, check their issue tracker on github for reference.


Anything serious that can corrupt or eat data?


No not really, It's really solid. But stuff like correlating writes with feeds are not possible atm.


One thing we've done in the past is included the feed name / source in the document (just an fyi...not sure if applicable to your use case or not).


I do the same.

If I'm doing something where I'm storing JSON (e.g., from an external service), but only care about certain bits of it, I'll store the bits I care about in columns, then put the whole JSON object into a jsonb column.

The key advantage that I see is that when I decide that a new piece of the JSON needs to be cared about, I can do something like this:

    (modify data types as required)

    ALTER TABLE blah ADD COLUMN newthing text;
    CREATE INDEX ON blah.newthing; -- if needed
    UPDATE blah SET newthing = jsoncolumn->>'new_thing';


You don't even need these additional columns in PostgreSQL, thanks to expression indexes! [1]

Just add an expression index for your new field:

    CREATE INDEX ON blah BTREE ((jsoncolumn->>'new_thing'));
Then, the following kind of queries are blazingly fast due to the index:

    SELECT * FROM blah WHERE jsoncolumn->>'new_thing' = 'some_value';
This is less error prone than managing the consistency of the additional columns on your own. And if you still need some kind of pseudo-table, consider creating a simple view for that.

[1] https://www.postgresql.org/docs/current/static/indexes-expre...


Just a nitpick, but I believe it would be more efficient to create index as the last step here.


Not necessarily.

If the table is already fairly large you'd have to use Postgres CREATE INDEX CONCURRENTLY to avoid locking the table while it sorted out the index. The only downside is that you have very little control of the process.

If you already have the data available in a JSONB column you can create the column, create the index and then set your app code to look to the JSON data if the column is null during transition while a background job goes through and populates rows either individually or in smaller batches to avoid disrupting user experience.


This was just an example of what you can do. I'd do this sort of thing while developing an application, but if it were in Production I'd take a much closer look at how the steps in my change would impact the running application.


you should be able to index on an expression without adding the column.

eg

  create index on blah((jsoncolumn->>'new_thing'));
although I haven't tried it myself yet, I've been meaning to play with that feature.


One could argue that Neo4j is much closer to a relational database than a NoSQL object store, since a relational database is effectively a graph. You could model graph data like Neo4J in a relational database by simply keeping a table of nodes, edges, and a join table for both.

The advantage of Neo4J is that its query language is explicitly optimized for the graph use case, whereas querying a relational database as a graph engine would necessitate fairly complex SQL.


When would I use Neo4j over a relational sql database like PG/MySQL?


If you are writing lots of joins in SQL, that might be a good use case for a graph database like Neo4j. Neo4j treats relationships as first class citizens, so really anytime you care about relationships in your data you might have a good graph use case.

One way to think about relationships in graph databases is as materialized joins. Relationships are modeled and stored explicitly so there is no join at query time, only a constant time graph traversal. There are differences in flexibility of data model, ease of expressing traversals in the query language, etc but I think traversal vs JOIN is a good way to compare relational vs graph.


My understanding is if you had widespread, conditional joins. I haven't ever implemented though, so somebody feel free to correct me if I'm wrong here:

Example: News Feed - JOIN with my friends - JOIN recent posts - JOIN topics / keywords that I tend to care about - Filter by people I interact with more - Filter out shared posts from sources I ignore - Sort by some combination of recency, likes, likes from my friends

The idea, it seems, is to be optimized for wide joins.


Ultimately it's up to you, of course. But the most obvious use case would be when you're building something explicitly modeled onto a graph, like a social network.


If you are interested in graph databases (full disclosure: I am the author) you should check out https://github.com/amark/gun .

- It is Open Source under MIT license, very liberal, at no cost.

- It has property graph features like Neo4j.

- It does realtime updates, like Firebase.

- You can store JSON and Relational Documents, like RethinkDB.

- Offline first (or mobile first), like CouchDB.

- Upcoming release has been benchmarked at 25M+ reads/sec and 6K+ writes/sec.

- The community is really friendly and accepting.


That's us with Teradata, but in fairness, it does everything we want from an operational standpoint really well. A new project might present a use case where we might not want TD as the first choice, but TD ends up doing it well enough anyway. And it's expensive, we're already paying for it, so damn it, we're going to use it.


I agree.

However, re: document store, I've been bouncing Postgres around in my head for that: all required columns (ie, columns that always exist, or will be queried on) are normal SQL columns, everything else is stashed in one or more hstore columns.


> everything else is stashed in one or more hstore columns.

This has the same problem as most NoSQL solutions: You don't get the benefit of a schema, and schemas are damn useful things.


Not everything needs a schema.

The alternative I've also been bouncing around in my head is multiple tables: objects (rows in the "primary" table) have traits (rows in "secondary" tables with identical primary keys) that break out columns according to what they do.

Having 9000 columns in a single table is a bit unwieldy.


>Not everything needs a schema.

Everything does. Data without schema are just noise.

The real question is whether you'll enforce it at the DB level (and suffer from inflexibility) or enforce it at the code level (that, obviously, should know what field is what, and which fields are available, e.g. enforcing a schema at runtime) and get potentially corrupted data, management overhead, etc.


> Everything does. Data without schema are just noise.

Love that concise adage. Will probably adopt it when explaining databases to novices.


There are times when you don't care about the schema, though. E.g. when the data is mostly just pass-through, and you need to get it from here to there. Sometimes you need just a tiny bit of schema to know where "there" is (when that's encoded in the data), but you don't really care about the rest at all.


To clarify, I meant exactly that: enforced at the DB level, or enforced in the code base.


Everything has a schema. There's no 'need' about it.


I guess you are talking about database normalization. And it is part of schema design. https://en.wikipedia.org/wiki/Database_normalization



>Not everything needs a schema.

Any good examples?


One example would be ingesting structured or semi structured data from sources that you don't control.

You may know some invariants, but much can change without notice. So you want to be able to work with the structure you have without preventing non-conforming data from entering your system.

In some cases schema conformance is just delayed, in other cases it is never achieved completely or not even a goal.


> one example would be ingesting data from structured or semi-structured sources that you don't control

Can you give a more specific example?


For instance, we need to retrieve statistical data on various macro economic indicators from various statistics offices and international organisations. There is considerable overlap in the fields they use but it's rarely exact and often you can't merge them because they do not refer to the exact same entity or the data uses incompatible units. It's impossible to properly model all of it before storing it because so much changes all the time and it's all noisy and partly broken.

A similar thing happens when you retrieve data on securities and companies from various exchanges, from the SEC, from national registries all over the world or you try to include XBRL from different countries.

And then you often have documents (like quarterly reports) that contain structured fields and tables but not in a formally specified syntax. You don't know exactly what fields will be in those documents before you parse them. So you parse the documents, store key/value pairs, and then you clean them up gradually.

There are tons of situations like this in data integration. It's a never ending cleanup and merge process. You can use RDBMS for all of that but they're not always the best tool for the job (but they are still my preferred tool most of the time).


Having worked on that sort of process many times, I'm of the opinion that a message queue is the ideal solution there, not a database. If you're storing the data for the purpose of processing it again later, it should probably be ephemeral and fast, rather than long-lived and flexible.


That doesn't work for us (beyond the first stage), because the fields we extract from the original source are not ephemeral.

We need to store the key/value pairs and explore them in a reasonably productive fashion (i.e using queries) in order to come up with machine learning algorithms. And any new algorithms we write need access to all historical data.


Metrics. A metric has a known source, a timestamp, a name and a value. It can also ship with any arbitrary number of descriptive fields.

Similarly, events.


The known source might very well be expressed as a relation between two entities: a metric entity and a source entity.

The source entity, more often than not, is also complemented with other data that needs to remain a part of the persistence layer.


Metadata can be quite variable. Library, catalogue, picture tags. The majority of terms are common, but some can be pretty specific and (as a developer) you'd need to store them. You might not have control over the schema or even have a "finite" set of possibilities.

Imagine you want to store random metadata from a digital camera picture, or perhaps even XML/HTML attributes. You can create another table and add each new attribute – join on query – but if you don't plan to search for that data directly, it's easier to skip normalisation and dump the original set into a JSON(B) or HStore field. You don't have to add every possible attribute to your data model or schema, you can carry data along and not analyse it if it's not relevant to you.


At a previous company I worked I worked at there was a table that maxed out postgresql's column limit. That did not need to be that wide. It was much better suited as a ~30 column table with a single hstore column (does the key exist? return the value, otherwise? null), as 99% of each of the rows for those columns were completely empty, and PGSQL does not support sparse tables (the "right" solution here).


You can define arbitrary constraints of any level of complexity with Postgres for any column, be it plain int, hstore, jsonb or anything else. Most of that will also be efficiently indexable out of the box.


you can totally do this, you can even go one farther and put the value into a JSON column which can allow you to do queries on the value


One slightly concerning thing about this is the table at the end dictating what popular NoSQL DBs do is populated by, and here I quote - "The methodology used to identify the specific system properties consists of an in-depth analysis of publicly available documentation and literature on the systems. Furthermore, some properties had to be evaluated by researching the open-source code bases, personal communication with the developers as well as a meta-analysis of reports and benchmarks by practitioners."

Which looking at the table, doesn't seem to include Aphyr's work (Jepsen), so much of what is being said may in fact be wrong. I'd prefer a chart populated from that (though admittedly such a chart would largely be "undefined", "inconsistent", etc).


We tried our best to incorporate these results in the cases where they reflect fundamental shortcomings not bugs (of which Aphyr found quite a few). But you are right, we could include a list of popular examples, where description and experimental findings diverge. Which cases did you have in mind? MongoDB being CP?


And Redis, yeah. In general, seeing his work has made me very cagey at trusting any DB's marketing claims; I'd love to see citations as to what resources went into each claim, as that would inform me "this has been tested by a third party" vs "this has been claimed in marketing docs and YMMV; you should probably test it yourself"


This article will soon become outdated: CouchDB 2.0 is in release candidate stage, and will soon be released. It features built-in and automatic sharding and clustering and a new, MongoDB-inspired document query language...and lots of small improvements as well.

For more information, read the excellent blog posts from their official blog: https://blog.couchdb.org/


I'm also very curious about CouchDB 2.0 and whether CouchDB will be able to make a comeback. When I talked to Adam Kocoloski about a year ago, I got the impression that it takes a very good technical approach (consistent hashing, causal hash histories, MongoDB queries, etc.). However, the implementation is now mainly driven by Cloudant and CouchDB kept alive by them. I hope they manage to rekindle the old fascination they once had.

In any case I will be happy to update the article and include it.


Anyone looking for more information in CouchDB 2 redundancy should check out this article: http://docs.couchdb.org/en/1.6.1/cluster/theory.html

(Ignore that it's under /en/1.6.1, built-in clustering is 2+ only.)

I'm really excited about CouchDB 2.0.


This is pretty useful for someone like me who's been in large companies using Teradata forever, just to understand the NoSQL products out there and potential use cases. On any given day, I feel like I've fallen way behind on database technology simply because we're fairly locked in, admittedly very happy with TD. It's helpful to keep up here though, because we do have some projects we'd like to do that would probably work better outside of an RDBMS.

I'd love to see something similar on the various ETL/data pipeline tools. In that regard, we still write a ton of SAS code because we have a variety of sources, which SAS does well, pretty solid if you need to do some data cleansing into the target, and the code itself is very batch-friendly and maintainable. It's been a while since I've surveyed alternatives.


Great to see RethinkDB covered (although it would have been better if it was included in the primary comparison tables too).

It feels like a second-generation NoSQL database where most of the drawbacks with earlier NoSQL-databases has been taken care of!


I enjoyed this topic while researching New Data: a Field Guide [1]. In my research I've come to really enjoy ArangoDB. It has good support for documents, performance, relational data queries and basic graphs. Without doing the research I don't think I would have ever dug into the new tools. If you ever get a chance to do a small prototype, looking using these tools.

1- https://leanpub.com/NewDataAFieldGuide


I'd point out that there are arguments[1] that 'CA' is not a real choose you can make when it comes to the CAP theorem.

1: https://codahale.com/you-cant-sacrifice-partition-tolerance/


Not even a mention of ZODB. While it's true it's Python-only, it's certainly one of the most mature NoSQL databases around.


The context the author is describing is clustered databases, does ZODB offer a clustering option that is equally as mature?


You mean like ZEO? Yes, it's probably been around for a couple of decades now.


I don't think so. I am referring to something that allows you to scale your database cluster horizontally by adding more servers. ZEO appears to only support a single server.


This is a fantastic overview of the issues surrounding DB's. Would love to see some deeper discussions on graph databases in the next version!


Felix, author of the article and founder of Baqend [1] here. I'm happy to answer any questions. If you have suggestions for the next version of the article, we're eager to hear them.

[1] http://www.baqend.com


boo. doesn't include a comparison vs Google's NoSql SaaS offerings (Cloud DataStore and BigTable).

i personally use Cloud DataStore and find it a great fit, wish there was some comparisons against it.



If you wish an introduction to NoSQL there is a good webinar at http://www.prohuddle.com/webinars/nosql/NoSQL_Distilled.php, which includes discussions, examples and comparisons.


> Redis, as the only non-partitioned system in this comparison ...

This is a bit misguided, redis cluster has been in GA for over a year.


I agree, Redis Cluster has been available for a while. From a practical perspective we did not deem it production-ready yet, due to various shortcomings, e.g.: -Redis Cluter being neither AP nor CP with rather unsatisfying justifications [1] -Strong scalability issues, for instance with PubSub [2]

But you are right, I think Redis Cluster could be added as a separate system.

[1] https://aphyr.com/posts/283-jepsen-redis [2] https://github.com/antirez/redis/issues/2672


A couple of points, just for clarity:

1) It is true that Redis is not AP or CP, but this is not covered by [1] that covers Redis Sentinel (a previous version with different behavior compared to Sentinel v2 btw). So, while Redis Cluster is just an eventually consistent system that does not guarantees strong consistency nor availability (so is not AP nor CP), the pointed post is not related.

2) Redis Cluster is very scalable actually, since it is a flatly partitioned no-proxed system. Pub/Sub is not very scalable in Redis Cluster but Pub/Sub is a minor sub-system of Redis, most users look at the ability to scale the key-value store that is... 90% of what Redis is, so I think your claim is not justified.

3) The fact of not being AP or CP does not mean that a system is not useful. Depends on the business requirements. In fact, most SQL databases + failover setups, that is what takes a seizable portion of the big services of the internet up, are not AP or CP as well.

Database systems features are related to use cases, I believe that the Redis Cluster properties cover a huge set of real world use cases, in fact is the most actively mentioned/requested feature right now, even if we document in very clear terms the tradeoff and the actual behavior of the system. So people know what they want. IMHO excluding systems because of what you think being acceptable tradeoffs is not a good idea.

Otherwise you should also mention that SQL+ Main_widely_used_failover_solutions systems have the same unacceptable shortcomings, for instance.

Actually if the Redis Cluster implementation will be made the center of the future Redis development, it could easily become the most used sub-system of Redis, like Sentinel is already becoming. There are many devs that know what they are doing and know how to apply things depending on the requirements even if those things are not CP or AP.


I was oversimplifying of course, Redis Cluster definitely deserves a place in the comparison. We are actually very satisfied with using Redis in production but had to work around the PubSub limitations using keyspace notifications, which is why I'm biased regarding that point.

In any case, it's difficult in terms of presentation, since a "normal" master-slave-replicated Redis is a totally different system from Redis Cluster as the distribution models of Redis Cluster also affects the functional properties to a large extend, e.g. regarding atomic Multi/Lua blocks and all types of multi-key operations.

The failover solutions for relational database systems are a good point, they too should be included.

Regarding use cases, I totally agree: every system should be discussed in the light of the use cases it tries to solve. I personally think that Redis Cluster with a choice for trading consistency against latency would open up a whole new range of use cases that are currently not a good fit for Redis Cluster. For example, we have a Redis-Coordinator project (not open source, yet) which behaves similar to a scalable Zookeeper (BTW also neither CA nor CP [1]). However, it has weaker guarantees, due to lack of tuning knobs in Redis and Redis Cluster regarding consistency.

[1] https://martin.kleppmann.com/2015/05/11/please-stop-calling-...


That's a good idea. At the very least add a note about it in the mean time maybe. BTW as a caching-only system redis can scale exactly like memcache if your client supports it. I did this in one project and it worked great.

EDIT: Also, it might be too niche, but you can also have a look at redis+dynomite, which makes redis behave like a distributed database. https://github.com/Netflix/dynomite


serious question, what is to stop everyone (where a K/V store is not enough) from just chucking some form of SQL server in a VM in a ramdisk and using that? I've always wondered about that idea


Lost efficiency. If you want a semi-persistent relational DB, you can use BerkleyDB. There are also embedded DBs for Java, and probably other languages. If you do this, you're not wasting RAM for the VM and the guest OS. The DB gets its memory from the large application install.

edited for typo.


Slightly off topic but: As someone out of touch on the topic of DB utilizing apps for a while, what's the update on the ORM impedance mismatch? (I mean the guidelines or best practices).


The TL;DR was TL;DR. Ho, seriously. About halfway through, the buzzwords and jargon got to be too much for me. Let me know if there's anything interesting Wining.


Great work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: