Creating a Document-Store Hybrid in Postgres 9.5

SwellJoe · on Feb 5, 2016

I'm often surprised at how many of the new web frameworks only support something like MongoDB. There's nothing wrong with MongoDB (as far as I know), but when I'm building something new, I'm planning for a long haul, and I expect to want/need real database features in the future, even if I don't need them today.

Given that PostgreSQL has JSON data types, and performance when using those data types is competitive with the dedicated JSON or document databases, I honestly can't think of any reason to use something else. Especially given the long reliability history, the large pool of expertise, the tooling, etc. of PostgreSQL vs. the other options.

Am I missing something in this assessment of the situation? I see so many smart people deploying web applications (in particular) on MongoDB and others...why are they choosing it over PostgreSQL?

andygrove · on Feb 8, 2016

These web frameworks are all about reducing learning curves and making developers productive as soon as possible. MongoDB is easy to learn and will be fine for some percentage of use cases. The developers behind these web frameworks are often not coming from an enterprise systems / traditional RDBMS background and are not necessarily aware of the devops trade-offs between MongoDB versus traditional RDBMS.

I work for a software vendor and and this is a fairly typical scenario for customers of our dbShards product, who often have an app that starts to get traction and then requires a more reliable and extensible back end database that is easier to apply schema changes to without downtime.

l8again · on Feb 6, 2016

For us, it boiled down to multi-master writes, and global replication (multi-datacenters). Cassandra worked best for our use case. Moreover, as others have pointed out, high availability with fault tolerance is another major reason. CAP theorem is real, and there are trade-offs to be made in several use cases.

merb · on Feb 6, 2016

Still with Cassandra you loose trnsactions and you can't start "small", as small as you could start with postgresql. A 10000 user service works well with a database running on 512mb ram, while cassandra needs at least 1gig (running lower could work there also, however mostly you will run into longer gc pauses than, it's really hard to trim the jvm down to less. I mean there are some ways to do it but it's way harder than just using something different).

and mostly people start with as low as 10.000 people. Scaling means starting from 1 going to X. and postgresql is well enough for most stuff. You don't need multi-master replication for most stuff. people mostly don't need write scalability. and for stuff that needs some writes they could easily cache that. (see instagram)

reqres · on Feb 5, 2016

Just off the top of my head:

- First class support for sharding & replication

- Pluggable storage engine

- Capped collection (i.e. tailable table/collection)

- A query language which doesn't boil down to string concatenation (easier to manipulate in a language of your choice, but harder for human friendly adhoc query writing)

Not that postgresql doesn't offer any of the above in some form or another. Just that MongoDB does seem to make 1 & 2 & 3 easy to deploy & manage

plasticxme · on Feb 5, 2016

Mongodb easy; PostreSQL hard. The latter requires extensive planning, which slows you down when features need to go out now.

mbreese · on Feb 5, 2016

Oh no! Someone needs to model their data before writing code??? The horror!

You're going to pay for getting the data model right eventually. The only question is - do you want to pay for it now or later? If you push it off until later, it's generally more painful to make changes. Or, you have to maintain a sub-optimal model for as long as the application exists.

threeseed · on Feb 5, 2016

I take it you must be new to software development.

Because I have never seen or heard of a situation where the data model has been designed upfront and never changed along the way. And you seem to be turning this into a black/white scenario. In almost all cases you do your best effort upfront and then iterate.

The advantages of these NoSQL/Schemaless style databases is that you can do this migration in your app instead of in your app+database. And unfortunately schema/data migration tools for databases really aren't that great - even after all these decades.

pjlegato · on Feb 6, 2016

The disadvantage of NoSQL is that you do the migration in your app rather than in your database. This requires all programs that touch the database to know about every historic migration that ever happened, since there is still data in the database that uses the old schema.

With a SQL database, there is one source of truth: the current schema. It is not necessary to maintain historic information about all previous versions of the schema in every program that uses the database; they only have to know about the current version.

mbreese · on Feb 6, 2016

One problem with these schema-less style databases is that you still actually have a schema. You just don't have a good reference for what it is anymore since it's trapped in your code, likely in multiple files.

CuriouslyC · on Feb 6, 2016

It is short-sighted to load everything in the application, because at best utilizing the data in another context is going to involve a major (possibly painful) code refactor. You are more tightly coupling your data to your application stack, which isn't the best idea in my experience.

As an aside, Postgres supports multiple schemas for a given database, so it is trivial to support multiple versions of a single table, provide legacy support via views, etc.

threeseed · on Feb 5, 2016

> why are they choosing it over PostgreSQL

Because it's fast, portable and easy to use. Pretty much what most developers want.

People can bring up the tired "oh but you will lose your data" jokes but it really isn't that much worse than anything else on the market. They all have bugs at some point. And I really don't see how PostgreSQL is any more reliable or has a larger expertise pool/tooling etc. If you were talking about MySQL/Oracle I would agree but not PostgreSQL.

ahachete · on Feb 5, 2016

Most of this work, and much more, including compatibility with the MongoDB protocol (including replication) is already done in ToroDB (https://github.com/torodb/torodb).

Hope that this self-plug is hopeful :)

Sanddancer · on Feb 5, 2016

Neat little tutorial. One of the things I'd suggest to the author is to perhaps add in something on creating standard and materialized views for some of the more complicated queries, to show more of the interesting data manipulation options that postgresql gives you.

orf · on Feb 5, 2016

Interesting post, but I don't understand his way of deleting rows that aren't in data.json. Creating a separate table seems overly complex: why not just store a list of ID's in memory and delete any that aren't in that set? Creating a list of 100,000,000 integers in Python takes 900mb (4.2gb for a set!), which isn't that much. Stick it in one big 'DELETE WHERE id NOT IN (...)' query.

Or better yet just TRUNCATE beforehand.

anarazel · on Feb 5, 2016

Using NOT EXISTS() is a lot better than NOT IN(). The latter, as defined by the SQL standard, has confusing behaviour with NULLs. x NOT IN(y1, y2) is defined to expand to x <> y1 AND x <> y2. If any of these are NULL, the whole expression is NULL. Which means no rows match. Fun, ain't it?

empthought · on Feb 5, 2016

> Stick it in one big 'DELETE WHERE id NOT IN (...)' query.

I think that's a terrible idea, because it will scan the entire NOT IN (...) array for each row. Since it's NOT IN, it will even have to scan the whole thing every time. A temporary table is generally the way to go for unbounded lists of things; there's even support for it in PostgreSQL and the SQL standard.

orf · on Feb 5, 2016

I always assumed pg would be smart and use something with a constant lookup time. This[1] is a query plan 9.5 made with a table containing 10,000,000 sequential rows, executing "DELETE WHERE id IN (select * from generate_series(1, 1000000));". Maybe your right, it spends 5.8 seconds on the nested loop. I've only used this for smaller volumes where an extra table might be overkill, but I guess with loads more rows a temporary table is better.

1. http://tatiyants.com/pev/#/plans/plan_1454709615696

empthought · on Feb 5, 2016

PG is doing the smart thing; it's optimizing for the common case of short IN/NOT IN literals in SQL queries. Simple arrays have much better constant factors.

If the clause is a subselect (like yours) and is also an actual table (unlike yours), it will be smart and rewrite it as a JOIN, which will generally have better complexity if the indexes are correct.

fritzy · on Feb 5, 2016

You're right there. This was a naive stab at it that I hadn't properly thought through. I'll try to update the post this weekend.

jarjoura · on Feb 5, 2016

Only thing I would caution here is that Postgres HA is notoriously difficult to set up and get right. Something like CouchDB is designed from the ground up for high availability and scalability. So if your design calls for only a document store, probably best to avoid Postgres.

empthought · on Feb 5, 2016

This is flat-out misleading. PostgreSQL is also designed "from the ground up for high availability and scalability," it's just optimized for a different -- and far more common and useful -- set of system assumptions and constraints. Unfortunately those assumptions and constraints don't make good blog or product marketing material copy. "We have a typical LOB app," "our data needs are nothing extraordinary," etc.

coderzach · on Feb 5, 2016

Postgres isn't designed from the ground up for high availability or scalability. Most of the in-built functionality for setting up HA wasn't introduced until 9.0, and the process of setting up HA postgres with automatic failover requires third-party tools. As for scalability, postgres can only scale vertically.

If you need a solution that can scale writes to more than than one node, or a solution that has first party support for HA with automatic failover, you shouldn't be using postgres.

As an aside, I find the dogmatic "just use postgres for everything" choir just as bad as the marketing BS associated with noSQL databases.

snuxoll · on Feb 5, 2016

Unless you are using synchronous replication (locks up transactions if there is an issue with replication) there's no safe way to do automatic failover with an RDBMS without using shared storage (expensive, pain in the butt).

Corosync and pacemaker make setting up automatic failover with shared storage relatively painless. I don't do it this way personally because shared storage is a giant pain.

empthought · on Feb 5, 2016

"High availability" and "high availability with automatic failover" are not the same thing. (The tricky bit is actually the un-failing-over part.) I assure you that people were setting up high availability systems with PostgreSQL 7.x in 2004 or so.

Very few organizations need to scale their database in any other direction than vertically. If it works for Stack Overflow, it's certainly likely to work for your application.

takeda · on Feb 5, 2016

It's also lack of understanding the problem you're trying to solve.

For example at my previous job we had a dedicated data store for performing lookups such as ip -> zip code and latitude/longitude -> zip code.

The company decided to use Oracle Coherence and store all data in memory, because it'll be fast. To store all of that information they needed 16 m3.medium machines.

Last year they had a great success optimizing it, because they managed to replace 16 x m3.medium machines to just 3 x c3.2xlarge machines running MongoDB (the data was ~12GB).

I did a POC and put the data in PostgreSQL with proper columns and indices (I just needed to install ip4range and PostGIS), the whole data fit in 600MB! the queries took under at most 2ms on cold cache but generally were under miliseconds, because all the data fit in RAM.

threeseed · on Feb 5, 2016

It seems more of a problem with the people than technologies.

Why would you need 16 m3.medium (60GB RAM total) to store 600MB of data ? I've used Oracle Coherence and other grid technologies and something doesn't sound right here. It is just a couple of distributed Java HashMaps we are taking about here.

Likewise if 600MB of data is expanding to 12GB in MongoDB then something is very, very wrong with the design of your schema.

takeda · on Feb 6, 2016

Yes problem were people. Too much politics and that's why I left.

Why more data? In case of IP geolocation neither of that technology understood IPs not to mention being able to create a proper index for ranges.

So in case of Mongo, to get a good performance they decided to generate every possible IPv4 address and map it to zip code. To increase efficiency they stored every IP as a 64 bit integer.

In Coherence they did the same thing, but I guess less efficiently (did not look how it was done, since at the time coherence was in the process of being eliminated) I'm guessing maybe they stored is as a string?

Also note that Coherence is a distributed cache that supposed to withstand couple nodes going down, so a lot of data was duplicated.

empthought · on Feb 6, 2016

> It seems more of a problem with the people than technologies.

Isn't that what "lack of understanding of the problem you're trying to solve" means?

Though in this case, if the problem had "IP4 ranges" and "geographic data and computation functions" in its scope, then MongoDB is quite inadequate compared to PostgreSQL.

threeseed · on Feb 5, 2016

> Very few organizations need to scale their database in any other direction than vertically

Almost every major organisation and plenty of startups/SMEs are investing big into analytics programs. And whilst they don't have massive data sets they are expecting real time performance so being able to scale horizontally is important.

If you could vertically scale I/O that would be one thing but you can't.

empthought · on Feb 6, 2016

That's a bunch of read-only replicas though; RDBMSes can do that. Or if it's really big data, then we are talking about Redshift, Greenplum, or Teradata.

There's no need for something like CouchDB.

snuxoll · on Feb 5, 2016

PL/Proxy?

brightball · on Feb 5, 2016

My best attempt at PostgreSQL marketing material copy. :-) http://www.brightball.com/postgresql/why-should-you-learn-po...

merb · on Feb 6, 2016

Good stuff expect Listen Notify. it's too dependent on the driver. I mean some drivers needs some kind of polling which is, aweful. Mostly you could go around this issue by opening another connection without going through your database pool and building a simple driver just for listen/notify. (jdbc I'm looking at you)

brightball · on Feb 7, 2016

Most of the uses of it that I've seen have taken this approach with a listener on a thread using a different connection.

snuxoll · on Feb 5, 2016

PostgreSQL HA is not that difficult to setup, repmgr makes the task pretty easy. If you are talking about automatic failover then that's another story entirely, because it is simply not appropriate for relational databases unless you are using shared storage (expensive, painful) - but if you want to go that route corosync and pacemaker have your back.

__jal · on Feb 5, 2016

PG HA really isn't difficult to set up.

You have to know about your environment and how you use the DB, but if that's limiting factor then I'm really not sure why you'd try to set up HA.

bsaul · on Feb 6, 2016

I've read many answers to your comment saying "but it's not that hard". So to all those people : please, write a blog or send a link to one that explains how to setup a HA cluster with automatic failover. I've spent some time trying to setup a simple pg cluster once ( not even HA, just duplicate writes) and it was nothing but straightforward.