It's an all-of-the-above strategy. I remember when the Bigtable paper was releas...

notimetorelax · on Sept 2, 2012

Does it mean that in a single project you may use two different storage solutions? SQL DB for relational data and NoSQL for queues, heaps, etc. ?

I'm asking because for me it is a kind of a paradigm shift, I have used plain files and SQL DB in the same project before, but never two different databases.

encoderer · on Sept 2, 2012

The other two commenters here beat me to it, but yes, absolutely. But be sure to right-size. If you don't have scalability issues, then your life is easier. Postgres has a k/v store option. Use that. Get more complex if you need to.

It's also perfectly acceptable to duplicate data -- store everything in Postgres or MySQL as your "master", cache all query results in Redis to avoid hitting the database whenever possible, and then build out lists/indexes/copies in a NoSQL database. Because yeah, corruption happens and it's great to be able to wipe and rebuild.

A common pattern would be.. i dunno.. think of a newsfeed.

* Each item is one entry in an "items" table, maybe with a "likes" int column.

* For each item, you have a sorted-set in redis of Likers.

* For every user you have a sorted-set for their newsfeed, with ID's of each item.

* When they hit the page, you load that sorted set, then query for the first 20 items from your database by ID (so it's a primary key lookup, and it probably won't even need to hit the Database because your redis cache layer will see it has that ID in cache already)

* You'd use redis MultiGet feature to reduce round-trip calls.

* When somebody "likes" something you write a job to a queue, which will eventually add the liker to the bottom of the sorted-set, then probably stores the details of the Like itself (timestamp, user, etc) to a slower datastore, which could even be a transactional log that you'd never read again unless you had some significant issue.

I've never actually built a newsfeed so there's certainly issues here, but maybe it will be helpful as a general idea. Remember, somebody, can't remember whom, said "The only truly hard things in computer science are cache invalidation and naming things" Cache invalidation (knowing when to go back to source data because your cache is stale) is still an "analog" problem. By that i mean, there's a thousand shades of grey there and you have to pick a strategy that works for you.

notimetorelax · on Sept 3, 2012

Thank you for the example, it's interesting to see k/v storage used for caching. I'll keep it in mind!

ndemoor · on Sept 2, 2012

In the stack I am working on we have a variety of databases all serving a different type of data storage:

- memcache: for caching of data that doesn't persist - redis: caching of data that needs a to be persisted short term but not on the longer term (eg. sessions) - MySQL: for user-like data (account details, addresses, projects, ...) - DynamoDB: for millions of data points that only needs to be queried in 1 dimension, so are not related or compared to one another. eg. give me all values from this table containing a given datatype, between 2 dates - MongoDB: for millions of datapoints that need to be queried on deeper levels - etc.

notimetorelax · on Sept 2, 2012

Great, thank you for the detailed overview. If you don't mind me asking, do you ever end up having problems with consistency? (I imagine it could happen if some data was written to 2 databases and the second one rejected the transaction.)

ndemoor · on Sept 3, 2012

Well, the downside of using multiple DB stores is that the logic of keeping everything consistent is in the hands of the developer. So you have to make sure that everything is written correctly.

For instance, if you write to MySQL and Mongo, but your Mongo is down, you'll either have to queue the data item somewhere for a write once the system is back up, or you have a migration system in place that gets everything from MySQL since the downtime and writes it back to Mongo.

Depending on the type of data we have a few easing factors: for some data stores it is not that big of a deal if it doesn't get written to it's 2nd layer (eg. cache) as we can rewrite it the next time it is requested in layer 1.

blaines · on Sept 2, 2012

Yes. Always use the right tool for the job. Its okay to have some duplication (an object stored in postgres and also part of a mongo doc or cached in redis). Don't forget storage is cheap.