I remember when the Bigtable paper was released. It was very early in my career and I remember it sounding so alien to me. Sure, i had Memcached in my stack, but no SQL? It seemed like something they had to trade off to be able to build the kind of services they offer. I felt the same way after reading Dynamo.
Sure, I thought a lot about data design. I thought about usage patterns to inform how we denormalize. And I grew into using, eg, Gearman, to pre-compute dozens of tables every night. I evolved, a bit.
But a few years ago, a little before this OP was written, I had a great experience with some Facebook engineers and had an a-ha moment that has made me a much better software engineer. Basically, I realized that I needed to let my data be itself. If I have inherintly relational data, then it should be in a relational database. But I've built EAVs, queues, heaps, lists, all of these on top of MySQL and Postgres. Let that data be itself.
We have more options now than ever before. K/V stores, Column stores, etc. I use a lot of Cassandra. A lot of Redis. Some Mongo. And a put a lot more in flat files than I ever thought I would.
I know a lot of people that are smarter than me left the womb knowing these things. But for me it was transformational and has made me much happier. I realized how much energy I wasted fighting my own tools.
Does it mean that in a single project you may use two different storage solutions? SQL DB for relational data and NoSQL for queues, heaps, etc. ?
I'm asking because for me it is a kind of a paradigm shift, I have used plain files and SQL DB in the same project before, but never two different databases.
The other two commenters here beat me to it, but yes, absolutely. But be sure to right-size. If you don't have scalability issues, then your life is easier. Postgres has a k/v store option. Use that. Get more complex if you need to.
It's also perfectly acceptable to duplicate data -- store everything in Postgres or MySQL as your "master", cache all query results in Redis to avoid hitting the database whenever possible, and then build out lists/indexes/copies in a NoSQL database. Because yeah, corruption happens and it's great to be able to wipe and rebuild.
A common pattern would be.. i dunno.. think of a newsfeed.
* Each item is one entry in an "items" table, maybe with a "likes" int column.
* For each item, you have a sorted-set in redis of Likers.
* For every user you have a sorted-set for their newsfeed, with ID's of each item.
* When they hit the page, you load that sorted set, then query for the first 20 items from your database by ID (so it's a primary key lookup, and it probably won't even need to hit the Database because your redis cache layer will see it has that ID in cache already)
* You'd use redis MultiGet feature to reduce round-trip calls.
* When somebody "likes" something you write a job to a queue, which will eventually add the liker to the bottom of the sorted-set, then probably stores the details of the Like itself (timestamp, user, etc) to a slower datastore, which could even be a transactional log that you'd never read again unless you had some significant issue.
I've never actually built a newsfeed so there's certainly issues here, but maybe it will be helpful as a general idea. Remember, somebody, can't remember whom, said "The only truly hard things in computer science are cache invalidation and naming things" Cache invalidation (knowing when to go back to source data because your cache is stale) is still an "analog" problem. By that i mean, there's a thousand shades of grey there and you have to pick a strategy that works for you.
In the stack I am working on we have a variety of databases all serving a different type of data storage:
- memcache: for caching of data that doesn't persist
- redis: caching of data that needs a to be persisted short term but not on the longer term (eg. sessions)
- MySQL: for user-like data (account details, addresses, projects, ...)
- DynamoDB: for millions of data points that only needs to be queried in 1 dimension, so are not related or compared to one another. eg. give me all values from this table containing a given datatype, between 2 dates
- MongoDB: for millions of datapoints that need to be queried on deeper levels
- etc.
Great, thank you for the detailed overview. If you don't mind me asking, do you ever end up having problems with consistency? (I imagine it could happen if some data was written to 2 databases and the second one rejected the transaction.)
Well, the downside of using multiple DB stores is that the logic of keeping everything consistent is in the hands of the developer. So you have to make sure that everything is written correctly.
For instance, if you write to MySQL and Mongo, but your Mongo is down, you'll either have to queue the data item somewhere for a write once the system is back up, or you have a migration system in place that gets everything from MySQL since the downtime and writes it back to Mongo.
Depending on the type of data we have a few easing factors: for some data stores it is not that big of a deal if it doesn't get written to it's 2nd layer (eg. cache) as we can rewrite it the next time it is requested in layer 1.
Yes. Always use the right tool for the job. Its okay to have some duplication (an object stored in postgres and also part of a mongo doc or cached in redis). Don't forget storage is cheap.
I remember when the Bigtable paper was released. It was very early in my career and I remember it sounding so alien to me. Sure, i had Memcached in my stack, but no SQL? It seemed like something they had to trade off to be able to build the kind of services they offer. I felt the same way after reading Dynamo.
Sure, I thought a lot about data design. I thought about usage patterns to inform how we denormalize. And I grew into using, eg, Gearman, to pre-compute dozens of tables every night. I evolved, a bit.
But a few years ago, a little before this OP was written, I had a great experience with some Facebook engineers and had an a-ha moment that has made me a much better software engineer. Basically, I realized that I needed to let my data be itself. If I have inherintly relational data, then it should be in a relational database. But I've built EAVs, queues, heaps, lists, all of these on top of MySQL and Postgres. Let that data be itself.
We have more options now than ever before. K/V stores, Column stores, etc. I use a lot of Cassandra. A lot of Redis. Some Mongo. And a put a lot more in flat files than I ever thought I would.
I know a lot of people that are smarter than me left the womb knowing these things. But for me it was transformational and has made me much happier. I realized how much energy I wasted fighting my own tools.