When I read posts like this all but confirming MongoDB isn't the great product 10Gen make it out to be, I wonder how the heck MongoDB are still even relevant and then I remind myself of the fact that 10Gen have one of the best marketing and sales teams in the game at the moment.
While MongoDB has improved greatly over previous versions, I can't help but feel if 10Gen put as much effort into improving their product as they do selling it, Mongo would be a force to be reckoned with!
MongoDB is good at some things, but I think most people that try and fail with it fall into one of two camps: 10Gen sold them into it or they bought into the hype without assessing project requirements and ensuring MongoDB was a sensible choice.
There is one thing MongoDB does spectacularly well - you can feed in arbitrary JSON and get the same JSON back out. (No need to define schemas or play any kind of system/db administrator.) Even the queries have the same "shape" as JSON, so no need for another arbitrary query language.
It will eventually bite you, and bite you hard. But you'll be well into the millions of records before that happens. Developers and products below that number will have very smooth sailing. And some live there permanently. One project I worked on years ago involved a music catalogue. Did you know there are only about 20 million songs?
The main problem is things get very painful as you get bigger, especially for writes. A doubling of write activity can lead to calamitous drops in performance. This is especially bizarre as the data model means they can easily have multiple concurrent writers. Heck having a lock per 2GB data file would quickly help with concurrency.
They have this same "single" approach in other places. For example building an index is single threaded. I did a restore the other day and then had to wait 8 days while it rebuilt indexes. One cpu was pegged but everything else was idle!
It's telling that those jira tickets are both in the top five most voted on open server tickets and are from 2010 and before. (I know you know this roger, but) TokuMX completely resolves both of them.
But does TokuMX solve all the other issues like write concurrency, eye wateringly slow (and single threaded) index building, or there being no way to practically reduce the file sizes. (repairDB takes forever and requires the same amount of free space as already used, and compact also needs free space and doesn't remove any files).
Write concurrecy: yes, TokuMX does not have a database level reader/writer lock.
Index Building: yes, fractal trees can write data much more efficiently, so if index building is a problem, I bet TokuMX solves it.
Practically reducing file size: to be honest, I am not sure because due to our great compression, this has not been a general issue for our users. Our reindex command could reduce file size, but I cannot point to examples.
One of our big goals is to address storage issues MongoDB has.
Yes. It does bite you quickly. As you add more models you start to duplicate a lot more information and by that time you'd think relational makes sense but you have to continue to use MongoDB. The option you got is either embed or reference. And still, there is no JOIN in mongo so you'd iterate many collections and do combine within your application code.
I think as PostgreSQL continue to improve its JSON data type people will look at SQL again even if they need to a basic model working. Because at the end working with constraints can help. Well, either side will bite you but one has to weigh...and sure that's a tough question.
Personally, I think the company formerly known as 10gen is doing a lot of good work, particularly with the aggregation framework, but it's orthogonal to the improvements we've made with the storage subsystem.
Like any other NoSQL datastore, things like MongoDB aren't as general purpose as a RDBMS no matter how the marketing and sales teams make them out to be. Cassandra is no different.
People need to read the documentation (say what you will about mongo but 10Gen docs are pretty damn good) more thoroughly before blindly implementing stuff on NoSQL and complaining. (I've been guilty of this in the past.)
For me anyone that uses the term "NoSQL" to group completely different databases simply doesn't understand what they are talking about.
Cassandra and MongoDB are completely fine as general purpose databases. With CQL3 Cassandra is just as easy to use as any RDBMS with the advantage of being infinitely more scalable and easier to manage.
> Cassandra and MongoDB are completely fine as general purpose databases.
Whenever I hear this, I feel that the person saying it either doesn't have enough experience or that every problem to them looks like a nail for their particular flavor of NoSQL. That's like saying a street bike is a general purpose bicycle. I'm sure someone out there can make it work on an outdoor offroad trail, but is it ideal?
Cassandra is really awesome at writes but reads are really expensive. Atomicity at the row level only (unless you want that 30% performance hit) and eventually consistency make this not great at real time. CQL3 is an improvement but it's still not as good as ANSI SQL. I still have to do a lot of extra work.
With Mongodb it does really well with tree based data. The second you try to do join queries outside of that doc tree... major performance hit. (Maybe this has changed knowing the speed mongo evolves.) When data is larger than RAM... major performance hit... not so great for something like logging.
Using an RDBMS for everything isn't great, but the pain you suffer is a lot less especially since these days more and more them have some sort of federation or cluster solution. That said, some NoSQL datastores are better for certain applications than an RDBMS, just not for all of them.
Reads definitely aren't cheap, not compared to the hyper-optimized write path. Cassandra reads can however be exploited to make a lot of common time-series reads extremely cheap by amortizing the cost of the read over a lot of returned data (think a clustered row with lots of byte-sorted columns). Not that RDBMSes can't do the same thing, but it's a complex challenge to linearly scale and maintain availability.
Cassandra also exploits SSDs quite well by avoiding write-amplification problems, given that all write IO is sequential only.
I'm a huge fan of both traditional RDBMSes and alternative data stores. The most important rule for touring data stores is to be sure not to bring it into work when it's not appropriate and to evaluate the hell out of any solution before you write a single line of code.
Most of the time, PostgreSQL will do 99.9% of what you need and then some.
> This is getting old, and is just plain untrue. Reads are more expensive than the extremely cheap writes, but they really aren't expensive.
It has improved a lot over time but it isn't a myth due to the fact that it's not good out of the box. You have to spend some time with configuration and mulling over the schema to mitigate it. It's something you don't have to really think about as much on other data stores.
I agree that Cassandra CAN be used for basically everything a RDBMS can, but to claim it can be used just as easily is a stretch (based on my experience migrating from PostgreSQL to Cassandra). It lacks many administration tools, and for data with complex, non-hierarchical relationships, data modeling is more complicated and you must do any joins yourself. Sure you can get major performance and scalability gains if you use it correctly, but you also give up being fully ACID.
As a developer, using C for all data in any complex application is going to be harder up front than using a RDBMS. From what I have heard and learned about C, though, it is less painful to use C for a highly performant/scalable database than trying to scale most or all RDBMSs. From what I have read, Mongo is also hard to scale.
> 10Gen have one of the best marketing and sales teams in the game at the moment.
No doubt. I have 2 MongoDB mugs even though I vowed to never touch a product produced by them again (it was, after the whole -- we'll throw your data write requests over the wall and pray fiasco).
Knowing some of the FC people first hand, MongoDB did actually serve them fairly well for a substantial amount of time. Until they started hitting max limits, it didn't really make sense to move to c.
Going straight to C or something like it would have been almost cargo-cultish ( eg If we build industrial strength, we will get industrial levels of traffic ).
It sounds like you're assuming that productivity and power are opposites. Four years ago or even two, Cassandra was much harder to develop against than MongoDB, but that's not the case today.
Exactly. Can you imagine all of the Java/C developers telling Ruby developers that they are idiots and don't know anything about programming ? Simply because they choose a technology that is designed for developer productivity at the expense of scalability.
Because that what seems to happen for every database discussion.
I'm not disagreeing with you, but this is not what the article is about:
"MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well. Our mistake was when we decided to put off fixing MongoDB’s deployment, instead vertically scaling (...) By the time we had cycles to spend, it was too late to shard effectively It would have been terribly painful and unacceptably slowed our cluster for days."
It seems they weren't sharding their data. The advantage of the popular NoSQL databases like MongoDB is that they allow easier horizontal scaling than general purpose RDMS (though this is debatable - Postgres and Oracle allow you to make the same trade offs as NoSQL databases, they just don't force you to)
When you read the rest of the article, they explain that when they had to make a painful transition anyway, they chose Cassandra, to a different set of advantages and disadvantages that suited their needs.
MongoDB is still relevant because it is well supported. The product is well documented, excellent tooling is available, it is widely adopted, so that when you have a question, you can often google the answer.
It has an elegant query language; especially the built in aggregation framework is far more convenient than having to write map reduce functions for every query. It is easy to deploy and use. All these things make it a product that is pleasant to use from the point of view of a developer or DBA. I think you underestimate the importance of non-technical advantages and disadvantages
I am just not convinced it is the best database because I just don't see a use case for a 'general purpose' NoSQL database. For general purpose storage, RDMS like Postgres and Oracle are great. They support sharding if you really want it, and even allow indexing of unstructured data these days. They don't force you to use joins and transactions if you don't need them, but at least they support these features when you do.
As someone who works extensively in "Enterprise IT" it's the unifying factor in most successful middleware companies. I'm working with a CMS that has barely updated a single core feature in almost 10 years, consistently releases with glaring bugs and rakes in millions in license fees. How? Sales team always show a polished demo and don't give out dev licenses for evaluation.
Mongodb has marketed the database as a general purpose one when in reality it doesn't even come close to one. And the case is the same for all other NoSQL systems. No More general purpose database? The developers at http://www.amisalabs.com are tackling today's database problems.
We were a young startup and made a few crucial mistakes. MongoDB was not a mistake. It let us iterate rapidly and scaled reasonably well.
A good solution for X scale might not be so good for 10X scale. But by the same token, a 10X scale solution for a 1X scale problem isn't a good idea either.
I think part of the lesson learned from reading the post is MongoDB is best when you start sharding from the git-go, especially if you know you're going to have pounds of data.
Rebalancing shards in MongoDB sucks, especially if you have any kind of traffic. Which means that even if you start out with just a few shards you either need to keep up with growing your number of shards so that none of them are ever more than ~30% full or else you're in for a painful reshard experience.
At least this is what happened to me and my encounters with MongoDB (nee 10gen) were unsuccessful in speeding up our resharding.
In an ideal world, you'd monitor and plan your capacity proactively - I don't think there's really any magic button for - "My system is at capacity - horizontally scale it now, with no downtime!".
Cassandra certainly makes it a lot easier. We've doubled our cluster with no downtime. Even adding a non-double amount of nodes can be performed without any downtime (though slightly more impact).
Viber, one of the largest over the top messaging apps, recently shared their conversion from Mongo to Couchbase. They ended up requiring less than half of the original servers, and better performance.
Like always: "Use the right tool for the job". I did not think that this was MongoDB(10gen)s fault, they (viber) choose the wrong database type for their needs.
This article caught my interest as I've been reading into Cassandra. But some previous research had me thinking that Cassandra works best with under a TB/node. Is SQL still better when you have really large nodes (16-32TB) and only really want to scale out for more storage?
I'm currently humming along happily with Postgres, but some of the distributed features, and availability of Cassandra look really nice.
Cassandra 2.0 can handle 5TB per node easily, 10TB with some care. Best to scale out, not up.
That said, if someone else has already made the hardware choice for you, you can always run multiple C* nodes on a single machine. I know several production clusters that fit this description.
PostgreSQL can handle petabytes easily. However if you need to query a petabyte of data, then you need to rethink your solution. PrestoDB + Hive + Hadoop may be what you need.
It's much more about desired usage patterns than amount of storage. Cassandra and RDBMS's differ quite a lot in how you replicate, consistency guarantees, performant read patterns, performant write patterns, how you handle recovery, etc... If you intend to bring anything to scale it helps to understand the strengths and weaknesses of the underlying architecture.
We run at about 1TB a node and it works well (high write load things like metrics and telemetry data). But we also use SQL server where appropriate (i.e. transactional account stuff).
I am a fan of using the right tool for the right job providing you have the team to support it.
It's not supposed to display on engineering posts but something must have changed/been broken recently. We know you guys aren't interested in marketing content.
Again, sorry about that and thanks for bringing it up.
"To buy us time, we ‘sharded’ our MongoDB cluster. At the application layer. We had two MongoDB clusters of hi1.4xlarges, sent all new writes to the new cluster, and read from both..."
I'm curious about this. Why were you doing the sharding manually in your application layer? Picking a MongoDB shard key - something like the id of the user record - would produce some fairly consistent write-load distribution across clusters. Regardless - it seems like write-load was a problem for you, yet you sent all the write load to the new cluster - why not split it?
As explained, it was a stop-gap solution for data storage only, we did not have a problem with write load on SSDs.
We were at the point that MongoDB sharding was just as difficult to deploy as moving to Cassandra, which better fit our goals of availability. MongoDB sharding isn't instant by any means for existing clusters.
Yet another shining example of throwing money and time away to work within AWS constraints when bare metal and openstack (1) would have solved it cheaper (2) and arguably faster.
1 (if you insist on cloud provisioning instances, even though it makes little sense if the resources are as strictly dedicated as they are in this case)
2 (VASTLY, over time -- these guys are pissing money away at AWS and I hope their investors know it)
We understand that AWS comes at a premium, however we find the opportunity cost of losing the agility we have on AWS at this stage of our organization more expensive than the delta in cost between moving on to our own hardware and AWS.
Our organization is acutely aware of our costs and still strives to minimize them. Our move to Cassandra saved 79% over continuing to run our reserved SSD nodes & backup replica.
While MongoDB has improved greatly over previous versions, I can't help but feel if 10Gen put as much effort into improving their product as they do selling it, Mongo would be a force to be reckoned with!
MongoDB is good at some things, but I think most people that try and fail with it fall into one of two camps: 10Gen sold them into it or they bought into the hype without assessing project requirements and ensuring MongoDB was a sensible choice.