Hacker News new | past | comments | ask | show | jobs | submit login
A Year of MongoDB (speakerdeck.com)
275 points by j4mie on May 10, 2013 | hide | past | favorite | 129 comments



This is the problem with most of the guys who go with MongoDB. Obviously, this person is very technical, so I am not flaming him nor accusing him, but this is my view of the rest of them who pick MongoDB without exactly having a clue as to why (hipsters) or when they should use a NoSQL db and when they shouldn't.

I do not hesitate to admit that I was a hipster sometime back too. I chose MongoDB for many of my projects and it went well, till it reached some kind of moderate scale where I realized it was a terrible choice going with a NoSQL db (Sometimes, I'd have to duplicate data because there were no Joins, etc). And that's when you start to realize, that NoSQL is not a pure-white solution. It is designed to satisfy very specific use-cases. Relational databases are really good enough for 99% of the use cases out there.

Unless otherwise you are COMPLETELY unable to design your schema in a relational database, you SHOULD NOT simply opt for a NoSQL database. The claimed NoSQL performance benefits will easily be outrun by a terribly designed schema, if you use the wrong db for the wrong scenario. Trust me, MySQL has had so much negativity because of these hipsters, but even something as basically relational as MySQL scales really reaaallllly well. Infact, many top guys still use MySQL till date, for a reason, in production.[1]

Next time you launch your start-up, spend some time carefully evaluating your db design decisions, as the wrong db for the wrong use-case could easily become the most expensive mistake of your startup.

[1]http://www.quora.com/Quora-Infrastructure/Why-does-Quora-use...


Disclaimer, I work for RethinkDB a competitor of MongoDB.

I completely agree with you that NoSQL is not a pure-white solution but I disagree that it's only useful for specific use-cases. NoSQL has frequently been touted as this magic concept that makes databases scale well. It doesn't as many of the poorly scaling NoSQL databases on the market today show you. The most you can say is that it maybe lets you ignore a few particularly hard to scale operations. However there's also no reason that a NoSQL database can't scale well it's just that you have to actually design it in such a way that it does and Mongo failed at this.

Joins are a very useful feature of structured databases but there's no reason you can't have them in a NoSQL database. I speak from experience here, RethinkDB has joins (distributed joins no less). They're just as useful in a NoSQL world as they are in a SQL world.

I think the author's analogy "NoSQL : SQL :: Dynamic Typing : Static Typing" (to use SAT syntax) is a good one. I know some damn good programmers who prefer dynamically typed languages like LISP and I know some damn good ones who prefer statically typed languages like Haskell and I think programmers can be similarly productive with either a NoSQL or a SQL databases backing them up. Currently NoSQL has the very real downside that none of the NoSQL databases on the market have really nailed it (ours included but we're working on it) whereas there are some very well established and well tested SQL databases on the market so it's a perfectly rational choice to say you prefer the better tested SQL route. However I contend there's nothing inherently flawed about the NoSQL concept.


Isn't Google BigQuery a SQL on top of a schemaless database ?


Well, SQL and NoSQL get thrown around a lot but generally people mean 'standard relational' vs. 'newer, potentially schemaless' DB.

By a 'SQL/NoSQL' view, Cassandra has CQL which isn't ANSI standard but close enough to be a SQL.


It’s a distributed near–real‐time querying system for data at rest that supports subset of SQL.

Google has scalable SQL database called Spanner[1]. They also have a proper NoSQL database called internally Megastore[2] that provides distributed transactions based on entity‐groups on top of BigTable. Also known as High Replication Datastore and available via AppEngine[3].

[1] https://en.wikipedia.org/wiki/Spanner_%28database%29

[2] https://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf‎

[3] And hopefully via Google Compute pending announcement in few days at I/O. That is if Google actually wants their Cloud to succeed.


All I can say is this: if the saying "Always plan to throw away your MVP" is true, then I can't see any other storage solution other than MongoDB (or a similar schema-less document storage DB) for MVPs. The speed of development and flexibility are simply worth it. Yes, it is hard to refactor a live product and move it from MongoDB to MySQL / Postgre but was done before and you only do that if you get traction, so its a good problem to have.

I would start with MongoDB, get a grip on what on earth the product is doing, finalize the schema on the fly based on A/B tests, customer feedback and analytics, only once the schema is finalized move it to a SQL database if needed.

If you are not using a good ORM + DB migration system, then MongoDB will make perfect sense when you are quickly iterating through ideas and trying to find a product / market match. You really have no clue how your data schema is going to look like in the end, why confine it at the start? the vast majority of startups sadly won't get even near the phase where they really reach scale related performance issues, so choosing MongoDB for prototyping your business absolutely makes sense to me.


You make a VERY big "if" in your first sentence, one that (admittedly anecdotally) I've very rarely seen hold true in tech companies. Much more often, the MVP becomes the product, and all those shortcuts and poor design decisions come back to kill your productivity when it becomes necessary to refactor foundational tech/designs that have metastasized throughout the codebase.

I am curious if this is other folks' experience as well, or do you actually throw out the MVP and start all over at some point? If so, at what point do you make the break?


In real life, you are totally true. The initial test that succeeds becomes/continues to be the real product. I agree with the parent poster that Mongo is awesome for testing, if not awesome for scale. I think the point here is that tech co-founders/leads need to make it clear that this is a debt that will need to be paid if things take off.


If NoSql's speed and flex gets you to MVP, and MVP gets you traction, then all the hell you may go through to rearchitect for SQL is probably worth it.

Premature scalability is as dumb as premature optimization and premature generality.


MVP is used to get funding; which extends your runway so you have time to make your product correctly.


When I develop against MongoDB in python, ruby, or clojure, I consider it less a database and more a persistence layer for my dicts/hashes/hash-maps, with a handy-dandy query functionality built in.

Hell, in clojure I usually develop against a hash-map of hash-maps stored in-memory until the project gets far enough to re-implement the data layer (which I've hopefully abstracted well enough to not be a big deal).

Think of Mongo less like a schemaless postgres, and more like a persistent redis. Ease-of-use is its true advantage, and I think anyone who exceeds its capacity should be happy to have the problem.


I tend to treat Mongo similarly. At my job, we hate Mongo and I've stood up and cursed it more than once.

Plenty of personal projects end up running on Mongo, because it has a rich query layer for quick hacking. I also never expect these to exceed the amount of RAM on my machine either.


This is really the killer feature of MongoDB its API embeds very nicely in host languages and is very intuitive to use. I work for a competitor called RethinkDB and we've spent a lot of effort trying to make something that can compete in terms of ease of use. The leap we're really trying to make is for people to not have to throw their database out with their MVP.


Couldn't agree more.

I've worked for 15 years with relational databases and for the recent 6 months with mongodb on my startup. For me, the agility of not having to define my data in advance, and "work" so much to explain the DBMS how my schema should behave is no less than a game changer in my ability to build the MVP quickly. I just can't imagine having to invest all the labor in having to model everything I did with a normalized database and ORM.


Really? Mongo is for me not a database. And I worked like you 15 years with RDBMS and a year with mongo. Mongo has no security. Especially the Mongo Cluster Solution makes it so easy to steal and manipulate datas without notice. Everyone who has access to the server can enter the database. Implemented "security" could be turned off with a restart.

And nowadays it is so easy in a datamodel in a RDBMS to make changes. It is high flexible too.

But another worst thing in Mongo is the performance at complex queries. It is a nice bin for thinks nobody needs or to put documents in but that's it with Mongo.


I don't understand how this argument keeps coming up. If you use rails, schema changes and migrations are DEAD EASY and I am not even a professional developer. What time exactly are you saving ??


Schema changes are not dead easy when you have hundreds of millions of rows and/or hundreds of GBs of data.


Please don't conflate MongoDB and NoSQL. NoSQL is about the right tool for the job. For instance many of our Couchbase customers are very technical and risk averse. They choose Couchbase because other solutions (often MySQL or Oracle) have become the wrong tool for the job, as requirements change.

Mongo is a little different bc they actively position themselves as a MySQL replacement for your average Rails app.


"NoSQL is about the right tool for the job."

I never liked that rationale:

* The job almost certainly will change. Data lives a long time, usually much longer than the original application.

* It's likely to conflate marketing claims with actual fitness for a purpose. A special-purpose system with special-purpose marketing may sound great if the special case lines up with what you're doing. But that doesn't mean that it's actually a better fit for that special purpose. And, thanks to confirmation bias, people almost always think their situation does sound like the special purpose for which the system was built.

* A database system is a system, not a tool. A tool is used in a particular instance for a short period of time, the results are obvious, and it's easy to switch tools if one is not working out.

Of course, there are always valid reasons to choose the special-purpose system, and I'm not saying you shouldn't have used Couchbase. I just don't like the "right tool for the job" argument.


I've heard DBA complain that many programmers see data as not all that important: that the data simply exists to run the program. This obviously not the case in any serious real-life environments, but I think it's the sort of attitude that many NoSQL supporters have.


One thing that become much easier with the JSON document model is sync, which is a valuable abstraction over the network layer. http://blog.couchbase.com/why-mobile-sync

Another thing that you win with the document model is lots of primary key lookups, making it easy to model your data access around your hot path. http://blog.couchbase.com/performance-oriented-architecture


   I just don't like the "right tool for the job" argument.
Then say the "right technology for the task" then.

There is no ONE database that works perfectly in every situation.


When installing a database system, it is rarely a "task". It's usually something that will live a long time and there will be many long-term consequences.


Back in the day, many startups died with a gigantic Oracle database that cost them months of runway, because they had to "think long-term".

Build for what you need now, and just make sure your "central repository of truth" isn't lossy. You can always replicate for specific use cases later.


I'm not saying that you shouldn't make short-term compromises. I'm saying that "right tool for the job" is sloppy reasoning.

Even in the short term, it might be much faster to develop against a good RDBMS than a nosql system. I think many developers are so focused on "abstracting away the data store" that they forget what a good RDBMS can do for them.


I think the big problem is that most "NoSQL" engines are naive in their approach, thinking that the people at Oracle, Sybase, IBM, etc. are stupid.

Very few people need scalability beyond what a regular RDBMS can do. So yes, RDBMS have limits, but you'll most likely never reach them.


I would have to agree. Stackoverflow is in the alexa top 100, with over 400 million page views per month, and it runs off a single MS SQL Server instance (with a hot replica), with a Redis cache in front of it.

All the high-scale users of nosql systems end up having to learn a lot of arcane technical wizardry to make it scale (as demonstrated by the slidedeck linked here), and I'm not convinced it's actually easier to scale something like mongo than a traditional RDBMS.


No , there is something specific to MongoDB. Some NoSQL dbs are fine for some purposes ( multi-master sync for couchdb ,graphs for Neo4j ... )

MongoDB problem is that it promises too much and fail to deliver on a lot of levels. You cant expect to do all the operations one can do on a RDBMS on a NoSQL database and still scale horizontally,while maintaining data integrity. A good NoSQL db should have little features , but features that work very well. I like CouchDB for instance i use it in my apps to store customer infos that need to be synced accross multiple device, but i would NOT build an ecommerce app with it only...

NoSQL dbs solves specific problems. It does not solve the problems relational databases are for.

So selling MongoDB as a replacement for PostGreSQL or Mysql is a blatant lie.It will not work.


You are pretty violent against MongoDB. Any reason why?


I suspect the reason that Mongo attracts so much hate is that it's more heavily marketed/evangelised than other NoSQL DBs - so more in the public eye and people expect more from it. Since the reality is that Mongo is not yet a technically strong product (regardless of its merits in terms of ease of use and getting simple things done quickly), it's bound to attract more negative attention.


Bingo. My brief experience with Mongo a few years ago was in a situation where it got selected in order to use something "new and cool". My job was to produce some reports from the data. Guess what reports require? Joins. The whole mess was slow and sucked, and the client belatedly realized that MongoDB was a shitty choice for what they needed, which could have been handled by Postgres without breaking a sweat.


Having spent the last two years trying to make MongoDB work at large scale and only succeeding because I have a ton of resources (200+ db hosts), I can say it's more than just the spotlight.

It is a fundamentally broken product.


Fair enough - I have no direct personal experience with Mongo myself. Coming from a more traditional DB background, what I've read indicates it has a lot of technological flaws, but I didn't want to be too harsh in my judgement without personal experience.

For my own curiosity, what issues caused you the most trouble? The lack of transactions combined with the extremely coarse grained write locking were what put me off the most.


Read performance compared to a lot of other offerings is a significant issue.

However, one of the biggest issue is the entire sharding design that is incredibly delicate and wouldn't pass even the most basic high availability requirements. I could go into a lot of detail about why it's bad but it would take too long.

After that, it's the hardcoded limitations that prevent true multi-datacentre sharding, the previously mentioned CPU issues with mongos, broken replica selection, inability to control primary/replica setting manually, etc.

The list is virtually endless.


I used to spend a lot of time in the RDBMS space, and I'd hear the DBAs bitch about how NoSQL was just a bullshit excuse for not having any discipline around your data model, and I'd defend it, saying, "Well, you have to understand, certain use cases, etc. etc. etc."

And then I see an presentation like this and think, "Yes, of course, you should have put that stuff in a relational database, what were you thinking".

Of course, most of the open source databases do sharding and replication and the like equally terribly, so it's not a perfect solution....


I tend to take the view of RDBMS works until it doesn't. Most cases that things like click data or event data that ends up taking billions of rows, in which case you aren't going to want it in a RDBMS anyways (you'd like to, but hah).

I'd like LevelDB on MySQL personally.


"till it reached some kind of moderate scale where I realized it was a terrible choice going with a NoSQL db (Sometimes, I'd have to duplicate data because there were no Joins, etc)"

That statement doesn't make sense to me. One of the first things you do with a sql database when you reach moderate scale is denormalize data so you don't have to do joins, or shard which prevents a lot of joins. Joins are super convenient, but they don't scale.


You denormalize SOME joins.

On nearly all of the cases, they are very usefull, but once in a while one of them gets too slow, so you make a bit more of work enforcing the consistence by hand, accepts a bit more of risk, and denormalize it.

Throwing all of them away because a few may create problems later is a bit of an over reaction.


How many joins do you think this page uses? A denormalized version that doesn't scale probably uses 25-50. I'll bet the scalable implementation uses close to zero.


I would argue that its actually a multi-tiered question where the choice isn't really "NoSQL or RDBMS" but "NoSQL or ORM talking to RDBMS or going straight to RDBMS"

Here, the naive developer uses MongoDB as an all-in-one that rolls in the part of ORM that too often gets sole consideration, 'How do I persist my objects?', and effectively ignoring everything else. Try as I might to eloquently phrase my frustration with this all, it usually ends up in a rant against those relying on automagic who don't feel the need to understand every decision being made for you.


I have to completely agree... in many cases a classic SQL RDBMS is probably best. In some (such as single records having very dynamically shaped data, where most instances can be brought forward based on a key alone, and searches are on a few keys), NoSQL is better. In some, having data replicated from one to the other works well, one for primary records, one for search/display. I've even contemplated scenarios where ElasticSearch could be a source record. It really just depends on what you want/need from your data and searching.


If you use java stack like JSP or RingoJS, I would just use H2.

http://www.h2database.com/html/main.html

It bets MySQL and Postgresql in performance.


What does "pure-white solution" mean?


In this case, I believe it means all benefit and no disadvantage.


I'm a little blown away by the total lack of technical understanding when it comes to MongoDB. This person is obviously very technical, so why would he have chosen MongoDB in the first place? It isn't like MongoDb's technical shortcomings are a secret. The description of how MongoDb does sharding and distributed queries should have immediately raised red flags to any who has even a modicum of CAP understanding. If it sounds complex, a total hack and impossible to manage it probably is.

For gosh sakes, even the 10gen guys admit MongoDB lost data a year(!) ago. [1] If a database lost data in a single server configuration, why would you trust it as a cluster?

1: http://www.dbms2.com/2011/04/04/the-mongodb-story/


Because developers without significant database / data store experience are choosing where to put their data because its "easy to use".

Yes, RBDMS and similar approaches are difficult to work with [1] and introduce "impedance mismatch". Yes joins can be slow. Yes scaling can be difficult and or expensive. But, when choosing where to put data without understanding (or accepting) why the above is difficult is only asking history to repeat itself.

[1] "Database guy here". I know databases and SQL. I would not consider choosing what language to implement a web application tier on, because I don't have enough relevant experience. I focus on what I have battle experience in.


"Easy to use" is a perfectly legitimate criteria to optimize for. For most startups performance/scale is a total non-issue, most of the time even something like berkleydb would do. At most early stage startups developer time is the single biggest bottleneck.

If you make a non-optimal choice upfront you can always migrate to another database later on (database migrations are painful but in practice is something you have to do sooner or later in any case).

(obviously this applies to the scaling argument; if you need transactions you should pick a database which supports them)


Sometimes the decision of what technology to use and the task of making it work for your needs are entirely disconnected.


Unless you make a technology decision without understanding what you need to make it work.


... which is why any company with an "Architecture Group" has lost its way. When you have a small group that is rewarded with the job of playing around with new technologies and pushing them to the rank and file to implement, bad technical decisions will follow.

(I have a rant for the "Scrum Group", aka "the Agile Police", but I'll save it for another opportunity)


My point was, sometimes you, as the the person who needs to make it work, weren't the one that chose what to use. You have a job, to use this tool for this task. Do what you can.


   If a database lost data in a single server configuration, why would you trust it as a cluster?
Please be serious here. EVERY database has had at one point lost data due to a bug. It doesn't mean that there is some systemic problem that means you should jump ship immediately.

To this day Foursquare who are the ones referred to in the article still use MongoDB:

http://engineering.foursquare.com/2012/11/27/mongodb-at-four...


I was more surprised by a technical post complaining about workers, waiting on io, and context switching. There are ways to handle concurrency and network communication without large amounts of context switching and threads.


A few comments on the problems:

* CPU bottleneck. The mongod is no usually bound by CPU except for building indexes on existing data (which shouldn't really happen in production). The issue he's talking about is contention between the web server (or workers) and the mongos. This isn't anything unexpected. It's recommended to put the mongos onto the application server and then you scale this by adding CPUs initially but then by adding multiple application nodes and using a load balancer. Or perhaps splitting the mongos onto dedicated nodes?

* Virtualisation: VMs are notorious for having variable performance because you have the overhead of the hypervisor but more importantly, are sharing resources with others. We run performance critical apps on dedicated servers and reserve VMs for tools or things which aren't high throughput e.g. a MongoDB arbiter.

* EBS: This has had known performance issues for years. It's fine as a basic file store but should never be used for databases. PIOPs are the way around this but local instance storage is also an option.

* No transactions: MongoDB has never had them. This is known.

* Schemaless != no schema design. It makes it easy to play around but you still need to think through things carefully. See http://blog.serverdensity.com/mongodb-schema-design-pitfalls...

* No joins. Again, it has never had joins. This is known.


There are a number of very large bottlenecks in MongoDB, and one of the worst is the mongos process.

I can't run more than 100 processes on a server talking to a mongos process, because any more than that, mongos uses 70% of the CPU power on a 32-core box. 70% of all CPU for database transaction overhead! After that, it simply stops working.

The next one is the insane 20K connection limit per mongod process that is hardcoded into the binary. When asked about it, 10gen says they just decided years ago that that was the limit because no one would ever have more than 20GB in a server. So, I can't just run more mongos processes on other boxes - it hits the 20K limit.

I could go on but the point is, MongoDB is really badly designed.


"[EBS is] fine as a basic file store but should never be used for databases."

Doesn't Heroku use EBS for all of their postgres databases?

It may be the case that postgres works better on EBS than mongo does. Postgres has a traditional write-ahead log that minimizes (and spreads out) block writes and hides latencies. Mongo does not.


WAL only helps so much, since you also need seeks for reads. Cassandra has a WAL + log-structured storage + no read-before-update design, so it basically eliminates seeks on writes entirely, and EBS is still ass for workloads that don't fit in cache. Which, if you're bothering to use C*, is almost all of them.


That's ideal, but many kinds of updates require some kind of read.

I agree in general though.


Agreed on all counts. Sounds like he is complaining more about issues with VM performance than MongoDB performance on VMs.

The joins statement kills me. If somehow the database design requires joins, Mongo is fast enough to run two queries and then let you work with them in code.


> If somehow the database design requires joins, Mongo is fast enough to run two queries and then let you work with them in code.

That's definitely not true. What if you were planning on filtering after the join? You may find yourself pulling millions of records. The bandwidth alone would bring you down. I work with MongoDB, and once in awhile I really miss joins. You can't emulate joins in any reasonable amount of time in Mongo.


+1 joins are efficiently done by databases, not by applications. At least in any reasonable dataset.

Joining in application works for a few thousand records.


Depends on what you are joining. Selecting five rows from one (sharded) table and the looking up matching five rows from another (sharded) table by primary key is almost just as fast from the application as it would be done by the database itself.

Additionally many NoSQL databases let you store arbitrary number of data in a row, so you don't need (big) intermediate join tables.

However, I agree that a database can be much faster when you want to join two huge datasets, because then it has all the fancy ways of fast joining with sort-merge-join or hash-join, etc. But if you need to join huge datasets, you're screwed anyway, because such joins don't scale at all, and are still really, really slow - they need to do at least one sequential scan (and typically 3) over each full dataset. Definitely not something you want to do in your OLTP app.

I saw an OLTP app where joining a few thousand rows (yeah, thousand, not million!) killed the app performance totally, up to the point that a single user using the app had to wait >10 seconds for a page refresh. The app was probably done by someone thinking that joins are free.


> That's definitely not true. What if you were planning on filtering after the join? You may find yourself pulling millions of records. The bandwidth alone would bring you down. I work with MongoDB, and once in awhile I really miss joins. You can't emulate joins in any reasonable amount of time in Mongo.

I agree that joins should be done by mongo (as a feature, similar to RethinkDB), but they can get you in trouble when you convert your collection from local-to-joined-one into sharded-into-multiple-servers. You'd then gain huge network load that can potentially bring your system down.

If you decided that data in some particular collection is potentially huge -- just don't use joins on it (unless REALLY once in a while).


"What if you were planning on filtering after the join? You may find yourself pulling millions of records."

Then you always filter before doing the join. Problem solved.


Are we pretending that there aren't filters that could depend on both halves of the join?


> PIOPs are the way around this but local instance storage is also an option.

PIOP is not ringing a bell - I'll stick my neck out and ask what it is.


Amazon EBS Provisioned IOPS. Standard EBS gives around 100 IOPS (Input/Output Operations Per Second). With PIOP you can get more if you pay more.


Completely off-topic (well that's my nickname) but I'm seeing on HN more and more beautiful slide decks, from a purely esthetical point of view. This deck has beautiful fonts and a beautiful color scheme, and it is nicely designed.

My question is: how are they made? Keynote, Powerpoint, HTML...? Are they made with the help of a graphic designer? They look completely outside of the reach of the average technical developer. Or do they use a pre-made theme?


Zach Holman has written about this: http://zachholman.com/posts/slide-design-for-developers/

In fact, this slide deck was extremely reminiscent of his style, down to the font and some other details. I wouldn't be surprised if the person who designed this deck was influenced by some of Holman's previous presentations.


eh.. Mitsuhiko is known to have good eye for design. see his Flask website etc.,


> My question is: how are they made? Keynote, Powerpoint, HTML...? Are they made with the help of a graphic designer? They look completely outside of the reach of the average technical developer. Or do they use a pre-made theme?

author of the talk/slide-deck here. I make all my slide decks with Keynote, primarily because it has the best font rendering and presenter display out of the options available on OS X. They do not use a pre-made theme but I'm always looking for inspiration from other presentations and websites.

Might be out of reach for the average technical developer but that's just because the average developer does not do slides. That stuff can be practiced. My slides got considerably better (I think at least) the more presentations I did.


Very beautiful indeed, and makes reading very easy. My only wish is if the URL's were working, or at least copy+pasting worked too (chrome, windows here). Last page of the document had 4 urls, and had to type them by hand...


Similarly off topic: does speakerdeck provide a mbile version (not app)?



As I clicked forward in the slideshow, I kept expecting to find something to disagree with but it never happened. Mostly I nodded to myself.

I have a huge investment in MongoDB at this point, both financially and in equipment - over 200 machines dedicated in various separate clusters - and a pivot to another datastore at this point would be a significant re-engineering effort.

All the people talking about changing databases after the MVP have clearly never had to deal with a typical hockey stick growth profile and having to allocate engineering resources based on need - either making the product better, or wasting time changing your database and losing traction.

Anyway, I'm hoping posts like this can dissuade people from choosing MongoDB for anything destined for high throughput Enterprise-level production environments.

I just wish there was something I could easily put in to replace MongoDB, but none of the available options quite fit the same document store model, but make better use of available resources and provide much better performance.


Who is still surprised by this? I feel that after 2-3 years of the litany of stories and cases like this, it should shock absolutely no one anymore.


He is a relatively popular programmer so people will still comment on his presentation (even if it is a little redundant at this point). Unsung database veterans like Tony Marston have been saying things like this for years but few people took them seriously.


Yes, his web site is definitely not cool, but thank you for linking to it. I'll have to brew a decent coffee or three over the next week and take some time to read some of his stuff because it looks pretty good. Discounting the animated spiders and mention of COBOL!


but have you seen his website? uses frames, comic-sans and talks about COBOL. Obviously, NOT COOL. /s


In all of these discussions about MongoDb on HN, I always wonder why aren't people talking about the 1 % use cases for which it's good for. Maybe it is too obvious for most folks , but anyway I will just regurgitate the use cases Mongodb folks have documented officially. Here: http://docs.mongodb.org/manual/use-cases/. They have literally like three major use cases.


Sounds like rethinkDB would be the answer to our prayers. How close is it to being "ready"?


Hi jonny_eh,

We're currently winding down on release 1.5 which will include Secondary Indexes. After that the next release will be all about being production ready and having better performance. We're hoping to have that out 3-4 months from now. We'll be posting updates about the process via github twitter and our blog.


slava@rethink here -- we're working with a small group of early customers (providing support, driving product direction, etc.) If you're interested in working together, shoot me an e-mail -- slava@rethinkdb.com


Seems RethinkDB is still relational. Meh.

If you're using C#, I'd recommend RavenDB.


RethinkDB sounds like it's document oriented from their site: "RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn."

Also not sure why relational == Meh.

RthinkDB does sound promising though not ready for primetime. The following (critical) features are still in the development pipeline: Secondary indices, a db backup tool.


Definitely, we're leaving the production-ready tag off until we've built in some of these features.

Secondary index support is ready to go, we'll be releasing v1.5 with it in a few days.


I read your site and your system sounds very promising as I said in my previous post. It seems like you've identified some of the key problems with earlier noSQL implementations and that you're trying to solve them. Schema free was always a feature I could understand but "no joins" always sounded like an anti-feature to me. The promise of consistency and automatic sharding / replication is very nice but I'll remain skeptical until I hear about some production implementations.


My apologies, I thought RethinkDB was a relational database; the mention of joins gave me the thought, as I don't know of any other non-relational database that directly supports joins.


What do you mean by the term relational? Do you just mean 'Doesn't use JSON'?



Mike @ RethinkDB here-- the document model is similar to MongoDB in that you store JSON documents, and there are no schemas, so it's not a relational database.

Our query language (ReQL) does support a lot of SQL-like queries, such as group-by and JOINs, but there are no relations defined.

It's also worth noting that a great team has sprung up to build a C# / .NET driver for RethinkDB (https://github.com/mfenniak/rethinkdb-net).


Interesting will have to have a look.

I built a tool to track rankings for part of Reed Elsevier and I used mogodb but in a hybrid form using mysql db for the fixed stuff and mongodb for the representation of the first 120 results for a given language,engine,locale tripple


Hi Mike!

That's good to know, I'd like to give it a try, but only OS X and Linux seems to be supported.


That's right-- an enterprising soul is working on porting us to FreeBSD (https://github.com/rethinkdb/rethinkdb/pull/688), but we are notably missing Windows support-- we hope to get to it after adding some more core features (secondary indexes, etc.)


1. RethinkDB is a document orientated database, you store data very similarly to how you would in MongoDB / CouchDB.

2. Relational == Meh? Please elaborate, we're all waiting.


The relational model is not suitable for what most people are using relational databases for.

That's why there's a whole bunch of ORMs, with varying degrees of magic. It's also the reason some people were crazy enough to use relational DBs for key-value storage (eg. Reddit).


"is not suitable" is no more a concrete answer than "meh".

Please try and give reasons rather than restating the assertion.


http://ayende.com/blog/153026/embracing-ravendb

The points in the blog post apply to most non-relational databases.


That blog post is based on some pretty flawed understandings. The reason for normalisation (which I assume is what it's referring to) is not simply to improve write performance - indeed, it can make that slower too. It's because normalisation makes maintaining data integrity vastly easier. It can also improve performance, in as much as it improves the cacheability of your data by reducing its size. Further, when it comes to more complex data types, that's not a fundamental limitation of relational DBs - Postgres, for example, supports some quite rich datatypes.

I have no doubt that document stores save you a bit of time in the early stages of a system. Not sure if I'd trust them to be reliable in the long run as your application-stored schema changes over time, though.


>> If you're using C#, I'd recommend RavenDB

I would give anything to get back the time I spent developing my app using Raven and develop it with SQLServer instead. Ironically I feel so trapped by the schemalessness now that there is a not insignificant amount of data to port if I do need to make a significant schema change.

DO NOT FEAR THE JOINS!


What did you not like about raven?


What makes you recommend RavenDB for C#? Just the .NET API? Honestly curious.


I can confirm that the map-reduce thing does not hold up to any performance-wise expectations.


I assume your experience was on version < 2.4?

Now that 2.4 is out I'm interested in seeing how much the switch to multithreaded V8 has improved that. http://docs.mongodb.org/manual/release-notes/2.4-javascript/


I've run quite a few m/r queries against 2.4.2. The performance is still poor enough that'd I'd say it's only usable for occasional ad hoc work.

More seriously though running queries against a master node will regularly (maybe one in two queries) cause it to failover and elect a new primary. You'd think you should be able to run the queries against a slave, which you sort of can. Unless the result set is bigger than the maximum document size (16megs), in which case you're limited to the master as you can't write to a collection on a slave.

We're using Mongo fairly successfully but it has a lot of issues, particularly around administration tasks. Map reduce work gets done in Hadoop.


Well my 2 cents to the discussion.

We really tried to make Mongo work over the last 1.5 years. We decided to go for mongo mostly because of JSON and geospatial indexing.

We became to realize that for our business case it is not the right tool. That said, and I work with databases now for 20+ years, I don't see what problem Mongo solves.


My first thoughts when I read this:

1. It's easy to screw up a relational database. I've seen more than a few mature relational databases... Most of them have plenty of sins, near-crippling performance snafus and other horrible legacy. Any database that is big enough and growing enough is a beast to manage.

2. From the slides, I think this guy took "Schema-Less" as a cue to stuff completely arbitrary data into MongoDB. No wonder his indexes went crazy. You still need to think about the data you're storing & the relationships. You need a design whatever database you use.

3. Any relational databases I've seen at scale have a lot of flattening. I've seen intra-day transaction dbs that are completely flattened. If your next port of call is a highly normalised relational database, you're going to hit another wall fast enough.

4. Two Phase commit. Seriously. Forget it. I've spent half my career in financial institutions. Quick, fail-fast processing with a reconciliation process is by far the most common approach. 2PC actually slows you down and introduces another component that gets in the way. It's used very sparingly (and even then, usually causes a world of pain).


I think that some people misunderstand the primary purpose of Schema-less design. It's not about typing, it's about document flexibility. It's about getting rid of EAV tables (read: Magento) and storing document-specific information. Typing obviously comes into play, but is only half the topic.

If you are building a system where the schema is the same for all records, then you really shouldn't be using Schema-less design.


In a few projects I've worked on in the past year and a half, I've used EntityFramework C#, and added a Data NVarChar(MAX) field to each table.. then I added a base class that has a UseData(Action<JObject>) method that will pass a Json.Net based JObject in to be able to manipulate... adding additional properties that don't need to be indexed, and handling default values then becomes fairly easy with getters/setters. I also have a TempData table that is pretty basic where the core data is JObject based.

It's not the fastest option, but has worked pretty well for me. It did work out pretty well for holding temporary values, or other values that don't need to be indexed, or where the shape can change dramatically. I tend to store transaction details (credit-card, vs paypal, etc) in JObjects, since the shape can be different, with a key that can pull the right properties out via .ToObject<ConcreteInstance>()

In other cases, I've mirrored data to Mongo, so display versions of records can be pulled up denormalized from a single record/authority (the source records are across 30+ joins, and fairly expensive with a 50:1 view:edit ratio).

I will say that using MongoDB with NodeJS has to be the most seamless combination of tools I have ever worked with. I've written a few services based on this combination and love the development output. Fortunately my needs have been limited enough, that I have not hit too many walls. Most of the issues I have experienced relate to geo-indexes combined with other data, and the limits on multi-key indexes, and no secondary indexes.

I think more people need to consider how their data is shaped, and used and go from there.


I'd love to see a blog post on your Entity Framework JObject implementation!


After 6 intensive months with MongoDB to build my MVP, I just love it.

For sure, it's not perfect. Lack of joins is a shame, but can be quite easily solved outside the database.

But the ease of use and speed of development are such a HUGE advantage. Not having to break my schema into normalized relations and define it in the DB saved me literally days of work.

I can imagine a case that 1 year from now, when our product will be more mature, we'll be leaving mongodb for another SQL or No-SQL database. Doesn't matter. The benefit that mongodb gives us now justifies the costs the may be incurred years from now.


Totally agree, and not just because of your last name.


These guys are lucky they didn't try Cassandra. That's really Mongo's problem: it's too close to a regular SQL solution. You have a sharded NoSQL data store that performs in-store filtering and sorting? You can run aggregation queries? Compound indexes? Amazing! Tell me more.

Moral of the story is unless you can justify a NoSQL datastore for your particular solution and you can live without joins, stick with a regular SQL db.


"These guys are lucky they didn't try Cassandra." Could you explain that further? I'm currently using Cassandra for a new project, so this sentence caught my eyes.


Just like with Mongo or any NoSQL/non-traditional solution, you have to understand how the trade-offs and capabilities of the database relate to what you're using the database for. You also have to design your data storage with these tradeoffs in mind.

For example, joins. There are no joins in Mongo or Cassandra and anything working around joins is simply not going to be as fast as a traditional database's join. If you need to do joins all the time, you will be in pain. So the answer is to deduplicate your data, such that joins are not necessary for frequent operations.

In particular, with Cassandra, while it's great at many things, such as write speed and write availability, you have to be very careful with your data design to get the results that you need. And you have to be cognizant about the querying that you need to do.

Cassandra has really weak in-store aggregation and filtering, as in there is almost no in-store aggregation and there is no filtering other than by a prefix of a column or a key (prefixed subset). So if your column names are made out of a composite parts A:B:C, you can scan for A:* or A:B:* (or A:[some value of B to some other value of B]), but you can't do :B: or *:B:C.

The advanced trick is to use ordered rows, which are so strongly discouraged (because you can shoot yourself in the foot with a key distribution hotspot), which allows you another axis of prefixed subset filtering. But only one more axis.

Sorting? Cassandra doesn't sort. Cassandra project leadership thinks that sorting should be done in the client. If you want to filter a subset of keys in the shape of A:B:C, e.g. get all keys of a certain value of A and sort B:C, you have to do the sorting yourself. If you wanted to do a top-N report, you have to retrieve all that data to your client and then sort.

The only sorting in Cassandra is the hierarchical column (and optionally key) ordering. So if you want to have quick top-N reporting functionality on values A and B from an A:B data tuple, you end up maintaining two indices (i.e. precomputing query results). One such index has columns that start with A and another starts with B.

But then the indexing support is particularly weak. Secondary indexing is only done on values, so if you want to index portions of your keys, that's not natively supported. Also, only in Cassandra 1.2 is indexing finally "write-only," instead of "read-then-write." (Write-only performance is much faster.)

There are no triggers, so you can't write custom indices where you can atomically perform "read-then-write" operations to maintain an index. Instead, you have to write all such custom indexing logic yourself and take a hit for the transmission of all the indexing mutations over the network wire. This hurts particularly bad when you have a cluster distributed over geographical regions (i.e. slow/expensive link).

Cassandra does have the ability to count the number of columns, but only in one row (w/ only the same prefixed subset filtering available). Counting columns in multiple rows is not available, even if these rows are co-located on the same node.

Map-reduce is available, but it is not suitable for frequent queries (not meant to be run quickly, just like map-reduce in Mongo is not something you want to be hitting very frequently).

So, of course, whether these are issues for you depends entirely on your data design. There are many things that Cassandra does well and certain data shapes for which it is just diesel. It's quite ops-friendly, rolling full-uptime upgrades are reliable and are a key priority for the Cassandra team.

So Cassandra is even more specialized in terms of its uses than Mongo. If the original author of the presentation tried to use Cassandra for the same kind of data he used for Mongo, he probably would've written an even more scathing article.


Cassandra and many other NoSQLs are designed primarily for OLTP workloads, not OLAP. OLTP is almost exclusively "find me sth by primary key" (see the TPC-C benchmark used for RDBMSes). Sorting huge amounts of data, top-n queries, skyline queries, aggregation, joining huge data sets, complex filtering belong to analytics world, not OLTP. Unless your whole database is very tiny, let's say 10MB, those operations are pretty expensive even in RDBMSes. That's why those features are deliberately not included in Cassandra. Contrary, MongoDB took a different route - they include some of those features and then seriously underdeliver on many of them.


Not totally related to the article, but every technology has its use. I get really annoyed when people ask me why I don't use ruby or mongodb at my job. I hate this movement towards "blog technologists" that basically read something in a blog, maybe (probably not) try it out themselves for something small and assume it is the best for everyone and if you dont use it you are stupid.


What version of Mongo and pymongo is he using? The connection client looks old (not using MongoClient) and the compound index selection (or lack of with $and) doesn't exist in 2.4.1. I'm curious as to how many of issues have been resolved...


Is there a video somewhere of the actual talk?


I just submitted a related article on Hyperdex which, although it's not Python, has a very good Python interface.

https://news.ycombinator.com/item?id=5686973


Hyperdex looks quite interesting indeed, but after getting burned with new products claiming too much, I'm perhaps overly cautious. Also, it doesn't appear to be able to change schemas after creation, which is a significant issue.


Hyperdex looks good, but I'm personally a tad alarmed by them including transactions only as a proprietary plugin at undetermined cost.


The question I think is fundamental but that I didn't see asked nor replied is this: if we had a data store that had full performance, full scalability, etc., would we design it as a relational database or not? Said otherwise: the debate about NoSQL, is it an optimisation issue or a design issue? (Keep in mind pg's article about what would be a language in hundred years)


Slide 56 states: "Schema vs. Schema-less is just a different version of dynamic typing vs. static typing."

Wrong.

If anything, it is a different version of weak typing vs. strong typing. That is totally orthogonal.


The problem with most of these databases is that they are trying to do too much.

Either the database is great for ad-hoc queries and flexibility or it's great for performance and scale.


Why the URL are not working from the SpeakerDeck document? Captain Obvious here: But isn't that the purpose of the Web? It's the WEB... make those links work!


I think this is a fair comment. SpeakerDeck focuses on preserving the presentation layout with absolute fidelity, and does that by using static images. It's a trade-off. It's certainly possible to implement some kind of link-detection, I hope they are working or plan to work on it.


I feel like there is a lot of context missing for some of these slides. Will there be video available of the talk?


Sounds like 90% of your problems could have been solved if you'd spent a day or two researching what you were about to build your entire company around.

Sounds like the "We Fail" slide might be the most accurate one there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: