It's almost like the commenters who are bashing the author of the post did not read the bit in bold, which is his main point:
If you tell a database to store something, and it doesn’t complain, you should safely assume that it was stored.
This has nothing to do with the 2Gb limitation. Nowhere in the documentation does it mention that it will silently discard your data. What will happen with the 64-bit version if you run out of disk space, more silently discarded data?
I know a lot of you may have cut your teeth on MySQL which, in its default configuration, will happily truncate your strings if they are bigger than a column. Guess what? Anyone serious about databases does not consider MySQL to be a proper database with those defaults. And with this, neither is MongoDB, though it may have its uses if you don't need to be absolutely certain that your data is stored.
EDIT: Thanks for pointing out getLastError. My point still stands, since guaranteed persistence is optional rather than the default. In fact, reading more of the docs points out that some drivers can call getLastError by default to ensure persistence. That means that MongoDB + Driver X can be considered a database, but not MongoDB on its own.
I'm just struggling to imagine being willing to lose some amount of data purely for the sake of performance, so philosophically it's not a database unless you force it to be. Much like MySQL.
EDIT2: Not trying to be snarky here, but I would love to hear about datasets people have where missing random data would not be an issue. I'm serious, just want to know what the use case is that MongoDB's default behaviour was designed for.
EDIT3: (Seriously) I'm sure MongoDB works splendidly when you setup your driver to ensure that a certain numbers of servers will confirm receipt of the data (if your driver supports such an option), nowhere am I disputing that. But that number really should have a lower bound of 1, enforced by MongoDB itself. And to the guy who called me stupid: you are what's wrong with HN.
Let's say, outside of the tech world: When you send a post card (cheap ones) to a friend, you won't receive any delivery confirmation. You just send it and go do whatever you please, believing the post card will be there. If the envelope don't get there, no biggies, you will send another on your next trip anyways. No hurt feelings.
But, let's say you need to send me a check. You want to know if I received it or not, specially because sometimes I don't cash checks right away. Without confirmation it would be difficult to you to decide if you cancel the previous check and send another, or do nothing, because I could be at that very time trying to cash the check or it could be lost somewhere. The delivery confirmation is an add-on where you receive a confirmation that the envelope got there, but see, it will take time for that confirmation to arrive. It's expensive. If you are sending a 0.01 check, you can just send another if the recipient asks.
Is it not relevant for a blog? Your business website? Your toy application? It is even relevant for a chat system!
And the flaw of your argument: Even if there are other more important things for an application, let's just make anything else than the #1 feature shit.
I'm just saying some databases are like the mail. A chat system is one such case.
And the flaw of your argument: Even if there are other more important things for an application, let's just make anything else than the #1 feature shit.
I don't actually understand what you mean, here, but since you say it's the flaw of my argument, I'm very interested in it. Could you rephrase briefly?
Is it actually unimportant if a chat message is dropped? It seems damn important to me, what use is a chat app if someone sends you an important message and you never receive it? I could see that being true for something like anonymized logs where you are only going to be looking at it in aggregate, but just silently ignoring chat messages really doesn't seem acceptable to me.
Well, in practice, it's not too uncommon to send chat messages or SMS messages that just vanish, or arrive out of order, or arrive the day after they were sent. People do not, then, say that SMS is completely useless; instead, they accept that once in a while a message won't get through, and that they should call if it's important.
I'm not saying it's not at all important that chat messages actually get sent, and if it happens every single day to a user, then they might well look for alternatives, but it's not of the same importance as losing a banking transaction. If accepting that occasional writes will be dropped on the floor allows you to get your product out in October instead of December, that could be an acceptable tradeoff. Certainly not every use case is like this, but some are.
I mean is it desirable behaviour? Would you not want a chat program which never dropped your messages? If that's the case, we should work towards it, not accept a 1 of 1000 messages lost because "there are more important things to worry about". The most important feature should be prioritized over less important ones, but it should not make us forget them - they should be as good as possible as well.
I guess what I'm trying to say is this: You cannot ignore all the other features except the biggest one.
Not really, in my experience. Losing chat messages or logs is among the worst things a chat application can do.
I take your point though, but I think consistency is still one of the most important attributes for anything that is going to store data. Why even use a database, if your data matters so little? Just throw it into memory or memcache.
Better analogy: You ask about the deposit not reflected and they say "Sorry, you didn't choose the account option to prevent the money from vanishing. It was noted in the fine print; didn't you read it?" Then you get a new bank.
If you can turn that behavior on and off (by using getLastError or whatever), why not have this feature?
If I'm logging upvotes on a post or comments on a blog, which is about as serious as 99% of these b.s. startups are doing, I think it's fair to ignore errors.
I do agree that this should be pointed out in huge blinking letters though, or be a driver flag that is on by default. The amount of people who don't know this about Mongo, but are still using it to store gigs of data, is horrifying.
I am reminded of the time, back deep in the past of MySQL, that someone complained about MySQL not providing locks. The development replies amounted to "But it is FAST!" The reply was "But the results are often wrong" and the developers again "But it is FAST!"
Acceptable software, particularly in the class of databases, is obligated to tell you that it didn't complete your request. This is not an option.
The problem has and always will be that MongoDB toes the line between a caching product and a database product, regardless of what 10gen decides to call it. It's extremely frustrating that 10gen can't embrace this fact, and instead perpetuates marketing that causes the product to be perceived as flawed by their target audience. But once you've discovered this, you can use it appropriately (either by overriding the default silent failure to use it as for durable persistence, or only using it for caching or as an eventually consistent store).
That isn't official documentation as far as I can tell, though. I don't think I should need to read 5 chapters into a separate book for something so seemingly fundamental.
IMO, this is important enough information that it should be mentioned from the start, but it isn't in the tutorial[1], nor can I find it in the FAQ[2].
You're right, it's not official documentation, but it was the first thing I read when I decided to start learning Mongo. I've also seen 10gen hand out hard copies at meetups in NYC. Anecdotal, I know, but maybe helpful to someone.
EVERY database call should be wrapped in exception handling to make sure that any errors e.g. connection errors are handled appropriately. MongoDB is no different in this case.
If it blows your mind that people would write to mongo without making sure the write succeeded, then doesn't that make the default behaviour itself mindblowing?
Perhaps a better option would be to have an 'unsafe_write' option. But then of course, benchmarks would look less impressive which didn't use a function with 'unsafe' in the name.
MongoDB: "Okay, I've accepted your request. I'll get around to it eventually. Go about your business, there's no sense in you hanging around here waiting on me."
Or, if you really want to be sure it's done:
Me: "MongoDB, please store this. It's important, so let me know when it's done."
MongoDB: "Sure boss. This'll take me a little bit, but you said it's important, so I'm sure you don't mind waiting. I'll let you know when it's done."
Thank you for taking the time to respond in kind; I do not disagree with what you have stated. I disagree with choosing this by default; it violates the principle of designing tools and APIs for use by the general programming public in such a way that they fall into the "pit of success". http://blogs.msdn.com/brada/archive/2003/10/02/50420.aspx
To me, the choice of performance over reliability is the hallmark of mongodb, for better or worse.
I agree with you, incidentally. I think it might be a better design decision to be slower (but more reliable) and to fail loudly by default (10gen has started down this path a few versions ago, by turning journaling on by default). It's messy, but messes get peoples' attention, at least.
That said, I think that people really do overblow the issue and make mountains out of that particular molehill, because all the tools are there to make it do what you want. Many times, it comes down to people expecting that MongoDB will magically conform to their assumptions at the expense of conforming to others' assumptions. Having explicit knowledge of the ground rules for any piece of technology in your stack should be the rule rather than the exception.
Right, but no-one actually programs like that anymore. You expect an exception to be raised in the event of failure. When did you actually write, or even see (no pun intended) code where every function call was followed by an if statement on its return code?
And I say this as an old-skool C guy who does do this in critical sections of code... But for everything else I'm in a language like OCaml that behaves sanely, using a DB like Oracle that behaves sanely.
If a link to "Write Concern" prominently visible at the start of the first page of the official documentation for the Ruby API does not seem important enough to look at, I don't know what to tell you, except RTFM.
'Success' and 'Failure' are fuzzy concepts when writing to distributed databases, and you need to tell Mongo which particular definition fits your needs. The 'unsafe' default in mongo is controversial, but ranting about what a "proper database" is without even reading the docs is stupid. Instead, let's rant about what a "proper developer" should do when using a new system...
"I'm just struggling to imagine being willing to lose some amount of data purely for the sake of performance...".
A foursquare check-in database could be an example where performance is actually way more valuable than consistency. (I have no idea what database they use)
>I know a lot of you may have cut your teeth on MySQL which, in its default configuration, will happily truncate your strings if they are bigger than a column. Guess what? Anyone serious about databases does not consider MySQL to be a proper database with those defaults. And with this, neither is MongoDB, though it may have its uses if you don't need to be absolutely certain that your data is stored.
Nice ad homien there. MongoDB isn't DB2, just as MySQL wasn't. Both can still be used to build very good products; in fact, I'd go so far as to say they lead to better products than "proper" databases.
I've usually used pymongo [1] for my API docs, and I don't believe I've ever seen this limitation listed there. I also rummaged around the admin area of the mongodb site and don't recall seeing the limitation there.
I'm really glad I haven't deployed mongo now in a production 32-bit system.
Response to EDIT2: Where can data loss be acceptable?
If you are having a relatively speedy message system where messages are removed/outdated on rx. I'm sure there are other specialty needs.
For pymongo there is a "safe" parameter you can apply to operations or the connection as a whole. 10gen made a really stupid default decision here. Instead of calling it safe they should have called it async, and it should have defaulted to off.
So by default Mongo write operations are asynchronous and you have to explicitly ask for error codes later.
There's a bigger question here. I get why diego was flabbergasted by the default, and I also hear legitimate claims that the documentation should have been read. But what I want to know is: Why are MongoDB advocates in such a bad mood?
It's legit to criticize a language or a database. However, it seems to me that when MongoDB gets involved, the tone is far more aggressive and defensive. What's up with that? It's just software, bits and config files. It's not like someone called your mom a harlot.
Here's what I think. New developers, for a long time, have come into the industry and become overwhelmed with everything they need to learn. Let's take typical database servers. Writing a SELECT is easy enough, but to truly be an expert you have to learn about data writing operations, indexing, execution plans, triggers, replication, sharding, GRANTs, etc. As it's a mature technology, you start out barely an apprentice, with all these experienced professionals around you.
In recent years, software development has really been turned on its head. We're not building apps using the same stack we've used for a long time: OO + RDBMS + view layer + physical hardware. The younger the technology, the better, it seems. In theory, a 3 year developer and a 20 year developer are now pretty equal when we're talking about a technology that's been around 2-3 years. That wouldn't be true if we were dealing with say, OO design patterns. (Even when new languages come along, you still get to keep your experience in the core competencies.)
Attacks on these new technologies are perceived as an assault on this new world order, and those who have walked into being one of the "newb elite" respond emotionally to what they see as a battle for the return to the old guard. Am I totally off base here?
I think you're way off base. Even assuming your claims about experience with traditional databases (which I disagree with), we don't see the same kind of emotional tone when talking about equally new datastores like redis or couchdb.
Mongodb was very aggressively marketed; its advocates produced benchmarks comparing it directly to traditional relational databases as though the use cases were the same. I think that set the tone for future discussion in a way that's still being felt.
If you're as old as your opinions suggest you'll remember the early days of Java were very similar - Sun marketing pushed it no end, and so tempers ran high and discussions were emotionally charged in a way that never happened when talking about perl or python or TCL.
I'm not terribly old: 35. Been doing web development as a career since 1999.
More relevant, is my experience. I didn't come in when Java came out. I started (1997-1998) with some high-level dynamic web languages: ASP classic, ColdFusion (To this day, I still do CF - I'm a CF user group manager and I speak at CF conferences). Building HTML and JavaScript since 1996 (GeoCities, HotDog, and HomeSite). Nerded around with programming 1995-1997 in high school (TI Basic, Pascal, and Qbasic) In the days when I started web development, a lot of folks were still monkeying around with Perl and flatfiles. I can't really speak to early days of Java: until 2000, didn't really use it. ColdFusion 6 went from C++ to Java, at which point CF devs ran on the JVM and could target it.
From the beginning I was a consumer of RDBMSes. Started with
Access and moved on to SQL Server. There wasn't a need to know the full DB, only the pieces you needed for CRUD. Perhaps for newbs that has changed, and they have to learn the full SQL administrative experience. Personally I doubt that. Do some db migrations in Rails: you don't even need to know what SQL engine you're running on. (A good thing, IMO, but still means a lesser body of knowledge)
Good point that a lot of products try so hard to be the "new sexy" that they suggest an inaccurate comparison, or at best, implement a subset of what they're trying to replace.
While I agree with the sentiment you're expressing here on some levels (people don't like it when you insult things they like, as they feel they need to defend their choice or themselves because of it), I don't think that quite applies here.
This is a case where, although the ultimate complaint of the author is the behavior of the product (which is documented, but un-intuitive in nature unless you've read up on the issue), it's the way in which he chose to frame the problem that is getting people upset.
This is a known issue, even if it seems like a completely poor design decision. The issue I think most people here are taking is that because the author did almost no research on the topic, he got himself into a problem, and is trying to blame it on Mongo.
I don't think my GP was complaining about people saying the OP was wrong, but the vitriol associated with it. This article has prompted some of the ugliest comments I've seen on HN, even worse than Apple/Google crap articles.
Telling somebody they are wrong is one thing, calling them moronic or stupid is quite another.
I think this is an evolution of the language wars wherein immature[1] developers align themselves with a technology and mix up criticisms of the technology with criticisms of themselves. This seems to be part of the need humans have to be part of a community.
1. Immature in this context has nothing to do with age. Rather, it is an attitude that shows when any developer has not experienced and internalized enough technology to realize every single technology has fundamental problems, sucks in some way, yet is still usually pretty amazing nonetheless, especially within the context of its creation.
> In theory, a 3 year developer and a 20 year developer are now pretty equal when we're talking about a technology that's been around 2-3 years.
Hopefully the 20 year dev can recognize the new thing as new and possibly immature, can identify some areas of weakness when compared to tools with a successful history.
> Attacks on these new technologies are perceived as an assault on this new world order, and those who have walked into being one of the "newb elite" respond emotionally to what they see as a battle for the return to the old guard. Am I totally off base here?
2) Feel they had wool pulled over their eyes unexpectedly.
Let's talk about the wool. MongoDB was marketed initially with stupid little benchmarks (that were later removed as a policy). Those benchmarks were what people saw, showed their bosses, colleagues and decided -- "this is the one". Yes they picked a bad tool should have RTFM, I would normally say but not for MongoDB.
They marketed themselves as a "database" while at the same time shipping with durability turned off. Yes, you can write very fast if you don't acknowledge that data has hit the disk buffers. I wasn't fooled, I saw the throughput rates and thought, something is fishy. But a lot didn't.
Most of all I have no problem with this design decision given that there is a bright red flashing warning on the front page saying what the default settings are and what it could do to your data. There wasn't.
As developers (programmers whatever you want to call it), we feel that perhaps when other developers market things aimed at us, they would be somewhat more honest than say someone selling rejuvenating magnetic bracelets at 4am in the morning on TV. I think that is where the passionate discussion comes from.
Well you could just as easily say that the old guard get upset when all their hard-earned knowledge stops being relevant, so they respond emotionally as well. But arguing about nosql is just something that happens around here...
Very true and wise. As a "mid-guard" developer, it's a fight between "get off my lawn" and "pay attention, lest you become old and irrelevant". I like noSQL, but I feel that Mongo has made some sacrifices to move to the front. Given that performance, I think plenty of new devs have sold into it, and get upset easily at challenges. Maybe they'd have a different response were they committed to Riak or CouchDB.
It's pretty incredible that the author of a post called "I’ll Give MongoDB Another Try. In Ten Years." criticises a comment on this same post telling him to read the tutorial all the way through as "unnecessarily aggressive".
Aside from that, though, the 32 bit limitation is clear in the documentation and present on the download page. It's fine not to read the documentation before you use something but you can't then complain that it did something you did not expect. Mongodb is a little different from other databases. So is Redis. You can't blow everything off that is conceptually different.
This may be a case when using the package manager is not always the best option.
If you're talking about Ubuntu, I can attest that the default PM there is several versions out of date for a lot of things, and thus to get the version you'd expect, you're forced to install by hand.
Also, even using the PM version, didn't you get a warning when you started the server? I thought Mongo threw up a warning at start time about this exact issue (the 2GB limitation, not the silent failures)
Second every word here. The version in the Ubuntu's repositories is not the latest(which is 2.2) and they can't be more explicit about it than pointing it out on the download page and giving a message upon the database startup.
The version in the Ubuntu's repositories is not the latest(which is 2.2)
What does the author's complaint have to do with the version Ubuntu is distributing? Are the 32-bit limitations present in Ubuntu's version not present in the most recent version? If they are, than who cares which of them he installed?
they can't be more explicit about it than pointing it out on the download page and giving a message upon the database startup
Uhh, yeah they can. On Debian-derived systems like Ubuntu you can make your .deb packages throw up dialogs that the user has to read and agree to before installation via debconf (http://www.fifi.org/cgi-bin/man2html/usr/share/man/man8/debc...). There's probably a way to do the same thing in RPM-based systems as well. If the warning is something that every user of the software needs to see, putting up a warning dialog and requiring the user to confirm having seen it before installation starts would probably be appropriate.
They could also write an error to the database's error log whenever data is discarded due to the 32-bit limitation. Someone mentioned above that it puts a message at the start of the log, but if that's the case IMO it's insufficient; most of the time people interact with logs by looking in them for a particular moment in time, not by reading them from the first line on. Logging the error on or near the moment the data loss happens would make the issue visible to people using logs in this manner.
>Uhh, yeah they can. On Debian-derived systems like Ubuntu you can make your .deb packages throw up dialogs that the user has to read and agree to before installation via debconf (http://www.fifi.org/cgi-bin/man2html/usr/share/man/man8/debc...). There's probably a way to do the same thing in RPM-based systems as well. If the warning is something that every user of the software needs to see, putting up a warning dialog and requiring the user to confirm having seen it before installation starts would probably be appropriate.
Right, but the only person with the ability to do that is the Ubuntu maintainer of the package. Mongodb has no control over what they do and should not be held responsible for their actions.
Maybe.... just maybe you should read the documentation of the database you're installing before you actually start using it in production.
I'm sure this only bit the author because he was using MongoDB for a toy project, and in a real system he'd have done due diligence first.
I'm not a fan of MongoDB myself, but if I were to use it I know that I must read about every option available because by default MongoDB's team chose settings that are suited for speed and not reliability, durability, or (if i'm being less charitable) even sanity.
We've been interviewing candidates to join our team for a few months, and I've also lent some interviewing support to other startups in our area.
I've noticed a trend across about 20+ candidates, all of whom are smart people: people are using Mongo without actually understanding what the hell it's trying to solve by getting away from the RDBMS paradigm.
I'm not sure if this is because 10gen markets it as a general purpose tool, but I have yet to talk with a candidate who can actually describe why they were using the DB vs. a SQL database. I'm all for learning new things, but I can't help but wonder if the string of negative MongoDB posts is coming from people who pick it b/c it's new, then realise pretty far in that this is nothing like a normal DB, and "having no schema" isn't really a reason to go with a tool as foundational as a data store.
I think Mongo is great for really specific problems that its designed to solve. It's probably pretty bad for a general purpose tool, but I'd be surprised if anyone serious actually considers it one.
> but I can't help but wonder if the string of negative MongoDB posts is coming from people who pick it b/c it's new, then realise pretty far in that this is nothing like a normal DB, and "having no schema" isn't really a reason to go with a tool as foundational as a data store.
My observation has been that a substantial number of people pick NoSQL stores because they don't really understand RDBMSs, and can't be bothered to learn.
I don't mean this as a dig at NoSQL in general - there's perfectly valid reasons to want some NoSQL features - but the hype train does attract a lot of people who just want the new hotness.
> It's probably pretty bad for a general purpose tool, but I'd be surprised if anyone serious actually considers it one.
I have talked to more than one 10gen marketing bro who insisted that MongoDB is appropriate for any and all use cases, transient to archival. It's pretty disingenuous if you ask me.
What about scalability? Trying to cluster and shard MySQL is a very difficult task, but with MongoDB it is trivial. No schema can be good, but scaling out easily is the big plus I see.
I think blaming the user here is partially valid (he didn't read the docs), but that's not the whole story.
There is a discontinuity between the ease-of-use story and the blame-the-user story, regardless of how well documented the async insert behavior is.
And it doesn't have to be this way. There are ways of designing interfaces, APIs, and even naming that go a long way to prevent your users from shooting themselves in the foot.
Take postgres. It also supports at least a couple kinds of async insert, one of which is a part of libpq (postgres C client library). It's called "sendQuery" and it's documented under the "Asynchronous Command Processing" section. It's hard to imagine a user trying to use that and expecting it to return an error code or exception. Even if the user doesn't read the docs, or reads some fragment from a blog post, they will still see that the name suggests async and that it returns an int rather than a PGResult (which means it obviously doesn't fit into the normal sync pattern).
There is no reason mongo couldn't be clear about this distinction -- say, rename "insert" to "async_insert" and have "insert" be a wrapper around async_insert and getLastError. But instead, it's the user's fault because they didn't read the docs.
Careful API design is important to reduce the frequency of these kinds of errors. In postgres, it's relatively hard to shoot yourself in the foot this badly in such a simple case. I'm sure there are gotchas, but there is a conscious effort to prevent surprises of this sort.
> There is no reason mongo couldn't be clear about this distinction -- say, rename "insert" to "async_insert" and have "insert" be a wrapper around async_insert and getLastError. But instead, it's the user's fault because they didn't read the docs.
Because if you don't read enough of the docs to understand that 'insert' is asynchronous insert, you don't understand MongoDB and haven't done your research.
Why should 'insert' default to synchronous? Why shouldn't we instead have a sync_insert function instead? The only reason is that you're assuming familiarity for people coming from SQL/synchronous-oriented DBMS, but why should they be forced into an awkward design just because it's what people are familiar with from other DBMS?
A good system is forgiving; it encourages exploration; if there's a choice between safety and performance it defaults to safety. If/when profiling shows the safe behaviour to be a bottleneck, then users can Google the issue and discover "Oh, I just need to set flag X; I can live with the consequences here".
Expecting the user to be an expert in your product from the start is simply not realistic; a well-designed system facilitates use by people of varying levels of expertise.
> A good system is forgiving; it encourages exploration; if there's a choice between safety and performance it defaults to safety.
Not if you're choosing a system that's explicitly marked for performance over safety.
> Expecting the user to be an expert in your product from the start
The 'product' in this case is a non-relational database, not an iGadget. The user can and should be expected to be familiar with the main strengths and weaknesses of the database as a whole.
There is no way you can convince me that someone who has done a reasonable level of due-diligence in investigating MongoDB can be surprised when it behaves asynchronously.
Kudos to you for doing your research. If you're saying "don't use MongoDB without doing at least N days of research first", then you're very much at odds with (my perception of) the 10gen marketing message.
I think you're right though: MongoDB should not be used without _lots_ of research into its limitations.
> I think you're right though: MongoDB should not be used without _lots_ of research into its limitations.
That's true about any database, not just MongoDB; nothing new here.
> then you're very much at odds with (my perception of) the 10gen marketing message.
10Gen is fairly straightforward about the original issue, having blogged openly several times about their decisions - but at the end of the day, any engineer should do research beyond the simple marketer's pitch.
I won't doubt that there are people who make snap judgements about fundamental architecture based on marketing pitches[1], but that's very unfortunate, and the marketers really can't be blamed, especially when they make no effort to conceal the truth or deceive you!
> That's true about any database, not just MongoDB; nothing new here.
That's exactly the point where we started. A well-designed system fails "safe"; it should obey the principle of least surprise. Specifically: MongoDB should default to synchronous writes to disk on every commit; official drivers should default to acknowledging every network call; MongoDB shouldn't allow remote access from the network by default. Once you want higher performance or remote access, you can read about the configuration options to change and learn on-the-fly, evaluating the trade-offs as needed.
Other systems are safe by default (e.g. PostgreSQL), and their out-of-the box performance and setup complexity suffers because of it. MongoDB could ship "safe" (with the same trade-offs), but chooses not to. That sort of marketing-led decision-making has no place in my technology stack.
The Principle of Least Surprise has local scope. You may be surprised to find asynchronous writes on an arbitrary database, but not for a database that is documented, advertised, and marketed as asynchronous-by-default.
'Surprise' is relative to the current environment and paradigm (in this case, asynchronicity)- if you find that surprising, then that means that you should have read the basic documentation properly.
> MongoDB could ship "safe" (with the same trade-offs), but chooses not to.
Because that's one of the main points of choosing MongoDB...
This "main point" is never mentioned in their philosophy page. And the introduction mentions "Optional streaming writes (no acknowledgements)" which sounds like the default is synchronous writes.
I admit that the default unsafe tuning of MongoDB becomes quite obvious when you read more of the manual, but I can hardly say 10gen is without blame for causing this confusion.
I understand where you're coming from, though I disagree.
I hope you continue to explain these caveats to everyone considering MongoDB. I hope you recognize that not everyone is an expert in these limitations, and that you clearly explain to those that might not know it that MongoDB's "2GB limit" really means "data loss"; as does 'asynchronous'. Then you'll see fewer blog posts from people that didn't see through the marketing speak and were bitten by the defaults.
Right now, I think all these blog posts describing MongoDB losing data or performing poorly are getting upvoted because people are learning of these limitations for the first time.
It's not that way because somebody in the 70's flipped a coin and decided that sync was heads.
It's because it's a reasonable assumption to make. Data loss shouldn't be a surprise, if I need speed and am willing to risk dataloss I should have the option, but should explicitly choose to use it.
> if I need speed and am willing to risk dataloss I should have the option, but should explicitly choose to use it.
You did, by choosing to use MongoDB.
(And if you chose MongoDB without being aware of that implication, you didn't choose MongoDB for the right reasons or didn't do your due diligence, because you cannot understand MongoDB's use case and tradeoffs if you were unaware of this.)
This is the latest in a long line of negative posts on MongoDB based solely on first impressions because either:
1) it does not behave exactly like SQL
2) the user didn't read any more than a Quickstart Guide
3) the user fundamentally misunderstands the aim of the new technology or the application it is intended for
Ember.js suffers from the same ignorance.
What makes it worse is all the morons who upvote without even reading the detail purely because the title reinforces some misconceived bias they already have.
'NoSQL' is part of the problem. This technology has absolutely no comparison with SQL other than it persists data.
This technology has absolutely no comparison with SQL other than it persists data
Except that apparently under certain circumstances it doesn't persist data, which was the author's point.
Personally I wouldn't be upset about a limitation like the one described as much as I would be upset about the database not logging an error when it discards the data. Logs are a primary way you figure out what's wrong when your application isn't behaving as expected. If you open the logs and see a bunch of "32-bit capabilities exceeded, please buy a real computer" messages, you learn what the problem is. If the database error logs are empty, that implies that everything is working fine, when in this case it clearly isn't.
I'm sorry. Which part of "It silently ignored my data" do you not understand?
You call people "morons", yet it appears that you did not read the article yourself.
Whether SQL or not, scalable or not, old or new, or whatever... Is completely immaterial here.
When a database silently stops accepting data, and apparently has done so for 3 years, you have to at least admit that there are strange design goals at play.
Now, the entire claim of the article might be incorrect. Did you verify that yourself?
It's stated plainly and prominently in http://www.mongodb.org/downloads that 32-bit version is limited to 2GB. It's mentioned elsewhere in the documentation, but the OP didn't bother to read them. "A gem and two lines" and it worked, so he expected it to work forever. That's not how engineers usually work. Most of the time, they over-engineer, not the other way around! They research the hell out of any new technology they want to use. I'm definitely less talented than OP and others on HN, but even I know a hell of a lot about Redis, MongoDB and CouchB, and I haven't even started to write a line of code.
And anyone who has read more than an introduction to mongo knows that you SHOULD use getLastError to be safe. If you do that, no data will be dropped.
I think they need to change the word "limit". This is what "limit" means:
[root@li321-238 tmp]# dd if=/dev/zero of=./filesystem bs=1M count=128
[root@li321-238 tmp]# mkfs.ext2 filesystem
[root@li321-238 tmp]# mount -o loop filesystem myfilesystem
[root@li321-238 tmp]# dd if=/dev/zero of=myfilesystem/im_too_large bs=1M count=129
dd: writing `myfilesystem/im_too_large': No space left on device
that is, a "limit" means, the program stops, complains. It's "limited".
A program that continues along without issue, only changing its behavior in some unannounced (documented or not) way, is not "limited". It's free as a bird.
I think you're overlooking asynchronous writes. Exceptions kind of suck in the asynchronous world, because you need to clean up the write error and you have no idea where you are in your code.
With a getLastError model, you can do your work, then go check for errors when you're really ready.
I'm not saying it's a great api, but it does make sense in context. No idea why the tutorial the op followed didn't talk about the differences, or why asynch is hard.
Surely "getLastError" is an extremely questionable concept in the asynchronous world? How do I know the 'last' error is the one relating to the operation my code just executed?
Presumably this has to be done in the driver directly after the insert call - on the same connection, to ensure that you actually get the last error, and not someone else's error, if you have several instances writing to the db?
Mongo's wire protocol actually has request IDs, but it appears db.$cmd.findOne({getlasterror:1}) doesn't use that. Instead you have to send it over the same connection as the operation in question, and if you had to reconnect you're just fucked and will never know what you may or may not have committed.
While this article is a bit flippant, I think ten years is a pretty good number when you consider the vast amount of engineering effort that has already been poured into projects like Postgres.
This brings me back to the recent discussion about reading other people's code: it is almost certainly smarter to extend an existing database until it's capable of meeting your needs, rather than write one from scratch.
The fact that many programmers don't see it that way is a testament to their irrational fear of diving into other people's code.
10 years and PostgreSQL still has no easy, manageable solution for replication or sharding. And it's JSON support is still nothing more than a bolted on hack on top of a BLOB.
People need to stop acting like PostgreSQL is some holy grail database. It isn't.
Correctly and efficiently querying sharded tables is not only a very complicated dark art but also heavily patented. I thought they had a replication story, though.
Not reading the documentation (or hell, the red "note" text under each 32 bit download link at http://www.mongodb.org/downloads) for basic limitations will bite you in the ass in 2022 just as easily as it will in 2012.
Except that it was hidden deeply in the documentation. It's like making a car that has the gas and brake pedals switched, and then blaming accidents on people not reading section 5 of the owner's manual.
I'm hardly an inexperienced programmer. I've used Cassandra, SimpleDB, Voldemort, etc. I wrote part of the Inktomi Search Engine in the 90s, and plenty of (what today would be called) NoSQL stores over the years.
A default that's so counterintuitive for a database should be featured prominently with a huge neon sign. It wasn't in the Ruby tutorial, or in any of the many documents I read. It's buried deep in the Mongo website, and the first Google match about the 32-bit limitation is a blog post from 2009.
As the OP points out the limit is pretty clearly specified on the download page and there's a "note" linking to the limit right next to where it says "32-bit".
Sometimes you just have to admit you screwed up and didn't read the documentation. Everyone does it, we're hackers, we'd much rather play with technology than read docs.
Being limited to 2GB, and failing silently after 2GB are different things. I know about the 2GB limit but I also would have expected an error. (Though I think I managed to enable safe mode for my internal app.)
Even your own blog posts says that you basically just followed the getting started guide for ruby. Personally, I would not use an untested, brand new to me, technology on anything that 'had to work'. And if this wasn't that important, then chalk it up to a learning experience. MongoDB's decisions might not fit your personal style, but your attitude towards learning is a poor model for technology.
I installed it through Ubuntu (apt-get). Most people must (or should) be installing MongoDB through a package manager. Once again, assuming that people will see the warning because it's on the download page is shortsighted.
See dmaz's comment about the log files. I'm a complete mongodb noob as well, but found out pretty quickly that there's a 2GB limit on 32bit systems. It really is hard to miss.
It blares it at you if you try to start up a 32-bit Mongo binary. And it's on the downloads page. And in the documentation. And in every blog post about MongoDB ever.
That 2009 post is the canonical post about the issue, which is why it has such page rank. Its position is a consequence of the fact that it's linked to from all over the web, not because nobody has discussed it since.
Saying that different defaults should be documented prominently is like saying that because every piece of software is different, you should be required to read the documentation before you use it...
I'm a pretty big detractor of mongo, but I don't agree with this post. One of mongo's main design decisions is to defer writes, making this sort of thing possible. I think it's a crappy tradeoff but it is one of the things that makes mongodb mongodb. If you use it without knowing this you haven't researched it well.
I'm no expert in this area, but maybe if you want to use mongo for logging? Or things like that.
I kinda like TCP vs. UDP analogy. Sometimes you care more about speed than precision. A few dropped items in a log. Not a big deal. I'd rather have that, than to be forced to use a more expensive machine for the job.
That said, I absolutely think the default should be the TCP way.
Well, any time you'd rather have speed over completeness. Maybe you're aggregating tweets from the Twitter API and if the occasional one goes missing, it's not a big deal, or perhaps you can grab it on the next update. Maybe you're generating a real-time stats dashboard for your site and if one pageview gets lost every million, it's not a big deal.
Look, I agree that in most cases you probably want to do everything you can to make your data 100% complete. But failed writes should be really rare, and there are plenty of times I'd trade the rare missing write for cheaper/faster database servers.
The way MongoDB is designed, this would be outside the scope of a driver. A storage library that's based on the ruby MongoDB driver could certainly do it, though. That's what the OP ideally should have been using. In fact MongoDB would be a good choice for his use case, if he would switch to a 64-bit VM and handle error conditions (heh).
If your failure modes are uncorrelated (i.e., spread across datacenter facilities with separate power supplies), you might be happy knowing a majority has accepted the write in memory, even though none of them have stored it yet (because that's slower if you're on spinning rust).
I'm using MongoDB for experimental work for about 6 months, it has a few amazing advantages, it's the MVP / POC king, you just do it, it's the agile iteration master, it's definitely not the best choice for doing any statistics, or any financial like transaction handling.
However, it starts to feel like Anti MongoDB is just considered cool today, when I see someone that worked with MongoDB for a year, upgraded to 2.2, knows it inside out and still hates it, I would listen, and start to worry. but until then, I'm going to keep using it, and saving time.
I used MongoDB one afternoon, and guess what! It doesn't have table-locking writes?! :)
In all seriousness, I built a 10 machine Mongo cluster, talked with a 10gen consultant a full day, went to Mongo meetup, and ran all sorts of benchmarks before ever using it in production. I still don't feel like I have the expertise to write a snarky blog post about it.
Not really following the snark there. Are you trying to compare MongoDB to MySQL's MyISAM storage engine? Like there aren't numerous other extremely valid RDBMS solutions out there, which don't do table locks during a write? (MySQL InnoDB, Percona, Maria, Aria, Postgresql, Firebird, etc...)
Worse, it's the necessary tradeoff of one of MongoDB's self-proclaimed benefits (asynchronicity) - it's not an edge case; it's part of the core reasons you'd switch to MongoDB (or at least take into account when making that choice). If that surprises you, it's because you haven't switched for the right reasons or haven't researched enough to know that every benefit has a tradeoff (and to know what those tradeoffs are).
To give an analogy, it'd be as if someone read this post and decided to use a SQL database solely because they care about write-durability... and then complaining when they "suddenly" encounter an error when trying to include an extra field with an INSERT on-the-fly. ('You mean SQL has fixed schemas??')
If they just wrote asynch on that page somewhere most experienced programmers would immediately understand the implications. And also understand how the amazing performance was being achieved.
The MongoDB "way" is that clients know the importance of their data and can choose write strategies which make the proper trade-off between insertion throughput/latency and durability/consistency.
People tend to look at NoSQL and wonder why it doesn't function like MySQL, then they loudly complain how bad the software is. Nobody is writing articles about how Memcached doesn't function like MySQL.
I had a very similar experience about a year ago. Except instead of running out of RAM on a 32-bit instance, I was running out of disk on a 64-bit instance. That's right, my database ran out of disk, and the driver didn't throw an exception.
Yes, I realize that there's a "safe=True" option to my python driver. But I'm writing to a database. As others have said here and elsewhere, the default behavior of a database and its drivers should be to complain loudly when a write fails. It is ridiculous that safe!=True by default. If I want to turn off this feature to improve performance, I will.
When I wrote some hobby code for Postgres using the PHP driver I had to manually check error codes after each and every operation. This came as no surprise to me.
Exception throwing database drivers are a relatively new thing not an old thing. The only thing MongoDB does differently is that the writes are fire and forget in that the database hasn't returned a response of any kind when the function returns.
In native code you can forget about using exceptions in a database driver because exception handling can be exceptionally broken on some platforms. SmartOS I am looking in your direction.
Exception throwing is not that new. It's just a question of what is the style in the language you're using. In Java, for example, throwing exceptions has been the norm since JDBC was invented in 1997. PHP is definitely different in that exceptions are rare. Same story with C++. I'm not super experienced with Ruby, but they seem pretty common there, so I would've expected to get one.
Indeed but by then there will be more advancements in software to compliment changes. If you are in technology business it is assumed that you will keep up with technology.
Interesting that you're quoting the zen of python, but using ruby. I wonder if the python mongo client would have the same behaviour.
There seems to be a number of people commenting, telling you to read the documentation, but I'm with you, that is completely counter-intuitive behaviour and should be viewed as a bug.
As I understand it, it has everything to do with the client library, some clients may call getLastError on every operation and raise errors when they occur, for example.
Great point. That's usually how you to tell when a technology is starting to disrupt things. Really smart people / experts in their field (which Diego definitely is) start to bash it. In this case, he has a point but has way overblown things. But that's ok, that means Mongo is on the right track.
This type of attitude is not constructive. Of course I read the documentation. Much more than "copy and paste from the tutorial." I looked at tons of code samples as needed, read blog posts, etc. The limitation wasn't obvious at all.
This reminds me of the attitude that I had to correct in developers that worked for me:
- There is a huge difference between "it works" and "it does what the user expects in a friendly way."
Steve Jobs said that if you need to read a user manual (particularly to do the most vanilla usage of a product), the problem is the product. Not you.
> Steve Jobs said that if you need to read a user manual (particularly to do the most vanilla usage of a product), the problem is the product. Not you.
He's talking about consumer products, not databases that were intended for use by technology experts. There's a big difference there.
The onus is on you to understand the limitations of software before you start using it. You complain that the 32-bit warning doesn't show up in the package manager, but you still should have read the documentation before committing to a new technology. It's that simple.
Is it a flaw that mongo doesn't work well on 32 bit systems? Maybe. Probably.
Is it a flaw that you didn't do the requisite research before committing to a database and subsequently complaining about it? Definitely.
Are you really arguing that you shouldn't have to study documentation? First, blog posts and code samples are not documentation (in this case, at least).
If you were working for me as a developer and had the attitude that you shouldn't have to _thoroughly_ read the manual and notes for something like MongoDB, I'd let you go. Steve Jobs was not a programmer.
You're comparing something like an iPhone, intended for the average Joe, to a complex system intended for developers and especially data architects. Assuming software to be intuitive is a good sign of a bad programmer in my book; any good engineer would never assume anything.
Heck, I learned about error handling in Mongo the first hour I started learning it. Same for the 2Gb limitation of 32-bit. The mongo manual is very well done and also happens to be fully indexed in Google.
In the time you've spent defending yourself here and on your blog, you could have learned how to use MongoDB properly, or just gone and bought a 64-bit computer.
The Jobs quote is only relevant for mass-market consumer technology. Nobody would argue that you should be able to operate an MRI machine or an F-16 fighter plane without reading a manual.
The limitation WAS obvious to a great number of people. It's in the documentation, it's on the download page, and it's posted in several blogs available via google searches.
Beyond that I'm not sure why anyone would run a production system on a 32 bit system anymore. Sure the failing silently part sucks but really this seems much more like a poor deployment then a actual bug in mongodb being the root cause.
that quote is absolutely correct for a consumer product such as a phone, but it's disingenuous to apply it to a highly complex product used entirely for bespoke development, aimed at some of the most technical individuals around.
To be fair, if tutorial is official, it could do a better job on educating about handling error conditions of the operation. Like it or not, but large amount of tutorial readers will do exactly that: copy and paste code without going into depths of what's going on there.
I ran into this same nasty surprise building a prototype to store requests in Mongo instead of Postgres. It was enough to scare me away, too. Glad I noticed it while it was still just a script+Makefile simulation.
Another problem with Mongo I never heard anyone else raise is that there are no namespaces. If I install Mongo, all the tables/collections live in the same namespace. What if I want to use it for multiple projects? How do other people solve this problem?
Can you elaborate please? With a Postgres/MySQL/Oracle installation I can say `CREATE DATABASE` and get a new namespace. I couldn't find anything like that with Mongo. Am I just missing something?
Ah, somewhere along the line I got the impression that in Mongo a "database" and a "collection" were the same thing, but that's not true. Glad to know this was just my mistake!
I agree with posts stating that you should read the docs before using a tool you don't know. But I also think that these two really important points should be mentioned in the Getting Started guide, in bold:
- The 32bit 2GB limitation (seriously when I started with MongoDB I wasn't expecting this!)
- The fire-and-forget policy
These are really not points to be discovered in chapter whatever of the docs.
I can think of several cases where throwing an exception is counterintuitive to Mongo's design and applications. Let's say, if your app returns control to the user while storing data asynchronously, throwing an exception might not be the best way of handling errors. In fact, if throwing exceptions were Mongo's default, I wonder how long it would take for a blog post entitled "Mongo blew up my app" to appear.
Interesting you switched to Couch. I was hesitant to recommend it reading the post because I feared you were turned off JSON stores entirely, glad to hear that's not the case.
In general it feels like Couch actually takes storing data seriously. Append-only and whatnot. It's slower and a little bulkier than Mongo, but it does the important things right (1.0 bugs notwithstanding.)
I'd love a follow-up blog post on your experience with Couch.
Isn't the write result containing the info? e.g. if the write failed, it will contain the error if you just check for it? if so, and I haven't checked (but I assume it's so) then this post is equivalent to ranting on Go's lack of exception handling. like it, don't like it, it is what it is, you can either use it, or fork it and make your own database / language.
i welcome the post. even though most of my stuff runs on 64bit, i actually do have a few 32 bit systems here and there. I never knew. Because as the op mentions it's not written anywhere _obvious_.
another thing I didn't realize was that because of the memory mapped systems which i guess is fine performancewise it's hard to estimate memory usage on a machine. from what I understand there is no possibility to limit the memory usage. Which means that the only way you can limit the amount of memory used is by keeping the size of the database below your memory. quite important things to know imho.
Yeas, there is a small "note" there.
But for me the problem is not that the author didn't know about 2GB data limit.
The problem is that Mongodb didn't complain when he was inserting data above the limit. A data store doesn't complain when it runs out of space? It should be mentioned as the biggest problem with 32bit version.
One of the main reasons I hear people advocating MongoDB is it's ease of horizontal scaling, via replication and auto sharding. I wonder how many projects have such large data sets that they really require sharding of their data?
I understand having another node or two for fail over but I reckon with the spec of the largest offerings from AWS or Linode most people will never need to worry about this and can manage everything on one Postgres or MySQL db. Why complicate things before you have to.
If you tell a database to store something, and it doesn’t complain, you should safely assume that it was stored.
This has nothing to do with the 2Gb limitation. Nowhere in the documentation does it mention that it will silently discard your data. What will happen with the 64-bit version if you run out of disk space, more silently discarded data?
I know a lot of you may have cut your teeth on MySQL which, in its default configuration, will happily truncate your strings if they are bigger than a column. Guess what? Anyone serious about databases does not consider MySQL to be a proper database with those defaults. And with this, neither is MongoDB, though it may have its uses if you don't need to be absolutely certain that your data is stored.
EDIT: Thanks for pointing out getLastError. My point still stands, since guaranteed persistence is optional rather than the default. In fact, reading more of the docs points out that some drivers can call getLastError by default to ensure persistence. That means that MongoDB + Driver X can be considered a database, but not MongoDB on its own.
I'm just struggling to imagine being willing to lose some amount of data purely for the sake of performance, so philosophically it's not a database unless you force it to be. Much like MySQL.
EDIT2: Not trying to be snarky here, but I would love to hear about datasets people have where missing random data would not be an issue. I'm serious, just want to know what the use case is that MongoDB's default behaviour was designed for.
EDIT3: (Seriously) I'm sure MongoDB works splendidly when you setup your driver to ensure that a certain numbers of servers will confirm receipt of the data (if your driver supports such an option), nowhere am I disputing that. But that number really should have a lower bound of 1, enforced by MongoDB itself. And to the guy who called me stupid: you are what's wrong with HN.