I work at an ed-tech company in the Bay Area - We're a for-profit company but we've found success in building study tools targeted at students and teachers rather than top-down school district procurement. If you're interested in working in this space, or just chatting, feel free to message me. Same to others!
Not unique to Mongo, other systems that shard based on a user-generated key (HBase) have issues with this too. We developed this UUID generator as an attempt at a more thoughtful shard key: https://github.com/groupon/locality-uuid.java
We encountered some of these same issues and wrote this library to mitigate them: https://github.com/groupon/locality-uuid.java. I think UUIDs make good unique ids overall, particularly in distributed environments where id generation can't be coordinated, but should be used carefully, as the article notes.
As far as I can tell, neither MySQL, nor MariaDB has a function for generating type 4 UUIDs. It's of course possible to generate the UUIDs on the client side, but then it's not really an alternative to auto-incrementing surrogate keys.
Here's my theory on MongoDB, having spent a lot of time thinking about it:
There is going to be some huge number of new developers every year for the foreseeable future, and Mongo is the lowest barrier to entry database available. These people don't know SQL and don't care that "Postgres is better". Mongo does a few useful things really well, and everything else (like sharding) is a moat around that core value proposition, which is why Mongo gets away with being bad at a lot of stuff.
Mongo is good at:
* Starting up (when you download it, you literally just get a mongod binary).
* Analytical queries. I've literally written a book about Mongo and still can't remember most query syntax.
* Efficient data storage - documents take a huge amount of space.
* Scaling.
* Sharding - this feature has always been half-baked.
* A lot of other stuff.
Your immediate inclination is probably to say something like "but look at (Couchbase|MySQL|Cassandra|Rethink|Postgres), its so much better written and it does (document storage|relational querying|scaling|sharding|json documents now in version 9.4!!!!!) so much better". Again, thats not the point, the lesson here is that MongoDB is a thing because barrier to entry is a better feature for many users.
What's funny is that on all but one of your "Mongo is good at" points, Solr beats the snot out of it. The only debatable one is ease of starting up, which is a silly criterion when talking about a project that has any sort of lifespan. Even then, Solr startup is really quite easy.
Obviously Solr uses special-purpose indexing that one might argue is a bad fit in some cases but if you're using Mongo anyway...
Postgres is good but its horizontal scaling "story" is not great, obviously. But finding the edges of Postgres's goodness will take much longer than to bump into Mongo's limitations.
MS Access is a higher barrier of entry to a developer who knows some other language but not SQL and just wants to throw on a datastore than MongoDB is.
It might be a lower barrier to entry to learn Access from the beginning than (programming language) + (MongoDB) -- though I'm somewhat dubious of even that.
Since the newer versions of Access use SQL Server Express as a backend, I would expect them to be significantly more reliable and performant than MongoDB.
While its intentions are noble, in my experience the GPL is Free as in "you probably need a lawyer to ensure you're not violating it". Doubly so for AGPL.
The irony of RMS spreading his own vision of "Freedom" is that the GPL restricts things which he doesn't agree with, which seems to violate the most basic principle there. I've been involved with several organizations which try to avoid GPL code, by policy, in favor of MIT/BSD because it raises too many issues. Not saying GPL hasn't done good things (my understanding is that much of Apple's compiler work is public bc they started with GPL code), but the wind seems to be blowing towards less restrictive OSS licenses.
This is a problem close to my heart, thanks for pointing it out. I have a project up at http://commonwealth.io attempting the reduce the barrier to using data like this. Even when the data is accessible, it could be in any format and you have to load it into a DB to get any value.
The idea is to enable real SQL queries directly on data sets, so you don't need to worry about accessibility or how you'll query it. Not much data in it now, and it doesn't solve your immediate problem, but perhaps a better model for storing and representing this data.
Since the article is a resubmission, I hope you all will excuse a comment resubmission:
This is the 3rd or 4th time I've seen this article in the past few days so I've decided to post my take. I work with Mongo in a production scenario but I'm hesitant to post because these things tend to turn into a pointless argument. So let me stress: this is not an attack on the blog post, I'm hoping to improve the discussion here. Mongo has some real problems, these mentioned are not the problems. Here goes, written along with the article's segments.
- "It lies." Mongo used to have a default where a driver would fire off a write and not check that it succeeded with the server. This was very obviously a decision made to improve benchmark performance, though I imagine a benchmark with only default settings would be rather naive. Regardless, yes, the default was a stupid corporate decision but it is well known and should be apparent if you're deploying a Mongo cluster. Additionally, as the author notes, this default has changed and this entire point is no longer a concern.
- "Its Slow." A real point he raises is that its a little wonky you need to send a separate message for getLastError. I suspect this is an artifact of Mongo's historical internal structure. http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wi.... . If you look at the wire protocol, I think it is designed such that only the OP_QUERY and OP_GET_MORE message types get an OP_REPLY back. getLastError is a command, which are run through an OP_QUERY message. He notes that using this check affects performance. It does, but lets dig into this:
"Using this call requires doubling the cost of every write operation." If the author benchmarked this, I suspect he would find that it vastly more than doubles the latency of a single write operation from the client's perspective. My understanding is that when performing this kind of safe write, the driver sends an OP_INSERT, for which it doesn't have to wait for a reply, then immediately sends an OP_QUERY message (getLastError), on which it hangs waiting for the OP_REPLY. In other words, this is now slower because we've created a synchronous operation immediately after just firing off the insert command. Again, its a little wonky that we send Mongo two messages, but one is immediately after the other and that is vastly overshadowed by the fact that we now have to wait for a reply. I believe 1 synchronous send and receive is unavoidable to ensure a safe write in ANY system, and the argument about sending 2 messages really boils down to sending about 200 bytes over a socket vs 400 bytes, I personally don't worry about it.
- "It doesn't work pipelined / it doesn't work multithreaded." He is also missing the real complexity here, and this is where I have a problem with this blog post because this is not a theoretical discussion, if its content is true then it should just be proven rather than FUD launched into the world. As noted in the docs ( http://docs.mongodb.org/manual/reference/command/getLastErro.... ), getLastError applies for the socket over which the command was run, so its up to the driver to execute the getLastError on the same socket as the write, which is an implementation detail and a solved problem. The way the drivers do this in practice is you set the write concern and the driver takes care of the rest. If you run getLastError manually then it depends on the driver, but for Java the correct procedure is addressed at http://docs.mongodb.org/ecosystem/drivers/java-concurrency/ . So for the fastest possible safe performance, you multithread (which is effectively pipelining from the server's perspective) and run operations with the driver's thread-safe connection pool. Suffice to say people actually use these drivers in a multithreaded context in the real world, and they work.
- "WriteConcerns are broken." There are several relevant settings here, including write concern acknowledgement and fsync, the author is confused about how these map to the Java driver WriteConcern enum values. The acknowledgement setting (elegantly stored in a variable named "w", thanks 10gen) is the number of replicas that must confirm the write before the driver believes it has succeeded. I personally set this to 1, but you could potentially wait for the entire replica set to acknowledge. The fsync setting is whether or not this acknowledgement means that the server on these machines has completed the write in memory, or actually synced the data to disk. I set this to false for performance. There is an excellent StackOverflow answer on Java Mongo driver configuration at http://stackoverflow.com/questions/6520439/how-to-configure-.... .
The author also spends time noting that if you only ensure the write succeeds on a single machine, and you irreparably lose that machine before replication, then the data is lost. This is obviously true for every distributed system.
I've used a 4-shard Mongo in a production multithreaded environment that handled several hundred million writes. For part of this period our cluster was extremely unstable because of a serious data corrupting Mongo bug (more on this in a second). I haven't done a full audit but based on our logs, in about 6 months I've seen exactly 1 write go missing (which I believe was in the rollback log), so I'm personally not concerned about the things mentioned in the blog post. I've also been happy with performance as long as the data size comes under the memory limit. If your data exceeds memory, Mongo essentially falls on its face, though this is hard to avoid when a query requires disk accesses.
Mongo is not without its problems, however. As I mentioned, QA is a real concern, we hit a subtle bug when we upgraded to v2.2 that caused data corruption when profiling was turned on and the cluster was under high load. It was very difficult to debug, and basically should have been caught by 10gen before their release.
Another serious problem is that sharding configuration is still somewhat immature, and it seems like every new release is described as "well it used to be bad, we finally fixed it". Here is an example: you pick a shard key that can't be split into small enough chunks, and now shard balancing silently fails. Ok, so you pick a better shard key, but you can't migrate shard keys, so you have to drop the collection and start again. Except dropping a collection distributed across a cluster is buggy, so you can't recreate the collection with the same name and a different shard key. So you just pick a new name for your collection, and you have this weird broken state for the original one sitting around forever unless you completely blast your cluster and start from scratch. This sort of thing is not fun!
Mongo has many pros and cons, personally I think its real advantage is simplicity for developers which makes it worth putting up with the other stuff. Sorry for being long winded, hopefully this has been useful.
I'm the author of the blog post, and just saw this resubmission on HN. Just wanted to quickly touch upon some of these points:
* "It lies." Mongo lied at the time I wrote the post, and while the default has been changed, the "this entire point is no longer a concern" seems overly optimistic. See here for further details where successful writes are lost: http://aphyr.com/posts/284-call-me-maybe-mongodb
* "It's slow." We have indeed benchmarked Mongo's behavior, and, even though developers give up on consistency and fault-tolerance by using MongoDB, they don't get high performance in return for their tradeoff:
http://hyperdex.org/performance/
* "It doesn't work pipelined / it doesn't work multithreaded." I don't quite see a correctness argument here. I read through the sources of the Java driver at the time I wrote the blog post. If you follow the pattern described in the link you provided (java-concurrency), you will find that thread A can issue getLastError and receive results for thread B's operations. That is broken. The word "thread-safe" does not mean what you think it means. "People use it without ill effects" isn't as strong an argument as "I read through the code and it looks broken." Further, when 10gen responded to this blog post, they were unable to refute the technical point:
http://hackingdistributed.com/2013/02/07/10gen-response/
Agreed that Mongo's simplicity is a big draw for many developers new to NoSQL. Sadly, the system provides very weak properties, and applications built on top end up having even weaker properties still.
I'm a Dvorak user; I think the article is statistically correct but misses the broader point.
- The most important part of the Dvorak keyboard is that the most frequent English letters are on the home row with the vowels pushed to the left. This means that you're less likely to move your fingers from the home row and more likely to switch between hands for each letter. Thats the theory of why Dvorak is faster and more comfortable. It's a little crazy that E isn't on the Qwerty home row.
- It was much harder to switch to a new layout than I expected. It took about a month of typing with Dvorak every day and destroyed my ability to type on a qwerty keyboard. For a period during this transition I couldn't really type well on either keyboard, not fun, and not easy to explain to my boss why I couldn't fucking type. Interestingly I can still type fine on my iphone qwerty keyboard, so it is apparently a separate process in my brain.
- I think the article is similar to saying "there's no statistical proof that using the metric system is faster". Using the metric system is the kind of thing that makes sense intuitively, but if you took 100 scientists using imperial measurements and retrained them to use the metric system, it would be hard to conclusively prove that it is _BETTER_.
- To extrapolate a little bit, there I think there is a broader point here that statistics are usually used to make an argument, often deceptively. It is incredibly hard to create a clean sample in the real world, and even then its difficult to really extract meaning from those numbers. Remember, the average person has less than 2 legs.
- My personal feeling is that Dvorak is a LOT more comfortable and I type maybe 10% faster with it, though I can't really back that up. Maybe its just the fact I learned it second, or maybe I'm just fooling myself, who knows. But I spend like 10 hours a day typing, so if I type 10% faster over my lifetime then I've... turned a profit? Maybe I'll spend those extra days at the end of my life doing a better study about how much faster sailors can type on Dvorak.
Even if you don't use Dvorak, I HIGHLY recommend switching your Escape and Caps-lock keys, especially if you use Vim. Think about how much more you use Escape than Caps-lock. I use something called PCKeyboard Hack to configure that feature and Dvorak on OS X. On Windows, theres something called Auto Hotkey.
> - The most important part of the Dvorak keyboard is that the most frequent English letters are on the home row with the vowels pushed to the left. This means that you're less likely to move your fingers from the home row and more likely to switch between hands for each letter. Thats the theory of why Dvorak is faster and more comfortable. It's a little crazy that E isn't on the Qwerty home row.
> - It was much harder to switch to a new layout than I expected. It took about a month of typing with Dvorak every day and destroyed my ability to type on a qwerty keyboard. For a period during this transition I couldn't really type well on either keyboard, not fun, and not easy to explain to my boss why I couldn't fucking type. Interestingly I can still type fine on my iphone qwerty keyboard, so it is apparently a separate process in my brain.
This so accurately describes my experience, that I'm now piqued to see if others respond the same as well. I took on Dvirak as a learning challenge as I was finishing up my degree (and also to stop people from using my user account at work). It took me 4 weeks to become comfortable in Dvorak, one year to become proficient (Dvorak wpm > qwerty wpm) and about two years to regain proficiency in qwerty. It helps tremendously that I use both, equally, throughout the day thanks to shared computers.
As a side note: holy hell writing HN comments within Reeder on ios is a pain.
I am also a Dvorak user. I used to be a hunt an punch typist, and switching to Dvorak forced me to learn to touch type because the keycaps were wrong.
Touch typing (probably even on qwerty) is so much better than hunting and punching that there is no comparison. If anyone out there hunts and punches, I highly recommend that you switch your keyboard to Dvorak, make your desktop background an image of the layout, and give it a chance.
I struggle to imagine why the caps lock key exists, and if so I can't imagine why it is typically so big and placed in one of the best spots on the keyboard. The only time I ever use the caps lock key is to turn it off after I have accidentally pressed it.
I love these maps for some reason. If you're interested in these, I've drawn a chart of cable crossings at http://commonwealth.io/chart/5628626. This is backed with data in a public Mongo instance, which you can query by pointing a Mongo driver to commonwealth.io port 27017.
I work at an ed-tech company in the Bay Area - We're a for-profit company but we've found success in building study tools targeted at students and teachers rather than top-down school district procurement. If you're interested in working in this space, or just chatting, feel free to message me. Same to others!