Just a heads up if you are looking to dive right in: The client libraries still lag behind Riak itself, so it might be a while before you get all the goodness unless you plan to roll your own.
Yup. For haskell users, I have a trivial fork of Mailrank's riak-haskell-client that is tested to work with riak-1.0rc1; I haven't tried the secondary indices stuff yet, but it's possible that it would work too. It's at https://github.com/tsuraan/riak-haskell-client for anybody who wants to try it out.
I checked the protobuf spec, though, and Riak 1.0 is completely compatible with the older client libraries. All the changes are backward compatible. So you get perhaps 90% of the new hotness.
There will be a minor gem release (pre/beta/something) next week that supports Riak 1.0 features, then in the next major release those will also bubble up to the Ripple document layer. Sorry for the delay.
Most of the HTTP and PBC apis are the same so everything that has already been implemented should still work. I know they haven't added official support for secondary indexes, but I hacked them together real quick here: https://github.com/highgroove/ripple
I think I heard next week sometime the 1.0 ready ripple will be out. As others have said all clients will work with riak 1.0, they just might not support all of the new hotness.
Congratulations! I hope no Basho employees were injured in the last couple days of what, reading between the lines, sounded like a hard fought battle!
I'm about to dig in and read the release notes, upgrade etc, but I've been keeping up with the betas, so here's my early thoughts on Riak 1.0.
With 1.0 Basho has really started to separate from the pack.
I think secondary indexes are very well implemented and while they can be expanded feature wise over the years, as an %80 solution they are a no brainer-- and pack the double whammy of cutting developer time, and by efficiently reducing scope of Map/Request processing they can also boost performance.
I think the real sleeper hit, though, is riak_pipe. This moves Riak from just being a "database" or even a "batch processing system" into a realtime platform. I think in 3 years, this will be seen as the feature that put the elbow in Riak's growth. I'm hoping to have high level support for riak pipe in Nirvana when its released, and can't wait to start using it. Once again, I think you've saved me a couple months of work.
I know a lot of work was done on supporting new backends, specifically, LevelDB, and consolidating/unifying the existing ones (like merging ETS and Cache into the new RAM backend.) I think a blog post on each of the backends and when best to use them would be very useful (though this might be covered in the new docs.)
And Search Integration. I think this is the first NoSQL solution with built in, scalable, full text search.
Before, if you'd decided to go NoSQL, you kinda had to decide which architecture worked best for you and hope they had the features you wanted. I chose Riak because, I believe, it has the best architecture for the class of problems I'm solving... but now it also has a very complete set of features. I'm not sure if any of the competition is as complete out of the box, but even if they are, Riak should be in a lot more evaluations than it has been in the past.
Further, everything is so elegantly engineered that you've built an exceedingly attractive platform for us developers.
riak_pipe... moves Riak from just being a "database" or even a "batch processing system" into a realtime platform
Interesting - could you explain more about this? I've not really grokked Riak's map-reduce yet, and all I understood from the blog post about riak_pipe was that it was a new layer under the hood but didn't change the programming model for map-reduce queries. Is it simply that it's so much faster as to permit new use cases?
This is all personal opinion, of course. I think the key to what makes Riak great is that it is fully distributed. Every node is a peer, and this eliminated single points of failure. But this also makes things a challenge for organizing work. Riak was previously a fairly monolithic product[1] with a set of features, including being a KV database and doing Map-Reduce processing. At some point Basho, wisely, decided that making the product more modular would allow them to be more agile in their development.
So, they split the KV database from the ring code, creating Riak_Core and Riak_KV. Riak_Core allows you to crate a ring of virtual nodes on a cluster of physical nodes, and spread work around it. (essentially the dynamo concept.) Thus, Riak KV then became an application running on the virtual nodes of Riak Core, providing a key-value database. At this point (e.g.: post Riak Core split, but pre-Riak 1.0) Raik_KV also managed the Map-Reduce functionality.
With Riak Core, you can create an application that does whatever kind of work you want, and spread it around a dynamo style ring. The ring is just a way of partitioning up work using a hash function so that it can be evenly distributed across the virtual nodes (Which are cleverly distributed across the physical nodes in the cluster.)
Riak Pipe is an abstraction on top of Riak Core that makes lets you build a pipeline of processing. Each stage in the pipeline is called a fitting. Each fitting has a function (that does the work) and a function to decide which vnode to do the work on. When the pipeline has data going thru it, the vnodes that get work create queues and worker processes to do the work. A key feature of this is that if a queue gets full, earlier fittings in the pipeline are stopped from adding to it, such that their queues will eventually get full (say if there's a very slow process near the end of the pipe) producing a "back pressure" to prevent work from overwhelming the cluster (or a particular vnode).
So, for Riak 1.0, they re-worked their Map Reduce implementation to run on Riak Pipe. This will allow for more flexible map-reduce jobs in the future (maybe even now). As an example of how the map-reduce implementation works, a map phase might be described as a fitting that uses a word-count function (to do the work) and uses the hash of the piece of data from Riak KV to determine which vnode on which to run. so, as you fill the pope with documents to have their words counted, the tasks get spread to fittings across the cluster, and then each fitting sends its results to the appropriate vnode for the next stage in the pipe (which might be reduce) ... and here's the key point... without it having to talk to the node that started the job. Previously, the node that started the map-reduce job (I believe) had to coordinate it across the cluster)... now it self coordinates.
The great thing about Riak Pipe, though, is that it is (as I see it) essentially a realtime processing engine. Say you had a job where you were monitoring the twitter firehouse for mentions of your company. The task is relatively straightforward, but you wouldn't want it to all be running on a single node, right? So, you'd have the fittings work function be the code that scans for your company name in the tweet and flags it, and the function that determines which vnode to run on could be a random hash (so it's evenly distributed across all vnodes.) When the firehouse overwhelms your cluster, you don't find yourself swapping because back pressure will stop new tweets from going into the pipe, and if you need to add capacity you just add a new machine to the cluster.
I'm still wrapping my head around some parts of Riak Pipe. I think that it will turn out to be a really killer feature.
[1] Seems silly to call any cluster of a bunch of erlang processes "monolithic", but a better word is escaping me at the moment.
Interestingly, the first time I saw Riak I was interested in its potential for becoming a stream processing engine. Now, they are following this path. This is a very timely move as there's an increasing interest in this area and many emergent technologies like Yahoo's S4 and Twitter's Storm.
This sounds too flattering toward Riak; it almost sounds like an ad rather than an external congratulation. But what are the downsides & disadvantages of Riak when compared with other contenders? Why do you like it so much more?
I think that Basho deserves congratulations for moving the ball forward in the NoSQL space. If CouchDB had just achieved something similar, I'd have written a similar though admittedly shorter post.
By definition, every NoSQL solution has the downside of not having SQL. Given the popularity of SQL, this immediately rules them out for a lot of people.
Compared to other solutions, Riak, obviously is advantaged from my perspective, given what I value, which is why I chose it.
I'm building a (soon to be open sourced) web development platform on top of Riak.
I am not interested in debating the merits of other solutions, so I won't participate in followups when people disagree with what I say below. That's fine, everyone values different things and has different priorities for various features to solve their different problems.
Here's how Riak compares to the competition, from my perspective:
CouchDB--
CouchDB supports replication from any couchDB to any other CouchDB with the changes feed. This is a killer feature, and one that Riak doesn't really have. With Cloudant, they have taken CouchDB and spread it over a dynamo style Ring, which makes it, in some ways, similar to Riak. CouchDB essentially pre-computes its views, which makes it not a good match for my purposes (which is why I started looking elsewhere in the first place... in fact it was couchDB using erlang_js, a basho library that was my first exposure to Basho.) I think CouchDB is probably a great database for a lot of purposes, but to be honest, after the merger with membase it seems like CouchIO/CouchOne disappeared into a puff of marketing terms and I haven't been able to make heads or tails of what's changed with it over the last year or so.
Cassandra--
Looked into it a couple times, tried to make heads or tails of it, couldn't really, possibly because I'm looking for a document oriented database to begin with. Didn't like the scalability story either. I don't know if I'll ever need to scale or not. I 'm going to start with a cluster that is small that fits my duplication/safety requirements more than anything... but if we do need to scale, I know that re-architecting things is the last thing I'm going to want to be worrying about, as there's going to be many other things to deal with.
MongoDB--
Have never understood the appeal of this. They choose speed over robustness (which is the opposite of what I would choose) and their scalability story is not the no-brainer, no-thought, just-add-a-server don't-worry-about-it approach that I think is important. I'm sure mongoDB is faster than riak on a single node. But scaling from 1M requests a day to 100B requests a day will be much easier and faster (in terms of development time and headaches) with Riak... at least that's what I believe.
Hadoop--
A big old rambling project, and a cluster of open source solutions. Whatever you want to do with hadoop, someone's done it, and if its at all common, 8 people have done it in slightly different ways. I think PIG is really wonderful, and given the release of 1.0, something like Pig is the only big feature that Basho hasn't really addressed... (you mean I have to write my queries in erlang, bob?) but operationally, hadoop is too confusing, too much of a moving target and too many decisions that don't fit my personal style. (for many years Java was my favorite language, but I have to admit I'm an erlang snob these days. If you're not writing a distributed platform in erlang, the first thing I'm going to want to know is why, and the rest of the evaluation will suffer under that cloud. I'm more proficient in java than erlang, but I'm comfortable looking at Riak internals, while the thought of looking at hadoop internals fills me with great dread. In well designed erlang programs, a module is a single file, and doesn't have a lot of dependancies... I imagine the same functionality in hadoop will be spread across dozens of classes, though I might be wrong.)
SQL-Anything--
I need just-add-serviers-and-don't-worry-about-it robustness and scalability. I don't want to think about sharding, or architecting my applications to support the database... the database should STFU and do its job, and grow with me adding servers. I don't have the budget for an ops guy. I need to run without an ops guy for quite awhile. SQL itself really doesn't add anything for me.
So, in summary, a couple alternatives have nice features that Riak could use, but there aren't any real disadvantages or downsides when compared to them for Riak... at least based on what's valuable to me. (which is minimizing my development time and greatly minimizing my time wearing the operations hat.)
Edited: Fixed where I mistakenly typed "CouchDB" (the apache project) when I meant "CouchIO/CouchOne" (the company.) Also upgraded CouchDB's replication from "nice" to "killer" which more accurately represents my opinion of it.
Yes, CouchDB is an Apache project and Couchbase Single Server is a Couch One product. Except that they are almost exactly the same. Couchbase Single Server is already pre-built into easily installable rpms, it has GeoCouch integrated, but not a whole lot more compared to CouchDB trunk (Am I wrong?).
So as a developer you want to play around with Couch. Which one do you pick? ... Exactly. Aside from terminology there is basically a fork at the moment and I understand that CouchOne Server eventually will combine membase + CouchDB in it, and it is a great step forward but currently it makes it a bit harder. It is also not exactly easy to find a comparison of exactly which features are in which. So it is sort of a guess work.