I think your statement is both out of date and quite broad. Plus you imply that a database would be limited to 500k keys which is silly, when you really mean a map-reduce job. And further, are you really doing MR over your entire dataset, all the time, or would key filtering, ranges or secondary indexes be a better fit? Its easy to do M/R in Riak over only the correct amount of data.
Hadoop may have been better for what you are doing, and logging metrics is a particular use case where a specialized database is most appropriate.
But it is incorrect to imply that Riak falls over at any specific key limit. This is simply untrue. With Riak you can always add more nodes if you need more capacity, and Map Reduce is done in a distributed fashion so adding more nodes adds map reduce capacity. Its not perfect but it is not brittle.
That thread shows that all of the particulars of your claims about Riak are actually false. Further it seems you didn't bother to understand how Riak can solve your problem and thus decided that it cannot.
"If large-scale mapreduce (more than a few hundred thousand keys) is
important, or listing keys is critical, you might consider HBase."
"Riak can also collapse in horrible ways when asked to list huge numbers
of keys. Some people say it just gets slow on their large installations.
We've actually seen it hang the cluster altogether. Try it and find out!"
I chose the most polite way to point out his error, and now you are compounding it by attempting to rebut me with quotes that don't actually rebut me if you know what you're doing. Listing all keys is a function meant for debugging, not for running in production. If you're running M/R jobs based on that then you don't know what you're doing. The person you're quoting, in fact, said they were doing MR jobs over billions of keys. Further the person who made that recommendation doesn't work for Basho, and that they said he should consider HBase is not the same as saying that Riak can't do it.
You want to say I'm wrong, make a specific argument. Don't selectively quote things out of context that actually don't rebut my position, as that's profoundly dishonest. It is a way of pretending to rebut someone but without saying anything yourself so you can't be pinned on any statements. It is disingenuous.
I'm really tired of having to rebut these argument-from-ignorance "rebuttals" here on HN.
"The person you're quoting, in fact, said they were doing MR jobs over billions of keys"
Ctrl-F billions and found one match in the post I was quoting. No other reference to very large MR jobs in the post quoted.
"At Showyou, we're also building a custom backend called Mecha which integrates Riak and SOLR, specifically for this kind of analytics over billions of keys. We haven't packaged it for open-source release yet"
So the OP is supposed to use an unreleased experimental custom backend to do his big mapreduce jobs?
I am the person being quoted. You are correct that keylisting is not suitable for production use. We definitely don't do MR jobs over billions of keys: our huge data queries are powered by Mecha, which uses Solr.
Hadoop may have been better for what you are doing, and logging metrics is a particular use case where a specialized database is most appropriate.
But it is incorrect to imply that Riak falls over at any specific key limit. This is simply untrue. With Riak you can always add more nodes if you need more capacity, and Map Reduce is done in a distributed fashion so adding more nodes adds map reduce capacity. Its not perfect but it is not brittle.