I actually made a fork of redis (as opposed to a reimplementation), with the design challenge "smallest possible patch" (to make it easier to merge forward), that used LMDB as a graveyard of "old keys" for this very purpose.
git://git.saurik.com/redis.git
I do not quite remember if there were any outstanding "didn't quite finish this corner case" items, but I don't think there were. (If I looked at it I could tell, but I'm on my phone, and should be looking at other things right now anyway.)
(edit: To be clear, I mean in "normal usage". I almost certainly didn't think through issues involving, say, replication, to make certain the semantics were always sensible; my goal was "doesn't crash or break or corrupt".)
(I think the problem was mostly that the higher-level architecture I was putting this piece into underwent a slight change that made me think I might need something slightly different, and the result became "since I literally only use the APPEND command, I can probably come up with something more directly solves my problem.)
If you have any questions, I'm on IRC (various networks, including freenode and irc.saurik.com as "saurik").
But I remember some time ago antirez was trying to implement the same idea (it was called "virtual memory"[1]), but failed [2] mainly due the complexity of implementation.
Yeah. He, however, also was developing, himself, the storage backend, so that's already one major source of complexity you don't have if you use an off-the-shelf embedded key-value library. He also is simply working with a wide set of use cases, both at the time and on his roadmap, where VM causes things to "kind of suck". I certainly did not have the goal of making everything "work well", I just wanted things to "not break horribly" (I added a sentence to my previous post making this more clear). If I had actually built something that would be epic for some wide range of use cases, I'd have done something more with it than just leaving it in a public git repository for months ;P. In the end this didn't even satisfy my use case...
Agree. I think the main mistake was (widely spreaded) belief that key performance factor in k-v databases is complicated multi-level caching, that is obviously hard to implement, not to say to implement it right.
By the way, what was the storage that satisfied your 'append only' use case at the end of the day? AFAIK plain LMDB could get performance on par with raw disks I/O in this case.
What I was working on was a side project, and my time got diverted to things that were more important and I haven't been able to get back to it; the key thing I was needing was a very fast "append" primitive (as in, amortized constant-time append performance) that would be very tightly encoded.
In essence, I wanted some kind of vector array implementation (std::vector, java.util.ArrayList, etc.) on a server, but in a situation where a very large number of the values would end up becoming "dead weight" at some point and, while I needed them to not be deleted, could be flushed to disk.
(At the higher level of my system I am able to shard the strings into which I'm appending so they don't themselves get so large that the eviction/recovery process becomes insane, but within windows of time I want near constant-time performance doing random indexing into the appended data.)
Redis, with the caveat that it has a needlessly and seemingly painfully wasteful format for its append-only durability solution, pretty much does this, but was RAM-only, so I set things up such that old keys could be evicted to LMDB and later could be seamlessly recovered when/if required.
I do not believe "plain LMDB" (without Redis) is a good solution to this problem, but I am willing to believe that I just don't understand how to use it very well (I never found any real "documentation", so I was reading all the comments in the code, looking at the API, checking the mailing list, and falling back on some guess/check; not many people use it).
(edit: Just realized this maybe didn't answer the question while re-reading our thread. Will continue with more focus.)
One thing that I was dissatisfied with here is that the durability of the eviction to LMDB is still only going to be on a single computer; for my purpose, once a key is evicted, its latency requirements drop dramatically, so I could potentially store it to something much slower (like S3).
Honestly, though, I don't quite remember what I decided was sub-optimal about this, because my brain is right now filled with other things that I don't quite want to fully context switch out for the purpose of trying to bring this project back to the front of my mind ;P. It worked, though.
LMDB is only meant to be an embedded engine. Distribution/replication are features to add in higher level layers. E.g., memcacheDB built on LMDB exists, you can add repcached or something else to do replication over that. We're also adding LMDB to HyperDex and Riak. We may end up writing our own cluster manager at some point, if we don't like the ones we can work with today. But the key is to stick to Unix philosophy - write small modules that focus on one particular task. LMDB addresses storage in a single node. Some other component above it can do distribution/sharding/whatever.
That's exactly what I was accomplishing by having redis do key eviction to LMDB ;P. To be clear: it wasn't that LMDB wasn't awesome for the purpose of adding disk persistence and key eviction to redis, it was that having unclustered copies of redis with disk persistence turned out to not nail what I wanted to accomplish. I figured I'd reevaluate redis's cluster support at some later time, or build something more custom now that I've figured out what primitives I needed with more certainty.
(The other comment I made about "plain LMDB" was to address a question posed by snaky, where I was assuming either embedding it into my app server or having a thin wrapper; I was assuming the suggestion was that LMDB could handle "append to key" in a durable and fast manner without relying on redis to first buffer the data to be written en masse as larger chunks, which is pretty much what it is doing in this architecture I was designing.)
(BTW, I've never talked to you, but I knew one of your former students. Hello! ;P)
OK, that makes sense. Yah, it sounded like that to me too, but LMDB's Append mode is only usable when all the input is already in sorted order. Great for bulk loads of prepared data, not much help with randomly arriving input.
Oh, you are so right about documentation. I found the presentation slides on LMDB page the only viable source of 'why and how' information that in my opinion is (ok, almost) always a way more important than API reference (especially autogenerated one). That was funny to find out Python bindings docs more useful than original, huh.
Feel free to submit doc patches. I'm pretty sure I've made all of the relevant information/description/motivation available, it sounds like you just have to collect it into a single place now.
You can also catch us on irc #openldap if you want to ask questions about it and aren't posting to the openldap-technical mailing list. The py-lmdb guy has been on irc with us a lot lately, which is probably why his docs have so much info - he keeps asking questions and getting answers.
I'm afraid that for me as a non-native English speaker (to say the least) it would be a little suboptimal to write the perfect documentation, unfortunately.
And I agree with you in that all the relevant pieces of information are there, and for me personally, as well as for saurik, source code is more than enough actually, especially considering they are really clear and compact.
I just wish LMDB were more popular and widely adopted because it's really brilliant piece of software. And I think documentation is a one of main factors in that, that's it.
Anyway, thanks for pointing, I'm thinking about a potential complexity of adding to LMDB the ability to use secondary indexes (separetely stored, due the carefully done tuning of main storage scheme) so I'm sure it'll come to discuss with you on IRC and mailing list.
Yeah, I've focused my attention more on the code than the docs, which is pretty typical for me. But frankly I think the doxygen stuff is already very illuminating.
For secondary indexing, you should take a look at what slapd back-mdb does. I know that BDB supported 2ndary as a built in feature but we never used it because it was too slow. What we're doing now is already quite fast.
git://git.saurik.com/redis.git
I do not quite remember if there were any outstanding "didn't quite finish this corner case" items, but I don't think there were. (If I looked at it I could tell, but I'm on my phone, and should be looking at other things right now anyway.)
(edit: To be clear, I mean in "normal usage". I almost certainly didn't think through issues involving, say, replication, to make certain the semantics were always sensible; my goal was "doesn't crash or break or corrupt".)
(I think the problem was mostly that the higher-level architecture I was putting this piece into underwent a slight change that made me think I might need something slightly different, and the result became "since I literally only use the APPEND command, I can probably come up with something more directly solves my problem.)
If you have any questions, I'm on IRC (various networks, including freenode and irc.saurik.com as "saurik").