LevelDB: A Fast Persistent Key-Value Store

gojomo · on July 27, 2011

An interesting development a while back that I'm surprised hasn't received more attention was Oracle's release of a SQLite-based interface to BDB:

http://www.oracle.com/technetwork/database/berkeleydb/overvi...

It's essentially drop-in compatible with SQLite, but with added concurrency and speed for most operations. (The concurrency addresses a major issue usually keeping SQLite as a prototyping/single-user-only option in web development.)

With LevelDB as a BSD-licensed alternative to BDB, I wonder:

(1) How would the LevelDB-vs-SQLite benchmarks change against SQLite+BDB backend?

(2) Could a SQLite fork with a LevelDB backend get a performance boost?

gaborcselle · on July 27, 2011

Thanks for the link! You could theoretically just compile this file against SQLite-based BDB: http://code.google.com/p/leveldb/source/browse/trunk/doc/ben... And get the numbers yourself. (If you do, please post them here.)

est · on July 28, 2011

> SQLite-based interface to BDB

One thing I didn't get about SQL API for BDB, how does something like

    select * from users where name!='tom'

work ?

gojomo · on July 28, 2011

You really don't know or care that you're using BDB; it works (to the user) just like SQLite. (Behind the scenes, it's using BDB for the tables/indexes, and so would do various full- or partial- table scans much like SQLite's native on-disk format.)

dchest · on July 27, 2011

See also:

Previous discussion - http://news.ycombinator.com/item?id=2526032

Benchmarks vs Kyoto TreeDB and SQLite3 - http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html (discussion - http://news.ycombinator.com/item?id=2813061)

Benchmarks vs InnoDB - http://blog.basho.com/2011/07/01/Leveling-the-Field/

gaborcselle · on July 27, 2011

Hi there! I'm a YC alum (reMail W09) and helped Jeff and Sanjay with LevelDB. Let me know if you have any questions about LevelDB and I'll see if I can help.

clumsysmurf · on July 27, 2011

I would really like to use this from my Android Java application. Is this possible, and what would be the best way to accomplish this?

aashay · on July 28, 2011

How does this compare to other persistent key-value stores such as Membase?

dlsspy · on July 28, 2011

Membase is a clustered data storage service your application uses.

LevelDB is a persistence library.

That makes LevelDB the kind of thing you plug into membase to get the unique properties it has to offer (or at least for fun).

pjscott · on July 28, 2011

In fact, the Riak guys are planning on doing exactly that: offering LevelDB as one of the storage back-ends, perhaps even the default.

http://blog.basho.com/2011/07/01/Leveling-the-Field/

gojomo · on July 27, 2011

Any comparisons of performance or functionality against BerkeleyDB?

sallen · on July 27, 2011

Does this system have transactions and ACID guarantees?

stonemetal · on July 28, 2011

Sorry if this is a bit off topic but it seems to me like most of Google's opensource projects are more source available than open source. Do they actually take contributions from the community or are they all like android, source made available when its "ready for public consumption"?

LevelDB sounds like something I would like to contribute to but if the reception is going to be chilly I won't bother, maybe pick up mongo or redis instead.

gleb · on July 27, 2011

The synchronous writes benchmark is interesting. This is normally bound by # seeks your disk can do per second, which is mostly a function of rotational speed. With 7200RPM drive you get 7200/60 = 120 of these a second. So the 100 and 110 numbers for competitors make sense. 2,400 for LevelDB does not.

Is LevelDB batching writes or is there something more interesting going on?

agazso · on July 27, 2011

If you are writing sequentially, then you can write more than the number of seeks.

And that is exactly what LevelDB is doing: writing a log (sequential), and when the memorychunk is full, it is writing it to disk sorted (this is also sequential).

leif · on July 28, 2011

flushing the log in an LSM is only kinda sequential, sadly

leif · on July 28, 2011

Data structures which require a disk seek per random insert are obsolete. LevelDB is using a Log-Structured Merge Tree, one of many write-optimized data structures (but not the best).

DanWaterworth · on July 28, 2011

This link, comparing LSM trees with fractal trees, is quite interesting: http://www.quora.com/What-are-the-major-differences-between-...

stephth · on July 27, 2011

Is LevelDB batching writes

Yes, updates can be done in one atomic batch. Please correct me if I'm wrong, but I don't think Tokyo Cabinet allows it without Tokyo Tyrant.

groby_b · on July 27, 2011

If you write full disk blocks, wouldn't the disk cache hide the seek latency?

gleb · on July 27, 2011

Having write disk cache on would certainly explain it. But that leaves the question of discrepancy with numbers with competitors.

You turn off write-through caching on disks when you run a database unless you are willing to accept corruption (which is worse than data loss) on power outage. And that's why you can't get acceptable write performance out of database without a battery-backed RAID controller (or something other kind of RAM-based write cache with a battery backup).

Here's a simple way to test # fsyncs/s (a.k.a. commit rate) on your system:

  sysbench --test=fileio --file-fsync-freq=1 --file-num=1 \
   --file-total-size=16384 --file-test-mode=rndwr run --max-time=10 \
   | grep "Requests/sec"

groby_b · on July 28, 2011

If cache is on, any performance discrepancy can be explained away by "usage patterns" :)

Also, do you really mean turn off write-through, or did you mean write-behind? (I can't see how write-through would cause corruption, but maybe I'm missing something...)

Also, I wouldn't be surprised if there's a discrepancy in the flushing code across systems. God knows flushing a file to disk in cross-platform code is an arcane science :)

And finally, as somebody else pointed out, LevelDB seems to order write access sequentially as much as possible.

jchrisa · on July 27, 2011

Bindings for Node.js if anyone is interested: https://github.com/creationix/node-leveldb

stephth · on July 27, 2011

From the announcement: it has already been ported to a variety of Unix based systems, Mac OS X, Windows, and Android.

It's worth noting that the makefile includes options to build for iOS. I've successfully done it and my next iOS app will include LevelDB. Also worth noting, thanks to the iOS devices SSDs, it's much faster than with the traditional HDD machines.

taylorbuley · on July 27, 2011

Upcoming versions of the Chrome browser include an implementation of the IndexedDB HTML5 API that is built on top of LevelDB

Really excited about seeing IndexedDB run atop of this

rektide · on July 27, 2011

Open sourced as of March 18th, 2011: http://code.google.com/p/leveldb/source/browse/trunk/LICENSE...

That initial checkin: http://code.google.com/p/leveldb/source/detail?r=2

gaborcselle · on July 27, 2011

Yes, we put up the Google Code site incognito mode back then, but have since added a number of bugfixes and optimizations, so we're actually comfortable announcing the project now.

swah · on July 27, 2011

Interesting how, like in the open-sourced protobuf, there are no commits by Jeff or Sanjay...

gaborcselle · on July 27, 2011

dgrogan and myself have been batching changes to LevelDB from our internal code repository to put them on the Google Code page. Playing Google Code site admin didn't seem to me like a good use of Jeff and Sanjay's time.

swah · on July 27, 2011

Yep, its great you guys could separate it from internal dependencies! Congrats.

shadowmatter · on July 27, 2011

Jeff and Sanjay wrote the original protocol buffer implementation. The project was taken over by Kenton Varda, who rewrote the C++ and Java parts; this is what was open sourced. See http://temporal.fateofio.org/files/resume

swah · on July 27, 2011

Which is what I point as being interesting.

newhouseb · on July 27, 2011

How is this different than BDB?

davidhollander · on July 27, 2011

BDB is a key\value store for unordered data more similar to Tokyo Cabinet hash databases. Tokyo Cabinet hash databases are a much faster option than BDB if you only need unordered data.

LevelDB is for if you need ordered data, and a more appropriate comparison would be against a B+\tree database.

stephth · on July 27, 2011

LevelDB is for if you need ordered data

LevelDB is slower with random reads, but that doesn't mean you shouldn't use it for unordered data - it's still quite fast.

davidhollander · on July 27, 2011

>LevelDB is slower with random reads, but that doesn't mean you shouldn't use it for unordered data - it's still quite fast.

In a positive analysis (should rather than shouldn't), assuming no default choice, it seems rational to use Tokyo Cabinet or CDB _hashmaps_ for unordered data, and LevelDB for ordered data, from a datastructure and performance standpoint. To assert more would probably need a specific use case for context.

stephth · on July 27, 2011

In a positive analysis (should rather than shouldn't), assuming no default choice, it seems rational to use Tokyo Cabinet or CDB _hashmaps_ for unordered data, and LevelDB for ordered data, from a datastructure and performance standpoint.

It's as rational as doing optimizations. If the specific performance is extremely critical, yes, it definitely makes sense. But LevelDB does well with random reads [1]. With all its features and its permissive license, I think it's a strong contender as developer's go-to embedded key-value db, like SQLite for relational data.

Don't get me wrong, I do think your comparison is valuable, but I'm afraid it could be misleading; I specially found the wording LevelDB is for if you need ordered data misleading. Someone could read it and assume that LevelDB doesn't do well with unordered data.

[1] Compare today's benchmarks with these http://fallabs.com/tokyocabinet/benchmark.pdf, it looks like random reads in LevelDB are quite faster than BDB.

davidhollander · on July 28, 2011

> and its permissive license

Other variables such as licensing are context dependent. In my context, I generally use embedded databases in shared library form to store arbitrary serialized Lua (much faster to reload than JSON, no need to compile protocol buffers) on my own servers, so in my development context LGPL vs BSD is irrelevant. But perhaps not for an iPhone developer. TokyoCabinet already has excellent bindings for nearly every language I've used, but thats irrelevant to a developer whose application is also C.

> Compare today's benchmarks with these http://fallabs.com/tokyocabinet/benchmark.pdf, it looks like random reads in LevelDB are quite faster than BDB.

I'd bet BDB is a snail, BUT the benchmark you just linked tests 0 databases in common with the one released by Google. What can be observed is that using a hashmap (TC) was over 7X faster than randomly accessing the fastest ordered data structure by the same author (TC-BT-RND) :)

stephth · on July 27, 2011

you don't need to pay anyone to use it in your commercial software.

newhouseb · on July 27, 2011

I suppose if you are shipping proprietary binaries, then yes. But otherwise it's effectively GPL'ed.

stephth · on July 27, 2011

Exactly. BDB is GPL with an (non-publicly priced) commercial license to embed in apps, and LevelDB is BSD. With LevelDb, one of the most practical use case of a serverless db - embedding it in a client-side application - isn't crippled by it's own license.

jcapote · on July 27, 2011

it would be cool to make a leveldb backed fork of redis

stanleydrew · on July 27, 2011

Pardon my ignorance but what's backing redis currently?

stephth · on July 27, 2011

Redis. :) But more importantly, it runs as a server. I think what @jcapote meant is being able to use Redis operations without a server, like Leveldb or sqlite. I would love to see that. There's already some effort towards that direction [1], but using a google backed project instead of building a full library from the ground up could be a saner approach.

[1] https://github.com/seppo0010/redislite

swah · on July 28, 2011

I love the insight about how fast compression (Snappy) is like having faster hard drives.

timr · on July 27, 2011

http://www.hnsearch.com/search#request/all&q=leveldb

swah · on July 27, 2011

Your point?

timr · on July 27, 2011

That leveldb has been discussed several times on HN in the last two months. I just didn't break out the links from the search UI.

Downvoters: links to previous context are generally considered a good thing here.

swah · on July 27, 2011

Oh, then its easier to understand that in this query: http://www.hnsearch.com/search#request/all&q=leveldb&...

timr · on July 28, 2011

Yeah, the search results when I linked them had those results. The index was subsequently updated, so the top hits are now this thread, and this article. Oops.

mumrah · on July 27, 2011

Anyone know how LevelDB compares to Voldemort? From a cursory glance, they are identical in their simple API (get, put, delete)

strlen · on July 28, 2011

Voldemort developer here--

Voldemort to LevelDB is what MySQL is to InnoDB: Voldemort is a distributed system that allows multiple engines to be plugged in. Mostly commonly, companies use either BerkeleyDB or MySQL as a storage engine. LinkedIn, Mendeley, EBay and others also use the read only storage engine, where the data is pre-built in Hadoop and loaded into Voldemort.

I am really excited about LevelDB: while there are higher priority projects on my plate right now, we'd very much like to see a LevelDB storage engine. If anyone is interested in contributing one, they're welcome.

The steps are:

1) Creating JNI bindings to LevelDB (or creating a .so version of LevelDB and creating JNA bindings)

2) Implementing the StorageEngine interface with the bindings, including passing in any configuration.

Here is an example of a third party InnoDB/Haildb storage engine for Voldemort:

https://github.com/sunnygleason/v-storage-haildb

gaborcselle · on July 27, 2011

My understanding is that Voldemort is a distributed key-value storage system, while LevelDB is a local on-disk key-value storage system.

skanuj · on July 27, 2011

Note to myself : Search before you post. Apologies, I checked new and front page only!

dchest · on July 27, 2011

You were right to post it -- it's a new blog post announcing the non-beta release and benchmarks.

newman314 · on July 28, 2011

How well does LevelDB work for a mobile device? This might be a nice use case.

overred · on July 28, 2011

LSM-Tree is good!

swah · on July 27, 2011

So that's what Jeff does!

trungonnews · on July 27, 2011

how is this different from membase?

eis · on July 27, 2011

Uhm shouldn't that be ovious by reading the high level descriptions of each? They are for completely different use cases. Membase is a distributed Key/Value server and LevelDB is a Key/Value library.

thomas11 · on July 27, 2011

I see the difference between a server and a library, but both can often be used for the same use case. Just recently I evaluated a few data stores for a project and I didn't care all that much about the distinction. For the servers you're gonna use an API for your programming language anyway, so the programming model isn't that different.

stonemetal · on July 27, 2011

About like the difference between mysql and sqlite.