Diskhash – Disk-based, persistent hash tables

bane · on July 9, 2017

One of the nice things about using mmap'd data structures is that you don't have to slurp the entire thing into memory to work on it. The OS's virtual memory system will only load a few pages at a time for whatever's needed. This means that in practical applications you're likely to be within an order of magnitude of RAM speed if you're using an SSD.

Secondly, the other nice bit is that you can use structures larger than the physical RAM on the system. It's kind of like getting a system with enormous amounts of slow-ish RAM.

Thirdly, you can build a whole bunch of these for different purposes and just open the ones you need, and not bother with the rest, again like having even more RAM.

Finally, you can multi-process trivially on these as the OS sorts out the mess and suddenly it's like having a slightly slow multi-million dollar multi-tb shared memory supercomputer from 5-8 years ago at your disposal. Works great on Beowulf clusters if your disk storage is shared on fast links and can handle the IOPS.

The number of applications you can do with a 5-8 year old super computer are vast. I've seen 20tB classifiers built on more or less commodity hardware (<$100k for a cluster of cheap hardware and one big disk) that were previously thought to require a $5 million dollar supercomputer just because of the large shared memory. You can probably build an equivalent machine for <$10k these days off of Newegg and some decent NAS boxes.

wfunction · on July 9, 2017

> One of the nice things about using mmap'd data structures is that you don't have to slurp the entire thing into memory to work on it.

Right, but one of the not-so-nice-things is that you can't do the I/O asynchronously, so you can end up with poor performance depending on the access pattern. (I guess you can, with another thread running in the background touching pages, but it's more of a pain and you're not really guaranteed the data will stay in memory until you use them.) [Edit: Actually I guess if you touch by writing to those pages rather than just reading from them then they'll have to stay in memory... though do note that I'm assuming no swap here.]

luispedrocoelho · on July 9, 2017

At some cost to portability, IIRC, on Linux you can ask the operating system to keep pages in memory. Other systems will probably have similar functionality.

otterley · on July 9, 2017

They can remain paged in, but unlike on FreeBSD, you can't control when dirty pages are flushed to the backing filesystem on Linux. Specifically, the MAP_NOSYNC option doesn't exist on Linux (see https://www.freebsd.org/cgi/man.cgi?sektion=2&query=mmap for a description).

wfunction · on July 9, 2017

Windows has NtLockVirtualMemory, but (a) it requires special permissions (meaning random apps can't do it without admin privileges), and (b) something about it feels like it's the wrong way to do it, but I can't pin down what it is exactly.

otterley · on July 9, 2017

How does it compare to mdbm and lmdb?

mdbm: https://github.com/yahoo/mdbm

lmdb: https://symas.com/lightning-memory-mapped-database/

fanf2 · on July 9, 2017

It sounds like djb's cdb to me http://cr.yp.to/cdb.html

otterley · on July 9, 2017

cdb isn't memory mapped, and you can't update individual keys.

gravypod · on July 8, 2017

The project's readme on Pypi reads "The wrappers follow similar APIs with variations to accomodate the language specificity.".

Beautiful. There are so many times in my career that I've had to use poorly hobbled together wrappers for other languages that I now really appreciate it when someone takes their time to make their API "Fit" with the language.

SwellJoe · on July 8, 2017

While this is pretty neat, it reminds me a lot of Perl tie, which is kinda considered an anti-pattern in modern Perl (though there are places where it still makes sense and is still used). Any data structure in Perl can be connected to a wide variety of backing stores using tie: http://perldoc.perl.org/perltie.html

It was introduced in Perl 5.0, so it's been around for about a quarter century, and it's generic in the sense that there are tons of backends that use the tie interface. BerkeleyDB and gdbm and flat files and probably any DBD SQL database are among the backing store options. But, again, a lot of people think it should be avoided in the general case, even though it seems like a really nice simple solution to a common problem (I've got these hashes/arrays/whatever, and I want to save them on disk so they're persistent and shareable).

This implementation can't delete keys (yet?) which seems to limit its utility for general purpose tasks. I also wonder about reliability. When I first started reading I assumed it'd be used for small things, but he's working with a 32GB file shared across multiple processes. That's a lot of opportunity for data corruption.

mycelium · on July 9, 2017

You insinuate without providing the reasons behind it. I say this because diskhash and tie look like extremely useful things used carefully. This sort of thing solves exactly a problem I have right now.

> Perl tie, which is kinda considered an anti-pattern in modern Perl ... a lot of people think it should be avoided in the general case, even though it seems like a really nice simple solution to a common problem

Why? Any links to blog posts?

> When I first started reading I assumed it'd be used for small things, but he's working with a 32GB file shared across multiple processes. That's a lot of opportunity for data corruption.

I've used something similar and data corruption because of size is low on my list of concerns. If your storage layer fails in this manner silently, you have much bigger problems.

SwellJoe · on July 9, 2017

It's not an opinion I hold strongly, it's just something I've noticed. I've seen more than one talk where someone apologized for implementing something with 'tie'.

Here's one blog post about it (still opinion):

http://www.skrenta.com/2007/05/tie_considered_harmful.html

And, this article from the Modern Perl site itself has mixed opinions on the subject, but comes down on the side of not using it, in the general case, but knowing how and when to use it can be really valuable (Tie is the last subject covered on the page):

http://modernperlbooks.com/books/modern_perl_2014/11-what-to...

I'm not trying to say there's no place for it. Tie in Perl has been used for really cool stuff.

mycelium · on July 9, 2017

Appreciate the response and reasoning!

raarts · on July 8, 2017

Interesting. I used a disk-based hash table very similar to this one in DOS Point-Of -Sale applications in 1989. Goal was to have consistent lookup time when scanning barcode labels. Of course my average lookup times were much slower due to low seek times.

rkv · on July 8, 2017

Nice and simple. I tend to use a wrapper around BSD tree.h and SQLite for persistence. Longer insert/update times but quick access and you get the SQL language and tools for free.

fnl · on July 8, 2017

Quick access? Tree lookup is O(log n) on a rb-tree. Hashtables are O(1). And this Diskhash thing claims to scale to billions of entries. Not sure your options are comparable, therefore.

_3u10 · on July 9, 2017

Claims is the operative word. Features generally aren't free. If mmap'ing files was some sort of performance godsend don't you think most DB manufacturers would just rip out all their custom caching code and replace it with mmap?

fnl · on July 9, 2017

Absolutely correct! I guess a comparison with various levels of "rows" and concurrent access might be needed before judging any way...

adrianN · on July 9, 2017

log(billions) is not very large, constant factors matter in practice. You can also do things with trees that are very difficult with hashes (e.g. ordered traversal). This is why databases still ship with tree indices.

rurban · on July 8, 2017

I usually use gperf created tables or cdb tables for that purpose.

fnl · on July 8, 2017

Neither of those seem to easily scale to billions of keys, though.

rurban · on July 9, 2017

For gperf I had to add >int support by myself. https://github.com/rurban/gperf/commits/long_keys

cdb is limited to 2GB, yes.

fnl · on July 9, 2017

Thanks for the work on and link to gperf! Looks interesting, to say the least.

aboodman · on July 8, 2017

This is a similar idea to what nbs uses internally:

https://github.com/attic-labs/noms/tree/master/go/nbs

... but the idea is developed to support fast mutations as well as other related datastructures (sets, lists, etc).

dward · on July 8, 2017

What happens if the process can't shut down cleanly and writes are in flight? Is there a way to check for hash table corruption? Is there a plan to support deletion/compaction/defrag? Why would I use this over something like LevelDB?

jitl · on July 8, 2017

I think the use-cases are different here. This library is set up for write-once, then read-only access later. LevelDB is all about mutability. I think the use-cases here is for constant-time lookups on fixed-size data sets. LevelDB is a tree-based map, where this is a hashmap, which may give better performance for the author's purposes.

From the second paragraph:

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

GhotiFish · on July 8, 2017

This is such a clean concept.

If you have a tiny application, doing a tiny thing, do you need redis? Do you need sqlite? do you need to write some custom-text-file-ma-bob? nah. I feel like simple simple little tools like this are in short supply.

luispedrocoelho · on July 9, 2017

OP here.

Use redis if you can. Use something like diskhash if you cannot afford the extra overhead of sqlite/redis.

In my case, the alternative we were using before is an in-memory hash table built at startup from a text-based representation of the data. It was pretty good and worked very well up to millions of entries, but at the current scale we work with, it takes least 10 minutes at startup were necessary to build up the thing and it used ~200GB.

_3u10 · on July 9, 2017

They aren't in short supply, just about every shop I've worked at some tool decided to invent something like this instead of just using redis / sqlite.

fl0wenol · on July 9, 2017

Yeah but insofar as the article's author's revelation of how fast you can compile C code, maybe the correct answer is to use gperf?

If it's more dynamic than that I mean I guess it's cool to have a hash in a backing store that's dead simple.

But at sizes big enough and with enough dynamism that gperf or something similar isn't appropriate I think you should be considering dbm or SQLite, sort of by definition.

SwellJoe · on July 8, 2017

While I kinda agree, I also have found, most of the time, that by the time I finish building something that I thought didn't need a bigger thing...it actually needs a bigger thing, and I have to pull in SQLite or something else to make it all work reliably and fast and for all the data-related features that ended up being needed by the end.

_y4o5 · on July 9, 2017

Without going into to the code itself I'm always wary of these hobby projects that then get used at work when they are by the same author. Who now really owns the IP?

luispedrocoelho · on July 9, 2017

Look at the source code license, it's MIT licensed.

We are a publicly funded research organization, so even our "work code" is open source.

nkristoffersen · on July 10, 2017

Ok, I know this doesn't help the conversation, but I kept reading "Dickhash - Dick-based, persisted hash tables".

I was so confused.

ape4 · on July 9, 2017

sqlite gives you a bit more.

luispedrocoelho · on July 9, 2017

Yes, it does give you a lot more and is great.

Unfortunately, on my initial tests, it just was not fast enough.