Hacker News new | past | comments | ask | show | jobs | submit login
Relativistic hash tables (lwn.net)
187 points by yiransheng on Sept 24, 2014 | hide | past | favorite | 46 comments



I wrote the paper and dissertation this work was based on, and I'm happy to answer any questions people might have.

(Really awesome to see a production implementation of this.)


The application of a causally consistent model for hash table operations is brilliant; treating CPUs and memory as a "traditional" distributed system is brilliant.

I know I said it twice already, but: brilliant.

Reading your dissertation now.


I basically think of a single computer as a distributed system now. In other words, it's not a Turing machine / state machine any more; it's a network of such things.

Once you have multiple CPUs, you can't avoid concurrency, and message passing and immutability become common themes. The strategy for many years was to naively add locks to traditional stateful algorithms and mutable data structures, but this often leads to bad performance (and unpredictable performance)

There is an elaborate backward compatible illusion presented by the architecture, C, and most languages built on top of C. Internal busses and caches are not exposed to you as a programmer, but they are there and affect performance. But if you think of your app from the OS perspective rather than the language perspective, the distributed model becomes more clear and natural.


Thank you; you just made my day.


Great; thanks for volunteering to answer questions! I'm still reading your paper; please excuse me if this is covered.

There is a technique for eliminating the long pauses from rebuilding called "incremental global rebuilding". The big idea is that the new structure is built piece-by-piece, taking a few building steps for each update to the existing structure. The new construction is finished before the existing structure fills up (or empties out, for deletes). Depending on the variant, updates may be applied as the new structure is being built, batched to be applied later, or applied in a secondary "ghost" structure.

It has been used in dozens of data structures, and was, I think, made famous by a series of papers by Overmars and co-authors in the early 80s. He wrote a book about related techniques called "The Design of Dynamic Data Structures".

Does this technique not apply to resizing hash tables in a concurrent kernel?


A large part of the overhead of concurrent data structures lies not in the work itself, but in the synchronization used to avoid corruption to the data structure. The technique you're describing could potentially allow a resize to occur incrementally, without blocking other operations for a long time, but that doesn't take away the need to synchronize between resize, write, and read operations.

RCU-based algorithms allow readers to proceed with absolutely no locking, compare-and-swap, atomic operations, or other expensive steps. In order to support that, any modification to the data structure (such as a write or resize) must make sure the structure remains in a completely valid state after every individual modification. The hash resize algorithm I wrote provides that feature, ensuring that the resize does not disrupt concurrent readers.


Are you aware of any way that the work of Overmars et al. fails to meet the criterion "any modification to the data structure (such as a write or resize) must make sure the structure remains in a completely valid state after every individual modification"?


Nice algorithm. The diagrams in the paper are wonderfully clear and useful.

My instinct on reading it is that allowing readers to continue during a resize is a solid improvement, but that any approach based on buckets using linked pointers (open chaining) should be considered "low performance" by definition.

Do you think this approach is enough to make them competitive? Is there a parallel approach that would work for faster hashes that don't chase pointers? Would here be a way to "compact" the hash on resize so that at least the pointers are tightly arranged?


> Nice algorithm. The diagrams in the paper are wonderfully clear and useful.

Thanks! LaTeX and TikZ are awesome; I would not have wanted to draw those diagrams in any kind of WYSIWYG diagramming tool.

> My instinct on reading it is that allowing readers to continue during a resize is a solid improvement, but that any approach based on buckets using linked pointers (open chaining) should be considered "low performance" by definition.

See my comment at https://news.ycombinator.com/item?id=8365436 ; for many of the hashes in the Linux kernel, the hash nodes are too large to store directly in the buckets, and other parts of the system need to maintain references to them, so a resize cannot copy them and free the originals. And if you're storing a pointer to a node rather than a node, then you'll have one indirection (with associated memory/cacheline fetch) in each bucket anyway, whether you use a closed (non-chaining) table or an open (chaining) table.


How did this compare against Hash Array Mapped Tries? It looks like the amount of nodes that you'd need to change would be smaller, and updates are less dramatic. (They're common in functional programming for persistent data structures, and you get almost as good performance as a hash table with a really wide trie)


The table expansion operation seems complex. Is it really faster than just rebuilding the hash table (by concurrently reading it), switching to the new hash table, then deleting the old table (use RCU to determine when it's safe to delete)?

Perhaps copying is slow if the nodes are large. Perhaps one way around this is to have two bucket pointers within each node and swap between them whenever the table is rewritten.


In addition to the overhead of copying all the elements, you'd break any code that holds references to those elements. In quite a few hash tables in the kernel, other code holds long-running references to hash nodes and expects them to stay alive.

When I first started working on hash table algorithms, years ago, I started out by developing an algorithm to move a single node between buckets because the key changed. My first attempt involved making a copy of the node. That works fine in a standalone hash table, but it won't work in many real kernel data structures (for instance, dcache), because it breaks other pointers to the (reference-counted) node.

That's why the algorithm shown here does not copy any nodes.

As for having two bucket pointers within each node: we've looked at that approach, as have others, but it uses far more memory. If you have a hash table with millions of entries, you don't want to add 8 bytes to every one of them.


What's wrong with the following much simpler construction: each node contains an array of TWO links, and the hash table itself contains an "int link" field.

Readers traverse the chain using the pointer indexed by the link field.

Resizing rebuilds the chains by storing the new links into the 1-link field. When resizing is done, just set link=1-link. The "only" thing you have to guard against are extremely slow readers, i.e., that the table does not get resized twice while a single reader is traversing some chain.

According to the LWN article, concurrent updates with resizing is generally serialized.

What am I missing...? [ya, filling in the details and formal proof of correctness]


Looking at that quickly, I don't see a problem - that should work. Although you're doubling the size of the hash table itself.

You can even do (atomic) refcounting to make sure that everyone is done using the old chain before resizing, although this may slow things down. (So the hash table contains a field of two integers, which indicate the number of readers currently traversing index 0 and 1. Once the value of the "old" index reaches 0, you know it is safe to resize again.)

And you should be able to extend this to k versions easily, which would mean that (k-2) slow readers won't inhibit resizing.

Although, all of that being said, I don't really see the advantage of this over a coocoo hash table.


That works, and has been tested, but would drastically increase memory usage. And you still need a way to track those readers still looking at the previous version, which you'd probably still want to use RCU for.


I skimmed through the LWN article and it appears to oversimplify the situation implying that bucket lists are singly-linked. They aren't (or at least they weren't in kernel's hlist few years ago), there's also a back pointer and I'm not really sure how this fits in, i.e. one can't simply re-link tail of one bucket to the head of another when merging them... In other words it seems like LWN got the whole thing dumbed down significantly, no?


Readers don't look at the prev pointers, only the next pointers, so the writers can safely change prev pointers concurrently without worrying about readers.


I found the incremental approach used in redis to be most interesting (http://antirez.com/news/63).

Can you comment on how one can safely enumerate "all" keys of the hashtable, while minimizing contention.

It would be great to know whether, in principle, rhashtable can offer such a capability.


> Can you comment on how one can safely enumerate "all" keys of the hashtable, while minimizing contention.

for each bucket:


It looks from the article that this results in bounded-time concurrent reads and writes, even in the worst case, as long as resizes can complete often enough to not be overwhelmed by the writers, is that correct?

This looks huge for smoothing the latency spikes out of software networking on Linux.


> It looks from the article that this results in bounded-time concurrent reads and writes, even in the worst case, as long as resizes can complete often enough to not be overwhelmed by the writers, is that correct?

Mostly so, yes. Writers can still contend with each other, but that's less critical on a read-heavy data structure.


> ...I'm happy to answer any questions people might have.

would the same principles that apply for chaining, apply for linear probing and cuckoo hashing as well ? thanks !


In case people find it interesting, this is very similar to how one efficiently builds a search engine for high QPS and update rate (lockless realtime document index), though that example is slightly more involved than a hash map.

The basic premise is that the entry point to your data structure (or internal pointers) can change over time. You don't dismantle older entry points/pointers until all older readers release, though you don't have to wait for all concurrent readers to exit.


I'd be interested in learning more about this.


Oof, this may take a much longer blog post, but here is the very high level basic view.

The basic construction on one doc-sharded server looks like: 1) Maximum valid local_docid 2) A map of local_docid => state (valid, deleted) 3) A map of token_id (indexed term) => map of local_docids to positions in doc.

On document update, you increment the next local docid. You then rip through the doc and extract the tokens. For each token, you insert the docid,position into map (3). Then you add the document to map (2) with state "valid", and finally increment (1).

On query, you first copy (1), then do the typical AND/OR retrieval over (3). Any docids seen higher than (1) are ignored, and any docs retrieved are then filtered by (2).

In this model, (1) is a volatile memory access. (2) and (3) are very similar to this "relativistic hash map".

Deletions are complicated, and usually you filter out invalid docids from (3) as a background compaction process.


Maybe I'm daft, but the growing and shrinking explained looks like how you'd do it for any hash table, is that not how a normal one works? Do normal hash tables "freeze the world" to change tables or something?

Looks to me like an "RCU grace period" (not sure what this is, sleep maybe?) is introduced to allow concurrent threads time to "finish reads" in between pointer changes.


With a normal hash table, you'd acquire either a table-wide lock or every per-bucket lock before resizing the table. With this algorithm, readers can continue to run concurrently with the resize, and the concurrent resize will never cause a reader to incorrectly fail to find a hash element.


This issue arises even in completely single-threaded programs that have hash tables. Suppose you're iterating over a hash, hitting a callback function for each element which has access to the hash and inserts new elements. During the callback, new elements could be inserted which trigger a growth. If you allow the hash table reorganization, then it has to be reconciled with the in-progress traversal, so that it doesn't miss some elements, or visit some elements more than once.


I think if you insert an element during iteration, it is unpredictable whether said element will be visited during the iteration or not, so it is probably a bad idea even if you wont have a resize.


It's a bad idea if you don't have a well-defined mode of operation; for example, you're fine if the hash table uses copy-on-write semantics.


It looks to me like the major thing here is the grace period right? Otherwise you could get "pointer-changing" happy and mess up readers traversing bucket lists right?


If I understand correctly, readers never need to wait for the grace period. Careful ordering of operations ensures they can charge at full steam and yet always have a consistent view of the hash table.

The grace period is only necessary to ensure that no readers could possibly still be using a block before it's moved or freed. For example, when enlarging a table, if an object gets moved from a big bucket into a smaller one, there's a chance that readers on that object would suddenly find themselves in the wrong bucket. Waiting for the grace period ensures that there are no readers anymore and it's safe to move the object.

Very similar to RCU. Paul McKenny has had some great RCU articles on LWN in the past.


How does the "grace period" work in practice? Is it really a set amount of wall time?


No, it's a construct from RCU. There's an rcu_read_lock() and rcu_read_unlock(), which despite the name is a lightweight cpu-local operation that doesn't take any locks. synchronize_rcu() waits for all currently active readers, but does not wait for any new readers.


For an explanation of "RCU grace periods", read "The RCU Api, 2010/2014 Edition":

2010: http://lwn.net/Articles/418853/ 2014: http://lwn.net/Articles/609904/


It depends on the hash table. Cuckoo hashing is a lot easier since you can resize and reorder the buckets concurrently in a new array, exposing it to new readers is a single pointer swap. When you have a linked list of buckets you have a lot more work to do!


Cuckoo hashing, and other closed-hashing (non-chaining) approaches, don't allow other portions of the system to maintain references to hash nodes. Nodes within hash tables in the Linux kernel often consist of some other data structure with an embedded hlist_node, allowing the data structure to be dropped into a hash bucket.

Non-chaining hash tables make the most sense when you can fit the entire hash nodes directly into the array of buckets; that way, you never dereference a pointer at all. However, when storing larger structures, you typically need to put a pointer into the bucket rather than the structure itself. Once you've done that, you've already introduced an indirection and an extra memory read (or cacheline fetch). With open chaining, if you've sized the hash table to have on average 1 node per bucket, you have that same single indirection going from the bucket to the node, and then the node has all the data, so you don't need another.


I was presuming that you'd want more nodes per bucket, but I suppose once you have a hash that can expand and contract efficiently that becomes less of an problem. The issue though with having an average of 1 node is that about 1/3 of the buckets will have no nodes (wasted space), and about 1/3 will have 2 or more (longer lookups).

Yes, for an efficient closed hash of a larger structure, you'd need to store both a pointer to structure and a means of disambiguation so that that you don't have to follow the pointer to (probabilistically) test if you have a matching entry. For a cuckoo, this might be the hash of the other slot, so it's not entirely wasted space, but it would add 8B or so to each entry. On the other hand, we're presuming a large struct anyway, this might not be too much extra burden.

The model that I'm mentally comparing to is an 2-way cuckoo hash with 8 slots per bucket, with a single SIMD compare to test for a match. It's an approach that makes a lot of sense to me. It's not quite what they describe, but if you haven't seen it, you might check out this paper: https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.p...

In any case, neat stuff to think about. Thanks for answering questions throughout the thread.


Why do you need to maintain a reference to a hash node?

Or do you mean maintaining a reference to the value stored?

It seems to me that, even with a layer of indirection with a coocoo hashtable (i.e. your hash table is an array of pointers to elements), you still come out ahead with coocoo hashing as opposed to this method. If nothing else, your lookups are worst-case constant time.

In other words: what am I missing here?


> Why do you need to maintain a reference to a hash node?

> Or do you mean maintaining a reference to the value stored?

The hash node is the value stored; many hash tables are maps from keys to objects, and the objects get linked directly into the hash chain without further indirection.

The Linux kernel hash tables have a structure hlist_node which contains a next and previous pointer, and objects linkable into a hash table will contain an hlist_node as a field. This avoids having an extra level of indirection to get from a hash node to the hashed object.


Forgive me if this is stupid, but this could work for implement a concurrent VM?


Yes, that's exactly the kind of program you might want to use this for.


Could someone explain all of the grounds?


Those indicate the NULL at the end of a hash chain.


Null next pointers?


Or 0x00000001 next pointers, if it's using a hlist_nulls. See http://lwn.net/Articles/609904/




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: