To anyone wondering why stuff like this matters, it's because the benefits of functional programming reach new heights when coupled with efficient immutable structures. In C++, for example, you can do functional programming in the most basic sense of the word, and it's actually pretty fun. But it can be very expensive because you are not working on efficient data structures that support the mangling and idioms that make FP really shine. There have been interesting efforts to bring structures like this to C++ but nothing mature or known widespread. When people talk about FP, it's about a lot more than what you can do with functions and expressions; it's about making sure that these functional manipulations remain very fast without lots of copying, and that's what is so fascinating about data structure research: how it supports new ways to write programs.
This is really cool work by Michael, a collection of configurable data structures.
Underlying most of them is CHAMP - a compressed hash array map trie. Essentially it's a trie over the hash of the objects inserted in the map. It's compressed using a clever technique that involves bitmaps.
A made a toy implementation of it to get a sense of how it works. There are some accompanying notes that you might find useful: https://github.com/norswap/triemap
Could you expand a bit on the differences between CHAMPs and HAMTs? Most of the info I see about CHAMPs makes them seem very similar, but with a slightly different node structure.
From the bottom of the Capsule github page [1]: " HAMTs already feature efficient lookup, insert, and delete operations, however due to their tree-based nature their memory footprints and the runtime performance of iteration and equality checking lag behind array-based counterparts.
We introduce CHAMP (Compressed Hash-Array Mapped Prefix-tree), an evolutionary improvement over HAMTs. The new design increases the overall performance of immutable sets and maps. Furthermore, its resulting general purpose design increases cache locality and features a canonical representation."
That doesn't answer any of my questions, and they don't have any citation in that block of text. Real details could be in any of the 8 papers linked, or none at all.
That doesn't feel like a problem? As a PhD thesis, the cited works should cover the subjects, but there is no requirement for the citations to be direct, or even useful to the general public who want to "learn more" =)
If you just want to know the difference between CHAMP and HAMT, then Google gets you that information pretty swiftly. Maybe you were looking for https://blog.acolyer.org/2015/11/27/hamt which discusses HAMT and then discusses the improvements CHAMP brings to the table.
I was trying to tactfully suggest that google would have trivially found that answer instead of waiting for someone on HN to deliver the answer on a silver plate =)
I already found that "answer" and it wasn't very good or descriptive, so I was hoping someone on HN could give me an actual high-quality answer. People here often pull through with good, detailed answers.
They increase cache locality and runtime performance of iteration and equality checking. If that answer isn't sufficient, try watching the ~30min Clojure West video on the github page. The speaker seems to be a bit new to public speaking but his talk on the subject was easy to follow and informative for me.
The talk at Clojure West wasn't given by myself, but I found it worthwhile linking. Rather it was given by someone who independently picked up my research results and replicated them in the context of ClojureScript (https://github.com/bendyworks/lean-map). The authors independently confirmed the performance improvements of CHAMP over HAMT that I was observing.
Why should that be the case? Can you point to sources where you read that? In my experience, CHAMP clearly has advantages over HAMT in this scenario.
To answer your questions about using vectors of coordinates of keys: it depends on the design implementation of the vector's hash code, regardless if you use HAMT or CHAMP.
Using collections as keys in other collections is in general a performance sensitive subject. The available HAMT implementations in Clojure and Scala fail to deliver here.
The case study in Chapter 3.7 nests hash-sets into hash-sets (i.e., Set.Immutable<Set.Immutable<K>> sets). The CHAMP implementation yields minimal speedups of ~10x over Clojure and Scala due to the way it calculates and incrementally updates the collection's hash code.
I was there and you are right, he was very nervous. However his talk was excellent and approachable for everyone reading the above comment. Explains the differences quite well. Certainly worth a talk. I've asked on the clojure slack group if the proposal had gone anywhere and it seems to have stalled unfortunately.
It's all in the README above. But basically each node can have up to 32 (2^5) children. Instead of allocating an array of 32 elements in each node, you have a 32-bits bitmap where the nth bit is set only if the corresponding children is present.
All the children are "compressed" into a single array. To be able to index a child quickly, it suffices to count the number of bit set below the bit corresponding to the child (something that can be done with some masking and a intrinsic bit count operation that maps to an ASM instruction).
This is a bit simplified. In reality each node distinguishes between its children that are actual key-value map entries and those that are nodes.
Additionally, entries are not stored in an extra "Entry" object, but instead occupy two slot in the storage array.
Nevertheless that is your principle: each node is sparse and only allocates storage for the children that are actually present (at the cost of maintaining two bitmaps: one for children entries and one for children nodes).
I'm pretty sure that's exactly like a HAMT except that HAMTs only use one bitmap. I'm not sure why the CHAMP is referred to as "compressed", then, since HAMTs are compressed using the same bitmap technique with a slightly different node structure.
In principle both HAMT and CHAMP compress sparse associative arrays by construction. However, the bitmap encoding of CHAMP results in a better compression for map data structures, enabled by permuting the elements that are stored in the node's array.
This is in particular the case when compared to Clojure's PersistentHashMap. Clojure reserves two array slots per entry to either store a) two pointers inline (i.e., key and value tuple), or b) a single reference to a sub-trie in the second slot. At runtime Clojure checks if the first slot is equal to 'null' for distinguishing cases a)/b) and doing the correct type casts at runtime. The thesis visualizes this in chapter 3, page 42.
Chris Okasaki's work (cited several times in this thesis) is worth reading if you want to learn more about these kinds of data structures. His thesis is here: https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf and parts of it should be easy to grok by every software engineer. Chris also wrote a book on the same topic.
Yup. Purely Functional Data Structures (thesis as linked, book: https://www.amazon.com/Purely-Functional-Structures-Chris-Ok...) is awesome. I especially found some of the tree-based recursive algorithms eye-opening (once you spend the time to wrap your head around them).
IMHO required reading if you're doing any heavy FP work.
Okasaki's work depends crucially on laziness for turning amortized time bounds into per operation time bounds. Is that the case with the linked work too?
Did you mean "3.6 Benchmarks: CHAMP versus Clojure’s and Scala’s HAMTs"?
"Speedups Compared to Clojure’s Maps: In every runtime measurement CHAMP is
better than Clojure. CHAMP improves by a median 72 % for Lookup, 24 % for Insert,
and 32 % for Delete. At iteration and equality checking, CHAMP significantly outperforms
Clojure. Iteration (Key) improves by a median 83 %, and Iteration (Entry) by
73 %. Further, CHAMP improves on Equality (Distinct) by a median 96 %, and scores
several magnitudes better at Equality (Derived).
Speedups Compared to Clojure’s Sets: The speedups of CHAMP for sets are similar
to maps across the board, with exception of insertion and deletion where it scores
even better."
Sure, you can point to algorithms that nobody experienced ever uses. But if you e.g. compare qsort usage with heapsort usage, people use qsort because of it's better constant factors, even though the worst case complexity is worse.
The lookup performance gain is largely due to the CHAMP implementation using the default Java equality semantics, while Clojure's maps use Clojure's more expensive equality semantics. See https://github.com/lacuna/bifurcan/blob/master/doc/benchmark... for a more in depth illustration of this.
However, it is still a meaningful incremental improvement over Clojure's implementation for iteration and equality checks, among others.
Adoption (especially in industry) takes time, but there's definitely interest in the Clojure and Scala communities.
The CHAMP data structure was already picked up by the ClojureScript community (https://github.com/bendyworks/lean-map), but still requires some work to be upstreamed to Clojure I guess.
I personally (I'm the author of the thesis linked above) plan to work together with the Scala folks at Lightbend for the collections overhaul that's planned for Scala 2.13 (see https://github.com/scala/collection-strawman).
Any idea about CHAMP and Haskell? IIRC the reception of HAMT was somewhat lukewarm back in the day, and AFAIK to this day the Haskell standard library doesn't use them.
Very interesting! We've been using HAMTs in IPFS [1] to make huge directories more efficient (think NPM or Wikipedia), and the memory profile has been a pain, so this looks like a welcome improvement.