Hacker News new | past | comments | ask | show | jobs | submit login
Exotic data structures (concatenative.org)
248 points by wslh on Oct 14, 2012 | hide | past | favorite | 25 comments



This is a pretty good list of data structures that are out there, that are also useful, and to which most programmers never seem to have been introduced. I regularly use variants of some of the data structures listed (plus several that did not make the list) and frequently get questions about the design and structure from programmers that have never seen them.

In practice, you can get away with knowing just a handful of data structures and most programmers do. However, when it comes time to profile it is valuable to have alternatives to what the standard libraries and implementations offer. For any given set of constraints on the intended use case and workload for a classic container, there is usually a data structure that is nearly ideal for that use case. Also, there are often neat ideas buried in some of these data structures that are worth learning for their own sake.

For the work that I do (high-performance C++) the first pass will often use the C++ STL. When we start profiling, the STL is usually among the first things that need to be replaced. Having a large number of workload optimized data structures in the toolkit for the same abstract container makes this relatively easy.


That's interesting, I always thought of STL as a well-designed piece of library. Or is it just not specialised enough to fit your case?


STL is good for what it is. The limitation is that there is little ability to specify behaviors, use cases, data distributions, workload profiles, etc because that would effectively break the abstraction and expose the internals.

However, STL still requires an implementation and so the implementors select algorithms that will give adequate performance across a wide range of scenarios but which are usually significantly suboptimal for any specific scenario. Consequently, a custom implementation tuned for a specific scenario will often handily out-perform the STL implementation. For the kinds of applications I work on, poor-fitting STL implementations are frequently major bottlenecks in the profiler.

It is plausible to add an enormous number of implementations and data structures to STL that would allow someone to select/construct the perfect data structure for a particular use case. However, that would also add a level of complexity to the STL that probably is not worth the benefit for many users.


> The limitation is that there is little ability to specify behaviors, use cases, data distributions, workload profiles, etc because that would effectively break the abstraction and expose the internals.

STL's limitation is its design. That's the real reason why STL acceptance is low (for a Standard library), not the often cited compatibility issues.


> STL's limitation is its design.

Care to elaborate more on this?


Can you give a concrete example of a refactoring you had to do?


I cannot speak to jandrewrogers specific reason, but the two major sticking points I've seen developers hit is the STL's bad allocator design and the lack of intrusive data structures. Check out EASTL for more information [1][2].

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227... [2] https://github.com/paulhodge/EASTL


STL uses 'value semantics'. Just use push_back in a loop and observe how often the whole data in std::vector is copied. Not to mention std::sort.


vector::emplace_back ameliorates by allowing new elements to be constructed in-place [1]

[1] http://en.cppreference.com/w/cpp/container/vector/emplace_ba...


Value semantics are the only sensible default in a language like C++. If you need reference semantics, store (smart) pointers explicitly.


Nice article, although a bit terse. One part that jumped out at me was:

> It is possible to implement fully persistent arrays in O(1) space per update, with O(lglgN) time for access and updates [4].

This is surprising because it would guarantee at most a log(log(n)) slowdown for any immutable data structure (instead of the log(n) bound I was aware of). But then I checked out the paper and it turns out that the complexity is conditional:

> Fully persistent array use can also be provided at O(1) amortized cost, provided that the algorithm satisfies a simple requirement as to uniformity of access. For those algorithms which do not access the array uniformly or single-threadedly, array reads or updates take at most O(log n) amortized time, where n is the size of the array

By uniformity they mean no index is used more than K times the average use-per-index (for some constant K). A simple example of a data structure breaking that condition is a heap. The top element is accessed every query but the bottom elements, that dominate the average, are accessed once per n queries (not a constant).


Yeah, something like a finger tree is amortized constant time for sequential access, but log(n) for random access away from the ends.

What is nearly ALWAYS overlooked is constant factors. A finger tree is in a nice complexity class, but still requires a lot of per-object overhead compared to an array, that can turn a 10MB chunk of data into a 100MB data structure.


I would especially appreciate how one might go about implementing these in some of the concatenative languages listed there, as these kinds of data structures tend to be pointer- and memory-based, which I've found difficult to replicate in Forth et al.


The first data structure, the Compact dynamic array, is explained beautifully for an undergraduate audience, in Pat Morin's Open Data Structures open source data structures textbook. He calls them rootish array stacks.

http://opendatastructures.org/ods-java/2_6_RootishArrayStack...


I tried to find projects or libraries that have implemented the split hashtable, but searching for "split hashtables" gave few relevant results.

Does this go by other names?


For sure read this as "Erotic data structures."


split-hashtables look pretty cool.


Some of the ones that didn't make it:

Smushed list. Size O(1). The smushed list is a list of variables (of the same type), stored in a single variable of that type. To produce the smushed list, simply XOR all the elements of the list together, then store. To get a value back, simply XOR the smushed list by all the elements other than the one you want. Smushing is also embarrassingly parallel (you can smush two halves separately and then smush the results) so producing smushed lists is blazingly fast.

Unlinked list. O(n). This is slightly faster than a linked list, and acts as a "black box". Simply allocate nodes that are not linked to each other in any way. The data normally stays out of the way of your program, but in case of a core dump you can find it again. NOTE: If your language does reference-counting this will not work. Get a real language that does what you say.

Search-bush. Search trees are good at bisecting data, but they are not really conducive to a random walk for inspiration. Begin by constructing a binary search tree, keeping track of all the nodes you've added, and simply add a third, random, pointer to each node - have it point at a random node somewhere in the tree. In the search algorithm, either follow the left, right, or random node, depending on how much meandering you are interested in doing. The journey is the destination.


I've curious, under what conditions is a smushed list useful?


Actually, this one is real. AKA "parity". Another example (with n=2) is to create a doubly-linked list while only storing one pointer (ptr = prev_node xor next_node).


Oh, I hadn't made the connection between those and the general statement. Thanks for the clarification.


Funny, but why did you post this twice from two accounts?


Search-bush sounds pretty neat, thanks!


compact-array seems like a specialization of a rope, with the tree structure arbitrarily capped at depth=2, and without the benefits of O(log(n)) insertion.


Capping the depth gives you O(1) gets and sets though. Note that compact-arrays grow breadth-wise whereas ropes grow depth-wise. Breadth-wise growth trades insertion speed for random access speed.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: