This is a pretty good list of data structures that are out there, that are also useful, and to which most programmers never seem to have been introduced. I regularly use variants of some of the data structures listed (plus several that did not make the list) and frequently get questions about the design and structure from programmers that have never seen them.
In practice, you can get away with knowing just a handful of data structures and most programmers do. However, when it comes time to profile it is valuable to have alternatives to what the standard libraries and implementations offer. For any given set of constraints on the intended use case and workload for a classic container, there is usually a data structure that is nearly ideal for that use case. Also, there are often neat ideas buried in some of these data structures that are worth learning for their own sake.
For the work that I do (high-performance C++) the first pass will often use the C++ STL. When we start profiling, the STL is usually among the first things that need to be replaced. Having a large number of workload optimized data structures in the toolkit for the same abstract container makes this relatively easy.
STL is good for what it is. The limitation is that there is little ability to specify behaviors, use cases, data distributions, workload profiles, etc because that would effectively break the abstraction and expose the internals.
However, STL still requires an implementation and so the implementors select algorithms that will give adequate performance across a wide range of scenarios but which are usually significantly suboptimal for any specific scenario. Consequently, a custom implementation tuned for a specific scenario will often handily out-perform the STL implementation. For the kinds of applications I work on, poor-fitting STL implementations are frequently major bottlenecks in the profiler.
It is plausible to add an enormous number of implementations and data structures to STL that would allow someone to select/construct the perfect data structure for a particular use case. However, that would also add a level of complexity to the STL that probably is not worth the benefit for many users.
> The limitation is that there is little ability to specify behaviors, use cases, data distributions, workload profiles, etc because that would effectively break the abstraction and expose the internals.
STL's limitation is its design. That's the real reason why STL acceptance is low (for a Standard library), not the often cited compatibility issues.
I cannot speak to jandrewrogers specific reason, but the two major sticking points I've seen developers hit is the STL's bad allocator design and the lack of intrusive data structures. Check out EASTL for more information [1][2].
Nice article, although a bit terse. One part that jumped out at me was:
> It is possible to implement fully persistent arrays in O(1) space per update, with O(lglgN) time for access and updates [4].
This is surprising because it would guarantee at most a log(log(n)) slowdown for any immutable data structure (instead of the log(n) bound I was aware of). But then I checked out the paper and it turns out that the complexity is conditional:
> Fully persistent array use can also be provided
at O(1) amortized cost, provided that the algorithm satisfies a simple requirement as to
uniformity of access. For those algorithms which do not access the array uniformly or
single-threadedly, array reads or updates take at most O(log n) amortized time, where n
is the size of the array
By uniformity they mean no index is used more than K times the average use-per-index (for some constant K). A simple example of a data structure breaking that condition is a heap. The top element is accessed every query but the bottom elements, that dominate the average, are accessed once per n queries (not a constant).
Yeah, something like a finger tree is amortized constant time for sequential access, but log(n) for random access away from the ends.
What is nearly ALWAYS overlooked is constant factors. A finger tree is in a nice complexity class, but still requires a lot of per-object overhead compared to an array, that can turn a 10MB chunk of data into a 100MB data structure.
I would especially appreciate how one might go about implementing these in some of the concatenative languages listed there, as these kinds of data structures tend to be pointer- and memory-based, which I've found difficult to replicate in Forth et al.
The first data structure, the Compact dynamic array, is explained beautifully for an undergraduate audience, in Pat Morin's Open Data Structures open source data structures textbook. He calls them rootish array stacks.
Smushed list. Size O(1). The smushed list is a list of variables (of the same type), stored in a single variable of that type. To produce the smushed list, simply XOR all the elements of the list together, then store. To get a value back, simply XOR the smushed list by all the elements other than the one you want. Smushing is also embarrassingly parallel (you can smush two halves separately and then smush the results) so producing smushed lists is blazingly fast.
Unlinked list. O(n). This is slightly faster than a linked list, and acts as a "black box". Simply allocate nodes that are not linked to each other in any way. The data normally stays out of the way of your program, but in case of a core dump you can find it again. NOTE: If your language does reference-counting this will not work. Get a real language that does what you say.
Search-bush. Search trees are good at bisecting data, but they are not really conducive to a random walk for inspiration. Begin by constructing a binary search tree, keeping track of all the nodes you've added, and simply add a third, random, pointer to each node - have it point at a random node somewhere in the tree. In the search algorithm, either follow the left, right, or random node, depending on how much meandering you are interested in doing. The journey is the destination.
Actually, this one is real. AKA "parity". Another example (with n=2) is to create a doubly-linked list while only storing one pointer (ptr = prev_node xor next_node).
compact-array seems like a specialization of a rope, with the tree structure arbitrarily capped at depth=2, and without the benefits of O(log(n)) insertion.
Capping the depth gives you O(1) gets and sets though. Note that compact-arrays grow breadth-wise whereas ropes grow depth-wise. Breadth-wise growth trades insertion speed for random access speed.
In practice, you can get away with knowing just a handful of data structures and most programmers do. However, when it comes time to profile it is valuable to have alternatives to what the standard libraries and implementations offer. For any given set of constraints on the intended use case and workload for a classic container, there is usually a data structure that is nearly ideal for that use case. Also, there are often neat ideas buried in some of these data structures that are worth learning for their own sake.
For the work that I do (high-performance C++) the first pass will often use the C++ STL. When we start profiling, the STL is usually among the first things that need to be replaced. Having a large number of workload optimized data structures in the toolkit for the same abstract container makes this relatively easy.