Since Okasaki's work there have been several advancements and new developments in the field:(Source: MirrorThink.ai)
1. PaC-trees: Supporting Parallel and Compressed Purely-Functional Collections - 2022: This paper introduces PaC-trees, a purely functional data structure that supports parallel and compressed collections. PaC-trees are designed to be efficient in terms of space and time complexity while maintaining the benefits of functional data structures.
2. Proving tree algorithms for succinct data structures - 2019: This paper discusses the development of tree algorithms for succinct data structures, which are compact representations of data that support efficient query and update operations.
These is some work in other fields too:
1. CyBy2: a strongly typed, purely functional framework for chemical data management - 2019-12-30: This paper presents CyBy2, a purely functional framework for managing chemical data. The framework is designed to be strongly typed and enforce referential transparency through the type system.
2. chemf: A purely functional chemistry toolkit - 2012-12-20: This paper introduces chemf, a purely functional chemistry toolkit that provides a set of algorithms and data structures for working with chemical data in a functional programming context.
I'm not so sure about cyby2, the persistence is based on "conventional" relational DBs. Judging by the paper, the main goal wasn't developing a specialised functional data structure to store molecules
RRB trees in particular allow you to benefit from some of the cache efficiency of regular vectors, while supporting operations like immutable update. They are used in languages like Clojure and Scala.
I'm reading this book right now. It's really great so far!
I've been working a lot with Trees in Clojure, and have been hitting serious limitations of my understanding.
I also found this YouTube video from a Clojure conference that reviews some different strategies for tree traversal in Clojure: https://youtu.be/YgvJqWiyMRY
I thought that learning a Functional Lisp would make it really easy to traverse trees, since that is that the language is doing to actually execute its code.
Turns out all the stuff I want to do is simply hard.
Functional data structures are really awesome though, it just seems to take a bit of up front investment.
This has been a topic I've wanted to get into for a few years now, specifically because of Clojure! So if you have any additional recommendations I'd appreciate it.
I really enjoyed Friedman's book `Scheme and the Art of Programming` because it filled in
some pieces missing from "The Little Schemer" (and "Seasoned Schemer"). Building stuff like `kons`, `map` and all that `letrec` stuff.
But the big difference between Scheme and Clojure is that in Scheme, while it's "functional by concept," you get to escape with `(set! ...)` whenever you want to have a more traditional ds/algo (like say doing the n queens problem).
In Clojure you can kind of do that by either escaping to Java, using protocols (with atoms), or using transients, but I often feel like there's a "more functional way to do stuff that hasn't been taught to me."
I've opened up either the Okasaki dissertation or book or both, but I've always had trouble reading it, and then sticking with it. And some stuff like looking at Rosetta code and saying "to reverse a linked list in a lisp is easy... because it's a cons cell" seems like cheating. Almost like showing up to an interview, saying your "linked list" is implemented in an array structure and then calling `reverse()` on it.
Will watch that talk from 2014, must not have seen it before.
I guess, conceptually, day to day things in Clojure does feel pretty natural, even easier, and I think I have a decent understanding of it. But then when I look at leetcode type problems, or something more involved, it takes a lot of mental effort to translate to it. Especially things like `big O` gets thrown away in my mental model. I get it, persistent data structures and all that, but there's still a mystery there.
I feel like they are.. not so awesome: they are grossly inefficient due to all the pointer chasing and are pretty much guaranteed to be slower than the alternative since they trash your cache.
I wish I could remember the exact book, I think it's _writing solid code_. There was a long example about the excel evaluation engine back in the day when it was shipped on CD's and had to be perfect. The approach was, use one big dumb slow, but understandable and correct implementation, in parallel, use the lightning fast super whiz bang new implementation. Since both shared the same interface, they could be swapped out, or both run at the same time.
I think there is real value in starting with the pure functional versions, then swapping out when needed. One problem that seems fairly common in large codebases is using an object as a hash key. but as the code grows, that object sort of gets passed around and eventually somebody updates it without rehashing. That's a tough bug find. They are for me anyway.
This is one of those rare cases where you can actually make it faster later without trashing the overall design.
I'd encourage starting with the pure functional version first, every time. I'd go further and say, leave some opportunity to flip back to the pure functional version in dev builds for isolating heisenbugs.
Blah blah, grain of salt, free advice, your milage may vary, Games have different requirements than webpages, everything is contextual.
This is one rare case where it's always worth having the capability of swapping back and forth is worth it. Just use it, and think hard before giving it up is a really good default.
You feel this way but if you do the math and benchmarks you will be surprised.
You might not be running GHC's RTS on a resource-constrained target platform but you can generate the C code that could run on that platform from Haskell, leveraging the guarantees you get with using Haskell's advanced type system.
Update: point is that GHC can optimize code and some of those optimizations can avoid pointer chasing! But like most optimizing compilers, some times GHC can miss them, and you have to do a bit of work to tell GHC where it can avoid chasing unnecessary pointers (or where it should/should-not inline, give it some help to know where it can do fusion, etc).
There are entire fields of research dedicated to studying cache oblivious persistent data structures, and they're almost always trees.
One of the key points of immutable datastructures is that they're easy to reason about given that the invariants are upheld. Designing an efficient memory representation that upholds those invariants is an entirely separate problem domain. Not every pointer dereference is slow, and not every tree is represented with pointers.
They're only grossly inefficient if you wouldn't otherwise need to frequently make copies of large mutable data structures. It all depends on what you're trying to do.
I agree that it's a great book; but I wouldn't recommend it as an entry point into understanding purely functional data structures.
Okasaki fleshes out a couple of examples and explains his thought processes and heuristics for developing such data structures quite well; but they're very much still notes on the early-stage exploration of the concept.
From a historical perspective it's still a fascinating read though and definitely recommend it if you want to read up on the origins of immutable data structures.
This is the canonical guide to reasoning about amortized runtimes. The exponentially growing ArrayBuffer (O(1) amortized append) is the classical data structure used to teach amortized runtimes, but the O(1) amortized functional queue Okasaki presents here gives a much better intuition for what amortized runtimes are all about.
It's weird that there are so many claims in here that the data structures and algorithms are perfectly performant yet there isn't even one look at generated assembly or any acknowledgement of the underlying system that is supposed to run the code. Proving things are Big O performant is neat, but at some point the code has to hit hardware.
It's not weird, this is standard in academic computer science. It would be weird to do otherwise. In a theoretical dissertation/paper like this you can't just randomly bring up compiled assembly, it's completely and utterly off topic, it's not any more on topic than bringing up if the code was ran by an interpreter, or JVM, or transpiled to Haskell, or ran on GPU etc...
So do most algorithms papers that have benchmarks in them. It's not always useful though, because information about how this algorithm compares to some other one on a PDP-10 doesn't necessarily translate to modern machines.
I can't speak for other fields, but in statistics and machine learning, it's not uncommon to see at least some practical experiments or simulations alongside theoretical results, or at least some discussion of practical considerations.
There are definitely plenty of pure theory papers as well, and pure theory is still a contribution. However one would at least hope to see subsequent work that does focus on the practical aspect.
You're right, there are plenty of papers focusing on real-world performance. I chose not to capture the nuance because I wasn't sure how to express it succinctly.
How I see it: computer science as an academic field is roughly split between theory and systems. The theory folks tend to evaluate progress through proofs. The systems folks tend to evaluate progress through experiments.
I’ve vouched for this, because it brings up an interesting point (albeit, one that proven provocative and been expressed before). Also, many of your comments are dead.
Big O/algorithmic complexity is interesting in that it tends to abstract away the architecture of the underlying processor. A copy is a copy is a copy. Arithmetic is arithmetic is arithmetic. Regardless of what instructions/processing units the underlying processor has. Regardless of data access. Regardless of parallelization. Regardless of memory complexity. All we use are brutish units of “work” — despite not all “workers” being equal.
It reminds me a bit of scientific management: treating your “workers” as interchangeable units that carry out what has been decided to be the most efficient movements and processes — completely disregarding the individual characteristics of the worker. For a humanist example, consider the individual quirks of the physiology (not only in size, length, and composition of the skeletal system via the muscles, bones, tendons and so on that would necessitate there being specific movements and workloads that best suit them — but also in psychology; the brain as its own work-producing part that is uniquely suited and most efficient for specific workloads; rather than an all-encompassing abstract average “best workstyle” that makes no note of these characteristics, but simply decides to use external metrics like “time to pick up a box using X, Y, or Z form.”)
The same parallel can be drawn to computers: different processors and collections of hardware (what is essentially the “body and brain”) have different quirks and different workloads they perform best at. For example, GPUs are much more useful in the case of vectorized/parallel workloads where the operations are relatively simple (f.e matrix arithmetic). While you can run an iterative matrix multiplication algorithm on a CPU in n^3 time — your data access costs will be significant. Instead, running a parallel algorithm on a GPU, with RAM very close by, you can achieve log^2 n.
This is where CS research really shines: not running away from the realities of physical systems, but embracing them.
Other commenters have addressed the key point - it's not weird. CLRS doesn't look at generated assembly either.
One other thing I want to add, though, is that this book was published in 1996. The CPU landscape at the time was far more diverse, both in terms of ISA (which assembly?) and also in terms of capabilities. Even on the x86 line, plenty of people were still using 486s, which weren't even superscalar, and thus had pretty strikingly different performance characteristics than, say, a Pentium Pro (a "modern", OoO CPU). Tying your algorithmic analysis to a particular piece of hardware would be pretty useless; providing a satisfactory look at the performance on so many different (micro-)architectures could risk overwhelming the content.
Moreover, at the time, performance was, maybe, a little more straightforward, making your concerns perhaps not quite as important. Memory was slower than the CPU but not the way it is now. Caches were much smaller and less useful. Branch prediction was more primitive. CPUs had relatively weak reordering capabilities, if any at all, and simpler pipelines. There were few dedicated SIMD instructions. This was a world where linked lists still often made sense, because the hardware penalties you paid for them were smaller. In other words: in 1996 the on-paper performance wasn't at quite as big a risk of diverging quite so wildly from the real-world performance, so keeping things simple and focusing on the on-paper performance was less objectionable.
Yeah, Big-O notation is only part of the picture and yes, "galactic algorithms" that are theoretically efficient but only for astronomically big inputs do exist, but in many cases (especially if your algorithm isn't doing anything particularly deep or clever) linear is faster than quadratic which is faster than exponential on real world input, too.
Mathematics is the science of modeling and abstraction and that abstraction exists in order to make the process of knowledge gathering easier. It works very well for the sciences, so I would say the method has been proven. Of course, if the models turn out to be completely inappropriate, that would be different, but so far, the simple machine models used (implicitly or explicitly) in the analysis of algorithms seem to be rather reasonable.
The alternative that you suggest, trying to benchmark concrete implementations of algorithms has so many confounders that it would be very hard to execute properly. I'm sure people are doing this too, but the formal Big O proof has a different quality to it because it does not depend on those particulars.
How helpful would it be to see 30 year old generated assembly and benchmarks for an i486 with a tiny cache, no prefetching, and relatively tiny memory vs instruction latency compared to today’s CPUs?
is this being upvoted onto the homepage based on upvoters actually understanding that this paper from 1996 is of contemporary relevance and interest or more due to keywords like "pure", "functional", "data" and "structure"?
I can’t speak for everyone, but I upvoted it from nostalgia, having read the book version over a decade ago. I happened to be thinking about ordering a copy for the office just yesterday.
I've recently used a number of structures that I learned from this book. Though I don't know if the text represents the state of the art in purely functional data structures, it's a pretty seminal work in the area.
> is this being upvoted onto the homepage based on upvoters actually understanding that this paper from 1996 is of contemporary relevance and interest …?
The book is on Amazon, but this submission had those keywords, plus it's a PDF.
Of course it is of contemporary relevance; functional programming (FP) is all the rage.
The tricky bit are questions like
How does this jive with existing JS functional constructs like "fantasy land," for example.
How to "recruit" more folks to FP, or even a hybrid approach of objects interacting functionally
Game jams using more FP-like data structures? Or more HN submissions like that.
The harder things to evaluate are a lot of other topics, news-like but investigative and curious, or sites that are essentially selling a service (versus teaching the mechanism behind it).
For SaaS stuff, since HN is about startups, I have to let it slide. But the hacking piece is when one person accomplishes something with persistent decomposition of sequential problems, or does something clever using tools or ideas from a different context.
To Queue a Mockingbird
The Call-stack of the Wild
Tesselation of the d'Urbervilles
One Flew Over the Cuckoo Hash
The Data of the Woosters
Brideshead Re-visitor
The Catcher in the Trie
Les Miser-tables
The Nat-elim of Monte Cristo
I would love to be educated. I've seen claims about the merit and value of functional programming throughout my (nowadays relatively long) programming career. In practice I've never once seen those values or merits come to fruition - just the same cycle all software goes through.
My very direct experience recently has been Scala + cats resulted in the same buggy nonperformant software it was meant to prevent. I understand that bad programmers produce bad programs, regardless of language, but I feel pretty strongly that good tools prevent some amount of the typical "bad" that makes bad programs bad (ignoring the obviously maliciously bad examples). So I don't really understand, and would like to understand, if, how and when pure FP (and I suppose FP in general) actually improve quality of code/life outside of toy examples.
title typo: "Purely Functional Data Structure" -> "Purely Functional Data Structures" (pluralization)
reads a bit weird otherwise - sounds like it's discussing a particular purely functional data structure when it's actually a survey of many (pretty much the canonical survey of them, in fact).
What are software bugs that can be avoided by choosing data structures like these?
I'm making a broad, high-level presentation about immutability in technology. At my company we have folks who have heard of it in the context of ransomware-resilient backups, others who have heard of it in the context of infrastructure as code, and very few who have heard of it in terms of data structures (distributed and non-distributed). My goal is to showcase the concept in various contexts so that people can better understand its role as a key design choice in technology.
Personally I have no experience working on software that utilizes these, so if others here do, I would appreciate your input on how these make your software more reliable.
The emphasis on software reliability and bugs-avoided is because the audience works under the company's risk-management division.
Purely functional data structures are a means, not an end in themselves.
Programming with immutable values and the more declarative style they support does design out at least one infamous source of bugs: shared mutable state. That has become increasingly important in recent years, for example because modern chips support ever increasing numbers of concurrent threads in hardware and because a lot of the software we write is now part of distributed systems.
That programming style has other advantages in terms of readability and reasoning about code. If you can be confident that a particular name in the code always refers to the same value once it’s been defined, tracing the flow of data through a function or through an entire system often becomes much easier. Having more understandable code naturally tends to improve reliability, among other advantages.
If we adopt that programming style then we still need to represent collections of values and different relationships within our data, but we can’t always use “traditional” data structures to do that any more because many of them rely on mutability, which we have excluded. So we need other ways to represent structured data that are compatible with immutability and declarative style, and that is what purely functional data structures give us.
_Immutable_ data structures (not 100% the same thing, but a lot of overlap) avoid all kinds of concurrency problems, because you can safely pass them around and you'll never get a data race. You don't even need any locking (just to make things complicated, _lock-free_ data structures are another closely related but not identical concept).
Once you're running a distributed system, this kind of stuff comes into its own.
Careful with this wording. They avoid shared memory mutations. They don't necessarily change data races. Rather, they just change them since, by definition, every edit is now creating stale data.
At large, in distributed software, this is a distraction. Since most passing of messages around from one distributed piece to the other was already doing a copy across mediums. Such that sent data was already immutable from the senders perspective. (Granted, the preparation step can have you step on your own feet.)
As the peer comment points out, functional data structures force you to be deliberate about where/when state changes take place. In particular, they force you to be cognizant of which version of a data structure a particular piece of code is referencing.
Immutable structures can help significantly in reducing race conditions in parallel code, as it is impossible for one thread to modify part of a data structure while another thread is accessing it. Each thread sees a consistent 'version' of the data structure until the code explicitly replaces it with a new version.
Immutability can also help with memory optimizations: A node of one data structure can be safely re-used in another data structure. For example, if you need to retain older versions of a tree, you can share common nodes (this is how GIT encodes version changes).
Purely functional data structures are very common in purely functional languages like Haskell but are also used in non functional languages via libraries like immutable.js.
At a high level, immutability forces you to be extremely deliberate about state changes in your application. This improves reasoning/understanding, reduces bugs, and eases debugging.
An example of immutability that you might be familiar with would be react props/state. You don’t modify your state. This makes reasoning about state much more simple.
Disclaimer: Not a programmer - I do not have a CS degree. One of my minors as an undergrad was MIS and while I took a database programming course as a senior, my knowledge is limited.
The high-level explanation I recall reading years ago somewhere (Joel on Software, I think??) was that a) Functional programs have no side effects, leading to the notion that b) They can, without much effort, be adapted for parallel tasks.
The example here was MapReduce, which was the original guts of the Google Search algorithm.
E.g, Things are fast because we can send your query to many, many computers to find an answer, and chances are good that that machine is (relatively) close to you, which means low wait times to get your answer.
Modern consensus is done by constructing a replicated log - an append only data-structure shared across multiple workers. It's what you use Paxos for, and it's what Raft streamlines.
Okasaki got me interested in confluently persistent data-structures, way back in the 2000s.
They seem magical! To be able to combine data from the past with current data, efficiently!
They are almost always trees, with the exception of skip-lists, with all operations O(log(n)), .
After creating my own programming language Enchilada that is based on immutable data structures, I started considering what I deemed "next level":
Uniquely represented confluently persistent data structures
Combined with a Merkle tree encoding of such uniquely represented data structures (they are almost always trees), you can efficiently and incrementally authenticate them. Think 'block chain' on steroids, with incremental cryptographic hashes. Or torrents, if you are into that kind of thing.
Why would we want to use purely functional data structures? When do the pros of functional data structures outweigh the additional complexity? Are there scenarios when a project would want to pivot from a regular data structure to a purely functional one?
The most common point is that they're safe to share between threads making parallel algorithms easier to invent, understand, and implement correctly.
You can also safely re-use sub-structures without performing a deep copy. For example, if you want to keep a sub-tree around for later you can do that in O(1) time because it's safe to keep a reference to it. If it is a mutable tree you don't know what's going to happen to it so you need to do a deep copy of the entire sub-tree you're holding on to. This can save a lot on memory allocation and copying depending on your use case.
But after I learned JAX/Flax, a light turned on inside my head and I now write functional Deep Learning code as much as I can. All my side projects and new code are purely functional in JAX/Flax. PyTorch has functional API known as functorch, and I have used it one project.
Where lots and lots of data in 3,4,5 dimensional tensors exist, and you need to run lots of transformation on them, and then you need to multiply a huge number of them thousand times in each second- functional code makes much more sense and gives immense sanity and peace of mind.
Those of you writing Deep Learning code, learn functional programming principles (immutable data, pure functions, leaving no side effect, etc.), and apply them to DL via functorch or JAX.
JAX is very barebones and will require you to write much more code for the same task than you write in PyTorch.
Functorch is still new, and honestly, there is little to learn if you already know JAX. There are some talks from Meta, and then there is always the docs.
There's a tradeoff between the complexity of the implementation of a data structure, and the use of one. While the complexity of implementing purely functional data structures is often (except maybe in the case of singly linked lists) higher than their mutable counterparts, actually using them in a program is simpler and less error prone.
There are obviously other trade offs as well, like performance and memory usage.
they have some nice properties that you might want to benefit from. For example, they're persistent: every previous state of the data structure is still in there.. aka snapshots!
There is no additional programming complexity in using Scala's Map vs Java's HashMap. They are both maps with keys and values. The Java one you update in-place
var m = new HashMap<K,V>()
m.add(k, v)
the Scala one you "update" by creating new maps,
var m = Map.empty[K, V]
m += (key, value)
In practice it's mostly the same except you can share the immutable one around without being scared someone will mutate it, but it takes more memory per instance than the mutable version and creates more GC churn, which may or may not be an issue.
As is explained in the abstract, purely functional data structures are most useful when you want to avoid assignment, either due to the restrictions of your programming environment (f.ex. you're programming in Haskell), or because that is a property you're interested in for other reasons (e.g. you want to ensure immutability due to concurrency concerns, or exploit persistence to improve performance).
This book is near and dear to my heart. Back in the mists of time (~05?) when I was learning Haskell I reimplemented several of these in order to get my head around things. Lots of fond memories.
For me, the most mind-blowing part of Okasaki's book was the chapter on "numerical representations". Never looked at it like that before I read that chapter. While the other chapters certainly introduced material that was new to me at the time, this one took some things I knew and added a whole new dimension to them.
I have a personal pet peeve about the misuse of terminology when dealing with such names, for which the only solution is to go read the original reference to figure out what they meant by it.
E.g., in this case, to describe a data structure as "purely functional" makes zero sense to me intuitively at first. You need to go read the thesis and realise they're referring to data structures implemented as algebraic data types, which in the context of a purely functional language can themselves be thought of as functions, and can therefore be described as 'functional' ...
But unless you do that, the first thought is going to be "huh? can arrays be further split into imperative vs functional?" "Does he mean immutable?" "Can I use functional arrays in c?" "Are they faster/safer than normal arrays?".
By contrast, I think "Purely Functional Data-Structure Types" would have been a far more intuitive term ... but I suppose the author may have felt that clarifying further wasn't punchy enough, and could have made the thesis more wordy...
TypeScript is much less readable than Haskell and OCaml, but you can easily find translations to TypeScript such as https://github.com/skeate/lambdata.
> TypeScript is much less readable than Haskell and OCaml
That's like saying that Norwegian is much less readable than Italian. It is in the eye of the beholder. They can both express the same concepts but which one is more readable depends on which one you already know.
Only in the turing tarpit sense. Out of the box, they have very different capabilities. For example:
Higher-kinded types: easy in Haskell, hard in OCaml, mostly impossible in Typescript.
First-class modules: OCaml has them, Typescript can sort of simulate them with extraordinarily unsafe prototype mangling stuff that you should never ever use, impossible in Haskell
Open variants: Easy in OCaml and Typescript, hard in Haskell