Git is a purely functional data structure (2013)

BenoitEssiambre · on Dec 9, 2017

I sometimes like to explain things the other way around. Immutability being version control for program state.

I rarely use functional programming but I certainly see its appeal for certain things.

I think the concept of immutability confuses people. It really clicked for me when I stopped thinking of it in terms of things not being able to change and started instead to think of it in terms of each version of things having different names, somewhat like commits in version control.

Functional programming makes explicit, not only which variables you are accessing, but which version of it.

It may seem like you are copying variables every time you want to modify them but really you are just giving different mutations, different names. This doesn't mean things are actually copied in memory. The compiler doesn't need to keep every versions. If it sees that you are not going to reference a particular mutation, it might just physically overwrite it with the next mutation. In the background "var a=i, var b=a+j", might compile as something like "var b = i; b+=j";

runeks · on Dec 9, 2017

> I think the concept of immutability confuses people.

I think it confuses people because it’s framed oddly. Immutability isn’t about being unable to mutate state, it’s about no longer using containers (registers) as variables, such that the equal operator actually means “equals” as opposed to “store”.

In most programming languages, the equal operator works as a “store” operation, which stores a value in a named container/register. In “immutable by default” languages like Haskell, the equal operator actually means “equals”, as in “is synonymous with”.

The essence of immutability is referencing values directly, through synonyms, as opposed to storing them in named registers for later retrieval. When it’s done this way, immutability no longer makes sense: is the number 3 mutable? Can the number 3 be mutated into 4, or are they just two distinct numbers?

canes123456 · on Dec 9, 2017

I think I understand the point you are making but this is way more confusing for me. Partly this can be blamed on imperative lanuages and = vs ==.

sbergot · on Dec 10, 2017

Not that in mutable languages == can either mean storage equality or value equality. In immutable languages there is only value equality.

giesch · on Dec 10, 2017

In a procedural language:

x == 3 means: Take the value out of the box x. Is it 3?

x = 3 means: x is a box, put 3 in it. The 3 lasts forever, or until someone changes it.

In a functional language:

x = 3 means: Take the value out of box x. Is it 3?

let x = 3 in [...] means: x is a box, put a 3 in it. The 3 lasts for only the [...], but cannot be undone.

nybble41 · on Dec 11, 2017

Correction: In a functional language (specifically Haskell):

  x = 3 means: x is defined as 3.
  x == 3 means: Is x equal to 3?
  let x = 3 in [...] means: within the scope of [...], x is defined as 3.

There is no specific point where something is "put into the box" or "taken out of the box". In fact, there is no "box". The functional programming concept of "bindings" does not correspond to the imperative concept of "variables", not even the oxymoron known was "constant variables". (If you want variables in Haskell you need IORef or STRef, where the load and store operations are explicit monadic actions.)

sobellian · on Dec 10, 2017

> In the background "var a=i, var b=a+j", might compile as something like "var b = i; b+=j";

Believe it or not, many compilers internally represent mutable variables as a sequence of immutable variables: https://en.wikipedia.org/wiki/Static_single_assignment_form

Edit: I should clarify that I mean "immutable" in the context of primitive values like integers.

macintux · on Dec 9, 2017

That's an interesting way to describe it. I've talked a lot about immutability at conferences, but I've never thought about it in those terms. Thanks.

masklinn · on Dec 9, 2017

That realisation has been used by environments like Elm's Reactor[0] or its (sadly defunct) Time Traveling debugger. I think Om also had something like that.

Basically, if you use persistent data structures and a unified application state, you can keep a list of all previous application states and you can browse it or ship it for debugging, it's not that expensive.

[0] in debug mode, you get a list of all events having occurred in your application and can instantly move back to that state, and you can import/export that state history: http://elm-lang.org/blog/the-perfect-bug-report

arximboldi · on Dec 9, 2017

I am working on something like that for C++, a library + debugger called Lager. Quite experimental yet, but been using it to write apps with SDL and Ncurses and I am happy with how it is going... In the end it is a very simple architecture, implementable in any language. Good immutable data-structures are important for big programs though.

Library: https://github.com/arximboldi/lager

Ncurses example (full text-editor): https://github.com/arximboldi/ewig

SDL example: https://twitter.com/sinusoidalen/status/939539173584916480

Immutable data: https://github.com/arximboldi/immer

(There was a talk about Lager at MeetingCpp this year, still waiting for it to be published online...)

malkia · on Dec 9, 2017

Thank you Juan for the talks (I watched both the CppNow, and CppCon, though the latter one is even better). Also for the wonderful libs, and the real-life example (your text editor).

At work, we are pondering on a new way to split our game levels editing code, and persistent data structures came to me as one idea, in order to handle UNDO in generic way, and they by diff, or other means visualize the changes. I've read Okasaki's book long time ago, but seeing this implemented in a imperative language is really helpful.

Also pretty much enjoyed the post-modern talk too :) - hehehe

Thank you!

arximboldi · on Dec 9, 2017

Thanks a lot Malkia, that is super encouraging!

Editing code for game levels seems to me like a quite good use-case for this. "Editing" software with rich hierarchical documents is the itch I originally wanted to scratch with this work, even though there are many other use-cases too. Feel free to send me an email if you move further with that idea or if you want to discuss any of these topics with me :-)

masklinn · on Dec 9, 2017

> Good immutable data-structures are important for big programs though.

I'm guessing you mean good in the sense of well-implemented (performant), but I think good in terms of interface is also very important

For instance in my experience & opinion Immutable.JS is not very fun to use regardless of its implementation, because the interface does not feel natural for the language (not that the designers can really be faulted, the user-accessible part of the language simply limits what they can do). I think a similar library for Python would have similar issues (due to Python being very statements-oriented which is antithetical to persistent data structures).

no_wizard · on Dec 10, 2017

Python however does provide some immutable constructs as part of it’s core language (Tuples)

It also has the ability to really take any object and attach states to it at whim, including immutability.

I think that’s a better trade off than JS

masklinn · on Dec 10, 2017

I think you completely missed my point. That aside,

> Python however does provide some immutable constructs as part of it’s core language (Tuples)

That doesn't really help when you're looking for persistent datastructures.

> It also has the ability to really take any object and attach states to it at whim, including immutability.

No, and if it did that would not be an advantage in this context.

bjoli · on Dec 11, 2017

Hi arximboldi!

I actually have a half-started emacs-like editor built using Qt that is based on ewig. Not anywhere near usable yet, but my plan is to make a fast editor with fast highlighting and such, and have everything accessible to be scripted in guile. I doubt it will actually materialise, but if I ever get a proof of concept it will be on github.

I love immer! It is friggin excellent.

arximboldi · on Dec 13, 2017

Hey that's awesome! Let me know if you ever would like me to take a look at your code or anything. I'd be happy to link it from the Immer site too when you are ready to promote it :-)

runT1ME · on Dec 9, 2017

Nice, good explanation. And one way to look at the State Monad is when you always want the 'latest version' of an immutable 'value' that keeps getting new versions. :)

antoaravinth · on Dec 10, 2017

That's good explanation. I guess Immutable.js does uses the same concept behind the scene to retrieve each references i.e like commits. Looks like Immutable.js uses Tries data structures for such operations. May be I'm wrong here as well.

Edmond · on Dec 9, 2017

Perhaps for those familiar with "functional data structures" such an analogy is helpful but I find it easier to simply explain git for what it is without adding more exotic nomenclature to it.

Git lets you do version control via full snapshots as opposed to just tracking diffs (even though it does actually do this too behind the scene).

You can think of a full snapshot as saving a copy of your project structure every time you do a commit. The key trick is that git doesn't actually create new copies of the content for each commit but simply maintains a tree structure whose nodes are pointers (via hashing) to the content they represent.

The complication from git is not in understanding the core concept but knowing how best to apply them. There are all sorts of crazy workflows you could implement by manipulating git pointers and their associated patches. As with anything that is flexible, difficulty comes in knowing how to constraint yourself when using it.

pdonis · on Dec 9, 2017

> git doesn't actually create new copies of the content for each commit

More precisely, it doesn't create new copies of content that you didn't change. For example, if you have 100 files in your repo and you change one of them and then commit, git creates a new copy of the content of the file you changed--a new blob storing the new file content--and a new tree object that references the new blob instead of the old one, plus the other 99 blobs that store the contents of files you didn't change; the new commit object then references the new tree object (plus the message and metadata). But git never stores diffs between old and new content; it just creates a new blob every time the content of a file changes.

jdmichal · on Dec 9, 2017

> But git never stores diffs between old and new content; it just creates a new blob every time the content of a file changes.

Git pack files compress objects by storing them as diff files going backwards. That is, it stores the most recent state in full, then uses patches to go backwards. Because you're more likely to need a recent version in full than an older one.

https://git-scm.com/book/en/v2/Git-Internals-Packfiles

emmelaich · on Dec 10, 2017

This is true but packfiles are an implementation detail.

It's still useful and more accurate conceptually to consider every commit as a complete snapshot of the state of code that point.

jdmichal · on Dec 11, 2017

That can be said of every version control system. Restoration of state to any given version is their defining feature. How they achieve that is always an implementation detail, but those details can still be important and interesting.

bbatha · on Dec 11, 2017

Git commits are composed of all of the files in the commit, it’s parent and the commit message. This is an important guarantee that each checkout is valid without the rest of the repo. This allows you to have a lot of exotic implementations guarantee consistency between them. Meaning if your GitHub you can distribute commits across many servers. Or your Microsoft and you build partial checkouts for Gvfs. It’s what allows Git LFS to keep many of git’s core guarantees while making tradeoffs to improve areas where git is traditionally weak.

emmelaich · on Dec 12, 2017

Sorta true but see what bbatha said.

There are people who distinguish changeset oriented and snapshot oriented and will hotly debate that one or the other is better.

But as you say, restoration of state is a necessary and defining feature.

nine_k · on Dec 9, 2017

Exactly. Those familiar with functional data structures would thus point you at git and say: heres a structure from the Okasaki book that you use every day.

One more step is to point that futures / promises, and even lists, are monads that a e.g. JS programmer uses every day, too. It reminds me of the old literary character who did not know that he's been speaking prose all his life.

rst · on Dec 9, 2017

Particularly since a git repo, as a whole, isn't a functional data structure. The commits (and the graph in which they're embedded) are immutable, but the mapping between branch names and commits gets mutated all the time. (To say nothing of slightly deeper esoterica like the index and stashes.)

masklinn · on Dec 9, 2017

That's pretty functional still, the "branches" are just (in clojure terms) atoms, aka atomically mutable references to a structure (namely a commit which is some metadata + a reference to a tree).

kazinator · on Dec 9, 2017

> Git lets you do version control via full snapshots as opposed to just tracking diffs.

That's completely orthogonal to whether the version history is immutable with each branch being like singly linked list where we "cons" new things onto the front.

misnome · on Dec 9, 2017

That’s the point; because sentences like

> the version history is immutable with each branch being like a singly linked list where we “cons” new things onto the front

Just contributes to the general confusion around git where people decide it’s too complicated to learn.

(Apologies if my sarcasm detector is just broken today)

kazinator · on Dec 9, 2017

Git is a purely functional data structure, except for the mutating head pointers, rewriting of tags, various state in the repo related to things like on-going rebases, cherry picks, bisects, ... oh and the index which is one object changing in-place (not to mention working tree, of course).

vickychijwani · on Dec 10, 2017

While what you say is technically accurate, I think you're missing the bigger picture here. The author's point still stands: git's commit history (arguably the most important part of a version-control system) can be viewed as a purely-functional data structure, and that view has practical benefits too. I tried to explain more here: https://news.ycombinator.com/item?id=15892013

rubenbe · on Dec 9, 2017

I often recommend people to read "Git Internals". If you know how git works internally, it's much easier to understand how it works and the reasons behind it.

https://git-scm.com/book/en/v1/Git-Internals

nine_k · on Dec 9, 2017

Even more enlightening is The git Parable.

http://tom.preston-werner.com/2009/05/19/the-git-parable.htm...

randomsearch · on Dec 9, 2017

This suggests a poor abstraction.

goialoq · on Dec 9, 2017

"Internals" is a poor choice of term. "Data structure" is a better term. Git is "plumbing and porcelain". The plumbing is the core of git. Porcelain are shortcuts. In general, Torvalds projects (Linux, Git) aren't big on abstractions that maximize simplicity-of-use, they focus on doing complex things correctly and quickly. Adding abstraction makes it hard to get details correct and run quickly.

randomsearch · on Dec 9, 2017

Agreed. Git seems like a good internal design with a terrible interface.

mannykannot · on Dec 9, 2017

I think the author has a point in saying that learning Git by trying to map it to Subversion is not the best way to do it, but I don't think analyzing it as a functional data structure adds much insight. To me, it is easier to understand when you look at its purpose, and how it solves the problems of that domain - and the biggest difficulties of version control are on account of the problem being essentially one of distributed, lockless concurrency, something not mentioned in this article.

icc97 · on Dec 9, 2017

I found the explanation from the Immutable JS presentation easier to understand when talking about Immutable data structures [0]

[0]: https://youtu.be/I7IdS-PbEgI?t=5m7s

dmitshur · on Dec 10, 2017

Even after 4 years, this remains my favorite, most influential article that helped me understand and feel comfortable with git. It's just a very good analogy.

dustingetz · on Dec 9, 2017

for a real database that works like git, see http://www.datomic.com

if git killed svn, datomic kills postgres

xj9 · on Dec 9, 2017

if it were open source maybe, but i'm definitely not going to switch (even though i might want to) for licensing reasons. in fact, i have more motivation to write an libre datomic clone than to pay cognitect anything for their proprietary db.

dustingetz · on Dec 9, 2017

do it!

greghendershott · on Dec 9, 2017

Depending on the definition of "it", someone has: https://github.com/tonsky/datascript

dustingetz · on Dec 9, 2017

datomic is a distributed system, datascript is an in memory data structure (it is very cool though, i use it)

macintux · on Dec 9, 2017

As much as I appreciate datomic, that's a poor conclusion to draw. git is objectively better than svn. Postgres is not objectively worse than datomic; there are things that datomic simply can't do efficiently.

dustingetz · on Dec 9, 2017

if you go back and read the git vs svn flame wars back when it first came out, people said the same thing. it happens every time there is a paradigm shift technology. the reactjs flame wars of 2013/4 were particularly brutal as everyone and their mom felt qualified to comment. the key idea here is that, at scale, immutability is just better, at nearly everything

Scarbutt · on Dec 9, 2017

datomic is to slow on writes, even slower than sqlite, that's because it serializes all writes.

ioquatix · on Dec 9, 2017

.... and one day I had the crazy idea to make a database on top of it: https://github.com/ioquatix/relaxo because the underlying immutable data structure makes this quite feasible.

snissn · on Dec 9, 2017

Is git a blockchain?

icebraining · on Dec 9, 2017

They both use Merkle/Hash trees:

Hash trees are used in the IPFS, Btrfs and ZFS file systems, BitTorrent protocol, Dat protocol, Apache Wave protocol, Git and Mercurial distributed revision control systems, the Tahoe-LAFS backup system, the Bitcoin and Ethereum peer-to-peer networks, the Certificate Transparency framework, and a number of NoSQL systems like Apache Cassandra, Riak and Dynamo.

https://en.wikipedia.org/wiki/Merkle_tree

masklinn · on Dec 9, 2017

Mostly the reverse, if you need to relate them a blockchain is a degenerate Git (history, which is a subset of Git itself):

A Git history is a DAG[0] (each commit can have multiple parents) and beyond that a polytree (it can have multiple roots); while a blockchain is an arborescence[2] (there's a single root — the "genesis block"; and each block can only have a single parent).

Further, beyond the technicalities blockchains are generally very linear (the side-chains tend to be pretty short, forks aside) while Git repositories can be extremely broad (have lots of concurrent branches).

[0] https://en.wikipedia.org/wiki/Directed_acyclic_graph

[1] https://en.wikipedia.org/wiki/Polytree

[2] https://en.wikipedia.org/wiki/Arborescence_(graph_theory)

finnthehuman · on Dec 9, 2017

Is a hot dog a sandwich?

The answer depends entirely on definition. What properties of bitcoin are essential to a blockchan vs which properties are simply how bitcoin happens to use a blockchain?

If blockchain just means the Merkle tree, then yes.

If it means Merkle tree + a computational consensus system for adding nodes, then no.

linschn · on Dec 9, 2017

Short answer, yes. The current commit includes the hash of its parent(s), so its own hash reflects the whole history, and one can not change the history without also changing the current hash. Just like a block contains the hash of the previous block.

sparkie · on Dec 9, 2017

That's a Merkle Tree. A blockchain is an application of a Merkle tree in which each node contains transaction data, and a majority of clients agree that the longest chain of blocks is the correct one.

Git also uses a Merkle-DAG, but it is not a blockchain.

mbrock · on Dec 9, 2017

a majority of clients agree that the longest chain of blocks is the correct one

If you squint, that's kind of true for Git repositories, too.

The version with the most "proof of work" on it is likely the master branch.

Of course the incentives are very different... but still, the similarity is somewhat illuminating, I think.

misnome · on Dec 9, 2017

But this ‘definition’ seems to needlessly tie the definition to the kind of data you are transmitting. And if that were the case, couldn’t a software diff be considered a kind of transaction?

doug1001 · on Dec 10, 2017

well the top-level git data structure is pretty close to eg, Scala's Vector, which is an immutable container implemented as a tree with a high branching factor of 32. Modification to such a vector, rebound to a new variable, relies on structural sharing of the original (http://www.codecommit.com/blog/scala/implementing-persistent...)

erikb · on Dec 9, 2017

A data structure cannot be functional. I understand what he's trying to say, and agree with most of it, but the word "functional" is purely wrong. What he wants to say is "good". But not all "functional programming" is good, nor is all good programming functional by necessity, despite what your local Lambda The Ultimate nerd tries to tell you.

The Best, when it comes to data structures is a Directed, Acyclic Graph. For instance your typical linux filesystem is a DAG. But there's one problem with DAGs: When they reach a certain complexity human brains are not fit enough to parse them anymore. (programs still can though)

So in many circumstances at least a human programmer needs to take a look at the state of your program and make assumptions about its correctness, which is called debugging. And that's why in Good programs we often use Good data structures instead of The Best.

Good data structures are key->value stores (which you may know as "hash tables" or "dictionaries"), trees, and trees in a simplified special form: lists, each of them being somewhat able to represent the other two, if one can accept a performance hit and/or increased complexity in source code. Dictionaries, trees, lists. That's it. And you do that in every programming language that is at least a little bit interested in being Good.

So there's nothing special or functional about git's data structures, it's just normal Good programming, and a few programmers who are so good at programming that they don't even need to mention it anymore, they breath good programms.

Then of course to the normal bread-earning coder good programs are a rare sight. But the reason is not that they are really rare, the reason is that successful business doesn't really require Good programs to succeed. Mediocre programs are good enough to earn their rent, and most of us spend most of our coding hours to earn our rent.

All that being said, if you don't just want to make money, go and spend some time studying git internals. It will teach you a lot more than most of your teachers/professors taught you combined. Sadly the source code is written by Linux gurus, who like to encrypt their source code with a very special key that only people from their tribe can understand. But the Git Book is actually good enough that you can study quite a lot of the internals from that book. I also suggest writing your own git in your favorite programming language once, to really understand it.

how_gauche · on Dec 9, 2017

> A data structure cannot be functional. I understand what he's trying to say, and agree with most of it, but the word "functional" is purely wrong. What he wants to say is "good". But not all "functional programming" is good, nor is all good programming functional by necessity, despite what your local Lambda The Ultimate nerd tries to tell you.

Why is it that whenever functional programming comes up, "real programmers" come out of the woodwork to bloviate, only to expose just how little they actually know about the topic?

"Purely functional" in this case is a jargon term that relates to Okasaki's 1996 PhD thesis, available online (https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf). It's a classification for data structures much in the way "regular" or "context-free" or "turing-complete" serve as classifications for grammars.

For a data structure to be "purely functional" (and no, article author does not mean "good" here) simply means that it's implementable in a pure functional language without mutation. Examples of non-functional data structures would be ones where mutation is intrinsic to how its algorithms work: traditional linear probing hash tables, union-find for graphs, etc.

By this technical criterion, the git object graph is clearly a purely functional data structure, sorry.

erikb · on Dec 9, 2017

See the other leaf of this purely functional comment tree for my answer.

erikb · on Dec 12, 2017

Since there is even another branch now, I paste it:

I feel like I already said "I understand but I disagree". A round wheel is more useful for a car than a triangular wheel. That doesn't mean it's a "car wheel". It's just as good on a horse wagon or a bike.

LukeShu · on Dec 9, 2017

As others have said, "purely functional data structure" is a jargon term with a precise meaning--that fact that this is true indicates that data structures can, in fact, be functional in this sense. To claim that "the word 'functional' is purely wrong" is, well, purely wrong; the data structure in Git being described is purely functional.

Beyond that, you seem to have shown misunderstanding of what that means (despite your claims of "I understand but I disagree"), as the other examples you give DAGs, dictionaries, trees, lists, and this comment thread, do not have the interesting properties of purely functional data structures that make them worth discussing in the context of Git. Namely, the immutability of existing entries.

The thesis of the article isn't that purely functional data structures are Good, or that functional programming is Good. Those are quality judgments that are not present in the article. The thesis is: Explaining Git in terms of the purely functional data structure it uses might be helpful to learning Git.

erikb · on Dec 9, 2017

Yes, I didn't accept that part of the logic. Neither do I believe that immutability is something good, nor do I believe that git is very immutable, nor that git gains from immutability.

I believe mutability is a trade off. You can make objects more immutable by being less efficient in terms of storage size, search time, source code simplicity. In exchange for these negative attributes you get source code you can proof more easily. It's good when that attribute is needed, but not most of the time.

Git is only immutable on that meta level where we talk about the object key-value store. The implementation uses mutable files though, and even steps away from that immutable key-value store when the size gets too big and needs to store more efficiently.

And on the user level you can amend, edit, squash and delete commits easily, which is actually a strength of git not a weakness. People love it for being more mutable than SVN.

Dylan16807 · on Dec 9, 2017

> Git is only immutable on that meta level where we talk about the object key-value store. The implementation uses mutable files though, and even steps away from that immutable key-value store when the size gets too big and needs to store more efficiently.

Functional languages are only immutable on that meta level where we talk about objects in memory. The implementation uses mutable RAM though, and even garbage collects when the size gets too big and needs to store more efficiently.

Unless you want to narrow the term 'functional' so much it describes nothing at all, Git is a functional data structure.

erikb · on Dec 9, 2017

It seems like you say that with the goal of proofing me wrong with an example. But what you say is

(a) an argument helping my thesis by showing that you can program functional languages quite well by using mutable recourses

(b) the biggest users of what you write there are also the biggest enemies of the status quo you describe. Functional programmers really hate that computers are so unlike what they love.

Dylan16807 · on Dec 9, 2017

> (a) an argument helping my thesis by showing that you can program functional languages quite well by using mutable recourses

You can implement a functional language with mutable resources, and you can implement a functional data structure like git with mutable resources. How does it help your thesis that git isn't a functional data structure?

> (b) the biggest users of what you write there are also the biggest enemies of the status quo you describe. Functional programmers really hate that computers are so unlike what they love.

Whatever, something that is purely functional all the way down to the wires is impossible.

Any definition of "functional" the rejects every single functional language is an objectively failed definition. A usable definition will go ahead and permit the git data structure to be called functional.

vickychijwani · on Dec 10, 2017

> (a) an argument helping my thesis by showing that you can program functional languages quite well by using mutable recourses

Let me preface this by revealing that I'm neither a functional programmer, nor am I convinced that functional programming or immutability is any sort of silver bullet.

Your original thesis was that "a data structure cannot be functional," which you later defended by saying "git is only immutable on the meta level, its implementation uses mutable files", which, while being 100% true, is also true of the category of programming languages we call "functional" as they rely on mutable RAM. That is what the GP has been trying to explain to you, but your obvious hate for anything labelled "functional" makes you blind to all this.

Sorry mate, you just have to concede the point here. This is not a discussion about whether functional programming is any good, which is what you seem to be treating it as.

pqh · on Dec 9, 2017

And yet, thankfully, the old versions of those commits are still present in the reflog for some time. Making it easy to recover from finger slips.

davidcuddeback · on Dec 9, 2017

> A data structure cannot be functional.

I think the author is using the term "purely functional data structure" to mean a data structure that lends itself to an implementation in a purely functional language [1,2].

[1] https://en.wikipedia.org/wiki/Purely_functional_data_structu...

[2] https://www.amazon.com/Purely-Functional-Structures-Chris-Ok...

erikb · on Dec 9, 2017

I feel like I already said "I understand but I disagree".

A round wheel is more useful for a car than a triangular wheel. That doesn't mean it's a "car wheel". It's just as good on a horse wagon or a bike.

davidcuddeback · on Dec 9, 2017

You can disagree if you want. I just don't want other readers to be misled into thinking that "purely functional data structure" isn't a term of art. Given the number of references to Okasaki's book in this thread, I feel like the interested reader is sufficiently armed to learn more that I don't feel the need to continue this discussion.

You may not find the definition useful, and that's alright. I won't try to convince you.

_0w8t · on Dec 9, 2017

Git is not a purely functional data structure [1]

[1] man git-rebase

19870213 · on Dec 9, 2017

But git-rebase does not alter existing commits in the commit tree, it simply creates a new branch (meaning new commits) on the tree.

Simon_says · on Dec 9, 2017

Alright wiseguy, git gc.

davidcuddeback · on Dec 9, 2017

I don't understand what point you're trying to make. Unreachable data can be garbage-collected in Haskell, ML, Clojure, Erlang, and many other functional (and non-functional) programming languages. What about GC is supposed to refute that a data structure is functional?

Simon_says · on Dec 9, 2017

19870213's point was that a git rebase is functional because it doesn't change existing commits, rather it only makes new ones and moves branch pointers. This is correct. You can also say that git gc is functional because it's only garbage collecting. But you can't have it both ways. The cumulative effect of rebase followed by gc is to delete commits that have a branch pointing to them.

rootlocus · on Dec 9, 2017

> The cumulative effect of rebase followed by gc is to delete commits that have a branch pointing to them.

This is both false and irrelevant.

It's false because after the rebase, the branch you just rebased won't point to the old commits. Unless you had other branches there, they would be orphaned. Also, according to the "Notes" section of the git-gc documentation [0]:

> git gc tries very hard not to delete objects that are referenced anywhere in your repository. In particular, it will keep not only objects referenced by your current set of branches and tags, but also objects referenced by the index, remote-tracking branches, refs saved by git filter-branch in refs/original/, or reflogs (which may reference commits in branches that were later amended or rewound).

Since the commit you were on before making the rebase is in the reflog, it will actually not be GCed (yet) even though there are no branches pointing to them.

It's irrelevant because, even if it was true, as long as there are no more objects referencing that commit, it's perfectly eligible for gc. I don't understand your argument that "you can't have it both ways".

[0] https://git-scm.com/docs/git-gc

davidcuddeback · on Dec 9, 2017

> The cumulative effect of rebase followed by gc is to delete commits that have a branch pointing to them.

That would be a strictly broken GC if it removes commits that are reachable from branch pointers. Please file a bug report with the git project if you've observed this happening, because it would be inconsistent with the git-gc(1) manpage, which says it only removes unreachable objects.

dustingetz · on Dec 9, 2017

hey you're getting downvoted bc you misunderstand so insistently and cockily, maybe try asking a question

dustingetz · on Dec 9, 2017

gc is actually essential to implementing pure functional programming efficiently on top of imperative hardware

(see: structure sharing, an essential optimization that lets functional data structures reuse bits of each other internally for low-cost copy-and-modify, at the cost of making a data structure no longer responsible for freeing its own memory since it overlaps)

klodolph · on Dec 9, 2017

If Git were purely functional, you would expect rebase not to modify the existing data in any way, and indeed that is exactly what happens. You can create, delete, or modify only the top-level pointers: branch names, reflog, etc. Instead, rebase creates a completely new set of commits, and points the current branch at a new one. This is exactly how functional data structures work in e.g. Haskell, where "inserting" an element into a dictionary means that you get a new copy of the dictionary, and all existing references to the original are unmodified.

_0w8t · on Dec 9, 2017

Git rebase alters the structures that are relevant for me, like heads of named branches. In Haskell let bindings are immutable. To reference to the results one has to put them into new bindings. I.e. if Git was purely functional, the rebase would create new names for branches.

_fq4v · on Dec 9, 2017

You can easily shadow a name in a let binding in Haskell

For example,

let x = 5 in let x = 6 in putStrLn (show x)

would print 6. Unless you turn on compiler warnings, there is never a need to choose new names for variables if the old version is no longer needed.

catnaroek · on Dec 10, 2017

Shadowing isn't the same thing as redefining. Try evaluating the following expression:

    let x = 4 ; f y = x + y in let x = 6 in f 10

rootlocus · on Dec 9, 2017

If you want immutable branches, you can use tags https://git-scm.com/book/en/v2/Git-Basics-Tagging

masklinn · on Dec 9, 2017

> In Haskell let bindings are immutable.

Haskell has mutable refs. That's what Git branches are.

catnaroek · on Dec 10, 2017

But the contents of a mutable ref aren't let-bound.

fiatjaf · on Dec 9, 2017

So your entire point amounts to naming?

fiatjaf · on Dec 9, 2017

I don't see that anywhere. A rebase is just a fork.

ianamartin · on Dec 10, 2017

I find this conversation fascinating because there is so much disagreement on the meaning of "functional" and "immutable"

What I've gathered so far from reading the article and the comments is that some people who are in the know about a very specific paper agree that Git is a purely functional data structure. And that others look at the ways you can use Git and point a finger and say, "Look! It can be mutated! Therefore it cannot be functional!" And the response to that is, "Don't be so technical about how you define functional. Or immutable. You know it when you see it."

Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is this so difficult to get a handle on?

If a thing says immutable on the tin, and it's mutable, how is that purely functional? I know, read the paper. I know. But still, it's a legit question.

It seems to me that a data structure so amazing as being purely functional shouldn't be so easy to misunderstand as what we're seeing here. And it's clearly being misunderstood. And not only by me.

how_gauche · on Dec 10, 2017

> some people who are in the know about a very specific paper

This stuff isn't obscure just because you don't happen to know about it already. Chris Okasaki's publications have been cited over 1000 times, mostly for his PhD work -- those papers, especially on functional data structures and amortization analysis in lazy languages, are considered foundational for a whole research area in Computer Science.

> Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is this so difficult to get a handle on?

Did you learn calculus overnight, or expect to understand a technical conversation about differential equations and Cauchy sequences without taking two years of classes or reading a couple of really thick books first? Why should this be any different?

> If a thing says immutable on the tin, and it's mutable, how is that purely functional? .... It seems to me that a data structure so amazing as being purely functional shouldn't be so easy to misunderstand as what we're seeing here. And it's clearly being misunderstood. And not only by me.

Sigh. Explaining it properly would involve a tour though the untyped lambda calculus, simply-typed lambda calculus and the Curry-Howard isomorphism, a discussion on denotational vs. operational semantics (Hoare logic, functional interpreters, type-preserving compilation, small-step vs large-step operational semantics, etc).

TL;DR: "purely functional" is a description of the program's meaning (in a technical sense), not its implementation.

ianamartin · on Dec 13, 2017

Okay, so if a program means for an ibject to be immutable but it actually is mutable then it’s still immutable if you explain it in terms of basic CS theory.

Got it. Thanks for the attitude.

vickychijwani · on Dec 10, 2017

> "Look! It can be mutated! Therefore it cannot be functional!" And the response to that is, "Don't be so technical about how you define functional. Or immutable. You know it when you see it."

I think I can offer something to the discussion here, as I'm straddling the 2 camps - having a fairly intricate understanding of git (I've held training seminars at my company about it), and basic familiarity with functional (technically "persistent") data structures thanks to a long-standing interest in Clojure.

What the functional camp is talking about, when they refer to git as a "purely functional data structure", is the core of git's implementation - commits are immutable snapshots of the repository arranged in a directed acyclic graph (or a tree, if there are no merge commits), and branches are "just" movable pointers to some commit. In this view, everything other than this core model of history as a graph is extra details layered on top.

The other camp (let's call them the "git camp") looks at git in its entirety, including branches, tags, the index, the working tree, stashes, and so on, and they can't help but come to the conclusion you mentioned - "Branches are mutable references! The index is mutable! The working tree is mutable! How can git possibly be called a functional data structure?"

While this 2nd perspective is understandable and technically correct, I think it misses the point the other camp is trying to make: that the immutable commit is the "fundamental unit of git" so to speak, that we interact with everyday, and that most of git's history-manipulating commands (git commit, git commit --amend, git rebase, git merge, git cherry-pick) can be described in terms of purely-functional operations on the commit graph. Once you understand this, many complex git scenarios become easier to understand (rebase, and filter-branch, for example), so this view does have practical value. Then branches, tags, the index, stashes, etc can indeed be understood separately.

In fact, this perspective of git as a data structure is so useful, that I rarely use "git log" as it is, preferring these additional flags to view the entire graph instead:

git log --graph --all --decorate --oneline

I hope that helps, cheers!

ianamartin · on Dec 13, 2017

Thank you that. My understanding is that the unit of work in git—the commit—is not actually immutable. Commits can be changed. Am I wrong about that?

vickychijwani · on Dec 14, 2017

> Am I wrong about that?

Sorry, but yes :)

The commit is immutable because when you "amend" a commit, git actually creates a new commit object, assigning it a new SHA1 hash. This means that the original contents from before the amend (i.e., the "old" commit) are still accessible via the old SHA1 hash - proving that an "amended" commit is actually a copy of the old one, not an in-place mutation.

Think of it like this - if I copy a text file and edit the copy, would you say that I've "mutated" the original file? No, right? That's what git does - it never actually mutates a commit :)

EDIT: to clarify further, git identifies each commit with a unique SHA1 hash which is assigned when the commit is created, and that hash depends on the contents of the files + the current timestamp + the SHA1 of its parent commit + some other things. Which means, if you try to edit the contents of the commit in any way at all (or even if you create another commit with the same exact files at a different time, or as a child of a different commit), its SHA1 hash will be different, i.e., it is a new commit by definition.