Diffsitter: A tree-sitter based AST difftool to get meaningful semantic diffs

affyboi · on July 19, 2021

Hey guys, author of the project here. I'll admit I was a bit surprised to see this on the front page, I want to say that this is very much alpha software (I very deliberately can't let it hit 1.0 yet), so definitely very rough around the edges. I've been very busy with work recently so I've definitely had a lull in my open source contributions, but I hope that I can get back on track soon (TM).

Right now I have plans to implement the Myers diff algorithm to make diffsitter more efficient (profiling shows that the current biggest bottleneck is memory allocations), and after that I wanted to start experimenting with different heuristics to make the diffs themselves more useful. Support for the unified diff format is also in my plans.

It's nice to see people discussing the project, I'm very flattered that you guys have taken the time to look at it, and please feel free to open issues with features that you want to see or any shortcomings you've found.

The backstory for this project is that I once had to review a diff in a golang codebase that realigned a bit struct because someone added a new field, and I was slightly annoyed that I was seeing all these extra lines changed even though I really only cared about the fields that were added, so I did a little weekend project to see if it was possible to do a diff directly on the AST instead, and I've been working on this in small pieces ever since.

a-dub · on July 19, 2021

cool idea!

have you thought about using the ast representation to regenerate "canonicalized" versions of the source files upon which you can then represent the diffs (and context)?

affyboi · on July 21, 2021

Sorry I'm not sure I quite follow by what you mean by canonical, do you something like applying a standard style?

a-dub · on July 22, 2021

yeah, but more aggressive. for both the left and right side, parse to an ast and then go back to text, but do things like enforce orderings on things where orderings don't matter and strip out comments, and force whitespace to be clearly defined... so you have a "just the facts" view into what changed that plays nice with existing diff tools.

(i'm not suggesting this be committed, but rather it be a view)

nemetroid · on July 18, 2021

The example is very confusing. According to the output of diffsitter, the diff is (in order):

1. "let x = 1" was removed

2. "fn add_one {" was removed

3. a closing brace was added

4. "fn addition() {" was added

5. "fn add_two() {" was added

(1) is neat, (2) is reasonable. (3) showing up before both (4) and (5) is super weird. Is there a bug, or does this just demonstrate a disconnect between how a parser parses the code and how a human parses it?

comex · on July 18, 2021

Indeed, the example seems a little underwhelming.

It does demonstrate that diffsitter can tell that `fn main() {` was not semantically changed at all, while still being able to cope with syntactically invalid code like `fn add_one {`.

But look at the following, where I've attempted to manually convert diffsitter's generated diff into a unified diff format, making it easier to understand what's going on:

    1   fn main() {
    2 -     let x = 1;
    3 - fn add_one {
    4 + }
    5 + fn addition() {
    6   }
    7 + fn add_two() {
    8   }

Look at the closing brace on line 6, which doesn't have a - or + in front of it, meaning that the diff tool has found a closing brace in the old code and one in the new code and decided that they correspond to each other. But this correspondence is not syntactically meaningful. In the old code, that closing brace belongs to `fn main`. In the new code, however, it belongs to `fn addition`, while `fn main`'s closing brace is now represented by the "inserted" closing brace on line 4!

Mixing up braces this way is a typical weakness of traditional diff tools. I'd hope that a syntax-aware diff tool could prevent such mix-ups, and instead generate a diff like:

    1   fn main() {
    2 -     let x = 1;
    3   }
    4 - fn add_one {
    5 - }
    6 + fn addition() {
    7 + }
    8 + fn add_two() {
    9 + }

Even though this diff is one line longer (while diff tools typically aim to produce the shortest possible diff), it's more semantically correct, and would be much easier to deal with if a merge conflict came into play.

Unfortunately, diffsitter does not do this, at least not in that demo. And I don't see how it would be [edit: caused by] a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.

feanaro · on July 18, 2021

> And I don't see how it would be because of a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.

I don't understand the argument you made here. Why couldn't it do this?

Compared to the first snippet, in the second snippet:

1. There is still a function called `main`, but it has no lines, compared to a single line before. The conclusion is that this line was removed.

2. There is no longer an `add_one` function.

3. There are two new functions, `addition` and `add_two`.

This is exactly what your wanted diff is showing and all of those things can be determined by a parser.

comex · on July 18, 2021

I agree. I think you misinterpreted my comment. By "I don't see how it would be because of a disconnect", I meant "I don't see how the issue could be caused by a disconnect". (Maybe you read it as "I don't see how diffsitter would do that, because there's a disconnect"?) The suggestion of a disconnect came from the parent comment, and I was attempting to refute it.

feanaro · on July 18, 2021

Oops, yep. I mentally dropped a few words there from the quoted part, even though I re-read multiple times. Thanks for bearing with me.

marcodiego · on July 18, 2021

Definitely underwhelming. Actually I couldn't see how it is better than "diff -u".

affyboi · on July 19, 2021

Probably more just a function of my poor heuristics (or lack thereof), diffsitter does equality comparisons on tree sitter nodes, which can also cause weird stuff like this where text gets grouped together.

I'm redoing the diffing algo anyways, so I'll keep this in mind for the next go.

Edit: it's because I decided to split line-by-line a while back because I thought it would make the diffs easier to read (rather than squashing all of the tokens together, which is not very aesthetically pleasing), but also gives us unintuitive results like this.

zomglings · on July 19, 2021

My team has a similar project (Locust: https://github.com/bugout-dev/locust) where the goal is to learn the semantic meanings of code changes in git commits, GitHub PRs, etc.

Since we took git diffs as a target for semantic analysis, we have a different approach to our diffs. We start with line-by-line diffs (specifically using "git diff") and then take a semantic diff by superimposing the git diff information on top of the initial and terminal ASTs.

This makes the diff calculation cheaper because we don't have to do full diff between trees.

Haven't updated the code in a few months, but my team is actively using Locust on public GitHub repos to learn the semantics of those code bases. We do plan to do some work on it soon to make it easier to make Locust easier to use (especially as a library).

Really need to sit down and take a proper look at tree-sitter. We currently support Locust diffs for Python, Javascript, and Java, but each one is custom written and implements the same basic algorithm. It looks like tree sitter might just crush this problem for us.

affyboi · on July 19, 2021

I would recommend using tree-sitter, it's probably easier not to reinvent the wheel, especially when parsing something like C++. It's also really fast

zomglings · on July 19, 2021

Thanks, definitely planning to look at it. Would really love to implement the semantic metadata extraction once in C/Rust and then have support for a bunch of different languages.

It is especially exciting that they aim to provide useful information even in the presence of syntax errors.

touisteur · on July 18, 2021

I wish we could standardise on such tools for all languages, maybe through the language server protocol? I'm afraid of building and maintaining yet another kind of parser (as awesome as tree-sitters are) and to be on the language update treadmill then...

z3t4 · on July 18, 2021

the language server protocol was a great initiative, it's however very difficult to implement into editors, and the popular language servers, such as the one VScode uses for Python is closed source and only available for VScode...

To be fair, Treesitter is also very hard to use/implement into an editor. It however seem more portable as it can compile to wasm, if you could only figure out how to use it...

I think the problems with Treesitter could be solved with documentation and examples. I don't think the problems with the language server protocol can be fixed as there is too much work to make a full featured language server, so no one will be willing to maintain them.

While the language server turns the editor into a simple io-terminal, Treesitter is like: - here's a tree - you parse it. I would like there to be a middle ground, where you could for example request a list of all variables in scope in the source code on row X. Something ontop of Treesitter, or maybe Treesitter "plugins".

moons_spawn · on July 19, 2021

Most language servers are open-source. Microsoft's pylance extension for vscode, is the proprietary version of the open-source pyright server which can be used independently.

A language client is certainly difficult but only has to be created once per editor, and it might be a good thing that the lsp spec is so broad since server authors can implement features at their own pace, while users can have a core subset of features from the start.

I haven't seen any issues with maintenance with any servers I use, though I have seen development on some servers stop when there are better are newer alternatives for a specific language.

chipotle_coyote · on July 19, 2021

You're absolutely right in theory. In practice... well, I've beta-tested two Mac native, non-VSCode editors that have integrated LSP support, and it's become pretty clear that a lot of language servers have only been tested against Visual Studio Code, and sometimes have rather, ah, let's say eccentric interpretations of the spec. (For instance, they may not set mode flags on responses that give guidance to the client; they may ignore initialization flags set by the client telling the server what kind of formats they can handle in responses; they may even crash if the client hasn't implemented a request they expect to be there even if the spec says supporting that is optional.)

I like that LSP has become a multi-editor standard, but I wish language server developers would test their servers against at least a few other editors, or with users of other editors reporting bugs.

touisteur · on July 19, 2021

What's probably missing is a standard test suite for the protocol... Can't one generate one with model-based testing or another rather automatic mean. Who's the boss of the lsp project? :-)

I'm adding something from my open tabs: https://fitzgeraldnick.com/2020/08/24/writing-a-test-case-ge...

thayne · on July 19, 2021

And tsserver doesn't even comply with the LSP spec, which means non-vscode editors need to choice between buggy support for typescript or also implementing a second protocol that is sort of but not quite LSP.

See https://github.com/microsoft/TypeScript/issues/39459

Pozzuh · on July 19, 2021

The Theia IDE people have implemented a LSP-wrapper around tsserver [1], which works all right. Of course, the reason that Typescript does not comply to the spec is that it was built before the LSP existed.

[1] https://github.com/theia-ide/typescript-language-server

brundolf · on July 19, 2021

rust-analyzer [1] goes through the LSP and it's fantastic. It provides an incredibly rich experience nearly on-par with that of TypeScript in VSCode, but for Rust. And it's both third-party and open-source.

[1] https://rust-analyzer.github.io/

Pozzuh · on July 19, 2021

I am doing research into using the Language Server Protocol for applications other than editor integration. Specifically, I am looking at calculating software metrics using the LSP. My use case is definitely doable. However, the LSP is not a great general abstraction layer of source code, something like Diffsitter would probably not be possible.

def-lkb · on July 18, 2021

The output is a bit underwhelming. You might be interested in https://codinuum.github.io/gallery-cca/ It is not based on tree sitter , the parsers they use are quite impressive. Yet a difficult part is computing large tree diff (it is O(n^3) if I remember well), the authors devise heuristics that are efficient in practice and should be adaptable to tree sitter.

dataflow · on July 18, 2021

How in the world does it handle C++? With macros and all that being in C++, surely it's relying on some heuristics? (Which is fine; it's still an improvement. Just wondering if it's really working semantically.)

Also, one nice thing about diff is that it also gives you a patch that will turn the input into the output. Can this do that?

0b01 · on July 19, 2021

These tools simply don’t work well on anything more complicated than S-expressions. Every time I look into this space I end up disappointed.

tyingq · on July 18, 2021

Someone is working on tree-sitter for Perl too: https://github.com/ganezdragon/tree-sitter-perl

Which is supposed to be difficult, if not impossible: https://www.perlmonks.org/?node_id=663393

I suppose, though, if it's for diffs and syntax highlighting, flaws matter less.

dataflow · on July 18, 2021

Yeah I honestly wouldn't mind it even if it's a text-based heuristic; anything that's an improvement over line-by-line diff is a big win. I'm just wondering whether/how it's actually doing what it's actually advertised as doing, since it seems so much more difficult.

affyboi · on July 19, 2021

Short answer is that it doesn't: it lets tree-sitter do the heavy lifting to parse everything

dataflow · on July 19, 2021

Thanks! But then I guess my question is how does tree-sitter do it? I assume it has to choke on at least macros, right? Or does it call the actual compiler routines (like libclang)?

conradev · on July 18, 2021

This caused me to search and see if there are tools to leverage ASTs when resolving merge conflicts, and it looks like there are!

https://www.semanticmerge.com/

Would be cool if diffsitter added support for facilitating three way merges

bhl · on July 18, 2021

Anyone know how to compute the diff between two ASTs if the edits _are_ known, either in general or with tree-sitter.

The method used by diffsitter is to compute the path with minimal editing distance: https://github.com/afnanenayet/diffsitter/blob/main/src/ast.....

In the case where the edits are known (because you're re-creating the AST per keystroke), you can save parsing time by passing in the old tree and a set of edits: https://tree-sitter.github.io/tree-sitter/using-parsers#edit....

Intuitively, I feel like there should be a way to reduce diffing time as well to avoid O(mn) run and space time that comes from the dynamic programming solution to compute the minimal edit.

z3t4 · on July 19, 2021

I've made a incremental parser, and solved it by only re-parsing the changed code block. And the diff would then be the entire branch, but you could compare the old branch with the new branch and get a more fine detailed diff. It's hard to know beforehand what will get better performance, I found that for small files < 100 LOC it was faster to re-parse the entire file.

0b01 · on July 19, 2021

You are correct. It is indeed possible to use dynamic programming. For more info check out [0]. OP’s project probably takes too long on nontrivial merges in its current state.

[0] https://thume.ca/2017/06/17/tree-diffing/

bhl · on July 19, 2021

That link does go over an optimized tree-diffing algorithm for ASTs, but the question I'm wondering is how to extend incremental parsing into incremental diffing, as opposed to full parse into full diff.

the_benno · on July 19, 2021

The short answer is that it's unsolved in general and just done ad-hoc in practice. I'm working on incremental & demand-driven analysis techniques for my PhD research and currently building essentially what you describe. The state-of-the-art is probably this technique from this year's PLDI [1] -- last I heard they were using a batch parser but planning on integrating with tree-sitter at some point. [1] https://dl.acm.org/doi/10.1145/3453483.3454052

affyboi · on July 19, 2021

Yes, it's not currently very performant for larger files (have a branch in the works to fix this). Checked out the link, it's an interesting read

armchairhacker · on July 18, 2021

Looks similar to https://github.com/github/semantic

bhl · on July 18, 2021

Github is also behind the library that diffsitter builds upon: https://github.com/tree-sitter/tree-sitter.

bostonvaulter2 · on July 19, 2021

How can you tell that GitHub is behind tree-sitter? I don't see anything of the sort from that repository link. Is Max employed by GitHub?

bhl · on July 19, 2021

`tree-sitter` has been post to the website a bunch of times so I remembered from those submissions: https://hn.algolia.com/?query=tree-sitter

Also these resources:

https://github.blog/2018-10-31-atoms-new-parsing-system/

https://www.youtube.com/watch?v=a1rC79DHpmY

the_benno · on July 19, 2021

They actually dropped support for diffing and incremental parsing a few months back, when they changed up some internal structures to what they call "precise AST datatypes" (which made diffing quite tricky in the type system).

My impression (from the outside looking in) is that there weren't many people making use of those capabilities, so it wasn't a high priority feature. A bit of a shame, though -- it's a really cool tool and I enjoyed building some things on top of it. I switched to a different front-end, though, when I saw that they had removed those features and all mention of incrementality/differencing from the documentations.

I'd be happy to be wrong about this, if anyone has more up-to-date information about semantic!

dataflow · on July 18, 2021

That one (like most such tools) doesn't handle C++.

pabs3 · on July 19, 2021

A related tool for token based authorship information:

https://github.com/cregit/cregit https://lwn.net/Articles/698425/

iwwr · on July 18, 2021

Is this tool obsoleted by autoformatters?

craftkiller · on July 18, 2021

It could be useful if it was modified to become a great merge tool. I've been using kdiff3 as my git mergetool and its always splits my code up at the worst spots because it does not understand python.

codetrotter · on July 18, 2021

Not unless you you have a git hook to ensure that every commit is run through the auto formatter, and you additionally ensure that everyone that commits to the repositories you work on have this hook installed too. And even then, many people still interact with repositories that are outside of their power to enforce such rules upon.

sahkopoyta · on July 18, 2021

>git hook to ensure that every commit is run through the auto formatter, and you additionally ensure that everyone that commits to the repositories you work on have this hook installed too

Eg. with node project this is a trivial task with tools like Husky in place

kevincox · on July 18, 2021

Not necessarily. You could also run the "unformatted" files through the formatter before doing the diff.

polynomial · on July 18, 2021

sorry for the dumb question but why can't you use GH Actions for this? (Instead of making sure all committers have the hook installed.)

fierro · on July 18, 2021

no, because the bigger win isn't ingoring whitespacing. It's ignore refactoring changes like renaming functions

affyboi · on July 19, 2021

Not if they do any sort of alignment (like structs in golang)

erik_seaberg · on July 19, 2021

This tool works with the author’s intended layout, a big advantage over autoformatters which destroy and replace it poorly (it’s probably an AI-complete problem).