Hey guys, author of the project here. I'll admit I was a bit surprised to see this on the front page, I want to say that this is very much alpha software (I very deliberately can't let it hit 1.0 yet), so definitely very rough around the edges. I've been very busy with work recently so I've definitely had a lull in my open source contributions, but I hope that I can get back on track soon (TM).
Right now I have plans to implement the Myers diff algorithm to make diffsitter more efficient (profiling shows that the current biggest bottleneck is memory allocations), and after that I wanted to start experimenting with different heuristics to make the diffs themselves more useful. Support for the unified diff format is also in my plans.
It's nice to see people discussing the project, I'm very flattered that you guys have taken the time to look at it, and please feel free to open issues with features that you want to see or any shortcomings you've found.
The backstory for this project is that I once had to review a diff in a golang codebase that realigned a bit struct because someone added a new field, and I was slightly annoyed that I was seeing all these extra lines changed even though I really only cared about the fields that were added, so I did a little weekend project to see if it was possible to do a diff directly on the AST instead, and I've been working on this in small pieces ever since.
have you thought about using the ast representation to regenerate "canonicalized" versions of the source files upon which you can then represent the diffs (and context)?
yeah, but more aggressive. for both the left and right side, parse to an ast and then go back to text, but do things like enforce orderings on things where orderings don't matter and strip out comments, and force whitespace to be clearly defined... so you have a "just the facts" view into what changed that plays nice with existing diff tools.
(i'm not suggesting this be committed, but rather it be a view)
The example is very confusing. According to the output of diffsitter, the diff is (in order):
1. "let x = 1" was removed
2. "fn add_one {" was removed
3. a closing brace was added
4. "fn addition() {" was added
5. "fn add_two() {" was added
(1) is neat, (2) is reasonable. (3) showing up before both (4) and (5) is super weird. Is there a bug, or does this just demonstrate a disconnect between how a parser parses the code and how a human parses it?
It does demonstrate that diffsitter can tell that `fn main() {` was not semantically changed at all, while still being able to cope with syntactically invalid code like `fn add_one {`.
But look at the following, where I've attempted to manually convert diffsitter's generated diff into a unified diff format, making it easier to understand what's going on:
Look at the closing brace on line 6, which doesn't have a - or + in front of it, meaning that the diff tool has found a closing brace in the old code and one in the new code and decided that they correspond to each other. But this correspondence is not syntactically meaningful. In the old code, that closing brace belongs to `fn main`. In the new code, however, it belongs to `fn addition`, while `fn main`'s closing brace is now represented by the "inserted" closing brace on line 4!
Mixing up braces this way is a typical weakness of traditional diff tools. I'd hope that a syntax-aware diff tool could prevent such mix-ups, and instead generate a diff like:
Even though this diff is one line longer (while diff tools typically aim to produce the shortest possible diff), it's more semantically correct, and would be much easier to deal with if a merge conflict came into play.
Unfortunately, diffsitter does not do this, at least not in that demo. And I don't see how it would be [edit: caused by] a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.
> And I don't see how it would be because of a disconnect between how a parser parses the code and how a human parses it. The parser has the same idea of closing braces belonging to particular opening constructs as humans do.
I don't understand the argument you made here. Why couldn't it do this?
Compared to the first snippet, in the second snippet:
1. There is still a function called `main`, but it has no lines, compared to a single line before. The conclusion is that this line was removed.
2. There is no longer an `add_one` function.
3. There are two new functions, `addition` and `add_two`.
This is exactly what your wanted diff is showing and all of those things can be determined by a parser.
I agree. I think you misinterpreted my comment. By "I don't see how it would be because of a disconnect", I meant "I don't see how the issue could be caused by a disconnect". (Maybe you read it as "I don't see how diffsitter would do that, because there's a disconnect"?) The suggestion of a disconnect came from the parent comment, and I was attempting to refute it.
Probably more just a function of my poor heuristics (or lack thereof), diffsitter does equality comparisons on tree sitter nodes, which can also cause weird stuff like this where text gets grouped together.
I'm redoing the diffing algo anyways, so I'll keep this in mind for the next go.
Edit: it's because I decided to split line-by-line a while back because I thought it would make the diffs easier to read (rather than squashing all of the tokens together, which is not very aesthetically pleasing), but also gives us unintuitive results like this.
My team has a similar project (Locust: https://github.com/bugout-dev/locust) where the goal is to learn the semantic meanings of code changes in git commits, GitHub PRs, etc.
Since we took git diffs as a target for semantic analysis, we have a different approach to our diffs. We start with line-by-line diffs (specifically using "git diff") and then take a semantic diff by superimposing the git diff information on top of the initial and terminal ASTs.
This makes the diff calculation cheaper because we don't have to do full diff between trees.
Haven't updated the code in a few months, but my team is actively using Locust on public GitHub repos to learn the semantics of those code bases. We do plan to do some work on it soon to make it easier to make Locust easier to use (especially as a library).
Really need to sit down and take a proper look at tree-sitter. We currently support Locust diffs for Python, Javascript, and Java, but each one is custom written and implements the same basic algorithm. It looks like tree sitter might just crush this problem for us.
Thanks, definitely planning to look at it. Would really love to implement the semantic metadata extraction once in C/Rust and then have support for a bunch of different languages.
It is especially exciting that they aim to provide useful information even in the presence of syntax errors.
I wish we could standardise on such tools for all languages, maybe through the language server protocol? I'm afraid of building and maintaining yet another kind of parser (as awesome as tree-sitters are) and to be on the language update treadmill then...
the language server protocol was a great initiative, it's however very difficult to implement into editors, and the popular language servers, such as the one VScode uses for Python is closed source and only available for VScode...
To be fair, Treesitter is also very hard to use/implement into an editor. It however seem more portable as it can compile to wasm, if you could only figure out how to use it...
I think the problems with Treesitter could be solved with documentation and examples. I don't think the problems with the language server protocol can be fixed as there is too much work to make a full featured language server, so no one will be willing to maintain them.
While the language server turns the editor into a simple io-terminal, Treesitter is like: - here's a tree - you parse it. I would like there to be a middle ground, where you could for example request a list of all variables in scope in the source code on row X. Something ontop of Treesitter, or maybe Treesitter "plugins".
Most language servers are open-source. Microsoft's pylance extension for vscode, is the proprietary version of the open-source pyright server which can be used independently.
A language client is certainly difficult but only has to be created once per editor, and it might be a good thing that the lsp spec is so broad since server authors can implement features at their own pace, while users can have a core subset of features from the start.
I haven't seen any issues with maintenance with any servers I use, though I have seen development on some servers stop when there are better are newer alternatives for a specific language.
You're absolutely right in theory. In practice... well, I've beta-tested two Mac native, non-VSCode editors that have integrated LSP support, and it's become pretty clear that a lot of language servers have only been tested against Visual Studio Code, and sometimes have rather, ah, let's say eccentric interpretations of the spec. (For instance, they may not set mode flags on responses that give guidance to the client; they may ignore initialization flags set by the client telling the server what kind of formats they can handle in responses; they may even crash if the client hasn't implemented a request they expect to be there even if the spec says supporting that is optional.)
I like that LSP has become a multi-editor standard, but I wish language server developers would test their servers against at least a few other editors, or with users of other editors reporting bugs.
What's probably missing is a standard test suite for the protocol... Can't one generate one with model-based testing or another rather automatic mean. Who's the boss of the lsp project? :-)
And tsserver doesn't even comply with the LSP spec, which means non-vscode editors need to choice between buggy support for typescript or also implementing a second protocol that is sort of but not quite LSP.
The Theia IDE people have implemented a LSP-wrapper around tsserver [1], which works all right. Of course, the reason that Typescript does not comply to the spec is that it was built before the LSP existed.
rust-analyzer [1] goes through the LSP and it's fantastic. It provides an incredibly rich experience nearly on-par with that of TypeScript in VSCode, but for Rust. And it's both third-party and open-source.
I am doing research into using the Language Server Protocol for applications other than editor integration. Specifically, I am looking at calculating software metrics using the LSP. My use case is definitely doable. However, the LSP is not a great general abstraction layer of source code, something like Diffsitter would probably not be possible.
The output is a bit underwhelming. You might be interested in https://codinuum.github.io/gallery-cca/
It is not based on tree sitter , the parsers they use are quite impressive. Yet a difficult part is computing large tree diff (it is O(n^3) if I remember well), the authors devise heuristics that are efficient in practice and should be adaptable to tree sitter.
How in the world does it handle C++? With macros and all that being in C++, surely it's relying on some heuristics? (Which is fine; it's still an improvement. Just wondering if it's really working semantically.)
Also, one nice thing about diff is that it also gives you a patch that will turn the input into the output. Can this do that?
Yeah I honestly wouldn't mind it even if it's a text-based heuristic; anything that's an improvement over line-by-line diff is a big win. I'm just wondering whether/how it's actually doing what it's actually advertised as doing, since it seems so much more difficult.
Thanks! But then I guess my question is how does tree-sitter do it? I assume it has to choke on at least macros, right? Or does it call the actual compiler routines (like libclang)?
Intuitively, I feel like there should be a way to reduce diffing time as well to avoid O(mn) run and space time that comes from the dynamic programming solution to compute the minimal edit.
I've made a incremental parser, and solved it by only re-parsing the changed code block. And the diff would then be the entire branch, but you could compare the old branch with the new branch and get a more fine detailed diff. It's hard to know beforehand what will get better performance, I found that for small files < 100 LOC it was faster to re-parse the entire file.
You are correct. It is indeed possible to use dynamic programming. For more info check out [0]. OP’s project probably takes too long on nontrivial merges in its current state.
That link does go over an optimized tree-diffing algorithm for ASTs, but the question I'm wondering is how to extend incremental parsing into incremental diffing, as opposed to full parse into full diff.
The short answer is that it's unsolved in general and just done ad-hoc in practice. I'm working on incremental & demand-driven analysis techniques for my PhD research and currently building essentially what you describe. The state-of-the-art is probably this technique from this year's PLDI [1] -- last I heard they were using a batch parser but planning on integrating with tree-sitter at some point.
[1] https://dl.acm.org/doi/10.1145/3453483.3454052
They actually dropped support for diffing and incremental parsing a few months back, when they changed up some internal structures to what they call "precise AST datatypes" (which made diffing quite tricky in the type system).
My impression (from the outside looking in) is that there weren't many people making use of those capabilities, so it wasn't a high priority feature. A bit of a shame, though -- it's a really cool tool and I enjoyed building some things on top of it. I switched to a different front-end, though, when I saw that they had removed those features and all mention of incrementality/differencing from the documentations.
I'd be happy to be wrong about this, if anyone has more up-to-date information about semantic!
It could be useful if it was modified to become a great merge tool. I've been using kdiff3 as my git mergetool and its always splits my code up at the worst spots because it does not understand python.
Not unless you you have a git hook to ensure that every commit is run through the auto formatter, and you additionally ensure that everyone that commits to the repositories you work on have this hook installed too. And even then, many people still interact with repositories that are outside of their power to enforce such rules upon.
>git hook to ensure that every commit is run through the auto formatter, and you additionally ensure that everyone that commits to the repositories you work on have this hook installed too
Eg. with node project this is a trivial task with tools like Husky in place
This tool works with the author’s intended layout, a big advantage over autoformatters which destroy and replace it poorly (it’s probably an AI-complete problem).
Right now I have plans to implement the Myers diff algorithm to make diffsitter more efficient (profiling shows that the current biggest bottleneck is memory allocations), and after that I wanted to start experimenting with different heuristics to make the diffs themselves more useful. Support for the unified diff format is also in my plans.
It's nice to see people discussing the project, I'm very flattered that you guys have taken the time to look at it, and please feel free to open issues with features that you want to see or any shortcomings you've found.
The backstory for this project is that I once had to review a diff in a golang codebase that realigned a bit struct because someone added a new field, and I was slightly annoyed that I was seeing all these extra lines changed even though I really only cared about the fields that were added, so I did a little weekend project to see if it was possible to do a diff directly on the AST instead, and I've been working on this in small pieces ever since.