Shameless plug: I've written difftastic[1], a tool that builds ASTs and then does a structural diff of them. You can use it with git too.
It's an incredibly hard problem though, both from a computational complexity point of view, and trying to build a comprehensible UI once you've done the structural AST diff.
I think part of the problem is it seems everyone is trying to make a version control tool that is agnostic to all languages. Both computationally and UI wise. But C++ users expect to see different things than JavaScript users and so forth.
I’m surprised at the lack of hyper-specific language version control tools. I thought about making a side project for one in Julia a while back but not quite sure how it would look. Some random thoughts:
- info on type, name, constant changes
- let me checkout older revisions of individual functions / objects / whatever
- on unit test result changes for functions that have unit tests
- when changes are simply a refactor and are functionally the same
Most repositories I work on don't have only one language. They have at the very least two, like the main language and maybe markdown for README files, then configuration like .ini or .toml, json stuff, yml, xml, etcpp. And then you might have bash scripts, Dockerfiles, other build tool languages, etcpp. And those are only text files. You probably will also have images, maybe zipped stuff, office documents and more, all not the "core" repository content, but stored nearby and versioned alongside.
Building a hyper-focussed tool won't be very useful, expect to at least rudimentarily support other file types.
This doesn’t really detract from my point - the “best” tool tool would use knowledge of python for python files, json for json files, and so forth. I think you’re just saying you’d want multiple of these rolled in a single tool as opposed to standalone, which is fair. I think any tool would have to be compatible with git /layer on top of it so it’s available as a fallback
Every change is different in the same way every program is unique, the change of a couple of characters will alter the meaning. I think you have to try to write a diff UI to understand why it is hard.
Difftastic, Meld, diff -u, Word and other tools are amazing because they are usefull in many scenarios. Getting the UI right has been a long process, beingable to grok the changes is still hard even with thw best tooling. It is also a question of tool adoption it takes a long time to understand how a tool works.
Ah, yes, I knew I was forgetting one project. difftastic is very cool, thanks for writing it!
How well do existing VCSs integrate with it? Did you feel restricted at any point by writing a diffing tool, instead of basing a new VCS around this concept? Do you think a deeper integration would allow supporting other functionality beyond diffing, like automatic merging, conflict resolution, etc.?
I agree that it's a very difficult problem. But as an industry, we have more than enough smart people and resources to work on it, which if solved would greatly improve our collective QoL. I can't imagine the amount of time and effort we've wasted fighting with version control tools over the years, and a tool that solved these issues in a smarter way would make our lives much easier.
Git supports external diffing tools really well with GIT_EXTERNAL_DIFF, which you can use with difftastic[1]. Other VCSs are less flexible. For example, I haven't found a nice way of getting a pager when using difftastic with mercurial.
> Did you feel restricted at any point by writing a diffing tool, instead of basing a new VCS around this concept?
Oh, that's an interesting question! Difftastic has been a really big project[2] despite its limited scope and I'm less interested in VCS implementation.
I think text works well as the backing store for a VCS. There are a few systems that have structured backends (e.g. monticello for smalltalk), but they're more constrained. You can only store structured content (e.g. monticello requires smalltalk code) and it must be well-formed (your VCS must understand any future syntax you use).
Unison[3] is a really interesting project in this space, it stores code by hash in a sqlite backend. This makes some code changes trivial, such as renames.
From the perspective of a text diff, an AST diff is lossy. If you add an extra blank line between two unchanged functions, difftastic ignores it. That's great for understanding changes, but not for storage.
I already use delta[1] as a diff viewer, but I suppose GIT_EXTERNAL_DIFF is a deeper integration than just a pager. I've been aware of your project for some time now, but haven't played around with it since I wasn't sure if it would help with automatic conflict resolution, and other issues Git often struggles with. But I'll give it a try soon, thanks again.
I wasn't familiar with Unison. It looks interesting. We definitely need more novel approaches to programming, especially since our field will radically change in a few years as AI becomes more capable.
For languages that have strong IDE refactoring support and userbases that use it a (future) solution would be for the ide to autocommit along the way with metadata to explain what happen "removed unused function based on suggestion", "extracted duplicate", "renamed public method taxed to isTaxed and updated usages across files x, y and z, developer comment: every other of these methods follow the pattern isSomething ".
The last example also add a new feature, and option for a developer to add a comment on an automated refactor.
Ordinary commits could exist on top of this as milestones.
I wouldn't be totally surprised if sooner or later Jetbrains does this. They are creating their own, often better versions of everything I feel and version control could be an obvious next step.
As someone who often prefers other solutions to theirs, I'd prefer if someone else does it first so I end up with something I can use across NetBeans, VS Code, eclipse etc and not something like Kotlin which forces me to use IntelliJ. (Don't get me wrong, IntelliJ is great, I just have NetBeans as my personal favorite.)
Another enthusiastic Bangle.js user here: I had the original and used it, programmed it until the strap (integrated into the body) broke.
Apparently you can actually connect it to phone notifications using gadgetbridge[0] but I didn't have much success when I tried it. The BLE was a little flaky at the best of times (pairing to a PC for programming failed more often than I'd like).
Banglejs2 user here, Gadgetbridge works perfectly fine for my basic usage.
idk if Bangle1 strap is different but (don't remember exact measurement) you can put any standard watch strap with a normal strap pin on it. I replaced the broken stock strap with a nylon one off the net and it's great.
tree-sitter has first class support for parsing errors as ERROR nodes in the output tree. I treat these as just another atom in the s-expression.
In practise I haven't noticed any issues yet. I suspect that difftastic doesn't see many syntax errors because users have fixed most of them by the time they run the diff. When you look at diffs of committed code you hopefully have no parse errors at all.
The tree-sitter parsers could still reject valid code I suppose. I worry slightly more about the parsers getting precedence/associativity wrong, but it would be hard to construct an example that produce identical parse trees due to incorrect precedence.
Every graph vertex represents a pair of pointers (or positions) to AST nodes. So in the example, the start program is `A` and the end program is `X A`. The positions point to AST nodes in these programs.
I try to use 'vertex' consistently for graph vertices, to avoid confusion with AST/s-expression nodes. If you have any suggestions for better terminology I'd be very interested too :)
Most tree diffing papers that I've seen focus on either (1) providing a minimal diff and accepting the performance cost or (2) providing a relatively minimal diff and focusing on the performance.
I've generally found that you need a minimal diff to get a good result, so papers in (2) are less applicable. I've also found several cases where there are several possible minimal diffs, but there's a clear 'correct' answer from the user's perspective.
Difftastic doesn't handle moves: the edit set is add, remove, or replace similar comment. If you reorder functions, it will take the largest unchanged subset. Moves are hard to model in a diffing algorithm, but they're also very hard to display coherently in the UI.
I know a few code forge websites (e.g. Phabricator) show moves in a fairly comprehensible way, although they're all based on line-based diffs.
Thanks for the feedback! I'll look to clarify the wording here. The failed example was a real output that difftastic gave in early versions.
I agree that the classic red/green colour scheme of diffs isn't great for colourblind users. I've asked a few colourblind developers and they were happy with terminal ANSI colours (which is what difftastic uses), because they can configure each colour individually.
You can use difftastic as your default git diff tool, but you can also use it as an opt-in diffing tool. I recommend using it as an opt-in, but defining a git alias so you can do 'git difft'.
It's an incredibly hard problem though, both from a computational complexity point of view, and trying to build a comprehensible UI once you've done the structural AST diff.
[1]: https://github.com/wilfred/difftastic