Hacker News new | past | comments | ask | show | jobs | submit login

I'm not familiar with Pijul, and haven't finished watching this presentation, but IME the problems with modern version control tools is that they still rely on comparing lines of plain text, something we've been doing for decades. Merge conflicts are an issue because our tools are agnostic about the actual content they're tracking.

Instead, the tools should be smarter and work on the level of functions, classes, packages, sentences, paragraphs, or whatever primitive makes sense for the project and file that is being changed. In the case of code bases, they need to be aware of the language and the AST of the program. For binary files, they need to be aware of the file format and its binary structure. This would allow them to show actually meaningful diffs, and minimize the chances of conflicts, and of producing a corrupt file after an automatic merge.

There has been some research in this area, and there are a few semantic diffing tools[1,2,3], but I'm not aware of this being widely used in any VCS.

Nowadays, with all the machine learning advances, the ideal VCS should also use ML to understand the change at a deeper level, and maybe even suggest improvements. If AI can write code for me, it could surely understand what I'm trying to do, and help me so that version control is entirely hands-free, instead of having to fight with it, and be constantly aware of it, as I have to do now.

Or, since it's more than likely that humans won't be writing code or text in the near future, we'll skip the next revolution in VCS tools, and AI will be able to version its own software. /sigh

I just finished watching the presentation, and Pijul seems like an iterative improvement over Git. Nothing jumped out at me like a killer feature that would make me want to give it a try. It might be because the author focuses too much on technical details and fixing Git's shortcomings, instead of taking a step back and rethinking what a modern VCS tool should look like today.

[1]: https://semanticdiff.com/

[2]: https://github.com/trailofbits/graphtage

[3]: https://github.com/GumTreeDiff/gumtree




Shameless plug: I've written difftastic[1], a tool that builds ASTs and then does a structural diff of them. You can use it with git too.

It's an incredibly hard problem though, both from a computational complexity point of view, and trying to build a comprehensible UI once you've done the structural AST diff.

[1]: https://github.com/wilfred/difftastic


I think part of the problem is it seems everyone is trying to make a version control tool that is agnostic to all languages. Both computationally and UI wise. But C++ users expect to see different things than JavaScript users and so forth.

I’m surprised at the lack of hyper-specific language version control tools. I thought about making a side project for one in Julia a while back but not quite sure how it would look. Some random thoughts:

- info on type, name, constant changes

- let me checkout older revisions of individual functions / objects / whatever

- on unit test result changes for functions that have unit tests

- when changes are simply a refactor and are functionally the same


Most repositories I work on don't have only one language. They have at the very least two, like the main language and maybe markdown for README files, then configuration like .ini or .toml, json stuff, yml, xml, etcpp. And then you might have bash scripts, Dockerfiles, other build tool languages, etcpp. And those are only text files. You probably will also have images, maybe zipped stuff, office documents and more, all not the "core" repository content, but stored nearby and versioned alongside.

Building a hyper-focussed tool won't be very useful, expect to at least rudimentarily support other file types.


This doesn’t really detract from my point - the “best” tool tool would use knowledge of python for python files, json for json files, and so forth. I think you’re just saying you’d want multiple of these rolled in a single tool as opposed to standalone, which is fair. I think any tool would have to be compatible with git /layer on top of it so it’s available as a fallback


Off topic, but “etcpp” is a new one for me. Wiktionary suggests it is German, which amuses me because I write “etc. usw.” to mean the same thing.


whoops :D thought it was more international than this, since its pretty much as latin as etc. Interesting!


I read it as a little joke referring to C++, for what it's worth.


What's the difference between etcpp and etc?


The pp implies there are manny more things, as opposed to just some more things. https://en.m.wiktionary.org/wiki/etc._pp.


sibling is right, although for me its mostly a habit, I rarely use etc.


Can you imagine having to learn a bunch of different language specific version control tools? Sounds like a hassle to me.


No, because I only code in one language.


Every change is different in the same way every program is unique, the change of a couple of characters will alter the meaning. I think you have to try to write a diff UI to understand why it is hard.

Difftastic, Meld, diff -u, Word and other tools are amazing because they are usefull in many scenarios. Getting the UI right has been a long process, beingable to grok the changes is still hard even with thw best tooling. It is also a question of tool adoption it takes a long time to understand how a tool works.


Ah, yes, I knew I was forgetting one project. difftastic is very cool, thanks for writing it!

How well do existing VCSs integrate with it? Did you feel restricted at any point by writing a diffing tool, instead of basing a new VCS around this concept? Do you think a deeper integration would allow supporting other functionality beyond diffing, like automatic merging, conflict resolution, etc.?

I agree that it's a very difficult problem. But as an industry, we have more than enough smart people and resources to work on it, which if solved would greatly improve our collective QoL. I can't imagine the amount of time and effort we've wasted fighting with version control tools over the years, and a tool that solved these issues in a smarter way would make our lives much easier.


> How well do existing VCSs integrate with it?

Git supports external diffing tools really well with GIT_EXTERNAL_DIFF, which you can use with difftastic[1]. Other VCSs are less flexible. For example, I haven't found a nice way of getting a pager when using difftastic with mercurial.

> Did you feel restricted at any point by writing a diffing tool, instead of basing a new VCS around this concept?

Oh, that's an interesting question! Difftastic has been a really big project[2] despite its limited scope and I'm less interested in VCS implementation.

I think text works well as the backing store for a VCS. There are a few systems that have structured backends (e.g. monticello for smalltalk), but they're more constrained. You can only store structured content (e.g. monticello requires smalltalk code) and it must be well-formed (your VCS must understand any future syntax you use).

Unison[3] is a really interesting project in this space, it stores code by hash in a sqlite backend. This makes some code changes trivial, such as renames.

From the perspective of a text diff, an AST diff is lossy. If you add an extra blank line between two unchanged functions, difftastic ignores it. That's great for understanding changes, but not for storage.

[1]: https://difftastic.wilfred.me.uk/git.html

[2]: https://www.wilfred.me.uk/blog/2022/09/06/difftastic-the-fan...

[3] https://www.unison-lang.org/


Thanks for replying.

I already use delta[1] as a diff viewer, but I suppose GIT_EXTERNAL_DIFF is a deeper integration than just a pager. I've been aware of your project for some time now, but haven't played around with it since I wasn't sure if it would help with automatic conflict resolution, and other issues Git often struggles with. But I'll give it a try soon, thanks again.

I wasn't familiar with Unison. It looks interesting. We definitely need more novel approaches to programming, especially since our field will radically change in a few years as AI becomes more capable.

[1]: https://github.com/dandavison/delta


For languages that have strong IDE refactoring support and userbases that use it a (future) solution would be for the ide to autocommit along the way with metadata to explain what happen "removed unused function based on suggestion", "extracted duplicate", "renamed public method taxed to isTaxed and updated usages across files x, y and z, developer comment: every other of these methods follow the pattern isSomething ".

The last example also add a new feature, and option for a developer to add a comment on an automated refactor.

Ordinary commits could exist on top of this as milestones.

I wouldn't be totally surprised if sooner or later Jetbrains does this. They are creating their own, often better versions of everything I feel and version control could be an obvious next step.

As someone who often prefers other solutions to theirs, I'd prefer if someone else does it first so I end up with something I can use across NetBeans, VS Code, eclipse etc and not something like Kotlin which forces me to use IntelliJ. (Don't get me wrong, IntelliJ is great, I just have NetBeans as my personal favorite.)


I've been using difftastic as default diff for the past few months. Thank you.

It works well, and the only time I switch back is when comparing long strings.


I was hoping to see something like this. Thank-you!


I disagree. Merge conflicts are just a fact of life, and line-granularity has good usability properties (displaying and editing). `git` has issues, but I don't see merge conflict granularity being an issue, especially when project enforce consistent&automatic formatting.

I agree however that while Pijul is technically very interesting, it doesn't seem to have any killer features that would overcome the cost of switching to a niche version control.


> line-granularity has good usability properties

I was reviewing a PR today and have to disagree with you. There was a single value change in a jsonl test data file. This is nightmarish to read in regular git diffs, as the change gif thought was happening was (with text wrapping) an full page worth of json rather than identifying it was a single word change. And because it is jsonl, the file could not be split into different lines without altering it’s semantics.

I don’t think it’s unreasonable we could be a little smarter here.


JSONL and line-oriented version control are never going to play nicely together.

Imagine doing code review in a language where every function had to be written on one line!

The two techniques I use to dodge this:

1/ Switch from JSONL to a list of objects then pretty print it to be line oriented.

2/ Compress the test data to discourage viewing it altogether, and make people describe what’s being changed rather than leaving it up to the diff to show it.


Diffs come with column offset coordinates, though, so perhaps it's your diff editor that needs some more muscle?


I can believe that they can come with that... but `man diff` doesn't mention it at all, I can't find any description of the diff/patch format that mentions it, and I've literally never seen it. I do frequently get multi-megabyte single-line diffs though, which amount to just one or two characters changing.

Other tools exist, of course, but if you're going to be shipping around stuff that `git apply` or thousands of other tools can handle, it has to be in "the" standard format.


Yeah, meld handles this fine.


Simply auto-format that jsonl file and your git is going to be happy.


You could just use a better diff tool. Git itself is format-agnostic.


This is an area where it should be easier to build tools for that. I have tools for diffing k/v and structured formats but they are local on my machine. Being able to publish those to Gitlab, github and bitbucket would be great.

Edit: Worst is diffing yaml files, you really need a stable way to parse those.


These ideas have mileage but as long as the source code is plain text, the way we represent changes is always going to be text based too.

Wouldn’t a better starting point be to change the way we represent source code, then let the patch tools follow?

Your editor knows exactly what steps you took to make your change. A semantic VCS like you describe sounds very similar to reaching eventual consistency in a distributed data structure by sharing streams of edits between peers.

Personally, I’m a firm believer in text. Auto formatting code so that changes are line oriented helps a lot. So does a good culture of namespaces and separation of concerns. Conflicts happen when two people working on different things have to edit the same code. You can dodge that by more carefully structuring your project.

There’s a reason why software projects aren’t just single-file piles of symbols. Code is primarily supposed to be human readable, and the more legible the project the better shape it is in for good ole line oriented diffing and patching.


Plain text is only storage, I happily posit plain text as storage is not the issue.

It's the tooling above that's lacking. From editors to source control, it's all text/buffer/line/character oriented, which does have its benefits. Sure there's syntax highlight, folding, symbol search and whatnot but semantically these tools only superficially understand code itself and certainly don't operate on code, they only pretend to and are fundamentally text editors. We're getting there with LSPs and error-tolerant parsers but they still map back to text for us to interact.

Tools like gofmt, black, ruby standard and such already kind of abstract away text as storage: you write code in whatever way and it gets transformed right under your feet. In some way as a dev you already don't care about the text, it gets handled for you, but it still maps back to text because editors can't handle anything else.

Similarly LSPs are in my mind quite nerfed because they have to do a whole back and forth to text dance. Vim text objects kind of goes into that direction as well, where you think about higher level constituents than text (arguments, methods, etc). Imagine being able to bind the understanding of LSPs right into semantic Vim language objects without them having to go through text!

I dream of an editor where I can open a bunch of functions or classes or namespaces (not files) in buffers that have understanding of the constituents, and it would all map back to files for storage behind the scenes. I believe it doesn't have to go full tilt Smalltalk-like; the Closure conf Overtone demo from years ago is almost there, although not quite.


I think it’s possible to go to a function level, but you basically need to stop using the file system. We come back to the question of storing code in some sort of db based storage, which can then contain all these tools built in. I can see this type of system being used more and more with the lambda / edge / micro service systems where it simplifies data synchronization. However git / nextBestThing will keep on being used as long as we write code in text files.


i had been thinking something similar to this

would different users have different schemas in their vcs db? would every pull request be a db migration XD?


Hasn"t that been done 30 years ago - didn"t Envy work on the program structure.

But it is Smalltalk so code was not stored as plain text.

Now if only Smalltalk had changed with the times, e.g. added typing, used native GUIs and not set itself off in a walled expensive garden.


You can tell git to use a different executable as diff tool. I agree, and I'm curious if such a tool satisfies my needs. I think this problem is particularly hard since the diff tool needs to understand the coding language. We should have one diff tool per language IMO.

Edit: Related SO https://stackoverflow.com/questions/523307/semantic-diff-uti...


Not sure how i feel about “ML” that would likely change over time being used in a VCS. This would make commits or whatever unit of work you want to save at non deterministic. Also as people we still care about file format and likely want to track it. If anything what you are talking about would just be a different view in a VCS that would still want to track file level changes if it was ever adopted.

For what your talking about though I don’t think the fundamental VCS really matters. You can do everything you are talking about with a tool that uses the diff from git.


I while back I saw a paper[1] from someone who integrated semantic diff for VCS. They said that it works well for toplevel changes to the file (movind classes around, etc), but it doesn't work as well for changes inside functions. For changes at the statement level, text diff worked better. [1] Unfortunately, I don't remember the name of the paper though :(


Counterpoint: line by line comparison gives you 90% of the value at 1% of the complexity. That is a pretty tremendous local maximum.


> If AI can write code for me, it could surely understand what I'm trying to do.

You are anthropomorphizing LLMs. Essentially, they are just conditional probability distributions over tokens. That does not require or imply understanding or reasoning skills.


What does understanding and reasoning mean?


We don't know. The nature of consciousness is an unsolved problem.

We do know that LLMs fall into a certain category of mistake that most educated humans look at and go "HA! What was it thinking??"

It's not that humans don't also make those types of errors - it's that we recognize them quickly when they're pointed out to us and usually describe the error as a "stupid mistake," "brain fart," or similar name intended to show explicitly "gosh, I totally failed to actually think before I did that."

The LLMs show no sign of such self-awareness or, well, "intelligence," loose and squishy as those words are.

Maybe GPT-5 will fix that, but so far it doesn't look that way.


For a step back moving away from text into ASTs there's a bunch of interesting projects.

  * https://unisonweb.org: Unison, a programming language that abstracts names and builds a store of canonical functions
  * https://lamdu.org: Lamdu, a programming language that's meant to be edited as a tree and it's accompanying editor.
There's many more akin projects listed in https://github.com/yairchu/awesome-structure-editors/blob/ma...

I can't wait fast enough for these ideas to reshape how we deal with programs and build stuff.

Also, I wouldn't take too much credit away from projects like Pijul that, maybe more practically, slowly steer us where we want to go. I find it hard to believe that something new will suddenly replace everything given the sheer amount of things that would be left behind and can't rapidly be ported into new shiny technology for various reasons.


> There's many more akin projects listed in https://github.com/yairchu/awesome-structure-editors/blob/ma...

Awesome link! How did you find that link btw? I feel like I scoured the web so many times on this topic and missed all of this stuff.

> I can't wait fast enough for these ideas to reshape how we deal with programs and build stuff.

Same here. I feel like we can't leave plain text completely behind though. There has to be some two-way sync to the structured model.


Merge conflicts don't go away when a diff tool understands syntax.

Semantic conflicts happen even when there are no textual conflicts. E.g. one developer removes a function and all calls to it. In parallel, another developer adds a new use of the removed function somewhere, in a file that the other developer didn't even touch. Cherry pick those changes and you have a broken program.


That's true, but those are logical conflicts, that could arguably also be taken into account by a (much) smarter AI-powered VCS.

The conflicts most people experience on a daily basis are with the tool being confused about changes to symbol names, function signatures, and with the context around the hunk changing, which has no relevance to the change itself. These can mostly be resolved by the tool having more awareness of the program structure, understanding the intention of the change, and knowing how to produce a valid result.

Version control would be much more useful if the tools kept track of semantic changes in a project, instead of line-based differences without any awareness of the content. Existing semantic diffing tools show that this is indeed a better approach, but as pointed out[0], it's a very difficult problem to solve.

[0]: https://news.ycombinator.com/item?id=37097171


> These can mostly be resolved by the tool having more awareness of the program structure, understanding the intention of the change, and knowing how to produce a valid result.

If you have a three-way merge tool which does all these things, you can use it with Git.

> Version control would be much more useful if the tools kept track of semantic changes in a project, instead of line-based differences without any awareness of the content.

So the good news for you is that Git doesn't track any differences at all. Every commit stores a set of files, the "index". There is no limitation on how smart your merge tooling can be, as long as it can work with three artifacts: the ancestor code, and the two parallel derivatives of it to be merged.

The idea that a version control must track detailed differences is false, and a bad requirement. Additionally, if the internal tracking representation has follow different syntaxes of umpteen languages, that's a hyperbad requirement.


Can't find the source right now, but I think I've read a discussion on pijul's forum about its ability to change the tokenizer depending on the file type, for a more meaningful granularity level. I think someone was talking about plugging treesitter there to get an AST.


This would work only if the language is suitable to do so (lisp, smalltalk comes to mind, but even there - having a comment can screw things up)


How about a few for examples of how this semantic / AST is a game changer.

An extra parameter is added to an existing function, how does this look

Similar functionality are extracted to a parameterised new function, how does this look

I’m sure diff and code review tools will evolve but it’s helpful for people to talk about how it looks like to make it less nebulous matrix second life


Coccinelle is an interesting project, relatively widely used by the Linux kernel developers. Some examples on their website:

https://coccinelle.gitlabpages.inria.fr/website/sp.html

https://coccinelle.gitlabpages.inria.fr/website/rules/


What (if anything) would we lose if we're just shipping around AST's and their associated deltas?

Would we lose formatting, etc?


What's the difference between shipping AST with formatting stripped and shipping code that's been automatically formatted? I feel like the only difference is the configuration required to enforce the latter and different modes of failure. Adopting Prettier in my team was the best decision ever, so liberating. More languages should have a single, mandatory way to format code, without any ways to opt out.


> More languages should have a single, mandatory way to format code, without any ways to opt out.

Strongly disagree. Maybe if you're in a very domain constrained environment, i vould see this being valuable. But i write graphics and simulation code all day, which involves a lot of translating math expressions. A compiler insisting on me using PascalCase (like for example .net uses) leads to very unreadable translations of formulas. And I'm not of the opinion that a system making me rewrite variable names to "meaningful names" helps understanding of the underlying math much, if you need to do symbol manipulation, or read backgrounds papers anyway.

Trust your users. Give them the tools to enforce safety barriers for themselves. Give them sensible defaults, sure. But give them ways to opt-out if they know that they need to break the conventions.


Just one example: Empty lines are used to visually structure blocks of code within a function. Those can't be recovered from an AST. (Unless, of course, you make such formatting choices part of the abstract syntax.)


    More languages should have a single, mandatory way to 
    format code, without any ways to opt out.
Strong disagree but, I definitely agree that every project (or team) should have its own standards for formatting that can be automatically applied.

In Ruby land, Rubocop has been a win at the companies where I've worked. Greatly reduces grumbling about formatting. And VSCode/Sublime/etc can format code automatically.


> Would we lose formatting, etc?

Basically you'd lose the entire source representation of your code so you are essentially shipping binaries at that point. You could annotate the AST with hints to recover the original source, but once you have an AST you also have the option of transpiling to other languages/representations.

This is essentially what things like the JVM, .Net, wasm, and any sort of embedded virtual machine are. The AST is kind of just the byte-code that gets executed since the machine abstraction isn't really tied to physical architecture.


I think the problem with that is that's it's a massive amount of work after which you get a fragile system (What if the AST changes?) which doesn't really mean much less work. Merge conflicts will still happen if two people change the same thing.


Your "rethinking" is "let's AI write and version control the code."

Very deep.

AI, for the moment, cannot write anything outside of what is already written. It just so happen that my interests are in the very strange problems where no code has been written thus far, except by me and for me. AI is not helpful there at all.

I am not even starting on the tasks that are formulated like "DUZ BY ... WITH KUMAJ is not supported" that has to be solved in code base that is a million LOC or so (50 million bytes) and is itself a part (and user of) of much, much larger code base.

Finally, git was a de-improvement on darcs, which predates git by two years, if I remember correctly. Darcs was way ahead of git in everything but speed, including an attempt to view text as structure (darcs has a Rename change where one identifier gets renamed into other). Pijul is contemporary rethinking of what darcs offer.

So, pijul is not an iterative improvement upon git. It is improvement upon darcs which represents a start of separate lineage of DVCS, that tried to include many things you mentioned.

Except using AI, of course.


> Your "rethinking" is "let's AI write and version control the code."

It's not. There are several semantic diffing tools that do a much better job at showing relevant changes than simple line-based diffs. This is all done without AI.

My point is that a VCS written from the ground up with this knowledge would offer a much better UX than current tools.

My second point mentions AI as the next step in the progression, since it's clear that it will affect how humans write code in the very near future.

> Very deep.

I don't appreciate the snark.

This is not some deeply unique line of thinking.

> AI, for the moment, cannot write anything outside of what is already written.

You're underestimating the power of combining written chunks of code to produce a unique solution. Most software is written by glueing existing code together and using libraries. AI can do this reasonably well today. What do you think it will be capable of in 5 years? 10?

> It just so happen that my interests are in the very strange problems where no code has been written thus far, except by me and for me. AI is not helpful there at all.

I think you're overestimating the uniqueness of your code. Is it really all original? You're writing everything from scratch with novel, never before seen approaches? I doubt that very much.

There's a reason why design patterns exist. Many solutions can benefit from following existing patterns. AI today can help automate with writing common patterns, and also with any chore work like writing tests. It's helpful even if it's not writing those novel solutions you still have to do yourself.

Besides, none of this is relevant for a VCS. AI doesn't need to have superhuman programming skills to manage versions. It would just be foolish to not use its current capabilities to understand changes better, and help us take the chore out of dealing with version management.


>Most software is written by glueing existing code together and using libraries. AI can do this reasonably well today.

>I think you're overestimating the uniqueness of your code. Is it really all original? You're writing everything from scratch with novel, never before seen approaches? I doubt that very much.

You can doubt that. Yes, of course.

One kind gentleman here introduced me to worst-case optimal join algorithms and I designed one myself. It uses Bloom filters represented as binary decision diagrams, really nothing fancy.

You can try asking AI to write you a Bloom filter represented as binary decision diagrams. I doubt you will get anything interesting and/or useful.

>There's a reason why design patterns exist. Many solutions can benefit from following existing patterns.

This is history repeating itself.

Functional languages do not require that much of design patterns. In fact, functional programming can hide complexity so well that uncovering it can produce exponential blow up in code size. And type systems greatly benefit writing complex systems. I know because I programmed in almost all kind of languages (typed, untyped, dynamic, static, Turing-complete, total, etc) and programing paradigms existing (including term rewriting systems).

I doubt you use Haskell for your work, which is functional and has rich and powerful type system.

The same will be with AI. You can proclaim it will change the world, but similar tools readily available for quite some time did not.

Returning to AI and VCS, the presentation that started our discussion shows a principal problem with VCS, at 7:06 or so. We can make a changeset that will provoke any merging/conflict-resolution algorithms into making a silent mistake. Pijul postpones this problem, I think, not rids of it.

AI does not report a problem, it hides it, and lies. Nobody, to my knowledge, was able to make any AI to admit it does not have a clue about thing it clearly has no clue about.

So if we certainly will have a mistake in our merging process, AI will certainly lie about it. The code after merging-with-AI shall be tested and reviewed, there is no short-cuts there.

So, why bother?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: