The state of merging technology

wscott · on Dec 14, 2023

The piece of BitKeeper I wish people would steal is the smerge gca conflict format. See https://www.bitkeeper.org/man/smerge.html Example:

               <<<<<<< local slib.c 1.642.1.6 vs 1.645
                    sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
               -    assert(sc->tree);
               -    sccs_sdelta(sc, sc->tree, file);
               +    assert(HASGRAPH(sc));
               +    sccs_sdelta(sc, sccs_ino(sc), file);
               <<<<<<< remote slib.c 1.642.1.6 vs 1.642.2.1
               -    sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
               +    sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
                    assert(sc->tree);
                    sccs_sdelta(sc, sc->tree, file);
               >>>>>>>

Here we have a code conflict and rather than showing you what the file looks like on the two sides it shows you what was changed on both sides relative to the GCA. So we get two unified diffs. The local side made this edit, while on the remote side we had that edit. Then it is obvious how to resolve the conflict without losing a change.

This works for the cris-cross case because that GCA is really a set of common revisions merged together.

teraflop · on Dec 14, 2023

Git has something similar if you turn on the "diff3" conflict style, and I can't for the life of me understand why it's not on by default, because there are many situations where you just don't have enough information to properly resolve a merge without it.

wscott · on Dec 14, 2023

The "diff3" style looks like this:

           <<<<<<< HEAD
                sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
                assert(HASGRAPH(sc));
                sccs_sdelta(sc, sccs_ino(sc), file);
           ||||||| merged common ancestors
                sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, s->proj);
                assert(sc->tree);
                sccs_sdelta(sc, sc->tree, file);
           =======
                sc = sccs_init(file, INIT_NOCKSUM|INIT_SAVEPROJ, p);
                assert(sc->tree);
                sccs_sdelta(sc, sc->tree, file);
           >>>>>>> c4892343......

That contains the same information, but you have to parse it yourself and it is not nearly as fast to see what changed.

However, it is faster to edit since you don't need to remove the diff markers.

pabs3 · on Dec 14, 2023

BTW zdiff3 is a newer version of that style that is slightly better than diff3.

juped · on Dec 14, 2023

zdiff3 is NOT "better", it's a matter of personal preference. it moves common lines out of the conflicted hunks, which may be highly confusing. some people use it anyway

juped · on Dec 14, 2023

If you ask on / search the history of the mailing list, you'll see that in complex merges with synthetic intermediate parents it can produce some really gnarly output, which is the main reason why (coupled with git's general conservatism).

tome · on Dec 16, 2023

> turn on "diff3" ... there are many situations where you just don't have enough information to properly resolve a merge without it

That's right. Here's an example I wrote up:

https://stackoverflow.com/a/63739655/997606

gavinhoward · on Dec 14, 2023

I'm making a VCS based on the weave.

Your wish will be answered; I wanted a conflict-aware format, and I will definitely plunder the smerge format.

kragen · on Dec 14, 2023

some relevant context is that bram and his brother ross developed an early decentralized version control system named 'codeville', more or less contemporary with git and mercurial; they put a lot of work into figuring out how merging should handle different hairy scenarios

bram was deeply disappointed that the systems that got widely adopted, like git, did a terrible job with these hairy scenarios, since he knew that it was possible to do much better

ajb · on Dec 14, 2023

Oh yeah, there was a bit of a Cambrian explosion of version control systems around then, due to everyone getting fed up with CVS as well as the bitkeeper debacle. Codeville, TLA, monotone, vesta...

I quite liked the idea of vesta, which included a build system and with hindsight looks a lot like nix/guix

kragen · on Dec 14, 2023

yeah, bram and len featured a lot of them in codecon. but i don't agree with your explanation of why it happened

svn was the result of everyone getting fed up with cvs; basically it does the same thing as cvs, but does it in a less janky way, and with atomic commits. but it still suffers from cvs's design weaknesses

vesta was a digital research (decwrl?) project from the previous millennium; peter deutsch told me about it at the time, but it was still proprietary, and it took them a while to be able to open-source it. it was basically a clone of dsee, just like clearcase, though perhaps better done. it wasn't motivated by dissatisfaction with cvs and in fact couldn't do things cvs could do

i think the main thing that kicked off the cambrian explosion wasn't 'everyone getting fed up with cvs' but rather tom lord (rip) writing arch (tla, later baz and bzr) which demonstrated to everyone that it was possible to do enormously better than cvs/svn, with features like atomic commits, forking your own branches without permission from the core team, decentralization, and serving from regular ftp or web servers (no special server software)

these design features were ideologically driven on tom's part; he wanted to give ordinary users version-control tools that were just as powerful as the ones used by core teams on projects like freebsd and apache, motivated by the same egalitarianism that had led him to become an employee of the fsf

and i think graydon hoare's monotone was the thing that most inspired the other follow-on systems, like git, mercurial, codeville, fossil, maybe darcs, and maybe even baz and bzr

maybe kernel hackers getting experience with bitkeeper in 02000 to 02005 added motivation for moving to better-than-cvs-and-svn models too tho

shlomi fish's site from the time in question has a lot of material on what was happening, including even lesser known version tracking systems like aegis: https://better-scm.shlomifish.org/aegis/

hyperthesis · on Dec 14, 2023

Historically, bitkeeper directly led to git.

But you're saying arch, then monotone, came before bitkeeper? What innovations did each provide?

(git's innovation was content-based addressing, so the data structure does the heavy lifting. bitkeeper used sha1 hashes for decentralization - was that its main contribution?)

Probably the internet enabled the decentralized version control Cambrian explosion (or, at least, a Cambrian explosion)

BTW: funfact re "merging": bitkeeper had first-class renaming, which git lost. A process of subtraction as well as addition.

justin_ · on Dec 14, 2023

Linus himself has credited Monotone with the content-addressing by SHA1: https://marc.info/?l=git&m=114685143200012

I think the main issue with Monotone was the performance. Linus also hates databases and C++.

--

Hoare didn't come up with this idea either, but he did apply it to version control. He had potentially been influenced by his earlier work on distributed file systems and object systems. Here's his 1999 project making use of hashes: https://web.archive.org/web/20010420023937/venge.net/graydon...

He was in contact with Ian Clarke of Freenet fame (also 1999). There seems to have been a rise in distributed and encrypted communications around the time, as kragen mentions in his other post.

BitTorrent would also come to use hashes for identifying torrents in a tracker, and would come out in 2001, created by Bram Cohen, the author of the post here :)

kragen · on Dec 14, 2023

thanks for digging up these links

interestingly it does say bk used md5 in some way; i'm not sure how i overlooked that when i was looking at the code earlier, but indeed md5 is used in lots of places (though apparently not for naming objects the way git and monotone do)

the crucial way bittorrent uses hashes actually is for identifying chunks of files (a .torrent file is mostly chunk hashes by volume); that's why it was immune to the poisoning attacks used against other p2p systems in the early 02000s where malicious entities would send you bogus data. once you had the correct .torrent file, you could tell good data from bad data. using the infohash talking to the tracker is convenient but, as i understand it, there isn't really a security reason for it; the tracker doesn't verify you're really participating productively in the swarm, it just sends your IP to other peers in case you might. so there isn't a strong reason to keep torrent infohashes from colliding

hyperthesis · on Dec 14, 2023

Right, bitkeeper doesn't name files with hashes like git does. But it uses sha1 (or similar) for decentralization: to tell if two remote files are the same.

Another player is tridge's rsync, which also uses hashes like that.

kragen · on Dec 14, 2023

aha, thanks

kragen · on Dec 14, 2023

bitkeeper definitely led directly to git

but bitkeeper predated arch and monotone

i listed arch's innovations above (some of which were also in bitkeeper, though i don't think that's where tom got them; atomic commits in particular were in all kinds of version control systems). as i understand it, git got content-based addressing from monotone. but monotone didn't invent that either; merkle invented it for his dissertation in 01979

the current version of bitkeeper (7.3.3) doesn't use sha1 except to import and export to git (look for yourself: https://www.bitkeeper.org/downloads/7.3.3/bk-7.3.3.src.tar.g...), so i think you might have that part wrong

the internet predated the decentralized version control cambrian explosion by about 17 years, if we count from the tcp/ip flag day, or by 32 years if we count from the first arpanet connections. it was clearly a crucial ingredient but it wasn't the limiting reagent

hyperthesis · on Dec 14, 2023

Thanks; maybe critical mass for the internet triggered it?

oh yeah, I recall merkle now; but maybe git first applied to decentralization?

maybe bitkeeper doesn't use sha1 specifically, but some similar hash?

kragen · on Dec 14, 2023

i don't know much about how bitkeeper works, but i don't think it uses secure-hash-based naming of any kind

i think these are some of the things that led up to the transition to decentralized source control:

1. linus torvalds didn't want to use cvs because the social process of the linux kernel already depended on being able to ship patches around willy-nilly, but did want some kind of version control system

2. larry mcvoy had worked on a decentralized source control system called teamware, at sun, before the very first version of linux https://www.krsaborio.net/linux-kernel/research/2002/0528.ht... so he proposed to build a decentralized version control system to solve linus's problem, initially called bitsccs https://lkml.org/lkml/1998/9/30/122 but later called bitkeeper

3. tom lord decided the way we were doing version control was wrong and bad and spent years annoying the hell out of everyone and building software to demonstrate that a much better way was possible, and finally he convinced a lot of people, who started building better versions of the kinda janky arch/tla

4. there had been a lot of work on merkle graphs over the years, mostly for cryptographic applications, but in particular in the late 90s for decentralized filesystems; things like pgp, freenet, mojonation, bittorrent, and tahoe-lafs were popularizing this remarkable fact of being able to assign decentralized, secure names to pieces of content as long as they didn't have to be human-readable (a trilemma tahoe's designer zooko would formalize as 'zooko's triangle' until satoshi nakamoto found a solution). it may or may not be relevant that merkle's foundational patent expired in 01996; i think it's maybe more relevant that napster took off hugely in 02000, and suddenly decentralized and peer-to-peer systems that didn't have a central naming authority became an extremely fashionable thing to work on

5. larry got pissed off at tridge for trying to make a bitkeeper-compatible system and revoked the bitkeeper license after 5 years of people using it for linux. linus tried a bunch of the new free-software decentralized version control systems, including monotone, but none of them were adequate, so he decided to make a really stupid, basic version control system that would work well enough, and that was git

6. some kids started github, and they did a really good job of building a new kind of forge, and that took a lot of the pain out of using git. also because of how they set up the namespace the barrier to starting a new project there was much lower than on sourceforge, because you could call the project, like, 'notes', and because it was inside the namespace of your username there was no implicit claim to be the one and only notes project for the world.

you could definitely argue that critical mass for the internet was the thing that triggered so much interest in decentralized systems. but then again, zooko and ian clarke had spent a decade already trying to figure out how to protect human rights on the internet, and so maybe they were going to build decentralized systems once they figured out how, regardless of how many or how few people they served. or maybe if larry hadn't revoked the license, linus wouldn't have written git, and without linus's superb quality of performance engineering, people would have kept using svn except in cases where they really needed decentralization, and maybe mercurial wouldn't have become decently performant without competition from git. or maybe without 9/11 the internet would have developed in a totally different direction. i don't really know what other paths history might have taken

fmajid · on Dec 14, 2023

Hi Kragen!

Don't forget Darcs, Bazaar and Mercurial. I think it's the needs of Open Source collaboration that drove this convergent evolution of DVCS, and the real conceptual breakthrough getting rid of RCS-like sequential revision numbers.

The commercial world lagged. Certainly Apple were late adopters and didn't support git in Xcode in 2010 when Subversion was the only choice, and Microsoft of course was a late if enthusiastic adopter because of its Ballmer-era aversion to anything Linux.

I personally prefer Fossil and used its forge-like CVStrac in the guise of Gittrac for years, but for better or worse Git's tooling integration won out.

kragen · on Dec 14, 2023

hi fazal!

i mostly agree, and certainly didn't mean to slight darcs, bazaar and mercurial; darcs in particular was my version tracker of choice before switching to git

but i think 'the needs of Open Source collaboration' are somewhat more plastic and historically contingent than you imply. if you read producing open source software you'll see a snapshot of the dominant social practices of open source collaboration in the world git and bazaar were born into (which of course you also experienced, but others reading this comment may not have). those practices still survive in places, like netbsd and debian

arch, git, and family were designed to support a different set of social practices, practices that were at the time marginal in part because of the practical difficulty of applying them without software support. tom lord's radical program was to change the software landscape on which open-source collaboration happened in order to make those social practices viable

i agree that globally sequential revision numbers are incompatible with decentralization in the pre-nakamoto world, because they demand consensus, and decentralized consensus was infeasible until nakamoto. it's still probably too costly for this purpose

hyperthesis · on Dec 15, 2023

I'm reading fmajid as meaning that decentralization is antithetical to the control that companies require - so it must grow elsewhere.

By "dominant social practices" I think you mean cathedral as opposed to bazaar? (esr)

So it looks to me like: internet adoption made decentralization possible; a decentralization practice arose, exemplified by linus and publicised by esr, thus creating decentralization-tool demand; and then, all the technical progress you mention was harnessed, including application of ideas from other fields, like merkle trees that were originally for encryption, created decentralizatiom supply.

fmajid · on Dec 15, 2023

I don't think decentralization is the issue, after all most companies adopted Git eventually (or a forked and rewritten Mercurial in the case of Meta). It's just they did not feel the burning need the way some distributed teams like the Linux kernel team did.

And Kragen is right that open-source communities vary widely in culture and processes. Linux has its email patch based workflow, as does Sourcehut. There is the popular pull request model invented or at least popularized by GitHub. OpenBSD still uses its bizarre CVS workflow and seems happy with it.

kragen · on Dec 15, 2023

pretty much

keep in mind, tho, that teamware was decentralized version control in the 80s or very early 90s inside sun tho, so i don't think it's antithetical to something companies require

i've never used teamware and i don't have a good handle on how it worked

with respect to merkle trees, it's true that encryption was merkle's original intended use for them, but i don't know if anyone has ever used them for that except as an experiment. surely not since rsa was published

hyperthesis · on Dec 15, 2023

Good point about teamware, though I think sun was a very unusual company, and particularly network focussed. Also, the internet itself had a decentralized design (for military robustness, IIRC).

Maybe making the core data structure be a merkle tree (not a separate data structure for validation) was new... at least, for version control? It seems monotone used merkle trees as an overlay https://decomposition.al/blog/2019/05/31/how-i-learned-about...

I'm mainly going by larry, who told me that git's data structure wasn't from bitkeeper (but was "all his", IIRC). The significance is that this data structure can't model renames (of course, could do them as an overlay or decoration - though git doesn't)

hyperthesis · on Dec 15, 2023

Thanks for laying all that out!

kragen · on Dec 15, 2023

sure, i hope i'm not getting too much wrong

sockaddr · on Dec 14, 2023

> 02000 to 02005

I think this is the first time I've seen this five-digit notation used in the wild after reading about The Long Now Foundation using it years ago.

ComputerGuru · on Dec 14, 2023

It’s how you know a kragen post on HN.

ajb · on Dec 14, 2023

Fair enough, your history is probably more accurate.

I didn't know Tom Lord had died. And not very old either darn it :-(

kragen · on Dec 14, 2023

yeah, it's a huge loss

sitkack · on Dec 14, 2023

Tom Lord has died (berkeleydailyplanet.com) https://news.ycombinator.com/item?id=32155067

kuahyeow · on Dec 14, 2023

Didn't Git have a new default merge strategy, `ort` https://github.com/git/git/blob/master/Documentation/RelNote... ?

juped · on Dec 14, 2023

histogram is a diff algorithm

skywal_l · on Dec 14, 2023

From the article:

> switch the default 3 way merge algorithm to histogram

toomim · on Dec 15, 2023

Yeah, that was poorly written and confused me too.

It should say "switch the default diff algorithm to histogram." The diff algorithm is used within the recursive 3-way merge algorithm.

jez · on Dec 14, 2023

The biggest day-to-day merge conflict annoyance I have is situations like this. As far as I know, there’s no solution:

    commit aaaaaaaa
    diff --git a/README.md b/README.md
    --- a/README.md
    +++ b/README.md
    @@ -1,4 +1,4 @@
    -<p align="center">
    +<p align="left">
       <img width="200" src="logo.svg">
     </p>


    commit bbbbbbbb
    diff --git a/README.md b/README.md
    --- a/README.md
    +++ b/README.md
    @@ -1,5 +1,5 @@
     <p align="center">
    -  <img width="200" src="logo.svg">
    +  <img width="345" src="logo.svg">
     </p>

     # Project

Commit aaaaaaaa changes some text on line 1.

Commit bbbbbbbb changes some unrelated text on line 2.

I don’t want to have to manually resolve this—just merge the lines. If there’s a semantic conflict I’ll let the tests sort it out. When I’ve looked in the past there wasn’t a merge strategy that fixes this.

wscott · on Dec 14, 2023

BitKeeper was able to merge that successfully by looking at the revision history and seeing that the changes involved just those lines. If one of them added a line between the two being changed then it would still be a conflict.

I spent a year looking at interesting merges (and commits fixing bad merges) in the Linux kernel and making a catalog of interesting cases before writing 'smerge' for BitKeeper.

It is impossible to make a perfect merge tool, but we can do a lot better than diff3.

toomim · on Dec 15, 2023

> The next easiest win would be to use the weave algorithm for non-rebase merges.

The weave is essentially a CRDT, but invented in 1977 by SCCS. Anyone interested in weaves should consider modern CRDT tech. I've been keeping some notes on the correspondence within the Braid.org group (e.g. https://braid.org/meeting-60/sccs-is-a-time-collapse) and encourage anyone interested to reach out!