Historically, bitkeeper directly led to git. But you're saying arch, then monoto...

justin_ · on Dec 14, 2023

Linus himself has credited Monotone with the content-addressing by SHA1: https://marc.info/?l=git&m=114685143200012

I think the main issue with Monotone was the performance. Linus also hates databases and C++.

--

Hoare didn't come up with this idea either, but he did apply it to version control. He had potentially been influenced by his earlier work on distributed file systems and object systems. Here's his 1999 project making use of hashes: https://web.archive.org/web/20010420023937/venge.net/graydon...

He was in contact with Ian Clarke of Freenet fame (also 1999). There seems to have been a rise in distributed and encrypted communications around the time, as kragen mentions in his other post.

BitTorrent would also come to use hashes for identifying torrents in a tracker, and would come out in 2001, created by Bram Cohen, the author of the post here :)

kragen · on Dec 14, 2023

thanks for digging up these links

interestingly it does say bk used md5 in some way; i'm not sure how i overlooked that when i was looking at the code earlier, but indeed md5 is used in lots of places (though apparently not for naming objects the way git and monotone do)

the crucial way bittorrent uses hashes actually is for identifying chunks of files (a .torrent file is mostly chunk hashes by volume); that's why it was immune to the poisoning attacks used against other p2p systems in the early 02000s where malicious entities would send you bogus data. once you had the correct .torrent file, you could tell good data from bad data. using the infohash talking to the tracker is convenient but, as i understand it, there isn't really a security reason for it; the tracker doesn't verify you're really participating productively in the swarm, it just sends your IP to other peers in case you might. so there isn't a strong reason to keep torrent infohashes from colliding

hyperthesis · on Dec 14, 2023

Right, bitkeeper doesn't name files with hashes like git does. But it uses sha1 (or similar) for decentralization: to tell if two remote files are the same.

Another player is tridge's rsync, which also uses hashes like that.

kragen · on Dec 14, 2023

aha, thanks

kragen · on Dec 14, 2023

bitkeeper definitely led directly to git

but bitkeeper predated arch and monotone

i listed arch's innovations above (some of which were also in bitkeeper, though i don't think that's where tom got them; atomic commits in particular were in all kinds of version control systems). as i understand it, git got content-based addressing from monotone. but monotone didn't invent that either; merkle invented it for his dissertation in 01979

the current version of bitkeeper (7.3.3) doesn't use sha1 except to import and export to git (look for yourself: https://www.bitkeeper.org/downloads/7.3.3/bk-7.3.3.src.tar.g...), so i think you might have that part wrong

the internet predated the decentralized version control cambrian explosion by about 17 years, if we count from the tcp/ip flag day, or by 32 years if we count from the first arpanet connections. it was clearly a crucial ingredient but it wasn't the limiting reagent

hyperthesis · on Dec 14, 2023

Thanks; maybe critical mass for the internet triggered it?

oh yeah, I recall merkle now; but maybe git first applied to decentralization?

maybe bitkeeper doesn't use sha1 specifically, but some similar hash?

kragen · on Dec 14, 2023

i don't know much about how bitkeeper works, but i don't think it uses secure-hash-based naming of any kind

i think these are some of the things that led up to the transition to decentralized source control:

1. linus torvalds didn't want to use cvs because the social process of the linux kernel already depended on being able to ship patches around willy-nilly, but did want some kind of version control system

2. larry mcvoy had worked on a decentralized source control system called teamware, at sun, before the very first version of linux https://www.krsaborio.net/linux-kernel/research/2002/0528.ht... so he proposed to build a decentralized version control system to solve linus's problem, initially called bitsccs https://lkml.org/lkml/1998/9/30/122 but later called bitkeeper

3. tom lord decided the way we were doing version control was wrong and bad and spent years annoying the hell out of everyone and building software to demonstrate that a much better way was possible, and finally he convinced a lot of people, who started building better versions of the kinda janky arch/tla

4. there had been a lot of work on merkle graphs over the years, mostly for cryptographic applications, but in particular in the late 90s for decentralized filesystems; things like pgp, freenet, mojonation, bittorrent, and tahoe-lafs were popularizing this remarkable fact of being able to assign decentralized, secure names to pieces of content as long as they didn't have to be human-readable (a trilemma tahoe's designer zooko would formalize as 'zooko's triangle' until satoshi nakamoto found a solution). it may or may not be relevant that merkle's foundational patent expired in 01996; i think it's maybe more relevant that napster took off hugely in 02000, and suddenly decentralized and peer-to-peer systems that didn't have a central naming authority became an extremely fashionable thing to work on

5. larry got pissed off at tridge for trying to make a bitkeeper-compatible system and revoked the bitkeeper license after 5 years of people using it for linux. linus tried a bunch of the new free-software decentralized version control systems, including monotone, but none of them were adequate, so he decided to make a really stupid, basic version control system that would work well enough, and that was git

6. some kids started github, and they did a really good job of building a new kind of forge, and that took a lot of the pain out of using git. also because of how they set up the namespace the barrier to starting a new project there was much lower than on sourceforge, because you could call the project, like, 'notes', and because it was inside the namespace of your username there was no implicit claim to be the one and only notes project for the world.

you could definitely argue that critical mass for the internet was the thing that triggered so much interest in decentralized systems. but then again, zooko and ian clarke had spent a decade already trying to figure out how to protect human rights on the internet, and so maybe they were going to build decentralized systems once they figured out how, regardless of how many or how few people they served. or maybe if larry hadn't revoked the license, linus wouldn't have written git, and without linus's superb quality of performance engineering, people would have kept using svn except in cases where they really needed decentralization, and maybe mercurial wouldn't have become decently performant without competition from git. or maybe without 9/11 the internet would have developed in a totally different direction. i don't really know what other paths history might have taken

fmajid · on Dec 14, 2023

Hi Kragen!

Don't forget Darcs, Bazaar and Mercurial. I think it's the needs of Open Source collaboration that drove this convergent evolution of DVCS, and the real conceptual breakthrough getting rid of RCS-like sequential revision numbers.

The commercial world lagged. Certainly Apple were late adopters and didn't support git in Xcode in 2010 when Subversion was the only choice, and Microsoft of course was a late if enthusiastic adopter because of its Ballmer-era aversion to anything Linux.

I personally prefer Fossil and used its forge-like CVStrac in the guise of Gittrac for years, but for better or worse Git's tooling integration won out.

kragen · on Dec 14, 2023

hi fazal!

i mostly agree, and certainly didn't mean to slight darcs, bazaar and mercurial; darcs in particular was my version tracker of choice before switching to git

but i think 'the needs of Open Source collaboration' are somewhat more plastic and historically contingent than you imply. if you read producing open source software you'll see a snapshot of the dominant social practices of open source collaboration in the world git and bazaar were born into (which of course you also experienced, but others reading this comment may not have). those practices still survive in places, like netbsd and debian

arch, git, and family were designed to support a different set of social practices, practices that were at the time marginal in part because of the practical difficulty of applying them without software support. tom lord's radical program was to change the software landscape on which open-source collaboration happened in order to make those social practices viable

i agree that globally sequential revision numbers are incompatible with decentralization in the pre-nakamoto world, because they demand consensus, and decentralized consensus was infeasible until nakamoto. it's still probably too costly for this purpose

hyperthesis · on Dec 15, 2023

I'm reading fmajid as meaning that decentralization is antithetical to the control that companies require - so it must grow elsewhere.

By "dominant social practices" I think you mean cathedral as opposed to bazaar? (esr)

So it looks to me like: internet adoption made decentralization possible; a decentralization practice arose, exemplified by linus and publicised by esr, thus creating decentralization-tool demand; and then, all the technical progress you mention was harnessed, including application of ideas from other fields, like merkle trees that were originally for encryption, created decentralizatiom supply.

fmajid · on Dec 15, 2023

I don't think decentralization is the issue, after all most companies adopted Git eventually (or a forked and rewritten Mercurial in the case of Meta). It's just they did not feel the burning need the way some distributed teams like the Linux kernel team did.

And Kragen is right that open-source communities vary widely in culture and processes. Linux has its email patch based workflow, as does Sourcehut. There is the popular pull request model invented or at least popularized by GitHub. OpenBSD still uses its bizarre CVS workflow and seems happy with it.

kragen · on Dec 15, 2023

pretty much

keep in mind, tho, that teamware was decentralized version control in the 80s or very early 90s inside sun tho, so i don't think it's antithetical to something companies require

i've never used teamware and i don't have a good handle on how it worked

with respect to merkle trees, it's true that encryption was merkle's original intended use for them, but i don't know if anyone has ever used them for that except as an experiment. surely not since rsa was published

hyperthesis · on Dec 15, 2023

Good point about teamware, though I think sun was a very unusual company, and particularly network focussed. Also, the internet itself had a decentralized design (for military robustness, IIRC).

Maybe making the core data structure be a merkle tree (not a separate data structure for validation) was new... at least, for version control? It seems monotone used merkle trees as an overlay https://decomposition.al/blog/2019/05/31/how-i-learned-about...

I'm mainly going by larry, who told me that git's data structure wasn't from bitkeeper (but was "all his", IIRC). The significance is that this data structure can't model renames (of course, could do them as an overlay or decoration - though git doesn't)

hyperthesis · on Dec 15, 2023

Thanks for laying all that out!

kragen · on Dec 15, 2023

sure, i hope i'm not getting too much wrong