Make your monorepo feel small with Git’s sparse index

arxanas · on Nov 11, 2021

The index as a data structure is really starting to show its age, especially as developers adapt Git to monorepo scale. It's really fast for repositories up to a certain size, but big tech organizations grow exponentially, and start to suffer performance issues. At some point, you can't afford to use a data structure that scales with the size of the repo, and have to switch to one that scales with the size of the user's change.

I spent a good chunk of time working around the lack of sparse indexes in libgit2, which produced speedups on the order of 500x for certain operations, because reading and writing the entire index is unnecessary for most users of a monorepo: https://github.com/libgit2/libgit2/issues/6036. I'm excited to see sparse indexes make their way into Git proper.

Shameless plug: I'm working on improving monorepo-scale Git tooling at https://github.com/arxanas/git-branchless, such as with in-memory rebases: https://blog.waleedkhan.name/in-memory-rebases/. Try it out if you work in a Git monorepo.

stormbrew · on Nov 11, 2021

> I'm working on improving monorepo-scale Git tooling at https://github.com/arxanas/git-branchless

I'm intrigued by this but the readme could maybe use some work to describe how you envision it being used day-to-day? All the examples seem to be about using it to fix things but I'm not at all clear how it helps enable a new workflow.

Even if it was just a link to a similar tool?

arxanas · on Nov 11, 2021

Thanks for the feedback. I also received this request today to document a relevant workflow: https://github.com/arxanas/git-branchless/issues/210. If you want to be notified when I write the documentation (hopefully today?), then you can watch that issue.

There's a decent discussion here on "stacked changes": https://docs.graphite.dev/getting-started/why-use-stacked-ch..., with references to other articles. This workflow is sometimes called development via "patch stack" or "stacked diffs". But that's just a part of the workflow which git-branchless enables.

The most similar tool would be Mercurial as used at large companies (and in fact, `git-branchless` is, for now, just trying to get to feature parity with it). But I don't know if the feature set which engineers rely on is documented anywhere publicly.

I use git-branchless 1) simply to scale to a monorepo, because `git move` is a lot faster than `git rebase`, and 2) to do highly speculative work and jump between many different approaches to the same problem (a kind of breadth-first search). I always had this problem with Git where I wanted to make many speculative changes, but branch and stash management got in the way. (For example, it's hard to update a commit which is a common ancestor of two or more branches. `git move` solves this.) The branchless workflow lets me be more nimble and update the commit graph more deftly, so that I can do experimental work much more easily.

Buttersite · on Nov 12, 2021

Maybe this is a sign that monorepos are the problem?

arxanas · on Nov 12, 2021

Why would this indicate that monorepos are the problem?

chongli · on Nov 12, 2021

It’s a philosophical conflict. Git was designed for distributed, decentralized version control and collaboration. Monorepos are inherently centralized. Thus git is the wrong tool for the job and all of this effort to adapt git to monorepo workflows runs counter to its philosophy.

smrq · on Nov 12, 2021

I think you're conflating two different forms of "centralized". There is nothing inherently un-git about centralizing a bunch of code in a single repo; the decentralization that git is all about is that the repo state is decentralized across multiple machines.

status_quo69 · on Nov 12, 2021

I dunno, the linux kernel is pretty big and centralized by that definition, probably a monorepo by some definitions. I don't think this is a square peg round hole problem, I think that the operations that git must do coupled with the inevitable massive files that every company shoves into their repos (assets, binaries, you name it) makes git chug quite a bit.

arxanas · on Nov 12, 2021

There are decentralized monorepos, such as gecko-dev (https://github.com/mozilla/gecko-dev), which presumably has several forks in products like Iceweasel.

I think the monorepo workflows which Git isn't good at are things like branching, code review, and feature/version management. But there's no reason that Git should have to be slow just because a repository is large. It's more things like "merge commits don't scale in a monorepo", which I would agree with, but that's not related to the performance of the index data structure.

harvie · on Nov 11, 2021

I hope one day all of this will be as easy as in SVN. eg.:

I have repository https://example.com/myrepo

And i can simply do:

svn co https://example.com/myrepo/some/directory/

And i can work with that subdirectory as if it was actual repo. Completely transparently.

This i really miss in git.

Gigachad · on Nov 11, 2021

I'm working in hell right now. The current company has the site frontend, backend, and tests in separate repos and it's basically impossible to do anything without force merging because the build is broken without a chicken and egg situation between the 3 pull requests.

laurent123456 · on Nov 11, 2021

I worked at a company that not only did that, but also decided to split the main web app into multiple repos, one per country. It was so much fun to do anything in this project.

xorcist · on Nov 11, 2021

Now, that's a microservice if there ever was one!

Iwan-Zotow · on Nov 12, 2021

that is some good office porno!

hinkley · on Nov 11, 2021

You might want the front end and backend to be compatible across revision numbers, so that’s less of a problem and more of an impedance mismatch with your process.

The tests are a trickier problem, in theory you could lock version numbers (the new behavior increments to the new tests) but tests have bugs, and this doesn’t solve two people making behavioral changes at the same time.

Short of merging the tests into the respective projects, going full feature toggle for all breaking changes is the only other lever that comes to mind.

Gigachad · on Nov 11, 2021

My thinking is that the tests _have_ to be merged with frontend since they do stuff like "Find the button with the content 'continue' and then click it" There is no such thing as backwards compatible changes to button content but yeah you are right, it would be possible to create backwards compatible API changes but I don't know if it's even worth it since we are the only users of the api, and it is only used for the frontend.

nitrogen · on Nov 12, 2021

There is no such thing as backwards compatible changes to button content

If your form html is reasonably semantically structured, and/or you give meaningful IDs to each interactive element, then your tests can be made somewhat robust against content changes. E.g. "click button #next-step" rather than "click the word Next".

If your frontend is a desktop or mobile app, look for ways to introspect the widget hierarchy to do the same thing (I know this is possible in e.g. Swing).

Vinnl · on Nov 12, 2021

But generally you'll want to emulate how users use your application, which is looking for a button with a particular label (bonus points if your tests are looking for the accessible label), rather than something with a given ID.

erik_seaberg · on Nov 11, 2021

This. Deployments and rollbacks are not atomic, so atomic commits do not make frontend+backend breaking changes safe.

williamvds · on Nov 11, 2021

With shallow checkouts cloning is much quicker. You could try combining it with sparse checkouts too. You can even have Git fetch the full history in the background, and from a quick test you can do stuff like commit while it's fetching. Obviously the limited history means commands like log and blame will be inaccurate until it's done.

  $ git clone --depth=1 <url>
  $ cd repo
  $ git fetch --unshallow &
  $ <do work>

vtbassmatt · on Nov 11, 2021

This is really expensive for the server (the unshallow step specifically). Consider a partial clone here instead.

williamvds · on Nov 12, 2021

Is unshallowing any more expensive than a normal clone? The shallow clone is for the benefit of the user so they can get to work sooner.

vtbassmatt · on Nov 12, 2021

It is, because of the way the server has to recompute packfiles. I realize that you may choose to optimize for the user over the server, but a partial (blobless) clone should satisfy both.

erik_seaberg · on Nov 12, 2021

Does the server rule out sending a packfile that you only need 95% of? That seems like a good thing to optimize.

zwieback · on Nov 11, 2021

Yeah, that's really the one thing I miss from my SVN days. I'm also still using Perforce, which can do even crazier things with workspace mappings.

xxpor · on Nov 11, 2021

As someone raised on git, and is seriously confused/scared every time he has to use svn, I'm not sure what you mean. I use the folder git creates as a repo too?

afranchuk · on Nov 12, 2021

The subtlety is that the checkout command as written is checking out only a subdirectory, not the entire repository. Look closely at the given checkout url.

xxpor · on Nov 12, 2021

Ahhhh, that's quite interesting. I could see why that'd be a big deal for a monorepo.

williamvds · on Nov 12, 2021

Per my understanding of SVN, a checkout only downloads the lastest version of all the files ("pristines"). Git on the other hand clones just about everything from the remote repository, giving you a fully-fledged repository locally. SVN operations that need more history like log and blame will communicate with the repository for that info.

azornathogron · on Nov 12, 2021

This isn't example isn't about not copying history (partial clone on the time dimension) it is about not copying the full directory tree but only a subdirectory (partial clone on the spatial dimension).

aranchelk · on Nov 11, 2021

The more I've come to rely on techniques like canary testing and opt-in version upgrades the more firmly I believe one of the main motivations for monorepos is flawed: at any given time there may not be a fact of the matter as to which single version of an app or service is running in an environment.

At places I’ve worked when we thought about canary testing we ignored the fact that there were multiple versions of the software running in parallel, we classified it as part of an orchestration process and not a reality about the code or env, but we really did have multiple versions of a service running at once, sometimes for days.

Similarly if you’ve got a setup where you can upgrade users/regions/etc piecemeal (opt-in or by other selection criteria) I don’t know how you reflect this in a monorepo. I’m curious how Google actually does this as I recall they have offered user opt-in upgrades for Gmail. My suspicion is this gets solved with something like directories ./v2/ and ./v3/ — but that’s far from ideal.

jeffbee · on Nov 11, 2021

> the main motivations for monorepos is flawed: at any given time there may not be a fact of the matter as to which single version of an app or service is running in an environment.

Your understanding of the motivations for monorepo is flawed. I've never heard anyone even advocate for this as a reason for monorepos. For some actual reasons people use monorepos, see https://danluu.com/monorepo/

Regarding your question, which I re-emphasize has got nothing to do with the arrangement of the source code, the solution is to simply treat your protocol as API and follow these rules: 1) Follow "Postel's Law", accepting to the best of your abilities unknown messages and fields; 2) never change the meaning of anything in an existing protocol. Do not change between incompatible types, or change an optional item to required; 3) Never re-use a retired aspect of the protocol with a different meaning; 4) Generally do not make any incompatible change to an existing protocol. If you must change the semantics, then you are making a new protocol, not changing the old one,

> I don’t know how you reflect this in a monorepo

You don't. Why would the deployed version of some application be coded in your repo? It's simply a fact on the ground and there's no reason for that to appear in source control.

aranchelk · on Nov 11, 2021

We may just be talking past each other, but in the link you provided, sections "Simplified dependencies" and (to a lesser extent) "Cross-project changes" are pretty much exactly what I'm talking about.

joshuamorton · on Nov 11, 2021

They aren't, because those discussions are all related to link-time stuff (if I update foo.h and bar.c that depends on foo.h, I can do so atomically, because those are built into the same artifact).

As soon as you discuss network traffic (or really anything that crosses an RPC boundary), things get more complicated, but none of that has anything to do with a monorepo, and monorepos still sometimes simplify things.

So there's a few tools that are common: feature flags, 3 stage-rollouts, and probably more that are relevant, but let's dive into those first two.

Feature "flags" are often dynamically scoped and runtime-modifiable. You can change a feature flag via an RPC, without restarting the binary running. This is done by having something along the lines of

    if (condition_that_enables_feature()) {
        do_feature_thing()
    } else {
        do_old_thing()
    }

A/B testing tools like optimizely and co provide this, and there are generic frameworks too. `condition_that_enables_feature()`, here is a dynamic function that may change value based on the time of day, the user, etc. (think something like `hash(user.username).startswith(b'00') and user.locale == 'EN'`). The tools allow you to modify these conditions and push and change the conditions all without restarts. That's how you get per-user opt-in to certain behaviors. Fundamentally, you might have an app that is capable of serving two completely different UIs for the same user journey.

Then you have "3-phase" updates. In this process, you have a client and server. You want to update them to use "v2" of some api, that's totally incompatible with v1. You start by updating the server to accept requests in either v1 or v2 format. That's stage one. Then you update the clients to sent requests in v2 format. That's stage two. Then you remove all support for v1. That's stage three.

When you canary a new version of a binary, you'll have the old version that only supports v1, and the canary version that supports v1 and v2. If it's the server, none of the clients use v2 yet, so this is fine. If it's the client, you've already updated the server to support v2, so it works fine.

Note again that all of this happens whether or not you use a monorepo.

lmm · on Nov 12, 2021

> When you canary a new version of a binary, you'll have the old version that only supports v1, and the canary version that supports v1 and v2. If it's the server, none of the clients use v2 yet, so this is fine. If it's the client, you've already updated the server to support v2, so it works fine.

But the supposed benefit of the monorepo was that you could update everything from v1 to v2 in one atomic commit, and that hasn't actually happened. You have to make the changes in exactly the same way as if your client and server were in separate repositories; indeed it may be harder to get your integration tests to test every combination that will actually be deployed. So when there's a network boundary in between - or more to the point a deployment boundary in between - you end up with the costs of the monorepo but not the benefit.

joshuamorton · on Nov 12, 2021

> But the supposed benefit of the monorepo was that you could update everything from v1 to v2 in one atomic commit

No, a benefit of a monorepo is that you can update your build-time dependencies atomically. No one has claimed that a monorepo lets you magically do atomic RPC API migrations.

> You have to make the changes in exactly the same way as if your client and server were in separate repositories; indeed it may be harder to get your integration tests to test every combination that will actually be deployed.

It won't though: if you cover both code paths at each stage of the three stages, you'll cover all possibilities (its also not clear that you need integration tests to cover this, you can fake the client or server and test against both versions and be fine). This is the same if you've got multi or mono-repo, although with a monorepo there's a decent argument that maintaining integration tests is easier, as you can build both the client and server from within the repo.

> So when there's a network boundary in between - or more to the point a deployment boundary in between - you end up with the costs of the monorepo but not the benefit.

This presumes that there are some costs. I'd counter that there's no real difference in this situation, and so when choosing mono- vs. multi-repo, this kind of situation shouldn't influence your decision, because it doesn't matter. You should base your decision on the cases where the choice has an impact.

lmm · on Nov 12, 2021

> No one has claimed that a monorepo lets you magically do atomic RPC API migrations.

Plenty of people have claimed and continue to claim that a monorepo lets you magically do atomic API migrations, and skate over how the fact that many APIs go over RPC these days.

> It won't though: if you cover both code paths at each stage of the three stages, you'll cover all possibilities (its also not clear that you need integration tests to cover this, you can fake the client or server and test against both versions and be fine). This is the same if you've got multi or mono-repo, although with a monorepo there's a decent argument that maintaining integration tests is easier, as you can build both the client and server from within the repo.

This relies on assuming that you perfectly maintain the previous codepath as-is, otherwise your tests won't cover what's actually going to run. In a multirepo setup the idea that you run your integration tests against version X of server code and version Y of client code is already something you're conscious of (and you absolutely do need to integration test your real client with your real server, otherwise you won't uncover your false assumptions about the behaviour of one or the other), so testing the new release of the server with the unmodified old release of the client is something you can do very naturally, whereas in a monorepo you have to make a very deliberate effort to test anything other than current client against current server.

numbsafari · on Nov 12, 2021

> Plenty of people have claimed and continue to claim that a monorepo lets you magically do atomic API migrations, and skate over how the fact that many APIs go over RPC these days.

Nobody that knows what they are talking about makes this claim. People making this claim are confused and misunderstand. Just because there are a group of people who don't understand the thing they are advocating for, doesn't make the thing any better or worse for whatever purpose they are espousing it for. It just means that they don't know what they are talking about (see also "even a broken clock is right twice a day").

To wit: if I'm using a monorepo to produce my back-end service and the client libraries I ship to my customers, who must update their deployed version themselves, how do these proponents of "magic monorepo atomic RPC rollouts" propose that works exactly? Clearly, it doesn't. Updating your own environments is really no different. All deployments are eventually consistent. C'est la vie.

> whereas in a monorepo you have to make a very deliberate effort

When dealing with API or data model changes, you have to deal with the same, deliberate effort to schedule/stage the release of your changes regardless of approach that is used.

lmm · on Nov 12, 2021

> Nobody that knows what they are talking about makes this claim. People making this claim are confused and misunderstand. Just because there are a group of people who don't understand the thing they are advocating for, doesn't make the thing any better or worse for whatever purpose they are espousing it for. It just means that they don't know what they are talking about (see also "even a broken clock is right twice a day").

Right, but, frankly, that's the statement that a lot of hype and even adoption of monorepos is being driven by people who don't know what they're talking about, for reasons that are nonsense.

> When dealing with API or data model changes, you have to deal with the same, deliberate effort to schedule/stage the release of your changes regardless of approach that is used.

If you align the repository - particularly, the level of granularity at which you tag and at which you check out - with the level of granularity at which you deploy, that can simplify your working practices and make some important details more visible. If your day-to-day development workflow is that you check out version 506 of "everything", than naturally nudges you to thinking of this global monotonic version which is misleading when it comes to actually deploy things. If it's natural to checkout version 523 of foo and version 416 of bar in your day-to-day work, that can help you align more closely with a world where version 523 of foo and version 416 of bar is what you actually have deployed on your servers.

joshuamorton · on Nov 12, 2021

If you're using anything approaching continuous integration/deployment, the value of checking out version 523 of foo and 416 of bar is low, because by the time you test and submit your change, one or the other will have changed.

When you have relatively small services, continuously updated, there's not a need to do the sort of precise cross-version testing you're advocating for, as long as you maintain some level of per-feature forward and backward compatibility across your api surfaces, which is far easier than doing cross-version integration tests for every change.

Put another way, you should decouple yourself from caring about particular versions and think more about which features are supported by what's currently deployed. The question you ask shouldn't be "is version X in production", it's "is feature Foo safe to depend on". Unless something is wrong, feature Foo should work just as well in version X and X+1 and X+2, so caring about precisely which you're using is too low level.

lmm · on Nov 15, 2021

Well, that's a matter of opinion, and by no means the only way to run a development workflow.

I find testing the new version of the server against the actual version of the client that is deployed in production, or vice versa, extremely valuable for avoiding issues.

joshuamorton · on Nov 15, 2021

So let's say you make a change on the client today, how do you ensure that the version of the server you test against today is the same version of the server that you deploy against tomorrow (keep in mind one of your concerns was that at any given time you'd have multiple versions in prod at once)?

lmm · on Nov 16, 2021

You have as part of your deployment workflow that you run the integration tests with exactly the version you're about to deploy. And you have a process step to ensure that you don't deploy both the server and client at the same time. That's quite doable and worthwhile once you step back from the idea that everything needs to be continuously deployed, IME.

joshuamorton · on Nov 12, 2021

> This relies on assuming that you perfectly maintain the previous codepath as-is, otherwise your tests won't cover what's actually going to run.

At each commit, your integration test covers the code path that runs at that CL. So for the client, when you switch from the V1 API to the V2 API, you also swap the integration test. Sure you're no longer testing V1, but that's fine, because the artifact being deployed no longer uses the V1 api.

Since the server needs to support both versions, then you'd probably want, for some period of time, tests exercising both paths, but its not challenging to do that. You don't touch the existing tests, and solely add a new test case exercising the new path.

There's no need for explicit cross-version testing with particular versions, because the way you stage the upgrades ensures all the cross-compatibility you need.

> so testing the new release of the server with the unmodified old release of the client is something you can do very naturally

No, this happens perfectly naturally in the monorepo case. When you upgrade the server to support both APIs, you haven't yet touched the client, so you're naturally testing it on the unmodified old client behavior. I'll reiterate, by doing the upgrade in the three-phase way I mentioned, this isn't a problem, and I'm not sure why you think it is.

lmm · on Nov 15, 2021

> Since the server needs to support both versions, then you'd probably want, for some period of time, tests exercising both paths, but its not challenging to do that. You don't touch the existing tests, and solely add a new test case exercising the new path.

But that's a manual, deliberate process. You're effectively recreating what your VCS does for you by having explicit separate codepaths for each API version.

joshuamorton · on Nov 15, 2021

Can you explain to me how you'd avoid having said separate codepaths and tests in a multi-repo situation?

You'd still need to update the server to support both API versions during the transition period. We agree the change can't be made atomically in a monorepo, but it also can't be done atomically in a multi-repo environment.

So, since you have two code paths in prod, you also need them in your tests. It's a manual process either way, and completely independent of your vcs.

lmm · on Nov 16, 2021

Say you currently have server version 5 and client version 3 deployed. You've already tested those. Now you make a change to the shared protocol, and you decide you're going to deploy the server first and then the client (or vice versa). So you bump the server version to 6 in your integration tests, check if they pass, then deploy server version 6 (progressively, but that's fine, you've tested both the combination of client 3 and server 5 and client 3 and server 6). Then you bump the client version to 4 in your integration tests, check if they pass, and deploy client version 4.

Importantly, this is the completely natural way to do every release and deploy; you'd have to go out of your way not to do this. Even if you thought version 6 didn't contain any protocol changes, you'd still deploy it exactly the same way. So you end up testing the real combination of code that's about to get deployed, before you deploy it, every time.

rfoo · on Nov 12, 2021

> monorepo lets you magically do atomic API migrations

I'd argue that there are many more APIs which just don't go over network. With monorepo you can do cross-component refactor very easily, this is especially handy when you are working on a widely-used library.

Imagine that you are working on your language's standard library and you want to fix a small API mis-design.

lmm · on Nov 12, 2021

I think having a single repo for a library that's versioned and released as a single component makes a lot of sense, but combining separate services that are deployed separately into a single repository is not a good idea. The latter is what people generally mean by "monorepo" AFAICS.

rfoo · on Nov 13, 2021

What I mean is you put your library and all services using it in a single repo.

lmm · on Nov 15, 2021

Well, you said language standard library, so that would mean keeping every program written in that language in a single repo, which seems obviously absurd.

howinteresting · on Nov 11, 2021

In general, it is a good practice to try and maximize compile-time resolution of dependencies and minimize network resolution of them. Services are great when the working set doesn't fit in RAM or the different parts have different hardware needs, but trying to make every little thing its own service is foolish.

Doing so makes this a less pertinent problem.

coryrc · on Nov 11, 2021

Only binaries are released. Binaries are timestamped and linked to a changelist.

The "opt-in upgrades" are all live code. I know more than a few "foo" "foo2" directories, but I wouldn't want an actively-delivered, long-running service to be living in a feature branch so I would still expect anyone to be using a similar naming scheme.

ASinclair · on Nov 11, 2021

> I’m curious how Google actually does this

Branches are cut for releases. Binaries are versioned.

eximius · on Nov 11, 2021

that doesnt seem like a problem with monorepos (or otherwise).

youd just need to tag your releases, right?

aranchelk · on Nov 11, 2021

I don't think it's a problem, rather I'm challenging a touted benefit.

In large monorepos the supposition is you've got a class of compatible apps and services bundled together. Version dependencies are somewhat implicit: the correct version for each project to interoperate is whatever was checked-in together in your commit.

I don't know how it works in practice at different orgs, but there's certainly an idea I've heard repeated that you can essentially test, build, and deploy your monorepo atomically, but the reality in my experience is you can't escape the need to think about compatibility across multiple versions of services once you use techniques like canary testing or targeted upgrades.

staticassertion · on Nov 11, 2021

You still have to think about compatibility across versions - that does not go away in a monorepo, and you should use protocols that enforce compatible changes. The monorepo just tells you that all tests pass across your entire codebase given a change you made to some other part.

eximius · on Nov 11, 2021

thats fair. you require reasonable deployment intervals and may need to wait to merge based on deployment. Workflow actions that can check whether a commit is deployed in a given environment are invaluable

jeffbee · on Nov 11, 2021

> may need to wait to merge based on deployment

Again, this fundamentally misunderstands the purpose of the source code repo and how it relates to the build artifacts deployed in production. If you are waiting for something to happen in production before landing some change, that tells me right there you have committed some kind of serious error.

joshuamorton · on Nov 11, 2021

I'd caveat this with code change.

Its very common to need to wait for some version of a binary to be live before updating some associated configuration to enable a feature in that binary (since the dynamic configuration isn't usually versioned with the binary). It's possible that some systems exist that fail quiet, with a non-existent configuration option being silently ignored, but the ones I know of don't do that.

eximius · on Nov 12, 2021

I'm not sure I agree?

If I'm waiting on an update in a 3rd party library... you have to wait for the 3rd party library to ship. Different binaries are effectively that.

So, yea, I could also feature flag everything, flip the flag when the conditions are right, but sometimes its easier /simpler to just wait to merge.

sroussey · on Nov 11, 2021

This is still true, but to a matter of degree. Even in the feature flagged deploys mixed with canary the permutations are all evident, and ideally tested.

Also, you wouldn’t expect a schema change to occur with code that requires it. Those will need to happen earlier.

Real systems are complex. A monorepo is one attempt at capping the complexity to known permutations. For smaller teams, it might collapse to a single one.

haberman · on Nov 11, 2021

> The index file stores a list of every file at HEAD, along with the object ID for its blob and some metadata. This list of files is stored as a flat list and Git parses the index into an array.

I'm surprised that the index is not hierarchical, like tree objects in Git's object storage.

With tree objects (https://git-scm.com/book/en/v2/Git-Internals-Git-Objects#_tr...), each level of the hierarchy is a separate object. So you would only need to load directories that are interesting to you. You could use a single hash compare to determine that two directories are identical without actually recursing into them.

In particular, I can't understand why you would need a full list of all files to create a commit. If your commit is known not to touch certain directories, it should be able to simply refer to the existing tree object without loading or expanding it.

I guess that's what this sparse-index work is doing. I'm just surprised it didn't already work that way.

arxanas · on Nov 11, 2021

It makes more sense if you think of the index as a structure meant specifically to speed up `git status` operations. (It was originally called the "dircache"! See https://github.com/git/git/commit/5adf317b31729707fad4967c1a...) We desperately want to reduce the number of file accesses we have to make, so directly using the object database and a tree object (or similar structures) would more than double file accesses.

There's performance-related metadata in the index which isn't in tree objects. For example, the modified-time of a given file exists in its index entry, which can be used to avoid reading the file from disk if it seems to be unmodified. If you have to do a disk lookup to decide whether to read a file from disk, then the overhead is potentially as much as the operation itself.

There's also semantic metadata, such as which stage the file is in (for merge conflict resolution).

It's worth noting that you can turn on the cache tree extension (https://git-scm.com/docs/index-format#_cache_tree) in order to speed up commit operations. It doesn't replace objects in the index with trees, but it does keep ranges of the index cached, if they're known to correspond to a tree.

junon · on Nov 11, 2021

What I'd really like to see is Git have the ability to consolidate repeat submodules down into a single set of objects in the super repository. Currently cloning the same submodule results in a copy of the repository for each path, which is absurd.

It's been something on my list to address on the mailing lists for a while, just haven't had time.

nightpool · on Nov 11, 2021

Is there a point in having a monorepo if you're all in on the microservices approach? I'm a big microservices skeptic, but as far as I understand it, the benefit of microservices is independence of change & deployment enforced by solid API contracts—don't you give that all up when you use a monorepo? What does "Monorepo with microservices" give you that a normal monolithic backend doesn't?

(Obviously e.g. an image resizer or something else completely decoupled from your business logic should be a separate service / repo anyway—my point is more along the lines of "If something shares code, shouldn't it share a deployment strategy?")

echelon · on Nov 11, 2021

> Is there a point in having a monorepo if you're all in on the microservices approach?

Monorepos are excellent for microservices.

- You can update the protobuf service graph (all strongly typed) easily and make sure all the code changes are compatible. You still have to release in a sensible order to make sure the APIs are talking in an expected way, but this at least ensures that the code agrees.

- You can address library vulns and upgrades all at once for everything. Everything can get the new gRPC release at the same time. Instead of having app owners be on the hook for this, a central team can manage these important upgrades and provide assistance / pairing for only the most complex situations.

- If you're the one working on a very large library migration, you can rebase daily against the entire fleet of microservices and not manage N-many code changes in N-many repos. This makes huge efforts much easier. Bonus: you can land incrementally across everything.

- If you're the one scoping out one of these "big changes", you can statically find all of the code you'll impact or need to understand. This is such an amazing win. No more hunting for repos and grepping for code in hundreds of undiscovered places.

- Once a vuln is fixed, you can tell all apps to deploy after SHA X to fix VULN Y. This is such an easy thing for app owners to do.

- You can collect common service library code in a central place (eg. internal Guava, auth tools, i18n, etc). Such packages are easy to share and reuse. All of your internal code is "vendored" essentially, but you can choose to depend on only the things you need. A monorepo only feels heavy if you depend on all the things (or your tooling doesn't support git or build operations - you seriously have to staff a monorepo team).

- Other teams can easily discover and read your code.

Monorepos are the best possible way to go as long as you have the tooling to support it. They fall over and become a burden if they're not seriously staffed. When they work, they really work.

echelon · on Nov 11, 2021

Should have also mentioned: all those changes to cross cutting library code will trigger builds and tests of the dependent services. You can find out at once what breaks. It's a superpower.

nightpool · on Nov 11, 2021

None of this addresses my question—what benefits do you get from having monorepo-with-microservices over a monolithic backend? All of the things you mentioned would be even easier with a monolithic backend.

echelon · on Nov 11, 2021

At scale, you almost never want a monolith.

- Monoliths force you to coordinate deploys, which can be a pain. It's difficult to verify your changes in concert with others.

- Bugs from one team can cause a rollback for everyone.

- One N+1 query, nonlinear algorithm, or GC-contentious endpoint can send the entire system into a spiral and cause non-linear behavior.

- You don't need to scale everything the same (services, databases, etc.) Your authentication pieces need massive scale: 10,000 QPS in an active-active configuration. Your random endpoint for the "XYZ promotion" does 0.5 QPM.

- Engineers that build platforms can operate in Java/Go/Rust, engineers that build product can build in Ruby/Python/Node.

- Payment data growing faster than user data? Scale your payment pieces. Shard their databases. They're safely cordoned off and isolated from anything else. The payments team(s) can focus on this one really hard problem without anyone else knowing or caring at all.

- Engineers can have privileged access to the resources their team consumes and needs. You don't want your customer loyalty team having access to payment team data. Or your appointments team having access to user accounts and authentication.

- You can hire for teams/concerns, and you have an interesting metric: number of services per engineer or team. It's actually pretty useful.

- Mistakes in one corner can sometimes be completely isolated and not take down the entire company. This isn't always the case. Critical infra (auth, service discovery, networking) can take everything out. But your appointments or customer success app won't have any impact to the other business areas. Also, it's great to have those super mission-critical folks focused on only the extremely important function they serve.

- Tech debt can live "under the covers". External teams calling your services might not know about the mess or the ongoing migrations under the hood. They don't need to know. (Granted, if your data model at scale sucks, this doesn't really afford much. But it is an easier puzzle to decompose.)

tl;dr - teams are focused on things they own, scope is controlled, blast radius is limited, pieces scale independently as needed.

staticassertion · on Nov 11, 2021

They solve different problems, some of which may overlap I suppose.

For one thing you get clear ownership of deployed code. There isn't one monolithic service that everyone is responsible for babying, every team can baby their own, even if they all share libraries and whatnot.

You also get things like fault isolation and fine grained scaling too.

johnmaguire · on Nov 11, 2021

Monorepo with microservices gives you the ability to scale and perform SRE-type maintenance at a granular level. Teams maintain responsibility for their service, but are more easily able to refactor, share code, pull dependencies like GraphQL schemas into the frontend, etc. across many services.

nightpool · on Nov 11, 2021

So basically each team has to reinvent devops from the ground up, and staff their own on call rotation, instead of having a centralized devops function that provides a stable platform? That sounds horrendous.

Although, that said, I can at least see the benefits of the "1 service per team" methodology, where you have a dedicated team that's independently responsible for updating their service. I'm more used to associating "microservices" with the model where a single team is managing 5 or 6 interacting services, and the benefits there seem much smaller.

johnmaguire · on Nov 11, 2021

> That sounds horrendous.

Different teams can make their own decisions, but as a developer on a team that ran our own SRE, I found it came with many advantages. Specifically, we saw very little downtime, and when outages did occur were very prepared to fix it as we knew the exact state of our services (code, infrastructure, recent changes and deploys.) Additionally, we had very good logging and metrics because we knew what we'd want to have in the event of a problem.

And I'm not sure what you mean "from the ground up." We were able to share a lot of Ansible playbooks, frameworks, and our entire observability stack across all teams.

But I think you may also be missing the rest of my post. This is only one possible advantage. Even if the team doesn't perform their own SRE, these services can be scaled independently - both in terms of infrastructure and codebase - even while sharing code (including things like protocol data structures, auth schemes, etc.)

A service that receives 1 SAML Response from an IdP (identity provider) per day per user may not need the same resources as a dashboard that exposes all SP (service providers) to a user many times a day. And an administration panel for this has still different needs.

Yet, all of these services may communicate with each other.

rsj_hn · on Nov 11, 2021

Yeah, I've seen it used to allow teams to use consistent frameworks and libraries across many different microservices. Think of authentication, DB clients, logging, webservers, grpc/http service front ends, uptime oracle -- there's lots of cross cutting concerns that are shared code among many microservices.

So the next thing you decide to do is create some microservice framework that bundles all that stuff in and allow your microservice team to write some business logic on top. But now 99% of your executables are in this microservice framework that everyone is using, and that's the point where a lot of companies go the monorepo route.

Actually most companies do some mix -- have a lot of stuff in a big repo and then other smaller repos alongside that.

anon9001 · on Nov 11, 2021

This is well written and deserves my upvote, because sparse-checkout is part of git and knowing how it works is useful.

That said, there's absolutely no reason to structure your code in a monorepo.

Here's what I think GitHub is doing:

1) Encourage monorepo adoption

2) Build tooling for monorepos

3) Selling tooling to developers stranded in monorepos

Microsoft, which owns GitHub, created the microsoft/git fork linked in the article, and they explain their justification here: https://github.com/microsoft/git#why-is-this-fork-needed

> Well, because Git is a distributed version control system, each Git repository has a copy of all files in the entire history. As large repositories, aka monorepos grow, Git can struggle to manage all that data. As Git commands like status and fetch get slower, developers stop waiting and start switching context. And context switches harm developer productivity.

I believe that Google's brand is so big that it led to this mass cognitive dissonance, which is being exploited by GitHub.

To be clear, here are the two ideas in conflict:

* Git is decentralized and fast, and Google famously doesn't use it.

* Companies want to use "industry standard" tech, and Google is the standard for success.

Now apply those observations to a world where your engineers only use "git".

The result is market demand to misuse git for monorepos, which Microsoft is pouring huge amounts of resources into enabling via GitHub.

It makes great sense that GitHub wants to lean into this. More centralization and being more reliant on GitHub's custom tooling is obviously better for GitHub.

It just so happens that GitHub is building tools to enable monorepos, essentially normalizing their usage.

Then GitHub can sell tools to deal with your enormous monorepo, because your traditional tools will feel slow and worse than GitHub's tools.

In other words, GitHub is propping up the failed monorepo idea as a strategy to get people in the pipeline for things like CodeSpaces: https://github.com/features/codespaces

Because if you have 100 projects and they're all separate, you can do development locally for each and it's fast and sensible. But if all your projects are in one repo, the tools grind to a halt, and suddenly you need to buy a solution that just works to meet your business goals.

ratww · on Nov 11, 2021

> That said, there's absolutely no reason to structure your code in a monorepo.

Bullshit. There are very good reasons to use it in some situations. My company is using it and it's a tremendous productivity boon. And Git works perfectly fine for smaller scales.

Obviously, "because Google does it" is a terrible reason. But it's disingenuous to say that's the only reason people are doing it. Not everyone is a moron.

dboreham · on Nov 11, 2021

> it's a tremendous productivity boon

Curious to hear more specifics on this. Did you migrate from separate repos to a monorepo and subsequently measure improved productivity as a result?

ratww · on Nov 11, 2021

Correct. We measured how long it took to integrate changes in the core libraries into the consumers (multiple PRs) versus doing it on a monorepo (single PR for change). We ran them together for a couple weeks and the difference was big.

The biggest differences were in changes that would break the consumers. For this cases we had to go back and patch the original library, or revert and start from scratch. But even in the easy changes, just the "bureaucracy" of opening tens of pull-requests, watching a few CI pipelines and getting them approved by different code owners was also large.

Now, whenever we have changes in one of the core libraries, we also run full tests in the library consumers. With tests running in parallel, sometimes it takes 20 minutes (instead of 4, 5 hours) to get a patch affecting all frontends tested, approved and merged into the main branch.

Also, everyone agreed that having multiple PRs open is quite stressful.

nitrogen · on Nov 12, 2021

Now, whenever we have changes in one of the core libraries, we also run full tests in the library consumers.

This can also be done with CI triggers across repos.

ratww · on Nov 12, 2021

For most CI/host combinations this requires an ad-hoc solution involving calling APIs that's hardly trivial, both for triggering the CI and blocking the PR. We didn't want to change our perfectly fine solution. Thus, the monorepo killed a handful of birds with one stone.

anon9001 · on Nov 11, 2021

I'm glad you're having a good experience now, and git as a monorepo will work fine at smaller scales, but you will outgrow it at some point.

When you do, you have two choices. You can either commit to the monorepo direction and start using non-standard tooling that sacrifices decentralization, or you can break up your repo into smaller manageable repos.

I don't have any problem with small organizations throwing everything into one git repo because it's convenient.

My objection is that when you eventually do hit the limits of git, will you choose to break the fundamentals of git decentralization as a workaround? Or will you break up the repo into a couple of other repos with specific purposes?

I don't like that GitHub makes money by encouraging people to make the wrong choice at that juncture.

rsj_hn · on Nov 11, 2021

> I'm glad you're having a good experience now, and git as a monorepo will work fine at smaller scales, but you will outgrow it at some point.

I would say the opposite. A lot of companies are fine with independent teams using their own versions of dependencies and their own versions of core code, but at some point that becomes unmanageable and you need to start using a common set of dependencies and the same version of base frameworks to reduce the complexity. That means pushing a patch to a framework means all the teams are upgraded. Monorepos are the most common solution to enforce that behavior.

Look, this is all dealing with the problem of coordination in large teams. Different organizations have different capacities for coordination, and so it's like squeezing a balloon -- yes, you want more agility to pick your own deps but then the cost of that is dealing with so much complexity when you need to push a fix to a commonly used framework or when a CVE is found in a widely used dep and needs to be updated by 1000 different teams all manually.

There is no "right" way. It's just something organizations have to struggle with because it's going to be costly no matter what, and all that matters is what type of cost your org is most easily able to bear. That will decide whether you use a monorepo or a bunch of independent repos, whether you go for microservices or a monolith, and most companies will do some mix of all of the above.

erik_seaberg · on Nov 12, 2021

Being on a known-good version of a dependency is tech debt that I want my team to pay down when we have the slack. Alpha testing every commit of everything in prod is not our job. Google3 had company-wide build breaks so often (and hours of retest backlog) that teams had a “build cop” whose job was to urgently complain about them. Eventually they had to pick a set of active heavily-used libraries (e.g., bigtable) and set up the build system so most consumers would not depend on them at the very latest changelist.

anon9001 · on Nov 11, 2021

> Monorepos are the most common solution to enforce that behavior.

Yes. This is very accurate and also the problem. Monorepos are being used as a political tool to change behavior, but the problem is that it has severe technical implications.

> There is no "right" way.

With git, there is a "wrong" way, and that's not separating your project into different repos. It causes real world technical problems, otherwise we wouldn't have this article posted in the first place.

> It's just something organizations have struggle with because it's going to be costly no matter what, and all that matters is what type of cost your org is most easily able to bear.

It's not a coin toss whether monorepos will have better or worse support from all standard git tooling. It will be worse every time.

The amount of tooling required to enforce dependency upgrades, code styles, security checks, etc across many repos is significantly less than the amount of tooling required to successfully use a monorepo.

philosopher1234 · on Nov 11, 2021

If you want to play right and wrong, I will say that noe it’s the right way since there is support for sparse checkouts in git.

This isn’t a useful game to play.

ratww · on Nov 12, 2021

> The amount of tooling required to enforce dependency upgrades, code styles, security checks, etc across many repos is significantly less than the amount of tooling required to successfully use a monorepo.

This is completely false. You only need extra tooling on monorepo for gigantic scale. For 90% or maybe even 99% of companies out there using version control, plain git and a "bunch of folders" is all you need, honest to god. This way you can reap most benefits and very few downsides. Detecting changes in folders is pretty easy, as most CIs/CDs have it built-in. Ff not it's easier to program a script than setup multirepo tooling.

eximius · on Nov 11, 2021

If you are in an enterprise setting, you don't need decentralized version control.

So, yea, for companies, monorepos are a no brainer in a lot of ways.

For open source, separate repos makes more sense.

To expand on corporate monorepos, if you can still set up access control (e.g., code owners to review additions by domain) and code visibility (so there isn't _unlimited_ code sharing), then I can't think of a reason to not use monorepos.

ratww · on Nov 11, 2021

When I hit the limits of git then I will worry about it.

One of our tasks when building the monorepo was proving it was possible to split it again. It was trivial and we have tools to help us avoid complexity.

We’re not using Github so that part doesn’t apply to me.

Also, nice of you to assume we’ll get to Google scale, but thanks to the monorepo, I was able to make a few pull requests reducing duplication and reducing line count of app by thousands ever since. So I really don’t see us getting into Google scale anytime soon. We're downsizing.

I also find it ironic that you’re accusing people of “copying Google” in a parent post but you’re the one assuming that everyone will hit Google limits…

anon9001 · on Nov 11, 2021

If you ever do hit a git limit where it's no longer comfortable to keep the whole repo on each developer machine, I would encourage you to split up the repo into separate project-based repos rather than switching to Microsoft's git fork.

As a best practice, there's a reason that Linus started git in a separate repo, rather than as part of the Linux project. The reason is that if you put too many projects into one git repo, and it gets too large, you do eventually hit a scale where it becomes a problem.

A very simple way to mitigate that is to keep each project in its own repo, which you can easily do once you start hitting git scale problems.

Thankfully, one of the original git use cases was to decompose huge svn repos into smaller git repos, so the tooling required is already built in.

> I honestly find it ironic that you’re accusing people of “copying google” in a parent post but you’re the one assuming that everyone will hit Google limits…

I think you got the wrong take there. I'm saying that Google's monorepo approach is only valid because they invested so heavily into building custom tooling to handle it. We don't have access to those tools and therefore shouldn't use their monorepo approach.

If you're going to use git, you're going to have the most success using it as intended, which is some logical separation of "one repo per project" where "project" doesn't grow too out of hand. The Linux kernel could be thought of as a large project that git still handles just fine.

Tragically, I think if Google did opensource their internal vcs and monorepo tooling, they would immediately displace git as the dominant vcs and we would regress back to trunk-based development.

ratww · on Nov 11, 2021

> As a best practice, there's a reason that Linus started git in a separate repo, rather than as part of the Linux project. The reason is that if you put too many projects into one git repo, and it gets too large, you do eventually hit a scale where it becomes a problem.

Nope. The reason they are in a different repositories is that they are completely different things, with virtually zero code shared between them. There is no point in putting them in a single repository. The teams are different, the process is different, the release cycle is different.

The reason people like me use monorepos is not because we mistakenly believe "each company needs only one repo". It's because there is a tangible and measurable advantage in grouping some things together. That's it.

> If you're going to use git, you're going to have the most success using it as intended, which is some logical separation of "one repo per project" where "project" doesn't grow too out of hand.

No, that's patently false and absurd. I had the experience of using one repo per project and it was a horrible experience full of bureaucracy, and I know other people with the same experience. A git monorepo is working fine for my company and has been for years. I know other companies who also feel the same, and no, the "sky is not falling" as you seem to believe.

Claiming that "there can be only valid method, and it's mine" is bullshit. Engineering is not about silver bullets.

> Tragically, I think if Google did opensource their internal vcs and monorepo tooling, they would immediately displace git as the dominant vcs and we would regress back to trunk-based development.

You keep forgetting that not every company operates at the Google level. Git is more than fine for 90% of the cases out there. There is more to tech than FAANG.

> If you ever do hit a git limit where it's no longer comfortable to keep the whole repo on each developer machine, I would encourage you to split up the repo into separate project-based repos rather than switching to Microsoft's git fork.

Sorry to be blunt, but I'm not taking advice from an internet stranger with opinions grounded purely on dogma, and with an axe to grind. I will carefully study my options and take a decision based on my needs. There are no silver bullets in engineering.

IshKebab · on Nov 11, 2021

> you will outgrow it at some point

Given that Google and Microsoft use monorepos that seems unlikely!

anon9001 · on Nov 11, 2021

Google had to build an internal version control system as an alternative to git and perforce to support their monorepo.

Microsoft forked git and layered their own file system on top of it to support a centralized git workflow so that they could have a monorepo.

jayd16 · on Nov 11, 2021

Your logic is circular.

    No one should work on monorepos because...
    monorepos are bad because...
    git can't easily handle them and we shouldn't fix that because...
    No one should work on monorepos...

Clearly there are reasons people like monorepos and it makes sense to update git to support the workflow.

anon9001 · on Nov 11, 2021

That isn't circular. The conclusion should be that git, a decentralized vcs, should not take on changes to make it a centralized vcs.

If you think that git needs to be "fixed" or "updated" to support a centralized vcs server to do partial updates over the network, then I think you've missed the point of git.

dlp211 · on Nov 11, 2021

Having had used both, Google's implementation is IMO the superior version of monorepo. Really, Google's Engineering Systems are just better than anything that I have ever used anywhere else.

anon9001 · on Nov 11, 2021

This is exactly as I'd expect.

If you want a centralized, trunk-based version control, don't use git.

It's funny how each company decides to solve these problems.

Google called in the computer scientists and designed a better centralized vcs for their purposes. Good on them. It'd be great if they open sourced it. So typical of Google to invent their own thing and keep it private.

Microsoft took the most popular vcs (git), and inserted a shim layer to make it compatible with their use case. How expected that Microsoft would build a compatibility shim that attempts to hide complexity from the end user.

Meanwhile, Linux and Git are plugging along just fine, in their own separate repos, even though many people work on both projects.

erik_seaberg · on Nov 12, 2021

Google did not design a VCS from scratch, they adopted Perforce and started replacing its backend as they hit scaling limits. They eventually added some kind of copy-on-write FUSE layer that looks like your workspace contains the entire monorepo, which was pretty cool.

It would be hard to open source if its dependencies only exist on Borg. I think that was a problem with Blaze (now Bazel).

IshKebab · on Nov 11, 2021

> So typical of Google to invent their own thing and keep it private.

Yeah like their build system... Bazel, that's completely closed source.

vtbassmatt · on Nov 12, 2021

Sparse index has literally nothing to do with VFS for Git. The sparse index features are in microsoft/git because it's a convenient, deployed-at-scale test-bed before upstreaming into git/git.

omegalulw · on Nov 11, 2021

Your take is extremely biased. You only just discuss why monorepos are bad.

Here's some of the many reasons why monorepos are excellent:

- Continuous integration. Every project is almost always using the lastest code from other projects and libraries it depends on.

- Builds from scratch are very easy and don't need extravagant tooling.

- Problems due to build versions in dependency management are reduced (everyone is expected to use HEAD).

- The whole organization settles on a common build patterns - so if you want to add a new dependency you wouldn't need to struggle with their build system. Conversely, you need to write lesser documentation on how to build your code - cause that's now standard.

anon9001 · on Nov 11, 2021

Heh, the major problems that I've run into using monorepos in the real world at scale are:

- CI breaks all the time. Even one temperamental test from anywhere else in the organization can cause your CI run to fail.

- Building the monorepo locally becomes very complicated, even to just get your little section running. Now all developers need all the tools used in the monorepo.

- Dependencies get upgraded unexpectedly. Tests aren't perfect, so people upgrade dependencies and your code inevitably breaks.

It's cool that everyone is on the same coding style, but that's very much achievable with a shared linter configuration.

dlp211 · on Nov 11, 2021

Your problem isn't monorepo, it's bad tooling. Tests should only execute against code that changed. Builds should only build the thing you want to build, not the whole repository.

anon9001 · on Nov 11, 2021

Yes!

The problem is choosing a monorepo because the tooling isn't suited for monorepos.

Trying to build a monorepo with git is like trying to build your CRUD web app frontend in c++.

Sure, you can do it. Webassembly exists and clang can compile to it. I wouldn't recommend it because the tooling doesn't match your actual problem.

Or maybe a better example is that it's like deciding the browser widgets aren't very good, so we'll re-render our own custom widgets with WebGL. Yes, this is all quite possible, and your result might get to some definition of "better", but you're not really solving the problem you had of building a CRUD web app.

Can Microsoft successfully shim git so that it appears like a centralized trunk-based monorepo, the way you'd find at an old cvs/svn/perforce shop? Yes, they did, but they shouldn't have.

My thesis is they're only pushing monorepos because it helps GitHub monetize, and I stand by that.

> Tests should only execute against code that changed. Builds should only build the thing you want to build, not the whole repository.

How do you run your JS monorepo? Did you somehow get bazel to remote cache a webpack build into individual objects, so you're only building the changes? Can this even be done with a modern minimization tool in the pipeline? Is there another web packager that does take a remotely cachable object-based approach?

I don't know enough about JS build systems to make a monorepo work in any sensible way that utilizes caching and minimizes build times. If anything good comes out of the monorepo movement, it will be a forcing function that makes JS transpilers more cacheable.

And all this for what? Trunk-based development? So we can get surprise dependency updates? So that some manager feels good that all the code is in one directory?

The reason Linus invented git in the first place was because decentralized is the best way to build software. He literally stopped work on the kernel for 2 weeks to build the first version of git because the scale by which he could merge code was the superpower that grew Linux.

If you YouTube search for "git linus" you can listen to the original author explain the intent from 14 years ago: https://www.youtube.com/watch?v=4XpnKHJAok8

If this is a topic you're passionate about, I'd encourage you to watch that video, as he addresses why decentralizing is so important and how it makes for healthy software projects. It's also fun to watch old Googlers not "get it".

He was right then and he's right now. It's disappointing to see so much of HN not get it.

vtbassmatt · on Nov 12, 2021

You have cause and effect reversed. There are teams/groups/customers running into scale problems with Git, and they aren't interested in switching away from Git. So we're teaching Git to meet their needs. There's not much of a moat (monetization potential) here since we're upstreaming everything to core Git.

Disclosure: I'm the product manager for Git Systems at GitHub, where much of this work is based.

anon9001 · on Nov 12, 2021

I understand that your customers want very large git repositories, so that's the need you're going to serve. As a (very large) bonus, you're going to sell a lot of cloud services to people that can't manage to work with their monorepo on local machines. I won't be surprised if in 5 years, most big companies are just leasing GitHub instances as the only reasonable way to edit/build/test/deploy all of their code from the browser. I think it's a concerning trend if you're not GitHub.

Also, I appreciate that you're upstreaming to core git and not trying to build a technical moat.

But why do so many people want these monorepos?

This seems like a serious developer education problem to me.

If I'm wrong that Google is the cause of this cultural shift toward monorepos, what do you think caused it?

I can't figure out why so many people are married to git but insist on leaning into its weaknesses (large repos) instead of its strengths (smaller composable repos).

In siblings to this thread, we have people reminiscing about the good old days of subversion and asking for git to emulate those features. It's quite frustrating to watch.

I think I've been operating under the illusion that most developers understood git well enough to realize it's a better model than svn. But in reality developers would have been more than happy to use "SvnHub" if that had come along first.

azornathogron · on Nov 12, 2021

> But why do so many people want these monorepos?

Lots of people elsewhere in the thread have given reasons that I won't repeat, but I will say I think it's a very different situation for an engineering-focused company compared to, for example, open source projects.

In a company you get a lot of advantages from uniformity of tools and practices and dependencies across the many projects/products that the company has, and you have the organizational structure to onboard new employees onto that way of working and to maintain the level of uniformity and actually benefit from it.

In open source, every project and even every contributor is sovereign unto themselves. In this situation, the difficulties of cross-project changes are primarily human organizational challenges - even if you could collect a lot of projects into a monorepo, you couldn't get the big benefits from it because you wouldn't be able to get everyone to agree to stick to a single set of tools and do the other things that make monorepos powerful.

I think monorepos are great for companies.

But using git as a basis for a monorepo system is a bad idea for the reasons you suggest. It's totally the wrong tool for the job.

I remember the transition period when (nearly) everyone in the open source world moved over multiple years from subversion - or in some cases cvs - to git. The advantages of decentralised development for open source were really clear, and so was the way git tracked branch and merge history (when git was first developed, subversion barely tracked merge history at all, which made it easy to end up with bad merges after accidentally merging the same change multiple times). And at the scale of repositories in the open source world, git was massively faster than svn. But the speed doesn't scale, and if you've got a monorepo you almost certainly aren't doing real decentralised development, and the merge tracking in subversion was fixed (well... maybe; honestly I haven't been paying enough attention to be sure). So seriously SvnHub would actually be a better basis for the monorepo world. It's almost unfortunate that git took over so comprehensively that now companies want to use it for a different context where it really doesn't shine at all.

CRConrad · on Nov 13, 2021

> subversion barely tracked merge history at all, which made it easy to end up with bad merges after accidentally merging the same change multiple times

Weird. Shouldn't that be just a no-op?

azornathogron · on Nov 13, 2021

It's been more than a decade so I almost certainly oversimplified the problem.

But to give an approximate explanation: Subversion's model of how branches work is fundamentally different to git's model. In fact in a sense subversion doesn't have a concept of branches at all in its core data model. Subversion just gives you a directory tree with a linear history, and then makes it cheap to copy directories around. The recommended layout for a subversion repository looks something like this:

    - /tags
    - /tags/release-v1
    - /tags/release-v2
    - /branches
    - /branches/my-feature-branch
    - /trunk/...

If you want to create a branch you just copy your trunk directory to a new name under /branches. In order to reliably merge back & forth between branches or between a branch and trunk (which are just directories in the repository), you need more than just finding the diff between the latest state of each branch; you really need to know about the historical relationships between them: When the directory was created what was it copied from, what other merges have happened since then? But this information - at least "what merges have already been done" literally wasn't tracked until about subversion 1.5 (which was actually contemporaneous with a lot of the migrations of subversion to git, at least in my recollection).

Some references for you:

- The classic "svn book" about branching: https://svnbook.red-bean.com/en/1.7/svn.branchmerge.using.ht... - note the section at the bottom "The Key Concepts Behind Branching" which notes "First, Subversion has no internal concept of a branch—it knows only how to make copies. When you copy a directory, the resultant directory is only a “branch” because you attach that meaning to it."

- The subversion 1.5 release notes explanation of merge tracking: https://subversion.apache.org/docs/release-notes/1.5.html#me...

To try to place this in time, I was using subversion (and was pretty happy with it!) in 2007. By 2009 I was using git wherever I could. Of course, the widespread migration toward git throughout the open source ecosystem was spread over a lot more years than this. But hopefully that gives some context.

CRConrad · on Nov 14, 2021

Ah, thanks, TIL something again.

Seems (at least the original version you described of) subversion was in some/several/many ways genuinely inferior to git.

But yeah, having thought about it a bit more re-applying the same change again in any versioning system is of course not necessarily "a no-op" if they're not immediately following each other; if other changes that partially reverted the original ones have happened in between. Then re-applying the original changeset would of course re-do that part of the original changes.

erik_seaberg · on Nov 12, 2021

> its strengths (smaller composable repos)

I think this is happening because git submodules have a hard and confusing CLI even compared to the rest of git.

> would have been more than happy to use "SvnHub"

svnmerge.py caused almost as many trainwrecks as it avoided. I don’t think resolving merge conflicts can be made to work so long as the repo relies on being informed of all file moves and copies, because devs just didn’t do that consistently.

vtbassmatt · on Nov 12, 2021

Thank you for your reply. I think you raise some great points, and I'll respond to a few that I have knowledge of.

> But why do so many people want these monorepos?

Google-copying is part of it for sure. And I agree with your position - copying Google isn't a great reason on its own. Some more valid reasons: code which deploys together often wants to live together. Common dependencies can be easier to find (and use) if they're right there in the repo. Related, making cross-cutting changes can be easier when you can make an atomic change everywhere. Also, big repos often started out as small repos and then grew over time; the cost of a major change might outweigh the friction caused by keeping Git in place.

Consider also that we might not all be talking about the same thing. Some people (even here in this topic) consider Linux to be "large" and "a monorepo". Linux isn't notably big anymore, and it's not unusually challenging to host or to use locally. It's arguably a monorepo since it contains most of the code for the Linux kernel, but to me, "monorepo" implies needing special attention. So I probably wouldn't classify Linux as a monorepo for this discussion.

> I can't figure out why so many people are married to git but insist on leaning into its weaknesses

I'm sure there are many reasons. The common one I hear in my role boils down to, essentially, "Git is the de facto standard". That can be expressed in several ways: "it's harder to attract and retain engineers with an uncommon toolset"; "we want to focus on our core business, not on innovating in version control"; "all the other tools in our kit work with Git". (NB: I put those in scare quotes to distinguish that they aren't my or GitHub's position, they're things I've heard from others. They're not direct quotes, though.)

I talk with customers weekly who want to mash dozens of independent repos (Git or otherwise) together. If they aren't going to reap any of the benefits mentioned above or elsewhere in this topic, I strongly advise against it. At the end of the day, GitHub doesn't care if you have one giant monorepo or 1000 tiny ones; the product works pretty well for both. I suppose that's why I felt compelled to reply to your thread in particular -- yes, we're investing in monorepos, but no, it's not because we're trying to drive people to them.

anon9001 · on Nov 12, 2021

Thanks for engaging in discussion about this. If nothing else, you're building my confidence in GitHub.

> Consider also that we might not all be talking about the same thing.

I think this is 90% of the confusion/disagreement in this thread.

I've always thought that it's fine to use git however you want, but if you hit a bottleneck (it gets too big and becomes slow), then you split your repo along logical boundaries (usually a large library can be split out and versioned, for example).

Somewhere over the last 15 years, that's changed, and the zeitgeist is now "mash dozens of independent repos into one repo" no matter the situation. Everyone in this thread that's suggested monorepos aren't the way to go has been downvoted, which caught me by surprise.

> I suppose that's why I felt compelled to reply to your thread in particular -- yes, we're investing in monorepos, but no, it's not because we're trying to drive people to them.

I believe you're sincere, and perhaps I was a bit too cynical about GitHub's motivations. Sorry about that, this topic is just frustrating and GH is in a position to help make things better.

Do you think GitHub could offer some kind of practical guide about when to use monorepos and what their limitations are?

I think part of the problem is that git's docs and the git-scm book aren't going to prescribe a way to use the software, because it's intentionally extremely flexible. Git users appreciate this, but GitHub users might lack good guidance.

As another reply pointed out, this might also have its origins in the git-submodule porcelain being confusing and underutilized.

Most GitHub users have probably never used submodules, don't know when a git repo would start to slow down due to size, aren't sure how to split out part of a repo while preserving history, and probably haven't thought too much about internal dependency versioning.

> to me, "monorepo" implies needing special attention

I think you and I are actually in total agreement, but the vast majority of corporate GitHub users have no context about where git came from, what it's good at, what it's limitations are, and how to use it for more than trunk-based development.

The ideas that "Linux is a monorepo" or "monorepos are the only natural way to manage code for any project" or that git should be "fixed" to support centralization should be concerning to GH.

I suspect these people don't have "monorepos" in the way that you and I are talking about them. They probably just have mid-sized repos that haven't need to be split up yet.

Even if you can support those customers as they grow into monorepos without friction due to enormous technical efforts, we're failing to teach a generation of engineers how to think about their most fundamental tools.

I appreciate that GitHub is trying to use technology to smooth out these points of confusion and improve git to work in every scenario, but publishing customer education materials about how to make good decisions about source code management would also help a lot.

> I talk with customers weekly who want to mash dozens of independent repos (Git or otherwise) together. If they aren't going to reap any of the benefits mentioned above or elsewhere in this topic, I strongly advise against it.

It sounds like you're giving good advice to the teams that you talk to, but what can I show my team that's authoritative from GitHub that says "don't mash dozens of independent repos together and then blur the lines so you can't tell what's independent anymore"?

I thought this was obvious, but it's not, and I don't know how to get people to understand it.

This is a particularly bad problem when your independent repos are working fine, but then there's a company-wide initiative to go "monorepo", and it's obvious in advance that the resulting monorepo won't be usable without a lot of extra work.

Maybe I've just been unlucky, but every time I've had a monorepo experience, it's been exactly that approach. And, as I'm sure you can tell by this thread, I haven't had much luck in convincing other engineers that mashing all unrelated code into one repo is a silly thing to do.

tsimionescu · on Nov 11, 2021

Monorepos are much easier for everyone to use, and are the only natural way to manage code for any project. You keep talking about Google, but a much more famous monorepo is Linux itself. Perhaps Linus Torvalds has fallen into Google's hype?

The fact that git is very poor at scaling monorepos might mean that it's a bad idea to use git for larger organizations, not that it's a bad idea to use monorepos. If git can be improved to work with monorepos, all the better.

anon9001 · on Nov 11, 2021

> Monorepos are much easier for everyone to use, and are the only natural way to manage code for any project.

I strongly disagree with that, but I'll let this blog post explain it better than I can: https://medium.com/@mattklein123/monorepos-please-dont-e9a27...

> You keep talking about Google, but a much more famous monorepo is Linux itself.

I thought it was fairly well known that monorepos came directly from Google as part of their SRE strategy. It didn't even come into common usage until around 2017 (according to wikipedia). If I'm remembering correctly, the SRE book recommends it, and that's why it gained popularity.

Also, I don't believe that Linux is a valid interpretation of "monorepo". Linux is a singular product. You can't build the kernel without all of the parts.

A better example would be if there was a "Linus" repo that contained both git and linux. There isn't, and for good reason.

> The fact that git is very poor at scaling monorepos might mean that it's a bad idea to use git for larger organizations, not that it's a bad idea to use monorepos. If git can be improved to work with monorepos, all the better.

Any performance improvement in git is welcome, but anything that sacrifices a full clone of the entire repository is antithetical to decentralization.

The whole point of git is decentralized source code.

cdcarter · on Nov 11, 2021

I think it's at least somewhat fair to call Linux a monorepo. There are a lot of drivers included in the main tree. They don't need to be, (we know this because there are also lots of drivers not in the source tree). But by including them, the kernel devs can make large changes to the API and all the drivers in one go. This is a classic "why use a monorepo".

solarmist · on Nov 11, 2021

Monorepos (up to a certain size where git starts getting too slow) are easier to use unless you have sufficient investment into dev tooling.

I think "monorepo" here is a shorthand for large, complex repos with long histories which git does not scale well to whether or not it is all of the repos for an organization. For example I'd call the Windows OS a monorepo for all of the important reasons.

dataangel · on Nov 11, 2021

> Also, I don't believe that Linux is a valid interpretation of "monorepo". Linux is a singular product. You can't build the kernel without all of the parts.

But it’s also larger scale than the vast majority of startups will ever reach. My work has had the same monorepo for 8 years with over 100 employees now and git has had few problems.

xorcist · on Nov 12, 2021

> You can't build the kernel without all of the parts.

You most certainly can. Loadable modules have been part of the kernel for over 20 years now.

The fact that many drivers exists out-of-tree should be enough to settle that particular argument.

There are libraries in the kernel perfectly usable on its own. There are also many scripts, analysis and testing tools that live in the kernel that build and runs separately from the kernel itself. Then there's a whole lot of documentation.

Linux is what git what designed for. It sits in a singular respository and no other repositories are needed to build a working product. It can accurately be described as a monorepo, if that particular distinction was important.

tsimionescu · on Nov 12, 2021

I'd also note that Linux doesn't use any kind of dependency management tools between all of its sub-components: everything builds using what Git keeps on disk.

Also, there are no external dependencies: if you want a new library to be used, you copy its source to the kernel source tree - a classic monorepo solution, that detractors claim "doesn't scale".

anarazel · on Nov 12, 2021

Which pre-existing libraries have been integrated that way into the kernel? There's a few tiny pieces, but largely you can't just use pre-existing libs inside the kernel.

howinteresting · on Nov 11, 2021

> The whole point of git is decentralized source code.

The "whole point of git" is to provide value to its users. Full decentralization is not necessary for that.

naniwaduni · on Nov 12, 2021

Providing version control is also not strictly necessary to providing value to users. But no, I'm pretty sure git's whole point is "to provide value to its users".

tsimionescu · on Nov 12, 2021

> I strongly disagree with that, but I'll let this blog post explain it better than I can: https://medium.com/@mattklein123/monorepos-please-dont-e9a27...

That article completely misses the point about project history - a monorepo has the full history of your project, while multiple repos split that history. If you have a good split, everything is fine, but if you're moving code between repos relatively often, everything gets muddled.

It also overstates the need for build artifact management in a monorepo. For 3rd party code that has a good package management solution you can use that, while still using the simpler solution of no dependency management for internal libraries, in the good old C tradition. I will again point to the Linux kernel as a good example of doing this successfully - they don't do any kind of versioned build artifacts for any of the many tens of libraries they use - they just rely on git.

> I thought it was fairly well known that monorepos came directly from Google as part of their SRE strategy. It didn't even come into common usage until around 2017 (according to wikipedia). If I'm remembering correctly, the SRE book recommends it, and that's why it gained popularity.

While Google may have coined this term in 2017, the idea of keeping all of an organization's code into a single repo was around since forever. The ~1k dev company I work for had a single Perforce repo with history going back to 1998 or something like that.

> Also, I don't believe that Linux is a valid interpretation of "monorepo". Linux is a singular product. You can't build the kernel without all of the parts.

This is probably the core of disagreement actually. Linux of course has all sorts of internal parts that can be considered libraries/modules. They famously have a huge amount of drivers, but there are also things like kernel-space implementations of much of the C stdlib, such as kmalloc, compression libraries, a unit testing framework, the ebpf compiler and runtime, more than 40 file systems, network stacks for various protocols, and so many more. Nothing prevents the kernel team from splitting up the kernel into a core 'app' + tens of libraries and add tooling to stitch these together from separate repos (since it's C, you could theoretically do all of this at the single source file level, even). Not to mention, if you want to use a 3rd party library in Linux, there is only one way to do it: you copy its source into the kernel source tree.

Of course, no one is suggesting such a thing, because it's well accepted that the kernel is a 'single product'. But multiple-repo advocates usually miss the fact that everyone starts with a single repo and a single product, and then they gradually isolate parts of that product as 'libraries', and they gradually split up functionalities as 'separate products', and it's not always clear when this separation is actually finished enough to make the decision to split it off into its own repo, if ever.

> Any performance improvement in git is welcome, but anything that sacrifices a full clone of the entire repository is antithetical to decentralization.

> The whole point of git is decentralized source code.

Partial clones are still decentralized source, these concepts are completely orthogonal. As long as I can clone all of the code I am working on and all of its history, it obviously makes no difference if this is a single repo from a multi-repo project or it is a single part of a monorepo.

Also, the whole point of git is managing source code history. The decentralized part is only important for decentralized projects, like Linux. Most projects, either private or public, work in a centralized way, and maintain a central source control server, and local git clones are just nice-to-have caches.

And I want to emphasize again: I am talking about most open source projects here as well, even some of that large ones like Apache. Even GNU typically works in this centralized manner: you get the central repo for some project, you make changes, you rebate your changes on master, you send a patch for review to the mailing list, and if the patch is accepted, you ore some maintainer commit it to the central repo.

In a decentralized workflow like Linux, you start off by cloning one or more relevant authoritative repos (Linus' repo, or Debian's repo etc), you make your changes, you format them as a patch, and you send the patch to the maintainers of the repos you want to change - e.g. Debian's repo for a security fix maybe, or the kernel maintainer for the subsystem you want to change. If they like your patch they take it and put it in their repo. If it's important enough, it will then slowly percolate through the ecosystem in various ways - the maintainer will eventually push it to Linus to be included in an official release, and Debian will eventually take it from Linus' repo. Some mainly not even wait for it to make it to an official Linux release - a cutting-edge distro may directly take changes from another maintainer's repo.

There are extremely few projects that work in this manner.

Edit: added a few more details on just how much stuff is in the Linux source tree, and how much they work exactly like the linked article claims will never scale (no dependency management, no 3rd party artifacts etc).

CRConrad · on Nov 13, 2021

> Nothing prevents the kernel team from splitting up the kernel into a core 'app' + tens of libraries

Hmm -- would a different gitting strategy on the part of the kernel dev team perhaps influence their take on the whole monolithic vs microkernel question? :-)

ajkjk · on Nov 11, 2021

Very much doubt that's their corporate strategy. More likely it's as simple as: lots of people have monorepos; they have lots of issues with Git and Github; Github wants their business.

Orphis · on Nov 11, 2021

> Git is decentralized and fast, and Google famously doesn't use it.

Most (all?) of Google OSS software is hosted on either Gerrit or Github. Git is not used by the "google3" monorepo, but it's used by quite a few major projects.

solarmist · on Nov 11, 2021

From my understanding Microsoft is doing it because they want to use git for developing windows which is(was?) a large monorepo.

saghm · on Nov 12, 2021

> Then GitHub can sell tools to deal with your enormous monorepo, because your traditional tools will feel slow and worse than GitHub's tools

The "tool" in this case is a feature being added to upstream git, which everyone gets for free. It seems reasonable to assume other git providers will also be adding support for it server-side where needed; I have to imagine Gitlab either already is or will be doing work to support monorepos as well, assuming there's enough demand for it. At worst, this seems like it could be trying to lock people into git generally rather than GitHub specifically, but more likely this is just work being done to support something that (potential) customers are already asking for.

jeffbee · on Nov 11, 2021

> Git is ... fast, and Google ... doesn't use it.

Everything about git is orders of magnitude slower than the monorepo in use at Google. Git is not fast, and its slowness scales with the size of your repo.

rurban · on Nov 11, 2021

Even better looks the new OTR merge strategy, which benefits everyone. Not only the tiny monorepo userbase.

rq1 · on Nov 11, 2021

What I was looking for recently is a way to make “sparse push”. And trigger a chain reaction with hooks.

Didn’t find anything interesting.

speedgoose · on Nov 11, 2021

"One of the biggest traps for smart engineers is optimizing something that shouldn't exist."

Elon Musk.

solarmist · on Nov 11, 2021

Let's make the snarky comment into a helpful comment. Why do you think it shouldn't exist?

speedgoose · on Nov 11, 2021

Monorepos create more issues than what they solve.

ratww · on Nov 11, 2021

That's completely false. Monorepos don't really create issues at all when done properly, and when used in situations where they make sense.

In smaller scales, for example, they're fantastic for productivity, and my company is not looking back.

jeremyjh · on Nov 11, 2021

This is no different from saying "monorepo bad". Aside from performance issues in git, why would a monorepo be bad? It seems very natural to me to have a whole system referenced with a single branch/tag that must all pass CI together. Otherwise supporting projects can introduce breaking changes downstream that are not apparent before they hit master.

anaganisk · on Nov 16, 2021

Somehow its working for google? May be im wrong but last I heard they use mono repos.

solarmist · on Nov 11, 2021

Such as? That's just parroting “common opinion” otherwise.

speedgoose · on Nov 11, 2021

The second paragraph of the article we are discussing about for example.

But you can find a list on Wikipedia and make your own opinion : https://en.m.wikipedia.org/wiki/Monorepo

tsimionescu · on Nov 11, 2021

So the Linux devs have no idea how to properly use Git?

speedgoose · on Nov 11, 2021

I'm not sure whether the Linux kernel git repository qualifies as a Monorepo.

tsimionescu · on Nov 12, 2021

Of course it does. It includes core kernel code, drivers, countless libraries (compression libraries, fonts, a unit testing framework, various data structure implemetations, and many many others), file systems, network stacks, a programming language runtime+compiler (ebpf), and many other subsystems.

According to multi-repo advocates, it would obviously be split up into at least a dozen different repos, with dependency management tools to specify which version of zlib or ebpf to use, with a dedicated repo just for the build logic, of course.

tambourine_man · on Nov 11, 2021

Supporting evidence?

speedgoose · on Nov 11, 2021

The Wikipedia article about monorepos has a good summary

https://en.m.wikipedia.org/wiki/Monorepo

Then you can do your own opinion, I'm sharing mine.

jeremyjh · on Nov 11, 2021

Apart from performance issues that article offers more (and more significant) advantages than it does drawbacks, so it really does not support your statement.

anaganisk · on Nov 16, 2021

And then he comes up with Hyperloop.

CRConrad · on Nov 13, 2021

Wow, so he's actually said something smart for a change.

joconde · on Nov 11, 2021

Was he talking about something specific?

plopz · on Nov 11, 2021

I remember him saying something like that during a walkthrough of the base building starship and if I recall it was in reference to overengineering something about the grid fins.

junon · on Nov 11, 2021

He was talking about the battery vibrator plates or something in Tesla cars.