Scaling monorepo maintenance

infogulch · on May 1, 2021

This was great! My summary:

A git packfile is an aggregated and indexed collection of historical git objects which reduce the time it takes to serve requests to those objects, implemented as two files: .pack and .idx. GitHub was having issues maintaining packfiles for very large repos in particular because regular repacking always has to repack the entire history into a single new packfile every time -- which is an expensive quadratic algorithm. GitHub's engineering team ameliorated this problem in two steps: 1. Enable repos to be served from multiple packfiles at once, 2. Design a packfile maintenance strategy that uses multiple packfiles sustainably.

Multi-pack indexes are a new git feature, but the initial implementation was missing performance-critial reachability bitmaps for multi-pack indexes. In general, index files store object names in lexicographic order and point to the named object's position in the associated packfile. As a first step to implement reachability bitmaps for multi-pack indexes, they introduced a reverse index file (.rev) which maps packfile object positions back to index file name offsets. This alone had a big performance improvement, but it also filled in the missing piece in order to implement multi-pack bitmaps.

With the issues of serving repos from multiple packs solved, they need to efficiently utilize multiple packs to reduce maintenance overhead. They chose to maintain historical packfiles in a geometrically increasing size. I.e., during the maintenance job, consider the first N most recent packfiles, if the sum of the size of all packfiles from [1, N] is less than the size of packfile N+1, then packfiles [1, N] are repacked into a single packfile, done; however if their summed size is greater than the size of packfile N+1, then iterate and consider all the packfiles [1, N+1] compared to packfile N+2 etc. This results in a set of packfiles where each file is roughly double the size of the previous when ordered by age, which has a number of beneficial properties for both serving and the average case maintenance run. Funny enough, this selection procedure struck me as similar to the game "2048".

the_duke · on May 1, 2021

Very well written post and upstream work is always appreciated.

I also really like monorepos, but Git and GitHub really don't work well at all for them.

On the Git side there is no way to clone only parts of a repo or to limit access by user. All the Git tooling out there, from CLI to the various IDE integrations, are all very ill adjusted to a huge repo with lots of unrelated commits.

On the Github side there is no separation between the different parts of a monorepo in the UI (issues, prs, CI), the workflows, or the permission system. Sure, you can hack something together with labels and custom bots, but it always feels like a hack.

Using Git(hub) for monorepos is really painful in my experience.

There is a reason why Google, Facebook et all have heaps of custom tooling.

krasin · on May 1, 2021

I really like monorepos. But I find that it's almost never a good idea to hide parts of a source code from developers. And if there's some secret sauce that's so sensitive that only single-digit number of developers in the whole company can access it, then it's probably okay to have a separate repository just for it.

Working in environments where different people have partial access to different parts of the code never felt productive to me -- often, the time to figure out who can take on a task and how to grant all the access might take longer than the task itself.

jeffbee · on May 1, 2021

It's funny that you mention this as if monorepos of course require custom tooling. Google started with off-the-shelf Perforce and that was fine for many years, long after their repo became truly huge. Only when it became monstrously huge did they need custom tools and even then they basically just re-implemented Perforce instead of adopting git concepts. You, too, can just use Perforce. It's even free for up to five users. You won't outgrow its limits until you get about a million engineer-years under your belt.

The reason git doesn't have partial repo cloning is because it was written by people without regard to past experience of software development organizations. It is suited to the radically decentralized group of Linux maintainers. It is likely that your organization much more closely resembles Google or Facebook than Linux. Perforce has had partial checkout since ~always, because that's a pretty obvious requirement when you stop and think about what software development companies do all day.

forrestthewoods · on May 1, 2021

It’s somewhat mind boggling that no one has made a better Perforce. It has numerous issues and warts. But it’s much closer to what the majority of projects need that Git imho. And for bonus points I can teach an artist/designer how to safely and correctly use Perforce in about 10 minutes.

I’ve been using Git/Hg for years and I still run into the occasional Gitastrophe where I have to Google how to unbreak myself.

Chyzwar · on May 1, 2021

git have added recently sparse checkout and there is also Virtual File System for Git from Microsoft.

From my experience git/vcs is not an issue for monorepo. Build, test, automation, deployments, CI/CD are way harder. You will end with a bunch of shell scripts, make files, grunt and a combination of ugly hacks. If you are smart you will adopt something like bazel and have a dedicated tooling team. If you see everything as nail, you will split monorepo into an unmaintainable mess of small repos that slowly rot away.

no_wizard · on May 2, 2021

I have had the exact opposite experience. Long term maintenance in a monorepo is much easier because all developers have visibility into changes. Tooling isn’t really all that complex (at least to me and I’m not a devops engineer either). It’s way more complex to me to have multiple repositories. I’d rather have hundreds of directories that follow uniformity rather than hundreds of git repositories that may not

KallDrexx · on May 2, 2021

For me the complexity comes when you have many isolated projects sharing code. With a mono-repo that shared code is immediately consumed by all isolated projects, most of which the developer is not actively looking at to look for breakages.

This means that every single commit every single test suite has to be run across the whole organization. For a small startup that's probably fine, but for a larger organization that adds a ton of issues and extra time. If your change to a shared project breaks another project that you may not be familiar with you end up not only with a delayed context switch (since you don't know about it until the CI has finished running all tests across the whole org), but now you have to figure out how to address it before someone else changes code in that shared project which causes a merge conflict.

Compare that to a split-repo setup where projects are sharing code indirectly (via package management systems). When you change code other projects will only consume the new code when they actually upgrade their packages, and usually a developer that is familiar with the code base will actively hit that breakage in real time.

Granted, this has its own trade offs such as all your systems may be running different versions of shared code, and the delay in breakage detection can cause some back and forth in breakage between isolated projects.

no_wizard · on May 2, 2021

This to me sounds like a break down of documentation and software architecture patterns, not so much a flaw of a monorepo. I’d rather it come up at that stage anyway even if it’s a pain, because it will help re-enforce better patterns and documentation as people feel these pain points. Shared libraries should have a code owner that is responsive for this if it’s not agreed that it’s straightforward enough to work on as an individual dev no?

That’s my take anyway

Chyzwar · on May 9, 2021

That why I mentioned Bazel, once monorepo reach certain size you need to invest in right tooling. What people do instead is splitting into multi-repo and keep using the same tooling. By splitting repo you exchange easy problem of scaling tooling for very hard multi-repo/micro-service/coordination. IMO, the biggest problem with splitting repo is Conway Law, once split you change important dynamics of product development.

Beowolve · on May 1, 2021

On this note, GitHub does reach out to customers with Monorepos and is aware of their short comings. I think overtime we will see them change to have better support. It's only a matter of time.

throwaway894345 · on May 1, 2021

I’ve always found that the biggest issue with monorepos is the build tooling. I can’t get my head around Bazel and other Blaze derivatives enough to extend them to support any interesting case, and Nix has too many usability issues to use productively (and I’ve been in an organization that gave it an earnest shot).

gen220 · on May 2, 2021

Yeah, in my experience it takes X years of experience at a company that is already using a monorepo + blaze-derivative tooling, for the costs of setup and maintenance to make it worth adoption.

That X is mostly a function of how open you are to the idea that maintaining + using a centralized, language-agnostic build graph is a Useful Thing.

But once that X is crossed, I don't think it's possible to happily go back to a non-monorepo environment. :)

FWIW, we've used Pants successfully for years, which has excellent Python support. However, we're in a long-haul migration to Bazel, so you might be better off revisiting Bazel if you want to invest in something that's extremely likely to issue dividends over the next 10 years.

throwaway894345 · on May 2, 2021

We used Pants as well, and it worked alright for Python. But it’s codebase was a mess and it was also difficult to extend. I get the strong feeling that there is a better way to make a build system—perhaps something like Nix but without so many usability problems—but no one is investing in that right now.

csnweb · on May 1, 2021

With GitHub actions you can quite easily specify a workflow for parts of your repo (simple file path filter https://docs.github.com/en/actions/reference/workflow-syntax...). So you can basically just write one workflow for each project in the monorepo and have only those running where changes occured.

throwaway894345 · on May 1, 2021

That’s what I’m doing now, but you still have to push artifacts up to a repository and downstream jobs have to pull them down. Further still, managing the boilerplate for GitHub actions is no small task (e.g., you have a dozen Python projects and you want the actions for each to look similar), so to solve for that you have to build some hack that involves generating the Actions YAML before you commit or similar. Not the end of the world, but still far from my ideal state.

NewJazz · on May 2, 2021

>have to build some hack that involves generating the Actions YAML before you commit or similar

You could probably use jsonnet expressions and invoke them from a pre-commit hook.

That seems fairly elegant to me.

throwaway894345 · on May 2, 2021

I’m not doing json net, but I am generating them effectively from a pre-commit hook. One of the jobs that I generate validates that the actions that were committed are up-to-date. This works well because I don’t have too many dependencies, but if you have deep dependency trees then the builds will end up taking a long time and you end up reinventing Nix or Bazel poorly.

NewJazz · on May 2, 2021

Are nix and Bazel in the same category? I have used nix very little and never used Bazel. But I have used Make and jsonnet pretty often. Can you re-run Make in a pipeline to verify that the actions in use match what would be generated based on the repo state?

throwaway894345 · on May 2, 2021

> Are nix and Bazel in the same category?

Nix bills itself as a fully reproducible package manager and Bazel as a fully reproducible build tool. In the abstract, both of these reproducibly build software, so that puts them in the same category in general.

> Can you re-run Make in a pipeline to verify that the actions in use match what would be generated based on the repo state?

Yes, and I’m doing something similar although not with Make. The problem is that this home-grown build system doesn’t do incremental builds, so everything will be fully built from source every time, and that can take a while for very deep dependency graphs.

Conceivably your “Actions generator” script could generate only the jobs that need to be run based on what changed since the previous commit, referencing artifacts from the previous commit for those build steps which weren’t invalidated. This is a neat concept, but we’re well on our way to reinventing Nix or Bazel.

krasin · on May 1, 2021

Can you please give an example of such an interesting case? I am genuinely curious.

And I agree with the general point that monorepos require a great build tooling as a match.

throwaway894345 · on May 1, 2021

An interesting case of what? A monorepo? Issues with Bazel? With Nix?

krasin · on May 1, 2021

An interesting case that's hard to support with Bazel.

Note: I don't quite like Bazel; but my take is that Bazel supports too much configurations and options: my workflows are always much simpler. Which is why I am curious to hear what do your workflows require.

throwaway894345 · on May 2, 2021

It’s been a while since I tried it (maybe 2018?), but Python 3 support straight up didn’t work despite the documentation. It also seemed like adding support for your own toolchain required you to write Java plugins in a very confusing, inheritance-for-it’s-own-sake, raptly-2000s-OOP way.

krasin · on May 2, 2021

Got it. I can relate: while Python 3 support is probably fixed now, I agree that Bazel tooling for adding toolchains has been always abysmal. Things somewhat get better (https://docs.bazel.build/versions/master/toolchains.html) since it's now in Skylark, not in Java, but I still don't like it (aka not productive)

Overall, custom toolchains are common (especially, in the embedded world), and a good build system should make that very easy.

jayd16 · on May 1, 2021

I wouldn't call it painful exactly but I'll be happy when shallow and sparse cloning become rock solid and boring.

WayToDoor · on May 1, 2021

The article is really impressive. It's nice to see GitHub contribute changes back to the git project, and to know that the two work closely together.

slver · on May 1, 2021

It's in their mutual interest. Imagine what happens to GIThub if GIT goes out of fashion.

infogulch · on May 1, 2021

Yes, isn't it nice when interests of multiple parties are aligned such that they help each other make progress towards their shared goals?

slver · on May 1, 2021

Well, it's nice to see they're rational, indeed.

jackbravo · on May 1, 2021

Other rational companies could try to fix this without contributing upstream. Doing it upstream benefits competitors like gitlab. So yeah! It's nice seeing this kind of behavior

slver · on May 1, 2021

First, they not only contributed upstream, upstream developers contributed to this patch. I.e. they got help outside GitHub to make this patch possible.

Second, if they had decided to fork Git, then they'd have to maintain this fork forever.

Third, this fork could overtime become visibly or even worse subtly incompatible with stock Git which is still the Git running on GitHub users' machines, and both should interact with each other in 100% compatible manner.

So, in this case, not contributing upstream was literally no-go. The only rational choice would be to not fork Git.

underdeserver · on May 1, 2021

30 minute read + the Git object model = mind boggled.

I'd have appreciated a series of articles instead of one, for me it's way too much info to take in in one sitting.

georgyo · on May 1, 2021

It was a lot to digest, but it was also all one continuous thought.

If it was broken up, I don't think it would have been nearly as good. And I don't think I would have been able to keep all the context to understand smaller chunks.

I really enjoyed the whole thing.

iudqnolq · on May 1, 2021

I'm currently working through the book Building Git. Best $30 I've spent in a while. It's about 700 pages, but 200 pages in and I can stage files to/from the index, make commits, and see the current status (although not on a repo with packfiles).

I'm thinking about writing a blog post where I write a git commit with hexdump, zlib, and vim.

whymauri · on May 1, 2021

I love when people use Git in ways I haven't thought about before. Reminds me of the first time I played around with 'blobs.'

debarshri · on May 1, 2021

It is a great writeup. I wonder how gitlab solves this problem.

lbotos · on May 1, 2021

GL packs refs at various times and frequencies depending on usage: https://docs.gitlab.com/ee/administration/housekeeping.html

It works well for most repos but as you start to get out to the edges of lots of commits it can cause slowness. GL admins can repack reasonably safely at various times to get access speedups, but the solution that is presented in the blog would def speed packing up.

(I work as a Support Engineering Leader at GL but I'm reading HN for fun <3)

masklinn · on May 1, 2021

They might have yet to encounter it. Git is hosting some really big repos.

lbotos · on May 1, 2021

Oh, we def have. I've seen some large repos (50+GB) in some GL installations.

swiley · on May 1, 2021

Mono-repos are like having a flat directory structure.

Sure it's simple but it makes it hard to find anything if you have a lot of stuff/people. submodules and package managers exist for a reason.

Denvercoder9 · on May 1, 2021

> Sure it's simple but it makes it hard to find anything if you have a lot of stuff/people.

I think this is a bad analogy. Looking up a file or directory in a monorepo isn't harder than looking up a repository. In fact, I'd argue it's easier, as we've developed decades of tooling for searching through filesystems, while for searching through remotely hosted repositories you're dependent on the search function of the repository host, which is often worse.

no_wizard · on May 1, 2021

Note: for the sake of discussion I’m assuming when we say monorepo we mean monorepo and associated tools used to manage them

The trade off is simplified management of dependencies. With a monorepo, I can control every version of a given dependency so they’re uniform across packages. If I update one package it is always going to be linked to the other in its latest version. I can simplify releases and managing my infrastructure in the long term, though there is a trade off in initial complexity for certain things if you want to do something like say, only run tests for packages that have changed in CI (useful in some cases).

It’s all trade offs, but the quality of code has been higher for our org in a monorepo on average

mr_tristan · on May 1, 2021

I've found that many developers do not pay attention to dependency management, so this approach of "it's either in the repo or it doesn't exist" is actually a nice guard rail.

I'm reading between the lines here, but, I'm assuming you've setup your tooling to enforce this. As in: the various projects in the repo don't just optionally decide to have external references, i.e., Maven central, npm, etc.

This puts quite a lot of "stuff" in the repo, but with improvements like this article mentioned, makes monorepos in git much easier to use.

I'd have to think, you could generate a lot of automation and reports triggering out of commits pretty easily, too. I'd say that would make the monorepo even easier to observe with a modicum of the tooling required to maintain independent repositories.

no_wizard · on May 1, 2021

That is accurate, I wouldn't use a monorepo without tooling, and in the JavaScript / TypeScript ecosystem, you really can't do much without tooling (though npm supports workspaces now, it doesn't support much else yet, like plugins or hooks etc).

I have tried in the past, trying to achieve the same goals, particularly around the dependency graph and not duplicating functionality found in shared libraries (though this concern goes in hand with solving another concern I have, which is documentation enforcement), were just not really possible in a way that I could automate with a high degree of accuracy and confidence, without even more complexity, like having to use some kind of CI integration to pull dependency files across packages and compare them, in a monorepo I have a single tool that does this for all dependencies whenever any package.json file is updated or the lock file is updated

If you care at all about your dependency graph, and in my not so humble opinion every developer should have some high-level awareness here in their given domain, I haven't found a better solution that is less complex than leveraging a monorepo

cryptica · on May 1, 2021

To scale a monorepo, you need to split it up into multiple repos; that way each repo can be maintained independently by a separate team...

We can call it a multi-monrepo, that way our brainwashed managers will agree to it.

Orphis · on May 1, 2021

And that way, you can't have atomic updates across the repositories and need to synchronize them all the time, great.

slver · on May 1, 2021

We have repository systems built for centralized atomic updates, and giant monorepos, like SVN. Question is why are we trying to have Git do this, which was explicitly designed with the exact opposite goal? Is this an attempt to do SVN in Git so we get to keep the benefits of the former, and the cool buzzword-factor of the latter? I don't know.

Also when I try to think about reasons to have atomic cross-project changes, my mind keeps drawing negative examples, such as another team changing the code on your project, is that a good practice? Not really. Well unless all projects are owned by the same team, it'll happen in a monorepo.

Atomic updates not scaling beyond certain technical level is often a good thing, because they also don't scale on human and organizational level.

alexhutcheson · on May 1, 2021

1. You determine that a library used by a sizable fraction of the code in your entire org has a problem that’s critical to fix (maybe a security issue, or maybe the change could just save millions of dollars in compute resources, etc.), but the fix requires updating the use of that library in ~30 call sites spread across the codebases of ~10 different teams.

2. You create a PR that fixes the code and the problematic call sites in a single commit. It gets merged and you’re done.

In the multi-repo world, you need to instead:

1. Add conditional branching in your library so that it supports both the old behavior and new behavior. This could be an experiment flag, a new method DoSomethingV2, a new constructor arg, etc. Depending on how you do this, you might dramatically increase the number of call sites that need to be modified.

2. Either wait for all the problematic clients to update to the new version of your library, or create PRs to manually bump their version. Whoops - turns out a couple of them were on a very old version, and the upgrade is non-trivial. Now that’s your problem to resolve before you proceed.

3. Create PRs to modify the calling code in every repo that includes problematic calls, and follow up with 10 different reviewers to get them merged.

4. If you still have the stamina, go through steps 1-3 again to clean up the conditional logic you added to your library in step 1.

Basically, if code calls libraries that exist in different repos, then making backwards-incompatible changes to those libraries becomes extremely expensive. This is bad, because sometimes backwards-incompatible changes would have very high value.

If the numbers from my example were higher (e.g. 1000 call sites across 100 teams), then the library maintainer in a monorepo would probably still want to use a feature flag or similar to avoid trying to merge a commit that affects 1000 files in one go. However, the library maintainer’s job is still dramatically easier, because they don’t have to deal with 100 individual repos, and they don’t need to do anything to ensure that everyone is using the latest version of their library.

slver · on May 1, 2021

Your monorepo scenario makes the following unlikely assumptions:

1. A critical security/performance fix has no other recourse than breaking the interface compatibility of a library. Far more common scenario is this can be fixed in the implementation without BC breaks (otherwise systems like semver wouldn't make sense).

2. The person maintaining the library knows the codebases of 10 teams better than the those 10 teams, so that person can patch their projects better and faster than the actual teams.

As a library maintainer, you know the interface of your library. But that's merely the "how" on the other end of those 30 call sites. You don't know the "why". You can easily break their projects, despite your code compiles just fine. So that'd be reckless of an approach.

Also your multi-repo scenario is artificially contrived. No, you don't need conditional branching and all this nonsense.

In the common scenario, you just push a patch that maintains BC and tell the teams to update and that's it.

And if you do have BC breaks, then:

1. Push a major version with the BC breaks and the fix.

2. Push a patch version deprecating that release and telling developers to update.

That's it. You don't need all this nonsense you listed.

hamandcheese · on May 1, 2021

I’ve lived both lives. It absolutely is an ordeal making changes across repos. The model you are highlighting opens up substantial risk that folks don’t update in a timely manner. What you are describing is basically just throwing code over the wall and hoping for the best.

slver · on May 2, 2021

We're still people capable of speech and communication, if something is that urgent, we can communicate it to our peers.

Changing code under their nose risks breaking bunch of projects. We can also fix this by rather communicating, right? But if we CAN communicate... then we can go back to the previous option (telling them to update) as it becomes just as viable.

Communicating is always essential, and can't be avoided.

howinteresting · on May 1, 2021

Semver is a second-rate coping mechanism for when better coordination systems don't exist.

slver · on May 1, 2021

Patching the code of 10 projects you don't maintain isn't an example of a "coordination system". It's an example of avoiding having one.

In multithreading this would be basically mutable shared state with no coordination. Every thread sees everything, and is free to mutate any of it at any point. Which as we all know is a best practice in multithreading /s

howinteresting · on May 1, 2021

The same code can have multiple overlapping sets of maintainers. For example, one team can be responsible for business logic while another team can manage core abstractions shared by many product teams. Yet another team may be responsible for upgrading to newer toolchains and language features. They'll all want to touch the same code but make different, roughly orthogonal changes to it.

Semver provides just a few bits of information, not nearly enough to cover the whole gamut of shared and distributed responsibility.

The comparison with multithreading is not really valid, since monorepos typically linearize history.

slver · on May 1, 2021

Semver was enough for me to resolve very simply a scenario above that was presented as some kind of unsurmountable nightmare. So I think Semver is just fine. It's an example of a simple, well designed abstraction. Having "more bits" is not a virtue here.

I could have some comments on your "overlapping responsibilities" as well, but your description is too abstract and vague to address, so I'm pass on that. But you literally described the concept of library at one point. There's nothing overlapping about it.

iudqnolq · on May 1, 2021

> You create a PR that fixes the code and the problematic call sites in a single commit. It gets merged and you’re done.

What happens when you roll this out and partway through the rollout an old version talks to a new version? I thought you still needed backwards compat? I'm a student and I've never worked on a project with no-downtime deploys, so I'm interested in how this can be possible.

ublaze · on May 1, 2021

Version skew is only an issue when there's cross service communication. One service deployed on two different codepaths (in this case, using a different library implementation) is completely fine.

alexhutcheson · on May 2, 2021

My example was a library, so both the library and it’s caller are in the same release artifact (binary/JAR/whatever).

If you’re changing the interface of an RPC service, then you can’t do that in a single commit, and need to fall back to something like the second approach, but with even more caution to make sure you properly account for releases and the possibility of rollbacks.

pyrale · on May 2, 2021

The philosophy differences between git and svn aren't limited to one centralized repo vs. multiple repos.

With git, you get a local stage/history which lets you rework/reorder your commits for clarity before pushing. It also allows for more options to resolve conflicts, although this increased ability has brought its own problems.

pcl · on May 2, 2021

I also find git’s branching model to be far superior to subversion’s.

I assume that this is due to its design center around distributed repositories and patch files. But it’s still there even when using a central repo and a monorepo structure.

howinteresting · on May 1, 2021

Of course I want people who care about modernizing code to come in and modernize my code (such as upgrades to newer language versions). Why should the burden be distributed when it can be concentrated among experts?

I leverage type systems and write tests to catch any mistakes they might make.

iudqnolq · on May 1, 2021

What do atomic source updates get you if you don't have atomic deploys? I'm just a student but my impression is that literally no one serious has atomic deploys, not even Google, because the only way to do it is scheduled downtime.

If you need to handle different versions talking to each other in production it doesn't seem any harder to also deal with different versions in source, and I'd worry atomic updates to source would give a false sense of security in deployment.

pyrale · on May 2, 2021

It's not just atomic updates, the critical part is to run all the relevant integration jobs. The aim is that when a change goes to the reference branch, you must understand what other code is going to get impacted. Without that, you can break builds depending on the repo you changed, and if the broken repos don't change often, the breakage could be discovered months later. If you're in that situation, you'll randomly discover piles of debt when you use a slow-moving repo.

Atomic deploys are not as important, because you can still decide to version your APIs or releases even if you're using a monorepo.

That being said, you can use multiple repos and still mostly avoid trouble by choosing how to cut your codebase (HR software is likely not going to depend heavily on presale, for instance). The metric to optimize is to minimize the required version bumps.

Touche · on May 2, 2021

Got has submodules which solves this but for some reason people decided to stop using them a while back.

aseipp · on May 2, 2021

Yes, for good reason: they're a poor solution to most needs, are needlessly error prone, and so they make life more difficult especially among components with heavy developer traffic. As someone who's used both, I vastly prefer the monorepo approach, even if git is kind of slow sometimes.

status_quo69 · on May 1, 2021

> If you need to handle different versions talking to each other in production it doesn't seem any harder to also deal with different versions in source

It's much more annoying to deal with multi-repo setups and it can be a real productivity killer. Additionally, if you have a shared dependency, now you have to juggle managing that shared dep. For example, repo A needs shared lib Foo@1.2.0 and repo B needs Foo@1.3.4, because developers on team A didn't update their dependencies often enough to keep up with version bumps from the Foo team. Now there's a really weird situation going on in your company where not all teams are on the same page. A naiive monorepo forces that shared dep change to be applied across the board at once.

Edit: In regards to your "old code talking to new version" problem, that's a culture problem IMO. At work we must always consider the fact that a deployment rollout takes time, so our changes in sensitive areas (controllers, jobs, etc) should be as backwards compatible as possible for that one deploy barring a rollback of some kind. We have linting rules and a very stupid bot that posts a message reminding us of that fact if we're trying to change something sensitive to version changes, but the main thing that keeps it all sane is we have it all collectively drilled in our heads from the first time that we deploy to production that we support N number of versions backwards. Since we're in a monorepo, the PR to rip out the backwards compat check is usually ripped out immediately after a deployment is verified as good. In a multi-repo setup, ripping that compat check out would require _another_ version bump and N number of PRs to make sure that everyone is on the same page. It really sucks.

swiley · on May 1, 2021

Yes you can, it happens when you bump the sub module reference. This is how reasonable people use git.

Denvercoder9 · on May 1, 2021

Submodules often provide a terrible user experience because they are locked to a single version. To propagate a single commit, you need to update every single dependent repository. In some contexts that can be helpful, but in my experience it's mostly an enormous hassle.

Also it's awful that a simple git pull doesn't actually pull updated submodules, you need to run git submodule update (or sync or whatever it is) as well.

I don't want to work with git submodules ever again. The idea is nice, but the user experience is really terrible.

mdaniel · on May 1, 2021

And woe unto junior developers who change into the submodule directory and do a git commit, then made infinitely worse if it's followed by git push because now there's a sha hanging out in the repo which works on one machine but that no one else's submodule update will see without surgery

I'm not at my computer to see if modern git prohibits that behavior, but it is indicative of the "watch out" that comes with advanced git usage: it is a very sharp knife

_0w8t · on May 1, 2021

Looking back I just do not understand why git came up with this awkward mess of submodules. Instead it should have a way to say that a particular directory is self-contained and any commit affecting it should be two objects. The first is the commit object for the directory using only relative paths. The second is commit for the rest of code with a reference to it. Then one can just pull any repository into the main repository without and use it normally.

git subtree tries to emulate that, but it does not scale to huge repositories as it needs to change all commits in the subtree to use new nested paths.

dylan-m · on May 1, 2021

Or define your interfaces properly, version them, and publish libraries (precompiled, ideally) somewhere outside of your source repo. Your associated projects depend on those rather than random chunks of code that happen to be in the same file structure. It's more work, but it encourages better organization in general and saves an incredible amount of time later on for any complex project.

throwaway894345 · on May 1, 2021

I don’t like this because it assumes that all of those repositories are accessible all of the time to everyone who might want to build something. If one repo for some core artifact becomes unreachable, everyone is dead in the water.

Ideally “cached on the network” could be a sort of optional side effect, like with Nix, but you can still reproducibly build from source. That said, I can’t recommend Nix, not for philosophical reasons, but for lots of implementation details.

pyrale · on May 2, 2021

Using the expression "that's how reasonable people do ..." is not a great conversation starter.

I've always had a bad experience using submodules, they're imo the poor developer's versioning tool. It's useful when you use a language without a good build/packaging tool, but otherwise, I'm better off leaving the language-specific tool fetch the depended code.

cryptica · on May 1, 2021

If the project has good separation of concerns, you don't need atomic updates. Good separation of concerns yields many benefits beyond ease of project management. It requires a bit more thought, but if done correctly, it's worth many times the effort.

Good separation of concerns is like earning compound interest on your code.

Just keep the dependencies generic and tailor the higher level logic to the business domain. Then you rarely need to update the dependencies.

I've been doing this on commercial projects (to much success) for decades; before most of the down-voters on here even wrote their first hello world programs.