I understand that your customers want very large git repositories, so that's the...

azornathogron · on Nov 12, 2021

> But why do so many people want these monorepos?

Lots of people elsewhere in the thread have given reasons that I won't repeat, but I will say I think it's a very different situation for an engineering-focused company compared to, for example, open source projects.

In a company you get a lot of advantages from uniformity of tools and practices and dependencies across the many projects/products that the company has, and you have the organizational structure to onboard new employees onto that way of working and to maintain the level of uniformity and actually benefit from it.

In open source, every project and even every contributor is sovereign unto themselves. In this situation, the difficulties of cross-project changes are primarily human organizational challenges - even if you could collect a lot of projects into a monorepo, you couldn't get the big benefits from it because you wouldn't be able to get everyone to agree to stick to a single set of tools and do the other things that make monorepos powerful.

I think monorepos are great for companies.

But using git as a basis for a monorepo system is a bad idea for the reasons you suggest. It's totally the wrong tool for the job.

I remember the transition period when (nearly) everyone in the open source world moved over multiple years from subversion - or in some cases cvs - to git. The advantages of decentralised development for open source were really clear, and so was the way git tracked branch and merge history (when git was first developed, subversion barely tracked merge history at all, which made it easy to end up with bad merges after accidentally merging the same change multiple times). And at the scale of repositories in the open source world, git was massively faster than svn. But the speed doesn't scale, and if you've got a monorepo you almost certainly aren't doing real decentralised development, and the merge tracking in subversion was fixed (well... maybe; honestly I haven't been paying enough attention to be sure). So seriously SvnHub would actually be a better basis for the monorepo world. It's almost unfortunate that git took over so comprehensively that now companies want to use it for a different context where it really doesn't shine at all.

CRConrad · on Nov 13, 2021

> subversion barely tracked merge history at all, which made it easy to end up with bad merges after accidentally merging the same change multiple times

Weird. Shouldn't that be just a no-op?

azornathogron · on Nov 13, 2021

It's been more than a decade so I almost certainly oversimplified the problem.

But to give an approximate explanation: Subversion's model of how branches work is fundamentally different to git's model. In fact in a sense subversion doesn't have a concept of branches at all in its core data model. Subversion just gives you a directory tree with a linear history, and then makes it cheap to copy directories around. The recommended layout for a subversion repository looks something like this:

    - /tags
    - /tags/release-v1
    - /tags/release-v2
    - /branches
    - /branches/my-feature-branch
    - /trunk/...

If you want to create a branch you just copy your trunk directory to a new name under /branches. In order to reliably merge back & forth between branches or between a branch and trunk (which are just directories in the repository), you need more than just finding the diff between the latest state of each branch; you really need to know about the historical relationships between them: When the directory was created what was it copied from, what other merges have happened since then? But this information - at least "what merges have already been done" literally wasn't tracked until about subversion 1.5 (which was actually contemporaneous with a lot of the migrations of subversion to git, at least in my recollection).

Some references for you:

- The classic "svn book" about branching: https://svnbook.red-bean.com/en/1.7/svn.branchmerge.using.ht... - note the section at the bottom "The Key Concepts Behind Branching" which notes "First, Subversion has no internal concept of a branch—it knows only how to make copies. When you copy a directory, the resultant directory is only a “branch” because you attach that meaning to it."

- The subversion 1.5 release notes explanation of merge tracking: https://subversion.apache.org/docs/release-notes/1.5.html#me...

To try to place this in time, I was using subversion (and was pretty happy with it!) in 2007. By 2009 I was using git wherever I could. Of course, the widespread migration toward git throughout the open source ecosystem was spread over a lot more years than this. But hopefully that gives some context.

CRConrad · on Nov 14, 2021

Ah, thanks, TIL something again.

Seems (at least the original version you described of) subversion was in some/several/many ways genuinely inferior to git.

But yeah, having thought about it a bit more re-applying the same change again in any versioning system is of course not necessarily "a no-op" if they're not immediately following each other; if other changes that partially reverted the original ones have happened in between. Then re-applying the original changeset would of course re-do that part of the original changes.

erik_seaberg · on Nov 12, 2021

> its strengths (smaller composable repos)

I think this is happening because git submodules have a hard and confusing CLI even compared to the rest of git.

> would have been more than happy to use "SvnHub"

svnmerge.py caused almost as many trainwrecks as it avoided. I don’t think resolving merge conflicts can be made to work so long as the repo relies on being informed of all file moves and copies, because devs just didn’t do that consistently.

vtbassmatt · on Nov 12, 2021

Thank you for your reply. I think you raise some great points, and I'll respond to a few that I have knowledge of.

> But why do so many people want these monorepos?

Google-copying is part of it for sure. And I agree with your position - copying Google isn't a great reason on its own. Some more valid reasons: code which deploys together often wants to live together. Common dependencies can be easier to find (and use) if they're right there in the repo. Related, making cross-cutting changes can be easier when you can make an atomic change everywhere. Also, big repos often started out as small repos and then grew over time; the cost of a major change might outweigh the friction caused by keeping Git in place.

Consider also that we might not all be talking about the same thing. Some people (even here in this topic) consider Linux to be "large" and "a monorepo". Linux isn't notably big anymore, and it's not unusually challenging to host or to use locally. It's arguably a monorepo since it contains most of the code for the Linux kernel, but to me, "monorepo" implies needing special attention. So I probably wouldn't classify Linux as a monorepo for this discussion.

> I can't figure out why so many people are married to git but insist on leaning into its weaknesses

I'm sure there are many reasons. The common one I hear in my role boils down to, essentially, "Git is the de facto standard". That can be expressed in several ways: "it's harder to attract and retain engineers with an uncommon toolset"; "we want to focus on our core business, not on innovating in version control"; "all the other tools in our kit work with Git". (NB: I put those in scare quotes to distinguish that they aren't my or GitHub's position, they're things I've heard from others. They're not direct quotes, though.)

I talk with customers weekly who want to mash dozens of independent repos (Git or otherwise) together. If they aren't going to reap any of the benefits mentioned above or elsewhere in this topic, I strongly advise against it. At the end of the day, GitHub doesn't care if you have one giant monorepo or 1000 tiny ones; the product works pretty well for both. I suppose that's why I felt compelled to reply to your thread in particular -- yes, we're investing in monorepos, but no, it's not because we're trying to drive people to them.

anon9001 · on Nov 12, 2021

Thanks for engaging in discussion about this. If nothing else, you're building my confidence in GitHub.

> Consider also that we might not all be talking about the same thing.

I think this is 90% of the confusion/disagreement in this thread.

I've always thought that it's fine to use git however you want, but if you hit a bottleneck (it gets too big and becomes slow), then you split your repo along logical boundaries (usually a large library can be split out and versioned, for example).

Somewhere over the last 15 years, that's changed, and the zeitgeist is now "mash dozens of independent repos into one repo" no matter the situation. Everyone in this thread that's suggested monorepos aren't the way to go has been downvoted, which caught me by surprise.

> I suppose that's why I felt compelled to reply to your thread in particular -- yes, we're investing in monorepos, but no, it's not because we're trying to drive people to them.

I believe you're sincere, and perhaps I was a bit too cynical about GitHub's motivations. Sorry about that, this topic is just frustrating and GH is in a position to help make things better.

Do you think GitHub could offer some kind of practical guide about when to use monorepos and what their limitations are?

I think part of the problem is that git's docs and the git-scm book aren't going to prescribe a way to use the software, because it's intentionally extremely flexible. Git users appreciate this, but GitHub users might lack good guidance.

As another reply pointed out, this might also have its origins in the git-submodule porcelain being confusing and underutilized.

Most GitHub users have probably never used submodules, don't know when a git repo would start to slow down due to size, aren't sure how to split out part of a repo while preserving history, and probably haven't thought too much about internal dependency versioning.

> to me, "monorepo" implies needing special attention

I think you and I are actually in total agreement, but the vast majority of corporate GitHub users have no context about where git came from, what it's good at, what it's limitations are, and how to use it for more than trunk-based development.

The ideas that "Linux is a monorepo" or "monorepos are the only natural way to manage code for any project" or that git should be "fixed" to support centralization should be concerning to GH.

I suspect these people don't have "monorepos" in the way that you and I are talking about them. They probably just have mid-sized repos that haven't need to be split up yet.

Even if you can support those customers as they grow into monorepos without friction due to enormous technical efforts, we're failing to teach a generation of engineers how to think about their most fundamental tools.

I appreciate that GitHub is trying to use technology to smooth out these points of confusion and improve git to work in every scenario, but publishing customer education materials about how to make good decisions about source code management would also help a lot.

> I talk with customers weekly who want to mash dozens of independent repos (Git or otherwise) together. If they aren't going to reap any of the benefits mentioned above or elsewhere in this topic, I strongly advise against it.

It sounds like you're giving good advice to the teams that you talk to, but what can I show my team that's authoritative from GitHub that says "don't mash dozens of independent repos together and then blur the lines so you can't tell what's independent anymore"?

I thought this was obvious, but it's not, and I don't know how to get people to understand it.

This is a particularly bad problem when your independent repos are working fine, but then there's a company-wide initiative to go "monorepo", and it's obvious in advance that the resulting monorepo won't be usable without a lot of extra work.

Maybe I've just been unlucky, but every time I've had a monorepo experience, it's been exactly that approach. And, as I'm sure you can tell by this thread, I haven't had much luck in convincing other engineers that mashing all unrelated code into one repo is a silly thing to do.