Hacker News new | past | comments | ask | show | jobs | submit login
Advantages of Monolithic Version Control (danluu.com)
92 points by benkuhn on May 18, 2015 | hide | past | favorite | 64 comments



I view the repo as the unit of versioning. If something has its own release cycle with its own semver number, it should be in its own repo, and vice versa. Monorepos make sense if your whole site is at a single version, as in the article. But if you want to have libraries that have stable versions (which I find useful, because it allows teams to own codebases - and I find maven is much much better than this article seems to think) then it's worth putting them in their own repositories.


As a counter-argument: I think that using a monorepo encourages a steady and careful movement of API deprecation, which is actually healthier than semver. In fact, I'll contend that semver is nothing more than an effort to bring monorepo-like bug fixes to environments that can't otherwise have monorepo-like management.

Here's the thing: if all the code is in a single repo, it's really easy for me to find all users of a given function. This means that, if I need to alter the behavior of a function, I can sanely pursue any of three options:

  1. Know for a fact I can alter the behavior without impacting
     anyone.
  2. Refactor across the whole code base in one shot.
  3. Deprecate the API and provide a replacement immediately,
     then work with teams to get off the deprecated API.
In all three cases, I can trivially know exactly what I'm impacting and make a decision on how I want to impact it. As a bonus, not only can I much more directly control deprecation times, but it's dramatically less likely that one of my client programs keeps deploying with an outdated, insecure/buggy version of my library.

In my view, the whole point of semver is handling situations wherein I cannot do that. For example, if I publish a library on GitHub, it's simply not reasonable for me to know who's using my library and why. Thus, semver provides a contract between me and you: since I can't know what you're doing, I promise not to do certain things with my library so that you can use it with some confidence. But there is a cost here: it's harder for you to track upstream, it's harder for me to know who's still using what in the older library, and it's harder for me to figure out how I can help people upgrade.

I'm emphatically not saying semver is bad, but I am saying that it's deliberately compensating for source code federated across many repos. A properly managed monorepo keeps you from having to worry about it.


I think Conway's law applies: the structure of the repositories reflects the communication structure of the teams that are using them. SemVer is appropriate when you have a low-bandwidth communication channel with the other side, when it's too expensive to go into the details of what and how you need to change, so a summary is appropriate.

In an organization where everyone's familiar with and works on every codebase, you don't need that. But my experience is that codebases need to have an owning team and teams that are larger than ~12 become unwieldy, and from what I've heard from friends who work there even e.g. Facebook tends to work in smaller teams than that.

So fine, I know that the people who depend on my library are the Foo team in Australia and the Bar team in New York. But I still wouldn't want to refactor their codebase without, at least, a code review from someone on that team - in which case I want separate pull requests for the library change, the project foo change and the project bar change. And I don't want one PR to block the others, so I want to make a library release but have foo depend on the old version until their change goes through. In short I think I'd always take option 3.


You don't need a monolithic repo to do that. Just pull in the library as a git submodule, and you control exactly which version (by git commit hash) of the library you use. And in a single commit, you can change that commit hash and change the code using the library, just as you could in a monolithic repo.


With subrepos, you -- the consumer -- are still stuck with either stagnation ("stability") or you need take on a bunch of refactors whenever you update the reference to upstream.

The producer in that case is obviously completely unaware of what you're doing so they cannot make decisions based on that fact. They may, for example, deprecate an API and provide no direct equivalent because they don't know anyone's relying on it. Then you suddenly have to come up with additional code to restore an equivalent API...

The subrepo approach is OK if you're tracking some external code. In that case it's comparable to semver but possibly better integrated with your build/tooling, you win something. But if it's internal to your organization, you're hacking around what a single repo could provide you.


This is exactly the opposite of what I was arguing for. In a monolithic repo, I can see and control what the consumers are doing. Your version is actually just a way to implement semver. (Or, possibly, to do something worse than semver; subrepos by themselves, provide no stability contract, so I have to think really hard about updating my subrepos.)


> Monorepos make sense if your whole site is at a single version, as in the article.

I cannot imagine that this is true for Google. There is no way Gmail is locked to the same version as Search or the same version as Maps.


And even if it were, surely they wouldn't be locked to the same version as Chrome or Android?

Basically, I think lmm and gecko are both right: there should be exactly one repository per project, where a project is defined in terms of a set of stable interfaces with the outside world. Whether the total set of stuff your company is doing should be one project or a few or many, depends on the nature of the stuff you do, what it has to deal with, and your company structure; but the answer to that should determine your repository structure.


Of course, there are downsides to using a monorepo. I’m not going to discuss them because the downsides are already widely known and discussed.

I searched but I honestly can't find them. The only things I can think of are: implementation problems (repo too large, every git pull takes more time than strictly necessary for your project) and tainting tools like "git log".

Those don't seem very fundamental. Solvable by a "monogit" wrapper, if someone put their mind to it. Or are they?

Is there a fundamental problem I'm missing? I feel stupid for asking, given how matter of fact he dismissed it :(

EDIT: I guess what I'm trying to say is: monorepos feel, theoretically, like a strict superset of many independent ones. It's just the tooling that makes it less convenient.


"It's too slow" is not a trivial problem to solve. Anything but.

At some large companies the simple act of making a commit takes the better part of an hour. Same with checking out, updating, etc.

This has serious implications for the development style and productivity. At the very least, not every developer can adapt to such a style.


    "It's too slow" is not a trivial problem to solve.
    Anything but.
Depends on your viewpoint, but I'd argue it's not actually that hard to solve unless you also require you solve it within the context of Git. For example, in most systems, you can do something called a narrow checkout, where you only grab part of the tree of the whole repository. Subversion, Monotone, Perforce, and (soon) Mercurial all support this, whereas Git does not--and, the last I looked at the Git protocol, it wouldn't have been sane to implement. (Been about two years at this point, so I apologize if that has changed.)

Beyond that, even local commit in a single repo is actually pretty easy to solve if you have file watchers (powered by e.g. inotify, kqueue), which at least Mercurial does, and I'd honestly swear Git does also.

At that point, I don't think there's a lot left that inherently has to make commits take the better part of an hour. If you are at such a company, please take a moment to run strace or something; I'm really curious where the tool's spending so much time.


> whereas Git does not [...]

Git's "sparse checkout" feature can grab a part of a larger repo from memory.


Just as an update for you: You can do "shallow" clones in git (clone up to a maximum given depth of history).

Since Version 1.9 (released early 2014), you can use such a repo like a normal one (push, clone from it, ...)


Shallow is not narrow, and I specifically meant narrow in this particular context. A narrow clone gives you the full history for part of the tree.


Permissions? ie different teams with different push/pull rights for different repositories. We have that where I work; for most projects it seems like needless complexity, but some things are more closely controlled; payment handling, ops (where a change could bring down the whole org not just an app), etc. There's also the issue of contractors being brought in, generally they see a restricted subset of the codebase.


A lot of the VCSes that use the monorepo model allow fine grained permissions on various parts of the repo (this necessitates being able to restrictively fetch only part of the repo).


The speed is a serious downside.

Go on vacation for two weeks? You get to go on another vacation while you wait for your monorepo to update. Similarly, good luck ever trying to work from a coffee shop -- at the very least you probably have 50mb of updates because everyone in the company is committing code all the time.


> You get to go on another vacation while you wait for your monorepo to update.

If you care, you should read about how Google makes sure this doesn't happen.

> By putting our revision history in the cloud we provide engineers with complete access to all the source, and yet almost no time is spent checking out code.

http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...

http://google-engtools.blogspot.com/2011/08/build-in-cloud-h...


This is only true if you do two unrelated things simultaneously: have a monolithic repo, and keep the whole thing checked out at all times. Provided you have a sane build system (pants, Blaze, Buck, etc.), then the only thing that should need to have the full checkout is the CI server; you should be able to safely work from a narrow clone.

That said, I find your note a bit amusing. I'd have agreed a decade ago, but nowadays, 50 MB should take effectively zero time for you to both download and apply. It's worth revisiting things like this as LANs and the internet get faster; I think the speed comment is a bit outdated now.


True, however Pants right now does not have support with Git to only checkout dependencies it actually needs for the build.

I agree that it should take effectively zero time for me to download and apply 50MB of updates, but poor quality internet is still pretty easy to find (working with plane wifi) -- maybe my coffee shops have worse internet than yours!


I'm going to assume the internal network at a Google office is going to make moving even gigabytes of updates trivial.


Biggest problem is also its greatest benefit. If everything CAN access anything, everything WILL eventually access anything.

Without any barriers, it's way too easy for things to devolve into spaghetti code.


This exactly.


Could an alternative be to use individual repos and also a meta-repo? The metarepo contents are the commit ids of each individual repo.

So let's say you want to update a repo which depends on another one:

Update project A, commit changes.

# Your product hasn't changed at this point

Update project B, commit changes.

# Your product hand't changed at this point

In metarepo, checkout to the new 'master' branches of Projects A and B, commit that to metarepo

# Your product is now updated!


This sounds a lot like git submodules.


Or Mercurial subrepos.

Both of which are far greater of a pain in the ass than monorepos.


Or git subtree


Both variants are nasty in a huge repo. The quoted example at the end is a good example. If you have different repos, you sometimes can't push to repoA because you are depending on repoC. If everything is in one repo it should just work. But that's not really the case. You still need to wait for changeC before you can go and do changeA. Same problem, independent of the repo situation.

The solution is that changeA must be backwards compatible. In a complex system you always need to have some kind of backwards compatibility, at least for some time.

In the end both (mono and multi repos) doesn't really work in huge, complex scenarios.


Malarky.

>> With multiple repos... having to split a project because it’s too big or has too much history for your VCS is not optimal... With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way.

Uuuuh, if your one project has to be split because it's too big for your VCS, then you aren't going to make that thing smaller by putting multiple projects in with it.

>> A side effect of the simplified organization is that it’s easier to navigate projects.

That's a UI issue. Build a better UI, don't use a dirty hack, especially one that has other implications.

>> A side effect of that side effect is that, with monorepos, it’s often the case that it’s very easy to get a dev environment set up to run builds and tests.

With the growing trend of package managers being able to install dependencies straight from a git repository, I don't see this being an issue much longer. Again, this is a UI issue.

>> This probably goes without saying, but with multiple repos, you need to have some way of specifying and versioning dependencies between them.

Yeah, no shit, that's just good software development. The argument here is that a monorepo lets you be lazy.


You're missing a huge part of the benefit, which is that when you go and do internal API cleanups you don't have an awkward dance across N repositories, you just have a change that's atomic across the entire project.

It's a huge benefit to be able to have all your mobile apps and web apps and whatever in the same repo, because then you can easily see who calls what, and how various RPCs are used, etc. Don't knock it until you've tried it.

(I used to think monorepos were dumb, but over the course of several years I came around.)


That's what I'm talking about being lazy. These companies provide public APIs and we all will have the same headaches of having to update our software to those APIs. By dogfooding your own APIs in the same way your customers eat them, you force yourself to have to communicate change correctly.


You say "lazy", I say "not wasting time on unnecessary things". Versioning dependencies is something done to solve problems, not an inherently good moral imperative.


The right approach depends on your configuration management strategy, which in turn depends on the kind of product you are making.

If you are developing a big system with lots of components that need to work together (e.g. an embedded machine vision system with a multitude of related data-recording, calibration, simulation & test utilities and subsystems), then the matrix showing which versions and configurations are compatible with one-another quickly grows to an unmanageable size.

Unless you have a god-like configuration management system, the only practical approach is to go "green trunk" and co-version all of your components. Granted, this doesn't necessarily force you to use a single repo, but a single repo is (at least initially) the simplest approach.

Of course, you could go "old-skool" and define interfaces up-front then freeze them, but this just slows your development down to a snail's pace. Better (IMHO) to co-version your components then lean on your integration tests to maintain compatibility.


Of course ... when your development team grows really big, then this may well become untenable ... but "green trunk" should work in development teams up to a couple hundred developers or so.


One big issue where I would love Open Source to innovate: Test downstream.

If you change an Open Source library there is no way to check with users of your library, because they are hard to track down and use various different build processes. In a monorepo, this is much easier. You could create an automatic "build and test everything".


We've been working on tooling for Rust that runs tests across everything in the package repository, as a way of hopefully detecting regressions. https://internals.rust-lang.org/t/regression-report-beta-201...


For a moment I was a little frustrated at how we split everything into separate repos at work, mostly due to the difficulty of finding and refactoring code.

And while a monorepo would help a lot with discoverability, I think that the promises this article makes about cross-project changes are a bit optimistic since it ignores the difficulty of doing deployments in live distributed systems. Even if you have a single git repo, there will certainly be an order you need to deploy them in so that things don't break when the API consumer gets updated before the API provider. Google/FB/Twitter/etc certainly have a better deployment system than we do, so maybe for them it's easy, but it's not something that is just solved by going to a monorepo.


I wonder if Google is still using Perforce? That's one of the arguments used at our company to argue that perforce handles huge repos very nicely.


Look at the apache SVN repo. http://svn.apache.org/repos/asf/


<wrong>Google is still using don't-look-behind-the-curtain-it's-totally-not-Perforce-but-it's-Perforce</wrong>

Google is using a Perforce lookalike that is not actually Perforce, but they're moving to Mercurial. Subversion also scales up to similar sizes, although I confess to not knowing what "similar" means. (I do know that Google did an experiment and concluded that Subversion would scale to their needs, but they were already committed to Perforce by that point.)

EDIT: Got corrected by someone who has good reason to know.


> Google is using a Perforce lookalike that is not actually Perforce, but they're moving to Mercurial.

Do you have a source for this move to mercurial? This is the second time I've heard it, but couldn't find a source before.


I don't have a publishable source, no, but you can look at the number of @google.com email addresses on the Mercurial mailing list if you want some to do your own inference, or just hang in there for a bit and you won't have to. :)


I find it curious you were corrected. I think I observed Google employees contributing to improved Mercurial scaling, which also hints in that direction?


I was corrected that Google uses Perforce. They apparently do not anymore, but rather wrote their own Perforce workalike.


Oh thanks, I read the correction the wrong way around.


For years, large codebases at Adobe and Microsoft were built using Perforce. I don't know if they still are today, but the argument that Perforce is good at handling huge codebases is valid.


There's an older paper/blog from a Google admin describing the set of scripts they use to periodically reboot their Perforce servers.

That should tell you how well Perforce handles huge repos, and I imagine it's not gotten better.


I think there's an important distinction between operational/infrastructure/support issues and development workflows and ease of use. While Perforce itself might be a beast to support from an Ops perspective it may provide enough positive benefits to the developers that it's well worth the trade off.


I think having to reboot the server periodically (as well as the problems that prompted me to find how Google is running theirs) is evidence enough to argue against "<that> perforce handles huge repos very nicely".


When people say that perforce handles huge repos well they're referring to a characteristic of the technology not the server implementation of the technology. Git, on its own, is not very user friendly when it comes to large repos or binary files. It's a limitation of the technology regardless of server implementation.

Believe me, I'm not saying that perforce is flawless. But there's a reason its used heavily in the game dev and media industries. Does it suck to administer? Certainly. Does that mean that Google's problems with scaling or operating something like perforce apply to most of us? No. Most of us can resolve issues with Perforce by applying some limitations and governance because we're not operating at the massive and distributed scale that google is.


> Me: I think engineers at FB and Google are probably familiar with using smaller repos (doesn’t Junio Hamano work at Google?), and they still prefer a single huge repo for [reasons].

I'm a former such engineer; I still prefer smaller repos. There's enough engineers at both companies that I can assure you such opinions (and knowledge) are quite varied.

> it’s often the case that it’s very easy to get a dev environment set up to run builds and tests.

I've worked with both; in both cases, the workflow was essentially a checkout, followed by a build, followed by running the tests. I've found this is more a product of the environment (i.e., do the developers care about tests being easy to run) than the VCS in use.

> With a monorepo, you just refactor the API and all of its callers in one commit.

I'd restate this: with a monorepo, you must refactor the API and all of its callers in one commit. You cannot do it gradually, or you will break someone. A gradual refactor is only possible in multiple repositories, specifically multiple repositories that obey something resembling semantic versioning. You make your breaking change, and because it is a breaking change, you up the version to indicate that. Reverse-dependencies wishing to update then must make the change, but can do so at their leisure.

I've seen some truly heroic work done to get "APIs with thousands of usages across hundreds of projects get refactored". Sometimes it _is_ easy: you can track down the callers with a grep, and fix them with a perl script. But I think you must limit yourself to changes of that nature: massive refactors too great for a script would leave you to edit the call sites. Though, with thousands of callers, this is probably true anyways, I find having to move even a couple dozen through a major change (such as one where the paradigm expressed by the API is completely wrong) is difficult if you must update them all at once.

Last, the most common "monorepo" system I've seen is Perforce, and compared to git it has such stark usability issues that I'd rather not go back to it (staging, git add -p, bisect, real branches). This comment though,

> where it’s impossible to do a single atomic commit across multiple files

I would hesitate to use "atomic" to describe commits in Perforce; if you check out CL X, make some changes, and "commit" ("submit" is Perforce's term), the parent of your new CL might be Y, not X, and you might get no warnings about this, either. Collisions on an individual file will prevent the submit from going through, but changes on separate files (that together represent a human-level merge conflict) , will not get caught. (They wouldn't show as merge conflicts in git, either, but git will tell you that someone updated the code, and refuse your push; unit tests must catch these, but in Perforce's case, you must run them after making your change permanently visible to the world.)


It is not difficult to achieve the "atomic" commit issue you mention via use of triggers. And indeed streams provides it out of the box.

These days, with the use of shelving, there are integrations with CI tools such as Jenkins which provide "pre-flight" builds before checkin - thus before your changes are world visible.


I've found companies typically use monolithic development because they don't get all of the benefits of modular development. e.g. they don't re-use code cause they only have 1 website (and don't care about open sourcing). I'll stick with modular all the way. :)


Most of the arguments in the article were in cases where you do reuse code though.

If you didn't reuse code (across repos) then you wouldn't have any of the problems that a monolithic repo solves.


If you're willing to "move fast and break things" then doing an API change in one go across all your projects is great. But in the kind of industry where downtime is unacceptable, if project A and project B both use library C, when you make a change to library C you want to be able to release and test project A and B separately, according to their own project lifecycles. So you need the ability to release and tag library C on its own, and it makes sense for it to be in a separate repository.


I just cannot imagine trying to make sense of the history of a giant monorepo.

What I don't get is the example from Twitter. If I need a fellow developer to fix projectB and projectC for me to fix projectA, I ask him to fix it. As soon as he has committed it to their respective repos, the buildserver would pick it up, and I can expect the next time it builds projectA, it pulls the lastest version of projectB and C and use them.


If you always pull the latest version of other projects, you have no way to have a consistent snapshot of a working system. If you use a monorepo, it means that you always know which versions of every file in every project are in use at a particular commit. This of course is also true if you use separate projects and pin their dependencies. But if you use separate projects and always grab the latest version of each subproject, there's no way to ensure you have compatible versions of different projects in use.


At least in TeamCity, I can tell it what version of an artifact I want, it's even possible from within the VCS with some powershell magic. So a non breaking change like a bugfix, will update the buildnumber, but not the rest. If the fix includes breaking changes, I want to adapt my code anyway, and then update the reference to the newest version anyway.

For our .net projects, I've set up an internal NuGet repo to handle internal dependencies.

This honestly sounds more like the best workaround because the tooling isn't good.


You don't need a monorepo to solve this problem. The dependent projects would depend on a specific version of an artifact, rather than a 'latest' pointer. Example: Project 1 depends on version 1.2.3 of Library A, and Library A builds are hosted in an artifact repo. Project 2 needs an update to Library A, and those updates are built and released as version 1.3.0. Project 1 continues to use version 1.2.3, until it's updated to work with the new 1.3.0 API.


Of course you don't need a monorepo to solve the dependency problem. My point was that just using the latest version of all projects in different repos is not necessarily an acceptable solution.


If you want to make a backwards-incompatible API change, it's very convenient if you can change both sides in the same commit.


But if you do versioning properly, then you can change "both sides" in one commit by including the change of the dependency version.


Exactly - an extra step in the development process, you need to do an official release of both libraries / projects - do your work (same as in a monolithic repo), bump version x2, update dependencies x2, make tag x2, and hope others haven't changed anything in between and made a release so you'd have to do it again. That goes against "move fast and break things", or at least the move fast bit.


Taking NuGet as an example - you make a change, bump the version number, commit and push it. The buildserver automatically push it to your internal NuGet server. Now your IDE will notify you that there is an update to a package you use and you can update and make your own bugfix, and commit it.

The alternative with monorepo is - you make a change, bump the version number, commit and push it. You watch for the commit you need (in a monorepo this might be harder than it have to be, alternatively the developer can send you an IM) and hopefully there isn't any merge conflicts when it's done. You pull the commit, make your fix and commit it.

It's largely a matter of poor tooling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: