I always find monorepo/polyrepo discussions tiresome, mostly because people take the limitations of git and existing OSS tooling as a given and project those failings onto whichever paradigm they are arguing against.
I'm pretty excited for new OSS source control tools that would hopefully help us move past this discussion. Particularly, Meta's Sapling[0] seems like a pretty exciting step forward, though they've only released a client so far. (MS released its VFS for Git awhile back, but unfortunately now is deprecated.)
It's like telling someone that throwing all their stuff in one huge box is always better than using smaller boxes. Obviously it depends on the situation.
I strongly prefer the simplicity of a monorepo, but I once worked on a project that used three repos, and kept them in sync by having IntelliJ keep the branches in sync. Make a new branch, and you make it in all three branches simultaneously. Switch branch, and you switch in all three. That made it very convenient.
The project I'm currently working on just switched from polyrepo to monorepo. Interestingly, front and back end were in a single repo, but there was another repo with a bunch of definitions and datatypes, and a third with a frontend component library that was meant to be shared with another team, but that never happened. And that just made development really awkward.
I think polyrepo only makes sense if you actually have multiple teams with clearly separated responsibilities. But then each team still effectively works monorepo, don't they?
I'm on the same page with you. A repository is a boundary of responsibility, and they should be (ideally) able to evolve independent from each other.
Trying to develop software in multiple repos by a single team does not makes sense and creates extra load. The reverse is also true, and creates risk of collisions since different teams can touch the same file unintentionally.
Extending from that point, I don't think Git is a bad or insufficient VCS. Like every software it has opinions, mode of operations, expectations from its user and limitations. One needs to understand what it's working with.
People badmouthing tools because they don't work the way they expect to really rubs me the wrong way sometimes. If you can hold a hammer wrong, you can hold a software wrong, too. This is why people say RTFM since forever.
> I think polyrepo only makes sense if you actually have multiple teams with clearly separated responsibilities. But then each team still effectively works monorepo, don't they?
If you have a cross-functional team they might make a repo for the frontend and a repo for the backend, unless steered to do otherwise.
On my personal experience, relying on Intellij syncs and not knowing how git works is how we got several emergency production reverses applied in a matter of days because someone accidentally kept deploying broken changes to production, while thinking they were working only locally.
The monorepo decision has little to do with VCS from my perspective - I can't think of a single case where git was the make-or-break decision point. It's primarily about operations, testing, dependency management, and release processes.
For me it comes down to this: do you want to put in the effort up front to integrate all your dependencies in a systemic way at development time? or do you want small pieces that can evolve independently, effectively deferring system integration concerns to release time?
Monorepo or Manyrepo - either way, someone has to roll up their sleeves and figure out how all the libraries and services fit together. It's just a matter of when and where you do that.
Can't we just generalize the package manager already and push it into VCS? I just want to commit some code and roll it out in the next release. Somewhere in the tree is a top level makefile or something. Stop making this complicated.
When it comes to scale, force versions to be incremented at the same time across the lot. You can even spin up a new deploy set and gracefully handoff load.
My point is. Just let the source code, in whatever language, be distributed as packages in as modular a way as developers want. One way or another you're going to end up with a makefile or shell script that builds the damn thing. If you don't, then someone fucked up and your build is effectively broken. Monorepo or not.
> My point is. Just let the source code, in whatever language, be distributed as packages in as modular a way as developers want. One way or another you're going to end up with a makefile or shell script that builds the damn thing. If you don't, then someone fucked up and your build is effectively broken. Monorepo or not.
The point of a monorepo is that it is not at all modular. You can upgrade shared dependencies all in one go. You can make systems-wide changes with confidence. The monorepo lets you move everything together in lockstep.
Monorepos also allow for incredible sharing potential, but that's less of a selling point.
Yeah, kinda. If you've got a frontend (e.g. phone app) and a backend, you still need to think about your upgrade scenarios. You probably still need some concept of versioning so you can keep track of keeping backend support for whatever apps will still be in the wild for a while.
I say this as a big fan of monorepos - they're great but they don't solve all problems.
For sure. You're not even absolved of deploying internal microservices in the correct order during certain classes of migrations, or even changing fields within a single service. Systems at scale are hard and require discipline.
Either one of monorepo minus the major downsides or polyrepo minus the major downsides effectively makes the discussion of the trade offs moot and the first one to happen will likely “win” with not enough gained by deviating from the norm to switch once people adopt it.
> people take the limitations of git and existing OSS tooling as a given and project those failings onto whichever paradigm they are arguing against.
Very much agree. We have the whole "rebase to a single commit for your PR" vs. "keep a history of what actually happened" argument. One side wants to view concise, comprehensible change histories and be able to bisect them to see the origins of bugs etc, the other wants to use an rcs/vcs for one of the primary tasks an rcs/vcs is supposed to undertake - recording and keeping safe a version history of code as it is developed. To each the other side is wrong to even want that.
There have been source control systems in the past that would cater to both quite happily. Nightmarish, terrible, slow, heavyweight source control systems that involved learning an entire configuration language to use effectively, and which I certainly wouldn't recommend using today! (Rational Clearcase, I'm looking at you). They have existed and conceivably could do so again.
There is a middle ground between "retain the history of every typo anyone ever made" and "squash a month's worth of work into a single commit". Without having to learn a huge amount you can rebase your private branch now and again to squash all those typos and "fix the tests" commits into a single coherent commit describing the step towards the feature you're working on.
What I'd really love in that context is for Github's PR interface to surface individual commits better. I want to be able to step through each commit, reviewing the incremental changes towards a fully working feature, rather than have to review the entire thing as one big blob.
It already does that. Click on the individual commit to only see those changes, then click next to continue. That's the only way I review commits, and it has been available for at least a few years
> Who wants to record typos or patch after patch in your private branch? That has absolutely no value.
A couple of years ago I wrote most of a custom X509 validation stack in java before realising we didn't need one after all, there was a way to do what we needed with the standard stuff, so it wasn't in the final PR. Three months later things changed, I did need one and being able to look it up saved me several days work.
It can have a huge amount of value.
Who cares about revision history being neat and atomic? It's not been of the slightest consequence to me. It's not like there's a realistic maximum number of revisions you can store in your repo.
But to the original point, clearly both of these features have a use-case, and people want them. Other source control systems in the past (which were much worse in many other ways) catered for this. But the current tension only exists because the dominant source control system doesn't really allow you to pick and choose how you see the data.
I'm so glad that the days of ClearCase are over, managing multisite replicated vobs and whatever bullshit viewspec required to make releases work was a nightmare that I wouldn't wish on my worst enemy. <big tech company> also had some horrendous frontend to clearcase that was actually used by the engineers that had strange and wonderful interactions that only the guy that left 5 years ago knew about and left for us to re-discover.
One reason I like cleaning up the history before merging is that anyone can then `git blame` and land on a commit that shows the feature/bugfix as a whole and hopefully with a clear explanation in the commit message. Not a bunch of "Fix typo" kind of commits.
IMO the biggest drawback to a monolith, maybe beyond those listed, is losing the 1-1 mapping of changes to CI to releases. If you know "this thing is broken", the commit log is a fantastic resource to figure out what changed which may have broke it. You submit a PR, and CI has to run; getting most CI platforms cleanly configured to, say, "only run the user-service tests if only the user-service changes" isn't straightforward. I understand there are some tools (Bazel?) which can do it, but the vast, vast majority of systems won't near-out-of-the-box, and will require messy pipelines and shell scripts to get rolling.
There's also challenges with local dev tooling. Many VSCode language extensions, for example, won't operate at their best if the "language signaler" file (e.g. package.json for JS) isn't in the root of the repo; from just refusing to function, to intellisensing code in another project into this one, all manner of oddities.
Meanwhile; I don't think the purported advantages are all that interesting. Being able to write the blog post and make the change in the same PR? I've never been in a role where we'd actually want that; we want the change running silently in prod, publish the blog post, flip a feature flag or A/B test proportions. Even an argument like "you can add schemas and the API changes in one go"; you can do that without a "monorepo", just co-locate your schemas with the API itself, this isn't controversial or weird, this is how everywhere I've worked operates.
None of that is to even approach the scale challenges that come with monorepos at even reasonable sizes; which you can hit pretty quickly if you're, say, vendoring dependencies. Trying to debug something while the Github or Bitbucket UI is falling over because the repo is too large, or the file list is too long, isn't fun.
I'm not going to assert this is a hill I would die on, but I'm pretty staunchly in the "one service, for one deployment artifact, for one CI pipeline, for one repo" camp.
As a big fan of the monorepo approach personally, I would say the biggest benefit is being able to always know exactly what state the system was in when a problem occurred.
I've worked in large polyrepo environments. By the time you get big enough that you have internal libraries that depend on other internal libraries, debugging becomes too much like solving a murder mystery. In particular, on more than one occasion I had a stacktrace that was impossible with the code that should have been running. A properly-configured monorepo basically makes that problem disappear.
This is more of a problem the bigger you are, however.
I think we're just reducing down to "programming at scale" is hard, at some point.
Sure; that is a really big problem, and it becomes a bigger problem the bigger you are. But, as you become bigger: the monorepo is constantly changing. Google has an entire team dedicated to the infrastructure of running their monorepo. Answering the question: "For this service, in this monorepo, what is the set of recent changes" actually isn't straightforward without additional tooling. Asking "what PRs of the hundred open PRs am I responsible for reviewing" isn't straightforward without, again, additional tooling. Making the CI fast is hard, without additional tooling. Determining bounded contexts between individuals and teams is hard, without additional tooling.
The biggest reason why I am anti-monorepo is mostly that: advocates will stress "all of this is possible (with additional tooling)", but all of this is possible, today, just by using different repos. And I still haven't heard a convincing argument for what benefits monorepos carry.
Maybe you could argue "you know exactly what state the system was in when something happened", sure. But when you start getting CI pipelines that take 60 minutes to run, or failed deployments, or whathaveyou; even that isn't straightforward.
And I also question the value of that; sure you have a single view of state, but problems don't generally happen "point in time"; they happen, and then continue to happen. So if we start an investigation by saying "this shit started at 16:30 UTC", the question we first want to have answered is "what changed at 16:30 UTC". Having a unified commit log is great, but realistically: a unified CI deploy log is far more valuable, and every CI provider under the sun just Does That. It doesn't mean squat that some change was merged to master at 16:29 if it didn't hit prod until 16:42; the problem started at 16:30; the unified commit log is just adding noise.
You had the wrong tools. It doesn't matter if you have a monorepo or not, you will need tools to manage your project.
I'm on a mutirepo project and we can't have that problem because we have careful versioning of what goes together. Sure many combinations are legal/possible, but we control/log exactly what is in use.
> Sure many combinations are legal/possible, but we control/log exactly what is in use.
I'll acknowledge our tooling could have been better, but isn't it better to just be able to check out one revision of one repo and have confidence that you're looking at the code that was running?
If I have a services based architecture then I can jump straight to the repo for that particular service and have confidence that it is the code that is running.
So instead of adopting a system that makes the problem we’re discussing not possible you use a human-backed matrix of known compatible versions?
Like you do you but I’ve never seen “just apply discipline” or “just be careful” ever work. You either make something impossible, with tooling or otherwise, or it will happen.
No, it is a tool backed matrix. Illegal combinations are not possible, and we have logs of exactly what was installed so we can check that revision out anytime
To solve this properly you need to store the deployed/executed commit id of any service. That could be in the logs, in the a label/annotation of a kubernetes object or somewhere else. But this has nothing to do whether you use a monorepo or multiple smaller repositories. In some projects of me, we use the commit of the source repo as docker tag. And we make sure that the docker image build is as reproducible as possible. I.e. we don't always build with the latest commit of an internal library, but with the one that is mentioned in the dependency manifest of our build tool. Since updating all those internal dependencies is a hassle, that is updated automatically. It means there is an auto-generated merge requests to update a dependency for every downstream project. Therefore all the downstream pipelines can run their test suites before an update gets merged. Once in a while that fails, then a human has to adapt the downstream project to its latest dependencies. In a monorepo that work has to be done as well. But for all downstream projects at once.
Submodules are hell. I work somewhere with a polyrepo topology, with the inevitable "shared" bits ending up integrated into other repos as submodules. Nothing has been more destructive to productivity and caused so many problems.
It can now, but that's not the default. The defaults for submodule suck, because they match the behavior of old versions of git for backwards compatibility.
Yeah. Leaving the UX-issues aside. Don't ever use submodules to manage dependencies inside of each polyrepo, it will eventually accumulate duplicate, conflicting and out of date sub-dependencies. Package managers exist for a reason. The only correct way to use submodules is a root-level integration-repository, being the only repo that is allowed to have submodules.
The only problem I have with a monorepo, is that sometimes I need to share code between completely different teams. For example, I could have a repo that contains a bunch of protobuf definitions so that every team can consume them in their projects. It would be absurd to shove all of those unrelated projects into one heaping monorepo.
Well that's what a monorepo is! I work on one, it's very large, other teams can consume partial artifacts from it (because we have a release system that releases parts of the repo to different locations) but if they want to change anything, then yeah they have to PR against the giant monorepo. And that's good!
Teams out of the repo have to be careful about which version they pull and when they update etc. However, if you are a team IN the monorepo, you know that (provided you have enough tests) breaking changes to your upstream dependencies will make your tests fail which will block the PR of the upstream person making the changes. This forces upstream teams to engage (either by commits or by discussions) with their clients before completing changes, and it means that downstream teams are not constantly racing to apply upgrades or avoiding upgrades altogether.
I work on shared library code and the monorepo is really crucial to keeping us honest. I may think for example that some API is bad and I want to change it. With the monorepo I can immediately see the impact of a change and then decide whether it's actually needed, based on how many places downstream would break.
Ok. I've had some time to think about this, and I am warming up to the idea. It would sure simplify a lot of challenging coordination problems. My only real concern is that the repo may grow so large it becomes very slow. Doubly so if someone commits some large binaries.
I am certainly not a heavy user, but for work I've made myself a "workflow" repository which pulls together all the repositories related to one task. This works super well. There sure is a bit of weirdness in managing them, but I found it manageable. But I'll admit that I don't really use the submodules for much more than initial cloning, maybe I'd experience more problems if I did.
Yes, but it's because submodules are a badly architected, badly implemented train wreck.
There are many good and easy solutions to this problem, all of which were not implemented by git.
git is a clean and elegant system overall, with submodules as by far the biggest wart in that architecture. They should be doused with gasoline and burned to the ground.
I like using submodules for internal dependencies I might modify as part of an issue. I like conan or cargo for things I never will. I don't particularly like conan. Perhaps bazel, hunter, meson or vcpkg are all better.
But this is a discussion of dependencies between services. You need more tooling for managing inter-service dependencies as opposed to package dependencies within one monolith.
> I've worked in large polyrepo environments. By the time you get big enough that you have internal libraries that depend on other internal libraries, debugging becomes too much like solving a murder mystery. In particular, on more than one occasion I had a stacktrace that was impossible with the code that should have been running. A properly-configured monorepo basically makes that problem disappear.
On the contrary, a monorepo makes it impossible because you can't ever check out what was actually deployed. If what was running at the time was two different versions of the same internal library in service A and service B, that sucks but if you have separate checkouts for service A and service B then it sucks less than if you're trying to look at two different versions of parts of the same monorepo at the same time.
There is no source of truth for "what was deployed at time T" except the orchestration system responsible for the deployment environment. There is no relationship between source code revision and deployed artifacts.
Hopefully you have a tag in your VCS for each released/deployed version. (The fact that tags are repository-global is another argument for aligning your repository boundaries with the scope of what you deploy).
Why not? I’m doing it right now. The infrastructure is versioned just like the app and I can say with certainty that we are on app version X and infra version Y.
I even have a nice little db/graph of what versions were in service at what times so I can give you timestamp -> all app and infra versions for the last several years.
Unless your infrastructure is a single deployable artifact, its "version" is a function of all of the versions of all of the running services. You can define a version that establishes specific versions of each service, but that's an intent, not a fact -- it doesn't mean that's what's actually running.
Am I missing some nuance here? Yes the infra version is an amalgamation of the fixed versions of all the underlying services. Once the deploy goes green I know exactly what’s running down to the exact commit hashes everywhere. And during the deploy I know that depending on the service it’s either version n-1 or n.
The kinds of failures you’re describing are throw away all assumptions and assume that everything from terraform to the compiler could be broken which is too paranoid to be practically useful and actionable.
If deploy fails I assume that new state is undefined and throw it away, having never switched over to it. If deploy passes then I now have the next known good state.
Oh, this implies you're deploying your entire infrastructure, from provisioned resources up to application services, with a single Terraform command, and managed by a single state file. That's fine and works up to a certain scale. It's not the context I thought we were working in. Normally multi-service architectures are used in order to allow services to be deployed independently and without this form of central locking.
If what was deployed was foo version x and bar version y, it's a lot easier to debug by checking out tag x in the foo repo and tag y in the bar repo than achieving the same thing in a monorepo.
I'm not sure I understand how that scenario would arise with a monorepo. The whole point of a monorepo is that everything changes together, so if you have a shared internal library, every service should be using the same version of that library at all times.
And every service deploys instantly whenever anything changes?
(I actually use that as my rule of thumb for where repository splits should happen: things that are deployed together should go in the same repo, things that deploy on different cycles should go in different repos)
Not necessarily instantly, but our CD is fast enough that changes are in production 5-10 minutes after hitting master.
But what's more valuable is that our artifacts are tagged with the commit hash that produced them, which is then emitted with every log event, so you can go straight from a log event to a checked-out copy of every relevant bit of code for that service.
Admittedly this doesn't totally guarantee you won't ever have to worry about multiple monorepo revisions when you're debugging an interaction between services, but I haven't found this to come up very much in practise.
Edit: I should also clarify, a change to any internal library in our monorepo will cause all services that consume that library to be redeployed.
> What do you do with libraries shared between different deployment targets?
The short answer is "make an awkward compromise". If it's a library that mostly belongs to A but is used by B then it can live in A (but this means you might sometimes have to release A with changes just for the sake of B); if it's a genuinely shared library that might be changed for the sake of A or B then I generally put it in a third repo of its own, meaning you have a two-step release process. The way to mitigate the pain of that is to make sure the library can be tested on its own without needing A or B; all I can suggest about the case where you have a library that's shared between two independent components A and B but tightly coupled to them both such that it can't really be tested on its own is to try to avoid it.
That's a great test and I think an argument for monorepo for most companies. Unless you work on products that are hermetically sealed from each other, there's very likely going to be tight dependencies between them. Your various frontends and backends are going to want to share data models for the stuff they're exchanging between them for example. You don't really want multiple versions of this to exist across your deployments, at least not long term
I think it's maybe an argument for a single repo per (two-pizza) team. Beyond that, you really don't want your components to be that tightly coupled together (e.g. you need each team to be able to control their own release cycles independently of each other). Conway's law works both ways.
If they have independent release cycles, they shouldn't be tightly coupled (sharing models etc. beyond a specific, narrowly-scoped, and carefully versioned API layer), and in that case there is little benefit and nontrivial cost to having them be in a monorepo.
Not GP, but I use versioned packages (npm, nuget, etc) for that. They're published just like they're an open source project, ideally using semantic versioning or matching the version of a parent project (in cases where eg we produce a client library from the same repo as the main service).
I've had the exact opposite experience. We have a polyrepo setup with four repos in the main stack (and a comical number of repos across the entire product, but that's a different story). My top pain point - almost painful enough to force a full consolidation - is trying to find the source of a regression.
When semi-regularly discover regressions on production and want to know when it was introduced. Any other project I've worked on, that can be done with a simple git bisect. I can tell you that trying to bisect across four repos is not fun. If everything were in a monorepo, I would be able to run the full stack at any point in time.
Now, if all your APIs are stable, this won't be as bad. But if you're actively developing your project and your APIs are private, I can only assume this pain will be ever present.
I think my counterpoint is: Generally, if I'm playing the part of the owner of some system N layers deep in the rats nest of corporate systems; I don't even want to think specifically about what broke. I know the dependencies of my system; if I have a dependency on the Users Service, and it looks like something related to the Users Service broke, my first action is probably to go into their slack channel and say "hey, we're seeing some weird behavior from the Users system; did y'all change something?"
At the end of the day; they're going to know best. Maybe code changed. Maybe someone kubectl edit'ed something manually. Not everything is represented in code.
The problem is that in microservice environments, a lot of complexity and source of bugs are (hidden) in the complex interactions between different components.
I also believe that this mentality of siloing/compartmentalization and habit of throwing things over the fence leads to ineffective organization.
After close to a decade of working in various microservice based organizations, I came to a big-ish monolith project (~100 devs). Analyzing bugs is now fun, being able to just step through the code for the whole business transaction serially in a debugger is an underrated boost. I still need to consult the code owners of a given module sometimes, but the amount of information I'm able to extract on my own is much higher than in microservice deployments. As a result, I'm much more autonomous too.
> Maybe code changed. Maybe someone kubectl edit'ed something manually. Not everything is represented in code.
That's honestly one of the big problems in microservices as well.
> After close to a decade of working in various microservice based organizations, I came to a big-ish monolith project (~100 devs). Analyzing bugs is now fun, being able to just step through the code for the whole business transaction serially in a debugger is an underrated boost. I still need to consult the code owners of a given module sometimes, but the amount of information I'm able to extract on my own is much higher than in microservice deployments. As a result, I'm much more autonomous too.
Could you expand how do you manage ownership of this monolith? Do you run all the modules in the same fleet of machines or dedicated? Single global DB or dedicated DB per module (where it makes sense, obviously)?
Because where I work we have a big monolith with a similar team size and it's a royal PITA, especially when something explodes or it is going to explode (but we have a single shared DB approach, due to older Rails limitation, and we have older Rails because it is difficult to even staff a dedicated team that take care of tending the lower level or common stuff in the monolith).
> Could you expand how do you manage ownership of this monolith?
We have a few devops teams (code + deployment) and platform teams (platform/framework code), the remaining teams (which form the majority of devs) own various feature slices. The ownership is relatively fluid, and it's common that teams will help out in areas outside of their expertise.
> Do you run all the modules in the same fleet of machines or dedicated?
Not sure if I understand. All modules run in the same JVM process running on ~50 instances. There are some specialized instances for e.g. batch processing, but they are running the same monolith, just configured differently.
> Single global DB or dedicated DB per module (where it makes sense, obviously)?
There is one main schema + several smaller ones for specific modules. Most modules use the main schema, though. Note that "module" here is a very vague term. It's a Java application which doesn't really have support for full modules (neither packages nor Java 9 modules count). "module" is more like a group of functionality.
> and we have older Rails because it is difficult to even staff a dedicated team that take care of tending the lower level or common stuff in the monolith).
This is usually a management problem that they don't pay attention to technical debt and just let it grow out of control to the point where it's very difficult to tackle it.
The critical part of the success of this project is that engineering has (and historically had) a strong say in the direction of the project.
But aren't micro-service specifically designed to be able to split responsibility of a large system between multiple teams. If everybody debugs and fixes bugs across the whole landscape, than everybody has to be familiar with everything, which means you are loosing the benefits. Occasionally, it might be helpful to debug the whole stack at once. But I wouldn't trust a landscape where that is needed too often. I might be that the chosen abstractions don't fit well.
> But aren't micro-service specifically designed to be able to split responsibility of a large system between multiple teams.
That's the idea, but business transactions usually span multiple services and bugs often aren't scoped to a specific service.
> If everybody debugs and fixes bugs across the whole landscape, than everybody has to be familiar with everything
A lot of things can be picked up along the way while you're debugging, and I'm usually able to identify the problem and sometimes even fix it.
> I might be that the chosen abstractions don't fit well.
Very often the case. Once created, services remain somewhat static, their purpose and responsibility often gets muddy. Mostly because "refactoring" microservice architecture is just very expensive and work intensive. Moving code between modules within a monolith is rather easy (with IDE's support), moving code between services is usually not trivial at all.
But that's just that one scenario you've described.
It's also common that if you have a dozen repos that maybe only one has changed and so when there is a defect it's trivial to determine what caused the regression.
I don't think mono or poly repos are better when it comes to triaging faults. They each have strengths and weaknesses.
>> I'm pretty staunchly in the "one service, for one deployment artifact, for one CI pipeline, for one repo" camp
This seems reasonable if nothing is shared. If there are any shared libraries then you are back to binary sharing (package managers, etc.) with this approach.
This looks trivial now, but when you multiply the number of directories by 8 or so it becomes a very nasty mess very quickly.
I think that the idea of only running what changed makes a lot of sense, I just think that managing that in declarative yml falls apart VERY quickly once you hit an inkling of scale.
I just want to comment that you are correct. Bazel allows that and so should any tool that can build dependencies DAGs. Once you have that it's absolutely feasible.
The major issue is that you need to be diligent at bookkeeping your dependencies. Bazel enforces that in the BUILD files and since everything is run in a sandbox you can't easily take shortcuts or you'll get missing dependencies errors.
>Many VSCode language extensions, for example, won't operate at their best if the "language signaler" file (e.g. package.json for JS) isn't in the root of the repo; from just refusing to function
With VSCode, a workaround is to use workspaces. Define a workspace and add each subproject folder as its own entity. VSCode will treat each folder as a project root where the language specific tooling will work as expected.
Your examples of CI and VSCode are on point, and in the bigger picture it's always about tooling.
The mono/multi repo argument is fundamentally boring to me because it always boils down to whether the shape of the tooling problem is easier to work with on this or that side of the divide.
The answer is always whichever tradeoffs work best for your situation, and the reason at the end of your post is as good of a reason as as any.
> If you know "this thing is broken", the commit log is a fantastic resource to figure out what changed which may have broke it. You submit a PR, and CI has to run; getting most CI platforms cleanly configured to, say, "only run the user-service tests if only the user-service changes" isn't straightforward. I understand there are some tools (Bazel?) which can do it, but the vast, vast majority of systems won't near-out-of-the-box, and will require messy pipelines and shell scripts to get rolling.
I'm not very familiar with the recently trending monorepo tools, but don't they generally provide a way to declare the dependencies between subrepos and prevent each subrepo from importing or otherwise depending on anything outside of those declared dependencies? If that's the case, then wouldn't CI be able to use that same dependency graph to determine when it needs to rebuild/redeploy each particular subrepo?
Well, there aren't "subrepos" … it's all one giant monorepo.
And … no? No, CI tools don't. There's generally not a tool that has the dependency graph, and it's typically not recorded. (Excepting bazel, which set out to solve this problem; lo and behold it was designed by a company with a monorepo, too.)
Some CI systems I've seen have half-assed attempts at it, such as "only run this CI step if the files given by this glob changed". But a.) it requires listing a transitive list of all globs that would apply to the current step, so it's not a good way to manage things and b.) every time I've seen this mis-feature, "change" is described as "in this commit"; that's incorrect. (I have base commit B, I push changes '1 and '2, for commit graph B - '1 - '2 ; CI detects for a step that the globbed files didn't change in '2, and ignores '1. The branch is green. I merge. The result merge commit 'M changes the union of files, so now the tests run, and the commit — now on HEAD — breaks the build. A subsequent unrelated commit M - '3 doesn't modify the relevant code; CI skips the tests and delivers a green result on a broken codebase. People erroneously think "problem fixed". I have seen this all play out in person, multiple times.)¹
(A "much easier" approach is to simply cache a single build step: you hash your inputs, and compute a cache keys; see if your output is cached. If yes use cache, if no build. Computing the cache key is the tricky part and risks that famous "top n problems in computer science … cache invalidation" quote.)
¹while I know how to compute better git diffs, the difference between the common ancestor, the result once the commit gets merged, etc. are subtle. Most devs are shockingly inexperience with git and don't even get this far into the problem, and CI system's insistence on only running on, e.g., "pushes" doesn't help.
Sure. You can't go halfway on monorepos where you check in all the code into one spot but don't build any tooling to manage that. You need to use something like Bazel/Blaze, Buck, or other tools that meant to own responsibility for managing dependencies between projects.
> Well, there aren't "subrepos" … it's all one giant monorepo.
I think op meant organizationally you still have logical components.
> And … no? No, CI tools don't. There's generally not a tool that has the dependency graph, and it's typically not recorded. (Excepting bazel, which set out to solve this problem; lo and behold it was designed by a company with a monorepo, too.)
Everywhere I've worked that has had a monorepo (Google, Facebook), that's definitely the case. The CI automation would query Buck/Bazel to figure out the set of dependencies impacted by a PR. Of course, some PRs would have outsized impacts (e.g. changing libc) but at the same time, there's probably not much better than that.
Apple was a bit different. While nominally each project had it's own git repository, you uploaded code to a central monorepo that organized everything by release. And it built every project incrementally and relinked/rebuilt whichever projects were impacted. They didn't at the time have a centralized CI system. But also Apple's system evolved from many decades ago and is a sane selection for building an OS from back then. Google's approach I think is generally accepted as a more effective strategy in some ways if you're going down that route. That being said, at Google scale you're shipping so much code there's still challenges. For example, there's so much code being changed at Google's scale, that they have to bundle things together into a single CI pass because there's just insufficient compute capacity available to do everything + avoid serialization of unrelated components. Of course, probabilistically there's a non zero chance that something is broken and they intelligently bisect and figure out what change needs to be omitted from the ship. Very complicated. But I think most people underappreciate that they'll never encounter these kinds of problems. If you just go all in on monorepo + Bazel + bazel-aware CI setup + build artifact caching, you're done and don't have to think about builds or code management very much after that. That's a really big superpower.
> I think op meant organizationally you still have logical components.
Yes, my impression was that all of these monorepo tools have a first-class notion of the subrepos/projects/workspaces/whatever that make up the monorepo. If you don’t have that, then I guess I don’t really know what you mean when you say you have a monorepo.
I dont understand. You can have 1 repo with multiple services that can be deployed independently.
Edit: perhaps the difference is that you said "monolith." I guess I'm not sure precisely what you mean by this, but context makes it seem like you're using it synonymously with monorepo. Since that's what this thread is about.
There's a very simple solution to that, if your systems & processes are reasonably lightweight.
Just build and deploy everything on every merge. Compute is fairly cheap, and if running in parallel, it doesn't have to take long.
You can also take it a step further and have "mono" binaries/container images, where you specify the service to execute as the first argument.
I've been doing this for about 5 years now, having a single output artifact for each language being used. It works great.
If you're careful about your optimisations, you can go from hitting the merge button to having 100+ services deployed on production in about 60 seconds
Arguably it's a bit of an extremist approach, but if you have a situation where technically you're deploying thousands of times a day, you get pretty good at making the process reliable and robust
> If you know "this thing is broken", the commit log is a fantastic resource to figure out what changed which may have broke it
git supports both diff and log for specific directories, although this may not help you if the issue was with a dependency in another folder that was updated.
But the point is that the tooling doesn't help you with it. Those are building blocks that you might build a "does this need to get build/deployed?" (& if not, what is the result of the build) mechanism with, but they are not that mechanism.
I agree that the blog point didn't make a very strong argument, but there are some inaccuracies in your comment as well.
With regard to the "only run X tests if X changed" problem, Bazel, Buck, and all the other monorepo build tools do this. I mean sure, if you're using some build system not meant for a monorepo you're going to have a terrible time, but who is really going to spend weeks or months converting to a monorepo and not also switch to Bazel (or something akin to it?) at the same time. In fact, I would say switching to Bazel (or Buck, etc.) for builds is a prerequisite to even starting on the path to switch to a monorepo.
This is just a really useful feature even if you're not in a monorepo. Sometimes you're changing some core header file or whatever and you really do need to run nearly all the tests in your test suite. Sometimes you're just changing some fairly self-contained file and only a few tests need to run. Sometimes you change some docs in the repo and you don't need to run any tests at all. Bazel will just do this automatically for local builds (it knows what tests have transitive dependencies that have changed since the last time those tests were run), and setting it up in CI is a few lines of bash or Python. To set this up in CI you basically just check which files have changed since the last time CI ran (e.g. using git diff), then you use bazel query to find all test targets that have transitive dependencies on those files, then you feed that list of test targets to bazel test. You can set this up per-developer branch for instance, so that if you have a bunch of developers all running tests on the same set of CI machines you get good caching.
With regard to colocating schemas in APIs, yes you can do that but it's really annoying to do with Protobuf/Thrift. First of all protobuf and thrift require that IDLs exist locally so they can do code generation, so if you have protobuf files split into multiple services you need a way to distribute them all which is super annoying. Additionally, in some cases there isn't a clear single owner of a particular IDL struct, for example let's say you have some date or time struct that many protobuf messages want to use in their fields. Which service do you define it in? Ignoring that, it is REALLY USEFUL to be able to modify the code for the producer of a message and the consumer of a message all at once, without having to make multiple commits in multiple repositories. This is especially true when it comes to testing. I have thing A producing a new field X, I want B to use the new field X, and I want to test that B uses it correctly. When everything is in one repo this just works, with multiple repos I need to first add the code to thing A, do a release of A (even if that's just making and pushing a git commit), then update B to consume the new thing and add the test, then if I realize I messed something up I need to go update repo A again to test it, and so on. Obviously this works and tons of people do it, but it sucks. I had to do this at my last job (which wasn't using a monorepo), and it worked but it was cumbersome and I hated it.
With regard to scaling, a lot has changed in git in the last two years to make it possible to run huge git monorepos without any weird hacks. The most notable such feature is sparse indexes, which let you clone a subset of a git repo locally and have it work normally. Here's a GitHub blog post about sparse indexes: https://github.blog/2021-11-10-make-your-monorepo-feel-small... . They also have a monorepo tag which you can use to look at other blog posts about monorepos (and as you'll note, most of these are pretty recent): https://github.blog/tag/monorepo/
The biggest downside of a monorepo in my opinion is that there are a lot of things that Bazel makes way harder than the default package manager for language X. Practically speaking it's probably going to be hard to use Bazel if you don't have a dedicated build team with experts who can spend all the time to figure out how to make Bazel work well in your organization. This is pretty different from just using pip or npm or yarn or whatever, where you can get things just working in a couple of minutes and the work to maintain the build system is probably just collectively a few hours of work a week from people spread throughout the organization who can be on any team. For a small organization I can't see it being worth the effort unless you already have a lot of engineers who have a background in Bazel, for example. But there's definitely a point where the high entry-level cost to Bazel and a monorepo makes sense.
> you need a way to distribute them all which is super annoying
Only if your tools are bad. In the stack I'm used to they're just another artifact that gets published as part of a module build.
> Additionally, in some cases there isn't a clear single owner of a particular IDL struct, for example let's say you have some date or time struct that many protobuf messages want to use in their fields. Which service do you define it in?
A common library, just like any other kind of code.
> Ignoring that, it is REALLY USEFUL to be able to modify the code for the producer of a message and the consumer of a message all at once, without having to make multiple commits in multiple repositories. This is especially true when it comes to testing. I have thing A producing a new field X, I want B to use the new field X, and I want to test that B uses it correctly. When everything is in one repo this just works, with multiple repos I need to first add the code to thing A, do a release of A (even if that's just making and pushing a git commit), then update B to consume the new thing and add the test, then if I realize I messed something up I the new version ofneed to go update repo A again to test it, and so on.
That's actually really important because it forces you to test all the intermediate states. If you can just change everything at once then you will, and you probably won't bother testing everything that actually gets deployed, and so you get into a situation where if you could deploy the new version of A and B at exactly the same time then it would work, but when there's any overlap in the rollout everything breaks horribly.
It really sucks to manage shared libraries, across many clients in a few languages. Any update requires updating the version everywhere, and you have to independently debug which version of which library was being used by which application at the time a bug occurred. It works, it isn't impossible, obviously lots of people manage it successfully (polyrepos are more common than monorepos in my experience), but it's a giant pain and it sucks.
> Any update requires updating the version everywhere
Much of the benefit of using an IDL like this is to be (mostly) forward compatible, so you don't have to upgrade everywhere immediately.
> you have to independently debug which version of which library was being used by which application at the time a bug occurred
You have to do that anyway; it's easier if your repository history reflects the reality of which applications were upgraded and deployed at which times. There's nothing worse than having to debug a deployed system that was deployed from parts of a single repo but at several different points in that repo's history.
Splitting your business’s mission across hard repository boundaries implies that… you know what you are doing! If that’s you then congratulations. Also: you’re kidding yourself.
For the rest of us, being virtually unable to rethink the structure of our components because they are ossified as named repositories is a technical and social disaster. Whole teams that should have faded and been reabsorbed elsewhere will live forever because the effort to dismantle and re-absorb their code into other components is astronomical.
The value we bring as engineers is in making sequences of small changes that keep us moving towards our business goal. Boundaries that get in the way are anathema to good engineering. Its exactly as if you were unable to move code between top level directories of your project. Ridiculous.
> Rolling out API changes concomitantly with downstream changes to the documentation or the OpenAPI spec.
> Introducing feature-level changes and the blog post announcing those changes.
These are horrible reasons to use a monorepo. Commits are not units of deployment. Even if you're pushing every system to prod on every commit, you'd still basically always want to make the changes incrementally, system by system, and with a sensibly sequenced rollout plan rather.
To take one of the examples above, why would you ever have the code implementing a feature and an annoucement blog post in the same commit? The feature might not work correctly. You'd want to be able to test it in a staging environment first, right? Or if you don't have staging, be able to run it in prod behind a feature flag gated to only test users, or as a dark launch, or something to verify that the feature is working before letting real users at it and having it crash your systems, cause data corruption, or some other critical problem that would necessitate a rollback. But none of this pre-testing is possible if the code changes are really being done in the same commit as the public announcment.
And talking of rolling back... When you revert the code changes that are misbehaving, what are you doing with the blog post? Unpublish it? Or do some kind of a dirty partial rollback that just reverts the code and leaves the blog post in place?
The same goes for any kind of cross-project change[0], some of which appear more compelling on the surface than the "code and blog post in one" use case (e.g. refactoring an API by changing the interface and callers at the same time). Monorepos allow for making such changes atomically, but you'd quickly find out that it's a bad idea. There are great reasons to use monorepos, but this is not it.
> you'd still basically always want to make the changes incrementally, system by system, and with a sensibly sequenced rollout plan rather.
Depends. It's significantly faster to deploy everything at the same time and accept that unlucky requests might end up in a weird state than to safely sequence changes.
In SRE phrasing, I'm choosing to spend our error budget to maximize change velocity by giving up on compatibility during deploys by skipping a multi-stage rollout plan. In return, I can condense a rollout to a single commit and deploy. A 99.9% availability target yields up to 86 seconds per day to pretend that deploys are "atomic".
Did you ever had to rollback some unlucky changes? Specifically rolling back, not fixing it frantically by several layers of fixes on top the buggy deploy?
Yes, absolutely. Nonetheless, I think the author may be right, except that by "Google" they mean "large." This is a fundamental misunderstanding of just how large Google and its peers are. I think it's more interesting to consider three sizes.
If you're small, everything will fit nicely in a monorepo.
If you're large, you'll want lots of repos. There aren't really any off the shelf monorepo options that scale super well, so using a bunch of small repos is a great way to deal with the problem. Plus, you probably don't have a full time staff babysitting the source repos, so you want some isolation. If someone in another org is breaking stuff left and right, you don't want the other orgs to be affected.
If you're GIGANTIC, monorepos are a pretty great option again. You'll probably have to build your own and then have a full time group of people maintain it, but that's not a huge problem for you because you're a gigantic tech company. You can set up an elaborate build system that takes advantage of the fact that the entire system is versioned together, which can let you almost completely eliminate version dependency hell. You can customize all of your tools to understand the rules for your new system. It's a huge undertaking, but it pays off because you've got a hundred thousand software engineers.
> There aren't really any off the shelf monorepo options that scale super well
How can you say this when Perforce on a single machine took Google to absolutely terrifying scale? There is no chance that your mid-sized software company will even slightly tax the abilities of Perforce.
What I believe you meant was there aren't really any good options to make git tolerable for non-trivial projects, and with that I wholeheartedly agree. And that's why these threads are so tiresome: they always boil down to people talking about what git can and cannot do.
Google wrote a whole paper on the fact that, with the help of a beast of a single computer, they were able to get Perforce to work for 10,000 employees averaging about 3 commits per second (20 million commits over 11 years) and a much higher volume of other queries. That white paper pointed out that Google had taken performance to the "edge of Perforce's envelope" and they were only able to do that by treating Perforce's performance limitations as a major concern and striping a fleet of hard drives on that machine with RAID 10.
That's not an endorsement for a company as big as Google was then looking for an easy, off the shelf solution. It'd probably be just fine for a company of hundreds, but so would git.
On the other hand, if you play to its strengths, it's probably a great choice. Maybe a team of dozens of content developers checking in large assets for videogames. Perfectly great use case for Perforce.
Yes, but the Linux kernel is still "non-trivial". You said git is not tolerable for non-trivial projects. I think you just meant that it isn't tolerable for "incredibly large" repos, which I do think is right.
It's just a boring semantic point that I'm making, that "non-trivial" was a hyperbolic word choice.
I'd argue that using git's sparse-checkout functionality and enforcing clean commits (such as via the patch-stack workflow and maintaining a hard line approach against diff-noise) does a lot of heavy lifting for handling git monorepos.
Sparse checkouts, shallow fetches/clones, partial clones, etc allow you to work with an egregiously large repository without needing to ever actually mess with the whole thing. Most existing build tooling can be made to work with these features pretty easily however some tools are easier than others.
Enforcing clean commits avoids the issues with keeping track of individual project histories and past that the existing git tooling largely already supports filtering commits to only expose commits relevant to specific directories/pathspecs.
---
The only time I really see an organisation outgrowing a monorepo is if the org is incapable of or unwilling to maintain strict development and integration policies.
Also worth noting because I don't see it mentioned enough but not everything has to be in the same monorepo. Putting all closely related products and libraries in the same monorepo is kosher but there's little reason for unrelated parts of an org's software to all be in the same monorepo. So what might be 50-200 independent projects/repos could be 3-20 monorepos with occassional dependencies on specific projects in the other monorepos.
It really paints a picture of the authors credentials to be making this declaration.
All signs for the rest of the world point to the opposite conclusion: unless you're Google-scale, you don't have Google level resources. Google has more engineers working on developer experience than most companies will ever have period.
And monorepos work best when workflows are carefully thought out with clever application specific tooling.
-
I think the author is probably working on a 1-10 developer project (and I'm leaning towards 1) and has confused the convenience with having things in reach when the entire system fits in your mind with the general benefits of a monorepo.
I also wonder if they read any of the letters they linked too...
> All signs for the rest of the world point to the opposite conclusion: unless you're Google-scale, you don't have Google level resources. Google has more engineers working on developer experience than most companies will ever have period.
I feel like this was true ~5 years ago, but these days the tooling around scaling monorepos is safely supporting O(100) developers without a lot of overhead.
> And monorepos work best when workflows are carefully thought out with clever application specific tooling.
I don't see any meaningful distinction with how well thought out workflows need to be between mono and polyrepo.
> I don't see any meaningful distinction with how well thought out workflows need to be between mono and polyrepo.
You completely failed to parse the sentence. The point being made isn't "well thought workflows are only for monorepos", that applies to the "needing clever application specific tooling" part.
Vendoring/Versioning for discrete packages is a heavily invested in problem space for most tech stacks you'll come across. But if you build a monorepo and don't end up with a build system that takes on those responsibilities you end something that doesn't scale to even moderately large interconnected components.
OP has such a tiny project that I'm not convinced they're even dealing with dependencies in a traditional sense. But well before you get to Google scale, you'll run into situations where you just want to change one thing and don't want to change every single downstream dependency which normally would have been isolated via a discrete package that doesn't have to change in lockstep. And then that exact same pain starts to exist for deployments and needs to be worked around.
> I feel like this was true ~5 years ago, but these days the tooling around scaling monorepos is safely supporting O(100) developers without a lot of overhead.
Nothing about the above has changed in the last 5 years, it's kind of the ground truth of monorepos via multiple repos: You're the first person I've ever seen imply monorepos don't offload complexity to tooling, even amongst proponents.
> But well before you get to Google scale, ... doesn't have to change in lockstep.
Not updating dependencies is the equivalent of never brushing your teeth. Yes, you can ship code faster in the short term, but version skew will be a huge pain in the future. A little maintenance every day is preferable to ten root canals in a few years.
As you scale a small company its exceedingly rare to not need 10 root canals along the way. Meanwhile it's exceedingly common to need to pivot quickly even if it comes at the cost of near-term engineering rigor.
I feel obliged to point out that I work at a company that uses a monorepo, so this isn't a "never use monorepos" counter-post. Instead my points are borderline tautological:
There's a balancing of near-time sacrifice vs long-term sustainability. But you need good reasons to pick the side of the scale that historically got less resources invested into it and puts an impetus on your engineering team to adjust to the knock on effects of that disparity while still building a fledgling company.
> Not updating dependencies is the equivalent of never brushing your teeth
That's a strawman: the choice is not between updating and not updating. The choice is between updating on my terms or not.
I recently updated stripe from 2.x.x to 5.x.x in one of the projects. That's several years without updates. Wouldn't it be fun if somebody was forced to update multiple projects every single time stripe ships a new minor version? And if we were to do the true monorepo, at what pace do you think stripe would be updated, if it was their responsibility to update all dependents?
you're conflating management of external dependencies with internal dependencies. Ideally, Stripe is actually as a Library Vendor here and so these are long-lived major versions with well-defined upgrade paths and surface area. Within your company you don't want every team to have to operate as a Library Vendor, and you also want to take advantage of the command economy you operate in to drive changes across the company rapidly.
Also, Amazon went through this whole thing. They have tons of tooling built up around managing different versions of external and internal dependencies and rolling them out in a distributed fashion. They are doing polyrepo at a scale that is unmatched by anyone else. And you know what they've settled on? Teams getting out of sync with the latest versions of dependencies is a Really Bad Thing, and you get barked at by a ton of systems if your software is stale on the order of days/weeks.
> Within your company you don't want every team to have to operate as a Library Vendor
But you want some teams to operate this way. And the best way to do it is by drawing boundaries at the repo level.
This is similar to monolith-services debate. Once monolith gets big enough there's benefit in breaking it down a bit. Technically nothing prevents you from keeping it modular. Except that humans just really suck at it.
> take advantage of the command economy you operate in to drive changes across the company rapidly
Driving changes across the company is a self-serving middle-manager goal. There's a reason why central planning fails at scale every single time it is attempted.
> Teams getting out of sync with the latest versions of dependencies is a Really Bad Thing
It definitely can be a bad thing. But you know what's even worse? Not having the option to get out of sync. If getting out if sync is a problem, polyrepo offers simple tooling to address it.
The assumption that you are making is that polyrepos will spend the vast amount of engineering effort to maintain a stable interface. Paraphrasing Linus: “we never break userspace.”
In practice internal teams don’t have this type of bandwidth. They need to make changes to their implementations to fix bugs, add optimizations, add critical features, and can’t afford backporting patches to the 4 versions floating around the codebase.
Repos work for open source precisely because open source libraries generally don’t have a strong coupling between implementers and users. That’s the exact opposite for internal libraries.
> In practice internal teams don’t have this type of bandwidth
You don't need bandwidth to maintain backward compatibility in polyrepo. As you said yourself, you need loose coupling.
When you are breaking backward compatibility, the amount of bandwidth required to address it is the same in mono- and polyrepos (with some exceptions benefitting polyrepos).
The big difference though is whose bandwidth are we going to spend. Correct me if I'm wrong, my understanding is that at Google it's the responsibility of dependency to update dependents. E.g. if compiler team is breaking the compiler, they are also responsible for fixing all of the code that it compiles.
So you're not developing your package at your own pace, you are limited by company pace. The more popular a compiler is, the slower it is going to be developed. You're slowing down innovation for the sake of predictability. To some degree you can just throw money at the problem, which is why big companies are the only ones who can afford it.
> can’t afford backporting patches to the 4 versions floating around the codebase
Backporting happens in open-source because you don't control all your user's dependencies. Someone can be locked into a specific version of your package through another dependency, and you have no way of forcing them to upgrade. But if we're talking about internal teams, upgrading is always an option, you don't have to backport (but you still have the option, and in some cases it might make business sense).
> open source libraries generally don’t have a strong coupling between implementers and users. That’s the exact opposite for internal libraries.
I disagree. There's always plenty of opportunities for good boundaries in internal libraries.
Though I'll grant you, if you draw bad boundaries, polyrepo will have the problems you're describing. But that's the difference between those two: monorepo is slow and predictable, polyrepo is fast and risky. You can reduce polyrepo risks by hiring better engineers, you can speed up monorepo (to a certain degree) by hiring more engineers.
When there's competition, slow and predictable always loses. Partially that's why I believe Google can't develop any good products in-house: pretty much all their popular products (other than search) are acquisitions.
> Vendoring/Versioning for discrete packages is a heavily invested in problem space for most tech stacks you'll come across.
100% disagree. The problem of "How do I define my dependencies and have a package manager reify that into a concrete set of versioned dependencies" may be a solved problem, but the tools for tracking dependencies across many repos and driving upgrades is neolithic. About the only company I've seen that does this well is Amazon, and they have yet to sell us version sets as a service.
> OP has such a tiny project that I'm not convinced they're even dealing with dependencies in a traditional sense. But well before you get to Google scale, you'll run into situations where you just want to change one thing and don't want to change every single downstream dependency which normally would have been isolated via a discrete package that doesn't have to change in lockstep. And then that exact same pain starts to exist for deployments and needs to be worked around.
As I alluded to above, the dual of this is that getting everyone to update their dependencies is orders of magnitude more difficult when you have a polyrepo setup, even if we're working under the ideal situation where repo setup is standardize to such a degree that a person can parachute into a repo and become effective within minutes.
> Nothing about the above has changed in the last 5 years, it's kind of the ground truth of monorepos via multiple repos: You're the first person I've ever seen imply monorepos don't offload complexity to tooling, even amongst proponents.
Both polyrepo and monorepo have complexity in scaling that is handled by their tooling. The difference historically is that OSS polyrepo tooling has been better and better integrated because that's just how most things are built in any language, but that has been improving over the past ~half decade
* Bazel maintenance complexity has dropped precipitously, and many of the initial bottlenecks you hit with it have been solved in OSS.
* If you're anti-bazel, gradle and cargo monorepo support is decently good. I believe the same is true in js these days, but I don't have hands-on experience
* Services for managing monorepos like sourcegraph for codesearch or mergify for submit queue now exist that make it easy to adopt the patterns that work well at large companies.
* Microsoft and others have invested in git to improve scalability of developing against large repos.
* you have OSS tools like git-branchless that further improve the experience of working in a monorepo
There are a bunch of companies in the O(100) - O(1000) developer range that are using this stuff and it works very well.
It's awkwardly phrased, yes, but what the author is saying is that Google-tier scale companies will have an awful time migrating from poly to mono-repo, and a not fun time being mono-repo for the first few years.
For company's not at this tier, it isn't that hard to migrate to a monorepo and the benefits will be more immediate because the tools (eg. git) won't be screaming under the load.
(My personal 2c is that you can be well below Google-scale and still hit the limits of the common tooling when using monorepos. Canva, Stripe, and Twitter are examples)
My understanding is that Meta/Google had to rewrite tons of stuff (distributed file systems, CVS, search tools, build tools...) and need many teams to maintain all these systems, and sometime their developer experience is inferior to what you get with off-the-shelf OSS. I'm not super familiar with this topic as I don't work with systems of that scale, but I'm wondering was it worth it or necessary? what would be the alternative?
> sometime their developer experience is inferior to what you get with off-the-shelf OSS
Working at Meta, being able to use Sapling (our Mercurial fork) is actually one of the highlights and I was pretty miserable every time I needed to go back to Git for my open-source work — thankfully it was open-sourced with a git backend a couple of weeks ago, so now I get to combine the excellent sapling CLI with the github hosting service :D
> sometime their developer experience is inferior to what you get with off-the-shelf OSS.
The internal version control system that I use (Fig, which is apparently based off Mercurial) is so good I've basically never had to think while using it. I can't really say the same about git.
CodeSearch is also, AFAIK, a lot better than what you would get OSS.
I think the productivity gains across 100K engineers is definitely worth it.
I have to say, I really enjoy using Google monorepo and all tools available to me.
Having said that, it's not free of course. There are many engineering teams and computer power being invested to provide this environment to all googlers.
I've seen what happens when an org tried to adopt a Google-style monorepo without making the investments in build tooling and cultural change. It was a disaster.
All of those things are needed even at orgs much smaller than Google, or you will end up with an unbuildable, unmaintainable, unreleasable mess.
For orgs that can't make those investments, I think a repo per team is the best approach. Each team can treat their repo like their own little Google-style monorepo if they want to.
A lot of the work is only because of the large size of the codebase. E.g. Forge and ObjFS, which is necessary because compiling the really large binaries on a normal workstation would OOM it. Or take days. https://bazel.build/basics/distributed-builds
If your codebase is "normal-sized," you don't need nearly that amount of infrastructure. There is probably some growing pain when transitioning from normal-sized to "huge," but that's part of the growing pain for any startup. You're going to have to hire people to work on internal tooling anyway; setting up a distributed build and testing service (especially now there are so many open-source and hosted implementations) is worth the effort once you're starting to scale. You're going to have to set that up regardless of a mono-repo or many separate repos.
It's probably only worth hiring serious, dedicated teams that work on building like Google once your CI costs are a significant portion of operation. That probably won't happen for a while for most startups.
I think that's a bit misleading (disclaimer: I very much like Bazel existing, though I think a better version of it could exist somewhere).
Surely a lot of work is put into Bazel core to support huge workflows. But a huge amount of work is put into simply getting tools to work in hermetic environments! Especially web stack tooling is so bad about this that lots of Bazel tools are automatically patching generated scripts from npm or pip, in order to get things working properly.
There is also incidental complexity when it comes to running Bazel itself, because it uses symlinks to support sandboxing by default. I have run into several programs that do "is file" checks that think that symlinks are not files.
Granted, we are fortunate that lots of that work happens in the open and goes beyond Google's "just vendor it" philosophy. But Docker's "let's put it all in one big ball of mud" strategy papers over a lot of potential issues that you have to face front-on with Bazel.
Personally I think this is what companies should do -- it guarantees hermeticity as you say, guards against NPM repo deletion (left-pad) and supply chain attacks. But for people who are used to just `npm install` there is a lot more overhead.
personally I don't think there is that much value in society in endlessly vendoring exactly the same code in various places. This is why we have checksums!
I understand that Google will do this stuff to remove certain stability issues (and I imagine they have their own patches!), but I don't think that this is the fundamental issue relative to practical integration issues that are solvable but tedious.
EDIT:I do think people have reasons for doing vendoring, of course, I don't think that it should be the default behavior unless you have a good reason.
For everyone who complains about monorepos, remember some of the most forward thinking engineering companies like Google and FB also use monorepos. All the arguments that people make in favor of polyrepo are making so because of lack of strong tooling for monorepos. That's also why Google and FB would not have scaled if they were using GitHub / GitLab but had to build their own. Also Google's original source control was built on top of perforce!
The issue with building a custom monorepo system that can handle Google's and Facebook's scale is that it fails to scale down, even to moderately large project and organizational size. It's expensive (think at least 7 figures opex) and not what most people should be doing.
git, for all its issues (and I'm a git-hater), scales down to an individual coder and scales up (with a lot of hacks, the hacks being used varying depending on whether you're taking a poly or mono approach) to companies that employ thousands of developers.
Polyrepo scales to thousands coders without anyone even noticing, and that's the beauty of it.
Just look at the size of node_modules in an average project. You stand on the shoulders of thousands of other engineers and you don't even think about it. That's your polyrepo at work.
Now imagine that every time a dependency wants ship a new version, their maintainers have to update all of the dependents. That's your monorepo.
It is quite obvious which one is more scalable.
The only real benefit monorepo has is that every dependency is always at it's latest version. But the cost to achieve that... let's just say there's a reason you mostly hear about monorepos from Google.
For anyone who complains about dictatorships, remember some of the most resource-abundant countries are dictatorships.
Those companies should be one of last places to look for good software development practices. They have absolutely no incentive to recognize their own mistakes.
a) There is no correlation between mono/poly repos and your ability to scale. There are many examples of successful companies using either approach.
b) As a general rule people should be cautious about adopting approaches and technologies from Google, Meta without a clear understanding of why they need it. What works at their scale doesn't always apply to smaller teams.
> remember some of the most forward thinking engineering companies like Google and FB also use monorepos
As a counterpoint though I’d say that the issues Google and FB face, particularly in terms of the sheer scale of the work they’re doing, is pretty unique.
Google literally invents programming languages for domains it feels needs them, I’m not about to blindly follow that practise either.
They way at least FB is using a monorepo is very different than any kind of monorepo most people imagine. It's not just about tooling, git itself could never handle it. I am all about using a monorepo but Google and FB having one is not the an argument for it.
IIRC originally neither mercurial nor git would scale enough, but mercurial was much more willing to accept scalability-related patches, so long as they didn’t harm the more common small-scale use-cases. After a couple of years of submitting patches for moderate improvements, meta wanted to make some more controversial large-scale changes like dropping support for sequential commit numbers, and ended up hard-forking and breaking compatibility to do that. That incompatible-but-better-performing fork then stayed internal for a while, before recently being released as Sapling.
As a bonus, part of the rewrite of the internal storage engine involved creating a storage engine abstraction layer, which in turn made it easy to add Git as a backend :D
I'd like to add my own PSA in concurrence with this.
Just use a monorepo. Use tooling to work around its limitations if you reach that point.
I work in a SaaS that's polyrepo based, having split from its original monorepo as part of a microservices push (which never succeeded, leaving us stuck half way in the worst of both worlds).
Nothing has been more destructive to productivity than the polyrepos and their consequences. We're talking a 20% engineering spend dead weight loss.
This is stark obvious to every single engineer, but trying to get people to accept that fact and sign off on a project to coalesce them back into a monorepo is just insurmountable.
Polyrepos are irreversible damage, stay away from them. Hold the line on your monorepo. One organisation, one repo.
As a special case: polyrepos are fine for when you have a genuine plugin architecture. To be a genuine one: you have a public facing end to end documentation on how to use it.
I would use chrome extensions vs. chrome itself as an example. Or same for VSCode. Then vendor plugins can be in their own repos.
I think otherwise for simple module A and module B that interact privately, in an adhoc way, the separation of concerns is not there and monorepo is better.
Absolutely this. I am experiencing a polyrepo setup of 254 repositories. The level of productivity loss is at a scale I don't think can be adequately describe in words.
Too small and you end up needing to deploy 5 apps to get your feature out and tests are difficult to coordinate.
Too large and the huge CI suite takes forever, people step on each others toes and it can be difficult to get things done. Access control and permissions is also difficult if you ever want to transfer ownership to another team.
The happy middle ground is to follow Conways law a little and have a few related apps/modules belonging to the same team in one repo and an a CI that has a simple integration test for all of them. It's fairly natural to achieve this if you don't have people following cargo cult memes.
I would like for any article insisting I adopt any practice to put a little more effort into doing so. This seems to amount to "it's good", "I should have done it sooner", "it's not so bad, trust me", and "smart people like it, too". Could have been a tweet.
Probably disagree - Google has built a boat-load of infrastructure to enable a mono-repo which you don't have and (I don't think) aren't even available either OSS or commercial. Use modules/libraries as an alternative to separate repos. This may require a few more commits for distributed changes but that isn't hard and forces you to consider realities like, "what if A service is deployed before B service?" which happen whether or not you're using a mono-repo. Also use data-lakes for data or generated artifacts (something based on S3 or an S3-alike).
my understanding is Nrwl's Nx is built by ex-Google employees, that their integrated repo style is based on the approach used at Google, and that it's open source (they make money with their paid build caching solution.)
Most companies aren't Google. So, the fact that they can't run the exact same tools doesn't seem like a real blocker when alternatives exist.
I've used Nx a bit before and intend to for my next project. Even as a solo dev initially, I'm optimistic it will help me keep things organized and catch potential issues with shared code via the dependency graph.
It's just one piece of the puzzle though. Have you noticed how companies running Bazel usually have a dedicated team? Many orgs are absolutely not interested in that.
Earlier in my career I was so excited to learn about microservices, as I was working with a huge monolith at the time. Then I went a bit overboard, and for a new prototype I had all these services in different repos and running independently. It was all very cool and I felt very accomplished, only to later (and now painfully obvious) be pointed out it was completely unnecessary, to the point each function basically was its own deployment. But not in a way that made any sense. If you were looking for the _most_ expensive way of running things, maybe.. hah
Tunnel-vision is a powerful thing, and if this is how people are splitting repos now a days, it seems I might not have been the only one chopping up perfectly good dish only to be left with the individual ingredients.
You can have microservices and different repos, microservices and a monorepo, or a monolith and monorepo. In principle you could even have a monolith split across different repos (e.g. each library in a separate repo). What's included in a deployed binary has nothing to do with how the source code is structured.
Yes, that's not lost on me, however my point might have been.
The point was, when learning this by yourself, you can really go down the wrong path with the right advice at the wrong time. This seems to hold true for the monorepo discussion as it did/does for microservices.
For me, this boils down to a question of "what set of problems do you want to solve?" and "what set of problems do you want the system to solve for you?"
To this end, I am generally a monorepo proponent. The problems with "what gets built" and how do you do versioning of the components" for the systems that I work on is an easier set of problems to solve than "how do I ensure that I don't break anything else when I update this method?" and "if I do break something when I update the method, where is everything that needs to change?"
I've had projects that were fractured across half a dozen repos where libraries were versioned and tagged and and deployed, and then the next down stream was updated to use the newly tagged version... and something broke. This was especially bad when the "something broke" was part of another team's code or it had been thrown over a wall. A bunch of changes, many PRs - some of which are just updating a version to pull from the artifact repo.
When things were updated, I took the opportunity to move all of those projects into one monorepo and I only had one PR to review - that passed all the tests for all of the projects contained within the monorepo.
Yes, deployments and tagging and the "well, just because you reved X to 2.3.0 doesn't mean that Y is at 2.3.0 too" (strict semver can get a bit on the messy side).
It's a question of which problems do you want to solve. I tend to find that me solving the "how do I set up the ci file and coordinate version numbering" is an easier problem to solve than "all these builds need to be updated in a bunch of different repos and one of them breaks."
Someone gets a few lines of code under their belt and now they overwhelmingly know what is best for all.
It's simple, if it gets released together, put it in the same repo (even if it's a big repo like the Linux kernel). If not, separate it out so you can stay sane.
If it's not released together then it's a monorepo, also defined as a big ball of spaghetti.
I have my doubts about the mono repo approach. From the top of my head:
1. It might increase the complexity of other process on that repo: CI/CD configuration, makefiles, branching strategies, codeowners etc
2. The versioning might lose its meaning
3. Blast radius in case of screwing things up accidentally.
I personally split repositories based on responsabilities: here the code base, here the iac, here the configuration, here the manifests. Always using a standard and predefined naming convention for the repository names. That being said, as always, it depends. I might embrace the monorepo if the context demands it and it has been properly discussed and evaluated.
It hasn't been like that most of the time in my experience. Having to integrate in the same CI/CD pipeline, let's say, from the code base side, linters, tests and build, from the iac side, formatting, validation, plan, from the manifests side, yaml linters, builds (e.g kustomize), dry runs etc. Now, think on all the previous stuff to consider but also adding the logic for something common like "based on the branch, run this and not that over this environment". Of course you could sepparate the previous in different pipelines, but then you might have to control somehow the order of execution of each. It is possible, however, it might be complex. Also, one must consider that sometimes, reaching a consensus between devops and developers can be tricky and you might end up in people stepping in each other toes.
2. Versioning becomes easier
How so? From my point of view, if I find a repository containing an specific thing (e.g a Terraform module) and I see the release version "1.0.1" I can get a clear idea of what that it implies. However, if that module is versioned along the terraform files themselves as well as the code base what was fixed on that "1.0.1" version, the Terraform module?, something in the app itself? Of course, I can go to the release notes and spend some time reading them, but you better be following some good practices on that regard, otherwise is gonna be time consuming.
> 3. Blast radius is similar but cleanup is easier
I could buy this one, however, it is preferable to assume the risk because of how easy is to clean up things in case they go south or is better to avoid the risk in the first place by having everything sepparated on the first place?
I guess context (complexity of the project, culture, communication across teams) is the key.
Here’s why I’ll never go back to anything but a monorepo.
My previous org decided on this for their source control/build dependency layout:
1. Monolith, home to front and back end, artisanal handcrafted openapi spec that plugged into nothing and was just for show.
2. Hand-painted service-cli library, which was supposedly going to be open source for customers but never was, and only mattered to the front end, back in the monolith. We manually kept it in sync with the backend, to everybody’s benefit (YFR). Hope you see it: backend changes left the monolith into the service-cli, back to the monolith. Face, meet wall.
3. A knapsack full of tiny libraries which all could have been wrapped up in one. They were all dependencies that plugged into the monolith and the service-cli. And they all had subdependencies.
4. Everything version-wise was synced via private npm registry. So if you wanted to make a change that went into the next version of monolith, but the code had to live in a leaf dependency way down at the bottom, you had to run a relay race marathon of red-tape PRs that you had to bribe your coworkers to pay attention to and/or approve.
The entire time, several of us on the team were begging the core team to repair this and they basically told us “no we believe in the tenets of sadism, now take your allotted pain, feature-bitch.”
Even in a private monorepo, Google has tools like Copybara that make it somewhat easy to 'open source' small pieces of it and sync them in and out of a 'public' repo. The benefits of a monorepo and just having to 'git checkout' and your entire environment 'just works' (I'm looking at you, scripting languages and build systems) is not to be taken lightly.
Where I work we just package everything (nugets, python packages, npm) on our Artifactory. Contracts dependencies (DLLs, protobufs) are also distributed as packages. We made it easy to fetch and test the source and allow developers to develop, debug and test those dependencies with their own project if needed.
Every time we try to assemble repositories in macro-repos we always end up regretting it. Multiple dedicated repositories allow autonomy for teams and enforce modularity and coding as a library. Monorepos have a tendency of becoming huge merge trains easily and often derailed and with lots of fear of being blamed on stepping on someone else's toes.
We update often all our projects knowing full well that not doing so is just borrowing development time at high interest rate.
As a side-note when we do have to do an assembly of different code base, we use git-subrepo: https://github.com/ingydotnet/git-subrepo which provide the best of both submodules and subtree.
Do not listen to pithy blog posts. Seriously. There's too many factors to consider that aren't in the blog post. They get upvoted on HN here quickly, and most often when they are written from ignorance, which seems super confident, and leaves out all the detail and nuance that could make such an upvote questionable. I'm pretty sure the upvotes are based on an emotional reaction to the title alone. (The "reasonable drawbacks" part of this post should have been like 5 pages; and the same for multirepo)
If you really want to know whether you should use a monorepo, first go buy a book about them. Then talk to senior leaders in organizations that have gone through an entire growth cycle and migrated from multirepo to monorepo or vice versa. The random dude on the random blog post doesn't have the perspective to inform you properly about the implications. Hell, I've done the switch twice, and I still wouldn't call myself an expert.
The whole thing really comes down to this difference: If you want to change one piece of software, it's faster and easier with multirepo; If you want to change many pieces of software, it's faster and easier with monorepo.
All a repository really is, is a collection of files in one "bucket". No magic involved, no secret sauce, no woo-woo philosophical jargon to intone, no technical mastery to obtain. There are other features of a repository, like branches, tags, subrepos, LFS, etc. But you don't even really have to use any of those things. At its core it's just files in buckets.
The complexity comes in once you have to start working with these buckets of files. How do I change a lot of files in one bucket? How do I change a lot of files in a lot of buckets? When I want to perform an action with the files ("build", "test", "deploy", etc), in one or many buckets, how do I do that?
How do I do these things with multiple people simultaneously changing the same files? Or files that depend on files that depend on files? How do I coordinate the subsequent actions of the first actions on the buckets on the files with multiple people testing multiple changes simultaneously?
No matter what option you choose, there will be choices you have to make which require compromise. You will eventually have to work around the complicated consequences of these compromises. No matter what you choose, there is always a way to make things easier, but it always requires work to get there.
So really the question isn't monorepo or multirepo. It's what actions do you want to make easy first, and what consequences you want to spend engineering effort to fix first. But at the end of the line, assuming your product exists long enough, everybody ends up with basically the same system, because all of them, when people use them in complex ways, require complex solutions.
Anecdotally, as a developer I really appreciate a well-thought-out monorepo. All the code is there. You don't have to dig around anything legacy or outdated. Searching is a breeze and it's _fast_. Updating both simple and complex logic is equally easy. You don't need to somehow discover that some other repo uses the thing you changed. Everything has the same deployment process. The tests test the thing, as soon as you push it - instead of months down the line, when someone finds forgotten repo again and oh, it's never been tested with the new functionality in our main repos. Dependency hell doesn't really exist.
There seems to be more discipline around monorepos because they're collectively owned, whereas many repos across an org each become someone's baby, and practices diverge until teams in the _same company_ fork each others' repos because they don't get along.
this is the first i've heard of "microrepositories" but it appears like the author may have just been overzealous with partitioning their repositories and has now settled on what most of us would just call a "repo".
How do monorepos work with something like gitlab or github, which expect a CI file in the repo root?
My workplace has dozens of repos, and they all have separate CI needs and test suites. Trying to shoehorn them all into a single set of CI pipelines would be very difficult, even with the include and parent-child pipeline features.
I have the most expense with Nx and can only speak to that. Nx projects are tied together with a dependency graph. The workflow/pipeline replaces something like "npm build" with with "nx affected --target=build --base=main". Nx compares the code to the base branch, identifies the leaves of the tree (projects) affected by the code change, and runs the specified task for each project.
Each project can specify what that task (technically the "target") means in it's own context by directing it to run a specific "executor." For some projects, "build" may run a Weback executor, others may map it to run a Vite executor.
Edit: To summarize, if you change something deep down in the tree like a design token, you can run the "test" and "build" tasks for everything affected by that change to make sure none of your apps break. If those projects are SPAs built in different frameworks, each can map the task to run the framework specific "test" or "build" executor. Now you have automatic assurance that all your apps still work. If that sounds like a lot for the CI to do, that's where distributed caching the of task results come in.
on:
pull_request:
types:
- opened
paths:
- '.github/workflows/**'
- '.github/CODEOWNERS'
- 'package*.json'
That triggers when those files change.
The "get them all to run in a single repo" is the challenge of the monorepo. The trade off is "if you change a common library, how do you trigger the test suites in all of the downstream repos to run with the new common library and verify that they still work?"
With a monorepo, that second part is solved for you... at the cost of needing to solve the problem that you're asking about.
Having worked for a company that had a monorepo approach and seeing what happens when it hits its apex of issues I don't think I would ever advocate for one again. I'm sure it works for some companies, but when it goes wrong it can literally destroy an entire company.
This might be an unpopular opinion, but to me, this is just an unsolved problem. Between polyrepos and monorepos, I’m not convinced either of them is a clear winner. I’ve used both structures and as a project grows, I dislike them both for different reasons.
To all the folks thinking that this project is too small to be representative of the benefits of monorepo -- both real world success stories and theory point towards monorepo being superior.
I've seen this at several orgs. A polyrepo runs into dependency hell, so they go to microservice. Now you have dependency hell and microservice hell. Furthermore, your microservice have APIs that need to be updated, so each change needs to be done in multiple microservice repos. So to solve one problem you now created three problems.
Monorepo reduce dependency hell to source control. The vast majority of monorepo issues people encounter are due to using 'git' which is why Google and Meta do not use 'git' at all.
Polyrepo is literally the only way you can run open source projects. Particularly in academia. People constantly confuse the open source world for how they should run their private companies.
The evidence from the most successful software companies run directly contrary to this model.
My 2 cents: monorepos try to solve what is actually a series of devops problems, by essentially abusing the way an organization or individual uses git (or VCS of your choice). These tools that were never designed for devops in the first place.
To elaborate: I think you are much better off, whether as a team or solo developer, developing small, isolated, and tested / reliable libraries and / or packages that can be combined as needed. Ideally, each package or library can build, test, and ship itself at the CI/CD level. Trying to encapsulate all that into one giant conglomerate of potentially different languages, file types, frameworks and much more sounds exactly what it is: a mess!
Developing components like this puts a lot of pressure on the components to each look like a piece of Real Software.
Versioning, stability, changelogs, coordinating new features as beta releases followed by deprecation announcements and finally deprecation. None of this is unreasonable, technically — my Linux OS is made of many small stable projects after all, none of whom communicate with each other beyond broadcasting NEWS.md etc with each release.
It’s also exhausting. Do your competitors hamper themselves by cosplaying a bazaar of open source projects coordinated by whatever your internal equivalent of the Debian project is? No! Move everything under one roof to truly reflect your org structure and take advantage of the fact that you all report to one CTO, you all have real-time communication between each other, and you probably know where everyone else is sat for eight hours a day. Move fast.
> take advantage of the fact that you all report to one CTO
How many companies even have a CTO that knows or cares about version control? In most companies version control is managed by IT if at all, and code for different products is split across different lines of business.
> These tools that were never designed for devops in the first place.
No. Git was specifically developed to manage the Linux kernel project, a famous example of a monorepo.
Every instance of polyrepo architecture I have seen in the wild, comes with some kind of bolt on system which emulates a monorepo. Such as: a top-level manifest file which pins the versions of every repo to a known-good value. With the understanding that if you were to pick any other combination of versions, You Are On Your Own.
My current company does this and it's quite unpleasant. I call it "monorepo as a service".
> Git was specifically developed to manage the Linux kernel project, a famous example of a monorepo.
Is Linux a monorepo? It's one project with a single release artifact. To me, Linux just looks like a big project. I'd define a monorepo as containing more than one project (e.g., if Linux and Git shared the same repo).
Huh? Monorepo refers to storing your source code, regardle5 of how many binaries you build and run, not shipping your system as one giant monolithic binary.
No! I'm tired of being told how to architect systems because other folks refuse to grow their knowledge.
Because I'm good at my job, a monorepo would be pants-on-head stupid. Because I know how to make separate codebases interact with one another successfully, it would be a massive mistake to lose those advantages, all because you (not me) run into problems when you try to do what I find perfectly natural.
And if you work with me, you will learn, and I will guide you, and in time you will experience systems design as I do, and no longer be shackled to only doing the things you understand today.
But no, "do this because I don't know how to do that" will never be a winning argument.
Loved this comment. We use multi-repos and the only drawback was at the beginning when we were deciding about our linter rules and we had to make multiple PRs (or some other library that had to be stabilized first). Other than that, It never annoyed me in some other way. Independent CI/CD configuration, independent versioning for each repo and independent commit logs are some of the reasons I like it. Maybe everything I mention is simply a tooling problem, but until it exists for us mortals and not just internally for Google, I will stick with multi-repos. I understand the reasoning behind monorepos but it simply is not enough to persuade me at the moment.
The only thing that I would like us to solve with a tool when it comes to the multi-repos, is to force which versions of the separate repos are compatible together. For the time being it's manageable without automation but it's definitely a high priority.
I am myself am skeptical of monorepos, and i have only worked with polyrepos, but I am curious what strategies you use that help make you effective with multiple code bases?
Keeping the spirit of microservices alive is super duper important, so not entwining two services with shared code or responsibilities is key, and forcing interaction to happen through their defined APIs.
Yeah, that's pretty much what I do as well. Shared libraries get their own repo as well and are published in a package manager. Honestly, in terms of enabling concurrent development across multiple teams while simply using common OSS software, polyrepo seems a lot easier.
In my experience you must write more tooling to get the monorepo to behave as you'd expect, and being able to use off-the-shell tools for CI/CD and orchestration is a huge win vs. having to convince those tools that deploying both the periodic service workers as well as an HTTP backend is actually Totally Normal when it's not.
I have worked at places that have micro-repos, places that have polyrepos, and places that have monorepos. From my observation monorepos are a sign of ossification and resistantce to new patterns and new technologies.
IMO it's not really a monorepo. There's a single codebase really. The author didn't meet the real monorepo pain points yet. The author didn't mention git operations slowing down, CI not handling the scale of the commit events from a single source and so on, instability of codebase due to many parallel changes, painful merges etc.
The difference between 150K LOC and Google's largest codebase in the world is huge and AFAIK google has a monorepo.
From my experience real monorepo requires a lot of tooling and maintenance.
I was curious why it was a big deal either way for a smaller codebase/documents to use a monorepo or not. It's mentioned:
> You may be reading this short essay and thinking: "Yeah, no shit! Of course you do not need a monorepo when you're talking about, like, 150K lines of code."
> Maybe it's some abstract sense of purity; maybe it's because you didn't realize most IaaS let you deploy from a subfolder. Whatever the reason is: trust me. Move it into a monorepo. If you do and you end up regretting it, [email me] and I'll donate to the charity of your choice.
So the new thing being said here is that a monorepo makes sense even at smaller scales. This seems to on the far other end of the scale with a main app, supporting infra, docs, marketing, etc.
I've certainly used fullstack repos with backend and frontend together and it's fantastic to use and deploy from commits. I can't infer from this article or my experience what it would be like to work with a monorepo at medium scale say 10-100 apps with 10s of teams. That would take more adaptation than simply "deploy from folder". Dependencies as mentioned in another comment and being able to detect if each target should be retested/deployed seems like the tough one.
I said 10-100 apps, and should maybe have said 5-10 teams rather than 10 teams, but 10 dev teams in a company to me is medium-sized. Or do you mean it's less than medium-sized?
One problem he didn't mention with the poly-repo approach is that repositories are like rabbits: as soon as you have more than one they begin breeding. Repositories, plural, beget more repositories.
The problem of course is that the only way to share code between repos is to create another repo. As the saying goes, once you have two things, you will soon want three.
"Oh, repo B needs to use some code in repo A. We'll just split that out into a separate repo."
Be very wary of this impulse. Soon, you end up in the situation I'm in now, where you have hundreds of tightly coupled repos, some with a single file! Worse, these all compose at runtime into only one application running on one host.
Consider instead that perhaps repo A and repo B should be joined. Or, (yes I dare say it), that perhaps the code should be duplicated.
This bugs me about Java software. Instead of having a single library, the library maintainers create some interface and then 30 more jars that implement it - xwidget-dropbox, xwidget-google-drive, xwidget-onedrive, etc... It's just really stupid. Everyone just pull down xwidget and move forward. Same goes for npm packages and front end things, rely on your treeshaker and ES6
You have a monorepo anyway, and the choice is between dealing with it via wrappers automated or manual wrappers around tools like git (e.g., manually cloning relevant repositories and constructing a build environment referencing all the right versions appropriately) or whether you'll use dedicated monorepo tooling.
That's still maybe an interesting question, but when appropriately framed it highlights that the real issue is that monorepo tooling may or may not fit your use case well, and you might be better served by hacking together more primitive solutions or making your own tooling.
IMO, the major point of the article still stands; git and whatnot are pretty good, and the build/deploy tools interacting with your code are probably terrible, so absent major outside pressures you'll go a long ways figuring out how to leverage git (in a monorepo configuration) against jenkinbuckethubernetes, and that's often much less painful than inevitably figuring out how to marry two ancestrally disjoint codebases.
Does this discussion not also apply to most mono vs poly scenarios? Monolith vs microservices comes to mind. The infamous monolith kernel vs micro kernel discussion between Tannenbaum and Torvalds. Even small companies vs large ones.
We shift the complexity and pain to where we are most comfortable with it. Or where we see the best benefits from the trade off.
Another big disadvantage of monorepos is the size. If you're having to clone the repo in its entirety on a routine basis (think CI/CD pipelines, or there's some other repo that needs to maintain multiple work trees referencing different commits), that repo being some massive monstrosity can be a real pain in the ass.
I mean the only ways you can work with a monorepo are either submodules or a package manager. Do devs really think these are better than a folder ?
Having used both they both introduce unneeded complexity and get in the way of actual development. If you use a sensible and modern CI they all support a monorepo flow.
Once you reach a certain scale in a particular type of company the repo just becomes another type of org unit. Your giant fish swallows a minnow, it looks good on paper, someone does a half arse integration, then that repo becomes the one. The one that your manager says, under no circumstances no-one commits to, because then it's OUR repo (with whatever baggage that brings)... The org depends on the code, but the org actively resists it like a virus. This kills any attempt at the mono repo at any largish non FAANG org.
People keep raving about google3, but it had recurring problems with changelists (far outside your own project) that break the world and stop you from testing or deploying anything new for hours. Eventually they had to give up on using everything at head, and instead use known-good versions of major shared dependencies that were a few hours or days behind head.
I want to bump our dependencies when we have the slack to handle the risks. Alpha testing every commit of every library is not really how our team is supposed to be spending our time.
Most opinions about monorepos conflate monocultures in source control with monocultures in build systems or workflows. Those two considerations are technically orthogonal. You can have one big repo with N build systems (like a node app and a python app in the same repo with disjoint workflows). And you can have N repos that have a common build system (each can combine into one big build with cloning and/or submoduling).
I find most of these conversations are more interesting when people decide to make that distinction.
This is a major sticking point for us. Many of our repos are 3rd party, made available to us under different licenses. Some require individuals to be licensed, and so we use ACLs for both read and write access. Some of our source is export-controlled and this also is enforced via ACLs at the repo level.
For your 2nd drawback, I can recommend checking out git subtree. It allows 2 way split and merge of any subtree of a repo. You can easily open source part of your repo in a “fork” and keep it up to date with your mainline as well as incoming pull requests with very little effort.
I couldn't agree more with the author, there's nothing worse than the self-flagellation of having a single change in one repository triggering a snowball effect where you have to update a dozen of other repositories, open another dozen pull requests to be approved, and carefully time releases so you don't break anything.
So author seems to have problem with premature optimization. It is not that you should not use monorepo for everything - only when starting out maybe you will be better off starting with single repo and then if you have the need to make a separate one.
Anyone have resources on how to actually perform the migration of many repos to monorepo? Currently facing the same issue and want to make the consolidation move but now sure on how to merge the git history. This article and the previous one it links to are missing all the implementation details.
It takes a bit of time. You’d want to do it one by one. I’ve done it the other way (mono repo code to library) and it’s worth the time to keep the history around where you are working. I just did a google search. First one that comes up tonight is this article: https://medium.com/@ayushya/move-directory-from-one-reposito...
Don’t, keep things simple. Maintain a separate archive of each individual repository and any history can be referenced in that repository when necessary.
I just finished a migration like that, and after some research ended up writing shell code to implement 3 methods and put them to the test. Maybe it is useful for you so I'll leave it here:
My first step was to learn the procedure that devs behind GStreamer used recently, when they went ahead and decided to merge all their disparate repos into a single monorepo. For my needs it is the less adequate of the three because it will break as soon as there is any collision between the desired destination and the already existing files of a repo.
The second method is one proposed in the JDriven blog. It doesn't suffer collisions because the movement into destination subdir of each repo is done separately on the original repos, and then they are merged with everything already in place.
Lastly, I found a third method which is the one I liked the most. It uses git subtrees, to perform the move and merge all in the target repo (the monorepo itself), without touching the original repos (unlike the JDriven proposal), so everything is tidy and left in place.
How many merge conflicts it gonna introduce with just a team of 4?
Mono repo also means that merging both frontend and backend into one, even with totally different languages? Should be fine with traditional web framework but how it will become for API+ SPAs?
Meta, Google, Microsoft, Twitter -- all of these companies use the monorepo. I wonder if taking the monorepo (and managing it successfully) actually has something to do with company's becoming successful.
I'm pretty excited for new OSS source control tools that would hopefully help us move past this discussion. Particularly, Meta's Sapling[0] seems like a pretty exciting step forward, though they've only released a client so far. (MS released its VFS for Git awhile back, but unfortunately now is deprecated.)
[0] https://engineering.fb.com/2022/11/15/open-source/sapling-so...