Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a big fan of the monorepo approach personally, I would say the biggest benefit is being able to always know exactly what state the system was in when a problem occurred.

I've worked in large polyrepo environments. By the time you get big enough that you have internal libraries that depend on other internal libraries, debugging becomes too much like solving a murder mystery. In particular, on more than one occasion I had a stacktrace that was impossible with the code that should have been running. A properly-configured monorepo basically makes that problem disappear.

This is more of a problem the bigger you are, however.



I think we're just reducing down to "programming at scale" is hard, at some point.

Sure; that is a really big problem, and it becomes a bigger problem the bigger you are. But, as you become bigger: the monorepo is constantly changing. Google has an entire team dedicated to the infrastructure of running their monorepo. Answering the question: "For this service, in this monorepo, what is the set of recent changes" actually isn't straightforward without additional tooling. Asking "what PRs of the hundred open PRs am I responsible for reviewing" isn't straightforward without, again, additional tooling. Making the CI fast is hard, without additional tooling. Determining bounded contexts between individuals and teams is hard, without additional tooling.

The biggest reason why I am anti-monorepo is mostly that: advocates will stress "all of this is possible (with additional tooling)", but all of this is possible, today, just by using different repos. And I still haven't heard a convincing argument for what benefits monorepos carry.

Maybe you could argue "you know exactly what state the system was in when something happened", sure. But when you start getting CI pipelines that take 60 minutes to run, or failed deployments, or whathaveyou; even that isn't straightforward.

And I also question the value of that; sure you have a single view of state, but problems don't generally happen "point in time"; they happen, and then continue to happen. So if we start an investigation by saying "this shit started at 16:30 UTC", the question we first want to have answered is "what changed at 16:30 UTC". Having a unified commit log is great, but realistically: a unified CI deploy log is far more valuable, and every CI provider under the sun just Does That. It doesn't mean squat that some change was merged to master at 16:29 if it didn't hit prod until 16:42; the problem started at 16:30; the unified commit log is just adding noise.


You had the wrong tools. It doesn't matter if you have a monorepo or not, you will need tools to manage your project.

I'm on a mutirepo project and we can't have that problem because we have careful versioning of what goes together. Sure many combinations are legal/possible, but we control/log exactly what is in use.


> Sure many combinations are legal/possible, but we control/log exactly what is in use.

I'll acknowledge our tooling could have been better, but isn't it better to just be able to check out one revision of one repo and have confidence that you're looking at the code that was running?


It depends on your architecture.

If I have a services based architecture then I can jump straight to the repo for that particular service and have confidence that it is the code that is running.


So instead of adopting a system that makes the problem we’re discussing not possible you use a human-backed matrix of known compatible versions?

Like you do you but I’ve never seen “just apply discipline” or “just be careful” ever work. You either make something impossible, with tooling or otherwise, or it will happen.


No, it is a tool backed matrix. Illegal combinations are not possible, and we have logs of exactly what was installed so we can check that revision out anytime


To solve this properly you need to store the deployed/executed commit id of any service. That could be in the logs, in the a label/annotation of a kubernetes object or somewhere else. But this has nothing to do whether you use a monorepo or multiple smaller repositories. In some projects of me, we use the commit of the source repo as docker tag. And we make sure that the docker image build is as reproducible as possible. I.e. we don't always build with the latest commit of an internal library, but with the one that is mentioned in the dependency manifest of our build tool. Since updating all those internal dependencies is a hassle, that is updated automatically. It means there is an auto-generated merge requests to update a dependency for every downstream project. Therefore all the downstream pipelines can run their test suites before an update gets merged. Once in a while that fails, then a human has to adapt the downstream project to its latest dependencies. In a monorepo that work has to be done as well. But for all downstream projects at once.


Could it be that submodules are underused?


Submodules are hell. I work somewhere with a polyrepo topology, with the inevitable "shared" bits ending up integrated into other repos as submodules. Nothing has been more destructive to productivity and caused so many problems.

A plain old monorepo really is the best.


Git submodules are really a PITA.

The fact that git checkout did not update submodules was a major design flaw in my opinion.


It can now, but that's not the default. The defaults for submodule suck, because they match the behavior of old versions of git for backwards compatibility.


Yeah. Leaving the UX-issues aside. Don't ever use submodules to manage dependencies inside of each polyrepo, it will eventually accumulate duplicate, conflicting and out of date sub-dependencies. Package managers exist for a reason. The only correct way to use submodules is a root-level integration-repository, being the only repo that is allowed to have submodules.


The only problem I have with a monorepo, is that sometimes I need to share code between completely different teams. For example, I could have a repo that contains a bunch of protobuf definitions so that every team can consume them in their projects. It would be absurd to shove all of those unrelated projects into one heaping monorepo.


Well that's what a monorepo is! I work on one, it's very large, other teams can consume partial artifacts from it (because we have a release system that releases parts of the repo to different locations) but if they want to change anything, then yeah they have to PR against the giant monorepo. And that's good!

Teams out of the repo have to be careful about which version they pull and when they update etc. However, if you are a team IN the monorepo, you know that (provided you have enough tests) breaking changes to your upstream dependencies will make your tests fail which will block the PR of the upstream person making the changes. This forces upstream teams to engage (either by commits or by discussions) with their clients before completing changes, and it means that downstream teams are not constantly racing to apply upgrades or avoiding upgrades altogether.

I work on shared library code and the monorepo is really crucial to keeping us honest. I may think for example that some API is bad and I want to change it. With the monorepo I can immediately see the impact of a change and then decide whether it's actually needed, based on how many places downstream would break.


Ok. I've had some time to think about this, and I am warming up to the idea. It would sure simplify a lot of challenging coordination problems. My only real concern is that the repo may grow so large it becomes very slow. Doubly so if someone commits some large binaries.


It does become slow eventually, and yes you need discipline and tooling to block people from dumping everything in it.

You do need a lot of code / developers before you outgrow git, cf the linux kernel.


I’m a git nerd and even I struggle with the submodule UI, there are probably a lot of people who just can’t deal with it.


I am certainly not a heavy user, but for work I've made myself a "workflow" repository which pulls together all the repositories related to one task. This works super well. There sure is a bit of weirdness in managing them, but I found it manageable. But I'll admit that I don't really use the submodules for much more than initial cloning, maybe I'd experience more problems if I did.


Yes, but it's because submodules are a badly architected, badly implemented train wreck.

There are many good and easy solutions to this problem, all of which were not implemented by git.

git is a clean and elegant system overall, with submodules as by far the biggest wart in that architecture. They should be doused with gasoline and burned to the ground.


I like using submodules for internal dependencies I might modify as part of an issue. I like conan or cargo for things I never will. I don't particularly like conan. Perhaps bazel, hunter, meson or vcpkg are all better.


> internal libraries that depend on other internal libraries

This is where you start to develop nostalgia for well-structured monolithic apps.


I can check out a git revision and the library dependencies will be handled transparently by the package manager.

No doubt this is possible with service approach, but it means additional layers of complexity added on top.


This should happen on monorepos as well as per-service repos. So it is not argument for any side of that discussion.


But this is a discussion of dependencies between services. You need more tooling for managing inter-service dependencies as opposed to package dependencies within one monolith.


> I've worked in large polyrepo environments. By the time you get big enough that you have internal libraries that depend on other internal libraries, debugging becomes too much like solving a murder mystery. In particular, on more than one occasion I had a stacktrace that was impossible with the code that should have been running. A properly-configured monorepo basically makes that problem disappear.

On the contrary, a monorepo makes it impossible because you can't ever check out what was actually deployed. If what was running at the time was two different versions of the same internal library in service A and service B, that sucks but if you have separate checkouts for service A and service B then it sucks less than if you're trying to look at two different versions of parts of the same monorepo at the same time.


There is no source of truth for "what was deployed at time T" except the orchestration system responsible for the deployment environment. There is no relationship between source code revision and deployed artifacts.


Hopefully you have a tag in your VCS for each released/deployed version. (The fact that tags are repository-global is another argument for aligning your repository boundaries with the scope of what you deploy).


Of a service, yes. Of the entire infrastructure, no.


Why not? I’m doing it right now. The infrastructure is versioned just like the app and I can say with certainty that we are on app version X and infra version Y.

I even have a nice little db/graph of what versions were in service at what times so I can give you timestamp -> all app and infra versions for the last several years.


Unless your infrastructure is a single deployable artifact, its "version" is a function of all of the versions of all of the running services. You can define a version that establishes specific versions of each service, but that's an intent, not a fact -- it doesn't mean that's what's actually running.


Am I missing some nuance here? Yes the infra version is an amalgamation of the fixed versions of all the underlying services. Once the deploy goes green I know exactly what’s running down to the exact commit hashes everywhere. And during the deploy I know that depending on the service it’s either version n-1 or n.

The kinds of failures you’re describing are throw away all assumptions and assume that everything from terraform to the compiler could be broken which is too paranoid to be practically useful and actionable.

If deploy fails I assume that new state is undefined and throw it away, having never switched over to it. If deploy passes then I now have the next known good state.


Oh, this implies you're deploying your entire infrastructure, from provisioned resources up to application services, with a single Terraform command, and managed by a single state file. That's fine and works up to a certain scale. It's not the context I thought we were working in. Normally multi-service architectures are used in order to allow services to be deployed independently and without this form of central locking.


If what was deployed was foo version x and bar version y, it's a lot easier to debug by checking out tag x in the foo repo and tag y in the bar repo than achieving the same thing in a monorepo.


Of course, but this is entirely possible with a monorepo.


Possible perhaps, but not easy by any means.


Then you should build one. E.g. gitlab can create special git references for every deployment it ever made.


Our artifacts are tagged with their git commit and build time, which then gets emitted with every log event.


I'm not sure I understand how that scenario would arise with a monorepo. The whole point of a monorepo is that everything changes together, so if you have a shared internal library, every service should be using the same version of that library at all times.


And every service deploys instantly whenever anything changes?

(I actually use that as my rule of thumb for where repository splits should happen: things that are deployed together should go in the same repo, things that deploy on different cycles should go in different repos)


Not necessarily instantly, but our CD is fast enough that changes are in production 5-10 minutes after hitting master.

But what's more valuable is that our artifacts are tagged with the commit hash that produced them, which is then emitted with every log event, so you can go straight from a log event to a checked-out copy of every relevant bit of code for that service.

Admittedly this doesn't totally guarantee you won't ever have to worry about multiple monorepo revisions when you're debugging an interaction between services, but I haven't found this to come up very much in practise.

Edit: I should also clarify, a change to any internal library in our monorepo will cause all services that consume that library to be redeployed.


Which CD are you using @xmodem?


Buildkite, with our own orchestration layer built on top.


> things that are deployed together should go in the same repo, things that deploy on different cycles should go in different repos

What do you do with libraries shared between different deployment targets?


> What do you do with libraries shared between different deployment targets?

The short answer is "make an awkward compromise". If it's a library that mostly belongs to A but is used by B then it can live in A (but this means you might sometimes have to release A with changes just for the sake of B); if it's a genuinely shared library that might be changed for the sake of A or B then I generally put it in a third repo of its own, meaning you have a two-step release process. The way to mitigate the pain of that is to make sure the library can be tested on its own without needing A or B; all I can suggest about the case where you have a library that's shared between two independent components A and B but tightly coupled to them both such that it can't really be tested on its own is to try to avoid it.


If you have a library that is tightly coupled to A and B, then A and B are effectively coupled.

Ergo, put all three into a single repo because you pretty much have to deploy all three together.

The test for the tightness of coupling is to ask whether A and B can use different versions of the library. If not, they are tightly coupled.


That's a great test and I think an argument for monorepo for most companies. Unless you work on products that are hermetically sealed from each other, there's very likely going to be tight dependencies between them. Your various frontends and backends are going to want to share data models for the stuff they're exchanging between them for example. You don't really want multiple versions of this to exist across your deployments, at least not long term


I think it's maybe an argument for a single repo per (two-pizza) team. Beyond that, you really don't want your components to be that tightly coupled together (e.g. you need each team to be able to control their own release cycles independently of each other). Conway's law works both ways.


? You can totally have independent release cycles between multiple targets within a monorepo.


If they have independent release cycles, they shouldn't be tightly coupled (sharing models etc. beyond a specific, narrowly-scoped, and carefully versioned API layer), and in that case there is little benefit and nontrivial cost to having them be in a monorepo.


Not GP, but I use versioned packages (npm, nuget, etc) for that. They're published just like they're an open source project, ideally using semantic versioning or matching the version of a parent project (in cases where eg we produce a client library from the same repo as the main service).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: