Hacker News new | past | comments | ask | show | jobs | submit login
Monorepoize – Bash scripts for creating a monorepo out of smaller repos (github.com/gigamonkey)
137 points by oumua_don17 on April 27, 2020 | hide | past | favorite | 102 comments



After working on a monorepo and then the split up repos for the same codebase, I cannot fathom why somebody would want to take small repos and merge them in a single repo. The mess and complexity just increase.


It's about having boundaries that make sense. The best scenario is having multiple codebases, while when someone wants to change something, it can be done by just modifying code in one repo.

However, the worst scenario is also having multiple codebases, but when someone wants to change something, they end up having to go through many codebases and have to deal with all the environment setup, coordination, and possibly the worst of them all - the versions.

Monorepo is a straightforward way to avoid the best and worst scenario at the same time. So you'll give up the codebase splitting game and never worry about them, then focus on things that matter more.


I worked in a company where a million line project was split in about 130 repos. Features or bug fixes frequently required syncing PRs between multiple repos. We needed scripts to create feature branches in all repos, or bump up the version everywhere.

4 or 5 repos would probably have been sane. 130 was hell to manage.


If that happens often you have the wrong repo split. This isn't a condemnation of the multi repo approach, only your architecture.

Note that this is about trade offs. If you have a monorepo this problem goes away but now you need to manage the problems of monorepos. If you have multiple repos you get this and other problems instead. Pick the right tradeoffs for your own needs. The only things wrong is claiming your answer is the right one for everyone else.


> If that happens often you have the wrong repo split.

Yes, but: you pick out a nice, clean perfectly segregated architecture. Two years later, things have evolved, and a new feature that no one initially thought of is asked and involved refactoring 25 repos. Oops.

Side note: not my architecture. I only worked there for a short while. I have no skin in the game either way, I'm sure I agree with you that mono and multi- repos are the right solution for different problems.

Just trying to expose problems I've seen happen in a massive multirepo approach. Everyone thought it was the correct solution after being burned by a massive monolith that over two decades became an entangled mess. They thought they had the perfect architecture, the "correct repo split". In practice it's almost impossible, which is a major tradeoff to be considered.

Minimizin the number of repos and splitting only when absolutely necessary at least has a chance of reducing that complexity. The default should be "as few repos as possible", not "we're going multirepo, let us break this down into as many small parts as possible as our first step", which might be what the architects were used to in modular program architecture (small, reusable functions/classes).


> The default should be "as few repos as possible"

In any corporate/enterprise-y environment with multiple teams, the default should be the smallest piece likely to change ownership.

When we see a pattern where a feature involves refactoring a bunch of repos, then your initial system breakdown was horrendously off-track and you lost sight of your domains. Been there, it is fixable but it takes a bit.

> Everyone thought it was the correct solution

There is no correct solution, there is the best solution for your problem set. Sometimes that's a monolith, and sometimes that's several small repos. Every once in a while, it's even a monorepo.


Even with the best architecture, you probably have some shared libraries that are used across teams. These will often need to be updated, and they will affect several other packages (other repos, in the multirepo case). The only way to avoid this is to duplicate all code. Moreover, you can have a great architecture, but your teams are "feature-oriented" instead of service oriented (or some other vertically oriented team structure) in which case they'll need to touch many different repos (I don't imagine feature-oriented repos are a good idea) per ticket.

I'm increasingly convinced that the monorepo approach is the right one, but the tooling needs to get better. Bazel is a hydra of incidental complexity and Nix has the right core idea but suffers from a variety of other problems (poor documentation, gratuitously and unnecessarily unfamiliar, too few escape hatches, etc). These issues aren't fundamental to the domain, and there's no reason these tools couldn't change course nor is there a reason that another tool couldn't come in and eat their lunch.


> you need to manage the problems of monorepos

If have not seen a problem on monorepos that does not exist on multirepos, other than more quickly reaching tools' repo size limits (but this is becoming less of a problem all the time on e.g. git and mercurial, and supposedly was never a problem with perforce).

I have seen many problems multirepos have that monorepos do not - such as that multirepos can (and almost always do) have a wrong split, and monorepos by definition cannot.


> If have not seen a problem on monorepos that does not exist on multirepo

How do you refer to a library in a monorepo from a different repo?

The workflow seems to me to entail an all-or-nothing approach.


I never tried it, but if I needed it, I would easier use sparse submodule checkout (if it’s possible - I don’t know if that’s true), or submodule the entire repo, and the symbolic link into the internal part I need.

If all else fails (e.g. you need 50 files out of a 100GB repo, which makes things too slow), I would set up a git subtree extract as a checkin hook, which would mean there exists another repo/branch automatically updated on checkin - and which you could use any way you want, e.g, simple submodule.

Do note that’s just what I said: the problem, “I only need a small part of the repo” is independent of it being a monorepo or multirepo.

You would only avoid this problem in a multirepo if you cut it to pieces in exactly the right places. it’s just a matter of whether the parts of the repo you don’t need are too large to make this comfortable — more likely in a monorepo as it is larger - but will happen in a big enough multirepo as well.


I worked in a company that had at least 100 repos, often with interdependencies, and was managing it well using svn.

Then we migrated to Git, which doesn't have a good equivalent to svn's externals, and things kind of went to heck for a while. We eventually hacked together an in-house system that brought back some of what we were missing, but it never worked as smoothly as svn externals did.

I was originally one of the advocates of moving to Git. Hindsight being 20/20, after the dust settled I realized we would have been fine staying on svn, anyway. With such a high number of individual repositories, blocking on each other and nasty merge conflicts were rarely a problem in practice because we had over 100 swim lanes clearly marked in the pool. Lots more swim lanes than we had developers.

FWIW, one team dealt with it by migrating just their stuff to their own monorepo, and I wasn't too impressed by that outcome, either. They started seeing an outsize rate of technical debt accumulation as things that previously were easier to keep well modularized started naturally getting more tangled together. I realize that is a problem that can easily be managed with just a little effort. But, like every other dev problem that can easily be managed with just a little effort, it wasn't.


git has submodules.

A repo can thus host any number of sub-repos. Using modules can also be pretty much like having a monorepo if you have a master repo with all others under it. This way, you can still publish things like public API as a repo.


Submodules don't allow for atomic commits that span repos, though. Atomic cross-project commits are the most compelling selling feature for monorepos IMHO.

(A common case for this is changing a function's API, and fixing all downstream consumers, in one tidy, holistic PR).


> Submodules don't allow for atomic commits that span repos, though.

They can, but there's little tooling support to help you do so.

You can make a new commit in one repo, and then in one atomic commit, you can pull that new commit into a parent repo's submodule and fix the users calling into that submodule.


They don't work quite the same way as svn externals, though, in a way that didn't work for us. In particular, they leave no good way to deal with merging in upstream changes in a submodule's code if you've also made some of your own (provisional) edits.


Often there are implicit dependencies between (versions of) the many repos. E.g. where I work, it was decided to put test data in a separate from the code (to keep the latter repo small in size). But now, when you add a new test case, it will fail on older versions of the code.

With a monorepo, you can always check out a consistent version of all parts.


And when your CI fails, it blames to a specific commit, rather than "the commit that triggered it, plus any commit in a dependency repo around the same time." And if you need to maintain old release branches, each one is a single branch, and your tooling doesn't need to know anything special about what branches of other repos to check out. And `git bisect` works.

All of these things are possible without a monorepo, by using higher-level tooling to manage the relationship between repos. It could be `git submodules`, or Repo, or whatever. But all of those things have their own downsides.


git submodules tracks the version of the sub repos.

So if you organize that your releases are a master repo with all needed repos as sub-modules, your sub-repo version tracking is already done fr you.

All that while still allowing sub-repo to move forward faster than other repos. With a mono-repo, you can only allow some sub-system to go forward faster than others by making them live permanently in separate branches.

Basically, in monorepo you have to use branches to do what separate repo do normally for free.

(Then there is the issue that with a monorepo, any screw up screws everybody. You're all on the same boat, all the time.)


Yeah, we have set up a "parent" repo with git submodules for all the individual repos. But it was an afterthought and our workflow has not really caught up to the new possibilities.


I feel like the library needed is actually the opposite. Take a monorepo, do some code analysis, split into many repositories. Monorepos are fool's gold unless you can have a team whose sole job is managing monorepo complexity, and even then I might ask "Why waste a team on monorepos?"


The problem is that we have crammed many related but distinct concepts--versioning, package management, access control, issue tracking, project management, licensing, etc.--into a single envelope called the "repo".

Having a single top level version and commit-log for all an organization's code is a huge win. But that doesn't mean you necessarily want to manage those other more granular concerns at the top level too. With tooling that does a better job drawing these boundaries, we should be able to have the best of both worlds (instead of the worst, which is where we are with the currently dominant repo-management tools).


I don't really agree. Rather than stuffing all your code into one envelope and then building tooling to decipher it, why not build tooling that allows commit log and versioning information for an arbitrary amount of repositories? Why does code have to live side-by-side to manage metadata about it?


That works too. I think you could arrive at the ideal setup from either direction: a monorepo with appropriate boundaries or multi-repos with an “umbrella” versioning layer for all of them. The end result is more or less the same.


I guess I think one of those scenarios comes with all of the benefits but very few of the drawbacks than the other. But you're right, you can skin a cat however you want.


The biggest barrier to monorepos is size, which is a problem for `git log` and related (e.g., `git blame`). The Microsoft work on Git will make a lot of that trouble go away, but I suppose there will always be some issues to working with a huge monorepo.

Q: But what's the alternative? A: A repo forest, or a repo web.

Well, repo forests/webs have similar issues crop up anyways. You could say that whether a mega-codebase is spread over a mega-monorepo or a mega-repoforest doesn't change the fact that it's a mega-codebase -- big comes with problems no matter what.

And many enterprises can't help getting to have megacodebases. Operating systems (including distros in the Linux case) are huge. So are ecosystems for various popular programming languages. Enterprise apps easily add up to more in large enterprises.

So once you're dealing with mega-codebases... if you can find a mega-monorepo that works, that's probably where you want to be.


On the contrary, you should provide a (good) argument before splitting a project. One such argument might be that those parts, for example configuration code, is on such a difference cadence that it should have a different branch and release model. Or that some parts must be kept unreadable for most developers. Or that the project has simply grown too large.

A good rule of thumb can be how much larger your project is than the Linux kernel itself. The kernel seems to constantly get near the number of objects that git can confortably handle without any major speedbumps.

Until then, do not bother. Be prepared that splitting a project in multiple repos will always require some sort of external tooling as soon as tickets or pull request workflows span multiple repos.


Depends what are you working on. Kernel is nothing alike enterprise environment. If you have services in your architecture, deployed separately, that is perfect argument to split your repository.


In git I fully agree, and I wonder which company successfully runs a monorepo in git.

For me, I prefer git submodules, which seem to have the benefit of both monorepo and separate repo's.


> I wonder which company successfully runs a monorepo in git.

Microsoft


Worth noting that due to the size of the Windows repo, they ended up having to extend git significantly

https://devblogs.microsoft.com/bharry/scaling-git-and-some-b...

https://github.com/microsoft/VFSForGit


For which product is this? It can't be all of them right?


Windows is in one repo at Microsoft, I believe.

https://devblogs.microsoft.com/bharry/the-largest-git-repo-o...


Yeah but they basically wrote their own wrapper around it, which is to say Microsoft has fallen into one of their usual patterns:

1. Picking a trendy tool

2. Mis-using trendy tool

3. Rewriting trendy tool so it's no longer trendy tool but something custom and not quite standards-compliant with it's own weird behaviors and bugs

4. Complain the standard is wrong


They wrote a wrapper around it because git took so much time for every basic operations, even a `git status` would take minutes. But still it's not a fork, you can either use it or not without impacting anyone else using the repo. And they also made it available to everyone, as how good open source citizens should do.


Microsoft does definitely use git, monorepo I am not so sure.



That doesn't say that everything related to Windows is developed on the same repo, just the kernel and several core components.


How did I miss this! Can you imagine someone claiming Windows is going to be in git as few as 10 years ago? This world never ceases to amaze me. What happened to SourceSafe?


SourceSafe was never used for anything internal major. They had their own custom source control system, based on Perforce, I believe, and then on TFS and then they switched to git.

SourceSafe was something inflicted on SMBs :-D


Interesting, didn't realize that. I haven't worked there but knew some people who did, and they were very proud of dogfooding - I guess at least the OS and IDEs, though maybe not the VCS.


There have been many posts in their tech blogs about why they chose a monorepo


Khan Academy has a monorepo in git. With 10 years of history in it, it's not always pleasant but does come with some advantages.


I've been looking at that approach. We have a large "app" that has a lot of boundaries (parts even being in different languages), but everything is run (love being able to docker-compose up my entire stack) and deployed in unison. However, it does make for a messy single repo.


> I wonder which company successfully runs a monorepo in git.

Google. They're not using git, but their own scm: https://research.google/pubs/pub45424/


I read somewhere on the internet that Google uses a monorepo. I want to be like Google, so I need to use a monorepo.


In the last few years, I saw a lot of people using monorepos, and discouraging using submodules.

Frankly, I don't know why. For small projects, sure - it it an overhead. For anything larger, I find it useful to encapsulate things - be it installable packages, or if it is not the case - at least other repositories.

In the last months I split one project into 4 repos (https://github.com/Quantum-Game/) and couldn't be happier about that move. It makes the code cleaner, Pull Requests more separated, etc.


Once your code grows and you end up with too many repos you'll be yearning to get back to monorepos. It allows to make changes in one go instead of having 3-4 PRs that all have to be merged at same time or otherwise the build breaks.

Having separate repos only makes sense if they are maintained by separate teams with a well defined API in-between that doesn't change often. If the same team maintains all repos then it is just overhead, especially if you end up with a 100 repos and you have significantly less than 100 engineers in your team.

There are also potential build time savings when having a monorepo since you can do parallel builds across different subdirs without serialising at package boundaries.


I maintain such a project with many dependencies which I also maintain separately. Yes it can be a lot of extra work to publish but this argument is based on the assumption that dependencies need to be changed constantly and this assumption is wrong. There should be pressure to design dependencies in such a way that they are modular and don't need to be changed often - This unpleasantness associated with updating deep dependencies is good and necessary. These should rarely be updated, if they need to be updated often, then they were designed incorrectly.

Many of the dependencies and sub-dependencies which I maintain are several years old. They are small enough and stable enough that they basically never need to change.

If I need to update them (which is rare), I may need to do a cascading update of multiple dependents and this can take a while but if I have to do this big update only once every 6 months, it's totally worth having them in separate repos.


That depends on the scale of the project, of course - if you have thousands of engineers as MicroFaceGoog does, then it’a a daily, even hourly, occurrence for many dependencies.

You are not MicroFaceGood, of course, so you only do that every 6 months.

However, I what way are separate repos nicer for you? Personally, I settled on a monorepo with a “$dep-devel” branch for every dependency, which I occasionally have on multiple working trees out at the same time, and which I feel gives me all the benefit a multirepo could have (whatever they may be, I can’t find them) while still making tracking, branching and merging across dependencies trivial.


> Having separate repos only makes sense if they are maintained by separate teams with a well defined API

Agree! I've found that a multirepo setup only works if there's an obvious boundary. Similar to SOA, distributed monoliths - or distributed monorepos - are problematic.


> Having separate repos only makes sense if they are maintained by separate teams with a well defined API

If some logic is modular then it should be in a separate repo. If it's not modular then it should be part of the same repo; just regular files inside directories... IMO, monorepos try to have it both ways but this doesn't make any sense; it encourages developers to write logic which is "somewhat modular".

But "somewhat modular" is not modular at all. Either it's modular and it can easily be extracted out completely or it's not and then it should be embedded into the project's code.

Code which is not modular should never need to be shared with other projects.


This is a ridiculous oversimplification that doesn't scale at all to larger orgs. Every modular component in its own repo? That basically means creating a repo for every library. You could easily be managing thousands of repos with negative benefit.

People seem to be afraid of using a single repo for some reason. It's really simple though. You should have one repo per "release cycle" that you have. Anything meant to be deployed together should be in the same repo. Things not meant to be deployed together should be in different repos. With the right tooling, that is what solves most problems.


I have two things that are not deployed together. However there is common algorithms between the two that I'd rather not copy and paste maintain. Mono repo doesn't make sense because there is always something close to release and getting risk adverse (there are legal consequences to some bugs) so nothing can every be updated.


The release cycle of that common code is also a big factor.

If your common code changes fairly irregularly, you are correct, monorepos have reduced benefit.

If your common code is as likely to change as anything else, or more likely, then your multirepo setup can begin to incur more cost.


>> Every modular component in its own repo

My original statement was slightly misleading. I meant to say "every modular component which needs to be reused by different projects" obviously I'm not advocating that every class should be in a separate repo; that would be madness. My point is; if it's modular and needs to be reused in different projects, then it should be a separate repo, otherwise, it can just be added as files inside the current project repo; no need to publish those as separate standalone libraries for download. If the component is not modular, then it shouldn't be shared with any other projects.


> It allows to make changes in one go instead of having 3-4 PRs that all have to be merged at same time or otherwise the build breaks.

Can't you solve this by pointing submodules in the parent repo to a specific commit SHA before the breaking change? Then update the pointer in parent repo when you've made the corresponding changes.

I understand it's still added work compared to a monorepo, but seems like this could be a controlled way to handle submodules when necessary.


I saw some projects in which my instinct was "too many repos" (e.g. https://github.com/ActivityWatch/activitywatch), exactly for the reasons you mentioned.

However, in my case - as the project gets bigger, I appreciate it more. In the first place, it makes code reusable, and forces to encapsulated functionalities as much as possible. (Yes, having non-encapsulated code would be a nightmare).

To get to the extreme, in the Node ecosystem code is pretty much structured as many, many small packages. (Bear in mind, that turning something into a package is a much stricter requirement that making it effectively a subfolder.)


There’s a longer discussion here which may interest you: https://danluu.com/monorepo/

But if your code uses someone else’s libraries that don’t change or has super well defined interfaces between parts that never change or is going to be a mix of languages then maybe you won’t have issues or see and advantage of monorepos.

If you have a library in a monorepo then it is much easier to improve the library in a breaking way and upgrade all its dependencies. If you have many repos then this work gets more distributed (so one person does it many times or lots of people who are less expert in the change to the library have to do it) or libraries are just not frequently updated and you get the same helper functions not being up streamed or you use different interfaces for the same library in different repos.


One size rarely fits all.

Moving to mono repos has been great for my team for reducing build times and removing the overhead of understanding decades of cruft built up to make a huge mono repo manageable.

But with turnover of both projects and people, and many projects not requiring active development, there's an awful lot of orphaned repos, with less than one person dedicated supporting them now.

In my environment at least, you can just stop development. Packages need to be kept up to date, as vulnerabilities are discovered, using Azure Devops means we need to move along with changes to the build process. Infrastructure and Secrets policies changes come from outside, and require us to make changes.

Making these kind of changes to a handful of repos is quite a bit of overhead. It's clear now, we went too small on the repo size, given the tools we have for the maintenance tasks we have to do.

And people advocate for even smaller repos...


The benefits of monorepos are (IME) all in people and organisation scale. For example, I've found it orders of magnitude easier to align tooling, quality, release and standards on a monorepo with 100's of 1000's of engineers working on it simultaneously.

The point you've made is a good one though, the idea of monorepos at small scale has come up a lot recently and I'm not sure the arguments for it are as compelling (I don't see many compelling arguments _against_ it either mind you). What I _have_ seen though is the pain large orgs go through going from multi-to-mono repos, I think Twitter did it recently and it was pretty painful.


Do they use git or some other versioning system?


Most orgs at that scale build their own tooling to make working with monorepos more ergonomic, in my experience. They're based on git (or other DVCS tools like Mercurial) but they allow you to easily do things like partial clones and company specific workflow tasks like submitting patches for review or tagging in weird and arcane ways.

A small, illustrative example of this is the tooling that's built for Chrome's source code[0] with the depot_tools project [1].

[0]: https://chromium.googlesource.com/chromium/src/+/master/docs...

[1]: https://chromium.googlesource.com/chromium/tools/depot_tools


I have worked on projects organized both ways with git and my observation is that regardless of the choices, the right tooling can make the workflows a lot more fluent.

For separate repos, share code as much as possible through a packaging system so one does not have to make a lot of refactoring across multiple repos. It sounds backward but an auto minor version update can ease a lot of merging pains in a CI environment.

For monorepos, figure out as early as possible what/how things should be shared and separated. I worked on a project with 20+ services and websites that form a whole product and each service chooses its own languages and build systems but shared deployment interfaces for unified service discovery. CI got tricky as it can be blocked because of an unrelated change. I have not found the unified commit history super helpful as I only have context on a few of the services and most of the time I only look at histories with `git log my/path'.


I feel bad about plugging myself into a post to Peter Seibel's work, but if you're interested in this kind of git preservation, I wrote a script to move a subfolder between two repos:

https://github.com/jakub-g/git-move-folder-between-repos-kee...

(I didn't use it for a while, but it worked for my case a few times in the past).


Why is there so much code? Both python and bash. What are the edge cases that it is handling? What niceties does it introduce that you might otherwise miss out on?

I combined ~10 repos last year use a few one liners and loops.


Having done this migration, I recommend you look into filter-repo, https://github.com/newren/git-filter-repo

I don't remember the specifics but the method used here didn't produce the results we were looking for when migrating long histories and lots of branches.


The single biggest impediment to monorepo's might be jenkins. Monorepos need a radically different way to do CI/CD and none of the existing tooling does it well. There is probably a market out there for some tooling to do good Monorepo CI/CD without requiring hacky scripts and polling.


This is really a build system issue - ie figuring out which parts of the (mono)repo are affected by which change. Bazel, for example, depends on explicit, per-directory, BUILD files, and does a fairly good job of finding dependants.


Bazel works great for our pretty simplistic monorepo. My main gripe is that it's a pretty big pain to add support for custom build requirements. There have been times where it takes a whole sprint to implement build functionality because, imo, it's hard to run with unless you've been in the ecosystem for a while to understand everything.

I'm curious though, am I the only one who feels this way?


Repo[1] is another tool for this, which is pretty well-tested and widely used.

[1] https://gerrit.googlesource.com/git-repo/


My initial though was: "Oh! This lets me quickly build a repo that merges all the smaller repos using git submodules!"

And then I see, it doesn't do it, it just merges the repos... Why would you do that?

Honestly, I don't know why for years git submodules have had such a bad fame. It works out of the box. You can checkout subrepos but you don't have to if you don't want to. You can set up separate branches, have separate commits, CI/CD workflows for each of the repos.

AND you're still able to have all the monorepo advantages: to lock down the dependencies to an exact version; to let a dev/CI pull all of them at one go.

Why wouldn't you just use git submodules for that?


I have a custom script for my monorepo that circumvents sub-modules while getting most of their benefits.

Basically, on git pull it goes through and finds all the folders with `.x_git`, renames them to `.git` and pulls. On push it does the inverse. My submodules are really just copies of one small part of the monorepo, but it allows me to have a monorepo with some parts of it being public.

So far it's been amazing.. there's no real line between sub-repos and sub-folders, if I can push a commit that modifies many with a single PR, it's obvious that monorepo is for internal stuff while the public repo is for public facing discussions... I really like it.


What is the advantage over using `git subtree`?


Git subtree wouldn't preserve hashes of the existing history?


You can preserve history. You can also squash it.


Maybe I don't know about git subtree. I guess it can "preserve history" in the sense of keeping a corresponding new commit for every old commit, but they wouldn't have the same hash id?


They have everything the same with old subtree. From commit message, history, object ID, commit ID.

git-subtree(1) also allows splitting subtree.

git-subtree(1) uses plumbing git command, which is stable interface, unlike this one, which uses porcelain command.

IIRC, Git’s maintainer uses git-subtree(1) to merge gitk and git-gui to Git.


I'm constantly amazed at how fashions come and go in cycles. Only a few years ago everyone was vigorously splitting everything up into microservices and now the monolithic repository is the latest trend.


Both of those are current trends - microservices in a monorepo - as far as I can tell. (Not a fan of either.)


Right, microservices beget monorepos because managing permissions and build pipelines across 100's of repos isn't pleasant. As mentioned elsewhere in this thread, coordination is really challenging across multiple repos and versions.


From having done this a couple of times in the last few years: make sure you use the mandatory history rewrite in this process to get rid of history you don't want. Team christmas party videos that got checked into master (just kidding) can easily be removed before switching.


It seems like github and gitlab are written for multi-repo work, especially the CI. Is there a selection of tools to make monorepo life on those services easier?


Big corporations are extremely inefficient. Why does everyone want to copy them?

And why do we use all the bulky tools they produce when there are far simpler, better and more open alternatives available?


Because it turns out big corporations that use multirepos tend to be even less efficient (about versioning) than those that use monorepos.

What are those far simpler, better, more open alternatives?


> Because it turns out big corporations that use multirepos tend to be even less efficient (about versioning) than those that use monorepos.

How do you measure that exactly?


Personally, informal survey of people I know who work at big corps.

Google and Microsoft have both evaluated this internally and reached that conclusion, with some of the evaluation criteria and conclusions publicly documented (though I don’t have links available on my phone, google will likely find them for you)


Microsoft doesn't really use a monorepo.

This leaves companies like Google and Facebook that made a decision fairly early in their existence, set constraints, and then spent several hundreds engineer-years into developing their own infrastructure and tools to support that decision.

Does it work? Yes. Is it more efficient than what other companies of similar scale are doing? I don't believe anyone knows.


From reading articles, I got the impression Windows is one monorepo (but it doesn’t include Office) and e.g. Visual Stduio is another. So it’s true that Microsoft doesn’t use a monorepo as a whole, but e.g. just Windows itself is a codebase significantly larger and more diverse than 99% of companies would have, and it does have multiple “independents” products - Notepad is an independent module from Solitaire and they are both independent of essentially everything else, whereas e.g. File Explorer, the win32 subsystem and the kernel often need to be updated together for new features.

GoogBook made it early, but re-evaluated. For sure, they found the cost of switching is higher than any potential benefit. I must say that I have not found anyone describing benefits other than “git is faster for smaller repositories” (which was very true but is only slightly true with sparse checkouts and shallow branches these days) and “I like it that way”.

Yosefk has a convincing article about “if your culture is bad, it doesn’t matter; if your culture is good, monorepos are technically easier until you hit some scale wall but then they don’t become harder that multirepo at that scale” IIRc.

But w.r.t Microsoft I admit I’m not very well informed, thank you for setting the record straight.


> it turns out big corporations that use multirepos tend to be even less efficient (about versioning) than those that use monorepos

Amazon disagrees.


Does Amazon optimize for efficiency in software engineering, or just for the overall quantity of output? My sense, having spoken with Amazon SWEs and a fair number of engineering leaders is that they optimize for overall output, but not for efficiency (output per worker).


Well, the measure of efficiency most of the world cares about is “overall profit”, which — around the efficient frontier — has a good proxy with “overall output”.

I feel your “output per worker” measure of average individual efficiency is not a good metric to optimize - just fire everyone except your best employee and you’ve maximized it.

Edit: employer -> employee typo


I think your first point is good and agree with it. Even within optimizing for that (rather than output/worker or output/salary), I think it's fair to examine whether specific choices are optimized for the same case that you are facing. Not everything done by very successful companies should be replicated in your business.


Proper secure development lifecycle approach for all software projects producing code?


What are those?

There are tens of such approaches, many of them practiced at large corps, and in my experience almost none of them deliver the supposed benefits, with few exceptions (NASA, MISRA) that are cost prohibitive except for very narrow fields.

And what does that have to do with “copying large corps” and/or monorepo vs multirepo?


I think it's because corporations treat developers like untrustworthy and replaceable cogs in a big machine - IMO, this is what's wrong, not multi-repos. If corporations gave their developers responsibility and ownership over the maintenance of specific repos which are heavily depended upon, then multi-repos would not be a problem. It's a great source of pride for developers to feel responsible for something and to take care of upstream developers.

About simpler alternatives, non-corporate tools and libraries are often much better than corporate-built ones. For example, React was built by Facebook and is inferior to VueJS which was built independently yet React is much more popular.

On top of this, I contend that if Evan You (the creator of VueJS) hadn't worked at Google, very few people would be using VueJS today; but the popularity has more to do with public image than technical merit. VueJS is associated with Google even though it was written independently; something so simple and elegant (especially at the time it came out) could only have been written by an indie dev. Too many cooks do spoil the broth unfortunately.


I disagree about the nature of the problem.

You can’t give sole ownership of a project people depend on to just one person - a bus factor of 1 is unacceptable in most circumstances. Busses happen, people leave, etc. Some things are also too big for one person.

And when you give it to a team of proud capable developers, you will get ego wars, blame throwing, etc. with probability that increases quickly with team size.

And that’s with capable, proud, responsible, cooperative people.

But not all your people will be like that all the time - some will be burnet our, some have no passion and just try to do the minimum to not get fired. Some are sociopaths who actively try to shift work to others and only pride themselves on work they managed to avoid.

Mom and pop software shops can afford non-replaceable-cogs, Google can’t (and afaik didn’t even in the old days when they were run by engineers and not MBAs).

And this whole discussion of copying corps was in the context of mono Vs multi repo....

(For the record, VueJS and Evan You are awesome)


Giving ownership to a single person doesn't imply a bus factor of one. As long as someone can take over, it's not a problem.

I've had this happen lots of times in my career, and most of the time dealing with such projects was faster than onboarding in a company-owned repo by at least an order of magnitude. I really mean it: minutes instead of days to be productive.

In my experience most engineers have the best intentions and want to do the best work possible, until you force them into working in a bad environment where he's treated like... a cog in the machine with zero ability to influence his own productivity, happiness or even the quality of his code.

The biggest reason large companies "can't afford" those things is because they're hell bent on turning engineerings into replaceable cogs, out of ideology.


> In my experience most engineers have the best intentions and want to do the best work possible

I agree, most of them do - but I've also been in a place that highers the best of the best, and despite the best intentions, some people cannot put their ego aside and the overall result is between less-than-ideal and downright-horrible


> ...until you force them into working in a bad environment where he's treated like... a cog in the machine with zero ability to influence his own productivity, happiness or even the quality of his code.


I disagree (unless your "bad environment" gets tautologically redefined to mean any place where people with good intentions don't get good results.... which you might).

I was working in a place that employed incredibly smart and capable but highly opinionated people, and did not treat them as cogs, and did let them influence their productivity, happiness, and quality of the code.

However, there was a major philosophical disagreement between the "functional programming" camp and the "object oriented" camp (with a minor "procedural programming" camp), and designs occasionally became "I am smarter than you" contests, the overall system being extremely incoherent because every side wss able to influence their own productivity.

It worked, because they were smart. It was not a bad environment for any single person. But it would have been a much better place for everyone if there was a benevolent dictator (but nevertheless, dictator) who enforced one way or the other - the randomly reached middle ground has all of the deficiencies of all approaches, and none of the benefits.


If "it worked" and "was not a bad environment", I really don't see the issue here (unless your "it worked" gets redefined to mean "it didn't work"... which you might). Sounds like a nice and intellectually stimulating environment, and that they had best intentions. Honestly, someone wanting a dictator has bigger ego problems than those people.


Very long onboarding, impedance mismatches everywhere, tons of unneeded glue logic. Lots of consternation. It worked in the sense that it wasn’t a failure like a significant number of software projects. It could have worked significantly better if the people involved didn’t feel they needed to show off.

And my model for A dictator is Linus Thorvald, Guiro van Rossum, Andreas Rumpf; if you can’t appreciate the management model and abilities that a benevolent dictator brings, we will just have to agree to disagree.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: