We Put Half a Million Files in One Git Repository, Here’s What We Learned

rsanheim · on June 16, 2022

It took me a while to find out what Canva actually does, but from https://www.canva.com it appears they are an online design / collab tool.

To be fair, that could mean a _lot_ of functionality and code providing a rich, SPA, JS heavy experience. Modern JS frameworks aren't exactly known for being concise.

But still, 60 million lines of code and half a million files is a sure sign that someone said at one point "sure, we can throw all these generated files into git!". Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"

Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.

arinlen · on June 16, 2022

> Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"

Why do you assume they didn't?

Just because they arrived at a different conclusion than you that doesn't mean they didn't thought about it. I might very well mean you did not considered the tradeoffs they had to take into account, mainly because you're out of the loop.

rtpg · on June 16, 2022

One reason for doing this is that someone might be using a very small set of the monorepo, so having everything prebuilt means that in general things are very fast, without constantly having to rebuild chunks of your system.

This is, I think, very common with stuff like browsers, where you have artifact checkouts that basically include built stuff since otherwise you're sitting there compiling forever all the time.

Stuff like Bazel in theory helps with this, but tools that help with this are either super idiosyncratic about how they work (meaning hard to adopt) or outright don't work.

I mean personally I would find that pretty annoying and unclean but I like DAGs.

whynaut · on June 16, 2022

They’re not assuming anything, but stating that the post itself does not cover this well/at all.

heyparkerj · on June 16, 2022

The post itself covers in the intro that they made the conscious decision to go with a monorepo, accepting the downsides of it. Much more than this article though, I'd like to read one discussing that decision and why they went that direction.

ElemenoPicuares · on June 16, 2022

Canva gives users without design tools expertise the ability to make fairly polished looking graphics with a super easy and intuitive interface. (As a designer, I can assure you that polished looking is not the same thing as designed.) It’s a very popular service, so they’re dealing with huge scale. Intuitive interfaces often come with complex mechanisms and lots of assets, and they have clients on every major mobile and desktop platform, and the web. They also do a ton of heavy graphics processing that is likely done in lower-level languages than the interface.

The likelihood of a 2000 employee software company simply not considering that they could streamline their build process is pretty slim.

necovek · on June 16, 2022

They have obviously invested a lot over time into streamlining their build process: so much so that they're putting an article about it.

All of the problems they are having are basically due to their use of a monorepo: they do explain that they made the decision early, but I wonder what are the advantages over multiple repos they are seeing that it was worth it all this trouble?

ralphb · on June 16, 2022

We can talk about the advantages of monorepos, but your questions is phrased in a way that makes me think that you don't see any "trouble" in multiple repositories.

I would encourage you to do some research and keep an open mind.

necovek · on June 16, 2022

I see "trouble" in all approaches: their solutions to manage their git monorepo approach it by switching to multi-repo emulation.

etskinner · on June 16, 2022

They seem to have purposely left that out of the scope of this article. There are myriad articles about the benefits and downsides of monorepos vs multiple repos.

necovek · on June 16, 2022

In another sibling comment I mentioned how I looked through their engineering blog and didn't see any post where they have talked about any of the benefits they are enjoying due to their use of monorepo.

The question is specifically about their usecase, since not everybody would hit the same bottlenecks as they did with git monorepos.

Iow, have they stopped and thought whether it's still worth it (eg how often do their engineers make use of the monorepo benefits like cross-project refactorings)?

_the_inflator · on June 16, 2022

Don't rely on this anecdotal heuristic. Have a look at some enterprises. My experience: "How to solve scaling issues?" - "Automation? No, another team." ;)

ElemenoPicuares · on June 16, 2022

This is not a 15k employee enterprise riveting features onto a codebase from the 90s— it’s a <10yo company that makes one product. Big difference in approach.

haizzz · on June 16, 2022

hey hey author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them

necovek · on June 16, 2022

How do you split your XLIFF files? Does each project get one big one and the proliferation is simply due to number of languages, or do you have a more granular split (eg. if you've got one component, it will have dozens of XLIFF files for every language, instead of one per language)?

By the numbers you mention, 70% of the files make a ratio of code files to translation files 1-3, so unless you only support 3-5 languages, it's definitely not one XLIFF file per source file, so I wonder at what granularity it is?

(My experience is mostly with localizations using GNU gettext tools, and you usually do a small-finite-number of PO files per project per language, where that small-finite number is exactly one for like 99% of projects)

auscompgeek · on June 16, 2022

It's one XLIFF file per locale per component, not including the source en_US. We currently support 104 locales.

More info: https://news.ycombinator.com/item?id=28931601

necovek · on June 16, 2022

Thanks: that's still a lot of components (3000+) if you've got at least 300k xlf files!

Good job managing all that regardless of the approach :)

tasn · on June 16, 2022

One of my biggest pet peeves: tracking generated files in version control.

The only exception is our generated OpenAPI spec, because we want people to be explicit about modifying the API, and have a CI task that verifies that the API and OpenAPI spec match.

cnity · on June 16, 2022

One alternative that enables you to keep generated files out but still feel like there's an explicit human check in place is to add a gated confirmation step in CI to confirm that the changes to the generated spec match expectations.

Something like: "This change will result in the following new API endpoints: ... do you wish to continue?"

tasn · on June 16, 2022

Hm... An interesting thought!

What does it compare against though? Need to add more state to the CI? We kinda like the interface be part of the version control and having an audit chain that's part of the code.

cnity · on June 17, 2022

Yeah either you'd have to maintain the generated spec as a versioned artifact hosted wherever is most convenient for you, or the CI could actually generate the before and after specs based on the PR diff. If the calculation of the spec is computationally expensive (it shouldn't be) then the latter approach could be a problem.

cmcconomy · on June 16, 2022

you could put it in the pre-commit githook

tasn · on June 19, 2022

That's not enforced centrally (or visible by the reviewer) so hard to trust it that everyone will remember having it.

Though even without that, I'm not sure how it'll even mechanically work.

SenHeng · on June 16, 2022

Remember back when people recommended commiting node_modules into git?

jamesfinlayson · on June 16, 2022

Ah - that would explain why at my current job there was a node_modules directory in git with nearly 2 million lines of Javascript within.

It is gone now.

SenHeng · on June 16, 2022

The days before lock files were a thing and 'it works on my machine!' was rampant.

sausagefeet · on June 16, 2022

Lock files protect you from the version changing out from under you, but modules disappearing from NPM is a thing that happens. Yes, you can use artifactory or similar as a proxy but that requires infrastructure that you may not want to run. That is all to say: there are situations where committing node_modules is the least evil.

anitil · on June 16, 2022

... are we no longer doing 'works on my machine' ?

dotancohen · on June 16, 2022

Ostensibly if it works in Alice's Docker instance, it will run in Bob's Docker instance too.

sigio · on June 16, 2022

Well... unless some dev's have M1-macs, and some of the docker layers are not available for arm, or the other way around, not available for amd64. Gives interesting issues.

Beltiras · on June 16, 2022

Except for weird Docker edge cases (extremely rare, but does happen).

necovek · on June 16, 2022

Not rare at all.

Docker is a congregation of technologies held together with duct tape and glue.

Eg. permissions handling is completely different on Macs with Docker Desktop from the Linux dockerd stuff: on Macs, it automatically translates user ownership for any "mounted" local storage (like your code repository), whereas on Linux user IDs and host system permissions are preserved. Have some developers use Macs and others use Linux, and attempt to do proper permissions setup (otherwise known as "don't run as root"), and you are looking for some fun debugging sessions.

the_common_man · on June 16, 2022

> Docker is a congregation of technologies held together with duct tape and glue.

No, it's not. What a wild conclusion to reach from the example you gave.

coding123 · on June 16, 2022

At companies that don't check in node_modules, build folders, and are using standard packaging tooling like maven or yarn or npm or what-have-you. Yes, I haven't experienced that in like 15 years.

raverbashing · on June 16, 2022

Ugh

The price of letting less experienced people "go crazy" in the repo

diegoveralli · on June 16, 2022

Npm didn't support lockfiles until version 5, released in 2017, Yarn had them at launch in 2016. Before that committing node_modules was often used as a form of vendoring, to get reproducible builds.

If a new project these days commits node_modules to git, it's likely a mistake, but for legacy projects started before 2017 it was the lesser of two evils.

Edit: spelling.

jamesfinlayson · on June 17, 2022

Hm, this project was started in 2017. The node_modules directory was for Serverless (a tool written in Javascript), not the website itself (which was written in AngularJS - probably not the best choice in 2017 either).

Groxx · on June 16, 2022

s/was often used as a/was the only practical/

Prior to lock files (and potentially after, as checked-in files are beyond trivial to modify and review and that can be worthwhile) committing dependencies in some form was basically the only reasonable way to have reproducible builds, unless you wanted to build your own package manager / lock file implementation.

Which is what Yarn did.

megous · on June 16, 2022

Or sane people wanting to have some cheap, low effort way to track changes in their project's dependencies.

eatonphil · on June 16, 2022

Based on how brittle Github actions is I'd be ready to commit node_modules except for that I'm building cross-platform software with native dependencies.

nopurpose · on June 16, 2022

`npm rebuild` should rebuild native code in your committed node_modules

donw · on June 16, 2022

Pretty sure that recommendation came from a Git hosting service that charged by the megabyte.

withinboredom · on June 16, 2022

Not everyone has the skills to build the toolset and use it. My brother called last night to help him change some SASS variables in a bootstrap theme. He’s a data scientist and had no idea how to build bootstrap’s js and apply the new variables. If bootstrap came from npm fully built, over half of his problems he called me about (15 times!) would have been avoided.

dotancohen · on June 16, 2022

  > it takes 10 seconds to run git status

People coming from the SVN world do not think that this is unusual or problematic. And unfortunately even recently I've seen SVN still in use at large legacy companies.

ChrisRR · on June 16, 2022

I don't think it's unfortunate. We use subversion for development in our team and it does everything we need it to. We looked into git and didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot

For many processes I think SVN is (and has been for many many years) been an absolutely fine method of version control

dotancohen · on June 17, 2022

  > didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot

You're not wrong about this.

I really like git for the cheap branching, which encourages branching and merging often. But SVN might have cheap branching now, as another commenter implies.

trasz · on June 16, 2022

From my experience SVN isn’t significantly slower than git.

dotancohen · on June 16, 2022

My experience is that anything dealing with a branch, especially but not exclusively creating branches, is very slow in SVN for a repo of any real size, basically anything with a framework.

I do not remember if "stat" was particularly slow, but SVN in general is slow.

capitainenemo · on June 16, 2022

Huh. 10 gigabyte svn repo at work spanning about 40 projects.. Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link). Curious as to why it would be slow for you. svn cp https:/ /server/svn/trunk/project/ https:/ /server/svn/branches/project/ticket -m "making a branch here"

svn status, even for an entire repo checkout (which is not common) is also fast.

And yeah, it has virtue of simplicity as well doing very well at narrow and shallow even though I'd love to have mercurial's feature set.

It's also rather good in the "wiki" situation since people can operate on their single files without needing to update, sync and merge.

https://www.bitquabit.com/post/unorthodocs-abandon-your-dvcs...

A fun rant, even though git has gotten better-ish at large files.

dotancohen · on June 17, 2022

  > Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link).

Copy is not a "free" operation, but a symlink is close to "free" if you're measuring disk space.

What version SVN are you using? I'm certain that older SVN versions would actually copy the entire project's files, not symlinks but real copies. That would take forever and running out of disk space was a real concern.

capitainenemo · on June 17, 2022

An svn copy is just a link. It has always been a "free" operation (and yes, the analogy would be to a symlink). I'm not aware of any version of svn that behaved the way you described and I've used it for a couple of decades.

I can perhaps imagine a large repo plus a broken svn client requiring checking out unneeded portions of trees to do a copy, but no client I've used works like that.

Hm. Another theory. Perhaps someone who knew nothing about svn and was using TortoiseSVN's Windows file manager integration was doing a Windows file manager copy, then checking that in as a "branch" with the only link being the commit message instead of using svn's copy which is free and properly links content. That would indeed be an expensive operation, and the wrong thing to do.

trasz · on June 16, 2022

Yeah, branches are slow. Otoh blame is fast compared to git.

capitainenemo · on June 16, 2022

These days mercurial has a cached blame/annotate called "fast annotate" which I love because of one particular awesome feature --deleted which must be seen to be appreciated I think.

I have this alias in my .hgrc file fad=fastannotate -u -n -wbB --deleted

It's by the Facebook engineer Jun Wu who also made the even more awesome "absorb"

mekster · on June 18, 2022

What is wrong with SVN?

jefurii · on June 17, 2022

> Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.

It's perfectly fine to use Git to track things other than sourcecode. In fact, right on the manpage, Git calls itself "the stupid content tracker".

I've been using Git with git-annex to track archival files with their associated metadata. We keep our data separate from our sourcecode, and segment our data into individual Git repositories for each collection. Git gives us many features that we would have had to build into our app in other ways (data integrity, fixity, etc), though this came with costs.

To my eye it probably would have been better for Canva to use multiple separate repositories instead of a monorepo, but I'm not them and their use-case is not mine.

hzhou321 · on June 16, 2022

I think it is a crowd dumbing effect. Since hundreds of engineers sharing the mono-repo, no one can or care to make the decision or is able to push the decision for alternate. Even when every one is complaining, it is still far from every one agreeing on the alternate. Crowd settle at the lowest denominator.

haizzz · on June 16, 2022

Hey everyone, author here, the article is a bit misleading in that .xlf files aren't really generated files, they're created through our translation pipeline by real humans. I considered them generated in the sense that they're not directly worked on by engineers who have to deal with them in the repository.

The content of these translation files are snapshot in time aligned with the text in our product so simply removing them we would lose all the changes made to translations each time texts are changed.

Sorry about that, hope this clears it up!

Edit: for more information on translation and xlf files, we have another blog post all about them https://canvatechblog.com/how-to-design-in-every-language-at...

prmoustache · on June 16, 2022

Can't those xlf files be stored in a separate repo or in an object storage and let the build system fetch them from there?

necovek · on June 16, 2022

That wouldn't be called a "monorepo" then :D

Obviously, the problems they are solving (and admitting to solving) are due to their dedication to the monorepo.

With all the effort spent on working around the drawbacks, I really wonder what advantages they are seeing that make it worth their while?

prmoustache · on June 16, 2022

Well, there is monorepo and monorepo. Git was primarily made to host code, not necessary artifacts. I would categorize those files as artifacts and in my opinion it would still be a code monorepo to have everything else on a single repo.

necovek · on June 17, 2022

They are code files managed by a different set of people (translators), not generated artifacts (as they explain elsewhere in this discussion).

If we are being pedantic, git was not designed to host multiple projects in a single repo (otherwise, git would have been a subdirectory in the kernel tree). But tools are made without knowing how they'll be used, and that's ok, so I wouldn't stress on what the purpose for monorepo was, but how it's used and what value it brings.

prmoustache · on June 17, 2022

>They are code files managed by a different set of people (translators), not generated artifacts (as they explain elsewhere in this discussion

Translations do not look like code to me. Rather human generated artifacts.

What I understand is there is a strong tie/match needed between versions of these translations files and the code itself, so I believe this is where having all in the same repo would make sense, having the translators update those file when code has been modified...

ghusbands · on June 17, 2022

I think it's relatively normal to include build artifacts in a monorepo when you don't want users to have to be able to build every single one. Especially if you don't want to have to buy a license for every developer for specialist software that only needs to be used by a few people.

TameAntelope · on June 16, 2022

Given the religious nature of the "monorepo" vs. "not-monorepo" argument, I would venture a guess that suggesting storing these xlf files elsewhere would upset someone's notion of what a "monorepo" Should Properly Be.

klysm · on June 16, 2022

Having them under version control seems very important. Monorepo makes that easy

fomine3 · on June 16, 2022

Thanks for additional info. I'm happy to hear that localization is on your priority.

> English source strings used in our frontend code live in Typescript files (.messages.ts). Source strings used in our backend code live in Java files (Message.java). Our internationalization (i18n) pipeline converts this into a series of XLIFF files (ending in .xlf), with one file for each locale. All these files live in the repository, but the translated .xlf files should never be modified by hand since they are updated automatically when strings get translated.

Isn't it cache?

haizzz · on June 16, 2022

Not quite, the .messages.ts and Message.java files are only in English which are then converted into XLIFF files for other languages so they're the source of truth for what the string should be in different locales so they're not cache

sally1620 · on June 17, 2022

If it is output of some other tool, then it is a build artifact and there is no need to check it in.

I have this argument with people all the time and the conclusion is always like: "it is too hard to integrate the generator with the build system so we check them in".

The big problem with generated files is merge conflicts. How do you resolve a merge conflict on generated files. especially if they are binary.

urban_winter · on June 16, 2022

What's the value of a monorepo if developers only ever check out a small subset of it? Wouldn't multiple repos allow greater scale without any practical reduction in utility?

For example, all the localisation files could live in a separate project (if we accept the need to commit them at all). Some tools would be needed to deal with the inevitable problem that developer working sets would not align with project boundaries, but that seems like an easier job than making git scale while maintaining response times.

nsoonhui · on June 16, 2022

One of the benefits of monorepo is refactoring. You can just apply a renaming command across all of the files in the solution and all the related names are properly updated.

Not that easy to get this to work on multiple repos.

necovek · on June 16, 2022

I think it's important to understand how often is the OP's company benefiting from this: is it worth the trouble they've gone through through the years? I've checked the rest of engineering posts, and none of them talk of the benefits of using a monorepo explicitly.

Monorepos, like basically all solutions, solves some problems and introduces new ones which you didn't have before. It depends on each individual case which drawbacks are more worthwhile.

vinay_ys · on June 16, 2022

All developer activities related to code – refactoring, editing, building etc all happen in a developer's workspace. A workspace can be composed of multiple git repo checkouts. An infrequent activity like renaming a lot of files can be done with minor inconveniences even if they are spread across the workspace in different repos.

Only the code that is closely related – read/modified/built together frequently – should live in the same repo. If two pieces of code that don't have much to do with each other (that is, they are not read/modified/built by a developer in a single developer workflow frequently) live in the same repo, then they are just being a burden to the overall development lifecycle of devs who work on those disjoint sets of code.

The unrelated code in the same repo is a distraction to the developers who checkout that code as it costs storage space, iops, cpu cycles and network bandwidth to lug that code around, load/index in IDEs, track changes, build and discard dependency graphs by build/dependency systems etc. Then, to deal with these issues more complexities are incurred. Instead, it is better to optimise for the common case and deal with the complexity only for the rare cases.

kanu666 · on June 16, 2022

How would you do refactoring over monorepo if you have sparse checkout?

blackoil · on June 16, 2022

CI/CD build the entire project. So if you make a breaking change in library the build will fail.

kanu666 · on June 16, 2022

Sure, and what next? So considering that on big projects CI/CD may take up to an hour (in one of my projects it took 4 hours) the feedback loop would be great

krageon · on June 16, 2022

> on big projects CI/CD may take up to an hour

No, it may not. Perhaps occasionally it does. That is a bug that you must fix - a pipeline that takes even 30 minutes is horrifically slow.

auscompgeek · on June 16, 2022

One of the nice things about using Bazel is caching builds, avoiding rebuilding parts of the monorepo that are completely unaffected by someone's changes.

Galanwe · on June 16, 2022

Sure, how did we ever manage to rename something without monorepos. Oh wait, maybe that's what this "versionning" thing is all about.

rtpg · on June 16, 2022

Right, it's "I have to send 5 PRs to 5 different repos, get them all cross merged, and then at the end it's wrong anyways so I have to start all over".

Multirepo management is extremely frustrating compared to "it's all in the same folder".

etripe · on June 16, 2022

How many scenarios are there where the rename both matters (beyond taste and philosophy) and is across interface boundaries?

Surely if it is an advantage to rename once in a ginormous, single code base there must also be leaky abstractions, poorly defined interfaces, god objects, etc present at the same time?

Whenever I find I need to rename anything across domains, it's a matter of updating the "core" repository and then just pulling the newest version.

necovek · on June 16, 2022

Monorepo is not necessarily synced deployment, and even if it was, each deployment of a single component is usually racy with itself (as long as you're deploying to at least two nodes).

Which means that you've got to do independent backwards-compatible changes anyway, and that for anything remotely complex, you are better off with separate branches (and PR/MRs) anyway.

Monorepos mostly have a benefit for trivial changes across all repos (eg. we've decided to rename our "Shop" to "Shoppe"), where it doesn't really take much to explain with multiple repos, but is mostly a lot of tedious work to get all the PRs up and such.

rtpg · on June 17, 2022

I think that when you have large enough systems that works. I do not believe that "microservice" is the right size for repo splits.

Sometimes you have to ship a feature. Shipping that requires changing 3 parts of your app. A lot of times that _entire_ set of changes is less than 100 lines of code.

Having a full vision of what is being accomplished across your system in one go is very helpful for reviewing code! It justifies the need for changes, makes it easier to propose alternatives, and makes the go/no-go operation much more straightforward.

At a smaller scale, you often see the idea of splitting frontend and backend into separate repos. Of course you can ship an API and then ship the changes to the frontend. But for a lot of trivial stuff, just having both lets you actually see API usage in practice.

I think this is much more applicable for companies under 100 people though. When you get super large you're going to put into place a superstructure that will cause these splits anyways.

necovek · on June 17, 2022

TBH, I am not a fan of frontend/backend split either: ideally, you'd still be splitting per component/service, so frontend and backend for a single thing could live in the same place: you get the benefit of seeing the API in use with each PR, without the costs of monorepo otherwise.

Most projects start out as monoliths (which is good) and splitting up on this axis is unfortunately very hard/costly.

rtpg · on June 17, 2022

This is why I'm a fan of tools like Bazel, where you can still get most of the tooling benefits from a single repo, but get testing speed benefits (and, if you roll that way, the design benefits from the separation) of a multirepo setup.

Unfortunately it's hard for me to recommend Bazel, it's such an uphill climb to get things working within that system.

doctor_eval · on June 16, 2022

What I took away from TFA is that monorepo management at this scale is “extremely frustrating” too.

ISTM that the complexity of managing any repo will be bounded by the size of that repo; a monorepo, being unbounded in size, will, in time, become arbitrarily complex to manage.

While a multirepo might occasionally require developers to apply changes to more than one repo at a time, I’ve never found this to be much more than a minor inconvenience; one that could be solved readily with simple tooling, if we had ever felt that the “problem” was even worth solving.

aljarry · on June 16, 2022

At my $dailyjob we (kinda unfortunately) went with tons of repositories and libraries upon libraries, and the only sane way for me to make changes across multiple repos is combining them into single build locally. In .NET it's not that complex - remove a Nuget dependency from your project, and add reference to locally checked-out repository and make sure you're using proper tags. It's mundane, happens to be frustrating, but I can make it work.

Galanwe · on June 16, 2022

The process you're describing looks like some trial and error PRs...

Multirepo also allows you to roll out that change incrementally instead of big banging all the time.

rtpg · on June 16, 2022

Well for trivial changes it's even worse, cuz instead of "change 3 files across this boundary" it's "send 2 sets of changes to different places, babysit it until merging, then send a third PR in the integration point to use the updated vesrion and then get it merged".

Meanwhile reviewers don't have context about changes, so it's easier to get lost in the weeds.

It's not always this, of course. But I think that way too many tools are based on "repo" being the largest element, so things like cross-repo review are just miserable.

kanu666 · on June 16, 2022

But in the monorepo you almost never can do the change in a single commit as it will cause incompatibilities during gradual deployment

mbo · on June 17, 2022

Canva engineer here: we do compatibility checking of interservice contracts (Proto) to ensure that gradual deploys are always safe and can always be safely rolled back.

compiler-guy · on June 16, 2022

Google does such a thing in its monorepo quite routinely.

throwaway892238 · on June 16, 2022

It takes X*N amount of work to merge change X across N repos. 1 repo just takes X.

Then there's version management. Do all your repos use the same versioning scheme? "They should", but in the real world, they sometimes don't. Whereas if you only have 1 repo, you are guaranteed 1 versioning scheme, and 1 version for everything.

How do you know which version of what correlates to what else? With N repos, do you maintain a DAG which maps every version of every repo to every other repo, so when you revert a change from 1 repo, you can go back in history and revert all the other repos to their versions from the same time? Most people do not, so reverting a change leads to regressions. With a multirepo, there only is one version of everything and everything is in lock-step with everything else, so you can either revert a single change, or do an entire rollback of everything, with 1 command.

How do you deploy changes? If each repo has an independent deployment process (if your repos even have a deployment process that isn't just waiting for Phil to do something from his laptop), are you going to deploy each one at a time, or all at once? What if one of them fails? How do you find out when they've all passed and deployed successfully? Pull up 5 different CI results in your browser every couple hours, and when one fails, go ask that team to fix something? If you only have 1 repo, there is 1 deploy process (though different jobs) and merging triggers exactly what needs to happen in exactly the right order.

The reason people use multirepos is they don't want to build a fully automated CI/CD pipeline. They don't want to add tests and quality gates, they don't want to set up a deployment system that can handle all the code. They just want to keep their own snowflake repo of code and deal with everything via manual toil. But at scale (not "Google scale", but just "We have 6 different teams working on one product" scale) it becomes incredibly wasteful to have all these divergent processes and not enough automation. Multirepo wastes time, adds complexity, and introduces errors.

rkangel · on June 16, 2022

> What's the value of a monorepo if developers only ever check out a small subset of it?

There's a couple of different views of this:

If the subsets are overlapping then the monorepo has had great value. Let's say you've got modules A, B, C and D. Dev 1 is interested in A and B, Dev 2 is interested in B and C, etc. In a multirepo world you have to draw a line somewhere and if someone has concerns overlapping that line then they're going to have to play the 'updating two projects with versioning' game.

The other way of looking at it is "data model" vs "presentation". Too often with git we confuse the two. Spare checkout is a way of presenting the relevant subset to each user. It is nice to be able to consider that separately from whether we want to store all that data together.

necovek · on June 16, 2022

> For example, all the localisation files could live in a separate project (if we accept the need to commit them at all).

That's the wrong way to split files: it's as if you said let's split a monorepo so all the .sh files are in one repo, all the Makefiles are in another, all the .py files in yet another...

What you want is to split into "natural" repositories instead. Having 50 or 150 localisation files in an otherwise 40-file repo is not a big deal for anyone. Of course, how the split happens would have an outsized influence on the ergonomics.

Also note that localisation files are tightly linked to source code (the way they use them, similar to GNU gettext model, though they do use XLIFF): you put English strings in the source code, and when you change them (reword, fix typos, or outright change them), all translations need to get their English version updated and translations potentially needing updates marked as such. In short, they are managing their translations as source code (even if translators would be using translation tools akin to IDEs for development).

sausagefeet · on June 16, 2022

If you can checkout your monorepo as if it's multiple repos, but then also check it out as a monorepo when you want it, that seems to me more utility than splitting into multiple repos, then you can never check it out as a monorepo.

maccard · on June 16, 2022

In a world where submodules worked (side note: We use PlasticSCM which has xlinks [0] which are substantially better than submodules, but Plastic itself has it's own set of problems), you could have each "subrepo" as an independent repo, and then have a monorepo comprised entirely of submodules.

If submodules worked.

[0] https://www.plasticscm.com/documentation/xlinks/plastic-scm-...

hamandcheese · on June 16, 2022

> without any practical reduction in utility?

There is a massive loss in dependency management if you move to multiple repos.

Do polyrepo build systems exist that give you the same capabilities as bazel? Particularly with regard to querying your dependency graph.

strken · on June 16, 2022

Atomic linearisable updates to your code.

IshKebab · on June 16, 2022

The point is that developers can check out bigger subsets if they need.

37 · on June 16, 2022

Pretty interesting.... I don't know that much about git, but still fun to read. I guess the main takeaway is don't put all your eggs in one basket? Although it kinda seems like they are going to stick with the monorepo, ("Here’s how we solve them at Canva")

Also, I looked up .xlf files and I still don't understand. It's xml, that part makes sense, but it's basically a config file? To tell what process to read which files?

Also, I've heard of Canva, but had no idea they were this big/ubiquitous/whatever.... and learning about pseudo localization is interesting too. And the graph for lines of code looks pretty exponential, maybe it's common up to a point, but if it continues at that rate, it will be infinite by about 2026 (okay, I just made that number up, but you get the idea)

auscompgeek · on June 16, 2022

.xlf isn't a config file format here. It refers to XLIFF, a standard for translation files. https://en.wikipedia.org/wiki/XLIFF

37 · on June 17, 2022

I looked at the Wikipedia article and a couple other pages, I guess the key word is "localization", which basically seems to mean language. I'm pretty sure I knew and forgot that.

The comments in this thread from the devs have been informative as well.

Bluecobra · on June 16, 2022

Putting my rusty sysadmin hat on, I would wager a guess that if it keeps on growing exponentially something will break. For example, hitting some hard limit like maximum number of files possible in a file system. Seems like you’re playing with fire to me.

37 · on June 17, 2022

I would think the same... and from reading the comments in this thread, it seems like a pretty spicy topic.

Gigachad · on June 16, 2022

>just under 60 million lines of code in 2022.

I'm always surprised at the fact that almost every product has more lines of code than the entire Linux repo. The scale of these products is astounding.

mherrmann · on June 16, 2022

From the rest of the article, it sounds like a big chunk of these lines are from generated files. What I don't understand is why they're checking in generated files into Git.

epage · on June 16, 2022

I've been switching a lot of generated files to being checked in (with CI verifying they haven't drifted from the source). The primary motivation has been performance. For example, in Rust code, it means I don't need to foist the code-gen process and all the dependencies needed for it on dependent crates. I've seen this play out similar in other build systems and circumstances. The key is the data needs to be independent of other factors (like the system doing the generation) and the rate of change of the code-gen source and generator has to be relatively low.

onion2k · on June 16, 2022

They're using git as a cache. Having generated files stored there means they're available if they're needed (eg in CI) without needing further access controls, they're versioned, and it's a simple and understandable strategy. As the article states, most devs are set up to ignore those files so they're not much of a source of the slowness. It's a common pattern for apps that have to serve lots of different locales.

fomine3 · on June 16, 2022

I don't think it's good idea to store cache in Git. Any file remains forever in the repository after once committed. Local/remote repository become unnecessary big.

kadoban · on June 16, 2022

The bigger issues for me are it makes history impossible to read (every change is hidden in an avalanche of crap), merges are a mess (you definitely want to spend forever merging autogened files, right?), PR reviews are annoying, etc.

Gigachad · on June 16, 2022

Depends how much generated stuff is there. We have our graphql schema in git even though its auto generated via a library. But its useful in PRs to see exactly how the schema changed as a result of the root change.

kadoban · on June 16, 2022

Yeah if there's not a lot of it, and if it's easy to regenerate, it can be fine.

_ikke_ · on June 16, 2022

You may want to set the '-diff' attribute for these files so that git will not show diffs for these, instead it will show 'Binary files differ'.

There is also '-merge', which will cause git to not attempt to merge the contents, but just ask you to pick a side.

The challenge however is then verifying the contents of these files in things like merge requests.

See https://gitirc.eu/gitattributes.html

robertlagrant · on June 16, 2022

It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.

Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?

arinlen · on June 16, 2022

> It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.

This is not surprising at all. In fact, it's quite standard to commit string translations. Just because you can run the code generation/string replacement step as part of the build that does not mean it's a good idea to generate everything from scratch at every single build.

String translations hardly change once they are introduced, running the build step takes significant amounts of time, and if anything fails then your product can break in critical and hard to notice ways.

robertlagrant · on June 16, 2022

I'm not saying don't have the translations at all. I'm saying: 1) caching things in git in general is a bad idea; why is it not in this case? 2) these are not - to my understanding - the raw resource files, but rather machine-generated intermediate files. This is why it's about caching, rather than minimal source files.

Additionally, to respond to your comment, if string translations don't change much then it may be possible to push them out as an internal 3rd-party library, and then they're even quicker to build.

arinlen · on June 16, 2022

> I'm not saying don't have the translations at all. I'm saying: 1) caching things in git in general is a bad idea (...)

You're missing the point. Storing translated files is caching things in git, and it is not a bad idea. It's a standard practice that saves your neck.

You either place faith on a build step working deterministically when it was not designed to work like that, or you track your generated files in your version control system.

If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system, you waste minutes with each build regenerating files that you could very well have checked out and in the process risk sneaking in hard to track bugs. Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked, which consume even more resources.

Or... you stash them in git?

You use git to track changes, regardless of where they came from. Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.

tourist2d · on June 16, 2022

> You either place faith on a build step working deterministically when it was not designed to work like that

I'm sorry, what? Why would a build not work deterministically?

> If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system

If your build is non-deterministic, how can you have any faith in the binaries it produces? You would have much larger problems in that case.

> You use git to track changes, regardless of where they came from

You probably don't want to do that if it is 70% of your codebase and slows down all your developer's git.

> Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked

I'm convinced you've never used a build system before. Your build should fail if required files are missing. Downloading translation files at build time from some artefact repository vs storing them in git is how a lot of companies do it.

arinlen · on June 16, 2022

> I'm sorry, what? Why would a build not work deterministically?

Because they don't and never did?

Do you understand build systems and individual tools were not designed to ensure deterministic behavior?

https://reproducible-builds.org/docs/deterministic-build-sys...

Anyone with any professional experience developing software can tell you countless war stories involving bugs that popped up when building the exact same project separate times. What leads you to believe that translations are any different? In fact, more often than not we see unexpected changes during translation update steps.

> If your build is non-deterministic, how can you have any faith in the binaries it produces?

First of all, all builds are not deterministic by default.

To start to come close to get a deterministic build, you need to do all your own legwork after doing all your homework.

Did you ever did any sort of this work? You didn't, didn't you? You're not looking and are instead just placing blind faith on stuff continuing to work by coincidence, aren't you?

> You probably don't want to do that (...)

Yes, I do. Anyone with their head on their shoulders wants to do that. It's either that or waste time tracking bugs that you allowed to go to production. Do you want to waste your time hunting down easily avoidable and hard to track bugs? Most of the professional world doesn't.

ratww · on June 16, 2022

It is definitely possible to have determinism in a CI build step, and it's possible to have checks for it. If one needs determinism and a cache, they can store the files on S3 or some other place instead of git. Re-generating the files every time on the build isn't the only alternative. Instead of generate-and-commit, generate and upload. The difficulty is the same for developers.

If one has to be more granular than that, and have versioning and verification against the repository, they can still store the multiple versions on another service and store the hashes on git. Even though I'm not a fan of this for translation (especially if you have lots of languages/lots of strings), since there's an advantage of decoupling the translation process from the development process.

The problem with storing those files on git is that it can cause more problems, including developer experience issues.

It depends on how much you're storing on git. Some CSS files? Fine. 70% of files of the project, like in this case, slowing down everyone's workflow? Definitely not.

robertlagrant · on June 16, 2022

> Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.

You're also doing that everywhere else. How do you think anything works? Why do you think Git is deterministic somehow? Why more so than including some files in a build?

ptha · on June 16, 2022

Just an example, I had the non-deterministic case using JAXB to generate java classes from XSD Schema files. Running an ANT jaxb task to generate the classes from the same schema files would generate different class files each time. The class files were functionally the same, however it would reorder methods, the order of the variable definitions etc. Possibly due to some internal code using a Map vs List, so order was not guaranteed. In our case the schema files were in Source Control, the Java/Class files were not, the Java/Class files were generated by the build, packaged to a jar and published to our artifact repository.

onion2k · on June 16, 2022

Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?

No reason at all, but when you need the files during development, and testing, and CI, and in production, and you don't want those things to fail when your artefact repo or source of data is down, then putting the latest versions in git makes sense.

The cost of having them in the repo is a tiny bit more complexity in your git workflow and config. The benefit is being able to access those files everywhere you access the code. It seems like a no-brainer to me.

kstenerud · on June 16, 2022

> place into an artifact repository

This adds yet another moving part to the system, and another place things can go wrong.

> generated and consumed in CI as part of generating a final build output

This can get quite slow, and on larger projects you have to expend a lot of effort to keep build times reasonable.

Also, if you're serving a library for public consumption, you generally don't want to add the burden of extra build steps for the user to follow before they can use it. If it can all be automated to the point of invisibility to the user that's fine, but often it can't.

haizzz · on June 16, 2022

author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them

haizzz · on June 16, 2022

Sorry I've been a bit misleading. These xlf files aren't generated, they're just not interacted with by engineers but they're still created and edited by humans as translations. We want to keep track of them so that if we deploy a different commit, the texts and translations in other languages will match

wichert · on June 16, 2022

Reading the article they are not generated files, but files that are never touched by developers. Translators will work with those files. I expect that for translators they have a different sparse checkout that only fetches .xlf files for their target languages.

onion2k · on June 16, 2022

xlf files are usually generated. They're XML. No one wants to write that by hand.

lultimouomo · on June 16, 2022

Also plain text files are usually generated. They are arrays of 1s and 0s. No one wants to write that by hand.

I think that this is not a sensible definition of a generated file. A more sensible definition is that a generated file is created automatically from some source, which is not user input (i.e. an other file). This means generated files do not need to be kept under git, as long as their source is checked in.

Translations files, even if they are not created with a plain text editor but with some other tool that handles the XML layer, are clearly not generated, as long as the translation is done by a human.

onion2k · on June 16, 2022

For the app I work at the moment we use https://lokalise.com/. We add translation strings to a SaaS app, and then the translation team translate them. I've written a build tool that downloads the translation JSON files from the API using the CLI, or as part of our CI process. Other teams have tools that download their language packs for different iOS and Android apps. The translations are versioned in Lokalise and we using a branching strategy to manage the work. Lokalise has an option to generate xlf files (and JSON, xliff, arb, etc).

This is a very typical workflow. Most people are not out there modifying xlf files by opening them in a text editor. For a start, translations usually aren't done by developers.

(Huge shoutout to Lokalise btw. I can highly recommend it. It makes building a multi-lingual app across different platforms so much easier.)

lultimouomo · on June 17, 2022

This still doesn't make them generated files!

You opted to keep translation files out of version control; you could also keep images there, or source files. All this stuff is the (pretty direct) output of non-deterministic human intervention.

(BTW, how do you build an old version of your application? Is lokalise able to give you the appropriate translations for a specific git commit / app version?)

Xylakant · on June 16, 2022

.docx files are archives of xml-files. No one wants to write that by hand.

Or, in more words: The format of the files is just the representation on disk - it’s not directly connected to how the files are generated or edited. XML files can be written by hand with suitable editor support.

onion2k · on June 16, 2022

I think we have different ideas of what "written by hand" means. Someone making a Word doc is not writing XML by hand.

snicker7 · on June 16, 2022

As a counter-point, package managers generate “lock files” that are designed to be tracked in VC.

For these translation files, I’d imagine there may be occasional work to modify them even after they are initially generated.

sdflhasjd · on June 16, 2022

Yes, the article glosses over that a little bit, and it's an unusual decision.

AlfeG · on June 16, 2022

We check-in some generated CSS files. That are generated by external theme cli. Just to be sure, that after version update we can track all changes in the generated CSS files.

jokethrowaway · on June 16, 2022

If they're generated you can just re-generate from source every time you need to track changes.

You're using git as a cache. You don't need to version a cache.

gustavohenke · on June 16, 2022

Regenerating certain things might be fast, but some might not be. Hundreds of engineers pushing code and having to wait for these to be regenerated both locally and on CI means that caching is quite cheap after all.

b3n · on June 16, 2022

Note that these files are not statically generated; they are translation files, generated by translators.

alexdowad · on June 16, 2022

The natural tendency in almost any software is to keep adding and adding, while rarely throwing anything out. More features, more code, more supported platforms, more supported languages, more this, more that, more, more, more.

If it was a physical product, you couldn't keep making it bigger and more complex ad infinitum, because making a physical thing bigger takes more material, and bounded physical resources would be consumed. With software, it's all just bits, and computers can hold a lot of bits.

This leads to bigger problems than just git running slowly.

Gigachad · on June 16, 2022

Canva is absolutely crushing the established design market with their product right now so I'd say having all those features is pretty important for a good tool these days.

shric · on June 16, 2022

As well as generated files the sibling mentioned I wonder how much is due to vendoring.

nomilk · on June 16, 2022

> git status takes 10 seconds on average

> running these commands multiple times a day reduces the total productive time engineers have every day

I love the attention paid to this. Often opportunities to prioritise seemingly small efficiency gains are neglected.

At 10 seconds per command, an engineer that uses git status 50 times per day spends ~10 minutes per day waiting; an entire work week per year!! Well above the threshold warranting optimisation, and that doesn't even factor in distractions and context switching.

codeyperson · on June 16, 2022

It’s actually even worse than that I think. If something takes over a certain amount of time, then I’m more likely to go do something else while I wait, like check Hackernews. And there goes 20 minutes.

Sohcahtoa82 · on June 17, 2022

I created a `beep` alias that just does `echo "\x07\x07\x07"` which triggers the system notification sound three times.

Then if I have a command that will take a while, like a stupidly long `git status`, I do `git status && beep`.

AshamedCaptain · on June 16, 2022

I'm hoping at least you acknowledge this is _your_ problem, rather than a tooling problem or the like. You just can't expect everything to give you immediate feedback after a couple seconds.

This is something that's really a degradation of the newer generations of engineers, since I clearly remember the time where these "somethings" would never take less than a couple minutes, and people did not immediately flee to their nearest distraction, but actually planned their time around it. In fact, if you go further back, these "somethings" would have taken hours, and the older generations still got work done.

pessimizer · on June 16, 2022

That's a lot of drama, existential angst, and nostalgia around the equivalent of "sometimes during the commercials I go to the bathroom, and when I do I sometimes miss a minute of the show because I don't get back in time."

AshamedCaptain · on June 16, 2022

I used to work at a place where git status would take 2 minutes (or more). You just stop using it and rely more on your memory, or run in it parallel while you continue working on something else, etc.

ngshiheng · on June 16, 2022

I'm glad I'm not the only one that thinks that way. Even ~3s every git status is driving me insane now.

zoomablemind · on June 16, 2022

> "...we found that .xlf files made up almost 70% of the total number of files. These .xlf files are generated... they are never manually edited by engineers..."

First thought is why not to zip/tar away all of these "convenience" files per generation and add a line into build/install script to unpack them after checkout?

Additionally, add the .xlf into .gitignore to exclude them from untracked.

Noone cares to diff them as long as their contents is consistent with the checkout. Text compresses quite efficiently so this should not introduce any unreasonable build/install delays.

haizzz · on June 16, 2022

I made another comment as well though tldr is these xlf files are translations tied to texts in code so we can't simply ignore them from the repository. The changes have to be kept so that if we say revert to a certain commit, all the translations match with the texts of headers, buttons, etc...

zoomablemind · on June 16, 2022

>...The changes have to be kept so that if we say revert to a certain commit, all the translations match with the texts of headers, buttons, etc...

The mentioned .zip file is to be kept in the repo. Instead of a whatever number of individual .xlf files per generation, these would get zipped together (say, 'assets/xlf.zip') before the commit and the resulting .zip added to the commit.

Similarly, when reverting or on a checkout, it's the .zip that gets checked out and then the .xlf files are unpacked.

The packing/unpacking could be done by the same process that handles the .xlf generation (??build).

Also this may be automated by git-hooks, though it's more natural to handle the packing of assets during the build stage.

GeneralMaximus · on June 16, 2022

Could these translations be moved to another repository? Maybe they could be published as a separate NPM package that devs could install if they needed to look at the translations?

klysm · on June 16, 2022

Then you get the annoying problem of having to push an update to that repo and wait for the new version before you can merge changes into the new repo which use the new version. It’s tightly coupled, so they should be co-located.

jeffhuys · on June 16, 2022

The entire workflow could remain the same while tar/zipping goes on in the background...

maccard · on June 16, 2022

I appreciate this post. It's nice to see that there are other teams that feel some of the pain points of git, (and unsurprising that most of the responses are "you're holding it wrong"). The fact is that git doesn't scale to _very large_ repos, We've seen it time and time again, but there isn't really a great alternative. Perforce is.... Perforce (centralized, very expensive to license, branches are incredibly expensive and streams still feel like a band aid even years and years later). PlasticSCM (which we use at work) is fine, but closed source, mildly expensive, and has a terrible UX

wtetzner · on June 16, 2022

Yeah, I think choosing git is fine, and choosing to have a monorepo is fine, but you probably don't want to do both.

Or, at the very least, once your git monorepo reaches a certain size, you should either split it up, or switch to something that handles monorepos better. Even if that something is e.g. a virtual filesystem on top of git.

jeffbee · on June 16, 2022

Just saying that Perforce "is Perforce" is pretty lazy. Perforce is great in my experience. It scales up to repo sizes you are unlikely to ever reach and has a UX that makes actual sense.

maccard · on June 16, 2022

In my defence, I elaborated in parentheses immediately after saying it. It's _very_ centralised and online only. There's no concept of any local work/branches so all prototypes are checked in (or in my experience people manually shelve things and juggle shelves around).

Perforce's branches are a disaster and should have been deprecated years ago and replaced with streams a decade ago. Branches are _incredibly_ slow, they're straight up copies, and they are pretty much isolated from each other. Streams are an improvement, but still are very primitive compared to git's branches - the change tracking across streams is poor, and the enforced hierarchy has too many escape hatches that can make a gigantic mess. Streams are loosely enforced with views which can't be customised per workspace, a major regression from branches. In practice, every team I've worked on has had a "convert merge to edit" style action to fix perforce's messed up idea of a merge. Stream switching is also dog slow (on my last project, it was quicker to delete the 150GB workspace, and re-sync than have perforce actually figure out what had changed).

Perforce is eye wateringly expensive, and very difficult to license - licenses are 4 figures per seat per year for medium sized businesses (and close to 4 figures for small companies), and maintaining a p4 server is genuine work. The hosted offerings of perforce (assembla only - https://get.assembla.com/pricing/) are very lackluster, and very limited (no triggers unless you pay the "contact us" pricing).

My experience in large teams was that running perforce at scale involves not using some/many of the features, and that actually keeping it running well pretty much requires an active support contract from perforce.

All that said, P4V is by far the best GUI client to any VCS, it _handles_ bianry files, and p4 sync's performance makes git clone feel like you're working on a 56k modem. It's just when you want to do anything other than submit or sync, the wheels fall off the track.

jeffbee · on June 16, 2022

I used Perforce for a decade and did not create a branch, not even once. People who want to branch in Perforce are often trying to bring a git mindset to a different tool. With a trunk-based edit/sync/submit workflow where you have different p4 clients for your different projects (what you would use various branches for in git) you need not branch.

maccard · on June 17, 2022

> I used Perforce for a decade and did not create a branch, not even once.

That's because P4's branches are a pale imitation of what branches can be (and to be fair to P4, branches long predate git and they can't exactly up and change the behaviour, however there's really no excuse for what they did with streams. They bolted a loosely enforced hierarchy onto the existing branch system, created a split in the tooling, and shipped something that has as many footguns as problems it solves.)

> People who want to branch in Perforce are often trying to bring a git mindset to a different tool. With a trunk-based edit/sync/submit workflow where you have different p4 clients for your different projects (what you would use various branches for in git) you need not branch.

"often" is a very nebulous adjective, and a naive view of what git branches do. Perforce and a task branch based workflow is a terrible idea, yes. If you want to do the PR based flow that github and gitlab encourage, you're going to have a bad time. P4's shelves are an excellent tool, but they encourage ad-hoc and self managed version control. Shelves to bring changes across streams (if you're using them), Shelves to share a WIP or a quick change with someone else, and iterating back and forth with shelves with v1 v2 v3 etc in them, shelves for temporary debugging code/non prod features/work in progress feature that's ticking along in the background.

Again, I'm not saying perforce has no place, but git's branches are a force multipler, even with trunk based development.

Ygg2 · on June 16, 2022

There was pijul. Which allowed partial pulls iirc. I haven't used it, so I can't really recommended it.

maccard · on June 16, 2022

I tried pijul [0] last time this conversation came up and despite its claims it is not fast.

[0] https://news.ycombinator.com/item?id=29992875

pmeunier · on June 23, 2022

Just seeing this now. If you look at the conversation, this was fixed within a few hours.

kanu666 · on June 16, 2022

But that's what those responses are all about: git doesn't scale, so don't use it if you need scale. If there are no tools that scale available, then change the approach so you don't need face this scaling problem at all.

EdSchouten · on June 16, 2022

Maybe at their scale it makes more sense to switch to a VCS like Eden? https://github.com/facebookexperimental/eden

Eden's equivalent of 'git status' should run almost instantaneous, as checkouts are hosted by a virtual file system (FUSE) that tracks changes.

barbs · on June 16, 2022

I feel like I've read about several big companies using monorepos, but I've never understood why. It feels like the source-control equivalent of writing your code in one big file.

Does anyone have any good resources for why and how best to implement a monorepo?

qbasic_forever · on June 16, 2022

You kind of have to experience the worst of boths worlds to understand where and how each method works and breaks down.

With multiple repos it's harder for teams to share code and collaborate. Each team has a repo that becomes a little fiefdom where they are oblivious to who is using their code and how they're using it. Suddenly they'll push out what they think is an innocuous refactor and inadvertently break core functionality other teams took a dependency on for better or worse.

So what happens is the team with a dependency now copies the old code into their repo and take on all the extra burden of maintaining this old version, trying to backport fixes, etc. It becomes an enormous mess and time sink. No one ever has time to go back and fix things, and when they eventually are forced to do so it costs more time and effort than it would have taken to do it right from the start. You'll also run into horrible versioning problems where you're stuck on old version X but depend on widget foo which needs current version Y of that dependency.

You might say well bad on that team they should have engaged the product managers, made sure their dependencies and usage were well tracked with them, been looped in the process of changes, etc... but in the real world when your boss says X feature needs to be shipped in a few days all of that process goes out the window.

milosmns · on June 16, 2022

This is valid. But I think it would make sense to challenge this and even (try to) leave if product is dictating tech what to do with their repos... especially if there's a boss figure pushing imaginary deadlines without talking to the team beforehand. There's a job crisis now, I know, but millions of files were not checked in during 2021. We should do better when it comes to organizing how we work

ABS · on June 16, 2022

good resources:

1) https://trunkbaseddevelopment.com/monorepos/

2) and (but I don't know it as well) https://monorepo.tools/

nvarsj · on June 16, 2022

It’s mostly about avoiding code and work duplication. At scale, the waste on duplicate work across teams can be massive (think about setting up CI tooling for example). Mono repo let’s you solve tooling/build problems once and for all. The main drawback is scalability of the tools involved like git.

hamandcheese · on June 16, 2022

> The main drawback is scalability of the tools involved like git.

And if you can employ enough engineers to break git, you can probably afford a team to work on scaling git.

kanu666 · on June 16, 2022

Git staring to break at 200-300 engineers pushing into it. Scaling git would take 200 more :)

hamandcheese · on June 16, 2022

Everything described in the post can be done by one engineer.

xmprt · on June 16, 2022

Here are a few reasons that I've heard and experienced

1. Standardized tooling

2. Fewer issues related to dependencies

3. Hermetic tests

4. Reduced code duplication and easier code sharing

c8g · on June 16, 2022

Did you consider VFS For Git and Scalar from Microsoft? What was the result? https://github.com/microsoft/VFSForGit

https://github.com/microsoft/scalar

scarmig · on June 16, 2022

VFS for Git is deprecated, although something incorporating its approach is likely to best way to implement a real monorepo. Which AFAIK doesn't exist, outside of proprietary implementations used inside certain big tech companies.

tehlike · on June 16, 2022

Did anyone use scalar from Microsoft?

https://github.com/microsoft/scalar

Maxious · on June 16, 2022

As mentioned in the blog "we are moving towards providing a known version of git", probably Microsoft Git which includes the scalar command but also upstreams many of the optimizations to git core https://github.com/microsoft/scalar#why-did-scalar-move

disclaimer: canva staff working on source control

amir734jj · on June 16, 2022

We use scalar at Microsoft. Working with massive scale of some repos would be near impossible with "vanilla" Git.

Thorentis · on June 16, 2022

I don't understand the logic of combining microservices with a monorepo. The whole point of microservices is that you don't care what is beyond the external contract of the service. Who cares how each individual team decides to name their stuff. Why do microservice teams need to care about having every single service checked out? Who or what would be bulk applying changes to all services? This is madness.

hamandcheese · on June 16, 2022

Suppose you want to deprecate an API you wrote in favor of something else for $valid_reasons.

In a monorepo with the right tooling, I can make a branch where I delete the API and get pretty immediate feedback as to every module I have broken. From there I can update all the call sites and I also know which teams/engineers I should give a heads-up to.

In a multi-repo world, this is much more difficult. Even learning what all the reverse dependencies are could be a challenge. Most likely, other teams have to pull in my changes on their own schedule, and other teams have very different incentives than my own. The cost of mistakes is therefore higher because they are more difficult to undo.

Monorepos are very important if you want to empower engineers to achieve broad changes across the org. Some people fundamentally disagree with this methodology. And a fair number of people just don’t care enough - marking the old thing deprecated, calling it a day, and letting it rot for eternity is good enough for them. These are the same people who you’ll find saying “not my job” a lot, in my experience.

wewtyflakes · on June 16, 2022

The perceived ease of that operation in a monorepo is what makes it dangerous. Unless these changes all map to a single monolith service, then even though you have updated all call sites, these callers will not be deployed all at the same time, meaning as the change rolls out you may see random breakage. By using polyrepo, the deployment boundary can align with the code boundary, making the rollout problem obvious.

hamandcheese · on June 16, 2022

People point this out often, but in practice I have never seen this cause issues. Remove callers first, then remove endpoint... its pretty obvious the order in which things need to be done.

What I have seen as a real problem, time and time again, is having trouble locating all usages of an API in a multi-repo scenario.

Anyone who fucks this up in a monoreppo will probably fuck it up worse with multiple repos.

wewtyflakes · on June 16, 2022

I think the tricky issue is updating the structure or semantics of an existent call across services. A monorepo makes it easy to make these kinds of changes in code, while making it non-obvious that it is dangerous to deploy; it is a giant foot-gun. Examples along this line include updating the name of an RPC, or endpoint, or changing the request/response structure, and so on. Of course you could argue that "you just shouldn't", of which I agree, but the point is that then making those kinds changes should actually be really hard, instead of really easy.

hamandcheese · on June 17, 2022

Ah yes, this is definitely a challenge. At my day job we use protobuf and follow the (imo well documented and well evangelized) best practice of forbidding breaking changes to protos, so I almost forgot about this class of problem. At least for changes to structure. Changes to semantics can still happen but I don’t know that I’ve ever seen anyone cause a major issue while keeping structures compatible.

We have escape hatches, which I mainly use when deleting code.

hcarvalhoalves · on June 16, 2022

Microservices in a monorepo is the least friction path towards a distributed ball of mud.

Whether this is a pattern or anti-pattern depends if you want a single engineer being able to change the entire architecture to “just ship it” or you if you value conceptual integrity more.

compressedgas · on June 16, 2022

This reminded me of https://github.com/twosigma/git-meta

cryptonector · on June 16, 2022

Yeah, git-meta is a reasonable alternative to megamonorepos. It's basically a repo-mega-forest.

A few things:

  - your build system will need more code
    to deal with a repo-mega-forest than
    a megamonorepo
  
  - code indexers may not be able to see
    cross-repo dependencies

etc.

You'll have pain no matter what. My preference would be for a megamonorepo approach to scale properly. That means partial/sparse cloning, as well as shallow cloning, and also all the hacks that are supposed to make git-status and git-log (and git-blame, and...) fast in partial clones.

ur-whale · on June 16, 2022

> These .xlf files are generated and contain translated strings for each locale

... and they make up 70% of their repo

Why would you include generated file in a repo?

Do they take to long to remake?

[EDIT]: especially given the fact that they're using bazel which is supposed to be the bee's knees of build system?

SerCe · on June 16, 2022

The article uses the term generated, but a more precise term could be "produced by translators". The files can't be simply regenerated. We've recently published [1] a separate post that explains the localisation process in more detail.

[1]: https://news.ycombinator.com/item?id=28931601

stevenjgarner · on June 16, 2022

I get a headache imagining everything Google in a monorepo. So how do you reconcile this single GIT repository approach with Infoworld's "The case against monorepos"? [1]:

Reason #1. Monorepos go against single-team ownership principles

Reason #2. Monorepos encourage bad practices involving massive refactoring

Reason #3: Small repositories are better than large ones

[1] https://www.infoworld.com/article/3638860/the-case-against-m...

scarmig · on June 16, 2022

With regard to 3:

> In Google’s case, more than 45,000 changes are made to its monorepo every day. This code management becomes an exponential problem in overhead as the number of developers of an application grows, and the number of components within the application expands.

Speaking from experience, the overhead of code management is far less at Google than any other place I've worked at, even on projects with just a couple hundred lines of code. If this author is imagining a monorepo means each engineer having to constantly check out terabytes of code to make any change and race with other developers to merge to HEAD, they don't really understand the landscape well enough to write an article criticizing monorepos (which certainly have real cons, particularly with the source control tools that are available to most companies).

Of course, you need actual tooling and infrastructure for that beyond what git offers.

pphysch · on June 16, 2022

One person's STOP is another's silo

One person's massive refactoring is another's tech debt reduction

One person's multirepo is another's inability to find that broken bit of upstream code

wtetzner · on June 16, 2022

I think the biggest problem here is that git (out of the box) is not well suited to a monorepo.

But in terms of the reasons in "The case against monorepos":

> Reason #1. Monorepos go against single-team ownership principles

I think it's up for debate about whether or not single-team ownership is desirable. But even if it is, I don't see the difference. Just have teams own their directories within the monorepo.

If you want to enforce that a team owns a particular part of the repo, just put some rules into the code review tool to ensure that a change to that component can't be merged without someone from that team reviewing/approving it.

> Reason #2. Monorepos encourage bad practices involving massive refactoring

Again, this just appears to be the author's opinion that massive refactoring is somehow problematic, without providing much of an argument against it. If you're changing service APIs, then sure, you need to be careful about the order in which you roll out changes. But if you're changing the APIs of shared libraries, then being able to do a single large refactoring is absolutely valuable.

> Reason #3: Small repositories are better than large ones

This is just a tooling issue. You can still separate projects by directories, and if your tools allow you to just check out a subdirectory, or if you have some sort of virtual filesystem on top of your source control, then you don't have to pay the penalty of pulling down the entire repository.

cryptonector · on June 16, 2022

> > Reason #3: Small repositories are better than large ones

> This is just a tooling issue. You can still separate projects by directories, and if your tools allow you to just check out a subdirectory, ...

Even a forest of repos has tooling issues: your build, code review, and code indexing tools will need to support the forest, and that's a lot of work. It's probably comparable to the work needed to make a monorepo work.

The alternative to the monorepo isn't all rosy.

alephnan · on June 16, 2022

These engineering blogs are intended to promote the engineering culture of the company for recruitment process, and as a correlate an opportunity for the engineers to self-promote and boost their resume/promotion-worthiness internally.

The sentiment on this thread, if it is indicative of the greater talent pool, suggests this blog post is having the complete opposite effect.

I remember when Uber was proud of their thousands of repos. Here it’s the 60 million lines of code. It’s not just red flags, but seems like stuff that might get leaked to Programming Horrors / WTFs.

hamandcheese · on June 16, 2022

I checked out their careers page after reading this.

anonymous_1234 · on June 16, 2022

I wonder why these generated files are not part of a separate distributed, HA based, cloud managed document store. Seems like a perfect use case for it. This looks like git abuse to me.

the_biot · on June 16, 2022

They made a bad design decision 10 years ago, have been fighting the fallout for years, and will be doing so forever and ever because things will only ever grow.

They wrote a blog post on how clever they think all their workarounds are, at least one of which involves sparse-checkout -- which is perilously close to chopping up your monorepo into several, while still pretending monorepo is fine.

I feel like somebody's job and/or ego is heavily invested in keeping things as they are, even if it demonstrably does not scale to their needs, and the solution is blindingly obvious to even a casual observer.

That is institutional insanity.