Hacker News new | past | comments | ask | show | jobs | submit login
We Put Half a Million Files in One Git Repository, Here’s What We Learned (canvatechblog.com)
222 points by Maxious on June 16, 2022 | hide | past | favorite | 257 comments



It took me a while to find out what Canva actually does, but from https://www.canva.com it appears they are an online design / collab tool.

To be fair, that could mean a _lot_ of functionality and code providing a rich, SPA, JS heavy experience. Modern JS frameworks aren't exactly known for being concise.

But still, 60 million lines of code and half a million files is a sure sign that someone said at one point "sure, we can throw all these generated files into git!". Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"

Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.


> Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"

Why do you assume they didn't?

Just because they arrived at a different conclusion than you that doesn't mean they didn't thought about it. I might very well mean you did not considered the tradeoffs they had to take into account, mainly because you're out of the loop.


One reason for doing this is that someone might be using a very small set of the monorepo, so having everything prebuilt means that in general things are very fast, without constantly having to rebuild chunks of your system.

This is, I think, very common with stuff like browsers, where you have artifact checkouts that basically include built stuff since otherwise you're sitting there compiling forever all the time.

Stuff like Bazel in theory helps with this, but tools that help with this are either super idiosyncratic about how they work (meaning hard to adopt) or outright don't work.

I mean personally I would find that pretty annoying and unclean but I like DAGs.


They’re not assuming anything, but stating that the post itself does not cover this well/at all.


The post itself covers in the intro that they made the conscious decision to go with a monorepo, accepting the downsides of it. Much more than this article though, I'd like to read one discussing that decision and why they went that direction.


Canva gives users without design tools expertise the ability to make fairly polished looking graphics with a super easy and intuitive interface. (As a designer, I can assure you that polished looking is not the same thing as designed.) It’s a very popular service, so they’re dealing with huge scale. Intuitive interfaces often come with complex mechanisms and lots of assets, and they have clients on every major mobile and desktop platform, and the web. They also do a ton of heavy graphics processing that is likely done in lower-level languages than the interface.

The likelihood of a 2000 employee software company simply not considering that they could streamline their build process is pretty slim.


They have obviously invested a lot over time into streamlining their build process: so much so that they're putting an article about it.

All of the problems they are having are basically due to their use of a monorepo: they do explain that they made the decision early, but I wonder what are the advantages over multiple repos they are seeing that it was worth it all this trouble?


We can talk about the advantages of monorepos, but your questions is phrased in a way that makes me think that you don't see any "trouble" in multiple repositories.

I would encourage you to do some research and keep an open mind.


I see "trouble" in all approaches: their solutions to manage their git monorepo approach it by switching to multi-repo emulation.


They seem to have purposely left that out of the scope of this article. There are myriad articles about the benefits and downsides of monorepos vs multiple repos.


In another sibling comment I mentioned how I looked through their engineering blog and didn't see any post where they have talked about any of the benefits they are enjoying due to their use of monorepo.

The question is specifically about their usecase, since not everybody would hit the same bottlenecks as they did with git monorepos.

Iow, have they stopped and thought whether it's still worth it (eg how often do their engineers make use of the monorepo benefits like cross-project refactorings)?


Don't rely on this anecdotal heuristic. Have a look at some enterprises. My experience: "How to solve scaling issues?" - "Automation? No, another team." ;)


This is not a 15k employee enterprise riveting features onto a codebase from the 90s— it’s a <10yo company that makes one product. Big difference in approach.


hey hey author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them


How do you split your XLIFF files? Does each project get one big one and the proliferation is simply due to number of languages, or do you have a more granular split (eg. if you've got one component, it will have dozens of XLIFF files for every language, instead of one per language)?

By the numbers you mention, 70% of the files make a ratio of code files to translation files 1-3, so unless you only support 3-5 languages, it's definitely not one XLIFF file per source file, so I wonder at what granularity it is?

(My experience is mostly with localizations using GNU gettext tools, and you usually do a small-finite-number of PO files per project per language, where that small-finite number is exactly one for like 99% of projects)


It's one XLIFF file per locale per component, not including the source en_US. We currently support 104 locales.

More info: https://news.ycombinator.com/item?id=28931601


Thanks: that's still a lot of components (3000+) if you've got at least 300k xlf files!

Good job managing all that regardless of the approach :)


One of my biggest pet peeves: tracking generated files in version control.

The only exception is our generated OpenAPI spec, because we want people to be explicit about modifying the API, and have a CI task that verifies that the API and OpenAPI spec match.


One alternative that enables you to keep generated files out but still feel like there's an explicit human check in place is to add a gated confirmation step in CI to confirm that the changes to the generated spec match expectations.

Something like: "This change will result in the following new API endpoints: ... do you wish to continue?"


Hm... An interesting thought!

What does it compare against though? Need to add more state to the CI? We kinda like the interface be part of the version control and having an audit chain that's part of the code.


Yeah either you'd have to maintain the generated spec as a versioned artifact hosted wherever is most convenient for you, or the CI could actually generate the before and after specs based on the PR diff. If the calculation of the spec is computationally expensive (it shouldn't be) then the latter approach could be a problem.


you could put it in the pre-commit githook


That's not enforced centrally (or visible by the reviewer) so hard to trust it that everyone will remember having it.

Though even without that, I'm not sure how it'll even mechanically work.


Remember back when people recommended commiting node_modules into git?


Ah - that would explain why at my current job there was a node_modules directory in git with nearly 2 million lines of Javascript within.

It is gone now.


The days before lock files were a thing and 'it works on my machine!' was rampant.


Lock files protect you from the version changing out from under you, but modules disappearing from NPM is a thing that happens. Yes, you can use artifactory or similar as a proxy but that requires infrastructure that you may not want to run. That is all to say: there are situations where committing node_modules is the least evil.


... are we no longer doing 'works on my machine' ?


Ostensibly if it works in Alice's Docker instance, it will run in Bob's Docker instance too.


Well... unless some dev's have M1-macs, and some of the docker layers are not available for arm, or the other way around, not available for amd64. Gives interesting issues.


Except for weird Docker edge cases (extremely rare, but does happen).


Not rare at all.

Docker is a congregation of technologies held together with duct tape and glue.

Eg. permissions handling is completely different on Macs with Docker Desktop from the Linux dockerd stuff: on Macs, it automatically translates user ownership for any "mounted" local storage (like your code repository), whereas on Linux user IDs and host system permissions are preserved. Have some developers use Macs and others use Linux, and attempt to do proper permissions setup (otherwise known as "don't run as root"), and you are looking for some fun debugging sessions.


> Docker is a congregation of technologies held together with duct tape and glue.

No, it's not. What a wild conclusion to reach from the example you gave.


At companies that don't check in node_modules, build folders, and are using standard packaging tooling like maven or yarn or npm or what-have-you. Yes, I haven't experienced that in like 15 years.


Ugh

The price of letting less experienced people "go crazy" in the repo


Npm didn't support lockfiles until version 5, released in 2017, Yarn had them at launch in 2016. Before that committing node_modules was often used as a form of vendoring, to get reproducible builds.

If a new project these days commits node_modules to git, it's likely a mistake, but for legacy projects started before 2017 it was the lesser of two evils.

Edit: spelling.


Hm, this project was started in 2017. The node_modules directory was for Serverless (a tool written in Javascript), not the website itself (which was written in AngularJS - probably not the best choice in 2017 either).


s/was often used as a/was the only practical/

Prior to lock files (and potentially after, as checked-in files are beyond trivial to modify and review and that can be worthwhile) committing dependencies in some form was basically the only reasonable way to have reproducible builds, unless you wanted to build your own package manager / lock file implementation.

Which is what Yarn did.


Or sane people wanting to have some cheap, low effort way to track changes in their project's dependencies.


Based on how brittle Github actions is I'd be ready to commit node_modules except for that I'm building cross-platform software with native dependencies.


`npm rebuild` should rebuild native code in your committed node_modules


Pretty sure that recommendation came from a Git hosting service that charged by the megabyte.


Not everyone has the skills to build the toolset and use it. My brother called last night to help him change some SASS variables in a bootstrap theme. He’s a data scientist and had no idea how to build bootstrap’s js and apply the new variables. If bootstrap came from npm fully built, over half of his problems he called me about (15 times!) would have been avoided.


  > it takes 10 seconds to run git status
People coming from the SVN world do not think that this is unusual or problematic. And unfortunately even recently I've seen SVN still in use at large legacy companies.


I don't think it's unfortunate. We use subversion for development in our team and it does everything we need it to. We looked into git and didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot

For many processes I think SVN is (and has been for many many years) been an absolutely fine method of version control


  > didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot
You're not wrong about this.

I really like git for the cheap branching, which encourages branching and merging often. But SVN might have cheap branching now, as another commenter implies.


From my experience SVN isn’t significantly slower than git.


My experience is that anything dealing with a branch, especially but not exclusively creating branches, is very slow in SVN for a repo of any real size, basically anything with a framework.

I do not remember if "stat" was particularly slow, but SVN in general is slow.


Huh. 10 gigabyte svn repo at work spanning about 40 projects.. Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link). Curious as to why it would be slow for you. svn cp https:/ /server/svn/trunk/project/ https:/ /server/svn/branches/project/ticket -m "making a branch here"

svn status, even for an entire repo checkout (which is not common) is also fast.

And yeah, it has virtue of simplicity as well doing very well at narrow and shallow even though I'd love to have mercurial's feature set.

It's also rather good in the "wiki" situation since people can operate on their single files without needing to update, sync and merge.

https://www.bitquabit.com/post/unorthodocs-abandon-your-dvcs...

A fun rant, even though git has gotten better-ish at large files.


  > Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link).
Copy is not a "free" operation, but a symlink is close to "free" if you're measuring disk space.

What version SVN are you using? I'm certain that older SVN versions would actually copy the entire project's files, not symlinks but real copies. That would take forever and running out of disk space was a real concern.


An svn copy is just a link. It has always been a "free" operation (and yes, the analogy would be to a symlink). I'm not aware of any version of svn that behaved the way you described and I've used it for a couple of decades.

I can perhaps imagine a large repo plus a broken svn client requiring checking out unneeded portions of trees to do a copy, but no client I've used works like that.

Hm. Another theory. Perhaps someone who knew nothing about svn and was using TortoiseSVN's Windows file manager integration was doing a Windows file manager copy, then checking that in as a "branch" with the only link being the commit message instead of using svn's copy which is free and properly links content. That would indeed be an expensive operation, and the wrong thing to do.


Yeah, branches are slow. Otoh blame is fast compared to git.


These days mercurial has a cached blame/annotate called "fast annotate" which I love because of one particular awesome feature --deleted which must be seen to be appreciated I think.

I have this alias in my .hgrc file fad=fastannotate -u -n -wbB --deleted

It's by the Facebook engineer Jun Wu who also made the even more awesome "absorb"


What is wrong with SVN?


> Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.

It's perfectly fine to use Git to track things other than sourcecode. In fact, right on the manpage, Git calls itself "the stupid content tracker".

I've been using Git with git-annex to track archival files with their associated metadata. We keep our data separate from our sourcecode, and segment our data into individual Git repositories for each collection. Git gives us many features that we would have had to build into our app in other ways (data integrity, fixity, etc), though this came with costs.

To my eye it probably would have been better for Canva to use multiple separate repositories instead of a monorepo, but I'm not them and their use-case is not mine.


I think it is a crowd dumbing effect. Since hundreds of engineers sharing the mono-repo, no one can or care to make the decision or is able to push the decision for alternate. Even when every one is complaining, it is still far from every one agreeing on the alternate. Crowd settle at the lowest denominator.


Hey everyone, author here, the article is a bit misleading in that .xlf files aren't really generated files, they're created through our translation pipeline by real humans. I considered them generated in the sense that they're not directly worked on by engineers who have to deal with them in the repository.

The content of these translation files are snapshot in time aligned with the text in our product so simply removing them we would lose all the changes made to translations each time texts are changed.

Sorry about that, hope this clears it up!

Edit: for more information on translation and xlf files, we have another blog post all about them https://canvatechblog.com/how-to-design-in-every-language-at...


Can't those xlf files be stored in a separate repo or in an object storage and let the build system fetch them from there?


That wouldn't be called a "monorepo" then :D

Obviously, the problems they are solving (and admitting to solving) are due to their dedication to the monorepo.

With all the effort spent on working around the drawbacks, I really wonder what advantages they are seeing that make it worth their while?


Well, there is monorepo and monorepo. Git was primarily made to host code, not necessary artifacts. I would categorize those files as artifacts and in my opinion it would still be a code monorepo to have everything else on a single repo.


They are code files managed by a different set of people (translators), not generated artifacts (as they explain elsewhere in this discussion).

If we are being pedantic, git was not designed to host multiple projects in a single repo (otherwise, git would have been a subdirectory in the kernel tree). But tools are made without knowing how they'll be used, and that's ok, so I wouldn't stress on what the purpose for monorepo was, but how it's used and what value it brings.


>They are code files managed by a different set of people (translators), not generated artifacts (as they explain elsewhere in this discussion

Translations do not look like code to me. Rather human generated artifacts.

What I understand is there is a strong tie/match needed between versions of these translations files and the code itself, so I believe this is where having all in the same repo would make sense, having the translators update those file when code has been modified...


I think it's relatively normal to include build artifacts in a monorepo when you don't want users to have to be able to build every single one. Especially if you don't want to have to buy a license for every developer for specialist software that only needs to be used by a few people.


Given the religious nature of the "monorepo" vs. "not-monorepo" argument, I would venture a guess that suggesting storing these xlf files elsewhere would upset someone's notion of what a "monorepo" Should Properly Be.


Having them under version control seems very important. Monorepo makes that easy


Thanks for additional info. I'm happy to hear that localization is on your priority.

> English source strings used in our frontend code live in Typescript files (.messages.ts). Source strings used in our backend code live in Java files (Message.java). Our internationalization (i18n) pipeline converts this into a series of XLIFF files (ending in .xlf), with one file for each locale. All these files live in the repository, but the translated .xlf files should never be modified by hand since they are updated automatically when strings get translated.

Isn't it cache?


Not quite, the .messages.ts and Message.java files are only in English which are then converted into XLIFF files for other languages so they're the source of truth for what the string should be in different locales so they're not cache


If it is output of some other tool, then it is a build artifact and there is no need to check it in.

I have this argument with people all the time and the conclusion is always like: "it is too hard to integrate the generator with the build system so we check them in".

The big problem with generated files is merge conflicts. How do you resolve a merge conflict on generated files. especially if they are binary.


What's the value of a monorepo if developers only ever check out a small subset of it? Wouldn't multiple repos allow greater scale without any practical reduction in utility?

For example, all the localisation files could live in a separate project (if we accept the need to commit them at all). Some tools would be needed to deal with the inevitable problem that developer working sets would not align with project boundaries, but that seems like an easier job than making git scale while maintaining response times.


One of the benefits of monorepo is refactoring. You can just apply a renaming command across all of the files in the solution and all the related names are properly updated.

Not that easy to get this to work on multiple repos.


I think it's important to understand how often is the OP's company benefiting from this: is it worth the trouble they've gone through through the years? I've checked the rest of engineering posts, and none of them talk of the benefits of using a monorepo explicitly.

Monorepos, like basically all solutions, solves some problems and introduces new ones which you didn't have before. It depends on each individual case which drawbacks are more worthwhile.


All developer activities related to code – refactoring, editing, building etc all happen in a developer's workspace. A workspace can be composed of multiple git repo checkouts. An infrequent activity like renaming a lot of files can be done with minor inconveniences even if they are spread across the workspace in different repos.

Only the code that is closely related – read/modified/built together frequently – should live in the same repo. If two pieces of code that don't have much to do with each other (that is, they are not read/modified/built by a developer in a single developer workflow frequently) live in the same repo, then they are just being a burden to the overall development lifecycle of devs who work on those disjoint sets of code.

The unrelated code in the same repo is a distraction to the developers who checkout that code as it costs storage space, iops, cpu cycles and network bandwidth to lug that code around, load/index in IDEs, track changes, build and discard dependency graphs by build/dependency systems etc. Then, to deal with these issues more complexities are incurred. Instead, it is better to optimise for the common case and deal with the complexity only for the rare cases.


How would you do refactoring over monorepo if you have sparse checkout?


CI/CD build the entire project. So if you make a breaking change in library the build will fail.


Sure, and what next? So considering that on big projects CI/CD may take up to an hour (in one of my projects it took 4 hours) the feedback loop would be great


> on big projects CI/CD may take up to an hour

No, it may not. Perhaps occasionally it does. That is a bug that you must fix - a pipeline that takes even 30 minutes is horrifically slow.


One of the nice things about using Bazel is caching builds, avoiding rebuilding parts of the monorepo that are completely unaffected by someone's changes.


Sure, how did we ever manage to rename something without monorepos. Oh wait, maybe that's what this "versionning" thing is all about.


Right, it's "I have to send 5 PRs to 5 different repos, get them all cross merged, and then at the end it's wrong anyways so I have to start all over".

Multirepo management is extremely frustrating compared to "it's all in the same folder".


How many scenarios are there where the rename both matters (beyond taste and philosophy) and is across interface boundaries?

Surely if it is an advantage to rename once in a ginormous, single code base there must also be leaky abstractions, poorly defined interfaces, god objects, etc present at the same time?

Whenever I find I need to rename anything across domains, it's a matter of updating the "core" repository and then just pulling the newest version.


Monorepo is not necessarily synced deployment, and even if it was, each deployment of a single component is usually racy with itself (as long as you're deploying to at least two nodes).

Which means that you've got to do independent backwards-compatible changes anyway, and that for anything remotely complex, you are better off with separate branches (and PR/MRs) anyway.

Monorepos mostly have a benefit for trivial changes across all repos (eg. we've decided to rename our "Shop" to "Shoppe"), where it doesn't really take much to explain with multiple repos, but is mostly a lot of tedious work to get all the PRs up and such.


I think that when you have large enough systems that works. I do not believe that "microservice" is the right size for repo splits.

Sometimes you have to ship a feature. Shipping that requires changing 3 parts of your app. A lot of times that _entire_ set of changes is less than 100 lines of code.

Having a full vision of what is being accomplished across your system in one go is very helpful for reviewing code! It justifies the need for changes, makes it easier to propose alternatives, and makes the go/no-go operation much more straightforward.

At a smaller scale, you often see the idea of splitting frontend and backend into separate repos. Of course you can ship an API and then ship the changes to the frontend. But for a lot of trivial stuff, just having both lets you actually see API usage in practice.

I think this is much more applicable for companies under 100 people though. When you get super large you're going to put into place a superstructure that will cause these splits anyways.


TBH, I am not a fan of frontend/backend split either: ideally, you'd still be splitting per component/service, so frontend and backend for a single thing could live in the same place: you get the benefit of seeing the API in use with each PR, without the costs of monorepo otherwise.

Most projects start out as monoliths (which is good) and splitting up on this axis is unfortunately very hard/costly.


This is why I'm a fan of tools like Bazel, where you can still get most of the tooling benefits from a single repo, but get testing speed benefits (and, if you roll that way, the design benefits from the separation) of a multirepo setup.

Unfortunately it's hard for me to recommend Bazel, it's such an uphill climb to get things working within that system.


What I took away from TFA is that monorepo management at this scale is “extremely frustrating” too.

ISTM that the complexity of managing any repo will be bounded by the size of that repo; a monorepo, being unbounded in size, will, in time, become arbitrarily complex to manage.

While a multirepo might occasionally require developers to apply changes to more than one repo at a time, I’ve never found this to be much more than a minor inconvenience; one that could be solved readily with simple tooling, if we had ever felt that the “problem” was even worth solving.


At my $dailyjob we (kinda unfortunately) went with tons of repositories and libraries upon libraries, and the only sane way for me to make changes across multiple repos is combining them into single build locally. In .NET it's not that complex - remove a Nuget dependency from your project, and add reference to locally checked-out repository and make sure you're using proper tags. It's mundane, happens to be frustrating, but I can make it work.


The process you're describing looks like some trial and error PRs...

Multirepo also allows you to roll out that change incrementally instead of big banging all the time.


Well for trivial changes it's even worse, cuz instead of "change 3 files across this boundary" it's "send 2 sets of changes to different places, babysit it until merging, then send a third PR in the integration point to use the updated vesrion and then get it merged".

Meanwhile reviewers don't have context about changes, so it's easier to get lost in the weeds.

It's not always this, of course. But I think that way too many tools are based on "repo" being the largest element, so things like cross-repo review are just miserable.


But in the monorepo you almost never can do the change in a single commit as it will cause incompatibilities during gradual deployment


Canva engineer here: we do compatibility checking of interservice contracts (Proto) to ensure that gradual deploys are always safe and can always be safely rolled back.


Google does such a thing in its monorepo quite routinely.


It takes X*N amount of work to merge change X across N repos. 1 repo just takes X.

Then there's version management. Do all your repos use the same versioning scheme? "They should", but in the real world, they sometimes don't. Whereas if you only have 1 repo, you are guaranteed 1 versioning scheme, and 1 version for everything.

How do you know which version of what correlates to what else? With N repos, do you maintain a DAG which maps every version of every repo to every other repo, so when you revert a change from 1 repo, you can go back in history and revert all the other repos to their versions from the same time? Most people do not, so reverting a change leads to regressions. With a multirepo, there only is one version of everything and everything is in lock-step with everything else, so you can either revert a single change, or do an entire rollback of everything, with 1 command.

How do you deploy changes? If each repo has an independent deployment process (if your repos even have a deployment process that isn't just waiting for Phil to do something from his laptop), are you going to deploy each one at a time, or all at once? What if one of them fails? How do you find out when they've all passed and deployed successfully? Pull up 5 different CI results in your browser every couple hours, and when one fails, go ask that team to fix something? If you only have 1 repo, there is 1 deploy process (though different jobs) and merging triggers exactly what needs to happen in exactly the right order.

The reason people use multirepos is they don't want to build a fully automated CI/CD pipeline. They don't want to add tests and quality gates, they don't want to set up a deployment system that can handle all the code. They just want to keep their own snowflake repo of code and deal with everything via manual toil. But at scale (not "Google scale", but just "We have 6 different teams working on one product" scale) it becomes incredibly wasteful to have all these divergent processes and not enough automation. Multirepo wastes time, adds complexity, and introduces errors.


> What's the value of a monorepo if developers only ever check out a small subset of it?

There's a couple of different views of this:

If the subsets are overlapping then the monorepo has had great value. Let's say you've got modules A, B, C and D. Dev 1 is interested in A and B, Dev 2 is interested in B and C, etc. In a multirepo world you have to draw a line somewhere and if someone has concerns overlapping that line then they're going to have to play the 'updating two projects with versioning' game.

The other way of looking at it is "data model" vs "presentation". Too often with git we confuse the two. Spare checkout is a way of presenting the relevant subset to each user. It is nice to be able to consider that separately from whether we want to store all that data together.


> For example, all the localisation files could live in a separate project (if we accept the need to commit them at all).

That's the wrong way to split files: it's as if you said let's split a monorepo so all the .sh files are in one repo, all the Makefiles are in another, all the .py files in yet another...

What you want is to split into "natural" repositories instead. Having 50 or 150 localisation files in an otherwise 40-file repo is not a big deal for anyone. Of course, how the split happens would have an outsized influence on the ergonomics.

Also note that localisation files are tightly linked to source code (the way they use them, similar to GNU gettext model, though they do use XLIFF): you put English strings in the source code, and when you change them (reword, fix typos, or outright change them), all translations need to get their English version updated and translations potentially needing updates marked as such. In short, they are managing their translations as source code (even if translators would be using translation tools akin to IDEs for development).


If you can checkout your monorepo as if it's multiple repos, but then also check it out as a monorepo when you want it, that seems to me more utility than splitting into multiple repos, then you can never check it out as a monorepo.


In a world where submodules worked (side note: We use PlasticSCM which has xlinks [0] which are substantially better than submodules, but Plastic itself has it's own set of problems), you could have each "subrepo" as an independent repo, and then have a monorepo comprised entirely of submodules.

If submodules worked.

[0] https://www.plasticscm.com/documentation/xlinks/plastic-scm-...


> without any practical reduction in utility?

There is a massive loss in dependency management if you move to multiple repos.

Do polyrepo build systems exist that give you the same capabilities as bazel? Particularly with regard to querying your dependency graph.


Atomic linearisable updates to your code.


The point is that developers can check out bigger subsets if they need.


Pretty interesting.... I don't know that much about git, but still fun to read. I guess the main takeaway is don't put all your eggs in one basket? Although it kinda seems like they are going to stick with the monorepo, ("Here’s how we solve them at Canva")

Also, I looked up .xlf files and I still don't understand. It's xml, that part makes sense, but it's basically a config file? To tell what process to read which files?

Also, I've heard of Canva, but had no idea they were this big/ubiquitous/whatever.... and learning about pseudo localization is interesting too. And the graph for lines of code looks pretty exponential, maybe it's common up to a point, but if it continues at that rate, it will be infinite by about 2026 (okay, I just made that number up, but you get the idea)


.xlf isn't a config file format here. It refers to XLIFF, a standard for translation files. https://en.wikipedia.org/wiki/XLIFF


I looked at the Wikipedia article and a couple other pages, I guess the key word is "localization", which basically seems to mean language. I'm pretty sure I knew and forgot that.

The comments in this thread from the devs have been informative as well.


Putting my rusty sysadmin hat on, I would wager a guess that if it keeps on growing exponentially something will break. For example, hitting some hard limit like maximum number of files possible in a file system. Seems like you’re playing with fire to me.


I would think the same... and from reading the comments in this thread, it seems like a pretty spicy topic.


>just under 60 million lines of code in 2022.

I'm always surprised at the fact that almost every product has more lines of code than the entire Linux repo. The scale of these products is astounding.


From the rest of the article, it sounds like a big chunk of these lines are from generated files. What I don't understand is why they're checking in generated files into Git.


I've been switching a lot of generated files to being checked in (with CI verifying they haven't drifted from the source). The primary motivation has been performance. For example, in Rust code, it means I don't need to foist the code-gen process and all the dependencies needed for it on dependent crates. I've seen this play out similar in other build systems and circumstances. The key is the data needs to be independent of other factors (like the system doing the generation) and the rate of change of the code-gen source and generator has to be relatively low.


They're using git as a cache. Having generated files stored there means they're available if they're needed (eg in CI) without needing further access controls, they're versioned, and it's a simple and understandable strategy. As the article states, most devs are set up to ignore those files so they're not much of a source of the slowness. It's a common pattern for apps that have to serve lots of different locales.


I don't think it's good idea to store cache in Git. Any file remains forever in the repository after once committed. Local/remote repository become unnecessary big.


The bigger issues for me are it makes history impossible to read (every change is hidden in an avalanche of crap), merges are a mess (you definitely want to spend forever merging autogened files, right?), PR reviews are annoying, etc.


Depends how much generated stuff is there. We have our graphql schema in git even though its auto generated via a library. But its useful in PRs to see exactly how the schema changed as a result of the root change.


Yeah if there's not a lot of it, and if it's easy to regenerate, it can be fine.


You may want to set the '-diff' attribute for these files so that git will not show diffs for these, instead it will show 'Binary files differ'.

There is also '-merge', which will cause git to not attempt to merge the contents, but just ask you to pick a side.

The challenge however is then verifying the contents of these files in things like merge requests.

See https://gitirc.eu/gitattributes.html


It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.

Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?


> It's still surprising to have any generated things there. E.g. you could make the same case for keeping built binaries in Git as well.

This is not surprising at all. In fact, it's quite standard to commit string translations. Just because you can run the code generation/string replacement step as part of the build that does not mean it's a good idea to generate everything from scratch at every single build.

String translations hardly change once they are introduced, running the build step takes significant amounts of time, and if anything fails then your product can break in critical and hard to notice ways.


I'm not saying don't have the translations at all. I'm saying: 1) caching things in git in general is a bad idea; why is it not in this case? 2) these are not - to my understanding - the raw resource files, but rather machine-generated intermediate files. This is why it's about caching, rather than minimal source files.

Additionally, to respond to your comment, if string translations don't change much then it may be possible to push them out as an internal 3rd-party library, and then they're even quicker to build.


> I'm not saying don't have the translations at all. I'm saying: 1) caching things in git in general is a bad idea (...)

You're missing the point. Storing translated files is caching things in git, and it is not a bad idea. It's a standard practice that saves your neck.

You either place faith on a build step working deterministically when it was not designed to work like that, or you track your generated files in your version control system.

If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system, you waste minutes with each build regenerating files that you could very well have checked out and in the process risk sneaking in hard to track bugs. Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked, which consume even more resources.

Or... you stash them in git?

You use git to track changes, regardless of where they came from. Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.


> You either place faith on a build step working deterministically when it was not designed to work like that

I'm sorry, what? Why would a build not work deterministically?

> If you decide to put faith on your ability to run deterministic builds with a potentially non-deterministic system

If your build is non-deterministic, how can you have any faith in the binaries it produces? You would have much larger problems in that case.

> You use git to track changes, regardless of where they came from

You probably don't want to do that if it is 70% of your codebase and slows down all your developer's git.

> Then you need to have internationalization test steps for each localization running as part of your integration tests to verify if your build worked

I'm convinced you've never used a build system before. Your build should fail if required files are missing. Downloading translation files at build time from some artefact repository vs storing them in git is how a lot of companies do it.


> I'm sorry, what? Why would a build not work deterministically?

Because they don't and never did?

Do you understand build systems and individual tools were not designed to ensure deterministic behavior?

https://reproducible-builds.org/docs/deterministic-build-sys...

Anyone with any professional experience developing software can tell you countless war stories involving bugs that popped up when building the exact same project separate times. What leads you to believe that translations are any different? In fact, more often than not we see unexpected changes during translation update steps.

> If your build is non-deterministic, how can you have any faith in the binaries it produces?

First of all, all builds are not deterministic by default.

To start to come close to get a deterministic build, you need to do all your own legwork after doing all your homework.

Did you ever did any sort of this work? You didn't, didn't you? You're not looking and are instead just placing blind faith on stuff continuing to work by coincidence, aren't you?

> You probably don't want to do that (...)

Yes, I do. Anyone with their head on their shoulders wants to do that. It's either that or waste time tracking bugs that you allowed to go to production. Do you want to waste your time hunting down easily avoidable and hard to track bugs? Most of the professional world doesn't.


It is definitely possible to have determinism in a CI build step, and it's possible to have checks for it. If one needs determinism and a cache, they can store the files on S3 or some other place instead of git. Re-generating the files every time on the build isn't the only alternative. Instead of generate-and-commit, generate and upload. The difficulty is the same for developers.

If one has to be more granular than that, and have versioning and verification against the repository, they can still store the multiple versions on another service and store the hashes on git. Even though I'm not a fan of this for translation (especially if you have lots of languages/lots of strings), since there's an advantage of decoupling the translation process from the development process.

The problem with storing those files on git is that it can cause more problems, including developer experience issues.

It depends on how much you're storing on git. Some CSS files? Fine. 70% of files of the project, like in this case, slowing down everyone's workflow? Definitely not.


> Just because you place faith in some build step to always work deterministically that does not mean you are following a good practice and everyone else around you is wrong.

You're also doing that everywhere else. How do you think anything works? Why do you think Git is deterministic somehow? Why more so than including some files in a build?


Just an example, I had the non-deterministic case using JAXB to generate java classes from XSD Schema files. Running an ANT jaxb task to generate the classes from the same schema files would generate different class files each time. The class files were functionally the same, however it would reorder methods, the order of the variable definitions etc. Possibly due to some internal code using a Map vs List, so order was not guaranteed. In our case the schema files were in Source Control, the Java/Class files were not, the Java/Class files were generated by the build, packaged to a jar and published to our artifact repository.


Is there a reason why that type of file couldn't be better place into an artifact repository, or just generated and consumed in CI as part of generating a final build output?

No reason at all, but when you need the files during development, and testing, and CI, and in production, and you don't want those things to fail when your artefact repo or source of data is down, then putting the latest versions in git makes sense.

The cost of having them in the repo is a tiny bit more complexity in your git workflow and config. The benefit is being able to access those files everywhere you access the code. It seems like a no-brainer to me.


> place into an artifact repository

This adds yet another moving part to the system, and another place things can go wrong.

> generated and consumed in CI as part of generating a final build output

This can get quite slow, and on larger projects you have to expend a lot of effort to keep build times reasonable.

Also, if you're serving a library for public consumption, you generally don't want to add the burden of extra build steps for the user to follow before they can use it. If it can all be automated to the point of invisibility to the user that's fine, but often it can't.


author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them


Sorry I've been a bit misleading. These xlf files aren't generated, they're just not interacted with by engineers but they're still created and edited by humans as translations. We want to keep track of them so that if we deploy a different commit, the texts and translations in other languages will match


Reading the article they are not generated files, but files that are never touched by developers. Translators will work with those files. I expect that for translators they have a different sparse checkout that only fetches .xlf files for their target languages.


xlf files are usually generated. They're XML. No one wants to write that by hand.


Also plain text files are usually generated. They are arrays of 1s and 0s. No one wants to write that by hand.

I think that this is not a sensible definition of a generated file. A more sensible definition is that a generated file is created automatically from some source, which is not user input (i.e. an other file). This means generated files do not need to be kept under git, as long as their source is checked in.

Translations files, even if they are not created with a plain text editor but with some other tool that handles the XML layer, are clearly not generated, as long as the translation is done by a human.


For the app I work at the moment we use https://lokalise.com/. We add translation strings to a SaaS app, and then the translation team translate them. I've written a build tool that downloads the translation JSON files from the API using the CLI, or as part of our CI process. Other teams have tools that download their language packs for different iOS and Android apps. The translations are versioned in Lokalise and we using a branching strategy to manage the work. Lokalise has an option to generate xlf files (and JSON, xliff, arb, etc).

This is a very typical workflow. Most people are not out there modifying xlf files by opening them in a text editor. For a start, translations usually aren't done by developers.

(Huge shoutout to Lokalise btw. I can highly recommend it. It makes building a multi-lingual app across different platforms so much easier.)


This still doesn't make them generated files!

You opted to keep translation files out of version control; you could also keep images there, or source files. All this stuff is the (pretty direct) output of non-deterministic human intervention.

(BTW, how do you build an old version of your application? Is lokalise able to give you the appropriate translations for a specific git commit / app version?)


.docx files are archives of xml-files. No one wants to write that by hand.

Or, in more words: The format of the files is just the representation on disk - it’s not directly connected to how the files are generated or edited. XML files can be written by hand with suitable editor support.


I think we have different ideas of what "written by hand" means. Someone making a Word doc is not writing XML by hand.


As a counter-point, package managers generate “lock files” that are designed to be tracked in VC.

For these translation files, I’d imagine there may be occasional work to modify them even after they are initially generated.


Yes, the article glosses over that a little bit, and it's an unusual decision.


We check-in some generated CSS files. That are generated by external theme cli. Just to be sure, that after version update we can track all changes in the generated CSS files.


If they're generated you can just re-generate from source every time you need to track changes.

You're using git as a cache. You don't need to version a cache.


Regenerating certain things might be fast, but some might not be. Hundreds of engineers pushing code and having to wait for these to be regenerated both locally and on CI means that caching is quite cheap after all.


Note that these files are not statically generated; they are translation files, generated by translators.


The natural tendency in almost any software is to keep adding and adding, while rarely throwing anything out. More features, more code, more supported platforms, more supported languages, more this, more that, more, more, more.

If it was a physical product, you couldn't keep making it bigger and more complex ad infinitum, because making a physical thing bigger takes more material, and bounded physical resources would be consumed. With software, it's all just bits, and computers can hold a lot of bits.

This leads to bigger problems than just git running slowly.


Canva is absolutely crushing the established design market with their product right now so I'd say having all those features is pretty important for a good tool these days.


As well as generated files the sibling mentioned I wonder how much is due to vendoring.


> git status takes 10 seconds on average

> running these commands multiple times a day reduces the total productive time engineers have every day

I love the attention paid to this. Often opportunities to prioritise seemingly small efficiency gains are neglected.

At 10 seconds per command, an engineer that uses git status 50 times per day spends ~10 minutes per day waiting; an entire work week per year!! Well above the threshold warranting optimisation, and that doesn't even factor in distractions and context switching.


It’s actually even worse than that I think. If something takes over a certain amount of time, then I’m more likely to go do something else while I wait, like check Hackernews. And there goes 20 minutes.


I created a `beep` alias that just does `echo "\x07\x07\x07"` which triggers the system notification sound three times.

Then if I have a command that will take a while, like a stupidly long `git status`, I do `git status && beep`.


I'm hoping at least you acknowledge this is _your_ problem, rather than a tooling problem or the like. You just can't expect everything to give you immediate feedback after a couple seconds.

This is something that's really a degradation of the newer generations of engineers, since I clearly remember the time where these "somethings" would never take less than a couple minutes, and people did not immediately flee to their nearest distraction, but actually planned their time around it. In fact, if you go further back, these "somethings" would have taken hours, and the older generations still got work done.


That's a lot of drama, existential angst, and nostalgia around the equivalent of "sometimes during the commercials I go to the bathroom, and when I do I sometimes miss a minute of the show because I don't get back in time."


I used to work at a place where git status would take 2 minutes (or more). You just stop using it and rely more on your memory, or run in it parallel while you continue working on something else, etc.


I'm glad I'm not the only one that thinks that way. Even ~3s every git status is driving me insane now.


> "...we found that .xlf files made up almost 70% of the total number of files. These .xlf files are generated... they are never manually edited by engineers..."

First thought is why not to zip/tar away all of these "convenience" files per generation and add a line into build/install script to unpack them after checkout?

Additionally, add the .xlf into .gitignore to exclude them from untracked.

Noone cares to diff them as long as their contents is consistent with the checkout. Text compresses quite efficiently so this should not introduce any unreasonable build/install delays.


I made another comment as well though tldr is these xlf files are translations tied to texts in code so we can't simply ignore them from the repository. The changes have to be kept so that if we say revert to a certain commit, all the translations match with the texts of headers, buttons, etc...


>...The changes have to be kept so that if we say revert to a certain commit, all the translations match with the texts of headers, buttons, etc...

The mentioned .zip file is to be kept in the repo. Instead of a whatever number of individual .xlf files per generation, these would get zipped together (say, 'assets/xlf.zip') before the commit and the resulting .zip added to the commit.

Similarly, when reverting or on a checkout, it's the .zip that gets checked out and then the .xlf files are unpacked.

The packing/unpacking could be done by the same process that handles the .xlf generation (??build).

Also this may be automated by git-hooks, though it's more natural to handle the packing of assets during the build stage.


Could these translations be moved to another repository? Maybe they could be published as a separate NPM package that devs could install if they needed to look at the translations?


Then you get the annoying problem of having to push an update to that repo and wait for the new version before you can merge changes into the new repo which use the new version. It’s tightly coupled, so they should be co-located.


The entire workflow could remain the same while tar/zipping goes on in the background...


I appreciate this post. It's nice to see that there are other teams that feel some of the pain points of git, (and unsurprising that most of the responses are "you're holding it wrong"). The fact is that git doesn't scale to _very large_ repos, We've seen it time and time again, but there isn't really a great alternative. Perforce is.... Perforce (centralized, very expensive to license, branches are incredibly expensive and streams still feel like a band aid even years and years later). PlasticSCM (which we use at work) is fine, but closed source, mildly expensive, and has a terrible UX


Yeah, I think choosing git is fine, and choosing to have a monorepo is fine, but you probably don't want to do both.

Or, at the very least, once your git monorepo reaches a certain size, you should either split it up, or switch to something that handles monorepos better. Even if that something is e.g. a virtual filesystem on top of git.


Just saying that Perforce "is Perforce" is pretty lazy. Perforce is great in my experience. It scales up to repo sizes you are unlikely to ever reach and has a UX that makes actual sense.


In my defence, I elaborated in parentheses immediately after saying it. It's _very_ centralised and online only. There's no concept of any local work/branches so all prototypes are checked in (or in my experience people manually shelve things and juggle shelves around).

Perforce's branches are a disaster and should have been deprecated years ago and replaced with streams a decade ago. Branches are _incredibly_ slow, they're straight up copies, and they are pretty much isolated from each other. Streams are an improvement, but still are very primitive compared to git's branches - the change tracking across streams is poor, and the enforced hierarchy has too many escape hatches that can make a gigantic mess. Streams are loosely enforced with views which can't be customised per workspace, a major regression from branches. In practice, every team I've worked on has had a "convert merge to edit" style action to fix perforce's messed up idea of a merge. Stream switching is also dog slow (on my last project, it was quicker to delete the 150GB workspace, and re-sync than have perforce actually figure out what had changed).

Perforce is eye wateringly expensive, and very difficult to license - licenses are 4 figures per seat per year for medium sized businesses (and close to 4 figures for small companies), and maintaining a p4 server is genuine work. The hosted offerings of perforce (assembla only - https://get.assembla.com/pricing/) are very lackluster, and very limited (no triggers unless you pay the "contact us" pricing).

My experience in large teams was that running perforce at scale involves not using some/many of the features, and that actually keeping it running well pretty much requires an active support contract from perforce.

All that said, P4V is by far the best GUI client to any VCS, it _handles_ bianry files, and p4 sync's performance makes git clone feel like you're working on a 56k modem. It's just when you want to do anything other than submit or sync, the wheels fall off the track.


I used Perforce for a decade and did not create a branch, not even once. People who want to branch in Perforce are often trying to bring a git mindset to a different tool. With a trunk-based edit/sync/submit workflow where you have different p4 clients for your different projects (what you would use various branches for in git) you need not branch.


> I used Perforce for a decade and did not create a branch, not even once.

That's because P4's branches are a pale imitation of what branches can be (and to be fair to P4, branches long predate git and they can't exactly up and change the behaviour, however there's really no excuse for what they did with streams. They bolted a loosely enforced hierarchy onto the existing branch system, created a split in the tooling, and shipped something that has as many footguns as problems it solves.)

> People who want to branch in Perforce are often trying to bring a git mindset to a different tool. With a trunk-based edit/sync/submit workflow where you have different p4 clients for your different projects (what you would use various branches for in git) you need not branch.

"often" is a very nebulous adjective, and a naive view of what git branches do. Perforce and a task branch based workflow is a terrible idea, yes. If you want to do the PR based flow that github and gitlab encourage, you're going to have a bad time. P4's shelves are an excellent tool, but they encourage ad-hoc and self managed version control. Shelves to bring changes across streams (if you're using them), Shelves to share a WIP or a quick change with someone else, and iterating back and forth with shelves with v1 v2 v3 etc in them, shelves for temporary debugging code/non prod features/work in progress feature that's ticking along in the background.

Again, I'm not saying perforce has no place, but git's branches are a force multipler, even with trunk based development.


There was pijul. Which allowed partial pulls iirc. I haven't used it, so I can't really recommended it.


I tried pijul [0] last time this conversation came up and despite its claims it is not fast.

[0] https://news.ycombinator.com/item?id=29992875


Just seeing this now. If you look at the conversation, this was fixed within a few hours.


But that's what those responses are all about: git doesn't scale, so don't use it if you need scale. If there are no tools that scale available, then change the approach so you don't need face this scaling problem at all.


Maybe at their scale it makes more sense to switch to a VCS like Eden? https://github.com/facebookexperimental/eden

Eden's equivalent of 'git status' should run almost instantaneous, as checkouts are hosted by a virtual file system (FUSE) that tracks changes.


I feel like I've read about several big companies using monorepos, but I've never understood why. It feels like the source-control equivalent of writing your code in one big file.

Does anyone have any good resources for why and how best to implement a monorepo?


You kind of have to experience the worst of boths worlds to understand where and how each method works and breaks down.

With multiple repos it's harder for teams to share code and collaborate. Each team has a repo that becomes a little fiefdom where they are oblivious to who is using their code and how they're using it. Suddenly they'll push out what they think is an innocuous refactor and inadvertently break core functionality other teams took a dependency on for better or worse.

So what happens is the team with a dependency now copies the old code into their repo and take on all the extra burden of maintaining this old version, trying to backport fixes, etc. It becomes an enormous mess and time sink. No one ever has time to go back and fix things, and when they eventually are forced to do so it costs more time and effort than it would have taken to do it right from the start. You'll also run into horrible versioning problems where you're stuck on old version X but depend on widget foo which needs current version Y of that dependency.

You might say well bad on that team they should have engaged the product managers, made sure their dependencies and usage were well tracked with them, been looped in the process of changes, etc... but in the real world when your boss says X feature needs to be shipped in a few days all of that process goes out the window.


This is valid. But I think it would make sense to challenge this and even (try to) leave if product is dictating tech what to do with their repos... especially if there's a boss figure pushing imaginary deadlines without talking to the team beforehand. There's a job crisis now, I know, but millions of files were not checked in during 2021. We should do better when it comes to organizing how we work


good resources:

1) https://trunkbaseddevelopment.com/monorepos/

2) and (but I don't know it as well) https://monorepo.tools/


It’s mostly about avoiding code and work duplication. At scale, the waste on duplicate work across teams can be massive (think about setting up CI tooling for example). Mono repo let’s you solve tooling/build problems once and for all. The main drawback is scalability of the tools involved like git.


> The main drawback is scalability of the tools involved like git.

And if you can employ enough engineers to break git, you can probably afford a team to work on scaling git.


Git staring to break at 200-300 engineers pushing into it. Scaling git would take 200 more :)


Everything described in the post can be done by one engineer.


Here are a few reasons that I've heard and experienced

1. Standardized tooling

2. Fewer issues related to dependencies

3. Hermetic tests

4. Reduced code duplication and easier code sharing


Did you consider VFS For Git and Scalar from Microsoft? What was the result? https://github.com/microsoft/VFSForGit

https://github.com/microsoft/scalar


VFS for Git is deprecated, although something incorporating its approach is likely to best way to implement a real monorepo. Which AFAIK doesn't exist, outside of proprietary implementations used inside certain big tech companies.


Did anyone use scalar from Microsoft?

https://github.com/microsoft/scalar


As mentioned in the blog "we are moving towards providing a known version of git", probably Microsoft Git which includes the scalar command but also upstreams many of the optimizations to git core https://github.com/microsoft/scalar#why-did-scalar-move

disclaimer: canva staff working on source control


We use scalar at Microsoft. Working with massive scale of some repos would be near impossible with "vanilla" Git.


I don't understand the logic of combining microservices with a monorepo. The whole point of microservices is that you don't care what is beyond the external contract of the service. Who cares how each individual team decides to name their stuff. Why do microservice teams need to care about having every single service checked out? Who or what would be bulk applying changes to all services? This is madness.


Suppose you want to deprecate an API you wrote in favor of something else for $valid_reasons.

In a monorepo with the right tooling, I can make a branch where I delete the API and get pretty immediate feedback as to every module I have broken. From there I can update all the call sites and I also know which teams/engineers I should give a heads-up to.

In a multi-repo world, this is much more difficult. Even learning what all the reverse dependencies are could be a challenge. Most likely, other teams have to pull in my changes on their own schedule, and other teams have very different incentives than my own. The cost of mistakes is therefore higher because they are more difficult to undo.

Monorepos are very important if you want to empower engineers to achieve broad changes across the org. Some people fundamentally disagree with this methodology. And a fair number of people just don’t care enough - marking the old thing deprecated, calling it a day, and letting it rot for eternity is good enough for them. These are the same people who you’ll find saying “not my job” a lot, in my experience.


The perceived ease of that operation in a monorepo is what makes it dangerous. Unless these changes all map to a single monolith service, then even though you have updated all call sites, these callers will not be deployed all at the same time, meaning as the change rolls out you may see random breakage. By using polyrepo, the deployment boundary can align with the code boundary, making the rollout problem obvious.


People point this out often, but in practice I have never seen this cause issues. Remove callers first, then remove endpoint... its pretty obvious the order in which things need to be done.

What I have seen as a real problem, time and time again, is having trouble locating all usages of an API in a multi-repo scenario.

Anyone who fucks this up in a monoreppo will probably fuck it up worse with multiple repos.


I think the tricky issue is updating the structure or semantics of an existent call across services. A monorepo makes it easy to make these kinds of changes in code, while making it non-obvious that it is dangerous to deploy; it is a giant foot-gun. Examples along this line include updating the name of an RPC, or endpoint, or changing the request/response structure, and so on. Of course you could argue that "you just shouldn't", of which I agree, but the point is that then making those kinds changes should actually be really hard, instead of really easy.


Ah yes, this is definitely a challenge. At my day job we use protobuf and follow the (imo well documented and well evangelized) best practice of forbidding breaking changes to protos, so I almost forgot about this class of problem. At least for changes to structure. Changes to semantics can still happen but I don’t know that I’ve ever seen anyone cause a major issue while keeping structures compatible.

We have escape hatches, which I mainly use when deleting code.


Microservices in a monorepo is the least friction path towards a distributed ball of mud.

Whether this is a pattern or anti-pattern depends if you want a single engineer being able to change the entire architecture to “just ship it” or you if you value conceptual integrity more.



Yeah, git-meta is a reasonable alternative to megamonorepos. It's basically a repo-mega-forest.

A few things:

  - your build system will need more code
    to deal with a repo-mega-forest than
    a megamonorepo
  
  - code indexers may not be able to see
    cross-repo dependencies
etc.

You'll have pain no matter what. My preference would be for a megamonorepo approach to scale properly. That means partial/sparse cloning, as well as shallow cloning, and also all the hacks that are supposed to make git-status and git-log (and git-blame, and...) fast in partial clones.


> These .xlf files are generated and contain translated strings for each locale

... and they make up 70% of their repo

Why would you include generated file in a repo?

Do they take to long to remake?

[EDIT]: especially given the fact that they're using bazel which is supposed to be the bee's knees of build system?


The article uses the term generated, but a more precise term could be "produced by translators". The files can't be simply regenerated. We've recently published [1] a separate post that explains the localisation process in more detail.

[1]: https://news.ycombinator.com/item?id=28931601


I get a headache imagining everything Google in a monorepo. So how do you reconcile this single GIT repository approach with Infoworld's "The case against monorepos"? [1]:

Reason #1. Monorepos go against single-team ownership principles

Reason #2. Monorepos encourage bad practices involving massive refactoring

Reason #3: Small repositories are better than large ones

[1] https://www.infoworld.com/article/3638860/the-case-against-m...


With regard to 3:

> In Google’s case, more than 45,000 changes are made to its monorepo every day. This code management becomes an exponential problem in overhead as the number of developers of an application grows, and the number of components within the application expands.

Speaking from experience, the overhead of code management is far less at Google than any other place I've worked at, even on projects with just a couple hundred lines of code. If this author is imagining a monorepo means each engineer having to constantly check out terabytes of code to make any change and race with other developers to merge to HEAD, they don't really understand the landscape well enough to write an article criticizing monorepos (which certainly have real cons, particularly with the source control tools that are available to most companies).

Of course, you need actual tooling and infrastructure for that beyond what git offers.


One person's STOP is another's silo

One person's massive refactoring is another's tech debt reduction

One person's multirepo is another's inability to find that broken bit of upstream code


I think the biggest problem here is that git (out of the box) is not well suited to a monorepo.

But in terms of the reasons in "The case against monorepos":

> Reason #1. Monorepos go against single-team ownership principles

I think it's up for debate about whether or not single-team ownership is desirable. But even if it is, I don't see the difference. Just have teams own their directories within the monorepo.

If you want to enforce that a team owns a particular part of the repo, just put some rules into the code review tool to ensure that a change to that component can't be merged without someone from that team reviewing/approving it.

> Reason #2. Monorepos encourage bad practices involving massive refactoring

Again, this just appears to be the author's opinion that massive refactoring is somehow problematic, without providing much of an argument against it. If you're changing service APIs, then sure, you need to be careful about the order in which you roll out changes. But if you're changing the APIs of shared libraries, then being able to do a single large refactoring is absolutely valuable.

> Reason #3: Small repositories are better than large ones

This is just a tooling issue. You can still separate projects by directories, and if your tools allow you to just check out a subdirectory, or if you have some sort of virtual filesystem on top of your source control, then you don't have to pay the penalty of pulling down the entire repository.


> > Reason #3: Small repositories are better than large ones

> This is just a tooling issue. You can still separate projects by directories, and if your tools allow you to just check out a subdirectory, ...

Even a forest of repos has tooling issues: your build, code review, and code indexing tools will need to support the forest, and that's a lot of work. It's probably comparable to the work needed to make a monorepo work.

The alternative to the monorepo isn't all rosy.


These engineering blogs are intended to promote the engineering culture of the company for recruitment process, and as a correlate an opportunity for the engineers to self-promote and boost their resume/promotion-worthiness internally.

The sentiment on this thread, if it is indicative of the greater talent pool, suggests this blog post is having the complete opposite effect.

I remember when Uber was proud of their thousands of repos. Here it’s the 60 million lines of code. It’s not just red flags, but seems like stuff that might get leaked to Programming Horrors / WTFs.


I checked out their careers page after reading this.


I wonder why these generated files are not part of a separate distributed, HA based, cloud managed document store. Seems like a perfect use case for it. This looks like git abuse to me.


They made a bad design decision 10 years ago, have been fighting the fallout for years, and will be doing so forever and ever because things will only ever grow.

They wrote a blog post on how clever they think all their workarounds are, at least one of which involves sparse-checkout -- which is perilously close to chopping up your monorepo into several, while still pretending monorepo is fine.

I feel like somebody's job and/or ego is heavily invested in keeping things as they are, even if it demonstrably does not scale to their needs, and the solution is blindingly obvious to even a casual observer.

That is institutional insanity.


So every company ever? I love these comments like hindsight is not 20/20 and it would have been so simple for you to have made all of the right engineering decisions as an armchair CTO. Give me a break. Git repos get out of control, bad decisions are made, this is an interesting solution and write up. So over the whole “just make perfect decisions all the time and you would never need X comments”. Guess what? Everybody makes mistakes


This isn't just a matter of hindsight. They don't even have that, as demonstrated by the fact they're putting ever more work into their workaround instead of fixing the glaringly obvious problem: it's a monorepo with half a million files growing ever bigger.

Certainly I wouldn't have decided to put all that junk into a single git repo from the start, since I'm not an idiot. But, even if I did, or I inherited something like that, I would fix it, not double down as they're doing.


What would be a better structure?


In a word, Sqlite. You can always put your files in a file. ("Yo dawg, I heard you like files so I...")

Depending on your use cases a zip file can work, e.g. Python packages can be imported from within a zip file, and the standard lib is distributed that way.


monorepo until it becomes too big -> then, splitting it into 2-3 repos, until each one also gets too big to manage...


The problem is not the repository, it’s having the translation data in the repo.

We have the same issue at $dayjob, the repo is quite large and 80% of it is translation data. Even if they compress ridiculously well (99% last I checked) the number of translation files and the number of exports makes them the vast majority of the cost.

However that structure remains a convenient nuisance, and more importantly removing them would really only be useful if we rewrote the entire repository, which breaks all working copies.

There’s been a task in a wishlist for years now, but the business incentive just isn’t there.

Exit: actually the translation files are 80% of the working copy, they’re closer to 90% of the repo, and on the far side.


But isnt it sort of correct to have translation data, which is text data, and there might be bug fixes, in the repo?


Which is why it commonly is there, but at the same time the translation data is kinda independent: you can run the software with no or part of the translations, as far as the software is concerned it’s usually close to configuration data or assets.

Not to mention transgressions often have their own lifecycle e.g. is common to find translatable strings which are untranslated or incorrectly translated, and want to ship updates independently from the software’s.

As such keeping the translations outside the source is also perfectly defensible.


Yes


> The problem is not the repository, it’s having the translation data in the repo.

Does this data change as often as the code does? If not then get it out of the repo.


What if it changes more often than the code? Throw out the code?


There's no material difference between taking the translations out of the repo and taking the code out of the repo. "She shouldn't divorce him, he should divorce her!"


Yeah, this sounds more like a pile of technical debt attached to a competent business.


Yet another example of Additive Bias: https://brainlenses.substack.com/p/additive-bias


The Linux kernel has about 30 million lines of code.

Windows 10 has about 60 million.

I doubt that any other software project has as many.

How would you even end up with 500K source files in the repo?


> to just under 60 million lines of code in 2022

I'd be somewhat intimidated on my first day of work if I cloned a repo with that much sloc...


git status scales with the number of files in the repo. Ask HN - Are there any common git operations that scale with the number of commits?


Repository maintenance operations like "git gc" scale with the total number of objects in the repo.


I used to work at a company where our git repository would get so polluted with change-related junk that we had to call (many times literally call) BitBucket and ask them to run git clean up on the remote. Otherwise, fetch/pull and many operations on remote would take forever or often even fail completely due to timeout or lack of RAM.

It's Thursday and you wanna release a hotfix - but nope - all builds are failing because git can't cope with junk and your CI can't proceed. You then have to call BB instead and ask them to clean up, hoping they'll do it within the same day. Ah, exciting times.


git log ;)


I believe git log is limited by the number of lines on the screen...


It may operate in a streaming fashion if it's going to a pager, but naw it'll output however much you want.

Technically it has modes which need to scan all commits in the current branch as well, like the ones that grep for certain changes.


I've had millions of files in git using git-annex. It kind of works if you're very patient, it's definitely not something I'd recommend. I wish there was a solution, such as git-annex, that worked well with millions of files.


If you don't have a couple of hundreds engineers who would write a custom file system for git (Microsoft), or who would take an existing source control system and nearly fully rewrite it (Facebook), or who would write a custom source control system from scratch (Google, Yandex, etc) DON'T USE MONOREPO.

Otherwise you are risking to end up in a situation when hundreds of your engineers have to spend tens of minutes every time they need to simply push their code changes.


Google did not need to write their own SCM to reach the scale mentioned in the article. Off-the-shelf Perforce served that company well enough when it had tens of thousands of engineers and millions of files. They had 11 million files in a monorepo with one 256GB machine which most of you would consider a bare bones desktop computer at this point.

https://www.perforce.com/sites/default/files/still-all-one-s...


I mean, if you have hundreds of engineers, that might be sound advice; if you have 20, you’ll probably be fine.


Of course, you shouldn't start a project from 10 repos. But the downside is while you live in a single repo all the tooling get tuned to work with a single repo. And then there is no way to go multi-repo when you need to.

So advice would be: start small but think big :)


There is no point in making any argument with someone who has the conviction of their religion behind them. Followers of the monorepo religion are no different.


Somehow, it seems so backwards. We make our repositories more accessible for automated tools and less accessible for people and instead of investing coding time into better tools, we invest it into local workarounds at every single developer's machine. This is probably not a general conclusion on monorepos, but the notion of "let's put auto-generated stuff into version control" sounds like the point in time where it all went awry. Yes, there is ClearCase where this is the way to do it, Perforce probably does this too and there are areas where you are supposed to be archiving your generated binaries with your code and scripts for later audits, but I'd argue most people shouldn't be doing this, since there are ways to separate source and release artifacts once and for all in a way to make both bots and people happy without locally-brewed scripts for checking out code.


>across half a million files

>we found that .xlf files made up almost 70% of the total number of files

If one xlf file is used to keep the translation for 1 language, how the heck do you have so many then ? Makes no sense.

If you have many xlf files for 1 language, may God save your soul.


One xlf per language _per component_.

5 components, 5 languages, would be 25 xlf files.


I don't think I've ever seen "because your mono-repo is too big for git" used as an argument for micro-services, but maybe at this point it's valid.


You can put microservices in a monorepo. You just put them in their own folder.


Of course, but can you split a monolith across multiple repos?


Sure. It'd drastically complicate the build process and CI for little benefit compared to other approaches (e.g. sparse checkout), but you could even in principle create a repo for each code directory and stitch them all together.


>It'd drastically complicate the build process and CI

This is a part of most monorepos. I figure having only a single build artifact is rare.


Yes. With git submodules.


Canva use microservices FWIW


This is funny, I stopped reading when it started talking about how engineer x will never touch section y of the code. Ok then put them in separate repos


How is that any better than telling git to ignore them? Esp when they're tightly coupled translations


Every time I read a monorepo story my takeaway is that it's a really bad idea. Segregation of responsibility is a good thing.


the easy solution is to put node_modules in your .gitignore :D


Just don’t run git gc

Runs forever even on small repos


The fact that Canva has a `Source Control team` with at least 5 people on it (going by the thanks at the bottom of the article), means they should probably try a different approach.

I think it's a cool company, with a good product, but they're WAY too small to be having a "source control team" on staff. That's at least 1.2MM a year salary / benefits cost.


It's not exactly a great name for the team but it gets across the point - we handle developer experience from the point you want to push your code to when you merge it, so we also work on communicating with CI/CD, code review/ownership tooling, gitops bots etc. Git Performance is a big issue right now but there's no end in sight to all the other scaling problems we could work on so if it stops making sense, we can work on something else :)

disclaimer: canva staff working on (for now) source control


I am curious why git modules couldn't be used here, and perhaps scripted to match particular workflows.


That's fair, and I think you're doing great work. I enjoyed the article


Author here, not everyone on the team works on git and git performance. Our team work on a variety of things that touch "code" in general like our ownership system and tooling, access control, bots and automation, etc...


Sorry, there is a quote from the article: "Kudos to other members of the Source Control team — Alex Sadleir, Wesley Li, Adam Murray, Matthew Chhoeu — who work on improving git performance at Canva"

So I assume 5 people working specifically on Git performance ;-P


Heh, I can see how this can be misinterpreted :P

That's the entire Source Control team (of 5 people) today. Among other things, they work on improving Git performance ;)


Thanks for the clarification. How many people work on Developer Productivity in total and what is the percentage from the total engineering headcount?


They have “hundreds of developers.” It seems completely reasonable for them to have a team focused on DX. It’s one of my most common recommendations for rapidly-growing companies, which are often focused on features at the expense of productivity. A DX team is one of the lowest-hanging fruits for fixing the problems that causes.

(Improved testing practices is often second, but that’s not a low-hanging fruit. It’s a big, juicy fruit that’s waaaaay up in the canopy. And then comes joint product decision-making within the team, which may as well be on another planet entirely.)


Canva is worth around $40 billion, it's not really a small company


Yeah... also once such a team exists, they are naturally going to resist ideas like not having a monorepo but just using git submodules, not keeping the history of every generated file forever etc.

Because then they won't need a 5-man team to handle the fallout any more.


That bit surprised me also. I don't have much experience with companies of this size so I have no real idea; how many people would you expect to work on source control at a company of this size? I had a look at Linkedin, they have 4,345 employees.


Usually it is considered good proportion to start with to be 10% to work on a Developer Productivity and it should increase to 15% as company grow. But regarding source control it really depends on the approach: if you are on polyrepo - then 0 is enough. If you decide to write you own source control system like Facebook or Google, then maybe couple of hundred engineers full time until the project is finished :)


Really weird. But maybe this team is in a country where developers cost 100x less? (: Still weird though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: