Hacker News new | past | comments | ask | show | jobs | submit login

> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence.

Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.

Although...

> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]

I'm not quite sure what this is saying.

Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?

If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?

I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.

I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.




you're misunderstanding a bunch of things.

> The total number of files also includes source files copied into release branches

I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.

> files that are deleted at the latest revision

so it means "one billion files have existed in the history repo, some are currently deleted".

> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,

seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.

it's very much not the same model as git.

hope that clears things up.


> you're misunderstanding a bunch of things.

It certainly feels that way :-)

> > The total number of files also includes source files copied into release branches

> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.

Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.

> so it means "one billion files have existed in the history repo, some are currently deleted".

I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.

But why "number of files (including ones that have been deleted)"? What can we do with this number?

> hope that clears things up.

It's helping. Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: