> The Google codebase includes approximately one billion files and has a history...

bananapub · on Feb 13, 2023

you're misunderstanding a bunch of things.

> The total number of files also includes source files copied into release branches

I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories. they are not used very much.

> files that are deleted at the latest revision

so it means "one billion files have existed in the history repo, some are currently deleted".

> I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository,

seems pretty clear - a source code repo has lots of files. at the most recent revision, some exist, some were deleted in some past revision. more will be added (and deleted) in later revisions.

it's very much not the same model as git.

hope that clears things up.

Karellen · on Feb 13, 2023

> you're misunderstanding a bunch of things.

It certainly feels that way :-)

> > The total number of files also includes source files copied into release branches

> I guess you haven't used Perforce or similar. a branch is a sparse copy of just the changed files/directories.

Still not sure I see the distinction. Surely "sparse" or "not sparse" is an implementation detail. If I create a new branch in git, the files that are unchanged from its parent branch share the same storage, but the files that have changed use their own storage.

> so it means "one billion files have existed in the history repo, some are currently deleted".

I guess I'm struggling to understand what the point of this metric is? I get why "Total number of commits", "Total storage size of repo in GB/TB/PB", "Number of files in current head/main/trunk", or even "total number of distinct file revisions in repo history", could be useful metrics.

But why "number of files (including ones that have been deleted)"? What can we do with this number?

> hope that clears things up.

It's helping. Thanks.