Measuring the many sizes of a Git repository

bluedino · on March 6, 2018

>> What we find is that many of the repositories that tax our servers the most are not unusually big. The most challenging repositories to host are often those that have an unusual internal layout that Git is not optimized for.

Like CocoaPods!

https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

liquid_room · on March 7, 2018

GitHub is pretty awesome :)

colemannugent · on March 6, 2018

I chuckled a bit when their git-sizer tool pointed out a high level of concern for the 66 parent octopus merge in the Linux kernel.

See https://www.destroyallsoftware.com/blog/2017/the-biggest-and... for the story behind the Cthulhu commit.

LukeShu · on March 7, 2018

From the article:

> Update: it was [an accident][1], which Linus responded to in his usual fashion.

> [1]: http://lkml.iu.edu/hypermail/linux/kernel/1603.2/01926.html

Those who only know Linus from his rants might be surprised that here "his usual fashion" means:

- Acknowledging that the root cause was Github's documentation being misleading.

- Not blaming the contributor for being mislead by Github: "I can see why that documentation would make you think it's the right thing to do."

- Admit that the ease with which the accident happened is a deficiency in Git's UI.

- CC the Git maintainer to discuss improving Git to make it harder to do this by accident. (Which eventually lead to the --allow-unrelated-histories flag being needed to do this kind of merge.)

pandem · on March 6, 2018

The Linux kernel has been developed over 25 years by thousands of contributors, so it is not at all alarming that it has grown to 1.5 GB. But if your weekend class assignment is already 1.5 GB, that’s probably a strong hint that you could be using Git more effectively!

Git is only 12 years old, how does Linux have 25 years of history there? As far as I know Linux used patches on mailing lists before git, are those also somehow transferred to the repo?

geofft · on March 6, 2018

"The repo" only dates back to v2.6.12-rc2 shortly after the first release of git, but there are repos with imported code from previous VCSes:

https://stackoverflow.com/questions/3264283/linux-kernel-his...

https://landley.net/kdocs/fullhist/

The first link describes using git's "grafts" feature to make the UI believe that the first commit in the normal repo actually has parents, which means you can use the repo normally and agree with everyone else about commit numbers, but also `git log` will go all the way back to Linux 0.0.1. I had this setup on my work machine in 2012 and it was useful a few times, but in the last couple of years I haven't really needed to see history past 2.6.12.

(But yes, the history repos don't explain the size of the normal linux.git repo - except to the extent that you need to spend over a decade writing an OS to get even that many lines of code in the first commit and that much activity shortly thereafter.)

__david__ · on March 6, 2018

It doesn't:

    commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (tag: refs/tags/v2.6.12-rc2)
    Author: Linus Torvalds <torvalds@ppc970.osdl.org>
    Date:   Sat Apr 16 15:20:36 2005 -0700

    Linux-2.6.12-rc2

    Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

mdaniel · on March 6, 2018

patches on mailing lists before git, are those also somehow transferred to the repo

Given that conceptually git is just a linked list of patches, I can't imagine why they wouldn't have that history

pedrocr · on March 6, 2018

Actually that's what VCSs before git used to be and what git changed. Git doesn't keep patches, it keeps full states of the repository in a content addressable fashion. It's one of its key insights. Instead of having to have an always correct way to encode deltas just encode the state itself and leave it to the tools to figure out what the diff should be. That way you're not encoding in your disk format something that can be done better in a later version of the tool.

simcop2387 · on March 7, 2018

That said, git doesn't just store direct copies either. It will bundle things up into packfiles as it calls them to do compression and encoding of various forms to reduce disk space and make it quicker to find a given version of a file

https://git-scm.com/book/en/v2/Git-Internals-Packfiles

pedrocr · on March 7, 2018

Git the tool does packfiles but that's an implementation detail. Git the VCS can work with any object storage backend.

kzrdude · on March 6, 2018

Well, I don't think anyone has the complete history in terms of patches. It just doesn't exist, without being reconstructed.

chimeracoder · on March 7, 2018

> Given that conceptually git is just a linked list of patches, I can't imagine why they wouldn't have that history

You can construct a linked list of patches for any series of commits, but git doesn't actually store patches or diffs - only raw content.

Volt · on March 6, 2018

They had version control before Git. Git was born after BitKeeper's licensing changed.

ktpsns · on March 6, 2018

I noticed just today that Github has a number of counter-measures for absurd git repositories built in when you try to push something. For instance, I imported a huge (3GB, mostly due to large frequently updated files in the history) subversion repository to git and got failures due to individual commits exceeding 100MB. This was quite helpful to bring the size of my repository to a reasonable state. Tools like the https://rtyley.github.io/bfg-repo-cleaner/ are indispensable to do this kind of filtering without headache.

nerdponx · on March 7, 2018

BFG is a lifesaver. If only I could convince my employer to donate! (Yes I already donated privately).

lilyball · on March 6, 2018

This looks great. Surprised to see no package manager support for it though. I'd love to see MacPorts or Nix support for this.

deadbunny · on March 7, 2018

While they're at it why not: dpkg, docker, entropy, flatpak, guix, ipkg, netpkg, opkg, pkgng, pacman, rmp, snappy?

It's better to leave packaging to each distro's maintainers rather than spending 80% of your time preparing the release packaging for every single package manager there is. Or super keen folks who want to do it specifically for your project, even then they'll only be super keen about one or two platforms.

lilyball · on March 7, 2018

Two things:

1. Releasing a new binary tool without any package manager support just sucks for your users in general, because it means they're required to manually install it and most of them will probably end up with a horribly-outdated version of your tool installed for a long time because their package manager can't ever tell them that it's out of date.

2. macOS isn't a distro, so you can't just say "let your distro maintainers do it". If you don't submit to MacPorts, the only way you'll get in there is if someone else steps up to submit on your behalf, but that kinda sucks because you're package will likely end up out-of-date in MacPorts unless the volunteer maintainer is super diligent about noticing new releases and updating the Portfile.

Nix is a more general-purpose packaging system, but it also suffers from this problem. In fact, in my experience, Nix packages do tend to be out of date for a while before someone notices and fixes it.

FWIW I don't really expect people to actually submit their own tools to Nix anyway, because there's a fairly steep learning curve there, but it would be really awesome if people did. But submitting to MacPorts is more straightforward.

deadbunny · on March 8, 2018

2. macOS isn't a distro, so you can't just say "let your distro maintainers do it". If you don't submit to MacPorts, the only way you'll get in there is if someone else steps up to submit on your behalf, but that kinda sucks because you're package will likely end up out-of-date in MacPorts unless the volunteer maintainer is super diligent about noticing new releases and updating the Portfile.

So by that logic I have to support mac/nix over every other system by default as they don't have maintainers? That sounds like a mac/nixos problem, not a developer problem.

lilyball · on March 8, 2018

If you want your tool to actually get used, you should put in at least a little effort towards trying to get it in package managers. I don't know why you're acting so surprised about that.

harshbutfair · on March 7, 2018

This seems like a very useful tool, and it provides useful information on our repository.

But in RHEL7 it gives this error: error: couldn't open Git repository: git rev-parse failed: Unknown option: -C

I assume it requires a later version of git.

jwilk · on March 7, 2018

git-sizer uses the -C option, which was added in git v1.8.5:

https://github.com/git/git/commit/44e1e4d67d5148c245db362cc4...

RHEL7 ships with git v1.8.3.1.

There might be other compatiblity issues, e.g.: https://github.com/github/git-sizer/issues/18

deadbunny · on March 7, 2018

If you need modern (but stable and maintained by Rackspace) packages in CentOS check out https://ius.io/

zspitzer · on March 7, 2018

I've always wondered why GitHub doesn't display the size of files at the folder level? the only way on the website is to drill down to the individual file.

drvd · on March 7, 2018

[flagged]

sctb · on March 7, 2018

I can't tell if this is a reverse troll or a regular troll. But please don't do this, comment civilly and substantively instead.

https://news.ycombinator.com/newsguidelines.html