i'm not sure if git is the right thing to choose to version big binary files.

tomku · on Aug 15, 2012

The project is based on git-annex, which is an extension that treats big binary files differently. Namely, it doesn't check in the file contents, so you don't get full-file versioning. You can find out more at http://git-annex.branchable.com/.

rmc · on Aug 15, 2012

Technically you can use the SHA backend to git-annex, so the actual file contents can be tracked by git, giving you "full file versioning". It's just not checked into git.

rektide · on Aug 15, 2012

Additionally there is a bup backend target for git-annex. Bup is targetted as a rsync-like backup service that can do incremental backups and ought work OK with large binary files. https://github.com/apenwarr/bup/

It was through Bup that I originally discovered git-annex.

rjzzleep · on Aug 15, 2012

hm ok, looks interesting, i'll have to read more about it's internals, thanks.

urza · on Aug 15, 2012

git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space.

-- http://git-annex.branchable.com/

rjzzleep · on Aug 15, 2012

ok read, what's the advantage of using this over rsync? it's git + haskell + rsync, loads of dependencies, to make git do something that it's not designed to do. or am i missing something crucial?

SomeCallMeTim · on Aug 15, 2012

If you're using git for revision control, it lets you "version" files without having all the historical versions locally.

Example: You check in bigencryptedfile.big, which is 100Mb. Then you modify it, and check it in again. Repeat 8 more times.

In git with and without git-annex, you can check out the repository at any time and end up with the local files from that checkout.

In normal git, your local repository is now a gigabyte (the encryption in this hypothetical file prevents git from being able to delta-compress; in reality git would likely be able to compress it somewhat, but it still may be hundreds of megabytes).

With git-annex, all the previous copies are stored on the SERVER, but not in git itself. Even if you don't care about local hard drive space since hard drives are cheap, consider that if I then clone your repository, I would only need to download 100Mb instead of 1Gb. The downside, of course, is that you need to be connected to the server to get historical versions of a particular file.

When you're dealing with, say, a game project with 20Gb of binary data that's been versioned 20x on average, you end up with 400Gb to clone your art repository, which is a non-trivial download size. And if, for some reason, you want your repository cloned to multiple folders on your drive, then again you're using 10Gb each instead of 400Gb each. Even on cheap hard drives, multiple folders of 400Gb each adds up quickly.

EDIT: OH, and one other advantage of doing it this way: If you just use rsync, and you accidentally overwrite a file and don't notice for a day or two, rsync will happily destroy your backup file as well, while git-annex will just store a new revision. Should have thought of that first. ;)

rlpb · on Aug 15, 2012

git-annex keeps track of what file is where, including any duplicate copies you wish to keep on other storage mediums. If I want some file I archived, git-annex will tell me which external disk it's on (and it can do S3 and some other online storage mediums too). rsync keeps track of nothing between runs.

I find it really useful for archiving large files - an entirely different use case than git.

icebraining · on Aug 15, 2012

It also keeps the hash of the files, so you can verify their integrity even without comparing to a different copy.

nodata · on Aug 15, 2012

Why? What is?

rjzzleep · on Aug 15, 2012

because your local history is bound to get really big in a short period of time?

what is? well, I don't really think anything is. it works with svn and dropbox, but that doesn't mean that is a good choice either.

obviously git etc. is mainly designed for text files. i've long thought about what the right way to approach this issue is.

edit: will read more about annex

_delirium · on Aug 15, 2012

Some of the commercial version-control systems handle big binaries reasonably well. That's one reason many game companies, for example, use Perforce, since it doesn't choke on piles of art assets.