While I'm sure this will help some people use git to address a use case that was previously impossible with git, I can't help but feel that it a bad step overall for the git ecosystem.
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.
You could extend git-lfs "pointer" file to support secure distributed storage using Convergent Encryption [1]. Right now, it's 3 lines:
version https://git-lfs.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
By adding an extra line containing the CHK-SHA256 (Content Hash Key), you could use a distributed p2p network like Freenet to store the large files, while keeping the data secure from other users (who don't have the OID).
version https://git-lfs.github.com/spec/v2proposed
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
chk sha256:8f32b12a5943f9e0ff658daa9d22eae2ca24d17e23934d7a214614ab2935cdbb
size 12345
That's how Freenet / Tahoe-LAFS / GnuNET work, basically.
Mercurial marks their Largefiles[0] support as a "feature of last resort". IE: enabling this breaks the core concept of what a DVCS is, as you now have a central authority you need to talk with. But at the same time, many people that use Git and HG use it with a central authoritative repo.
++! When I was in the games industry, it was extremely important to have this feature (and yes, it was a last resort!) This is why, then, we choose Mercurial over Git.
Unfortunately there is a lot of wackyness and far too often assets got out of sync. We ended up regressing, and put large assets (artwork) into a subversion repo instead.
I wish there was a better option, such as truncating the history of largeFiles, but that seems to break the concept of Git/Mercurial even more than the current "fix"
Indeed that was about 5 years ago. The problems generally were around assets getting out of sync, and occasionally corruption when uploading to the large-file storage server.
Problems generally occurred when a client timed out in the middle of an upload or download.
They were troublesome issues (and silent failures) which made it unusable for production use.
Hope they got it fixed, it was a great concept, and well ahead of Git in attempting to solve!
git-lfs (and similar systems) split up the storage of objects into the regular git object store (for small files) and the large file storage. This allows you to configure how you get and push large files independently of how you get and push regular objects.
A shallow clone gives you some `n` commits of history (and the objects that they point to). Using LFS allows you to have some `m` commits worth of large files.
If you want a completely distributed workflow, and have infinite local storage and infinite bandwidth, you can fetch all the large files when you do a normal `git fetch`. However, most people don't, so you can tweak this to get only parts of the local file history that you're interested in.
Indeed this is a trade off that requires some centralization, but so does your proposed solution of a shallow clone. This adds some subtlety and configurability around that.
Perhaps by adding a .git_sections file which keeps track of different sets of files you might want to checkout, but don't need to. You could have it that you could define different targets (and a default) such that say you are working on a large video-game you could have one repository for everything, but define a "artists" "programmers" and "full" target, where artists can keep their huge assets together with the rest of the repo and programmers can do shallow pulls, not constantly fetching asset files which may or may not be necessary for what they're working on.
Neat! Is there a way for me to serve those files so people can then use my repository as the authoritative source? I see that they still have the concepts of remotes, so maybe things are getting there?
git lfs push should work to push files new to the repo, but I'm not sure it works to push files that exists in the repo but are new to the server (because it is a new remote)
to us, things like git lfs encourage us to ensure there is an open source package that supports it, to keep the d in dvcs, so we'll add support to our community edition too
Gitlab giving me an alternative is awesome and commendable. If I will be able to clone a repo from github including all of the lfs then push it all to gitlab that would be better than nothing! Is github contributing to your effort to have an lfs server that is open and free?
Let me give an example of one way it hurts the existing git ecosystem. Someone decides to include their external source dependencies for their project as tarballs using lfs (which is probably dumb and not the use case that lfs is trying to support, but people will do it nonetheless). Now I want to mirror that repository inside of my companies firewall which hosts its git repositories using just git over ssh. Without lfs, I would just do 'git clone --mirror && git push --mirror' and I internally have a mirror that is easy to keep up-to-date, is dependable, supports extending the source in a way that is easy to contribute back, etc..
Now what options do I have with lfs (outside of gitlab)? Create a tarball with all the source +lfs in it? Create a new repository that doesn't use lfs and commit the files to that? Each of these is less than ideal and makes contributing back to the project harder.
Imagine instead a world where this happened: Github.com announces that they are adding large file support to git. These large files will be using an alternative object storage, but the existing git, http, and ssh protocols will be extended to support fetching these new objects. When support for it lands in the git mainline repository, suddenly everyone will be able to take advantage of it, regardless of how they choose to host their repositories!
I admire gitlab for creating an open source server implementation. I just wish that github would have done it a different way that would have been better for the overall git community (not just github users).
Usually large files are binary blobs(PSD, .ma, etc) and it becomes incredibly easy to blow away someone's work by not pulling before every file you edit(or two people edit at the same time).
As much as some people hate Perforce that's exactly what they are setup to do. Plus their binary syncing algorithms are top-notch. We used to regularly pull ~300GB art repo(for gamedev) in ~20 minutes.
Git is great for code but this seems like square peg, round hole to me.
I've read the replies to vvanders and he's correct. With binaries you really want some sort of global locking (easy with a centralized system, hard with a distributed system).
I believe his (her?) point is that for a very large class of binaries there is just no upside in parallel development, one guy is going squash the other guy's work. You want to serialize those efforts.
We don't have global locks yet but we know how to do them, just waiting for the right sales prospect to "force" us to do them. I'm 90% sure we could add them to BK in about a week.
Git annex solves this without locking nor losing any of the versions - the actual files get different names (based on hashes of the contents), which are referenced by a symlink tracked by git. If two people edit the same file - pointing the symlink at different filenames - you get a regular git merge conflict.
No /automatic/ merge resolution. Obviously you have a tool that can open them (if you edited it in the first place), and you can use that to view the differences, and replay one set of changes. The fact that the SCM detected the conflict, and alerted you, allowing you to resolve it, is a solid improvement over not using an SCM. Further, automatic merge resolution isn't always possible with text-based assets either (and even when it is, it isn't always the correct option!).
Yet I can still use it to resolve a merge conflict.
In Photoshop you would do this by opening both images, and visually comparing them to see what's different, then copying the appropriate parts from one to another. Instead of just visually comparing them, you might combine them into one file as separate layers, and use a transform to see the difference. If the tool doesn't support that, use ImageMagick to generate a graphical diff (either the `compare` or `compose` commands), and then copy the relevant parts from one to another.
We have fancy tools to help us, but fundamentally, merging is a /human/ operation, that requires human judgment to see how multiple sets of changes can be made to coexist. And that doesn't require a tool (though it can certainly help).
> In Photoshop you would do this by opening both images, and visually comparing them to see what's different, then copying the appropriate parts from one to another.
Good luck.
The point the parent was trying to make is that the lock operation of SVN was quite convenient for preventing a dual-edit scenario of assets that aren't easy to merge like 3D meshes, scenes, PSDs, etc.. It's easy to sit in the ivory tower of text merge resolution given how easy it is in comparison. The atom of change in other tools are quite a bit less obvious. Sure you can diff a mesh, but merging usually just means redoing it or picking one or the other.
Cool, how do you merge After Effects, Autocad, Cinema 4D, Unreal 3 Packages, Illustrator, Sketch, Blender, XSi Lightwave or any other of the production packages that I've seen used in shipping actual products? What happens if no one used layers in your Photoshop file and collapsed history to save performance?
There's a reason Pixar and most mid-large game dev studios use Perforce or a similar tech, it's because fundamentally you need locking if you're working with binary assets.
Your intelligent merge tool has access to the file history. If one user modifies a layer, and the other one squashes them down, in the merge you probably want to apply the changes in that order, even if it's out-of-order chronologically. If the file format has a full edit history baked in, great; even more info for the intelligent merge. Maybe the in-file history can even be kept in sync with the repository level history.
In the current ecosystem you probably need locking for your sanity, but some day software will suck less.
I can't imagine this is intended to compete head-to-head with something like Perforce. As you've pointed out it simply can't. But for a repo that's mostly code with binary assets that get updated occasionally it's probably a god-send.
> In Photoshop you would do this by opening both images, and visually comparing them to see what's different, (...)
Yeah, good luck with that. With software like Photoshop, not every change is obvious or easily visible. Maybe the other guy tweaked blending parameters of a layer, or reconfigured some effects layers. Or modified the document metadata. Or did thousands other things that are not immediately visible, or for which it is hard to determine the sequence changes should be reapplied.
Maybe you can manually merge the two files to some reasonably good approximation of the intended result, but you can't never be sure you didn't miss something important.
Merging tools for text files show enough information for you to know when you've seen every change made. You can't have that with complex binary formats used by graphics programs, mostly because those formats were never explicitly designed to support merging.
> Yeah, good luck with that. With software like Photoshop, not every change is obvious or easily visible.
Well, there's the time-honored technique of "rapidly switch between windows that are zoomed to the same place." But, more rigorously, I mentioned a way to do this; there are tools that can do a diff of raster images--which is what you are making at the end of the day with Photoshop. Sure it can't tell you what blurring parameters someone changed, but you can see that the blur changed, then you can go look at the parameters.
> Or did thousands other things that are not immediately visible
The trickiness of that situation isn't unique to binary formats. It comes up with code too.
> Maybe you can manually merge the two files to some reasonably good approximation of the intended result, but you can't never be sure you didn't miss something important.
That's just as true with code as it is with other formats!
> because those formats were never explicitly designed to support merging.
Neither was text. We just ended up making some tools that were reasonably decent at it.
I've been there, I've done that. I've done the 3-way merge with Photoshop files, and resolved the conflicts with 2 different people working on an InDesign file, and broken down to running `diff` on hexdumps of PDF files. Resolving merges with things that don't have nice tools for it isn't fun.
But it's a /lie/ to claim that a conflict for binary formats is "game over, you're just going to steamroll someone's work, there is no path to merge resolution". It's not a fun path, but it's not game over. Which is all I was really trying to refute.
(aside: It's interesting to me that this chain of comments went from being upvoted last night to downvoted this morning.)
> Well, there's the time-honored technique of "rapidly switch between windows that are zoomed to the same place." But, more rigorously, I mentioned a way to do this; there are tools that can do a diff of raster images--which is what you are making at the end of the day with Photoshop. Sure it can't tell you what blurring parameters someone changed, but you can see that the blur changed, then you can go look at the parameters.
I guess this could work with simple cases and if you accept less than pixel-perfect standard; I can see how this will fail when several people are working on a single file for long (because not everything that is important is visible to visual diff, at least you'd end up overwriting whatever scaffolding the other guy set himself up for his work), but at this point I'd be questioning the workflow that requires two or more people to work simultaneously on a single asset.
> The trickiness of that situation isn't unique to binary formats. It comes up with code too.
> That's just as true with code as it is with other formats!
Not really - text files don't contain any more data than you can see when you open them in your editor. With text, you see everything. When you open a 3D model or a PSD file, or even a Word document, what you see is just a tip of an iceberg.
> But it's a /lie/ to claim that a conflict for binary formats is "game over, you're just going to steamroll someone's work, there is no path to merge resolution". It's not a fun path, but it's not game over. Which is all I was really trying to refute.
I can agree with that. It's not impossible to do such merges; worst case scenario, one will end up praying to a hex editor like you say you did. It can even be fun sometimes. I guess what 'vvanders was arguing about is practicality - you can do it if you're willing to invest the time, but it's much better to not have to do it at all.
> (aside: It's interesting to me that this chain of comments went from being upvoted last night to downvoted this morning.)
HN moves in a mysterious way
its voting to perform;
A reader questions his comment's downvotes,
And thus ensues shitstorm.
That is to say, sometimes it's so random and fluctuating that personally, I stopped caring. If I get downvotes I usually a) already know I deserve them for being an ass and/or factually incorrect, b) have someone tell me why I deserve them, or c) assume they're random fluctiations and not worth getting upset about. I think we're dealing with type c) now.
So your suggestion for merging two 3D models that were made in Blender is to use ImageMagick compare to see what's different and then copy the differences from one Blender file to the other?
Is that really inherent to the data, or a case of useful merge plug-ins just not having been written yet for the proprietary file formats of big closed source apps?
Source code doesn't ~really merge all that well either; there's just been a big community of software developers collaboratively fixing up their tools for collaboration.
Of course we have it easy cause the tools we're using are ~made of the same stuff we work with daily. No amount of photoshop filter experience will enable you to write a program that intelligently merges two photoshop files.
After we add support for git lfs we plan to add web ui locking for files, this will allow you to lock files when browsing them and prevent others from uploading them
Won't a central storage for the large files alao make it straightforward to add locking functionality in a future version of git-lfs, or as an add-in? I agree it sure looks like an omission to have a vcs that is centralized and aimed at binary data, without having any locking functionality.
I don't even necessarily need my client to mark files as read only on my local machine. A system that just lets me query who has the lock on a certain file, and lets me get the lock on it if noone has, is miles ahead of shouting/email.
A remote-only locking system should be pretty easy to implement, e.g by just throwing in "filename.userid.lock" into the filesystem next to the file in question.
Translation: because it is not useful for you it's not useful for anybody else.
That's nonsense of course.
There are a lot of use cases where this would be very helpful without locking (i.e. jar/dlls).
This is useful now, locking can come later. We don't have to solve every conceivable problem all at once. More progress is made in small incremental steps than big bang leaps.
Is it safe to assume that Gitlab's implementation of Git LFS will allow to host the file storage server on premises and potentially on another machine than the one running Gitlab?
Will you be supporting both indefinitely or is there a plan to transition to a single well-supported solution for large files over the coming N releases?
There is a cost to everything and we try to be pragmatic. Git Annex is causing lots of work for us in the Omnibus packages. Video2000 is also technically better than VHS, but still people stopped producing equipment for it much sooner.
I haven't been following the various Git large file solutions - can someone comment on how this implementation compares to git-annex or whatever else is out there?
BAM works with a similar idea, instead of saving large files in the local repository, users are allowed to save them in a centralized server. This saves disk space and network transfer time.
However, unlike other solutions, BAM preserves the semantics of distributed development.
Instead of requiring a single or standardized set of servers, every user can have a different BAM server. Data is moved between servers automatically and on demand.
One group in an office might use a single BAM server for storing all their data close and locally. When another development group is started in India, they can use a server local to them. The binary assets will automatically transfer to the India server as commits are pulled between sites.
This allows centralized storage of your data and yet still supports having a team work while completely disconnected from the internet.
I've been using BAM for quite a while (I'm one of the developers of it). I use it to store my photos. I've 55GB of photos in there and backing them up is
cd photos
bk push
Works pretty well, when my mom was still alive we pushed them to her imac and the screen saver was pointed at that directory. So she got to see the kids and I got another backup.
I worked at Unity for a couple years and they are one of the biggest users of (and maintainers of) the Mercurial LargeFiles extension, so I was using that on a daily basis.
I agree that it should be a measure of last resort, but if you can't avoid working with big binary files, it makes the difference between a workflow that is a bit more cumbersome, and one that just grinds to a halt. Getting this functionality in git is great. And it'll mean a huge step forward in collaboration tools for game developers. You pretty much can't avoid big binary files when making games - and so far they've been stuck with SVN or Perforce (or the more adventurous ones are trying out PlasticSCM, which apparently is pretty nice too, but is proprietary and doesn't have a big ecosystem around it like git does). I hope this can lead to a boom of game developers using git.
Yup. I'm using git for game source code and I'm often holding off any commits to graphics/music until project is done. Any workaround outside git means you have two systems to manage and it can get really painful.
Not sure if they were working together with GitHub on this, but Microsoft also announced today that Visual Studio Online Git repos now support Git-LFS with unlimited free storage:
Where are the files actually stored? I hear "git lfs server" in the demo video, can this be changed? Can I init my repo and tell it to push all my objects to my own private s3 bucket, or can I only rely on some outside lfs server I don't control?
This could be a corollary to P. Graham's "don't do anything that scales": don't do anything that involves plonking stupidly large files into version control.
The fact that both Atlassian and GitHub intended to unveil their own almost identical competing solutions, both built in Go, in consecutive sessions at the Git Merge conference (without either being aware of the other) is pretty hilarious.
Git-lfs has been helpful for managing my repo of scientific research data. Hundreds of large-ish excel files, pngs, and hdf5 add up quickly if you're doing lots of small edits.
There's still some warts (don't forget git lfs init after cloning!), buts it's mostly fast and transparent. I also ponied up $5 a month to get 50 gigs or so of lfs storage. Decent deal imho.
They do have a reference implementation of the serverside here: https://github.com/github/lfs-test-server - though they themselves don't consider it production ready. But I'm sure it'll either get there in time, or another open source implementation will rise to the challenge (cf. syste's comment about GitLab planning support for this: https://news.ycombinator.com/item?id=10313495 )
It appears to centralize a distributed version control with no option to continue to use it in a distributed fashion. What would be wrong with fixing/enhancing the existing git protocols to enable shallow fetching of a commit (I want commit A, but without objects B and C, which are huge). Git already fully supports working from a shallow clone (not the full history) so it wouldn't be too much of a stretch to make it work with shallow trees (I didn't fetch all of the objects).
I'm sure git LFS was the quickest way for github to support a use case, but I'm not sure it is the best thing for git.