Hiya HN. I was ranting on Mastodon earlier today because I feel like people learn git the wrong way - from the outside in, instead of the inside out. I reasoned that git internals are pretty simple and easy to understand, and that the supposedly obtuse interface makes a lot more sense when you approach it with an understanding of the fundamentals in hand. I said that the internals were so simple that you could implement a workable version of git using only shell scripts inside of an afternoon. So I wrapped up what I was working on and set out to prove it.
Five hours later, it had turned into less of a simple explanation of "look how simple these primitives are, we can create them with only a dozen lines of shell scripting!" and more into "oh fuck, I didn't realize that the git index is a binary file format". Then it became a personal challenge to try and make it work anyway, despite POSIX shell scripts clearly being totally unsuitable for manipulating that kind of data.
Anyway, this is awful, don't use it for anything, don't read the code, don't look at it, just don't.
Now the name makes even more sense. I first read it as sh/git, but reading it as something that starts inside and slowly works its way out is now my preferred explanation of the name.
If it's so hard to learn about this tool and so easy to learn about it the wrong way, that's a pretty obvious hint that there's something wrong with the tool.
Having to learn about the internals is a giveaway that the tool suffers from poor encapsulation.
To me git is definitely one of those tools where one should satisfice and not learn it deeply, because it's not worth the effort. One can successfully stick to a simple workflow and ignore anything git astronauts come up with, like git flow if they want to keep their sanity and focus on what matters - creating quality software. And almost any team has some git fetishist which will be thrilled to help when things go south. And if they don't, it's probably for the better.
> If it's so hard to learn about this tool and so easy to learn about it the wrong way, that's a pretty obvious hint that there's something wrong with the tool.
Not necessarily. This may mean (and I think in this case, it does) that people are too afraid to learn about those "internals" - or should I say, the mental model behind the tool (and then some of those people write tutorials for others, perpetuating the problem). And with a "monkey see, monkey do" approach, people can fail at anything, up to and including tying their own shoelaces.
There is no such thing as a perfect encapsulation. Not in programming, and especially not in the physical world. "Internals" are ever-present and leak into view all the time. A good abstraction is just one that you can use day-to-day without constantly minding what's going on behind the scenes.
More importantly though, when you're just learning a bunch of git commands in isolation ("monkey see, monkey do"), you're not learning a tool/an abstraction - you're just learning its interface. That's sometimes OK, but in general, for effective use of an abstraction it's better to learn what moving pieces it is abstracting away. Which in case of Git is that it's a DAG. DAGs are kind of fundamental in programming, too; it's good to understand them.
Its part of a bigger issue - source control tools have wildly different fundamental models. Moving to git from anything else will be confusing, because users jump to conclusion about what commands and operations are doing. Its different from other tools in important ways.
Can't use svn for the life of me, though I used it for 10 years before trying git for a real dvcs need (remote team for a while)...
Just the idea pains me. Missing git add -p, cherry-pick & rebase -i so much I immediately put git-svn on if I have to go back...
Also, it makes telecommuting easier, asynchronous team work so much simpler...
I think the key 'abstraction' that people don't understand is cherry-pick. I can't explain clearly in fine details /how/ it works, but it is the base of so much of git's power...
Monkey see, monkey do would be watching someone use git and imitating them. Reading the manual or an overview of the commands and using them is how learning new tools typically works.
It should be possible to learn the commands for creating a branch, uploading our changes or making a commit like it was/is possible for all version control tools and then move on with our professional lives, which likely revolve around writing software and not fumbling with git.
By the way, I love your conversation about Merkle trees below; it was one of the most surreal things I've read lately. :-)
> It should be possible to learn the commands for creating a branch, uploading our changes or making a commit like it was/is possible for all version control tools
The problem revolves around the fact that, despite same name being used, git!branch != svn!branch, git!commit != svn!commit, etc. They serve related purposes - but not the same, because the concepts behind them are different. Learning a tool means learning those concepts. So in the process of learning "git commit" and "git branch", you're supposed to pick up on the "pointer to a node in a DAG" thing - otherwise you haven't learned "git commit", you've learned something else that's vaguely similar. And then you'll have difficulties when its behavior goes against your expectations.
But they behave pretty darn close to SVN branches and commits and... they're also called branches and commits. It's clear why that happened - no one would have used a weird tool which turned the old concepts on their heads, so git was being taught based on comparisons with existing tools.
Now that git's very popular, the teachers have become arrogant and are claiming that our mental models for how VCS work are wrong and we should instead adapt our thinking to the git internals. In almost all other professions a confusing tool is scorned, but only developers are expected to learn how all sorts of weird contraptions work and then anyone who can't keep up is scorned instead.
git is almost 15 years old and here we have yet another attempt at clarifying how it works to the masses. Why are there so many git GUIs and tutorials and attempts to clarify how this tool works? It's a freaking VCS, not rocket science. git took something that used to be straightforward and doable by any developer and turned it into an over-complicated mess.
Now here's a question for you: why do you defend this anti-developer tool instead of siding with fellow developers?
And one step beyond that, that it's a Merkel tree. It's key to understanding stuff like "if I change a commit, it changes all commit after that" or "if I move (cherry pick, rebase) this commit, I'm creating a new one, not really moving".
> And one step beyond that, that it's a Merkel tree.
Not really, not every block chain is a Merkle tree. Since Git history is not linear, you can’t order the commits in any canonical way. You definitely can order them in some way (like "git log" does) and then construct a tree for that list of hashes, but this is not really useful computation. Git repo integrity is verified simply by HEAD commit hash because you normally clone the entire repository anyway.
Is git even a Merkle tree? Git history forms a DAG, not a tree. But I'm not 100% sure about how the hashes are computed - whether they work on the DAG, or on some local subtree.
No, it’s not. Commit hashes are exactly what it says on the tin: hash sums of commit objects, which are basically text files that include hash sums of tree objects (directory tree state), hash sums of parent commits, your commit message and other metadata. You can see it with "git cat-file":
Git prefixes object content with object type ("commit" in case of commits), its size in bytes (textual, decimal), terminated by a null byte. And hashes all of that get the commit hash.
That conversation already went off the tracks at "internally", since it shouldn't matter at all how it works internally. We do not learn how most things work in detail, because we wouldn't have time to live our life if we did.
In this particular case, git is a version control tool and it supports various typical operations for such tools. One should be able to learn the commands and then successfully use the tool. If that's not possible, I continue to assert that there's a problem with the tool.
It also speaks about the quality of users as a whole - we somehow forgot to RTFM at all or lead people to documentations, simultaneously accepting lack of documentation in a vicious cycle.
>
The conclusion I draw from this is that you can only really use Git if you understand how Git works. Merely memorizing which commands you should run at what times will work in the short run, but it’s only a matter of time before you get stuck or, worse, break something.
Things began to click for me as soon as I read this in its intro section:
> Beginners to this workflow should always remember that a Git branch is not a container of commits, but rather a lightweight moving pointer that points to a commit in the commit history.
A---B---C
↑
(master)
> When a new commit is made in a branch, its branch pointer simply moves to point to the last commit in the branch.
A---B---C---D
↑
(master)
> A branch is merely a pointer to the tip of a series of commits. With this little thing in mind, seemingly complex operations like rebase and fast-forward merges become easy to understand and use.
This "moving pointer" model of Git branches led me to instant enlightenment. Now I can apply this model to other complicated operations too like conflict resolution during rebase, interactive rebase, force pushes, etc.
If I had to select a single most important concept in Git, I would say it is this: "A branch is merely a pointer to the tip of a series of commits."
And you can see this structure if you add to any "git log" command "--graph --oneline --decorate --color". IIRC some of those are unnecessary in recent versions of git, I just remember needing all of them at the point I started using it regularly.
I have a bash function for it (with a ton of other customizations, but it boils down to this):
function pwlog() {
git log "$@" --graph --oneline --decorate --color | less -SEXIER
}
pwlog --all -20
(...in that "less" command, "S" truncates instead of wraps lines, one "E" exits at EOF, "X" prevents screen-clearing, and "R" is to keep the color output. The second "E" does nothing special, it and "I" (case-insensitive search) are just to complete the word)
> The order of preference is the $GIT_PAGER environment variable, then core.pager configuration, then $PAGER, and then the default chosen at compile time (usually less).
>This "moving pointer" model of Git branches led me to instant enlightenment.
As opposed to any other VCS? Feels like that model is the only one that works with SVN too. I struggle to see how "branch is a container of commits" is a viable model to begin with.
It's a good-enough description of SVN, where branches exist in the same directory tree, commits are tied to the branch by way of the path, and the merge tools are "merge this batch of commits from branch A to trunk" (you don't have to take the whole branch at once).
One of the biggest hurdles my co-workers have had learning git after having used svn for years is the "bucket of commits" mental model they've built up for branches. A common question is how to merge a single commit.
Mercurial branches are different from git branches; they're topological structures that emerge when a revision gets an alternate child. They're like growing and stopping lines of development. They exist on their own, Mercurial simply allows to give them names. What git calls branches in Mercurial is called bookmarks.
It is the tip for that branch. Even if there exist other commits building on the commit the current branch points to, the pointer is still at the tip for that branch.
The point is that a branch is simply a pointer to a commit that automatically encapsulates all of the parent commits.
I think I see what it means though. The branch uses that commit as a new tip to then branch off of, not necessarily meaning a new branch starts at the existing 'tip'.
>The conclusion I draw from this is that you can only really use Git if you understand how Git works.
I say this as someone who uses git regularly, and who prefers it to all other version control systems I have tried:
A tool that breaks the principle of encapsulation by forcing you to grok its internals if you are to have any hope of understanding its arcane and inconsistent usage syntax is frankly not a very good tool.
By contrast, I don't understand how vim works beyond the base conceptual level (keypress goes in, character shows up on screen or command is executed) and yet I don't have any trouble using it. I don't need to know vim's internals to use it effectively. Vim is a good tool.
> I don't understand how vim works beyond the base conceptual level
How much time have you spent trying to figure out how to change the font size in Vim, rotate text 90° in Vim, recalculate a formula in Vim, or insert an image into a document you're editing in it? If the answer is “none”, you probably have a pretty deep understanding of the data model Vim manipulates, even if you aren't aware of it.
On the other hand I understand de data model of git and can't to the most basic shit without looking up which invocation I need via search engine/man pages. Like... deleting a branch `git branch -d` (-D for forced deletion). Deleting a remote? `git remote rm`. Knowing the model teaches me nothing about the ui.
This seems like a good opportunity to plug two aliases I wrote about a year ago that have been very helpful for cleaning up all the extraneous branches that would show up when I ran `git branch`.
I run `git listdead` after I merge and delete a branch and do the next fetch (`git pull --rebase`). That lists the branches that can now be deleted.
Then I run `git prunedead` and it actually removes them.
Previously if I ran `git branch` it would list every development branch I had ever created in nearly a decade of work. Now it lists maybe ten branches.
> rotate text 90° in Vim, recalculate a formula in Vim, or insert an image into a document you're editing in it
I'm not sure if I'm missing some features in Vim or you're actually pulling my leg by forcing me to notice that I know more about text than I care to admin :-)
Using Vim requires you to understand how the data it operates on is structured. The same applies to Git. Plain text is just a lot simpler than a VCS repository.
But you can use your awesome vim skill to feed text into a OpenOffice document, and while the OOo internals are probably 100x more complicated than vim, the user interface for "text on my screen" stays the same, and transition is smooth, even though the internals underneath is vastly different. If git requires 'everyone' to know the internals before they can use it, as opposed to rcs,cvs,svn,perforce users who can more easily flip around between those for most basic usages, then its on git for having a complicated shell around a complicated set of internals.
It would have been nice if there was a simpler shell around the complex machinery for those (us?) who don't want to do crazy stuff, who don't need to be able to do crazy stuff and who could settle for only the simple 90% of the tooling like we do with the alternatives, but are forced to use git for external reasons.
Plain text is internally ropes-something, indented with meters of vimscript, colored with a syntax model that is okay to change, hard to create from scratch, etc. it’s all hidden from a regular user who uses a subset of all features.
But if you ignore the non-ms movement, shortcuts and advanced transforms, it is still a text editor that everyone may use. You can’t put your text (text, not a current mode!) into a state that looks okay but requires a vim guru to continue or start over because something is broken in the model. That’s different from git issues where working copy looks okay, but the branch and merge are broken in subtle ways.
>Plain text is just a lot simpler than a VCS repository.
Than a Git repository, not a VCS one. Not saying that VCS = plain text, but much simpler models exist for merging teh codes.
Scott Chacon wrote a book on git internals that was published by peepcode some time ago. Searching for where to buy it turned up this HN thread:
https://news.ycombinator.com/item?id=7999515
>I reasoned that git internals are pretty simple and easy to understand, and that the supposedly obtuse interface makes a lot more sense when you approach it with an understanding of the fundamentals in hand.
Everybody's brain is different but I actually understand all of git's internals (the "plumbing") but it doesn't help me with the git commands (the "porcelain").
Yes, I know that git is a DAG (Directed Acyclic Graph), and that HEAD is a pointer, and the file format of BLOBs and SHAs, etc. If I were to implement a DVCS, I would inevitably end up reinventing many of the same technical architecture decisions that Linus came up with. But none of that insider knowledge really helps me remember git syntax if I haven't been using it in more than a month. Even though I grok git's mental model, I still can't answer the top-voted "git" questions on Stackoverflow without a cheat sheet: https://stackoverflow.com/questions/tagged/git?tab=Votes
The git UI and unintuitive syntax is just too hard for me to remember unless I use it every day.
In contrast... In vi or MS Word, I can effectively modify text without digging into underlying "rope data structure"[1]. In databases & SQL, I can "INSERT INTO x" without learning the "internals" of b-trees[2]. In Photoshop, I can stack layers without learning the math "plumbing" of alpha blending[3]. And yet for some reason, Git in particular needs people to learn it "inside out" more so than other tools. Not sure why Git needs this cognitive prerequisite.
Unfortunately, I think you have to learn this way because everyone who came before you also did. So not only is the porcelain oriented towards this understanding, but so are people's existing repositories and workflows.
People regularly use the limited subset of git that Github permits without learning how it works internally. If only the 'edit this file' button were powerful enough to do actual work... It isn't, and that's the other problem: internals knowledge actually helps you do day-to-day versioning tasks. The reality is that one day two developers will submit PRs that conflict, and you'll have to find a way to merge them both, and knowing how rebasing works inside absolutely helps you. The analogy is more MSWord style-stacking than manipulating ropes directly, because Git does manage to completely hide some of its guts. (Object storage, compression, transfer come to mind.)
Though git's interface is highly flawed, I do think a good VCS needs to expose more of its storage model to the user (or at least a model isomorphic to it) than most apps.
In vi and Word, you're not worried about state changes outside of saving the current state and, possibly, undoing some number of steps. In a VCS, you might need to check out, merge, or compare arbitrary states from the history, and doing this inherently requires a deeper understanding of how the history is stored. A good VCS should expose these internals in a clear way. In my experience, working with even 1 teammate immediately requires you to have some mental model of how your VCS deals with merging different histories.
That said, it's up to the VCS's interface to make these things clear. Git's mental model is simple enough, and the porcelain can do some of this stuff very well, but CLI is arcane; I end up storing extremely common functions as shortcuts because I'd never remember them or want to type them even though I use them dozens of times a day.
> oh fuck, I didn't realize that the git index is a binary file format
I wonder if the project (or likely, another one) might be better served by implementing the index using plain text (or whatever else might be more natural for shell wrangling) to elucidate the conceptual structure rather than matching git literally.
PS: The name is very apropos. One doesn’t see too many such fitting opportunities — feels warm and fuzzy to see this one well used :-)
Why do you think it's the wrong way? I sit somewhere in between and think that some people want to know the details and learning from inside is a good idea. But some other people want to simply be users and for the tool to get out of their way - and that's also good. So if the docs or the UX make either way hard or less effective, that's on the docs or the UX to improve.
Sure, but I also use a mouse. I could learn how the optical mouse works. I have some guesses about it too, but never actually learned the details.
But I'm a user of it - it works even if I don't understand exactly how and nobody tells me that I learned using the mouse "the wrong way" because of it.
Yes, and you don't have to spend one hour learning how to push a button.
A version control system is tackling a non-trivial problem. Go learn it properly, otherwise you'll be one of the 'users' that, at best, will be stumped on trivial issues, losing productivity and running to others for help. At worst, you'll be making bad decisions and dragging down your team.
Would you also say that you don't need to learn anything and can just "guess" while working with a programming language?
Go learn it properly, otherwise you'll be one of the 'users' that, at best, will be stumped on trivial issues, losing productivity and running to others for help. At worst, you'll be making bad decisions and dragging down your team.
I have no problem with people on my team asking each other for help, and I definitely don't consider it bad for productivity when they do. If someone on my team suggested people asking them for help was bad I would bring it up in their next one to one because that's a really bad sign something is wrong.
If everyone on my team decided to learn the internals of git so they didn't need to ask one another when a problem arose I would be genuinely concerned about how the team is working.
I think the point was "trivial issues". What if a Java programmer kept asking their teammates if the statement terminator in Java is colon or semicolon?
A typical mouse has 2-3 buttons, a wheel, and an X/Y axis.
Git is 100x more complex, if not more.
Give me a break. I'm sick of people glamorizing the idea that you should have your hand held through each step of every tool you use and never expend any effort on becoming an expert in the tools of your trade. Git is an engineering tool, designed by and for professionals. Imagine this kind of obscene complatency in other fields.
I meant the tool-vs-internals idea, not the specific example here. If you want something comparable in complexity: we learn programming from `print "hello world"`, not from memory models and assembly. Some people even just start with `=sum(...)` in excel. Every programmer pretty much stops at the level that's useful and productive for them.
There's often sentiment that people should know more, but I don't think I've ever seen anyone saying starting programming from high level is "the wrong way".
Example from out of it: doctors learn both how to use USG and how it works. But in every case, I've seen it in that order: practice, then internals.
>we learn programming from `print "hello world"`, not from memory models and assembly
You're talking to the wrong crowd with me, you know. I disagree with this approach, too. Maybe we start with "hello world" to get a taste, but the first thing we should do is start breaking it down.
You understand git. If someone said you need to know how sed works, or grep, or babel, or clang, or bash, or perl, or literally any other tool that you use regularly that you don't know the internals of then you'd quite reaaonably say they were wrong because you're an expert what matters to you and you can do your job without knowing how something else works. Perhaps you should try to respect that other people choose to become experts in things that aren't git.
If you use any of the tools listed, or any other tool, several times a day then it is reasonable to know the internals at least a bit. If you use grep (or git - the same applies) once a month then it's fine to just memorize some commands.
I don't think it's asking to have one's hand held to complain about git's poor interface. There's no reason other than lazy design to have a tool where to show all remotes it's
$ git remote -v
But to show all branches it's
$ git branch -a
It's like it's been purposefully designed to be obtuse.
`git remote` and `git branch` list the remotes and branches respectively.
Adding -v makes both of these verbose. It will additionally show what each branch/remote is "pointing at".
Adding -a to `git branch` shows remote tracking branches in additional to local branches. This not normally interesting so the default is to list only local branches.
...yes, remote tracking branches are interesting as well, I don't know a situation where they wouldn't be.
There's plenty of weirdness in Git, but honestly my main complaint is that the interface is awful and the documentation makes Dostoyevski look modern and sleek.
You clearly have not ever worked on repos where nobody ever cleans up after themselves as far as feature branches go. I'm working with repos with remote branches numbering in the hundreds. `git branch -a` is pretty useless at this point unless paired with grep.
We're all talking about learning the internals of git and how it works, not it's poorly formed command lines. Pointing out how shitty the interface can be doesn't mean you shouldn't learn how your tools work.
Every other field makes damn sure that their tools are usable, comfortable and as safe as they can be. While programmers act like it's your fault if you hurt yourself while using a chainsaw-hammer.
I've watched children use a mouse for the first time, and there is definitely an internal model you needed to learn. No, you don't need to learn exactly how optics work, in the same way that you don't need to learn exactly how Git is reading from files, or how its hashing algorithm is implemented.
But you do need to understand that the mousepad doesn't correspond to points on the screen, and you have to learn to treat it more like a treadmill than anything else. Going back in time and thinking about it from a rollerball perspective can help with that -- new users have a tendency to use something like 90% more space because they don't grok that for long movements they have to pick up the mouse.
People are bringing up the mouse as simple because they're used to using mice. But hand anyone a mouse for the first time and you'll find out that they aren't simple. They're just doing comparatively less than Git, so the problem space is slightly easier to tackle. And that's even ignoring the hand-eye coordination problem we take for granted, and that can take weeks for someone new to computers to get over.
Talking about internal mechanics is broadly useful when teaching computer literacy -- everything from mice, to copy/cut-paste, to shift-selection of files, to the file browser itself benefits from trying to build a systemic, mental model of some kind of behind-the-scenes abstraction.
git is a data structure manipulation tool, mouse is a cursor manipulation tool. A much better analogy would be trying to use a mouse without having a good idea of what the cursor is for.
The only difference is that grasping the cursor will probably take you minutes because it's a simple concept, and grasping the data structure takes a bit more effort because it's just a more complicated topic.
You don't need to know the implementation details of git, but you need to know the data structure it operates on, cause otherwise you're just walking in the dark.
I'm not certain that git will be the dominant VCS forever, as I'd used CVS, Perforce, Subversion, Mercurial, in various degrees when they were dominant (or at least relevant).
Who knows, maybe Linus will have another epiphany, while Microsoft somehow mismanages GitHub and squanders all the goodwill away. Then, a group of upstarts...
That said, wanting to learn git's internals for the sake of knowledge is fine as motivation.
shrug I'm probably never going to work at either of those places. For pretty much everywhere else, Git works just fine.
Maybe if I play my cards right, I'll use git for the rest of my career. If not, maybe there will be something new eventually, but I imagine that the concepts learned in mastering Git would still be useful.
You should give Mercurial a try. When you enable the changeset editing extensions it does everything Git does and is much easier to use and understand. I’ve trained people on both systems and Mercurial is much less creaky. The only reason everyone uses Git is for historical reasons because most of the modules In referring to weren’t added to Mercurial until like 09 when Git started seeing widespread use.
I used mercurial quite a bit back in 2010. It's nice, but I don't see the value in sinking a bunch of time into it these days.
Every one of the projects that I interact with regularly are in a Git repo on some kind of Git hosting service and the projects are run by people who understand/use Git regularly. For those projects, switching to Mercurial is a net loss, even just considering the time it takes to migrate the codebase + related processes (think CI, issue queue integration, even the repo hosting itself).
Sure, I could use hg-git, but that doesn't gain me much either: now I'm the guy with the weird setup. If something goes wrong with my setup, it's too weird for other people to help with. If something goes wrong with somebody else's setup, I'm not that helpful because I have a weird setup.
I found Mercurial to be just as capable and much easier to learn. I’m really only considering moving my team to Git as a least-common-denominator move since nobody really makes tools for Mercurial outside of Facebook. Otherwise it’s harder to learn Git and definitely it’s harder to train interns and juniors to use it.
Google uses a custom implementation of the Perforce interface called Piper. Google has looked at git and mercurial, but have concluded that they can't scale to the level they need it to. Read more about it here:
Because the porcelain is a nonsensical pile of crap on it’s own so you really can not make sense of it from the top down, it actively resist that approach.
That’s not an assertion that it’s a good thing mind, just that it’s the only one: learning git from the bottom up is much easier than top-down, and people who dislike that approach are simply hosed.
I think it's more like learning enough SQL to build an application without learning about indexes and transactions: you can be sufficiently productive until you encounter corner cases or things go wrong.
> I view not learning the basic concepts of how git works inside like trying to learn SQL without know what a table is.
That’s complete nonsense. A table is not a low level implementation detail of sql it’s a core feature.
And I don’t have to known how tables are represented on disk or what they store exactly to acquire a good intuition of how things work.
SQL is in and of itself an abstraction decoupled from the underlying concerns of implementation and execution. Something git’s porcelain definitely is not.
Git internals are fine. Now, if the utils gave access to human-oriented operations with those internals without the user googling every time they need something not-yet-memorized, that would be splendid. As it is, the utils are already pretty shitty without a reimplementation.
eh, common.sh is a lot more complicated than it could be. example:
write_hex() {
hex="$1"
echo "$hex" | sed -e 's/../&\n/' | while read -r hexbyte; do
printf "\\x$hexbyte"
done
}
you can imagine other shenanigans with xargs or something, but I think this strikes the best balance between performance and readability (as far as shell script goes).
read_int16 and read_int32 don't work on big-endian systems, or if int is 16 bits instead of 32. the latter issue can be easily fixed by explicitly specifying -td2/-td4, but the former issue is not so easy. I think it requires either figuring out the endianness beforehand, or better, something like this:
od -An -tx1 -j"$offs" -N4 "$path" | while read a b c d; do
echo $(((0x$a << 24) | (0x$b << 16) | (0x$c << 8) | 0x$d))
done
oddly, this is used in ls-files already. and yes, I checked: 0x$a is POSIX, and the arithmetic evaluation size must be at least a signed long, which is at least 32 bits.
'for x in $y; do printf "$a%s$b" "$x"; done' is equivalent to 'printf "$a%s$b" $y' (assuming neither a nor b contain format specifiers). similarly, 'for i in {1..100}; do printf "$a"; done' is equivalent to 'printf "$a%s.0" {1..100}'. unfortunately, brace expansion is not POSIX, but these are both significantly more efficient (both in code size and execution time) than the loop methods.
sha1sum is not POSIX. I think shell arithmetic provides you enough tools to implement https://en.wikipedia.org/wiki/SHA-1#SHA-1_pseudocode directly, although it may be slightly slower than a C implementation. awk is probably faster than shell.
Such a long text... to end with arguing non-POSIX solution. But the initial goal of the project was having a POSIX code, and it can indeed be a valid goal. So the whole post is "a lot more complicated than it could be." Tl dr: you wouldn't make POSIX code.
what the fuck? I specified POSIX alternatives to non-POSIX uses in the code. only one feature, brace expansion, is not POSIX, so I specifically did not recommend its use.
Th specific part that I understood as your argument for non-POSIX solutions:
"'for x in $y; do printf "$a%s$b" "$x"; done' is equivalent to 'printf "$a%s$b" $y' (assuming neither a nor b contain format specifiers). similarly, 'for i in {1..100}; do printf "$a"; done' is equivalent to 'printf "$a%s.0" {1..100}'. unfortunately, brace expansion is not POSIX, but these are both significantly more efficient (both in code size and execution time) than the loop methods."
I didn't understand why you write that part at all, considering the goals of the program we discuss (which is to demonstrate some git primitives in POSIX compliant shell code).
'for x in $y; do printf "$a%s$b" "$x"; done' is used in the code already. I am proposing that it be changed to 'printf "$a%s$b" $y', which is also POSIX compliant, shorter (even including a comment), and faster. I included the part about brace expansion as a side note, not proposing that it be used.
I’m sorry if I misunderstood you. I’m indeed interested in what from all you wrote is then what you would suggest to be changed, as I also looked at his code and also read here that it was done in short time so I also believer there are possibilities for improvement. Specifically, reimplementing sha calculation itself should be a non goal, in my opinion. Just that the .sh code itself works on all POSIX shells, not that the whole system has to be POSIX only: calculating sha in shell is surely not the point of demonstrating how git works.
Would you say if the data format was say JSON, YAML or TOML or any more human and bash friendly format it would have been easy to implement with your current experience?
Not to vouch for having Git store things in human format. But I often think about how inefficient JSON API's and YAML storage formats are (in parse time) just to be a benefit of a user debugging it or discovering the API through a browser. But since most people use a JSON prettifier plugin or a tool like Postman anyways, what is the benefit of the line format being character strings? Wouldn't a binary package not be just as easy translatable into human readable JSON formatted output as a compacted JSON string is?
> what is the benefit of the line format being character strings? Wouldn't a binary package not be just as easy translatable into human readable JSON formatted output as a compacted JSON string is?
One benefit is that I can look at an arbitrary file/response and be able to tell with a fairly high certainty whether it's JSON, YAML, or TOML, but there's no way that I tell whether it's messagepack, bson, or protobufs.
Most of the time you know the format you expect to decode you don't have to guess it anyway.
But I think you should be able to detect the type of format for binary encodings just as well as there spec is pretty specific. Maybe not at glance as a human, but that is the point I'm making. Should all line formats be made absolute human readable and parsable at glance just to the benefit of debugging at the cost of performance. Where with just a simple lens tool you can look at the data in a completely different (human friendly) view. Tools like this already exist in the form of WireShark, only they mostly operate at a deeper level.
Even with the self deprecating nature of this effort and project: you continue to produce a lot of open source contributions. I see your work and blog all over the place. You’re a machine!
Is this a reference to something? I can't imagine it would take 500mb to reimplement git in JavaScript. This one's <75kb and full of explanations: https://github.com/maryrosecook/gitlet
It's a dig at the typical size of a `node_modules` folder. These often are very large, and often do contain several thousand files, largely due to transitive dependencies.
For anyone looking to use Git in JS, check out https://github.com/isomorphic-git/isomorphic-git. I've had great success with it and really like it as a library. It's API design is good and it's tree-shakeable so the size of the library is very reasonable, even if taking it as a whole.
Indeed, I've come to the same conclusions myself. I'm the resident git expert at my job but, quite honestly, the only thing that separates me is that I've learnt git's internals. But I learnt it because I'm lazy and it's easier, not because I want to be the best or anything.
So many people say they "know" git, but then I watch them work and it's "git commit -a" all the time and "git clone" when something doesn't look right. It's really amazing how people refuse to learn this essential tool.
I wanted to have the staging area, and I had decided upfront that I wouldn't make the repository state inconsistent between shit and git. So the index needed to be done.
Also, you need to generate a tree out of something. Could just hash the entire worktree every time, but that would be pretty lame.
What I never fully understood is why the staging area isn't simply a commit that gets amended repeatedly as files are staged into it. Maybe just a tree pointer, since the rest of the commit data isn't available until commit, but you could fill in some placeholders. ("(staging)" for the message, current times for the timestamps, etc.)
(Note that index doubles for other functions like merging/conflict resolution, but I never thought that was a good thing, and could be separated out.)
I've wanted this for a long time, and also a frequently-amended working-tree-as-a-commit.
Why? I prefer a each branch to have its own staging area and working tree, which maps better to my mental model of "branch as an under-development feature".
Currently my workflow to achieve this involves a lot of stashing.
The 'staging' area is implemented through the index. And the index is used for more things then just deciding what gets in the next commit. A lot of gits speed comes from caching Stat data of files so that it does not have to hash the complete working tree for each operation. That's not something you can just ignore.
Someone proposed splitting this up the other day, but even that would come at the cost of performance and an increase of complexity.
Wow, I just totally assumed the index was a tree object. It seems much less elegant that it isn't.
Maybe it isn't because it would necessitate creating a lot of blob objects as you staged and unstaged changes, which might not get garbage collected for some time. I can't see any other reason.
For anyone wanting to understand Git from inside out at a fundamental level, I can't recommend the 'Building Git' book enough.
The code is Ruby, but there's enough explanation for each snippet to be able to follow along in whatever language one prefers. I had no problems with translating to Go, for example.
> I said that the internals were so simple that you could implement a workable version... inside of an afternoon. So I wrapped up what I was working on and set out to prove it.
Been there. Done that. With other things, not git.
I suspect many others here have too.
> Five hours later, it had turned into less of a simple explanation of "look how simple these primitives are, we can create them with only a dozen lines of shell scripting!" and more into "oh fuck, I didn't realize that ...". Then it became a personal challenge to try and make it work anyway...
Yep. Been there too. Done that too. Again, with other things, not git.
You're not the right person to ask about this, but if Drew is around I would love to hear high-level details about what how this setup works and what the average monthly cost is.
I can see that it's open source[0], and I'm very tempted to copy it. I'm already in the midst of migrating all of my video hosting to peertube, but I don't have a solution I'm confident in for livestreaming other than Twitch -- especially because when in the rare instances where I do stream coding sessions they can go up to 5 or 6 hours, at which point archiving and storing that video starts to look a lot more costly.
I don't use it very often. I just threw it up on a Linode with minimal effort so I could have a working live streaming setup. You'll note from the readme:
>This is the website for my self-hosted livestreaming platform (aka bag of hacks dumped into a server).
PeerTube is nice in theory but in practice it's been really really unreliable for me.
Archival/storage of video shouldn't be that costly.
With 5400rpm drives (better for archival than more or less any other type of storage media, including faster hard drives), it looks like the going rate is about a United States cent per gigabyte. Two for 7200rpm drives from manufacturers that seem to produce the most reliable drives on the market, consumer-side.
A setup that could survive through a reasonable amount of drive failure, then, seems to be relatively inexpensive, so long as you're not trying to archive your video In The Cloud®.*
Delivery being costly is a myth propagated by Big Cloud®. Any dollar store VPS that isn't DO will have more than enough for streaming video all day every day.
That's irrelevant, though, given the person's question was about storage and archival.
It all depends on the use-case/context. Hosting a single video with few concurrent views is cheaper to do on VPS. Hosting videos with short response time in any region, with high resiliency, etc. is likely cheaper to do on CDNs. It's not a myth. It's "general advice may not work for you".
Not sure what you mean by false advertisement. Australia-Netherlands (common European pop) connection is often >300ms from a home connection. Home in Australia to Sydney pop is likely <10ms. It makes a massive difference with many small resources, or restarted transfers. That's just physics at some point.
But does it really? I can see why a large company wants to squeeze milliseconds out of asset delivery, but as a watcher of a small independent creator I would have no problem waiting a second for the video to start playing.
Latency directly impacts bandwidth, which impacts quality, since all current-gen user-facing live streaming protocols that matter (HLS, DASH) are layered on top of HTTP (on top of TCP), and that's already the best trade-off for end-user delivery today.
For VOD it's less of an issue since you can just maintain a larger buffer, but with live that's a trade-off with being closer to the live edge or choosing poorer quality. It works OK for some cases, it's bad for others (like sports, or when letters on the screen become illegible due to compression artifacts).
Building your own CDN off of el cheapo VPSs is theoretically viable, the beauty of HLS and DASH is they're 100% plain old HTTP, so just drop Varnish, add GeoDNS on route53 and off you go. Actually I'd love to have the time to try that :)
Here, the roundtrip latency is ~14ms within the country (e.g. from here to capital city), and 40ms to the closest AWS or GCP datacenters (both are in Frankfurt).
I once landed a job as a web developer in a marketing department where IT was gatekeeping production hard.
Although all the code was in git (the real git), deployment involved a magical shell script someone in IT had written ages ago. Only after a bunch of rocky, outage-causing deployments did we have the sense to start digging into this magic script.
It turned out it was just a bad, thousands of lines long re-implementation of git using perl and mysql that captured and stored diffs and rsync'd them to production. ...And this was well into the era of CI/CD and infrastructure automation tools.
Eventually, the company brought in new IT leadership that put an end to that kind of nonsense, which freed us in marketing to buy purpose-built PaaS for our needs.
Obviously do this kind of crazy stuff as you please on the side. But at work, build only what uniquely adds value to your company. For most, dev tools are probably pretty low on that list.
I really disagree, a small amount of customisation of tools can reap significant productivity gains.
I think your anecdote is really just an example of bad management, not bad tooling. Could easily be that some management doofus had prohibited git from being installed on the 'production' system.
A table saw by itself is mostly useless. With a fence and a miter gauge, it becomes useful. With a push block, stop block, subfence, outfeed table, infeed table, featherboard, crosscut sled, tenon jig, and dado set, it is the single most useful tool in a woodshop. Keep in mind it is still one tool and all those accessories are not "tooling", they are accessories to a single tool that increase what you can do with it. The tool always works the same way, and anyone can use it with any of those accessories in any woodshop in the world. In effect it becomes a new, larger solution, made up of many features that extend the utility of the tool.
That's not really what we have. Mostly what we have are jigs. A jig isn't an outfeed table or a crosscut sled. It's a hack for a particular job. If you need to make one specific cut 300 times, you nail together some scrap wood, dial in the miter gauge, angle the saw blade, and make your cuts. And the jig is scrap once again.
But in the heady new world of "DevOps engineering", the jig is now "tooling", and we pat ourselves on the back that we were able to nail some scrap wood together and claim it created business value. Of course, it's not a shitty jig like in the bad old days of shell scripts ("ha ha! remember when we were productive with this simple code that was portable and not gigantic or complicated? how foolish!"), because instead of making it out of scrap wood, we now make it out of scrap steel with a MIG welder. We're advanced now.
And I'll go further. The fact that most woodworkers make their own tablesaw extensions is illustrative of the problem: craftspeople like having fun with their toys. Is there value and experience and dollar savings you get out of making your own crosscut sled? Sure! But it'll also take you 1-2 days of buying parts, measuring, cutting, gluing, clamping, drying, aligning, and finishing. Any business with any sense should have paid $100 to just buy a complete crosscut sled made of aluminum with a good design that will last forever. But they are too dumb to notice they're spending an inordinate amount of time and money on craftspeople making jigs.
I wish the ghost of W. Edwards Deming would rise from the grave and call us what we are: bullshit artists.
If I could buy the software equivalent of spending "$100 to just buy a complete crosscut sled made of aluminum with a good design that will last forever", I would have done that.
That is arguably what P/S/SaaS is, instead of building bespoke or Chef'd instances. But I'm not one to reimplement something in F77 in Julia just to have something to blog about.
Regardless, sometimes a DSL is just what you need, and you'd better have someone who likes creating compilers do it. Otherwise it's like when builders do wood things without talking to a carpenter first.
In my opinion, this is simply an over-extended metaphor. Programming ultimately is not carpentry, and custom tooling is justifiable in many more cases.
My wife is a woodworker by trade and this metaphor is on-point.
I return to my original point: if your company is working on something where "production" is novel/unique/a differentiator, then you probably need to invest a little time in how you manage and deploy to production (e.g. you need something that's more than just a jig, and you can't go down to the store and buy it because it literally doesn't exist).
There is probably a certain point in scaling an engineering org (maybe 50+ devs) where you inevitably have to devote some engineering time to this anyway (e.g. you adopted/bought a tool that requires non-trivial maintenance and customization).
If, on the other hand, you're working on something where production and deployment are a well-understood--maybe even commoditized domain--then you should direct your precious engineering time elsewhere.
Marketing website infrastructure has its nuances, but it's well-understood. CRUD apps that talk to databases are a similarly well-understood area.
>Mostly what we have are jigs. [...] Of course, it's not a shitty jig like in the bad old days of shell scripts[.] Instead of making it out of scrap wood, we now make it out of scrap steel with a MIG welder. We're advanced now.
Am I reading that correctly? A porcelain command is one that's not supposed to be used in scripts, but the --porcelain flag is for when you do want to use things in scripts?
I have it because when I make a command line mistake, I more often say "shit" than "fuck". It feels more natural to type what I'm actually thinking. :)
The WYAG book many have referenced is pretty good. Haven’t gotten through it yet, but I’m looking to do a Go implementation. I also want create my own SCM if I have the brainpower for it to test some new workflow ideas.
It's slightly more complicated than that. The `git` command line itself was done in shell. Initially, `git foo` just ran `git-foo-script`, and then those were also written in shell. But the actual sha1 and packfile stuff was always in C from the beginning.
Sorry, I wasn't expecting this to HN before I had written some porcelain commands. I pushed an updated README that explains how to write commits with this.
This is essentially adding a blob inside .git/objects by taking a sha1 hash of the data after appending a header "blog <length>\00<data>". Then the first two characters of the hash are created as a directory and the remaining 38 acts as the file name inside which the zlib compressed data is stored. Nice project for learning the git internals.
Git is already an insulting term where I come from - I remember one of our children, years ago when they were young, looking at my screen and saying "why are you typing git ??" in a shocked sort of tone.
They're much less concerned about their language now.
I don't mean to overstate the case - it's not a swearword or the sort of thing you'd really censor, just a playground term for a mean person. All the same it's an ugly word and (however irrationally) this is one of the reasons I prefer Mercurial to this day.
Five hours later, it had turned into less of a simple explanation of "look how simple these primitives are, we can create them with only a dozen lines of shell scripting!" and more into "oh fuck, I didn't realize that the git index is a binary file format". Then it became a personal challenge to try and make it work anyway, despite POSIX shell scripts clearly being totally unsuitable for manipulating that kind of data.
Anyway, this is awful, don't use it for anything, don't read the code, don't look at it, just don't.