I really do not see the problem with the rails commit history. Having those merge messages that point back to a pull request leaves lots of documentation about what was done and why it was done. I really do not know why people are so particular about their git log either. It shows an accurate history of the repo, not a revised cleaned up history.
However being able to edit pull request before merging was a good thing to learn. However I think in the long run I would want to learn it with plain old git rather than tack on another tool to my workflow.
So to be clear, every commit in the ActiveMerchant history also points back at the initiating PR via the commit message including the PR number (which gets linked up). So there's no real loss of information in that sense
And I hear your RE adding another tool, as I'm very cautious about that myself. That said, `git am` is "git cannon", and all hub is really adding is the ability for it to slurp Github urls as well as Maildirs.
Neat, related trick: add ".patch" to the end of any PR url on Github and you'll see a formatted patch for the PR. All hub is really doing is slurping that and passing it to git as though it came from a mailing list.
> So there's no real loss of information in that sense
It is in reality. Try to git bisect a bug. (Yes, I read the part about the bisect in the post. But thanks, I'd rather search the bug in several 10-line patches than in one 1000+).
Most of the contributions I'm merging in are 10 lines, but a lot of them (half?) take two or more (sometimes many more) commits to converge on those 10 lines, and none of those intermediate commits has useful information in them. The other half of contributions are only a single commit already.
If someone contributes a true 1000+ line contribution that I'm even willing to take (yikes!) then you better believe I want that in multiple, logical commits, and it may even be me that ends up doing the splitting.
I'm sure a lot of it just has to do with what types of contributions a given OSS project receives, so YMMV.
MMDV, and for my current project we have code reviews that don't pass and merge unless the automated CI (builds, tests, lints, etc) pass, so if your new feature takes 1000 lines of code, so be it. We do try to keep to the minimum necessary change, though, and focus on tight conceptual chunks. But a patch should never break the tests, especially not to keep it below a certain size, otherwise how do you do automated bisects?
His answer there on this topic boils down to “tell the patch author that they made a mistake, and steadfastly refuse to do anything with it yourself until the patch author does fix it”. This is the whole point of this article—that is simply not pragmatic; you can have changes made that aren’t quite perfect but where the original user is not willing to make the changes you require. It being abandoned in imperfect state, Linus’s answer is actually ignore it.
Linus' situation is that his ability to handle incoming contributions is a real bottleneck. In that case it is worthwhile to say, "Make my life easy or I ignore you." And because people want their stuff in his project, they do wind up making his life easy.
Most open source maintainers are not in that position.
Yeah, I agree Linus takes things too far at times. My response was mostly because the author's "this is the way Linus would have done it" stuff. I do agree that we should be willing to do a little cleanup ourselves when feasible. But I disagree with the author that we should edit /somebody else's commits/.
Well, quoting from that message, "If it's not ready, you send patches around". I think the big difference is that I'm willing to take patches that aren't ready and get them ready, whereas I'm sure Linus usually pushes back on the contributor to clean up their stuff. And I do think merges are super useful in a lot of cases, but Linus added `git am` too, and I like using all the tools at my disposal when they fit.
Yeah i feel the same way. Never been one to squash my commits, even when i commit something stupid and have to revert it. We have three owners and one collaborator on a javascript project right now, we've been good on keeping up with pull requests, and closing off bugs. I can see where a bigger project like ActiveMerchant could get nuts if only one or two people are handling it.
There's an issue here where people are conflating two definitions of "history". Sure, what happened in the real world can't be changed, but why should we be constrained by that in the worlds we construct in software? Just because someone calls a record of development "history" doesn't make it inviolate, and quite frankly, as a maintainer, I don't care about every little sneeze that a developer had on a project. I want each commit to compile and pass tests at a minimum, so that I can do automated git bisects. I would prefer to have a coherent history of well thought out conceptual chunks, so that I can figure out how things are supposed to be, not what some sleep-deprived programmer was thinking at 3AM on a Sunday.
> I want each commit to compile and pass tests at a minimum, so that I can do automated git bisects.
What stops you from doing automated git bisects when some commits shouldn't be tested? Returning a 125 from your script if it doesn't compile will skip that commit. When it comes to GH you can skip any commit where the message doesn't start with "Merge pull request #" for example.
If you're squashing commits together and only leaving chunks that compile & pass the tests, what are you losing by skipping commits that don't compile & pass the tests?
Can you come up with an concrete example of where this would be a problem? I might be missing something but I can't picture a case where there'd be an issue.
I guess different people have different workflows. I prefer to make many tiny atomic commits that gives me very granular ability to backtrack if things go wrong.
For "conceptual chunks", I use branches or tags, and don't worry about the individual commits very much.
Same here. I don't generally care how "messy" the git commit log is, but when we reach a point where something is notable, then it's time for a tag/release or whatever.
There's an argument that bisect is only useful if every commit is broken. If, as is often the case with my history, most commits don't compile, maybe there's an argument for squashing?
Would you not just skip those? Skipping non-compiling code is the canonical example of "git bisect skip".
You could also have a little script that does something like:
# Usage gbisect_prs bad good
git bisect start
for commit in "git log --since good --until bad"
if not commit.message.startswith("Merge pull request #"):
git bisect skip commit.hash
Turned into real code, obviously, but it'd tell git to ignore any commit that's not a PR merge.
You could also just include this into your bisect script if you're not doing it by hand, return 125 if it's not a commit you want to test.
This works as long as there are only a few commits which don't build. It fails miserably if, say, 25% of all commits don't build (don't ask). In such a situation it becomes increasingly hard to tell what actually broke whatever functionality you're interested in: The commit you end up with or N prior commits which don't build (and which is incidentally usually full of "noise").
> The commit you end up with or N prior commits which don't build (and which is incidentally usually full of "noise").
The alternative, where things are squashed together, would leave you with just one commit but it'd be those N combined together (at least, it could well be more).
You can still diff between last good and first known bad. If necessary you can branch and squash the non builders to see what happened. At worst the intermediate commits gives you the same problem you would have if you had squashed to start with.
> At worst the intermediate commits gives you the same problem you would have if you had squashed to start with.
Well, if you're going to be bisecting a lot or compilation takes a while, it saves you time to do the squash now (when you know which commits are supposed to compile and which aren't) rather than when you come back to it.
It might make sense to squash so you only have compile-ready commits in your log.
In a sense I work this way, without the squashing. When I write a test, the build fails and nothing has been done on the application codebase so I don't commit. When I fix the build by writing new code, I commit.
Maybe I should be doing small intermediate commits and squashing the commits into one commit.
> Maybe I should be doing small intermediate commits and squashing the commits into one commit.
I would recommend trying that, or at least something similar: Try committing regularly (this is useful if you ever want to go back to a prior state while working or e.g. pick up where you left off on a different machine, etc), but reorganizing and cleaning up your topic branch (via rebase -i, etc.) into a few commits before merging (fast forward). What is "a few commits"? Well, it's a bit of a judgement call, but I just group stuff into logicially coherent units which make sense as single units of functionality. Of course you must make sure each individual commit at least builds sensibly and preferably that all tests run too.
I think a lot of it depends on context. In a private repo I'm right there with you. But when I look at a public repo with a lot of contributors, I want to know what the "logical" history is a lot more than I want to know about the three steps it took to get a feature ready for public consumption.
And one of the major benefits of git is that you can do short, small commits often, to your private repo. When your fix or feature add is ready, you can squash them into a more coherent commit that builds and passes tests (so that it doesn't break bisect) and push it to the public repo.
I don't understand the bisect argument, maybe I'm missing something but can't you just skip everything that doesn't build and pass the tests? Returning a 125 from your script automatically skips the commit.
And how do you know that wasn't the commit you were looking for? The whole point of bisect is that you don't have to do a manual bug hunt; you can create a script that tests for the thing you are looking for. If it's something static sure, that's easy. But if it's something more complicated and you don't know what's causing it (a prime candidate for bisect), you will need to build and analyze the behavior of the built executable. Non-building commit = more manual bughunting.
But aren't we talking about squashing those commits? You're not going to lose anything by skipping commits between PRs instead of squashing everything between PRs into one commit, you can only narrow it down more.
> you can create a script that tests for the thing you are looking for.
Yes, which you can tell to skip anything that doesn't compile/pass tests. Return 125 from your script and it'll be skipped.
> Non-building commit = more manual bughunting.
But you can just do a loop of "make || git skip" and skip everything that doesn't commit, surely. Or if you're using a script, just return 125 for things you want to skip.
I always use "--no-ff" when merging topic branches to master. The advantage is that you can keep individual commits on the topic branch readable and still get an overview of the changes between two releases with "git log --first-parent".
You can also see the set of changes in a topic merge easily with "git log ${merge}^2..${merge}^1" whereas if you use a fast-forward merge it's not at all obvious which sets of commits are related.
We use the merge commits to trace back a given deploy or machine image to the pull request that caused it. So from a given deployed AMI, a Jenkins job can point to the pull request, and hence the full context of the feature or bug description, the back-and-forth of code review, etc.
The problem is that pull requests are one-way, read-only collaborations. If I want to make changes, I have to close the contributor's PR and open my own. With write access, I could do:
The result is the same, but I've found the `git am` workflow to be much smoother vs. mucking around with remotes. Some of that could be due to the nature of the OSS project I work on - ActiveMerchant - since just in the last month we've had 30+ unique contributors, and most of them have contributed a single change.
My general recommendation is to make sure you try out the `git am` flow for a bit, but then just do what works best for you.
Keep in mind that `git am` is not identical to pulling the commits. You might be applying the commits on a different base, and certainly your committer field will be different (not to mention any substantive changes you make). As a result, the sha1 of the commits you create will be different than those of the original submitter.
This makes life harder for the submitter, because they cannot ask a simple question: were my commits merged into the upstream repository? Because no, they weren't; but commits that are the moral equivalent were. Usually `git log --cherry-pick` can correlate the two, but not always.
We do use `git am` upstream when working on git itself, for the reasons you indicate in the article (plus we like mailing-list based review). But it does come at a cost in managing the various versions of patches.
This is why it is a good idea for proposed patches to be sent to a mailing list, where other people other than the repo owner can more easily review patches, and then to have those patches tracked via patchwork.
And when I accept a patch, even if I've had to edit the commit summary so it is parsable english (not all of my contributors speak english as a first language), I send a reply via e-mail saying "Thanks, applied".
Agreed, editing contributions does add a bit of extra work for contributors post-inclusion, especially if they've been using their own branch in production. That said, since this is a PR flow, the contributor will get an email from Github that their PR has been closed, and they can track directly from that PR to the closing commit.
While I've often wished that Github had a proper mailing list per repository that could be used to discuss changes, I have to say that one awesome thing about PR's is how nice they make reviewing a patch. The diff view is just awesome, so now that I have a `git am` style flow with them it would be hard for me to switch to just passing patches around on a mailing list. That said, a mailing list would still be a welcome adjunct to the PR flow if done right.
Well the commits may be changed during pulling if the maintainer deems it necessary (or a merge just ain't happening cleanly) so it seems like the solution would be to just include the SHAs that were merged in the 'git am' commits. Then the submitter knows what they intended to get pulled got in, which should resolve patch verification, right?
I think maybe you overlooked the grandparent's suggestion; you don't have to mess with adding contributors' remotes. Open pull request heads are available on your existing remote.
>The simple solution, though, is to require pull requests to come in a feature branch, and flat out reject any that target master. /shrug
Agreed, and in my experience, every major open source project I'm familiar with requires pull requests to feature branches. In fact, most small projects use the same workflow.
Is this not the case with most open source projects?
There are a lot of small, one-off, often very useful utilities that people are now sharing with each other via Github (I'm guilty/a participant in this phenomenon), and many noble users of these utilities want to help, contribute, and send PRs... a non-negligible number of them new to Git.
So, is it more of a PITA to set up a feature branch for a single python script and instruct users in your CONTRIBUTING file to 'make sure they submit PRs to branch XYZ!' or just deal with the odd occasional PR to master? Folks new to Git will probably just send a PR to master anyway (I believe OP addresses the 'new user' issue as well, having to explain Git commands to users in comments on a PR)
That all being said, I still typically follow the workflow shadowmint outlines above.
Using git remotes and distinct branches per contribution is totally legit, and I still do it sometimes. But, I'm often/usually dealing with contributions that have already been reviewed and don't need a whole feature branch/review within the root repository before inclusion. And for that setup - where it's almost ready - it's so much easier to just `git am` it into master, make necessary tweaks, and push.
In general I'd just encourage maintainers to try the `git am` flow, especially on small to medium complexity contributions where it looks mostly ready to go and you don't want to do another week of ping-pong with the contributor just to get a variable renamed or some whitespace fixed.
As I tell my kids, "Just try one bite of <food I find delicious>. If you don't like it, that's cool, then it's more for me!" :-)
You totally could do that, but it doesn't really help with the lack of interest problem he described. If the contributor can't be arsed to fix their patch, can we expect them to be arsed to merge a patch to their patch and then re-PR it?
He seems to be complaining primary about quality of commit messages, missing emails, commit sign offs an things like that. It is more about bureaucracy (which is arguably important on project if linux size) then about workflow itself.
It's about the patches not being fit for merging. It might be because of the commit sign-offs, or it might be because it was missing error checking, or any number of things. The real question is which incurs more overhead --- asking the contributor to fix it, and then needing to wonder when/if the contributor will fix it, or to just make the d*mned changes yourself, while preserving bisectability.
Another very common one isn't about bureaucracy, but making the commit messages readable --- including the stack trace of the faliure you are fixable (which again is important for non-toy projects that are being distributed in other products, and where people may cherry-pick fixes into a stable branch used for release purposes). Or if you have contributors from around the world whose native language is not English, and you want to make it easier for other people to understand what the heck is going on without having to read the contents of each and every diff.
This is fundamental to project health, which means that a proper workflow is fundamental to project health. Given that the vast majority of github repos are toy-sized, or end up being abandoned, maybe that's fine for github. But for any project where I have hopes that it will turn into something real (and if I don't have that hope, why would I waste time on it?), I'm not going to accept pull requests from git hub.
That's cuz Linus Torvalds doesn't have much of a bedside manner. I agree with ya'll on the quality commit message stuff, but that last third was all conjecture from your part, homey.
The last third is based on my extensive opinion having worked on open source software for my entire career, including being the ext4 subsystem maintainer and the e2fsprogs author and maintainer.
These quality issues become a problem sooner than I would have thought, and Linux is big enough that they can just demand contributors know their stuff. But I found that in between "hobby project of my own" and "huge project where people beg me to contribute", there's a big space of "big enough that it gets contributions, but small enough that it's not worth turning contributors away for small quality issues". And I love that hub + `git am` lets me take those contributions but still maintain the workflow I want on the project.
You know, when I started reading this I thought that I would disagree with you and write a grumpy comment to that effect. ("Considered Harmful" usually makes me grumpy.) However, you convinced me otherwise. Congratulations on a good "Considered Harmful" piece!
Here's another example: I recently made a PR to a project to fix a broken URL. I changed a wrong file and forgot about the PR. The maintainer had to close the PR, make the change himself, and explain why/what he did. This whole process would've been a lot simpler if Github allowed people to merge to another branch.
> This whole process would've been a lot simpler if Github allowed people to merge to another branch.
Is this a problem that comes from forking? I raise PRs on my projects onto various different branches all the time. Or maybe I've misunderstood the issue.
I just tried this on an open pull request we had. Pretty ok experience. For some reason though, there where a merge conflict when running the "hub am -3 <url>" command. In my case was easy to fix, but Github reported the PR to be mergeable, so maybe Github uses a different merge strategy for PR's than "hub am" out of the box?
Yeah, I don't get this hub thing. When I want to manually merge on a Github project I just follow the GH instructions on the PR page to manually checkout the fork, do my work there, and merge it locally then push. Why do you need another tool to do this?
As far as I know hub is more than just a command line tool for pull requests. It extends git with a lot of Github nice-ness. But in the end some people just like CLI's better - and in this case it exposes some features that's not directly available in the UI.
FWIW hub is even handy for the "add a remote for the fork, check out the fork branch" flow, since it lets you just `git remote add ntalbott` instead of having to find or derive the full url for the remote.
As I understand it `git am` is actually replaying the series of commits on top of the branch you run it in, whereas a merge commit flattens the commits out and applies them in one go (while keeping full history of the commits begin merged). So it's definitely a different strategy, and you can get conflicts where you wouldn't with a merge commit. As you said, though, they're super easy to fix when they happen. Often when I see a conflict like this it's because there are 52 (OK, exaggerating, a little) commits in the contribution and using the `git am` flow is a huge win since I'm now in a good spot to squash the down to one or two commits.
Just a note to people who want to try this: According to the post you should use the command "git am -3 <url>", this seems to be a typo and should be: "hub am -3 <url>"
I'm not sure how I feel about someone writing something and making me the author. I guess attribution makes sense in the form of paraphrasing. But I tend to think of commits as literal quotations.
I see what you mean, and it almost seems like the "sin" isn't the additional editing, it's the subsequent rebase. It's one thing to require contributors to rebase, but it seems like another to rebase for them. Maybe I'm missing something, but based on the "History Is Written By the Victors" section of TFA I would have been surprised to see anyone other than @ntalbott at the top of https://github.com/Shopify/active_merchant/graphs/contributo...
You may think of commits as literal quotations of the "Author", but it's important to realize that git doesn't. If you want to know who's ultimately responsible for a commit, you always want to look at the "Committer" since that's who actually signed on the dotted line so to speak.
I'm not concerned with who's ultimately responsible for the commit. I'm concerned with being called the author on something I didn't write. These distinctions are not made or enforced by git. At the end of the day, it's how humans interpret two pieces of metadata.
The first hit on Google and Bing for "git committer vs author" brings up this SO entry:
I've always just pulled in the contributor's branch, made additional commits to fix things up, and squashed all the commits before pushing up to GitHub. Not very difficult
Maybe, but... probably not. The thing is, I want to work with the changes locally before they get included. So it can't be a 100% web workflow. That said, there might be ways the web UI could make the transition to the CLI easier, and it would be great if Github promoted the `git am` flow on the PR pages as well.
What works best for me is to use a combination of the Pull Request button and to manually merge or rebase changes when necessary.
I used to strictly stay away from the Pull Request button, because it made my history "messy". Now I care less about that and more about the convenience (when it's appropriate).
I wish github would either detect when a branch is merged back into master and automatically close the PR, or allow for some more complex merge operation. Squash merge would be particularly helpful.
Which essay I already knew about and chose to willfully ignore when I titled the OP. We could chat over beers sometime as to whether I'm a bad person for doing so :-)
It's a play on the classic Dijkstra letter, "Goto Considered Harmful." Also yes, I think it's harmful. A co-worker and I wanted to use a new up and coming FOSS project that's hosted on GitHub, but we needed a certain killer feature. Said co-worker worked for about 4 weekends in a row, and fully implemented it, with nice abstraction and separation of concerns. The code was then turned down because it was "too seperated, and unnessecarily abstracted" which we disagreed with. Thus the pull request, with an amazing feature, sits languishing because he nor I is going to spend another few weekends working on a project that won't integrate new features unless they're perfect and conform to the original author's idea of "perfect code".
So what you have is a fork, with a fully implemented feature, that didn't make it back upstream.
This happens all the time, and is the nature of open source. No project is obligated to take your contribution, and that happens for a myriad of reasons that you are not obligated to understand.
It's fundamentally a human problem, not a software one. There's nothing wrong with the software here—it does everything required and more to be able to manage the codebase.
Take your fork, explain the situation in the README and let it be. It'll be in the fork graph and the list on Github. People can find it. If they like your feature, maybe they'll use yours instead. If enough people get behind it and ask for it, or say "hey this was in a PR, why wasn't it merged?", the pressure could be enough for the original author to just accept the PR.
In any case, this is not a problem with the pull request feature. It's a simple collaboration problem, and a PR is only one method of communication you can use to solve it. If you give up after throwing a PR into the void (not that I'm saying you did), you shouldn't expect instant success.
Translation: "My co-worker wrote a clusterfuck of indirection spaghetti and the open-source volunteer refused to maintain it!"
Every PR that adds a feature adds future support and maintenance effort. If you and your buddy are unwilling to spend the time to get it right, why are you expecting the maintainer to spend the time to support it?
In short, it was a perfect implementation of adding support for dynamically updating scripts via NuGet for a C# port of HuBot by typing "<robotName> update-scripts". This is the standard way of managing dependencies, and would mean not having to manually upload scripts to a folder restart the bot every time you need to add or change the scripts.
The maintainer shortly thereafter attempted to do it himself his way, ended up with a bug-ridden and memory-leaking implementation that barely worked, and sits languishing in a lonely branch.
Don't make assumptions about the code you didn't see and the attitudes involved coders might have. We would have been (and still would be) more than happy to help maintain it had the bot turned out useful for us, but because it was a new project and the structure was changing constantly, attempting to fork and integrate upstream changes would have been a nightmare, so we decided to go a different direction.
The right way to do this is to talk to the maintainer about how he wants the feature implemented first. Especially if it involves four weekends of work.
Depends on the size of the feature, and the maintainer, but you're right about "talk to the maintainter."
Some projects, you'll want to do a weekend of really rough work, contact the maintainer, get their thoughts. Others, you'll want to do 8 days work, get it far enough along to show, get their thoughts.
Often though, before you write a line of code, talk to them.
I was once on the other side of this. A nice guy wrote a whole ton of code for a module, refactored things to the max, made it "testable" and a bunch of other things. He was very nice, very enthusiastic, and really wanted to help.
But I had to say no because it didn't fit well within the overall project, and didn't really add any value but a lot of complexity. I felt very bad to turn him down but otherwise I'd be now responsible for code I have no interest in and code I don't feel helps users.
The right way is to talk to the maintainer and agree on what to do first. Or be prepared to run a fork.
However being able to edit pull request before merging was a good thing to learn. However I think in the long run I would want to learn it with plain old git rather than tack on another tool to my workflow.