Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Test && commit || revert (medium.com/kentbeck_7670)
96 points by grzm on Sept 28, 2018 | hide | past | favorite | 87 comments


One release of the GHC Haskell compiler, long ago, had a bug where it put all your source files in the list of build artifacts it had created. When GHC ran into a type error, it deleted all its build artifacts from the current run.

There are stories told that a few argued to keep it around as a feature for training in doing mental type checking.


Do people not typically modify the tests in the change where they introduce a feature?

I have never worked anywhere where I would check in a change without the tests passing. If you are making a huge refactoring, you do it a step at a time, with each incremental change still having passing tests.

When I was at Google I believe that changes were automatically rolled back if they broke enough tests globally. I don't recall anyone ever complaining except that the rollback was slower than they would have preferred. You simply can't check in a library change that breaks half the company, after all.

If people are writing adequate tests and consider commits to be "change lists" instead of just a random thing they want to back up from their local machine... this is a no-brainer. But it does require the culture of having your code reviewer check that they think there's enough code coverage. Otherwise "passing tests" doesn't mean anything.

(Google's tools happily showed lines of code that were covered by tests in the review UI, so you have objective and subjective data as a code reviewer to look at before approving.)


Note that this post is talking about the idea as a follow-on to the idea of automatically running tests and then generating a commit, from every filesystem-observable change (https://medium.com/@kentbeck_7670/limbo-on-the-cheap-e4cfae8...).

I'm sure your (manual) commits keep together the changes to the code and the matching changes to the test that keep the test green—but your Ctrl+S'es probably don't. In this strange workflow, the tests are required to be green every time you press Ctrl+S; and any time you press Ctrl+S and the tests aren't green, everything you did since the last time the tests were green disappears from the worktree.


I just can't imagine how that would scale in any significant way.

One of the many points of automated testing is that it's automated. I don't have to mentally run all of my tests in the back of my head to tell whether or not my code is correct. With this setup, I do. I become scared to even hit Ctrl+S in case I made a typo somewhere. Or perhaps even worse, what if I made a typo in my test and just committed a half-finished implementation? I've taken automated testing, one of the most brilliant time-saving inventions in all of programming, and turned it into a chore.

And as a programmer, that's supposed to encourage me to do more TDD?

I understand that the article is specifically about addressing those kinds of doubts, but... I dunno. I can't imagine ever trying this for any program bigger than a couple thousand lines of code.

If the goal is to have tiny changes, why not just count up the changelog and revert it if anyone ever tries to commit more than 25 lines of code? It seems like that would be just as effective, and potentially less annoying.


Oh, that's... odd.


The way I read it, it’s not on every save, but on every time you try running the tests.


Yeah, this is how I’ve worked for 16 years. The unit tests always pass. Maybe once or twice a year you can go way out on a limb and do something exploratory where you don’t care about the tests passing for every commit, but... part of TDD and CI is that the tests always pass.

In some ways, I think the feature branch->PR model of today is a step backwards from traditional central server single-trunk development of years gone by (CVS/Perforce/SVN). Branches and PRs work better for F/OSS development, and for monster refactoring, but I kinda long for the days of a normal team just committing to master/trunk several times a day.


I think PRs are logically the same as Perforce changelists. The fact that you can pick apart a PR into multiple commits does pollute the history when you merge, though, unless your tools treat the PR as a single atomic commit (for bisection, etc.)

I have never treated the individual commits inside a PR as anything other than "I'd like to backup my workstation to the cloud now", but if people are treating them as something meaningful, I think that's a mistake.


I don't know if I'm reading your comment correctly, sorry if I'm not... but individual commits within a PR are absolutely not a backup. Every commit in the PR has to be a logical unit, meaningful, atomic and stand on its own. The PR, as a high level feature, is just a grouping. But single commits have to be useful and be a history/build up of how high level features are being developed. This granularity is important to logically isolate the changes and make them bisectable among other things.


The popularity of rebase merging in Git I think is partly to blame here. Rebase merging tries to make everything seem like real history, but no one ever seems to have a solid idea of what that actually is supposed to mean (and CI systems don't make it easy to configure "test all commits" or declare what you want your commit history to mean).


There's no such thing as "real history" - even without rebasing, your commits do not reflect the development of the code; they reflect practically arbitrary moments in time at which you chose to commit.

The idea behind rebasing is not to make it seem like "real history", but to have all changes in a commit to be related. They're not a reflection of how you built the code, but how you can understand it. They're documentation, and as such, you spend effort on them.


Yup. I always picked "merge" instead of "rebase" in Gerrit because it's not like I tested the result of applying each patch to the master branch. All I did was decide that mine worked and that I wanted to merge my branch with the upstream.

People like using clever features to make the history inaccurate, which seems like a bad idea to me. (It's "cleaner", but wrong.)


If you internally rebase to cleanup history you have to test again. Ignoring that step is wrong. Local rebase can be good and useful.


Once your test suite is big enough that it won't run within a minute or two, it's kind of hard to keep a "green on each commit" workflow, though. Especially with UI and visual diff tests in the mix, I certainly don't want to tie up my local computer for a full test run when I'm just fixing a small bug; pushing the commits in a branch and letting the full suite run against that while I move on to another task is a much better use of resources, and that's where the PR model seems most useful to me.

Are there workflows that avoid this problem but still enforce every commit being green?


Sure, I separate heavyweight tests out into integration/acceptance tests (the side effects make them slow; the side effects mean they're not really unit tests). So, I don't commit until the unit tests pass, and then a PR triggers heavyweight tests on my build server, to be sure you're clean before the merge to master.

But if there's too much divergence between unit test and integration test pass/fail rates, then unit test coverage needs to be improved; obviously, easier said then done when you have old pre-TDD that's tightly coupled and/or code that's interacting with the system and incurring side effects that are hard/expensive to test.


In Phabricator you make your changes on a local branch, running "arc diff" to post or update a changeset. CI runs on each update to the changeset. You're expected to have CI passing before you "arc land" (squash merge to master and push), and usually before you solicit code review. Commit whatever garbage you want locally and even publish it to the Diff, but your code doesn't touch master in the central repository until reviewed and passing.


> But it does require the culture of having your code reviewer check that they think there's enough code coverage. Otherwise "passing tests" doesn't mean anything.

Replying just to highlight that it's possible, even common, to have a scenario in which:

> "passing tests" doesn't mean anything.


Back in my last job, management tried to get automated standards for code coverage. This led to people writing unit tests that looked like:

  //implementation
  var CONSTANTS = {
    RED : "red"
  }:

  //test
  test(function () {
    assert.strictEqual(CONSTANTS.RED, "red");
  });
I spent a lot of time arguing (unsuccessfully) that these kinds of tests accomplished literally nothing. I heard everything from, "well, this lets someone know that the constant shouldn't change" to "well, writing it twice will encourage programmers to be more careful."

The worst was that people started doing it with translation strings. So we'd send a config file to the doc team, they'd rephrase something to be clearer, or remove a misspelling, and suddenly tests are failing and the translation string can't be committed until the next day. We had a release build delayed over this once.

Meanwhile, because the coverage report is already showing the code path that displays the string as green, nobody has written tests to figure out if the dialog containing the string actually shows up on screen.

Sometimes I feel like the entire concept of code coverage was just a mistake ;)


This only comes up if you have data that isn't bundled with logic: as long as there is some logic to test, even enums end up with full coverage.

The outstanding issue with Java code coverage tools is switch statements, because there are unhandled paths generated by the compiler, but there is an easy solution: don't use switch statements. Enums are basically always a better option anyway.


Code coverage is a very poor measure of quality of your tests. You can have lower coverage and great tests, and high coverage with completely useless tests.

I use test coverage as part of my workflow. Run my tests with coverage on and make sure my new code is fully covered. Kinda like a sanity check to make sure I'm not missing anything. Or if I'm stepping into old code and I want to see how well it's tested, I'll run the full suite with coverage on and see how bad it is.

The second management starts asking for x% code coverage is when it starts to break down, imo.


I do this fairly often. I find it can aid communication and recall.

Few examples:

Say we have a parser and we discover that it chokes on certain input. I'll start a new branch and first commit will be "Failing test case for XXX" to demonstrate the bug. Then I go about fixing it (or sending it off to a colleague who is better suited to do the work). Once it's fixed, it goes into PR and it gets merged into master only if CI likes it. No one is inconvenienced by the fact that that particular test fails in one of the commits...

In this case, there are couple of benefits:

1. I get to see the tests go red. Happened to me a few times where I do the test and fix in the same breath cause the fix seems simple, but then turns out the fix did nothing as the test wasn't triggering the bug. I find that I'm less likely to do a commit that says "add failing test" than I am to dive right into fixing the code prior to running the tests.

2. It's easy for anyone to go back in time to see that test indeed got red, that is to verify that provided input indeed demonstrates a bug etc. If fix and failing test case were in the same commit, then one would have to carefully checkout just some files and not others.

3. If bug is being handed off to someone, it's pretty obvious where to start if there is a single failing test when they begin looking into it.

In a similar vein, another usecase for a commit that breaks the build is as a means of communication when "architecting" new features and delegating work. We are all developers, so code sometimes comes more easily than prose, and creating a new test case that fails is more expressive and less ambiguous than paragraphs of text in the ticket.

Say we get a ticket that says "We are having problems with trolls", and we want to make it a bit harder for them to register (contrived example and silly solution). I can create a new branch, make the following commit and assign to a junior developer:

    + assert_false(
    +   UserRepository->create( email => 'blah@mailinator.com' ),
    +   'Do not allow throwaway email addresses to register'
    + );
This way, whoever gets to work on the feature has a great starting point - ticket explains the business problem, and the failing test puts them right where they need to start looking. It can be a time saver for everyone involved vs me updating the ticket and saying: "Hi DevJunior, I think the way to solve this is to hook into the registration process. The create method in UserRepository is probably the best place to implement this - it should return failure if someone tries to register with @mailinator.com"

Lastly, it's great to have a failing test to come back to when needing to leave a certain branch for a while - be it end of the day, or switching focus to more urgent matters. I find I remember quicker where I left off if test suite tells me, than if everything is green and I have to hunt what was the next thing that was needed to complete the feature... Basically, using VCS and the testsuite as a distributed TODO list :)


On some projects (e.g. Mesa), the test projects (piglit, etc) are decoupled from the project. It's not unusual that tests for a feature land after the feature is merged.


This is the exact method I use to teach TDD and it works. It works really well.

The article mentions the sunk costs fallacy. This is true. In the workshop I give you tougher and tougher parameters, shorter and shorter time frames. If your tests aren’t passing you reset to the last commit. You can’t commit without passing tests. People learn, a) write smaller simpler tests, b) code is cheap. The third time you write something you’ll write it faster and better. Better to do that in a tight loop than over a prolonged dev-test-refactor cycle that could last days to years.

You also learn how to refactor in the green. Boy is that a good skill to have.


But doesn't the classical TDD process start by writing a test and making sure it fails before the code is changed?


Generally if I make a change and nothing fails, I assume something has gone even more wrong than usual...


heh, i love it. Sounds like a military type of training, lol. I'll give it a try.


I don't like lumping "add test & pass" together. It's very easy to accidentally write a test that doesn't test what you think it tests, and actually passes even without the "fix" applied. I do think it's very important to verify that the test fails without the associated change applied.

This breaks `git bisect`, so maybe they shouldn't be separate commits in the public commit log, but they should be separate commits while you're working on it.


... except seeing the test fail is how I know my test might be working.

It's trivial to write passing tests.


That's why the red-green-refactor style of TDD is so useful. You start by writing a failing test, then make it pass with the simplest possible change, and then refactor the code while keeping everything passing. You'd be hard pressed to write a test which fails and then passes by accident. It leads to extremely clean code IME.


In my experience this approach offers nothing, because the whole challenge is in writing the right tests, and you almost never know what the test is supposed to be like until after writing code for a while. There’s a symbiosis between knowing what test needs to be written and knowing what code needs to be written.

No part of specs, TDD, BDD, etc., actually ever captures the reality of this, despite every Name Brand Workflow always claiming to.


> There’s a symbiosis between knowing what test needs to be written and knowing what code needs to be written.

This should only be the case while unfamiliar with the language or project at hand. In my experience, once you're familiar with the project and the language, mulling over a requirement for a short time should be enough to split it into smaller features which are simple to test in isolation.


It has pretty much nothing to do with your experience level, your familiarity with the project or the language or tooling. When those things matter, they amplify this phenomenon, but the baseline effect is already high.

It’s a function of requirements being ambiguous by design. Non-engineering stakeholders have a severe need for requirements and specs to be fungible and ambiguous, such that whether or not the state of the project satisfies requirements is fundamentally not codified in any objective document and no test can reflect useful objective measures of it. Tests can only loosely approximate it and the ways to do so are always shifting and changing and up to a subjective interpretation of some non-technical people.

Perhaps the only objective measure that non-tech people will agree to be specific about is financial cost, and even then usually only after it becomes a problem.

That’s every project in a product-oriented tech company period, and anyone claiming otherwise is full of it.

There can be isolated, tiny pockets of maintenance work or last stages dev work once you pass a point of no return when business people can no longer sustainable lobby for requirements to continue to be ambiguous and fungible, and finally you can start proving things with accurately scoped tests, but it’s so late in the game that the idea you could do something like TDD at that point is comical.

Instead, becoming a good product-oriented developer means in part learning how to write well factored tests on the fly at the same time you are writing the code to be tested, and keeping testing code clean and set up with fixtures and set up to run in highly automated ways, so that when you’re inevitably ripping stuff out of the implementation for the 18th time due to yet another priority pivot from sales, you can nimbly also update the tests as you go.


I've been using TDD for years (since 2000!) and my experience doesn't match yours. I've used it on countless projects, big and small.

I think the disconnect is that you're assuming the tests are about requirements, when in fact the tests people write with TDD are much more tightly focused.


> “in fact the tests people write with TDD are much more tightly focused.”

This usually doesn’t help and sometimes hurts, because if the tests are tighlty scoped yet the requirements are ambiguous, it usually means the tests are very related to the current implementation details, which are going to rapidly change due to scope pivots.

TDD simply doesn’t work when you don’t know what tests to write or how to structure tests to accomodate scope and priority changes, which is all the time. Starting out by writing a test then coding to pass that test is baking in an assumption about that test which is almost surely already wrong, and just creates rework to undo a tightly focused test.


When TDDing the tests are going to change a lot, just like the code you are saying you write anyway. I recommend the book Refactoring for the tools so you can make the changes without breaking the test, then clean up the now-unimportant tests, and then clean up the unused code. That way the tests stay green even when making major, breaking changes.

Being able to slice projects that way, in terms of functional behavior, is absolutely a skill that requires practice. You are already thinking about what your code needs to do: the practice is to get good at expressing that in terms the computer understand. Once you are in the habit, the tests make it easier to evolve the code rather than harder.

It is possible to be a good developer without that skill, but learning and practicing the skill will make you better.


I'm sorry you're going to feel I'm full of it, but I've seen TDD work in three different companies. Of course, TDD is not going to fix broken project management such as blindly trying to second guess vague requirements. Other types of QA are also necessary, including (but not limited to) client UAT and sign-off on features, to make sure you're continuously working towards something the client would want to use.


I appreciate the comment but for me seeing is believing. I’ve heard many people make claims like yours, but across 5 “enterprise” software jobs at large companies (many thousands of engineers) and 2 jobs at medium startups ~10 yrs old and 300+ engineers, I’ve never ever seen any type of formal methodology make any impact or difference of any kind.

I’ve only ever seen non-engineers arguing about process crap and making claims like this, while engineers are forced to learn a bizarre skill set of how to actually solve problems despite this quarter’s latest and greatest new process thingy that’s going to fix it all.


That's the thing, if my test is green on the first try, I assume something is horribly horribly wrong..


Indeed. I always have to change a condition in the test to make sure that it's actually running, rather than that I just happened to get it right the first time ;)


This sounds highly useful for changes you can do in a few hours. Or for toy projects where I can keep the whole application in my head.

I have no idea how this should work on a big refactoring where you touch every single one of your.. say 200 files and add something to every single method. (Don't ask, assume a change that moves every single primary key to a composite key, so every getX(id) would become getX(id1, id2)).

I mean, we were multiple people - we split out the work because a lot of it wasn't even very error prone, but this took days (ok, weeks). Are you supposed to just not commit or even show the stuff you did to the others? (Yes, there were many, many reviews and pair programming sessions in between). Of course this was on a feature branch. Yes, it lived too long, yes there were still bugs at the end when we fixed the tests. I don't think it would've gone better or quicker if we had meticulously broken down the exact same change (add an id to every getter and setter on controller and db-layer level). Actually we tried - and as expected it took deep dive through all layers.


I don't think the contrast to TDD (made in the image more than the text) is right: why not split 'add test & pass' into 'add test' & 'pass', and then you still have TDD.

Sure, you have one commit where the test was broken. But the product was broken anyway, and now you have a commit proving the test works.


The author describes the script in https://medium.com/@kentbeck_7670/limbo-on-the-cheap-e4cfae8...

"test && git commit -am working"

This seems like a recipe for a useless commit history.


The workflow is using Git as a realtime state-synchronization primitive (like Operational Transformations in Google Docs), not an SCM. The point is to push the commits over to the other person as soon as they're created, and for the other person to pull them and rebase against them as soon as they're created.

Unlike character-by-character OT sync, though, each synced Git commit does represent a working codebase. You could theoretically retroactively grab huge numbers of these commits and squash them down and describe the result—though multiple independent things might have happened in them that would be very difficult to untangle, given that, in this workflow, it's entirely possible that two separate people are working collaboratively on two separate features, in the same file, at the same time.


I assume that commits are being made to some sort of topical branch (Edit: I guess not? That seems like a poor decision). Personally, I encourage the developers across our organization to do an interactive rebase prior to pushing to a remote, to "curate" their commits (squash/fixup).

This greatly improves the utility of the commit history, at the expense of losing the actual commit history as work progressed:

  - FeatureA should ensure valid user session
  - fix test
  - missing semicolon
  - null check
  - fix incompatibility with FeatureB 
  - shit broke the build
  - address PR comments


Probably useless in terms of existing, standard development processes, but it seems to me like this Limbo effort is to come up with a new one (in which we can assume, by the principle of charity, that this is not entirely useless).


This brings Vigil lang back to mind:

https://github.com/munificent/vigil

It's a programming language that erases any code that fails to meet contracts or throws an exception.


We did this in CI at an earlier job, and what we found is that it works if and only if you have a good test suite. In our case we had too many false positives because of badly written legacy tests, so it was not a usable approach.


The problem with this is making changes TOO small - then you can find you are testing the wrong behaviour just to make your code committable. I think there is a balance to finding the right size increment while still keeping direction


The is how I’ve developed code for almost a decade. For a given test we have the following behavior triple: desired result, expected result, and observed result. A test fails if it’s expected result differs from the observed result. The set of development work is every place the desired & expected results differ. The CI is passing when no test changes behavior. There’s a revert “window” where a change-in-state can be blessed.


What's the difference between a desired and expected result?


Winning lottery is your desired result. Not winning it is your expected result.


Desired is what you want to happen, expected is what you expect to happen, and observed is what happened with the current run.


From the perspective of a test that only passes or fails, this doesn't gain you anything.

I suppose we could imagine a test system which includes more metadata, and allows for placeholder values to let tests pass while producing an annotation that the product is not production ready.


What am I missing? If commit is a possible action, then we have a unit of code not yet in the repo. So if that is the case, won't the revert be reverting a previous, unrelated commit? If we can't commit unless the tests pass, then why would we ever revert?

I believe what I'm missing is taking the word `revert` too literally. I suspect it's actually `test && commit || checkout .` essentially.


It's:

    ./test.sh && (git add . && git commit -m "auto") || (git reset --hard HEAD)


You're thinking in terms of git. Most version systems before used revert to mean undo local changes, not apply the reverse of a commit.


It's not git, I think revert means "revert to state at last commit"


How do high-quality narrative comments get written in this workflow?

Either I write the comment up-front and risk it being deleted, or I write the comments after the tests pass and thereby propagate uncommented code.


Commit your comments. Your tests should pass.


The idea of this workflow is that saving runs the tests, and then either commits or deletes all changes. I think in this workflow I would write the code, get the tests to pass (thereby creating a commit), and then create a new commit with good comments.


Sure. Or, write your comments. Run the script, tests pass, comments are committed.


This sounds like a sure way to enforce a local optimum. If you can't make mistakes and break enough things at a time to change the architecture, in order to fix the architecture itself, it means your architecture is going to be extremely sticky, and there will be no chance that design mistakes will be corrected.


Owning production systems means you can't make mistakes and break things. People are already doing what you're saying can't be done, everyday, when they change production systems.

He's saying you should figure out the path from A to B and get there in small steps. And make sure your tests pass at each step. Don't make big changes that are hard to propagate and merge with others' changes.

Basically, avoid the classic "nobody change anything, I'm doing a merge today!" situation.


> People are already doing what you're saying can't be done, everyday, when they change production systems.

Only seemingly so. People do break tests when they are writing (or removing) code, and even intermittent commits often don't pass tests (they thus have to be squashed before the merger, they only exist for making reviewer's jobs easy and keeping them risks faulty bisect for example). Development and deployment are separate things.


Test && (commit || revert)?

Or (test && commit) || revert?


I'm thinking they mean the Bourne(-style) shell, where this syntax means that test is run, either commit or revert is run. That is, it is a shorthand for:

  if test
  then
    commit
  else
    revert
  fi
Presumably the point is to encourage you to keep your code changes and tests so small that you are OK with losing work.


It's actually equivalent to

  if test; then
    commit || revert
  else
    revert
  fi
Or in another language:

  try {
    test
    commit
  } catch {
    revert
  }
This is because

  test && commit
is evaluated, and the resulting value is the left side of

  expr || revert
Not sure if that was intentional, but it's a very nice one liner assuming either your commit never fails, or your commit runs extra tests, and you want to revert if those fail.


Assuming it's shell, they are always processed in left-to-right order, so it's (A && B) || C, or (A || B) && C, with no proper order of operations. In this case, it's closer to test ? commit : revert though since the commit can't fail (but that's not a bash operation)


The latter. Including reverting if commit fails. I'm not sure if they meant that literally, but it could actually be useful to revert if commit fails, since commit might have hooks that run extra tests.


test && commit || revert

is grouped like

test × commit + revert

so it would be the latter.


In C and derived languages, this is true; not in the POSIX shell, though, where these are on the same precedence level, and associate left to right.

  $ echo 4 || echo 5
  4
  $ echo 4 || false && echo 5
  4
  5
This is because the grouping is (echo 4 || false) && echo 5.


I feel this comes from a frustrating kind of purism, like xp, which is a nice thought experiment but is completely unworkable in the real world.

The real-world consequence of this would be people not writing tests until after they are feature-complete so they don't risk losing work.


The author founded XP.

Anyway, I suspect he's proposing it not as a serious methodology, but as a thought experiment to spur you into how you can make your changes smaller, simpler, and more atomic.


You are correct, he's not proposing it as a serious methodology. He ends the article by saying

> I don’t suggest you try “test && commit || revert” because it’s better than what you do now. I suggest you try it because:

> - It’s cheap.

> - You’re bound to learn something.


You could couple this with code-coverage requirements, like coverage must always increase or stay level with every commit. That can still be gamed but at least it's harder and more obvious to other devs


This can backfire in unexpected ways. Don't forget that deduplicating tested code decreases the coverage ratio. You don't want to forbid this, though.


People would just write tests which can never fail, and (in the best case) add the assertions in the end, one by one.


Speaking of tests that never fail... I've seen contractors hand over code exclaiming they've met the 75% test coverage required by SalesForce's platform but never actually asserting anything.


The times I found XP unworkable in the real world all fit the scenario of "We're doing XP, but not pair programming because that's silly, we don't do x and y either because those are impractical."

The few projects that fully embraced it worked surprisingly well.


Disagree about XP being unworkable in the real world; I've done it on multiple projects at multiple companies.

People's preconceptions about what XP is are unworkable, sure, but that's on them.


Erm, what about test coverage? You could prevent committing untested code.


Not all code should have unit tests.


Relevant comment from a thread a few days ago: https://news.ycombinator.com/item?id=18043293


Thank god for buffers :)


My habit of CTRL-Z'ing out of multiple running vims has saved the day a few times.



Sure, but what I meant was that a deleted file that's still in a vim buffer may be rewritten.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: