Mutation Driven Testing: When TDD Just Isn’t Good Enough

nullc · on Feb 6, 2021

I've deployed mutation testing extensively in libsecp256k1 for the past five years or so, to good ends.

Though it's turned up some testing inadequacies here and there and a substantial performance improvement ( https://twitter.com/pwuille/status/1348835954396516353 ), I don't believe it's yet caught a bug there, but I do feel a lot more confident in the tests as a result.

I've also deployed it to a lesser degree in the Bitcoin codebase and turned up some minor bugs as a result of the tests being improved to pass mutation testing.

The biggest challenge I've seen for most parties to use mutation testing is that to begin with you must have 100% branch coverage of the code you might mutate, and very few ordinary pieces of software reach that level of coverage.

The next issue is that in C/C++ there really aren't any great tools that I'm aware of-- so every effort needs to be homebrewed.

My process is to have some a harness script that:

1. makes a modification (e.g. a python script that does small search and replacements one at a time line by line, or just doing it by hand).

2. attempts to compile the code (if it fails, move on to the next change)

3. Compares the hash of the optimized binary to a collection of already tested hashes and moves onto the next if it's already been seen.

4. Runs the tests and if the tests pass save off the diff.

5. Goto 1.

Then I go back trough the diffs and toss ones that are obviously no meaningful effect, and lob the remaining diffs over to other contributors to figure out if they're false positives or to improve the tests.

Agentlien · on Feb 6, 2021

This automated approach reminds me of fuzzing.

masklinn · on Feb 6, 2021

"property testing" is closer to fuzzing: it runs a parametric test with randomly generated inputs (following an input specification). Usefully, good property testing systems will try to reduce the input when they find a failure (simplify possibly extremely complex input to try and minimise them to the smallest set of operations triggering them).

Mutation testing alters the system under test to identify under-specified (under-tested) portions, rather than explore the input space.

feanaro · on Feb 6, 2021

The system under test could be considered as an input to the specification / test suite. Mutation testing could then be viewed as fuzzing of the test suite, with every pass result after a mutation being considered a failure of the test suite.

pfdietz · on Feb 6, 2021

And with the property-based approach, test inputs can be generated that automatically increase coverage or that kill mutants.

The bane of mutation testing is equivalent mutants: mutations that aren't actually bugs, because they leave the program correct.

nullc · on Feb 7, 2021

> The bane of mutation testing is equivalent mutants: mutations that aren't actually bugs, because they leave the program correct.

That's one reason I check the hash of the optimized binaries-- often the compiler manages to convert different but equivalent code into the same binary. It's still not enough, but it's an easy filter to apply with no false exclusions.

pfdietz · on Feb 7, 2021

Good idea.

dllthomas · on Feb 6, 2021

Arguably fuzzing is property testing with a constrained set of properties, although in practice the communities differ.

dnautics · on Feb 6, 2021

I don't think fuzzing usually has a concept of automatic shrinking.

dllthomas · on Feb 7, 2021

I think shrinking is very important, but I guess I don't see it as part of the definition. Both sides have some particulars of their approaches that aren't well reflected on the other side, but I still say it's reasonable to think of them as doing fundamentally the same thing.

stevekemp · on Feb 6, 2021

Fuzz-testing can be amazingly useful at finding bugs, crashes, and security issues.

I'm looking forward to seeing more people use it, once it has been added to the golang testing tools:

https://go.googlesource.com/proposal/+/master/design/draft-f...

In several of my projects I thought I had good testing, but still found issues within 20 minutes of fuzzing. Wonderful to be able to find them so easily.

isbvhodnvemrwvn · on Feb 6, 2021

That's pretty much it, but it fuzzes the code rather than data.

AlexDenisov · on Feb 6, 2021

re: absence of C++ tools

Have you looked at Mull[1] or Dextool[2]?

[1] https://github.com/mull-project/mull

[2] https://github.com/joakim-brannstrom/dextool/tree/master/plu...

pfdietz · on Feb 6, 2021

Also, it's possible to mutate binaries.

bastawhiz · on Feb 6, 2021

I agree with the author's philosophy, but the approach described only gives you confidence at the time the code is written/tested. If someone changes adjacent code, you can no longer assume that your manual mutation testing is still valid. At some point (either in age or size or complexity of the codebase) manual mutation testing is going to decrease in effectiveness until the ROI of doing it is hard to justify. Automation is really key.

There's lots of great tools that help with mutation testing, though they can be expensive to run (depending on how they work and how many tests you have). In a past life, I wrote my own mutation testing library which run the tests after each mutation and generated a "reverse code coverage" report: essentially a report of which lines/functions/statements/etc. did not cause the tests to fail when mutated. Where code coverage is ideally approaching 100%, the reverse code coverage should have been near 0%. If you take the intersection of a coverage report with a reverse coverage report, you can easily find code that is executed, but whose behavior is not checked.

joatmon-snoo · on Feb 6, 2021

Relevant: https://research.google/pubs/pub46584/

From the abstract:

> We focus on a code-review based approach and consider the effects of surfacing mutation results on developer attention. The described system is used by 6,000 engineers in Google on all code changes they author or review, affecting in total more than 14,000 code authors as part of the mandatory code review process.

postalrat · on Feb 6, 2021

What many hardcore testing advocates don't want to accept is their tests will always be inadequate. I've found that bringing up mutation testing or bebugging tends to draw dirty looks from such people.

The last 100% test coverage advocate I mentioned it to said it would be a waste of developer effort. I assume they feel that effort would be better spent writing more tests.

rook166 · on Feb 6, 2021

For any Python users, there's a library that automates mutation testing by parsing the AST: https://github.com/EvanKepner/mutatest

schedutron · on Feb 6, 2021

And for property-based testing there's Hypothesis too: https://hypothesis.readthedocs.io/en/latest/

boxed · on Feb 6, 2021

There's mutmut (I'm the maintainer), cosmic-ray and mutpy too. In fact those are the established players. I have never heard of mutate st before! I will have to try it.

rook166 · on Feb 6, 2021

Oh cool, I'd never heard of those, funny enough. I'll have to look into those myself!

JustFinishedBSG · on Feb 6, 2021

While it sounds like a “good idea” mentally it seems also completely unrealistic and unpractical.

Basically what this is is writing tests for your tests. And because the input of tests is functions you need to be able to generate functions. That’s nice but it’s a pain considering the only solution proposed is “just do it manually” which is neither exhaustive nor trustworthy.

Also every single of the author’s exemple is caught by an actually good testing tool like QuickCheck.

https://en.wikipedia.org/wiki/QuickCheck

tom_mellior · on Feb 6, 2021

> Also every single of the author’s exemple is caught by an actually good testing tool like QuickCheck.

I'd be interested to see what a sufficiently strong QuickCheck specification of this problem would look like. I've used it a bit in the past, but not enough that I could reliably get it to produce all the interesting failure modes and know the expected result for each case.

twanvl · on Feb 6, 2021

> I'd be interested to see what a sufficiently strong QuickCheck specification of this problem would look like.

I would write something like this in haskell:

    spec :: [Integer] -> Property
    spec xs =
       length xs <= 2  ==>  fun (intercalate "," (map show xs)) == sum xs

This captures the three requirements, but not the implicit fourth requirement that the function throws an exception for other inputs.

tom_mellior · on Feb 6, 2021

Nor does this exercise the trimming of the substrings, for example. This is good for testing the happiest path, I agree. I was interested in the tedious testing of all the unhappy paths.

masklinn · on Feb 6, 2021

> not the implicit fourth requirement that the function throws an exception for other inputs.

You could probably generate invalid inputs by taking a list of strings as input. Though of course at that point the property test has to reimplement half the function.

That's an issue I often end up having with property tests: the oracle for interesting properties is as complex as the SUT, so you end up with two of them.

bavent · on Feb 6, 2021

I'm a big fan of mutation testing and I've converted a few other devs I've worked with to it. I use a tool called Stryker Mutator in C# and JavaScript/TypeScript to automate the bug-injection. It adds a little bit of overhead - a small-ish TS project where our normal test suite runs in 2 minutes or so takes about 20-25 with Stryker - but it has definitely found things that our tests weren't really covering as well as they should have before.

visarga · on Feb 6, 2021

This approach, like regular TDD, is upper bounded by the imagination of the tester. You would only catch bugs you can invent.

pjmorris · on Feb 6, 2021

Ammann and Offutt's 'Introduction to Software Testing' [0] describes model driven test design and criteria-based testing (e.g. input domain ) as approaches to being thorough about testing.

[0] https://cs.gmu.edu/~offutt/softwaretest/

henrikeh · on Feb 6, 2021

I don’t really get what you are saying. Are you talking about automated tools as a better option?

Edit: To elaborate a bit, programming is naturally limited by the programmer's imagination, too. But what alternative is offered is a much more interesting topic than just disregarding something because it isn't a perfect solution.

paullth · on Feb 6, 2021

We have a job that runs https://pitest.org/, analyse the report and tweak the codebase as per the results. Not sure it's ever found a bug that's likely to happen in prod but definitely gives us a confidence boost

jghn · on Feb 6, 2021

We use this as well. It caught some lurking bugs when we first turned it on. Since then it catches things before merging, so it is harder to keep metrics. It did just point out a bug in a PR of mine the other day so it’s at least doing something :)

Kototama · on Feb 6, 2021

I have also found that most TDD practitioners don't even know about generative tests (QuickCheck and similars), when these kind of tests - when well written - can catch much more subtle bugs than unit tests. Also there is a time where you should invest effort in monitoring and not in testing.

Testing with mutations is certainly interesting but I never had the opportunity to try it.

dathinab · on Feb 6, 2021

> don't even know [...] QuickCheck and similars

True, but in my experience this kind of testing tools do only complement testing. They should never be used to replace proper manually written tests as they are probability based and as long as the input domain is large enough it's quite viable to miss very obvious bugs not just in one rune but repeatedly.

Through if you are under time pressure it can be a good idea to, replace writing some relatively unimportant test with writing generative tests. Just only start doing so after you cover the most important parts with your manual tests.

It's kinda sad but for a lot of applications making the main features in their main use-case work right, and bringing out more main features, is more important then making all features always right but have less of them. In the end a imperfect but reasonable well working product (especially if all the small demo cases work) sells better then a perfect but very constrained product.

krzepah · on Feb 6, 2021

What I dislike about heavy test process is that you basically might be writing b*lshit until your product owner validates the implementation.

wpietri · on Feb 6, 2021

> One the one hand, [TDD is] too strict. Insisting on writing tests first, often gets in the way of the exploratory work

Who are the proponents of TDD that promote 100% adherence even when it's not a good match for the situation? I keep coming across this claim, but it's not how I learned TDD, and I wonder if it's a straw man.

the_af · on Feb 6, 2021

> Who are the proponents of TDD that promote 100% adherence even when it's not a good match for the situation?

While nobody proposes TDD "when it's not a good match", plenty of people overestimate the cases where they think it's a good match.

Plenty of TDD proponents believe you shouldn't write a single line of code without a test first. I've met them, and they believe it's a good match for the situation.

It's not helped by notorious failures like the infamous sudoku puzzle debacle -- a case where it was evidently not suitable, yet Jeffries went ahead and tried it anyway (and failed, predictably). The conclusion that TDD was not suitable for this kind of algorithmic exploratory development was somehow never reached...

strulovich · on Feb 6, 2021

I dealt with this a bunch. I think it’s a natural tendency of humans to hear a new idea and consider a simpler less nuanced version of it before they fully grasp it.

As an example. I was taught about object-calisthenics[1] in school. The lecturer treated a straw man version of it, and I thought about this like that for a while. What’s worse, when I later thought it was a good exercise and explained it to others I noticed I always have to reiterate multiple times that this set of rules is meant as an exercise of exploration, not a set of hard rules for whatever piece of code you are going to write next.

TDD is just so easy to turn into a straw man, and I think it has so many hardcore fans that makes it even worse to present the nuanced form of it.

[1] https://williamdurand.fr/2013/06/03/object-calisthenics/

aceBacker · on Feb 6, 2021

Robert Martin is the dude that said the only acceptable target is 100%. Of course he also said you shouldn't plan to actually hit that target, just get as close as is possible without having to test things that don't matter like frameworks.

Kent Beck recently went through and clarified TDD and he definitely doesn't advocate 100%. https://youtube.com/playlist?list=PLlmVY7qtgT_lkbrk9iZNizp97...

dathinab · on Feb 6, 2021

Honestly depending on how you interpret TDD it might be to strict but thinks which I belive are never to strict and always a good idea is:

- Writing tests first.

- Have a write test => write code => loop.

But there are some people which are pedantic about how small the loop needs to be or insisting that TDD excludes doing any planing about aspects like how you likely will structure code on a larger picture and similar (i.e. software architecture).

At the same time there are tools/library/framework/language combinations which are testing in generally really hard and can play really bad with TDD. While I believe such tools should be avoided there are situations in which you can not do so and in which you furthermore due to e.g. time constraints are simile not able to do any proper testing including TDD. It's kinda a nightmare situations but it does happen.

EDIT: Yes I realized I responded to the wrong comment :(

UK-Al05 · on Feb 6, 2021

Depends what the exploratory work is.

If you know what result you want. It's easy to apply. Assert on the output.

If you don't know what you want, you can't apply it. Because you can't write the asserts.

So you can do exploratory implementation as long as you know the result you're looking for. But I'd also argue you want to have an idea about what you want before you start writing code. So I don't find that many places where it doesn't apply.

One of the few ones where it doesn't apply is tweaking the UI for looks.

danielvaughn · on Feb 6, 2021

I might be wrong, but I'm fairly certain I've seen Uncle Bob make statements along the lines of "there are vanishingly small scenarios in which 100% code coverage is not applicable" (heavy paraphrasing). Not arguing for or against, just making that observation. I could be misremembering.

seanmcdirmid · on Feb 6, 2021

100% code coverage and 100% adherence to TDD are two different things.

jonny_eh · on Feb 6, 2021

Every senior engineer who joined a startup before me. It can take a while to disabuse them of this.

User23 · on Feb 6, 2021

When teaching TDD a strict adherence often is appropriate to help learners develop good habits. But like many such rules experts get to break it, because they have the expertise to know when it's appropriate to do so.

mam2 · on Feb 6, 2021

Ehh.. some people who wanna make it their edge in the swe community.

typhonius · on Feb 6, 2021

I’ve used the infection PHP library (https://github.com/infection/infection) in an API SDK that I maintain.

My experiences were very similar to the author’s when I first started using it. Even though my test coverage was near 100%, the mutations introduced revealed that in large part my tests were fallible due to assumptions I’d made when writing them.

I’ve incorporated mutation testing as the final step in my CI workflow as a test for my tests. It’s a fair bit of work the first time it’s run (especially with larger libraries), but in my opinion vital as a pairing with tests.

jackcviers3 · on Feb 6, 2021

Can't you do this with expectations of failure, especially if your functions under test aren't total functions?

I really like encoding failure types into the type signature of my code (assuming types).

    defa(x: str) -> Union[str, AnErrorType]:

That way, it is clear that your code may fail, and so you must test the unhappy path as well. Limits the places you might be able to throw and read/write input, and requires an effect type, but keeping functions total allows for fewer bugs and fewer places to test horrible things happening.

ChrisMarshallNY · on Feb 6, 2021

It’s not a bad practice; however, it suffers from the same “low-hanging fruit” issue that affects TDD. It relies on the developers being able to predict faults.

In many cases, this isn’t too difficult, but most bugs I encounter are ones that I never would have considered; no matter how much thought I devoted to the matter.

In my experience, there’s just no way to predict all the bugs, and it isn’t helpful to ever assume that my test suites have full coverage, even if I use a scientific approach to writing tests.

The author mentioned one aspect of TDD that has always bothered me: that I can’t “explore.” My basic design philosophy is “Pave The Bare Spots”[0]. This means that I develop the design as I develop the code[1].

Rigid philosophies like TDD spike this methodology. What I tend to do, is rely on test harnesses, as opposed to unit tests[2].

In any case, I definitely support any effort to improve fundamental software quality. I feel as if the classic “rush to MVP at any cost” approach results in enormous tech debt that never gets repaid.

Obligatory XKCD: https://xkcd.com/2030/

[0] https://littlegreenviper.com/miscellany/the-road-most-travel...

[1] https://littlegreenviper.com/miscellany/evolutionary-design-...

[2] https://littlegreenviper.com/miscellany/testing-harness-vs-u...

jacques_chester · on Feb 6, 2021

> there’s just no way to predict all the bugs

This is the Nirvana Fallacy, "if it isn't perfect, it's useless". Fine, you can't predict all the bugs. But you can predict some, perhaps many. That is strictly more robust than not predicting at all.

> The author mentioned one aspect of TDD that has always bothered me: that I can’t “explore.” ... Rigid philosophies like TDD spike this methodology.

"Spike" is the word TDD weirdoes like me use to say we are exploring. TDD requires enough knowledge to specify the problem in a way that drives out code. When you don't have that knowledge, you explore first.

I've had codebases where I wrote and discarded untested code multiple times in order to understand what design I needed. Once I began to grok the problem, I backed out and then test-drove my way back in.

ChrisMarshallNY · on Feb 6, 2021

Thanks. I wouldn't call TDD folks "weirdos." I 100% support the goals of TDD, and I think that it's a great discipline.

But as I explained in [2], I usually (not all the time) tend to write unit tests after the fact. This is something that seems to get TDD folks all hot and bothered.

My development testing is usually done with test harnesses. You can call it "spike," or whatever.

Actually, as I write this, I am taking a break from some fairly significant refactoring of a backend server that I wrote a couple of years ago[0]. It's a layered system, with each layer having a standalone product lifecycle and integrated tests.

I have a plan for the feature that I'm adding, but not a full project timeline. I have already encountered a couple of places where I deviated from my plan, and I'm barely getting started.

At this layer, the tests are more like test harnesses, than complete unit tests. By the time I get to BASALT (the top layer), the tests are pretty much complete unit tests, examining and reporting on results. At this level, my tests basically output runtime data, below an explanation of what we want to see in that data, so that means I need to spend quality time, reviewing the output. By the time I get to BASALT, I can just scan the reports, looking for red and green; which is good, because I run thousands of tests, by then. At this point, I'll be running less than a hundred tests.

So I guess all my tests are "spike" tests.

[0] https://riftvalleysoftware.com/work/open-source-projects/#ba...

jacques_chester · on Feb 6, 2021

I did a poor job of explaining myself. By "spike" I mean exploratory coding without test-driving. Sometimes you don't know enough about the problem or the solution spaces to test-drive effectively. That means hacking around for a while to get your bearings.

Test harnesses are really a different thing and I think they're always a good idea. I like to test "outside-in", I think it's good at preventing low-level assumptions from upsetting the top level. But again, it depends on the code and context.

As an aside, small studies of test-before vs test-after show that in terms of bug yield, there's no major difference. Over the long term I think they diverge. That means that the magic of TDD isn't that it causes you to write better code than writing tests afterwards. It's that it forces you to write tests at all. TDD is eating your vegetables first.

ChrisMarshallNY · on Feb 6, 2021

> TDD is eating your vegetables first.

I like that philosophy.

I spent many years, at an insanely quality-driven company. I wanted to strangle the QA folks on many occasions, but they trained me to not accept crap.

According to a lot of people, that disqualifies me from working at startups.

I'll have to let the folks at the startup I'm working with know that. They'll need to shop around for a slob that will work for free.

jrootabega · on Feb 6, 2021

Running full mutation testing may be overkill in many situations like people have mentioned here, but the idea is still useful during normal development and review. If you find a problem in the tests, submit the mutation that proves the test still passes.

generated · on Feb 6, 2021

Used to be called bebugging

GuB-42 · on Feb 6, 2021

More like bugging in this case.

The idea here is to deliberately inject bugs to see if your tests catches them.

omginternets · on Feb 6, 2021

I didn’t know this had a name! I always though this was just “testing done right”. :)

siscia · on Feb 6, 2021

At this point I would advocate against mutation driven test and just go for property based testing.

Much less manual work and with more stable results.

pfdietz · on Feb 6, 2021

The two are complementary. In particular, PBT can be used to generate new minimized inputs that kill mutants, without the need for a test oracle to say what the code should be doing.

underwater · on Feb 6, 2021

This doesn't really sound like a new coding methodology, it's just describing a way to write good tests. TDD gets a special label because it flips the normal coding and testing process.

mpweiher · on Feb 6, 2021

That's because TDD is primarily a design technique, not a test technique.

Tests are a (very) nice additional benefit, particularly because it has a simple way of ensuring coverage: only allowed to write production code if there is a failing test case, and only enough to make it pass.

GuB-42 · on Feb 6, 2021

I hate it as a design technique. It encourages you to write code you don't understand just so that it passes the tests.

For example. You want a function to test if a number is prime. You write the test: 4 is not prime, 5 is prime, 6 is not, 7 is prime, 8 and 9 are not. Now you write a function: n is prime if it is odd. Fails the tests: because it tells you that 9 is prime. You "fix" the code by checking that the number is not dividable by 3, it passes the tests, done.

Instead of writing code to solve a problem, you write tests to model the problem then you write code to pass the tests, there is an extra level of indirection and something may get lost along the way.

The only significant advantage of TDD I can think of is that you can't "forget" to write tests.

mpweiher · on Feb 11, 2021

Very cool!

In fact, you are almost doing TDD, but missing the final step: Refactor Mercilessly

   http://c2.com/xp/RefactorMercilessly.html

TDD directly helps you with interface design, that is how the pieces fit together. It helps you indirectly with the design of your implementations, because it enables this merciless refactoring.

"you write tests to model the problem then you write code to pass the tests"

Exactly! And once you have the code that passes the tests, and have written enough of it to have somewhat decent coverage for the problem-space, you can then refactor the implementation.

That is where you can go wild with your ideas, because you are only making changes to code that is already working and protected by a test-suite.

   https://stackoverflow.com/questions/78806/refactor-mercilessly-or-build-one-to-throw-away#78844

For me this splitting of the very difficult programming tasks into 3 distinct and much simpler steps is the beauty of TDD:

1. Figure out the problem (write a spec as executable tests)

2. Make it work (just write trivial code to get the tests green)

3. Refactor (go wild with your design ideas, make sure tests stay green)

dathinab · on Feb 6, 2021

> design technique

IMHO this isn't quite right, in my experience it's a programming technique. I.e. a way to approach programming applications. But it's not a technique for designing applications.

Sure sometimes you can "just write" an applications without caring about design. But in many many other cases just doing TDD without any proper designing will make you destined to fail to deliver a well working product. (At least in the field of larger applications which often need interoperability with all kind of other applications.)

mpweiher · on Feb 11, 2021

It is a design technique, that is, one technique to help you design programs. It is not the only technique or a fully comprehensive and brainless methodology that you just plug into and out pop beautifully designed programs.

You still need to think, for example. And you can't skip the "refactoring" step, because that is where you get to do much of the design that TDD helps you with.

User23 · on Feb 6, 2021

I'm a big fan of TDD and I personally get very little value out of the tests I write. However, they generate considerably more value when other engineers start working on the code. Also, when I haven't looked at a code base for a sufficiently long time I'm effectively no longer the engineer who wrote the tests. In both of these cases they become wonderful because, by design, TDD tests capture intent.

I am a little leery of yet another buzzword though. Reading the article just made me think yeah, you should rigorously think about what you're actually specifying, whether it's by formal or informal reasoning. I would actually claim the code the author picked on was in fact correct by definition since it satisfied its tests. It just didn't do what the article author figured it should do. The "mutation" is just a relatively cumbersome way to think through strengthening the specification. It's not clear to me that it's more productive to identify gaps in an implicit specification by breaking one's implementation rather than strengthening the tests, but I wouldn't be surprised if I have done so in some cases where for whatever reason it struck me as more pragmatic.

dathinab · on Feb 6, 2021

> get very little value out of the tests

I have a bad tendency to over complicate things because I often see many of the potential future complications and other aspects of the larger picture which I really should ignore at that point of writing.

In turn a TDD programing approach gives me a nice degree of value at the moment of writing. The larger and more complicate the resulting program will be the more value I get out of it.

> think through strengthening the specification.

Having a weak/incorrect specification is a very common problem on larger projects especially if the projects are writen by one company for a different company (and just them).

Through mutation testing your code won't help there, but similar practices can help in finding gaps and problems with a non-code specification, too.