Hacker News new | past | comments | ask | show | jobs | submit login
Automatic bug-repair system fixes 10 times as many errors as its predecessors (news.mit.edu)
137 points by jonbaer on Jan 30, 2016 | hide | past | favorite | 54 comments



Link to the paper mentioned in the article: http://people.csail.mit.edu/rinard/paper/popl16.pdf

quick summary: They trained on apr, curl, httpd, libtiff, php , python, svn, wireshark.

It basically trains the machine learning algorithm with the Abstrat syntax tree (AST) of the code before and after the successful patches. It then uses the machine learning predictions on real world defects and see what comes out of it.


Oh man. I'm already behind on reviewing the Pull Requests I get from human contributors. :(


This is undoubtedly cool research, but how useful is the automatic generation of a fix? In practice, finding the bug is by far the most important step. Once you've identified the problematic code, it's normally trivial to fix it.


That makes me think of the amount of satisfaction one will receive pushing MIT's autofix-it button as opposed to those seconds or minutes when you've caught the fucker and go in for the final fix.

Those moments of updating the files: That little euphoria of finding the fix but needing to focus to get it done correctly and finally hitting enter for the last time...ahhh.

I'l take that instead.


Yeah but sometimes that is predated by unfamiliar or unwanted behavior in a production system and years being taken off one's life.


Well said, amen!


I'd be interested in a tool that simply fixes the bugs found with static code analysis tools like findbugs. Some of them aren't very meaningful, like dead code... but if it could make good choices about how to handle slightly more complex situations like NPEs and that sort of thing it could make a lot of large corporate code bases a lot less buggy.


IDEA does some of this.


If it can identify a bug and fix it then its fully automatic. If its good enough we could eventually give it the password to our source control systems and let it do its thing unsupervised. As a funny thought I imagine a future virus that gains write access to all github projects but instead of deleting everything it just fixes all the bugs.


A better question is how many bugs does it find by inspection. I could easily see this as a step in a code review where it has the chance to flag potentially buggy code prior to committing it.


That's purely static analysis [1], a well-developed field with many existing tools across many languages. [2] One company, Coverity was purchased for $350 million. [3]

[1] https://en.wikipedia.org/wiki/Static_program_analysis

[2] https://en.wikipedia.org/wiki/List_of_tools_for_static_code_...

[3] https://en.wikipedia.org/wiki/Coverity


What immediately comes to mind is that trivial repairs may soon be done by "Prophet", or it's progeny, instead of a human...sobering...


Any tool that auto-fixes code seems to be abused as "normal to rely on" unfortunately. Yesterday it was "compilation warnings? code will still work". Today it's "semicolons? runtime can put them where needed". Tomorrow it will be "code bugs? prophet will fix it".


Progeny is the key word in my reply...this sounds as if a move towards a more complex "auto-correction" is afoot...


/TODO: Prophet- you know the drill.. i want some PID control here- but i am busy chatting up the secretary. Good Boy. Also Rick from Recruiting was fired today- with no replacement.. * */


I wonder sometimes what will cause all us thousands of software-engineers to become obsolete.

Not that long ago, we'd be a massive workforce of engineers, armed with sliderules, calculating away.

http://s7.computerhistory.org/is/image/CHM/500004055-05-01?$...

In 100 years, we'll be the people in that poster, but what will replace us? Smart software? Quantum computing?


Aren't we those people in the poster? Didn't they become us?


did you operate a slide rule in a previous job?


you must have missed the part of that advertisment which reads "No longer must valuable engineering personnel ... spend priceless creative time at routine figuring"

They're not obsolete at all, they're just doing better things with their time.


The full paper can be found on Martin Rinard's website [1].

[1] http://people.csail.mit.edu/rinard/paper/


After reading about the quality of the patches produced by these in "Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair" (https://www.cs.umd.edu/~tedks/papers/2015-fse-cure-or-diseas...), I'm not so interested. One has to have a very awesome test suite to not risk that overfitting.


One thing that I don't get from the article is if it can really catch my logical bugs. And I'm not trying to bring this research down because I don't doubt the results, but I am curious, and a bit skeptical, if this is a true step forward in catching "real" bugs or just surface ones. Even a sophisticated algorithm such as this. It takes into account features such variable scope and the operations performed on them, and is trained by real examples of bug patches on open source projects. Can it really understand my project? Because my logical bugs are specific to my project. It kind of reminds me of a recommendation IntelliJ once gave me for the following code:

return {left: am.left, top: am.right};

"WARNING: 'right' should probably not be assigned to 'top'"

This sounds like the kind of conclusion that such an algorithm might come to after seeing enough examples of similar bugs. It turns out that my code in this situation is actually correct, but in order to know this, you would have to have a thorough understanding of the rest of the program.


Of course it can't catch your logical bugs. Please see my comment below.


I read it and it's a great overview of the different kinds of bugs that happen in the real world. Certainly more examples than the article provided. But your comment seems to be focused more about how a good language with a proper type system can prevent bugs from happening in the first place which is a definitely a good point but not what this article was getting at I think.


I'm not sure what's your question. If I review that piece of code, I think there's probably a bug. The same one IntelliJ noticed. If it's not a bug, it deserves at least a comment on top. But if to know it's not a bug I need to have "a thorough understanding of the rest of the program", then the function where it is is probably unreadable. In which case IntelliJ's warning is making your code better.


Uh ok let's say "a thorough understanding of the rest of the function" instead. Does my question make more sense? The reviewer (human or computer) needs to follow the logic of the function, understand the true purpose of the function, in order to know that the line which looks like a bug actually isn't one. I feel that most of my bugs are domain specific logic bugs. My original comment wasn't so much of a question but skepticism that such an algorithm can catch the hardest bugs, despite the improved metrics compared to other algorithms.


I wonder if this can be re-purposed to find zero-day exploits.


Not sure about the posted paper, but another recent work that comes to mind is from David Brumley's group at CMU:

http://security.ece.cmu.edu/aeg/

"Automatic Exploit Generation" which does basically this. (I haven't read enough of either paper to understand how similar the analyses are, but both seem to be based on symbolic execution.)


Probably not, however I assume it could be trained on zero-day exploit fixes, and automatically try to generate a fix when a new zero-day exploit is found by a human.

If you have a tool that can automatically find 0 day exploits, then you could try couple the 2 and have a piece of software telling you: "Tool 1: I think I found this 0 day exploit;Tool 2(the one in this paper): here is how I think you could fix it: ..."


Are there any languages that automatically (as in, deliberately) lend themselves to automated error detection? I feel like it's either not possible or so possible that Haskell is the answer.


Certain languages lend themselves better to static analysis than others. Some qualities (there are others) that simplify static analysis include:

* static (not necessarily explicit) typing

* no dynamic metaprogramming (static OK)

* no "eval"

* no first-class functions (second-class OK)

* no prototype-based OOP (class-based OK)

* no mutable variables (lexical shadowing OK)

* no mutable values, or very strict aliasing rules

* no unbounded recursion

* no dynamically defined structures

Of course you can perform useful static analysis with only some of these guidelines met. e.g. Erlang's Dialyzer does pretty well with dynamic typing, first-class functions, and unbounded recursion, because these features generally aren't abused in Erlang. (Though this took a hit recently due to the recent introduction of the `maps` feature, which was designed in such a way as to encourage violating that last guideline, despite proposals for a similar feature which would not have violated it.)

Surprisingly, C also meets all but two of these guidelines, and, although it is an egregious violator of those both, it is somewhat amenable to static analysis (see Coverity).

JavaScript, on the other hand, is notoriously difficult to statically analyze, since it not only permits code to violate all the above guidelines, but it's common for code to violate all of them.


Even though static analysis of C/C++ works reasonably well in Coverity (at least enough to buy and use the product), there are enough false positives and corner cases that make the thought of "automatic" fixes unrealistic. It finds serious bugs but it also gets tripped up on certain patterns without modeling or embedding pseudo-pragmas for it to process.


Yes I know. But parent was asking about detection, not correction (which the OP is about).

On that topic though, it is notable that Erlang's Dialyzer, unlike most static analysis tools and type systems, guarantees no false positives; it only reports errors via code paths it can prove are taken. This makes it very easy to integrate into existing development workflows.


What is wrong with first-class functions other than function-pointer indirect jumps? If that is the problem, then polymorphism (virtual functions) should also be in the list.


Pardon if this is a naive question, but are there languages designed with static analysis in mind?


There are, but they're not too widely used. Cyclone [1] and SPARK [2] are two examples. Of course many others may have been designed in such a way that they are easily analyzable, without that explicitly being a goal.

[1] https://en.wikipedia.org/wiki/Cyclone_(programming_language)

[2] https://en.wikipedia.org/wiki/SPARK_(programming_language)


Kotlin, designed by the Intellij-people. Maybe Ceylon? Don't remember.


Thanks for that response. I didnt expect to get that much useful information.


I think it depends on what you mean by "error". Certainly Haskell is not a silver bullet here.

As a person who works at a large organisation focused on correctness, there are some crazy intersections between what "error" means, what programs do, and how humans work. In general, if you can clearly define what "error" means, you can mostly solve it with a sufficiently good language. That is not so easy.

The "automatic"[1] hammers we seem to have are good type systems and "Design By Contract" (think Eiffel, but having a strong proof assistant is a much more powerful case). In this case, I'm discussing the problem from a security point of view:

Some examples in order of difficulty:

* Process errors outside software; OK, these are usually not the programmer's fault. For example, Amazon account hijacking via customer service. No way to point-fix; DBC or typing won't help you.

* Software interaction with humans; I've seen regexes to disallow shipping (e.g. for identity documents) to P.O boxes, which allow the input "POSTAL BOX" rather than a variant of "PO BOX", "P.O BOX", "POSTAL BOX", "LOCKED BAG" or so on. The problem is that the data will be eventually interpreted by a human, who will correct the error and ensure delivery. AFAICT not fixable with DBC or typing - you need a business process improvement here.

* Ambient or environmental changes. The call you used to make to a third-party library or binary was secure, but now it isn't. For example, your app used RC4 or $RANDOM_BINARY, and it used to be secure, but now it isn't because your environment changed out from under you. Now your code needs a countermeasure or a fix.

* Subtle business logic errors such as missing a permissions check on private profile or data views - probably only fixable with proof or DBC, type system is not the right hammer here although maybe with effort you can sort of do it. Probably encoding rapidly changing business rules into your types is not a good idea though.

* Obvious business logic errors, e.g. allowing -1 books to be ordered in an online store; fixable with a good type system or DBC. (e.g. type "Quantity" is a natural number rather than a machine type).

* XSS or SQLI - should be fixed by type system alone in a sufficiently advanced language.

It's mostly the relatively trivial stuff at the bottom of this stack of possible errors which is fixable by good programming languages, assuming of course that your DBC logic or way you structure your typing is correct.

As a total aside, what would be really interesting to me would be a powerful "business process definition language", which could map data flow between systems and humans - and actually describe the real-world information flows when a user contacts customer support to reset their password. If that could interact with contracts or proofs in individual applications, my world would be rocked.

[1] Not requiring humans to check when changes happen, although obviously human effort is required during system definition.


There is a subset of Ada called SPARK that is intended to allow for at least some degree of automated formal verification. Which is not quite the same as automated error detection, but I would assume if your code passes the verification, the chance of bugs hiding in it is pretty much nil (assuming, of course, the verifier is not buggy itself).


Sure -- C is a great way to let programmers produce error ridden code that compiles.


    there are indeed universal properties of 
    correct code that you can learn from one 
    set of applications and apply to another 
    set of applications
Pardon the speculative thought, but wouldn't a program that detects "universal properties of correct code" be equivalent to a program that detects whether another program will halt? Hence, impossible?


It's a machine learning based system, so it's probabilistic. You can certainly write programs that can detect when some other programs halt, just not universal ones. For example, it would be relatively trivial to look for a "while true:" loop that has no end conditions.


Halting problem is all arbitrary programs, not a specific one. Be careful in how you apply knowledge.

   While(1):
       Pass
Is trivial to detect. Now make slightly more complex versions of the same.


I think the difference is in one case you are talking about specific properties of specific programs, and in the other you are talking about a universal question about any program.


No, the two are not equivalent.

running programs have state, that's the issue with the halting problem.


There is a small downside: there is the potential for learning less of your mistakes. Especially when one doesn't barely do any code review.


Anyone who has read this paper: how is their learning algorithm any different than a linear SVM?


It's either an SVM or multinomial logistic regression, but the bulk of the work is not the choice/design of model, but the structuring the data in a meaningful way.

The main issues in many interesting ML projects are 'how do you feed the data to the model' and 'how do you determine its success/failure'.

For instance: you could try to train a Pacman AI by feeding the game state into some model and asking for an up/down left/right output. But a lower bound for all the game states would be 2^(number of dots possible on level), making it impractical/impossible to store such a model in memory much less train it.

The strategy would be to encode the game state into a manageable number of features. This is a lossy process and the hope is that your set of features is small enough to be trained on, yet meaningful enough that the model can learn from them.

In the paper, they parse the data (code) into a syntax tree, compares it with the patched version. This identifies the 'point' in the tree where the patch modification takes place. The 'point' the 'type of modification' and a 'collection of neighboring points' are the features that are fed into the model.

tl;dr, yep, probably linear svm or multinomial logistic


I think low hanging fruit would be a system that backs out commits and system updates until tests pass again. Has that been done.

That way you could at least keep a system running until you can fix it.

What do you guys think?


Essentially, this is git bisect.


To add to this 100% correct comment, if you haven't tried using git bisect, I highly recommend it. The only thing you need is an automated way of determining if a specific sha is "good" or "bad" based on the question you are trying to answer. Often that's "do the tests pass?" but (AFAIK) it can be any property of the state of the code.

A++, will use again. (err, hopefully not, but you know what I mean)


A better and simpler way is to not allow code into the main branch that causes tests to fail.


Does it find incorrect locking code bugs in multi-threaded languages?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: