quick summary:
They trained on apr, curl, httpd, libtiff, php , python, svn, wireshark.
It basically trains the machine learning algorithm with the Abstrat syntax tree (AST) of the code before and after the successful patches.
It then uses the machine learning predictions on real world defects and see what comes out of it.
This is undoubtedly cool research, but how useful is the automatic generation of a fix? In practice, finding the bug is by far the most important step. Once you've identified the problematic code, it's normally trivial to fix it.
That makes me think of the amount of satisfaction one will receive pushing MIT's autofix-it button as opposed to those seconds or minutes when you've caught the fucker and go in for the final fix.
Those moments of updating the files: That little euphoria of finding the fix but needing to focus to get it done correctly and finally hitting enter for the last time...ahhh.
I'd be interested in a tool that simply fixes the bugs found with static code analysis tools like findbugs. Some of them aren't very meaningful, like dead code... but if it could make good choices about how to handle slightly more complex situations like NPEs and that sort of thing it could make a lot of large corporate code bases a lot less buggy.
If it can identify a bug and fix it then its fully automatic. If its good enough we could eventually give it the password to our source control systems and let it do its thing unsupervised. As a funny thought I imagine a future virus that gains write access to all github projects but instead of deleting everything it just fixes all the bugs.
A better question is how many bugs does it find by inspection. I could easily see this as a step in a code review where it has the chance to flag potentially buggy code prior to committing it.
That's purely static analysis [1], a well-developed field with many existing tools across many languages. [2] One company, Coverity was purchased for $350 million. [3]
Any tool that auto-fixes code seems to be abused as "normal to rely on" unfortunately. Yesterday it was "compilation warnings? code will still work". Today it's "semicolons? runtime can put them where needed". Tomorrow it will be "code bugs? prophet will fix it".
/TODO: Prophet- you know the drill.. i want some PID control here- but i am busy chatting up the secretary. Good Boy. Also Rick from Recruiting was fired today- with no replacement..
*
*/
you must have missed the part of that advertisment which reads "No longer must valuable engineering personnel ... spend priceless creative time at routine figuring"
They're not obsolete at all, they're just doing better things with their time.
After reading about the quality of the patches produced by these in "Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair" (https://www.cs.umd.edu/~tedks/papers/2015-fse-cure-or-diseas...), I'm not so interested. One has to have a very awesome test suite to not risk that overfitting.
One thing that I don't get from the article is if it can really catch my logical bugs. And I'm not trying to bring this research down because I don't doubt the results, but I am curious, and a bit skeptical, if this is a true step forward in catching "real" bugs or just surface ones. Even a sophisticated algorithm such as this. It takes into account features such variable scope and the operations performed on them, and is trained by real examples of bug patches on open source projects. Can it really understand my project? Because my logical bugs are specific to my project. It kind of reminds me of a recommendation IntelliJ once gave me for the following code:
return {left: am.left, top: am.right};
"WARNING: 'right' should probably not be assigned to 'top'"
This sounds like the kind of conclusion that such an algorithm might come to after seeing enough examples of similar bugs. It turns out that my code in this situation is actually correct, but in order to know this, you would have to have a thorough understanding of the rest of the program.
I read it and it's a great overview of the different kinds of bugs that happen in the real world. Certainly more examples than the article provided. But your comment seems to be focused more about how a good language with a proper type system can prevent bugs from happening in the first place which is a definitely a good point but not what this article was getting at I think.
I'm not sure what's your question. If I review that piece of code, I think there's probably a bug. The same one IntelliJ noticed. If it's not a bug, it deserves at least a comment on top. But if to know it's not a bug I need to have "a thorough understanding of the rest of the program", then the function where it is is probably unreadable. In which case IntelliJ's warning is making your code better.
Uh ok let's say "a thorough understanding of the rest of the function" instead. Does my question make more sense? The reviewer (human or computer) needs to follow the logic of the function, understand the true purpose of the function, in order to know that the line which looks like a bug actually isn't one. I feel that most of my bugs are domain specific logic bugs. My original comment wasn't so much of a question but skepticism that such an algorithm can catch the hardest bugs, despite the improved metrics compared to other algorithms.
"Automatic Exploit Generation" which does basically this. (I haven't read enough of either paper to understand how similar the analyses are, but both seem to be based on symbolic execution.)
Probably not, however I assume it could be trained on zero-day exploit fixes, and automatically try to generate a fix when a new zero-day exploit is found by a human.
If you have a tool that can automatically find 0 day exploits, then you could try couple the 2 and have a piece of software telling you: "Tool 1: I think I found this 0 day exploit;Tool 2(the one in this paper): here is how I think you could fix it: ..."
Are there any languages that automatically (as in, deliberately) lend themselves to automated error detection? I feel like it's either not possible or so possible that Haskell is the answer.
Certain languages lend themselves better to static analysis than others. Some qualities (there are others) that simplify static analysis include:
* static (not necessarily explicit) typing
* no dynamic metaprogramming (static OK)
* no "eval"
* no first-class functions (second-class OK)
* no prototype-based OOP (class-based OK)
* no mutable variables (lexical shadowing OK)
* no mutable values, or very strict aliasing rules
* no unbounded recursion
* no dynamically defined structures
Of course you can perform useful static analysis with only some of these guidelines met. e.g. Erlang's Dialyzer does pretty well with dynamic typing, first-class functions, and unbounded recursion, because these features generally aren't abused in Erlang. (Though this took a hit recently due to the recent introduction of the `maps` feature, which was designed in such a way as to encourage violating that last guideline, despite proposals for a similar feature which would not have violated it.)
Surprisingly, C also meets all but two of these guidelines, and, although it is an egregious violator of those both, it is somewhat amenable to static analysis (see Coverity).
JavaScript, on the other hand, is notoriously difficult to statically analyze, since it not only permits code to violate all the above guidelines, but it's common for code to violate all of them.
Even though static analysis of C/C++ works reasonably well in Coverity (at least enough to buy and use the product), there are enough false positives and corner cases that make the thought of "automatic" fixes unrealistic. It finds serious bugs but it also gets tripped up on certain patterns without modeling or embedding pseudo-pragmas for it to process.
Yes I know. But parent was asking about detection, not correction (which the OP is about).
On that topic though, it is notable that Erlang's Dialyzer, unlike most static analysis tools and type systems, guarantees no false positives; it only reports errors via code paths it can prove are taken. This makes it very easy to integrate into existing development workflows.
What is wrong with first-class functions other than function-pointer indirect jumps? If that is the problem, then polymorphism (virtual functions) should also be in the list.
There are, but they're not too widely used. Cyclone [1] and SPARK [2] are two examples. Of course many others may have been designed in such a way that they are easily analyzable, without that explicitly being a goal.
I think it depends on what you mean by "error". Certainly Haskell is not a silver bullet here.
As a person who works at a large organisation focused on correctness, there are some crazy intersections between what "error" means, what programs do, and how humans work. In general, if you can clearly define what "error" means, you can mostly solve it with a sufficiently good language. That is not so easy.
The "automatic"[1] hammers we seem to have are good type systems and "Design By Contract" (think Eiffel, but having a strong proof assistant is a much more powerful case). In this case, I'm discussing the problem from a security point of view:
Some examples in order of difficulty:
* Process errors outside software; OK, these are usually not the programmer's fault. For example, Amazon account hijacking via customer service. No way to point-fix; DBC or typing won't help you.
* Software interaction with humans; I've seen regexes to disallow shipping (e.g. for identity documents) to P.O boxes, which allow the input "POSTAL BOX" rather than a variant of "PO BOX", "P.O BOX", "POSTAL BOX", "LOCKED BAG" or so on. The problem is that the data will be eventually interpreted by a human, who will correct the error and ensure delivery. AFAICT not fixable with DBC or typing - you need a business process improvement here.
* Ambient or environmental changes. The call you used to make to a third-party library or binary was secure, but now it isn't. For example, your app used RC4 or $RANDOM_BINARY, and it used to be secure, but now it isn't because your environment changed out from under you. Now your code needs a countermeasure or a fix.
* Subtle business logic errors such as missing a permissions check on private profile or data views - probably only fixable with proof or DBC, type system is not the right hammer here although maybe with effort you can sort of do it. Probably encoding rapidly changing business rules into your types is not a good idea though.
* Obvious business logic errors, e.g. allowing -1 books to be ordered in an online store; fixable with a good type system or DBC. (e.g. type "Quantity" is a natural number rather than a machine type).
* XSS or SQLI - should be fixed by type system alone in a sufficiently advanced language.
It's mostly the relatively trivial stuff at the bottom of this stack of possible errors which is fixable by good programming languages, assuming of course that your DBC logic or way you structure your typing is correct.
As a total aside, what would be really interesting to me would be a powerful "business process definition language", which could map data flow between systems and humans - and actually describe the real-world information flows when a user contacts customer support to reset their password. If that could interact with contracts or proofs in individual applications, my world would be rocked.
[1] Not requiring humans to check when changes happen, although obviously human effort is required during system definition.
There is a subset of Ada called SPARK that is intended to allow for at least some degree of automated formal verification. Which is not quite the same as automated error detection, but I would assume if your code passes the verification, the chance of bugs hiding in it is pretty much nil (assuming, of course, the verifier is not buggy itself).
there are indeed universal properties of
correct code that you can learn from one
set of applications and apply to another
set of applications
Pardon the speculative thought, but wouldn't a program that detects "universal properties of correct code" be equivalent to a program that detects whether another program will halt? Hence, impossible?
It's a machine learning based system, so it's probabilistic. You can certainly write programs that can detect when some other programs halt, just not universal ones. For example, it would be relatively trivial to look for a "while true:" loop that has no end conditions.
I think the difference is in one case you are talking about specific properties of specific programs, and in the other you are talking about a universal question about any program.
It's either an SVM or multinomial logistic regression, but the bulk of the work is not the choice/design of model, but the structuring the data in a meaningful way.
The main issues in many interesting ML projects are 'how do you feed the data to the model' and 'how do you determine its success/failure'.
For instance: you could try to train a Pacman AI by feeding the game state into some model and asking for an up/down left/right output. But a lower bound for all the game states would be 2^(number of dots possible on level), making it impractical/impossible to store such a model in memory much less train it.
The strategy would be to encode the game state into a manageable number of features. This is a lossy process and the hope is that your set of features is small enough to be trained on, yet meaningful enough that the model can learn from them.
In the paper, they parse the data (code) into a syntax tree, compares it with the patched version. This identifies the 'point' in the tree where the patch modification takes place. The 'point' the 'type of modification' and a 'collection of neighboring points' are the features that are fed into the model.
tl;dr, yep, probably linear svm or multinomial logistic
To add to this 100% correct comment, if you haven't tried using git bisect, I highly recommend it. The only thing you need is an automated way of determining if a specific sha is "good" or "bad" based on the question you are trying to answer. Often that's "do the tests pass?" but (AFAIK) it can be any property of the state of the code.
A++, will use again. (err, hopefully not, but you know what I mean)
quick summary: They trained on apr, curl, httpd, libtiff, php , python, svn, wireshark.
It basically trains the machine learning algorithm with the Abstrat syntax tree (AST) of the code before and after the successful patches. It then uses the machine learning predictions on real world defects and see what comes out of it.