Scaling Static Analyses at Facebook

wallstprog · on July 29, 2019

I think one of the unstated problems with static analysis is just keeping track of the results. I know that when I started working with these tools, it was a huge PITA just dealing with the various output files.

That's why I created tools to convert the output from different tools into a common CSV format that can be databased and used to compare output from different tools, or from different versions of the code (e.g., after fixing errors reported by the tools).

These tools currently work with cppcheck, clang and PVS-Studio and can be found here: http://btorpey.github.io/blog/categories/static-analysis/

nh_99 · on July 29, 2019

Interesting approach. Where I work, we use Jenkins for collecting results. That way, for each build of our application we have a history of results for static analysis. Jenkins has good tools for storing and displaying this information, as well as the ability to show trends over time.

wallstprog · on July 29, 2019

If Jenkins works for you, great. It does seem to support both clang and cppcheck, although not PVS-Studio (which is one of the better tools out there in my experience).

Personally, I'm happier with plain old text files that can be manipulated with awk, grep, etc., can be databased if needed (since they're csv files) -- and can also be compared using my all-time favorite software, Beyond Compare. (http://btorpey.github.io/blog/2013/01/29/beyond-compare/).

nickpsecurity · on July 29, 2019

One of the things I like about this article is that it gives another example showing how formal methods catches deep errors unlikely to be caught with human review or testing:

"Overall, the error trace found by Infer has 61 steps, and the source of null, the call to X509 _ gmtime _ adj () goes five procedures deep and it eventually encounters a return of null at call-depth 4. "

I think the example Amazon gave for TLA+ was thirty-something steps. Most people's minds simply can't track 61 steps into software. Tests always have a coverage issue.

SanchoPanda · on July 29, 2019

> Zoncolan catches more SEVs than either manual security reviews or bug bounty reports. We measured that 43.3% of the severe security bugs are detected via Zoncolan. At press time, Zoncolan's "action rate" is above 80% and we observed about 11 "missed bugs."

>. For the server-side, we have over 100-million lines of Hack code, which Zoncolan can process in less than 30 minutes. Additionally, we have 10s of millions of both mobile (Android and Objective C) code and backend C++ code

> All codebases see thousands of code modifications each day and our tools run on each code change. For Zoncolan, this can amount to analyzing one trillion lines of code (LOC) per day.

11 "missed bugs" on the 100 mm server-side lines of code per run, or ever?

m0zg · on July 29, 2019

Also, the main issue with static analysis tools tends to be not false negatives, but false positives. That is, they churn out tons and tons of alerts that aren't actually bugs. Some such systems alert so much that they aren't worth using.

Matthias247 · on July 29, 2019

Yes, that's the main culprit with traditional static analysis. No one wants to review the results, because the amount of signal to noise is far too low. And also since it's an optional thing and not enforced by the compiler.

I think this is where languages with stronger inbuilt analysis (e.g. Rust) win: The results are better, and since the analysis is always running as part of a compiler pass there are no huge jumps in indicated bugs at once (like what would happen if one would run Coverity on a legacy C++ codebase).

apaprocki · on July 29, 2019

From experience on large codebases, get to -Wall -Wextra “clean” in both the latest versions of GCC and Clang and then tools like Coverity will produce much more useful results. The signal it provides to me at that point is exactly what it is meant to provide: mostly improper error handling analysis and N-level deep branches that result in a poor result due to an error or bad decision in another file that a human would not associate with the current call chain or think to look at. To be fair, the tools work much better when you know you have complicated pieces that you spend a little time writing correct models for (e.g. custom assertion/error handling, runtime supplied vtables, custom allocators, etc.).

pm215 · on July 29, 2019

Yes, I think the 'done by the compiler' part is really important for takeup. You can see a good example of a variant of this in the article:

> diff time [ie in the standard code-review workflow] deployment saw a 70% fix rate, where a more traditional "offline" or "batch" deployment (where bug lists are presented to engineers, outside their workflow) saw a 0% fix rate

That's the difference between "static analysis presented as part of the workflow a developer goes through anyway" and "static analysis presented after the fact". If you're in a position to enforce a code-review workflow that tools can hook into then "at code review time" works, but "at compile time" is better still since it shortens the feedback loop and ensures that everybody sees the issues while they're thinking about the code, even for smaller situations with more ad-hoc or nonexistent code review setups.

taeric · on July 29, 2019

This is less true of more advanced static analysis tools.

I mean, ultimately we agree. Most people don't trust static analysis tools because they have had bad experiences with them. I just suspect most people should try them again. The state of the art is quite good in that space.

muglug · on July 29, 2019

It sounds (from the article) like they have some sort of heuristic for determining potential severity, and they're ok with more false-positives in areas where the potential damage from a false-negative is very high.

m0zg · on July 29, 2019

I might be biased, but I've never seen these systems work well in practice. Some 15-17 years ago Microsoft depoloyed a system called PreFix which would find genuine, hard to find bugs, but then bury them under a mountain of false positives, so few teams ran it, and even fewer looked at the results. I like what LLVM did in this area. Its SCA is not very comprehensive (so it can't be relied upon for deep analysis), but when it does find something it's usually a legit issue.

But the balance of deep analysis and low false positives remains elusive. I'd be really stunned if FB really achieved a breakthrough in this area.

I do want to be wrong about this.

d7gibo · on July 29, 2019

The mountain of false positives isn't an issue if you run the static analysis tools from the start of the project's development.

seanmcdirmid · on July 29, 2019

It is as your code base becomes littered with annotations to suppress each historic false positive.

nickpsecurity · on July 30, 2019

You kind of explained it yourself saying 15-17 years ago. Lots of things can improve in that time-frame. PreFix wasn't even that focused on reducing false positives IIRC. Some today are focused on keeping false positives down. A few benchmarked are here in this also-biased article:

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/benc...

I bring them up because they made the open-source K Framework and a C semantics. Another commenter says PVS-Studio is pretty good. Since Synopsis owns Coverity now, I'd recommend RV-Match (little to no false positives) followed by PVS-Studio.

m0zg · on July 31, 2019

Don't know about Coverity, but a customer did a Fortify scan on our code last year (as a part of acceptance), and it didn't find any issues. Which I find really hard to believe that there aren't any issues, seeing that the codebase is 200KLOC of pretty gnarly C++ with probably another 500KLOC in source-level C++ deps (we compile from source where we can). Either the team was comprised entirely of gods of C++ programming, or it doesn't really "fortify" all that much. I'll let you decide which one is more probable.

j88439h84 · on July 29, 2019

In python, Pylint and mypy find real bugs all the time, plenty of false positives but still very usable.

clarus · on July 29, 2019

Pylint and mypy are about syntax and type-checking. While I agree these kind of tools work well, if think that by static analysis people usually imply something which goes farther than that. For example, the languages Facebook is citing (C++, Java, ...) already include type-checking in the compiler.

j88439h84 · on July 29, 2019

Pylint catches lots of semantic bugs, far from just syntax.

muglug · on July 29, 2019

It's explained in the article:

> We also use the traditional security programs to measure missed bugs (that is, the vulnerabilities for which there is a Zoncolan category), but the tool failed to report them. To date, we have had about 11 missed bugs, some of them caused by a bug in the tool or incomplete modeling.

A missed bug is presumably one that the tool is designed to spot, but which it didn't during the period in which it has been running.

sanxiyn · on July 29, 2019

We should start to run Infer on all open source C and C++ code in existence.

edwcross · on July 29, 2019

Not only Infer, but other static analyzers would also be useful.

Hopefully Software Heritage (https://www.softwareheritage.org) will help with that.

apaprocki · on July 29, 2019

Coverity scans most open-source software you likely depend on: https://scan.coverity.com/projects

mhxion · on July 29, 2019

Is there something wrong with acm's load balancer or whatever? First managed to read to the end of the article, but to download the PDF showed "Oops! This website is under heavy load." Now article page is under heavy load too.

Edit: It worked again right after I posted this comment.

fullsend · on July 29, 2019

Always cool to read about scale.