Hacker News new | past | comments | ask | show | jobs | submit login
Lessons from Building Static Analysis Tools at Google (acm.org)
208 points by DannyBee on April 30, 2018 | hide | past | favorite | 40 comments



It's worth noting that in the .NET world, there is a useful extension point provided by Roslyn compiler that can be used for shipping analyzers and codefixes using the standard packaging tools (NuGet). Static analysis distributed in such way is automatically integrated into both IDE and CI workflows -- as its directly supported by compilers.

This means that you can static analysis checks besides your library. For example, you have unit testing library that expects to be used in a certain way that cannot be completely described using type system -- it expects that test methods marked by some test attribute has to be public. Such library can include analyzer that would check usages of types from the library and report warnings. User of the library don't have to do a thing to register such analyzer, it will just work.


Adding to why static analysis tools are not used:

1. The false positives sometimes lead to ugly workarounds needed to quiet the message.

2. There's no standard, meaning different SA tools produce very different results, often contradictory.

A better approach is to improve the language. For example, the `(a < b < c)` is tripped by many static analyzers. In D, the grammar was changed so `a < b < c` is not valid. And the C/C++ committees should really deprecate `l` as a valid integer literal suffix :-)


Regarding 1., not just ugly workarounds, remember how Debian broke their OpenSSL PRNG for several years to silence warnings for analysis tools [1] – read the impact section and shiver if you had not heard of this one before. There are real world risks associated with spurious warnings.

[1]: https://github.com/g0tmi1k/debian-ssh


Well, that's not quite an accurate representation of the issue.

OpenSSL was trying to use uninitialized memory to seed the PRNG in one of the calls. The fix silenced the warnings by removing (almost) all seeding of the PRNG, instead of just the uninitialized memory seeding.

To his credit, the patch author did attempt to ask the OpenSSL developers if the patch was doing things correctly. However, the developers apparently didn't watch their own mailing list, and this didn't come out until after the problems were made public and the developers were laughing about what a n00b the Debian contributor was.


I hate to misrepresent facts, but after looking over what I wrote and reading what I linked for the first time in many years I don’t see any misrepresentation of the issue. Yes, what you write adds additional detail, but nowhere do I or what I link go after the Debian contributor; I was simply trying to make the point that spurious errors can lead us astray. Thank you for adding more nuance and do correct me if I messed up somewhere as I am unable to see it myself.


This sort of problem is probably why the article doesn't recommend compiler warnings as a useful intervention point: they can be ignored, but then continue to annoy future developers who aren't responsible for the mistake.


In Rust, static analysis ("lint") is part of language and every lint can be disabled in per function and per module basis. So ugly workarounds are never required.


Anything that has a type system is using static analysis.


What happens when the style guide authors inevitably declare that no lint messages should be disabled, ever ?


Well, that's a self-inflicted damage, isn't it? You can move to another team with less brain damaged style guide authors.


Style guides are usually per company or at least per department. Leaving them behind isn't usually in the cards without severe other consequences.

I usually write a python program that doesn't complain about style violations, it just fixes them, like autopep8, but with much more stuff, like correctly capitalizing function names, variables, making strings consistent, making or taking away long-quotes, putting conditions in ifs, even renames shadowed variables (built it in last week after what feels like the 100th time I hit a bug due to shadowing) ... Makes people wonder how I can possibly program so fast under such onerous rules. On the downside: I work here now for 6+ years, and still don't know (or care) about most of the style guide, and frequently get confused when people ask questions "because you have written the most code, you must know, right ? Capitalize functions or not ? ... Ehm, well ...". Also, people ignore these things, so I must run it on any file I edit and submit a "style fixes" cl before doing anything else. But since git became an option, not much of a problem.

That's the trouble with style guides: 95% of what they do COULD have been implemented in editors if we demanded that style guide authors weren't "architects" and policy-makers, but programmers that really know what they're doing. Very little they do requires a policy, it can just be described as a parse-tree-transform. Instead, they're just people who feel they can just push their opinion on a large group of people, just because it makes them feel a bit better. That they can't actually program a fix, but instead just create another problem, doesn't seem to bother any of them, ever.

And we actually reward people for this. I don't get that. These people CREATE a problem, then somehow negotiate this style with a department, slowing everyone down for no good reason other than that they don't know how to do it automatically, and they usually actually get rewarded for providing a bad solution. Like they "introduce standards". Even just providing a style-guide-checker with a consistent interface (so IDE integration would be easy) is too much to ask at every place I've ever worked.

No, they substitute 10% of the all labor in this company for the job a simple perl script COULD do.


1 is why I refuse to allow any static check in my company if you can show a real false positive in our code. If there is any doubt someone will quickly try to place all instances of that error in the doubt, and it is up to me (as the maintainer of static checks) to prove them wrong.

I have never seen contradictory parts of 2, so I don't know what you are talking about. If it is what I think: I would reject one of the two tools just so that I had a single easy to apply set of rules that have no false positives.

Static analysis is very valuable, but only when you can treat it like the hand of some god that will throw lightening at any developer who writes a found error. I'm currently looking at 1 suppression per 2 million lines of code. I think there is more value in more checks, but unless they can be fully automated without telling everybody how to suppress false positives I won't consider them.


You don't allow any type checking?

You can tune an analysis to be considerably more conservative and reduce false positives. The analysis becomes more like a semantically aware formatter. Doing most things related to pointers will lead to FPs, but you can still get value even if you are terrified of FPs.


We do lots of checks. Most of our code is C++ where we get a fair amount "free", and the few times where type checks get in the way they are easy to bypass in a natural way.


> 1 is why I refuse to allow any static check in my company if you can show a real false positive in our code.

So no linters either?


We have a couple linters. They all have a zero false positives policy. We have about 1 suppression per 2 million lines of code because of bugs in the linters we use - every one of the suppression was accepted as a bug by upstream before we suppressed it.

I know that there are lots of checks that would be useful that we are missing. However they are not worth looking at because every false positive destroys some trust, and soon they are not trusted even when they are right.


Re. #1

It sounds to me like the issue isn't that false positives exist (you aren't ever going to fully get rid of them), it's that the SA tools need easy "escape hatches" for those false positives.

Even something as simple as an editor plugin to let you click on a specific "error" or "warning" and click "silence" to add a magic comment or something to explicitly "allow" the line/function/file.

It reduces the amount of extra work from a false positive to next to nothing, it still can and will catch issues before they hit production, and it can make iffy-code more explicit (The added magic comment essentially saying "yes, i did mean to do that").


1. The magic comments are specific to which analysis tool you're using, so don't work well with portable code.

2. Those magic comments are put in the source code, meaning they uglify the code in github, etc.

Of course, one could create a separate database for the annotations, but that adds another layer of complexity for the users to prevent its use.


I disagree.

They are as portable as the code is, yes they might only work with a single language requiring another "format" for other linters in other languages, but they will stay with the code wherever it goes.

And they aren't "uglifying" the code any more than type annotations or variable names. They are an additional hint to the user and the system about what a specific line or block of code is doing. They are just an annotation that the compiler ignores (for the most part).


The PHP I write using Jetbrains IDE, with the most extensive static analysis plugin I could find, has that about 5 keystrokes away when you have the cursor on the sqiggly part. It's about the same keystrokes +-2 or so (navigation), to list all occurrences of that specific warning. So yes, what you ask for is already a thing in (more or less) standard tooling.


3. The checks are skipped for all the code in third_party which you are using.


I disagree that it is a better approach to improve the language. Rather, it is easier. At that point. It is harder in that it requires rewriting all of the projects you had.

And it is wishful thinking to think you will not have regressions in either correctness or performance. Often both.

Edit: to be explicit, I'm not saying it is worse, either. Just has trade offs.


At LibreOffice we, (mostly I) do whole program analyses, and they are fairly useful e.g. https://cgit.freedesktop.org/libreoffice/core/tree/compilerp...

That said, the analyses (a) are specific to our team (b) require about 2 hours (each) to compute on a 6-core box (c) are only run every couple of months (d) are run and applied by the same person. (e) have a false positive rate of around 1%, so require manual review (f) often make assumptions that are only valid for how we use C++ and various libraries (without some "well just don't do that" assumptions you can't do any useful analysis across a C++ codebase :-)


We use Coverity and, even though some members of the team really dislike it, I think it is quite helpful to detect things that would go unnoticed otherwise.

However, our project is relatively small, so we are normally done in ~15 minutes. We run it right before upstreaming patches, which means that the tests run once or twice a day.


If there are too many false-positives, it just becomes an annoyance. If the false-positives are too high, it can have the opposite effect, that code detected to have bugs is more likely to not have a bug. The false-positive max is around 5% I think. Anything above that - it would be more effective showing warnings at random locations.


The article claims 10% is a sweet spot for Google.


If I read it correctly, that 10% was only for the static checks flagged in code reviews. They have some checks that are 0%.

Which is to say there are some checks that are always right and google doesn't allow violations. The 10% seems to refer to checks that catch some significant bugs - things bad enough that they are worth looking at, but the check isn't perfect.


"Google does not have infrastructure support to run interprocedural or whole-program analysis at Google scale..."

I am a bit unclear about this statement. Does it really mean that Google uses a single, monolithic program to do everything from advertisement payment tracking to the GMail user interface? Or does it mean they don't have the infrastructure to do analysis of many individual programs? (Surely, that can't be true!?)


I read that part more as: we don't have the necessary infrastructure to compute whole-program analyses, and we don't have sufficient justification to invest the effort in creating such infrastructure.


Google uses a monorepo. This means that a single definition can make its way into an outrageous number of binaries. This doesn't lend itself to nice tight boundaries between programs.


"Attempt 1. Bug dashboard. Initially, in 2006, FindBugs was integrated as a centralized tool that ran nightly over the entire Google codebase, producing a database of findings engineers could examine through a dashboard. Although FindBugs found hundreds of bugs in Google's Java codebase, the dashboard saw little use because a bug dashboard was outside the developers' usual workflow, and distinguishing between new and existing static-analysis issues was distracting.

"Attempt 2. Filing bugs. The BugBot team then began to manually triage new issues found by each nightly FindBugs run, filing bug reports for the most important ones. In May 2009, hundreds of Google engineers participated in a companywide "Fixit" week, focusing on addressing FindBugs warnings.3 They reviewed a total of 3,954 such warnings (42% of 9,473 total), but only 16% (640) were actually fixed, despite the fact that 44% of reviewed issues (1,746) resulted in a bug report being filed. Although the Fixit validated that many issues found by FindBugs were actual bugs, a significant fraction were not important enough to fix in practice. Manually triaging issues and filing bug reports is not sustainable at a large scale.

"Attempt 3. Code review integration. The BugBot team then implemented a system in which FindBugs automatically ran when a proposed change was sent for review, posting results as comments on the code-review thread, something the code-review team was already doing for style/formatting issues. Google developers could suppress false positives and apply FindBugs' confidence in the result to filter comments. The tooling further attempted to show only new FindBugs warnings but sometimes miscategorized issues as new. Such integration was discontinued when the code-review tool was replaced in 2011 for two main reasons: the presence of effective false positives caused developers to lose confidence in the tool, and developer customization resulted in an inconsistent view of analysis results."

Once upon a time, I had a co-worker on a C++ project commit a change disabling -Wall with a comment that can loosely be paraphrased as, "No one has time for that."

It's good to see Google has the same issues.


Semmle (and its open source product LGTM) is a really great tool in this space.


Very nice to integrate static analysis with code review tools. Did anybody here successfully integrate Java static analysis (e.g. FindBugs) with GitLab merge requests?

How would someone integrate static analysis to the workflow if there's no code review?

My experience with bug dashboards is the same of Google. It is outside the developers workflow.


Are there an "ErrorProne" tool for Javascript or Angular?


Hi, I started Error Prone and now work on the Angular team :)

In addition to tslint (the "old way") we now have http://tsetse.info which follows the Error Prone model of baking checks into the TypeScript compiler and failing the compilation the same way as the type checker (for the error case).

Still have some work to do to make warnings appear in code review like we talk about in the paper...


> Because checks in Error Prone are self-contained and written against the javac abstract syntax tree ... it is relatively easy for developers outside the team to contribute checks... As of January 2018, 733 checks had been contributed by 162 authors.

Somewhat interestingly, Clippy is an analogous tool for Rust and has 181 contributors now, more than Error Prone. https://github.com/rust-lang-nursery/rust-clippy


Why is that relevant to this post?


Because it provides a comparison point to a statistic in the article? GP was pretty explicit about how it related to the article


Might be apples to oranges, if one tool is internal and the other isn't.


Error Prone is not internal: https://github.com/google/error-prone




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: