A team at Google pulled 20tb(!) of SWF files out of their crawl and fed them through a simple algorithm that determined the subset of 20,000 SWF files that exercised the maximum number of basic blocks in Adobe's Flash Player.
Then, using 2000 CPU cores at Google for 3 weeks, they flipped random bits in those 20,000 SWF files and fed them through an instrumented Flash Player.
Result: 80 code changes in Flash Player to fix security bugs from the resulting crashes.
This is great stuff; you can imagine a very well organized adversary spending the money on comparable compute resources, and even (if you stretch) obtaining the non-recoverable engineering time to build a comparably sophisticated fuzzing farm. But no other entity excepting perhaps Microsoft can generate the optimal corpus of SWF files to fuzz from.
DO PDF NEXT, GOOGLE.
You've got to ask yourself: in a year or so, if there are still regular updates for exploitable zero-day memory corruption flaws in Flash, even after Google exhaustively tests the input to every basic block in the player with the union of all file format features observed on the entire Internet, what does that say about the hardness of making software resilient to attack?
what does that say about the hardness of making software resilient to attack
Well, we already know the answer to that question. The long-term solution is to build software out of safer building blocks. A good example is high-level programming languages. When you write C and use cstrings (or other unstructured "blocks-o-ram" data structures), you have to "get it right" every single time you touch a string. In a codebase the size of Flash's, this probably amounts to tens of thousands of possible bugs. But if you write it in a high-level language, the runtime implementor only has to get it right once -- and there's no way you can get it wrong, even if you want to. (The middle ground is something like better string handling functions; OpenBSD tried to do this with strl*, and there are libraries like bstrings that represent strings properly. But you can still do unsafe pointer math on these strings with not-much-benefit.)
The way it stands right now, it's cheaper to write software with the approach of "throw some trained monkeys at it and hope for the best". But in the future, we're going to need to do a lot more thinking and a lot less typing if we want to write secure software without performance penalties.
Binary file formats make it worse, because you are dealing with untrusted data disguised as a C struct. For example, even multiplying with an integer that is too big can result in an integer overflow, and C will silently truncate.
Ok, I am going to say that this is just a little scary, scalewise. And I am thinking that the 2000 cores they used was some teeny fraction of what they might have deployed if they really needed it.
On the subject of raw computing power, if you live in the US you've probably heard about some NSA or CIA data facility being installed in your general region, and how the local power company built new infrastructure just to power the building. If Google can throw 2000 cores at securing software, how many can a government throw at breaking it, e.g. in preparation for the next iteration of Stuxnet?
The cluster is interesting, but not as interesting as the giant corpus of SWF files Google got to use. Do you think the government has a crawl as complete as Google's under its hat? How? People notice when the Googlebot does new things. Wouldn't we noticed the Fedbot?
Quite true. Google does have a lot of data. But, I'd wager the NSA has just as much data, just from different sources. Maybe they couldn't fuzz Flash with the optimal set of .swf files, but they could mine vast numbers of voice conversations for correlations.
Additionally, years ago a friend of mine who I'd lost contact with caught up with me and told me he found a cached copy of a website I'd taken down in his employer's equivalent to the Wayback Machine. His employer was a branch of the federal government. I know my anecdote doesn't prove anything, let alone come close to addressing the difficulty of crawling the web without anyone noticing (intercept all http traffic in transit?), but the fact remains that there are literally tons of computers doing something for the government.
I suspect "fedbot" works by calling up google and saying "Hi, it's us again, we've got another white van on the way to the googleplex, have a petabyte or two of the Internet ready for us to collect in 20 minutes. thanks"
Copyright law generally allows copying for research purposes that don't compete commercially with the author's use of the work. For example, see https://w2.eff.org/IP/DMCA/Felten_v_RIAA/ (And Google, unlike some of us, can afford lawyers to press that point in court if anyone tries to sue them.)
A team at Google pulled 20tb(!) of SWF files out of their crawl and fed them through a simple algorithm that determined the subset of 20,000 SWF files that exercised the maximum number of basic blocks in Adobe's Flash Player.
Then, using 2000 CPU cores at Google for 3 weeks, they flipped random bits in those 20,000 SWF files and fed them through an instrumented Flash Player.
Result: 80 code changes in Flash Player to fix security bugs from the resulting crashes.
This is great stuff; you can imagine a very well organized adversary spending the money on comparable compute resources, and even (if you stretch) obtaining the non-recoverable engineering time to build a comparably sophisticated fuzzing farm. But no other entity excepting perhaps Microsoft can generate the optimal corpus of SWF files to fuzz from.
DO PDF NEXT, GOOGLE.
You've got to ask yourself: in a year or so, if there are still regular updates for exploitable zero-day memory corruption flaws in Flash, even after Google exhaustively tests the input to every basic block in the player with the union of all file format features observed on the entire Internet, what does that say about the hardness of making software resilient to attack?