Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It starts with an invalid .jpg (literally a text file containing "hello"), and by trying over and over, changing random bytes and tracing the execution of the decoder program as it is fed the corrupted input, it will drill deeper and deeper into the program until it has gotten far enough that the input is actually a valid .jpg, without any human input.

Fuzzing like this is a very effective technique for finding (security) bugs in programs that parse input, because you will quickly end up with "impossible" input nobody thought to check for (but is close enough that it won't be rejected outright), and whoops there's your buffer overflow.

In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff. And don't forget, this will work with any type of program and file type, not just .jpgs and the djpeg binary.



>In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff.

To expand on this, techniques like this are called whitebox fuzzing (or maybe graybox in afl's case). In their extreme whitebox fuzzers even incorporate constraint solvers to directly solve inputs that take the program to previously unexplored paths. One very impressive project is the SAGE whitebox fuzzer [1,2,3] that's in production use at Microsoft (an internal project sadly). I work in the related field of automated test generation, but all my tools are very much research-grade. However, in SAGE they've done all the work of figuring out how 24/7 whitebox fuzzing can be integrated into the development process. I am somewhat envious of the researchers getting to work in an environment where that is possible. If you're interested I very much recommend reading the papers on SAGE.

[1] Poster about SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[2] An approachable article on SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[3] The paper with all the details: http://research.microsoft.com/en-us/projects/atg/ndss2008.pd...


The main problem with SAGE is that at least outside Microsoft, it exists just as a series of (very enthusiastic) papers :-)

So, while I suspect it's very cool, it's also a bit of a no-op for everybody else. It's also impossible to independently evaluate the benefits: for example, its performance cost, the amount of fine-tuning and configuration required for each target, the relative gains compared to less sophisticated instrumented fuzzing strategies, etc.


Great links, thanks.


Why does it use fuzzed input in the first place? Couldn’t one just use random input from the beginning instead? It would be effectively equivalent but fuzzing of a "hello" string seems to be roundabout.


Well, "hello" is pretty random. :) It was probably just used for dramatic effect in the demo, and you have to start with something - of course even a 0 byte file would be enough.

You could also have a started with a valid .jpg with lots of complicated embedded exif metadata sections etc, and have a good chance of triggering bugs in those code paths without having to "discover exif" first.


From the article: it works without any special preparation: there is nothing special about the "hello" string.


He said it took a day to find good jpg images. If you started the program with a valid input, then it would take much less time to explore the other code paths.


In this case "hello" was just a pseudorandom starter to seed the fuzzer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: