> There are products (Nexus Firewall) that can check dependencies for vulnerabil...

hoosieree · on Nov 14, 2023

My research is about detecting semantically similar executable code inside obfuscated and stripped programs. I don't know what commercial antivirus or vulnerability scanners use internally, but it's possible to generate similarity scores between an obfuscated/stripped unknown binary and a bunch of known binaries. I suspect commercial scanners use a lot of heuristics. I know IDA Pro has a plugin for "fingerprinting" but it's based on hashes of byte sequences and can be spoofed.

My approach is basically: train a model on a large obfuscated dataset seeded with "known" examples. While you can't say with certainty what an unknown sample contains, you can determine how similar it is to a known sample, so you can spend more of your time analyzing the really weird stuff.

The hardest part in my opinion is generating the training data. You need a good source code obfuscator for your language. I've seen a lot of papers that use obfuscator-llvm[1] to obfuscate the IR during compilation. I use Tigress[2] to obfuscate the source code because it provides more diversity, but it only supports C.

[1]: https://github.com/obfuscator-llvm/obfuscator/wiki/Installat...

[2]: https://tigress.wtf/

hun3 · on Nov 14, 2023

Great work! For unobfuscated or lightly packed ones, I guess your approach could mostly work.

One question: how do you detect when a binary is intentionally made to statically look similar to one binary, while its behavior actually mimics another?

hoosieree · 2023-11-16T14:50:42 1700146242

That's a good question. There are Tigress transformations [1,2] that seem highly relevant to this goal, but they're harder to work with because the resulting C code isn't always compilable without errors.

In my work I'm not looking for intentional spoofing, but the obfuscations I do use [3,4,5,6,7] end up building very similar control flow structures for different functions. Maybe that fits the spirit of your question... Let me know if not.

So far I'm doing purely static analysis and control flow, but the broader field of reverse engineering includes dynamic/symbolic analysis where you track values through a running/simulated program. Great results but very costly to run.

I've been focusing on making cheap/static analysis better, so I haven't explored the dynamic/symbolic side at all yet.

[1]: https://tigress.wtf/virtualize.html

[2]: https://tigress.wtf/jitter.html

[3]: https://tigress.wtf/flatten.html

[4]: https://tigress.wtf/split.html

[5]: https://tigress.wtf/merge.html

[6]: https://tigress.wtf/encodeArithmetic.html

[7]: https://tigress.wtf/inline.html