Hacker News new | past | comments | ask | show | jobs | submit login

> There are products (Nexus Firewall) that can check dependencies for vulnerabilities

Products can only do best effort scanning for known patterns of vulnerabilites. Those are only added after someone (or something) discovers it and verifies that it's not a false alarm. In between the time gap, scanners are ineffective and anything in your infrastructure can run the malicious code and get hacked. (Google supply chain incidents on npm or pypi.)

In general, there is always a way to bypass it, since the language is turing complete and unsandboxed. Anything going beyond that is a lie, or simply impractical. Systems with latest software and antivirus get hacked all the time. Nothing can really stop them. Why?

- https://web.archive.org/web/20160424133617/http://www.pcworl...

- https://security.stackexchange.com/questions/201992/has-it-b...

- https://en.wikipedia.org/wiki/Rice%27s_theorem

> and either block them from entering the network or fail CI pipelines.

Also, relying solely on an endpoint security product for protection is dangerous since antiviruses themselves get hacked all the time. Sonatype for an example:

- https://nvd.nist.gov/vuln/search/results?form_type=Basic&res...




My research is about detecting semantically similar executable code inside obfuscated and stripped programs. I don't know what commercial antivirus or vulnerability scanners use internally, but it's possible to generate similarity scores between an obfuscated/stripped unknown binary and a bunch of known binaries. I suspect commercial scanners use a lot of heuristics. I know IDA Pro has a plugin for "fingerprinting" but it's based on hashes of byte sequences and can be spoofed.

My approach is basically: train a model on a large obfuscated dataset seeded with "known" examples. While you can't say with certainty what an unknown sample contains, you can determine how similar it is to a known sample, so you can spend more of your time analyzing the really weird stuff.

The hardest part in my opinion is generating the training data. You need a good source code obfuscator for your language. I've seen a lot of papers that use obfuscator-llvm[1] to obfuscate the IR during compilation. I use Tigress[2] to obfuscate the source code because it provides more diversity, but it only supports C.

[1]: https://github.com/obfuscator-llvm/obfuscator/wiki/Installat...

[2]: https://tigress.wtf/


Great work! For unobfuscated or lightly packed ones, I guess your approach could mostly work.

One question: how do you detect when a binary is intentionally made to statically look similar to one binary, while its behavior actually mimics another?


That's a good question. There are Tigress transformations [1,2] that seem highly relevant to this goal, but they're harder to work with because the resulting C code isn't always compilable without errors.

In my work I'm not looking for intentional spoofing, but the obfuscations I do use [3,4,5,6,7] end up building very similar control flow structures for different functions. Maybe that fits the spirit of your question... Let me know if not.

So far I'm doing purely static analysis and control flow, but the broader field of reverse engineering includes dynamic/symbolic analysis where you track values through a running/simulated program. Great results but very costly to run.

I've been focusing on making cheap/static analysis better, so I haven't explored the dynamic/symbolic side at all yet.

[1]: https://tigress.wtf/virtualize.html

[2]: https://tigress.wtf/jitter.html

[3]: https://tigress.wtf/flatten.html

[4]: https://tigress.wtf/split.html

[5]: https://tigress.wtf/merge.html

[6]: https://tigress.wtf/encodeArithmetic.html

[7]: https://tigress.wtf/inline.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: