Can I just say, thanks to the person who posted this for waiting until this week to do so. (Side note: I suspect it was due to the recent coverage from C++ Weekly which is a great resource: https://www.youtube.com/watch?v=h3F0Fw0R7ME)
As recently as last week we had some horrible performance problems but it looks like the queue (https://dogbolt.org/queue) is mostly still fine! Other than the long pole of a few of the decompilers being backed up, things are humming along quite smoothly! Josh + Glenn have done some great work on it! (https://github.com/decompiler-explorer/decompiler-explorer/c...)
Wow, I really could have used this for my Ph.D. research (deep learning for obfuscated code).
I ditched Ghidra in my experiments in favor of angr early on because Ghidra did not play nicely with multiprocessing and I had a lot of data to process. Well maybe it does but it was much easier for me to achieve the same thing with angr.
Love the name! Although I feel compelled to point out that Compiler Explorer is the name of the project and Godbolt is its author's last name, but I suppose if people are to the point of using Godbolt as a verb the ship has sailed.
My work focuses on recognizing known functions in obfuscated binaries, but there are some papers you might want to check out related to deobfuscation, if not necessarily using ML for deobfuscation or decompilation.
My take is that ML can soundly defeat the "easy" and more static obfuscation types (encodings, control flow flattening, splitting functions). It's low hanging fruit, and it's what I worked on most, but adoption is slow. On the other hand, "hard" obfuscations like virtualized functions or programs which embed JIT compilers to obfuscate at runtime... as far as I know, those are still unsolved problems.
https://www.jinyier.me/papers/DATE19_Obf.pdf uses deobfuscation for RTL logic (FGPA/ASIC domain) with SAT solvers. Might be useful for a point of view from a fairly different domain.
https://advising.cs.arizona.edu/~debray/Publications/generic... uses "semantics-preserving transformations" to shed obfuscation. I think this approach is the way to go, especially when combined with dynamic/symbolic analysis to mitigate virt/jit types of transformations.
Eventually I think SBOM tools like Black Duck[1] and SLSA[2] will incorporate ML to improve the accuracy of even figuring out what dependencies a piece of software actually has.
> My take is that ML can soundly defeat the "easy" and more static obfuscation types (encodings, control flow flattening, splitting functions). It's low hanging fruit, and it's what I worked on most, but adoption is slow.
If I wanted to implement my own toy HexRays-like decompiler using a few of these techniques to decompile x86-64 binaries is there any high quality up-to-date paper/resource you would recommend?
Or do you think that "A Generic Approach to Automatic Deobfuscation of Executable Code" paper is a good enough start?
"A Generic Approach" seems like a good starting point for a classical approach: building a set of reusable components and heuristics to recognize idioms, etc.
Might also be worth considering an approach integrating LLMs for summarizing code. Maybe you could fine-tune a pretrained model that already "understands" source code to associate sources with generated code? If going this route I would still probably use a disassembler to preprocess, and maybe also extract basic blocks to use as my "target" domain for fine-tuning.
As for Tigress, I used it extensively and found it to be really great most of the time. There are some limitations to be aware of: it only works with C code, and you have to turn your multi-file projects into a single file with a main() function. Also, its C parser (CIL) has some limitations (e.g. doesn't recognize the word static in "struct foo x[static 1]") so you might need to translate your C code first. I translated manually because it was a really rare issue for the code I started with. I also had mixed results using Virtualize and JIT. Sometimes they would emit invalid code, so I ended up just throwing out that data.
In my view, the up-and-coming Tigress challenger is obfuscator-llvm. I think it is very promising for future work because it inherently supports more languages than only C. But currently obfuscator-llvm is much more limited (~3 transformations compared to ~48). So if you're using C, today I would pick Tigress.
IRs aren't generally suited toward small snippets of examination by human when you're starting with a full binary. I would imagine something like that would only work well when done for very small bits of assembly. Likewise, you might be interested in BNIL which is an entire stack of ILs that Binary Ninja is based on. (You can see it exposed in the cloud.binary.ninja UI or the demo)
Qemu works by translating a binary to an IR then doing stuff with it. Valgrind likewise. There's an optimiser called bolt (associated with facebook) which has the same idea.
Yup, I'm aware of both of those, but none of those tools listed so far are intended for the IR to be for human-consumable unlike disassemblers and decompilers. You think disassembly is verbose compared to a decompiler? Go look at the equivalent Vex (Valgrind's IR) for any non-trivial disassembly. It's suuuper verbose.
As far as I know, BNIL (https://docs.binary.ninja/dev/bnil-overview.html) is the only one that is designed to be readable and it still wouldn't make sense to include it in an IL comparison such as the one done here for decompilation in my opinion.
Speaking of decompilers, would Binary Ninja be a safe bet to pick? I've been told IDA is the gold standard, but it's also expensive for someone who wants to recreationally reverse engineer.
Binja decompiler is more-or-less fine. Its not as mature as IDA or Ghidra but its not a bad decompiler.
Though for me the big selling point on Binja is the Intermediate Languages (ILs). HIgh-level IL is the decompiler but you also get Low-level and Medium-level ILs as steps between assembly and source. If the decompiler is a bit funky you can look at the ILs to get a better idea of what is happening. the ILs are also just much nicer to read than plain assembly so I tend to use them a lot.
Its a feature that isn't really matched on any other platform. Ghidra and IDA both have a single IL that is more machine readable compared to Binja's human-readable ones.
Honestly just use Ghidra. It has it's quirks but it's pretty good. And open source.
If it's good enough for the NSA it's probably good enough for recreational use.
The code is open source and has been looked at by several people over the years.
It would be quite hard for the NSA to sneak in a backdoor but it is never out of the question.
However, the risk is so extremely minuscule when compared to other alternatives since they are not even open source.
Very nice. A parallel, I've been working on an emulator project recently, implementing my own disassembler, and I keep thinking about how I would turn patterns of machine code into a generalized form, which could then be turned into something like C-like pseudo-code, so it's been really compelling me lately to implement my own toy decompiler
BinaryNinja does this. They have several layers of intermediate representations[1], which they build their compiler on top of. Ghidra does something similar with their PCode. They disassemble to PCode and then decompile the PCode[2].
Love this - I can almost imagine the convincing for other companies wasn't even needed when they realized a small binary size and comparison to competitors would net them more business. A perfect little solution for triaging issues between services and comparing solutions.
That was indeed the logic. The two main commercial solutions included (Binary Ninja made by Vector 35, where I'm one of hte founders) and Hex-Rays both pay for all the hosting costs. And it's not particularly cheap -- there's a fair amount of compute to drive the decompilers especially as some of them are... not very efficient.
Binary Ninja likewise is empty and keeps up just fine as well. It's not a coincidence that the two commercial products that are funding it are both confident enough to put their stuff online like this.
> All submitted binaries are saved and made available to any of the authors of the tools used so they may improve their decompilers. If you're such an author who would like access, let us know!.
If you believe that content you submit to websites is not examined by interested parties associated with that website, then - I have a bridge to sell you... or perhaps I should say a Google account to give you, free of charge.
> In short: your source code is stored in plaintext for the minimum time feasible to be able to process your request. After that, it is discarded and is inaccessible. In very rare cases your code may be kept for a little longer (at most a week) to help debug issues in Compiler Explorer.
My bias may be showing, being a ctf-scene enthusiast. Most of these (tools on dogbolt) look like foss utilities you can run yourself. The rest, I'd imagine you are welcome to pay for licenses. Binary Ninja in particular, while maybe not cheap for everybody, isn't sky-high.
1. If a third-party does their link-shortening, which gets the program text, then - it doesn't matter how nice they are. And if that party is Google then, well...
2. The language you quoted still allows them to keep effectively all information through mining aspects of it rather than keeping the entire code as a stretch of plain text.
3. If GodBolt or its servers are subject to US law, then there might be National Security Letters which compel it to pass information on to the US government, and keep that secret. And this is not a conspiracy theory, this what Snowden has exposed about Google, Apple, Microsoft, Yahoo etc.
So - I respect and like the GodBolt'ers, but you don't have a good guarantee of your data being kept private.
I think they changed it recently, but all of the code you submit is embedded in the URL. (after an anchor) So, it's stored by google's link shortening service, but is resubmitted to the site every time you load it.
The name of this is a reference to the incredibly useful godbolt compiler explorer. If you are interested in this you will likely enjoy the other as well:
Sculpty terracotta would be a fitting choice. It's pretty easy to sculpt when kneaded, bakes in a traditional oven, keeps it's details. Perfect for silicone mold making.
It's never been called anything but either "GCC Explorer" or "Compiler Explorer", by me, anyway... The URL it's accessible for is an accident of the one I had hanging around :) (it's now available at compiler-explorer.com too, but...the name other people use has stuck so I'll never be able to reclaim my own domain...)
I think you _could_ reclaim your own domain if you wanted. You'd want to have a banner at the top with a clear note directing people to the new domain for the compiler explorer, so that people realize immediately that you're not domain squatting. A few people might put up a stink, but I'm pretty confident that most people wouldn't mind, especially since the tool itself is so useful. The name, for those who don't know it as your last name, is fun, but it isn't the reason people use the tool. Eventually, over enough time, people would start remembering the new URL, and you could shrink or remove the banner (and/or put a note elsewhere on the page).
Honestly "godbolt" is so memorable I can find it instantly even though I rarely use it; but "compiler-explorer" sounds like some generic SEO spam site that I'd probably never click on.
Even then the internet (and even books) are full of "godbolt" links, to the tool itself, to specific code samples. Till all those became irrelevant will take quite some time.
Links to specific examples are less of a problem as he could redirect those to compiler-explorer.com and just keep that redirect up forever. Really the only URL that would need to be "reclaimed" is https://godbolt.org/ and having a prominent link to compiler-explorer.com thee would solve that issue.
OTOH the godbolt domain is at least not actively used for a number of other TLDs getting one of those might be an easier option.
It’s such a memorable name for a tool like that. Other than losing your domain name to the topic, how do you feel about the de facto name?
To a far far lesser degree, I’ve experienced many examples of “you named it X but everyone at work calls it Y and now you have to live with that.” It used to really irk me for some reason.
It is fantastic name of an otherwise fantastic tool. The day I found it was your last name made me chuckle and liked it even more. And since I am here, thank you very much for it!
I always call it the compiler explorer but the url, as a sibling comment says, is memorable.
Could be misremembering, but IIRC it was called Compiler Explorer and used to live only on a subdomain of godbolt.org. But, it was so useful that it became presumably vastly higher traffic than the personal homepage part and people often referred to it as just "Godbolt" probably because it sounds cooler and is shorter than saying "Compiler Explorer" (and it may not be obvious the domain name is a last name rather than just a cool name for something.)
To be fair it's an amazing last name and it feels like there probably is a story, it just has to do with this guy's ancestors rather than the assembler tool we all know and love.
It makes for a nice parallel, since the original version of godbolt was just a split tmux session with vim running on one side, and "watch 'gcc -S -o /dev/stdout'" on the other. The main advantage of putting it online is not needing all of the compilers locally.
It might also be a bit of a portmanteau with a second reference to dogpile.com which was a pre-Google "search engine" that compiled search results from multiple search engines. Back in the day you often had to separately search altavista.com, lycos.com, askjeeves.com, yahoo.com, etc. because some of them would work for your query but others would not and it was difficult to predict the performance of any particular search engine, but usually at least one of them would have the result you wanted/needed.
Dogpile was an automated way to search all of the search engines at the same time with one query.
As recently as last week we had some horrible performance problems but it looks like the queue (https://dogbolt.org/queue) is mostly still fine! Other than the long pole of a few of the decompilers being backed up, things are humming along quite smoothly! Josh + Glenn have done some great work on it! (https://github.com/decompiler-explorer/decompiler-explorer/c...)