Decompilation is often the least important (and least reliable) part of IDA/Ghid...

aleclm · 2024-03-29T07:30:32 1711697432

> Decompilation is often the least important (and least reliable) part of IDA/Ghidra

This is something all people using decompilers say and sort of shows how low is trust towards decompilers. Expectations have always been rather low.

I've been there, but this does not have to be the case, the whole reason why we started rev.ng is to prove that expectations can be raised.

Apart from accuracy, which is difficult but engineering work, why don't decompilers emit syntactically valid C? Have you ever tried to re-compile code from any decompiler? It's a terrible experience.

rev.ng only emits valid C code, and we test it with a bunch of -Wall -Wextra:

https://github.com/revng/revng-c/blob/develop/share/revng-c/...

Other key topic: data structures. When reversing I spend half of the time renaming things and half of the time detecting data structures. The help I get from decompilers in latter is basically none.

rev.ng, by default, detects data structures on the whole binary, interprocedurally, including arrays. See the linked list example in the blog post. We also have plans to detect enums and other stuff.

Clearly we're not there yet, we still need to work on robustness, but our goal is to increase the confidence in decompilers and actually offer features that save time. Certain tools have made progress in improving the UI and the scripting experience, but there's other things to do beyond that.

I see this a bit like the transition from the phase in which C developers where using macros to ensure things were being inlined/unrolled to the phase where they stopped doing that because compilers got smart enough to the right thing and to do it much more effectively.

jcranmer · 2024-03-29T14:50:07 1711723807

Here's my issue with decompilers:

I don't want to look at assembly code. I'd rather see expression trees, expressed in C-like syntax, than trying to piece together variables from two-address or three-address instructions. Looking at assembly tends to lead to brain farts like "wait, was the first or second operand the output operand?" (really, fuck AT&T syntax) or "wait, does ja implement ugt or sgt?"

So that means I want to look at something vaguely C-like. But the problem is that the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong. And when it's wrong, I have to resort to staring at the assembly, which (for Ghidra at least) means throwing away a lot of the notes I've accumulated because they don't correlate back to underlying assembly.

So what I really want isn't something that can emit recompilable C code, that's optimizing for something that doesn't help me in the end. What I want is robust decompilation to something that lets me ignore the assembly entirely. I'm a compiler writer, I can handle a language where integers aren't signed but the operands are.

aleclm · 2024-03-29T15:32:25 1711726345

I 120% agree with what you're saying, but emitting valid C is kinda part of what you're asking, in design terms.

Our goal is: omit all the casts that can be omitted without changing the semantics according to C. In fact, we have a PR doing exactly this (still on the old repo, hopefully it will go in soon).

But, how can you expect to be able to be strict with what C allows you to do implicitly, if you're not even emitting valid C? For instance, thanks to the fact that we emit valid C, we could test if the assembly emitted by a compiler is the same before and after removing redundant casts.

My point is that emitting valid C is kind of a prerequisite for what you're asking, a rather low bar to pass, but that, in practice, no mainstream decompiler passes. It's pretty obvious the decompiled code will often be redundant and outright wrong if you don't even guarantee it's syntactically valid. Then clearly it's not a panacea, but it's an important design criterion and shows the direction we want to go.

As for comments: we still haven't implemented inline comments, but they will be attached to program addresses, so they will be available both in disassembly and decompiled C. It's not very hard to do, but that needs some love.

jcranmer · 2024-03-29T16:53:45 1711731225

One of the blog posts I keep meaning to write but never quite get around to is a post that C is not portable assembly. What is necessary is decompilation to a portable C-like assembly, but that target is not C, and I think focusing on creating valid C tends to drag you towards suboptimal decisions, even leaving aside issues like "should SLL decompile to x << y or x << (y % 32)?"

In my experience with Ghidra, I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether. There are some cases where it's clear it's just poor analysis on Ghidra's part (e.g., it doesn't seem to understand stack slot reuse, and memcpy-via-xmm is very confusing to it). And Ghidra's type system lacks function pointer types, which is very annoying when you're doing vtable-heavy C++ code.

I do like the appeal of a recompileable target language. But that language need not be C--in fact, I'm actually sketching out the design of such a language for my own purposes in being able to read LLVM IR without going crazy (which means I need to distinguish between, e.g., add nuw and just plain add).

Analysis necessarily involves multiple levels. Given that a lot of the type analysis today tends to be crap, I'd rather prefer to have the ability to see a more solid first-level analysis that does variable recovery and works out function calling conventions so that it can inform my ability to reverse engineer structures or things like "does this C++ method return a non-trivial struct that is an implicit first parameter?"

(Also, since I'm largely looking at C++ code in practice, I'd absolutely love to be able to import C++ header files to fill in known structure types.)

aleclm · 2024-03-29T17:28:02 1711733282

> should SLL decompile to x << y or x << (y % 32)?

I think this a bit of a misguided question. The hardware has a precise semantic defined, usually. QEMU's << behaves similarly to C (undefined behavior for rhs > 32), but this means that the lifter (still QEMU) will account for this and emit code preserving the semantics.

tl;dr: the code we emit should do the right thing depending on what the original instruction did, without making assumptions on what happens in case of C undefined behaviors.

> Ghidra's type system lacks function pointer types

Weird limitation, we support those.

> it doesn't seem to understand stack slot reuse

That's a tricky one. We're now re-designing certain parts of the pipeline to enable LLVM to promote stack accesses to SSA values, which basically solves the stack slot reuse. This is probably one of the most important features experienced reversers ask for.

> that language need not be C--

Making up your own language is temptation one should resist.

Anyway, we're rewriting our backend using an MLIR dialect (we call it clift) which targets C but should be good enough to emit something "similar to C but slightly different". It might make sense to have a different backend there. But a "standard C" backend has to be the first use case.

We thought about emitting C++, it would make our life simpler. But I think targeting non-C as the first and foremost backend would be a mistake.

Also, a Python backend would be cool.

> Analysis necessarily involves...

I would be interested in discussing more what exactly you mean here. Why don't you join our discord server?

> I'd absolutely love to be able to import C++ header files to fill in known structure types

We have a project for importing from header files. Basically we want use a compiler to turn them into DWARF debug symbols and then import those. Not too hard.

pfez · 2024-03-29T23:49:23 1711756163

> I do like the appeal of a recompileable target language. But that language need not be C.

Hey! Thanks for the very interesting feedback!

I also strongly feel the appeal of having a decompiler emit a recompilable language. But I want to stress that's not just appealing for it's own sake. It opens up the possibility of consumption by other tools, which is a great opportunity.

Basically, until the decompiler only emits some half-baked pseudocode that looks like C and humans can understand, that "language" is only an output format. It's the end of the journey from the binary. You can look at it, you can reason about it, you can even edit it change types and rename stuff, but its final purpose (and the only purpose of any adjustments you do to it) is for human consumption and understanding.

Don't get me wrong, human understanding is great, but it has shortcomings, and it doesn't scale.

On the other hand, the very moment a decompiler starts emitting decompiled code in a language that is parsable from other tools, its output stops being the end of the journey. In a way, it becomes yet another intermediate language, at a different level of abstraction, that can be consumed by other tools. Think any static analysis tool that usually requires having access to the source code, except now you can throw the decompiled code at it and get useful information about your binary.

And not hypothetically speaking. At rev.ng we have a PoC where we detect memory bugs like use-after-free in a binary, without access to the original source code, but using CodeQL or clang-static-analyzer on the decompiled C code. With all the nice reports that usually come with these tools, telling you the conditions that must be verified during the execution in order for the bug to be triggered. So, it is entirely possible to use C-bases source-level static analysis tools to automate at least some part of the grinding analysis job on a binary.

Take this with a grain of salt. It's a PoC. We haven't realeased it and it's not production grade yet, even if we're planning to show it around :) Also, I'm definitely not saying that's a silver bullet for every problem, or that it can solve stuff at every level of abstraction. But it's to make a point: decompiling to a recompilable language is a great opportunity to tap the potential of the analysis tools available for that language.

And if that's a direction you want to go, it suddenly becomes very important that the language you decompile to has a large pool of powerful robust and battle-tested static analysis tools. That's definitely true for C, not so much for a custom language you roll on your own. Which is not to say your custom language isn't good, but AFAIU from your message you are designing it basically for being able to better read LLVM IR yourself without going crazy. So it seems to me to be something designed for your own eyes and mind, not for mass consumption form other analysis tools. And even if it turns out to be good for consumption by other tools, it's hard to beat the amount of engineering effort that has been put into static analysis tools for C, that already available off the shelf.

So, all in all, I totally agree with you on the appeal of a recompilable target language. On that language being C or not, I really think it depends what you're trying to do. If you're trying to improve human understanding of the code, in the right conditions, I can see your point. If the decompiled code is just a starting point for other tools, I still think nothing beats C (yet?).

> Ghidra's type system lacks function pointer types

Wow! I think this is really crippling, and even without considering C++. I can think of many C codebases where people just do "C-with-classes" with a bunch of struct with function pointer fields.

> the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong.

> I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether.

Besides the lack of function pointers, which I can't say loud enough how crippling I think it is, I'd be really interested in knowing more about the specifics of your complaints on plain-wrong type recovery. I second the invite to join our Discord server!

j-krieger · 2024-03-29T10:20:01 1711707601

What happens if you put in a binary which outputs C-like machine code, like Rust (llvm) or zig?

aleclm · 2024-03-29T11:16:10 1711710970

Languages with a rich standard library and generating a lot of code for you usually need some love to get rid/represent idiomatically common patterns and to detect common data structures.

We haven't looked into it yet, but the automatic data structure recognition might help.

Frankly, Rust looks particularly scary: https://media.ccc.de/v/37c3-11684-rust_binary_analysis_featu...

tux3 · 2024-03-29T12:27:13 1711715233

Oh, very nice! I've dealt with forsaken deeply abstract vtable mazes of hell, but the idea of using a ton of sum types, dynamic dispatch, async everywhere, and long iterator chains would make for some deliciously unreadable binaries!

Sesse__ · 2024-03-29T12:40:53 1711716053

> Other key topic: data structures. When reversing I spend half of the time renaming things and half of the time detecting data structures. The help I get from decompilers in latter is basically none.

That's funny, because I've used both Hex-Rays and Ghidra, and gotten lots of help with data structures. The interactivity really helps a bunch with filling in the blanks.

aleclm · 2024-03-29T12:47:00 1711716420

In IDA you basically have only detection of stack frame layout (in a quite confusing fashion) and "create struct out of this pointer", which is something you have to do manually and its intraprocedural.

Imagine this being done automatically, across all of the binary. If you pass a pointer to another function the type is correct and you build the type from all the functions using it.

Then obviously the user needs to fix things, but boostrapping can definitely be hugely improved.

Sesse__ · 2024-03-29T22:32:11 1711751531

I'm sure user-defined structs can benefit from combining information from multiple functions, but saying that what you get today is “basically none” is a bit of an overstatement. Also, the special (and important!) case of operating system ABI structs is great, and that information propagates throughout function calls.

saagarjha · 2024-03-29T09:28:53 1711704533

Curious what you do when you encounter an instruction you don't model

aleclm · 2024-03-29T09:35:41 1711704941

That's unlikely, since we use QEMU as a lifter, which sometimes supports new instructions before they hit silicon.

However, I think we'll emit a call to some `noreturn` function. Basically we emit a call to `abort`.

saagarjha · 2024-03-29T09:46:40 1711705600

Right but you do see how this means that you need to lift code that has semantics that cannot be modeled in C?

aleclm · 2024-03-29T10:13:37 1711707217

Sure, in those cases we emit calls to C functions. The only thing we need to know is what registers are taken as input, what registers are output and what registers are preserved.

In QEMU parlance, these are helper functions, and they have actual implementations. But for decompilation purposes, you don't need to implement them. You just need to know how they interact with the registers.

vient · 2024-03-29T04:58:10 1711688290

Huh, for me as a malware analyst previously and a reverse engineer in general, decompilation is the most important part of such tools. It's all about speed, pseudo-C of some kind lets you roughly understand what's going on in a function in seconds. I guess you can become pretty fast with assembly too, but C is just a lot more dense.

Regarding reliability, I would say that Hex-Rays is pretty reliable (at least for x86) if you know its limitations, like throwing away all code in catch blocks. Usually wrong decompilation is caused by either wrong section permissions, or wrong function signature, both of them can be fixed. It can have bad time when stack frame size goes "negative" or some complex dynamic stack array logic is involved, which are usually signs of obfuscation anyway.

It was less reliable 10 years ago though.. Also even now hex-rays weirdly does not support some simple instructions like movbe.

saagarjha · 2024-03-29T03:58:36 1711684716

I hear this a lot and in my experience people who Ghidra or IDA and don’t use the decompiler are exceptionally rare. Why would you suffer that when you can use something else for what you actually want?

dvzk · 2024-03-29T04:54:04 1711688044

I didn't say I never use it, just that it's not always the core feature. This will depend heavily on your field, but in my past work, the features that were way more essential are: scripting (+ IR lifting), xrefs, CFGs, labels/notes (in a persistent DB).

In my experience decompilers will totally ignore or fail on certain types of malicious code, so they mainly exist to assist disassembly analysis. And for that purpose, they save us an incredible amount of human hours.

aleclm · 2024-03-29T07:12:21 1711696341

For scripting, our approach is to give you access to the project file (just a YAML file), and you can make changes from any scripting language you want. Everything the user can customize is in there, all the rest is deterministically produced from that file.

I really disliked the fact that you usually need to buy into the version of Python that $TOOL requires you to use, or the fact itself that you need to use a specific language.

Can parse YAML? You're mostly done.

The "project file" is what we call the model: https://docs.rev.ng/user-manual/model-tutorial/

For xrefs, CFG and the rest: we have all of that in the UI, but we also produce them in a rich way. For instance, when we emit disassembly and decompiled code, we actually emit plain text + HTML-like markup to provide metainformation for navigation (basically, xrefs) and highlighting. So you can use all that from any language that can parse HTML/XML. It's called PTML: https://docs.rev.ng/references/ptml/

For lifting: we use LLVM IR as our internal representation. This means that: 1) you don't have to learn an IR that no one else uses, 2) you can use off the shelf tools (e.g., KLEE for symbolic execution) but you can also use all the standard LLVM optimizations and analyses and 3) you can recompile it, but we're not into the binary translation business anymore.

znpy · 2024-03-29T08:04:36 1711699476

> 3) you can recompile it, but we're not into the binary translation business anymore

How comes?

aleclm · 2024-03-29T08:21:37 1711700497

Short answer: if you want to execute a program (maybe with some instrumentation, for fuzzing purposes) it's much easier to adopt a dynamic approach (i.e., emulation or virtualization). With static binary translation you can get better performance, but there's a lot of other things you need to get 100% right and that with a dynamic approach are a given (e.g., the CFG).

There's much more space of improvement in the field of analyzing code (as opposed to running it), so we're investing our energies there.

Then we're strong believers in integrating dynamic and static information, for instance see PageBuster: https://rev.ng/blog/pagebuster

But other than that, static binary translation is a feature of rev.ng in maintenance mode.