Hacker News new | past | comments | ask | show | jobs | submit login

One of the blog posts I keep meaning to write but never quite get around to is a post that C is not portable assembly. What is necessary is decompilation to a portable C-like assembly, but that target is not C, and I think focusing on creating valid C tends to drag you towards suboptimal decisions, even leaving aside issues like "should SLL decompile to x << y or x << (y % 32)?"

In my experience with Ghidra, I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether. There are some cases where it's clear it's just poor analysis on Ghidra's part (e.g., it doesn't seem to understand stack slot reuse, and memcpy-via-xmm is very confusing to it). And Ghidra's type system lacks function pointer types, which is very annoying when you're doing vtable-heavy C++ code.

I do like the appeal of a recompileable target language. But that language need not be C--in fact, I'm actually sketching out the design of such a language for my own purposes in being able to read LLVM IR without going crazy (which means I need to distinguish between, e.g., add nuw and just plain add).

Analysis necessarily involves multiple levels. Given that a lot of the type analysis today tends to be crap, I'd rather prefer to have the ability to see a more solid first-level analysis that does variable recovery and works out function calling conventions so that it can inform my ability to reverse engineer structures or things like "does this C++ method return a non-trivial struct that is an implicit first parameter?"

(Also, since I'm largely looking at C++ code in practice, I'd absolutely love to be able to import C++ header files to fill in known structure types.)




> should SLL decompile to x << y or x << (y % 32)?

I think this a bit of a misguided question. The hardware has a precise semantic defined, usually. QEMU's << behaves similarly to C (undefined behavior for rhs > 32), but this means that the lifter (still QEMU) will account for this and emit code preserving the semantics.

tl;dr: the code we emit should do the right thing depending on what the original instruction did, without making assumptions on what happens in case of C undefined behaviors.

> Ghidra's type system lacks function pointer types

Weird limitation, we support those.

> it doesn't seem to understand stack slot reuse

That's a tricky one. We're now re-designing certain parts of the pipeline to enable LLVM to promote stack accesses to SSA values, which basically solves the stack slot reuse. This is probably one of the most important features experienced reversers ask for.

> that language need not be C--

Making up your own language is temptation one should resist.

Anyway, we're rewriting our backend using an MLIR dialect (we call it clift) which targets C but should be good enough to emit something "similar to C but slightly different". It might make sense to have a different backend there. But a "standard C" backend has to be the first use case.

We thought about emitting C++, it would make our life simpler. But I think targeting non-C as the first and foremost backend would be a mistake.

Also, a Python backend would be cool.

> Analysis necessarily involves...

I would be interested in discussing more what exactly you mean here. Why don't you join our discord server?

> I'd absolutely love to be able to import C++ header files to fill in known structure types

We have a project for importing from header files. Basically we want use a compiler to turn them into DWARF debug symbols and then import those. Not too hard.


> I do like the appeal of a recompileable target language. But that language need not be C.

Hey! Thanks for the very interesting feedback!

I also strongly feel the appeal of having a decompiler emit a recompilable language. But I want to stress that's not just appealing for it's own sake. It opens up the possibility of consumption by other tools, which is a great opportunity.

Basically, until the decompiler only emits some half-baked pseudocode that looks like C and humans can understand, that "language" is only an output format. It's the end of the journey from the binary. You can look at it, you can reason about it, you can even edit it change types and rename stuff, but its final purpose (and the only purpose of any adjustments you do to it) is for human consumption and understanding.

Don't get me wrong, human understanding is great, but it has shortcomings, and it doesn't scale.

On the other hand, the very moment a decompiler starts emitting decompiled code in a language that is parsable from other tools, its output stops being the end of the journey. In a way, it becomes yet another intermediate language, at a different level of abstraction, that can be consumed by other tools. Think any static analysis tool that usually requires having access to the source code, except now you can throw the decompiled code at it and get useful information about your binary.

And not hypothetically speaking. At rev.ng we have a PoC where we detect memory bugs like use-after-free in a binary, without access to the original source code, but using CodeQL or clang-static-analyzer on the decompiled C code. With all the nice reports that usually come with these tools, telling you the conditions that must be verified during the execution in order for the bug to be triggered. So, it is entirely possible to use C-bases source-level static analysis tools to automate at least some part of the grinding analysis job on a binary.

Take this with a grain of salt. It's a PoC. We haven't realeased it and it's not production grade yet, even if we're planning to show it around :) Also, I'm definitely not saying that's a silver bullet for every problem, or that it can solve stuff at every level of abstraction. But it's to make a point: decompiling to a recompilable language is a great opportunity to tap the potential of the analysis tools available for that language.

And if that's a direction you want to go, it suddenly becomes very important that the language you decompile to has a large pool of powerful robust and battle-tested static analysis tools. That's definitely true for C, not so much for a custom language you roll on your own. Which is not to say your custom language isn't good, but AFAIU from your message you are designing it basically for being able to better read LLVM IR yourself without going crazy. So it seems to me to be something designed for your own eyes and mind, not for mass consumption form other analysis tools. And even if it turns out to be good for consumption by other tools, it's hard to beat the amount of engineering effort that has been put into static analysis tools for C, that already available off the shelf.

So, all in all, I totally agree with you on the appeal of a recompilable target language. On that language being C or not, I really think it depends what you're trying to do. If you're trying to improve human understanding of the code, in the right conditions, I can see your point. If the decompiled code is just a starting point for other tools, I still think nothing beats C (yet?).

> Ghidra's type system lacks function pointer types

Wow! I think this is really crippling, and even without considering C++. I can think of many C codebases where people just do "C-with-classes" with a bunch of struct with function pointer fields.

> the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong.

> I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether.

Besides the lack of function pointers, which I can't say loud enough how crippling I think it is, I'd be really interested in knowing more about the specifics of your complaints on plain-wrong type recovery. I second the invite to join our Discord server!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: