Is there any way to use this on a DOS EXE? This would be a lovely tool to port/remaster old games (i.e. decompile, replace rendering methods with modern equivalents, compile again -- an "offline" version of the "online" idea I was going for with http://www.gabrielgambetta.com/remakes.html)
Yesteryear's C compilers weren't great, optimization-wise. But anything remotely modern would just brute force its way through a DOS-era program, well optimized or not.
It probably needs a profile for the MZ format. And probably for 16-bit x86. I'm considering trying it on some .com binaries, to see what it does (since it said it can handle raw binaries). I doubt it'll work immediately, because I doubt that their code handles memory segmentation properly, 16 bit pointers and data, etc.
Cool. Finally some free alternative to IDA decompiler plugin. IDA is still better due to interactive nature, i.e. you can explore the code and rename variables/functions as you keep on exploring. I hope this evolves into something like that.
Disassembly is a much easier task than decompilation, since it's a mostly mechanical process. Decompilation requires you to undo the optimizations/transformations the compiler did as it generated the binary, which is much harder.
That said radare2 is still cool, and a GUI (Cutter) is in the works.
Exactly what I was thinking. I remember my sophomore year of CS undergrad I emailed IDA's support asking if there was a student license or some equivalent to learn how to use it. Props to them for at least replying, but their answer was a firm no and was disappointing to say the least.
It's not written in a Cyrillic language. It's written in English.
Nobody comes along and points out that in Spanish the "g" works differently, and it bugs them to see words with "g" in them.
It's one thing to do faux-Cyrillic and get the letters wrong. It's quite another to do something silly to a latin letter, and get complaints that it resembles a non-latin letter.
It doesn't resemble the Cyrillic letter---it is it. "R" is an English letter but not a Cyrillic one, and "Я" is a Cyrillic letter but not an English one, and by flipping them horizontally you transform them into each other. I imagine for many of the 7,574,303 people in Russia who speak English and probably also a fair chunk of the 854,955 Americans who speak Russian (and presumably have mastered both alphabets), it's annoying. Not a huge deal, just annoying.
There are so many symbols from different languages that resemble each other. If I use a smiley face, that doesn't mean I used a "ü" or a "ツ" from another alphabet just because it looks similar. A backwards R is visually the same as a Cyrillic character, but that doesn't mean I'm writing in Cyrillic, just like a "P" is visually the same as a Cyrillic character but doesn't mean I'm writing in Cyrillic.
Does anyone know why intel discontinued their tamper protection toolkit? They had an obfuscation compiler that would turn compiled C code into a self encrypting/decrypting code. The idea was if you dissassembled the code at any point you wpuld get mostly garbage instructions. I always wondered how a de compiler could get around that.
Google for anything Rolf Rolles has published on the topic, believe it or not there are general approaches to solving this. Someone already mentioned dumping the text segment, that only works for silly 90s-era obfuscators.
Contemporary obfuscators _rewrite_ the protected code as a series of instructions executed on a virtual machine whose bytecode (and bytecode semantics!) are randomly generated at build time. The solution (AIUI) is symbolic execution of the instructions to determine their underlying architectural effect, synthesize some compiler IR that is equivalent to those effects, run an optimization pass (like a regular compiler) over that IR, and finally generate x86 from the result.
The optimization passes are necessary to remove side effects that do not impact the state of the program ("noise"), which modern obfuscators like Themida insert a ton of into the instruction stream
In other words, rather than attempt to dump some particular part of the program, the binary as a whole is statically analysed to determine, regardless of the indirections inserted by any obfuscation pass, what machine instructions are ultimately executed for a given program input. The abstract representation is then compiled to an equivalent new program which is much easier to read, because all of the indirections and noise have been optimized away.
When I was reading about Rolles' work initially, I couldn't help but imagine this is the kind of approach Geordi La Forge would have come up with if cracking an encrypted binary were ever the plot for an episode of Star Trek :)
iirc it wasn't very advanced. They would 'mov' the decrypted instructions to a region in memory (always the same one), executed it, then save register state and go on to decrypt the next set of instructions.
Breaking it involved monitoring the memory for the decrypted instructions, and dumping them right before they were executed. I don't remember if there were any additional complications with stuff like conditional jumps.
There has been some malware that did just that - it was still possible to record the trace of instructions being executed along with the current instruction pointer to be able to reconstruct the binary quite well.
>"As we announced in our Botconf 2017 presentation at the beginning of December (slides), RetDec, our machine-code decompiler, is now open, which means anyone can freely use it, study its source code, modify it, and redistribute it."
These slides linked in the above looks like this was a really fascinating talk.
Does anybody know when or if this presentation was recorded or if it will be made available? I would love to watch this.
Really cool stuff. I don't like being negative when it comes to fantastic moves like this, but I'm still really disappointed that it doesn't support 64bit executables.
x86_64 have calling conventions (namingly __fastcall) which are more inconvenient to decode than x86 _cdecl or __stdcall where every arguments are passed on the stack. Most symbolic engines usually works only on x86 for the same reason.
I've used retdec before, its output is quite nice. I even had some problems with it (doing dumb stuff like putting in executables that were beyond the limits they imposed on their website) and whoever they had supporting it were quite friendly in helping me anyway.
Looks like it's also relying on LLVM for disassembly? Ouch; that's an incredibly bad idea if you're trying to analyze malicious or unusual code (it's not designed for that), but I guess it's the easiest for a proof of concept like this.
Although, there's no way an AV company doesn't have its own disassembler, but those are almost always treated as trade secrets (especially the stuff that isn't in the spec / the spec is wrong). They'll probably hook it up to that before doing any real work with it themselves.
> Looks like it's also relying on LLVM for disassembly?
[wild speculation here] I suspect they're using llvm to go from an ast to c(++) code since they have tooling for stuff like that.
Now I have to find me a binary-blob kernel module that manufactures like to put out and see what the C code it spits out looks like -- another wasted day methinks...
You could just ..uh.. acquire a copy of IDA that has the AMD64 decompiler. It's more mature and spits out C code of wildly varying readability, though only for one function at a time.
This is awesome! I had been using retdec.com since before Avast bought AVG (where RetDec was originally developed). I'm very excited to do away with the limits the website imposed.
Oh, I thought that we do this sort of thing whenever there's a new open source project posted here. People seem to want to know what it's made of and potentially its (S)LOC.
I wish GitHub let me customise which directories to include. My project shows up as 74% Ruby, but a lot of that is some specs that aren't really part of the project, and in reality it's 60% Ruby.
Two single line files that have a file extension making them look like ASP.net code is how I read that.
It could be place-holder scripts that redirect to the correct location, for instance a default.asp/index.asp file that does nothing but redirect to index.html some other default that IIS doesn't recognised out of the box. This would catch cases where someone has just dumped the web assets on a IIS share (IIS doesn't, or at least didn't used to, consider index.html as a potential default document). In classic ASP this would be something like the line: