A Python cross-version decompiler

jcranmer · on Jan 8, 2019

Just from taking a quick look at their description of how it works, this appears to essentially be a glorified pattern-matching approach to decompilation. This is generally the worst approach to take to the issue, since it's extremely brittle and trivial optimization or obfuscation will complicate your approach.

A modern decompiler framework tends to take a very different approach. For native executables, the hardest problems are actually disassembly and higher-level type recovery. Bytecode is easier in that the type information is essentially retained and code is very nicely and obviously differentiated from data; no need to worry about jump tables or constant pools.

Control flow for a decompiler is recovered via some form of control flow structuring; the problem is essentially a graph grammar parsing problem, but this is not an easy problem to solve. In practice, it's the existence of goto-like edges (gotos, breaks, and continues) that really screws things up. It's usually easy to tell what the structure should look like without these edges, but identifying which edges fall into this category turns out to be challenging in practice. Especially if you start trying to undo jump threading.

The other interesting issue is working out what the decompiled variables are and should have been. SSA is actually a pretty clean way to separate out different variables that get assigned to the same register (this identification problem exists even in bytecode, as local variable slots due get reused whenever possible), but doing the PHI elimination and finding out where you've got bitcasted type magic going on (more of an issue for native code) can be a challenge.

XMPPwocky · on Jan 8, 2019

Do note that CPython bytecode has explicit break, continue, and "for loop" instructions, and (ofc) the language itself has no goto- so you can unambiguously recover high-level structure from a CFG far more easily than you might with compiled C.

rockybernstein · on Jan 10, 2019

Yes and no. If you are sure that you start out with something generated by Python, then assuming the Python compiler is faithful, then of course you can always produces source code that doesn't have goto's in it. If instead someone crafts custom bytecode, like I did here: http://rocky.github.io/pycon2018-light.co/#/ it might not have a high-level Python translation.

When there is a translation, is it always unambiguous? No. The same bytecode could get decompiled "a and b" or as "if a: b".

Why this happens less often in Python, especially earlier versions fo Python, is that it's code generation is very much template driven and there very little in the way of the kinds of compiler optimization that would make decompiling control flow otherwise harder.

Using the example given above, I believe earlier versions of Python would always enclose the "then" statement/expression with "push_block", "pop_block" pairs. Even though in certain situations (no new variables are introduced in the "then" block) this isn't needed. But the fact that those two additional instructions are there can then distinguish between those two different constructs.

It is sort of like matching ancestry based on the largely "junk" DNA strands that exist in a person.

Finally, I should note that Python with its control structures with "else" clauses like "for/else" "while/else" and "try/else" make control flow detection much harder. And this is the major weakness of uncompyle6.

That's why I started https://github.com/rocky/python-control-flow to try to address this. In contrast, the current python uncompyle6 code which has a lot of hacky control flow detection, this creates a control flow graph, computes dominators and reverse domninators in support of distinguishing these high-level structures better.

XMPPwocky · on Jan 10, 2019

Definitely- though, arguably trying to decompile something which was never Python into Python is a bit of a fool's errand!

Didn't think of for/else and friends, yeah, that's tricky.

Definitely using doms/idoms is the right approach.

dataflow · on Jan 9, 2019

Confused, so you say it's taking the worst approach but then you just list a bunch of difficulties with doing it better... what is the criticism here? When I read your first paragraph I expected you were going to explain how to do it better.

rockybernstein · on Jan 10, 2019

Instead of taking a quick look at the description of how it works, why not instead spend a little more time to understand more, before passing judgement? I have written http://rocky.github.io/Deparsing-Paper.pdf to help people who want understand how this fits into the conventional decompiler landscape, and how things are a little different here.

Spoiler alert: no is it no more a glorified pattern-matching approach any more than a dragon-book-style compiler is.

biofunsf · on Jan 8, 2019

The README mentions that some Dropbox source code is included, though I don't see it in the repo?

  We include Dropbox's Python 2.5 bytecode

Dropbox is pretty well known for building their desktop clients in Python. I recall they're stuck on Python 2 and use some obfuscation and opcode remapping with a custom interpreter so its a little hard to decompile but still possible. Would love to see what this decompiler can generate for them though.

loeg · on Jan 8, 2019

> I recall they're stuck on Python 2 and use some obfuscation and opcode remapping with a custom interpreter so its a little hard to decompile but still possible.

https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one...

Currently (custom) Python 3.5, I guess.

gizmo385 · on Jan 8, 2019

Am I missing the details in their blog post or do they detail the customizations that they've made to Python 3.5 to suit their purposes?

loeg · on Jan 8, 2019

I also don't see any details of their Python 3.x customization in that blog post.

sametmax · on Jan 9, 2019

Yes, they hired Guido after all, and sponsor mypy to help with the migration as well.

tyingq · on Jan 9, 2019

Here's the original presentation back when they decompiled it: https://www.usenix.org/conference/woot13/workshop-program/pr...

And the paper: https://www.usenix.org/system/files/conference/woot13/woot13...

Basically, dropbox remapped the op codes to different values and shipped rc4 encrypted pyc files with a custom python binary that had the key inside it.

deckar01 · on Jan 8, 2019

> There are also customized Python interpreters, notably Dropbox, which use their own magic and encrypt bytcode. With the exception of the Dropbox's old Python 2.5 interpreter this kind of thing is not handled.

rwmj · on Jan 8, 2019

Could this be good for converting Python 3 code back to Python 2 (to target CentOS 7 specifically)? The final code doesn't need to look nice, it just needs to run.

aynawn · on Jan 8, 2019

That sounds much more difficult than installing python 3 on centos 7. Why can't you do that instead?

rwmj · on Jan 9, 2019

Because that would mean all my customers have to find and install Python 3. Most of them won't allow EPEL packages to be installed because they're not supported, and even if they did there's a huge barrier to asking people to add an extra repository.

sametmax · on Jan 9, 2019

The difference between 2 and 3 is more than the language: the Stdlib changed as well, the data structure behavior as well, and some features are just not backward (e.g: unicode handling, async/await, etc).

Using Centos 7, I managed to install Python 3.6 (https://www.rosehosting.com/blog/how-to-install-python-3-6-4...) quite easily. And of course Python 3.4 is available in the EPEL repo if you need something more "official" (https://fedoraproject.org/wiki/EPEL).

But if you don't want your target machines to install anything, then use nuikta (http://nuitka.net/) on the dev machine to make a stand alone executable out of your python program, and distribute that.

The dev machine needs to be a centos 7 to make sure it will work on the targets (there are other possibilities, but more complicated). Other than that, you won't need to provide anything else. Python will be embedded in the resulting executable, will work transparently, and will not conflict with any existing setup.

As a bonus, the program will run up to 4 times faster, and won't need any admin rights to be installed.

Alternatively, you can use the python-future library (https://python-future.org) to make you program compatible with both python 2 and 3. But it's much more work.

jcranmer · on Jan 8, 2019

Unlikely. The main difficult issue between python 2 and python 3 is in library changes, where features such as what type range() returns or the handling of Unicode strings impact code correctness. Transpilation would mostly be an issue of providing the expected library semantics along with a fairly simple AST transformation, which doesn't require the lossiness of compilation/decompilation.

Twirrim · on Jan 9, 2019

Python 3.4 is available in the EPEL repository, which is usually a really good and stable source of packages.

loeg · on Jan 8, 2019

Does anyone use this and have a good example they've round-tripped through the CPython compiler and this decompiler? I would assume variable names are lost, like in most other bytecode compiler languages.

rockybernstein · on Jan 10, 2019

"Does anyone use this": I use it in the "deparse" command of my Pytohn 3 debugger trepan3k ( https://pypi.org/project/trepan3k/#exact-location-informatio... ) and the corresponding Python 2 debugger trepan2.

See http://rocky.github.io/pycon2018.co/#/16 for a video showing use in this context.

As for round-tripping, I make use of that idea in testing the decompiler and/or looking for bugs (which are numerous). Python has an extensive test suite for its many library programs (in later versions in excesss of 800) which are largely written in Python.

The cool thing about a unit test program is that it is self checking.

uncompyle6 is far from perfect, so no, it doesn't roundrip all of the unit test programs that it could. (There are some unit test programs that are too fragile, like those that expect the line number to be exactly the same which might occur in tests testing a debugger).

The biggest problem as I've said many times is handling control flow properly. However that could be solved much better if someone who had more time to spend on this wanted to.

dbrueck · on Jan 9, 2019

Actually, variable names should largely /not/ be lost - Python is insanely dynamic and needs variable/function/etc. names to remain.

At most you might lose the names of local variables inside a function (i.e. variables that, due to internal scope, have a limited lifespan), but member variables, global variables, function names, class names, method names, etc. would all need to be kept in the bytecode.

ben-schaaf · on Jan 9, 2019

Considering `globals` and `locals` exist I would be very surprised if any name got lost. Python is so dynamic you can actually mutate the bytecode of a function at runtime[0] (LISP esque), so the compiler can't actually make any assumptions and optimizations based on your source code.

[0] https://pypi.org/project/byteplay/

loeg · on Jan 9, 2019

Ah, that's unfortunate. I sort of assumed the compiler would be able to optimize stack variables and only do the slow thing if `locals` was actually invoked.

sametmax · on Jan 9, 2019

It's a common joke to say that if you just shorter names in C, your program might run faster. It's not true of course, but it Python, it might. All those names are used in hash computations and look ups, and you may shave a few micro seconds if you make your program unmaintable.

rockybernstein · on Jan 10, 2019

Actually, no, shortening names won't speed bytecode up in the way you seem to be suggesting. Variables "are" indexed by position in some arrays like co_varnames, co_names, co_const and so on. See for example https://docs.python.org/3.8/library/dis.html?highlight=co_na...

There are maps in the code object that will turn the index back into a name, but that's only used when some sort of introspection occurs, such as you might find in a debugger or traceback.

The names found in an "import" of course is a string so there might some small advantage there in a shorter name.

sametmax · on Jan 10, 2019

Variables are not the only names in a program. Attribute lookups do use a dictionary.