Just from taking a quick look at their description of how it works, this appears to essentially be a glorified pattern-matching approach to decompilation. This is generally the worst approach to take to the issue, since it's extremely brittle and trivial optimization or obfuscation will complicate your approach.
A modern decompiler framework tends to take a very different approach. For native executables, the hardest problems are actually disassembly and higher-level type recovery. Bytecode is easier in that the type information is essentially retained and code is very nicely and obviously differentiated from data; no need to worry about jump tables or constant pools.
Control flow for a decompiler is recovered via some form of control flow structuring; the problem is essentially a graph grammar parsing problem, but this is not an easy problem to solve. In practice, it's the existence of goto-like edges (gotos, breaks, and continues) that really screws things up. It's usually easy to tell what the structure should look like without these edges, but identifying which edges fall into this category turns out to be challenging in practice. Especially if you start trying to undo jump threading.
The other interesting issue is working out what the decompiled variables are and should have been. SSA is actually a pretty clean way to separate out different variables that get assigned to the same register (this identification problem exists even in bytecode, as local variable slots due get reused whenever possible), but doing the PHI elimination and finding out where you've got bitcasted type magic going on (more of an issue for native code) can be a challenge.
Do note that CPython bytecode has explicit break, continue, and "for loop" instructions, and (ofc) the language itself has no goto- so you can unambiguously recover high-level structure from a CFG far more easily than you might with compiled C.
Yes and no. If you are sure that you start out with something generated by Python, then assuming the Python compiler is faithful, then of course you can always produces source code that doesn't have goto's in it. If instead someone crafts custom bytecode, like I did here: http://rocky.github.io/pycon2018-light.co/#/ it might not have a high-level Python translation.
When there is a translation, is it always unambiguous? No. The same bytecode could get decompiled "a and b" or as "if a: b".
Why this happens less often in Python, especially earlier versions fo Python, is that it's code generation is very much template driven and there very little in the way of the kinds of compiler optimization that would make decompiling control flow otherwise harder.
Using the example given above, I believe earlier versions of Python would always enclose the "then" statement/expression with "push_block", "pop_block" pairs. Even though in certain situations (no new variables are introduced in the "then" block) this isn't needed. But the fact that those two additional instructions are there can then distinguish between those two different constructs.
It is sort of like matching ancestry based on the largely "junk" DNA strands that exist in a person.
Finally, I should note that Python with its control structures with "else" clauses like "for/else" "while/else" and "try/else" make control flow detection much harder. And this is the major weakness of uncompyle6.
That's why I started https://github.com/rocky/python-control-flow to try to address this. In contrast, the current python uncompyle6 code which has a lot of hacky control flow detection, this creates a control flow graph, computes dominators and reverse domninators in support of distinguishing these high-level structures better.
Confused, so you say it's taking the worst approach but then you just list a bunch of difficulties with doing it better... what is the criticism here? When I read your first paragraph I expected you were going to explain how to do it better.
Instead of taking a quick look at the description of how it works, why not instead spend a little more time to understand more, before passing judgement? I have written http://rocky.github.io/Deparsing-Paper.pdf to help people who want understand how this fits into the conventional decompiler landscape, and how things are a little different here.
Spoiler alert: no is it no more a glorified pattern-matching approach any more than a dragon-book-style compiler is.
The README mentions that some Dropbox source code is included, though I don't see it in the repo?
We include Dropbox's Python 2.5 bytecode
Dropbox is pretty well known for building their desktop clients in Python. I recall they're stuck on Python 2 and use some obfuscation and opcode remapping with a custom interpreter so its a little hard to decompile but still possible. Would love to see what this decompiler can generate for them though.
> I recall they're stuck on Python 2 and use some obfuscation and opcode remapping with a custom interpreter so its a little hard to decompile but still possible.
Basically, dropbox remapped the op codes to different values and shipped rc4 encrypted pyc files with a custom python binary that had the key inside it.
> There are also customized Python interpreters, notably Dropbox, which use their own magic and encrypt bytcode. With the exception of the Dropbox's old Python 2.5 interpreter this kind of thing is not handled.
Could this be good for converting Python 3 code back to Python 2 (to target CentOS 7 specifically)? The final code doesn't need to look nice, it just needs to run.
Because that would mean all my customers have to find and install Python 3. Most of them won't allow EPEL packages to be installed because they're not supported, and even if they did there's a huge barrier to asking people to add an extra repository.
The difference between 2 and 3 is more than the language: the Stdlib changed as well, the data structure behavior as well, and some features are just not backward (e.g: unicode handling, async/await, etc).
But if you don't want your target machines to install anything, then use nuikta (http://nuitka.net/) on the dev machine to make a stand alone executable out of your python program, and distribute that.
The dev machine needs to be a centos 7 to make sure it will work on the targets (there are other possibilities, but more complicated). Other than that, you won't need to provide anything else. Python will be embedded in the resulting executable, will work transparently, and will not conflict with any existing setup.
As a bonus, the program will run up to 4 times faster, and won't need any admin rights to be installed.
Alternatively, you can use the python-future library (https://python-future.org) to make you program compatible with both python 2 and 3. But it's much more work.
Unlikely. The main difficult issue between python 2 and python 3 is in library changes, where features such as what type range() returns or the handling of Unicode strings impact code correctness. Transpilation would mostly be an issue of providing the expected library semantics along with a fairly simple AST transformation, which doesn't require the lossiness of compilation/decompilation.
Does anyone use this and have a good example they've round-tripped through the CPython compiler and this decompiler? I would assume variable names are lost, like in most other bytecode compiler languages.
As for round-tripping, I make use of that idea in testing the decompiler and/or looking for bugs (which are numerous). Python has an extensive test suite for its many library programs (in later versions in excesss of 800) which are largely written in Python.
The cool thing about a unit test program is that it is self checking.
uncompyle6 is far from perfect, so no, it doesn't roundrip all of the unit test programs that it could. (There are some unit test programs that are too fragile, like those that expect the line number to be exactly the same which might occur in tests testing a debugger).
The biggest problem as I've said many times is handling control flow properly. However that could be solved much better if someone who had more time to spend on this wanted to.
Actually, variable names should largely /not/ be lost - Python is insanely dynamic and needs variable/function/etc. names to remain.
At most you might lose the names of local variables inside a function (i.e. variables that, due to internal scope, have a limited lifespan), but member variables, global variables, function names, class names, method names, etc. would all need to be kept in the bytecode.
Considering `globals` and `locals` exist I would be very surprised if any name got lost. Python is so dynamic you can actually mutate the bytecode of a function at runtime[0] (LISP esque), so the compiler can't actually make any assumptions and optimizations based on your source code.
Ah, that's unfortunate. I sort of assumed the compiler would be able to optimize stack variables and only do the slow thing if `locals` was actually invoked.
It's a common joke to say that if you just shorter names in C, your program might run faster. It's not true of course, but it Python, it might. All those names are used in hash computations and look ups, and you may shave a few micro seconds if you make your program unmaintable.
Actually, no, shortening names won't speed bytecode up in the way you seem to be suggesting. Variables "are" indexed by position in some arrays like co_varnames, co_names, co_const and so on. See for example https://docs.python.org/3.8/library/dis.html?highlight=co_na...
There are maps in the code object that will turn the index back into a name, but that's only used when some sort of introspection occurs, such as you might find in a debugger or traceback.
The names found in an "import" of course is a string so there might some small advantage there in a shorter name.
A modern decompiler framework tends to take a very different approach. For native executables, the hardest problems are actually disassembly and higher-level type recovery. Bytecode is easier in that the type information is essentially retained and code is very nicely and obviously differentiated from data; no need to worry about jump tables or constant pools.
Control flow for a decompiler is recovered via some form of control flow structuring; the problem is essentially a graph grammar parsing problem, but this is not an easy problem to solve. In practice, it's the existence of goto-like edges (gotos, breaks, and continues) that really screws things up. It's usually easy to tell what the structure should look like without these edges, but identifying which edges fall into this category turns out to be challenging in practice. Especially if you start trying to undo jump threading.
The other interesting issue is working out what the decompiled variables are and should have been. SSA is actually a pretty clean way to separate out different variables that get assigned to the same register (this identification problem exists even in bytecode, as local variable slots due get reused whenever possible), but doing the PHI elimination and finding out where you've got bitcasted type magic going on (more of an issue for native code) can be a challenge.