Ok, one way to do interpretation is to do a loop like while True: instr = decode...

int3 · on March 30, 2012

Thanks for the explanation! It makes sense. I was just wondering though -- why is it necessary for generate_block / decode_instruction to be called on the fly, instead of as a pre-compilation step? I guess you need some sort of runtime information for the compilation?

reginaldo · on March 31, 2012

What you're looking for is called static recompilation. It's what Emscripten [1] does, it statically recompiles LLVM bitcode to javascript. Emscripten is a wonderful project, by the way. After compiling from LLVM to Javascript, it applies an algorithm (relooper) to the "intermediate" javascript code whose result is code that resembles the structure of the original code so that, for instance, if the original code has a loop, the final javascript code will also have a loop. It is a true gem, and it's part of what makes Emscripten so awesome and fast.

I believe I could do static recompilation for programs that run on a barebones environment and do not employ self modifying code. The more general problem of finding all executable code in a blob of arbitrary binary data (which I would have to solve if I wanted to run an arbitrary program unmodified, which I do), is equivalent to the Halting Problem, so that is a no go. This means that, to have a general solution, a hybrid approach of static and dynamic recompilation would have to be employed.

For well behaved programs, code generation on the fly does not pose a significant overhead (that's what the profiler tells, if I recall correctly), so I'm not sure the hybrid approach is worth its weight in complexity.

Besides, I don't know if I'd be able to run a full OS, though, because those need interrupts (in my case, currently there's a timer tick and the UART interrupt when input is entered through the keyboard), and I haven't thought about handling that.

[1] https://github.com/kripken/emscripten

int3 · on April 2, 2012

Yup, I'm aware of Emscripten, which was part of the reason behind my question. I hadn't considered the issue of lack of code / data separation when dealing with pure binary code.

I'm actually working on a JVM interpreter in Coffeescript now, so this topic is particularly interesting to me. Thanks again for taking the time to explain things.