It's quite impressive to read a blog post that analyzes a competitor's announcement carefully (with no ill will), performs tests to show the validity of the analysis, provides the source for the test, and is written cleanly and concisely. I wish more posts on the internet were like this.
Actually, direct threading can be done with C/C++: I just wrote a switch statement with 256 case labels and made sure the switch control variable is unsigned char. With -O3 it gave me exactly the code I wanted - no range checking anymore, since the control variable is unsigned char anyway.
Not quite. If you're dispatching based on a sequence of opcodes, that's indirect threading. Direct threading uses a sequence of pointers to the opcode implementations, and jumps directly to them. Also, one of the key benefits of threaded interpreters is that each opcode implementation ends by decoding the next instruction and jumping to its implementation. This duplication of code gives the processor more context for branch prediction, and can lead to fewer pipeline stalls. With a switch statement, the decoding and branching all happens at the top of the loop.
It seems you can't have direct threaded interpreter for JVM or CLR where the opcodes conform to specification that can't change between versions of the interpreter.
That link includes a link to the Lua mailing list post by Mike Pall, the author of LuaJIT. It includes some great info and links to papers if you read a bit in. I found it extremely informative, and not just about the Lua-specific implementation.