Show HN: Open Source emulator in Javascript that runs Linux

reginaldo · on March 29, 2012

Since there seems to be some interest on the subject, and I don't want to kidnap the great thread about the worst Linux PC ever [1], I decided to submit on a separate thread.

Some things do keep in mind:

1. Very early stage. Runs better on Chrome.

2. If you think this sucks, you're probably comparing my code to Fabrice Bellard's Jslinux[2], and as Bellard is a genius his code (which is unfortunately still not available in readable form) is probably much better than mine. But progress is being made.

3. Still, the potential is there, soon it will be possible to have a somewhat standard programming stack and write GUI applications that run off a canvas, as I said in [3].

4. For Chrome there is a nice terminal emulator (hterm) I lifted from the Chromium source tree. The rest use textarea.

5. The code, of course: https://github.com/ubercomp/jslm32/

If anyone has questions, I'll be happy to answer them.

[1] http://news.ycombinator.com/item?id=3767410 [2] http://bellard.org/jslinux [3] http://news.ycombinator.com/item?id=3769486

EDIT: Added itens 4 and 5.

rdg · on March 29, 2012

Hey reginaldo, for some reason I cannot enter any text on the terminal, hence, I cannot login. I'm using Chrome 18 on OS X Lion. Just in case, I tried downloading the code and running it locally: same result. Any ideas?

reginaldo · on March 29, 2012

Yeah, it is a known bug. The hterm terminal emulator runs on its own iframe, so you'll have to click anywhere on the screen (outside the terminal). If that doesn't work, then it's a new bug...

jbert · on March 29, 2012

Very nice. Is there any chance of a follow up on your comment in the previous thread:

"I'll write a post describing what I did to take boot time from 2.5 minutes to 10 seconds."

My interests:

- how did you find what to optimise? - what approach did you use?

reginaldo · on March 29, 2012

> - how did you find what to optimise? - what approach did you use?

I used profilers: Chrome developer tools and Firebug. The Chrome profiler is very nice and easy to use, but it looks like it is a statistical profiler. When the usual invocation takes less than 1ms, it might get the times wrong (there might be a bottleneck that Chrome profiler is not showing).

The Firebug profiler is more precise, I believe it traces every executing instruction, which also makes it much slower to run you program in, but does sometime gives you interesting insights.

Also, there's intuition. I wanted to keep the code readable. The interpreter loop is basically something to the effect of

  while(true) {
    op = decode(mmu.read_32(cs.pc));
    cs.pc += 4;
    (optable[opcode])(op);
  }

which I believe is very readable, but it counters everything one expects about writing fast javascript code that runs in a tight loop. Bellard told me he thinks my "dynamic interpreted" version might be slower than a throughly optimized interpreter version, and he might very well be true. But I found the readability of the interpreter to be of the upmost importance, at least until I had a working "dynamic recompilation" version.

To optimize the interpreter version, the profiler showed me that basically three things take most (about 65, 70% of the time: * Instruction decoding * Memory access

There's not much I found to make instruction decoding faster , but the memory access time could be improved. In the beginning, I used the form mmu.read_32(addr) to read from any region of the memory, including RAM. Most of the accesses are of course from RAM, so it made sense to optimize for this fact (despite it being bad for readability).

So nowadays I do: op = (v8[rpc] << 24) | (v8[rpc + 1] << 16) | (v8[rpc + 2] << 8) | (v8[rpc + 3]);

For memory accesses.

Some other things I found randomly. The processor in the board I'm emulating has a 75Mhz clock. With that emulated frequency, Linux was spending a lot of time on the "idle task". Whan I told the emulator that the frequency was about 8Mhz, the problem went away.

Then, I thought to myself: it would be neat if I could implement parts of the libc in Javascript. That's how https://github.com/ubercomp/jslm32/blob/master/src/lm32_libc... came to be. So I implemented some string.h functions in javascript (memcpy, memmove, and memset), and voilá, the boot time went down about 3 times.

But in the end of the day, lm32_libc_dev is a hack with very negative consequences. I would have to modify every program I wanted to run on the emulated board to use my libc functions instead. I did this with the linux kernel and it was a bad experience. So I focused my energy on the dynamic code generation side of things.

I can tell you I have succeeded as the jslibc thing is now deleted on my local source three and I didn't experience any slowdowns whatsoever.

jbert · on March 29, 2012

Thanks for this, it's a great piece of work and very easy to read. Nice code.

> "it looks like it is a statistical profiler. When the usual invocation takes less than 1ms, it might get the times wrong"

I don't understand this. As long as the statistical profiler can interrupt at any point, surely a routine running for 10k times for 0.1ms will still be interrupted a similar number of times as one routine running for 1s? Or are you thinking that there may be a bias to the sampling? (e.g. method of interrupting means we're not likely to see the first N ops after a method call or something)?

Did you see a significant difference in the profile between the two profiler approaches (firebug/chrome)?

> For memory accesses.

I wonder if your RAM accesses are mostly 32bit aligned and made in 4bytes at a time? (I guess you could instrument to find out)? If so, you might get a further speedup by having the RAM array stored as 32bit quantities instead of bytes, doing:

op = v832bit[rpc]; // only works for aligned code

and then having the RAM read/write 8/16 be special cases on top of that? That could also give a speedup on memmove etc (4x speedup?), making the libc hack less necessary?

reginaldo · on March 29, 2012

> I don't understand this..

Well, I won't speculate on the workings of the Chrome profiler, but what I can say is that profiling in Chrome and Firefox gives different results (which may of course be caused by differences in the Javascript engines).

> I wonder if your RAM accesses are mostly 32bit

All accesses on the LM32 processor are aligned. Half word accesses have 2 byte alignment, and word accesses have 4 byte alignment. So in theory it would be easy.

I use ArrayBuffers for the RAM array (where supported). You can have different views for the same array buffer, so in theory I could have the ram ArrayBuffer with an 8bit view, a 16bit view, and a 32bit view, where each would be used for accesses of their size. In that sense, you're absolutely right and I think a good speedup would be achieved by that change (something like 2x speedup as most accesses are in fact 32 bit).

I don't do that for endianness reasons. The LM32 is big endian, and most devices nowadays are little endian. Unfortunately, the state of ArrayBuffer is still kind of chaotic, in that the endianness of the ArrayBuffer is the endianness of the host (they do this for performance reasons). So I can't do the v32[rpc] thing. Theoretically, I could if the programs have Read-Write consistency, i.e., if they only used the same size for reading and writing some data, which they don't. A program that tests the endianness of the processor, for instance, does not have read write consistency.

I could use ArrayBufferViews to access memory in an endianness-independent way, but I tried it and it's actually much slower.

To make memory faster, two things can be done either: 1) emulate a little endian processor instead. 2) have a fast byteswap operator in javascript, without a function call (never gonna happen).

So since 2 is never going to happen, the solution is to tackle 1.

jbert · on March 29, 2012

Ugly thought - if "almost all" accesses are Read-Write consistent, maybe an exception list of addresses could be used?

Basically do the 4byte cell approach, and make most of your RAM the "fast" endianness, with some 4byte cells "correct" endianness, because you know they are being accessed in a way that cares. (Since you're interpreting, I'm guessing this is reasonably-easily detectable?)

It does put a "if-addr-on-exception-list" on your mem access path, but if that's a hash lookup it could be OK. If you can further guarantee that your text segments are always fast-endian, then your opcode dispatch loop can still go direct and miss out that test.

reginaldo · on March 29, 2012

That would actually work, I guess :). I'm pretty sure the memcpy, memmove, etc. implementations I've seen for this architecture all work by copying one byte at a time (argh!) though, but that problem can be solved with some smart-if-ugly hackery.

But, as you said yourself, it would be kind of messy, in that it would be hard to be certain if it works in 100% of the cases.

I think it would be easier to just change the toolchain code and make it think the processor is little endian.

int3 · on March 29, 2012

You mentioned in the 'Worst Linux PC' thread that your implementation generates JS code on the fly and 'captures up to three backwards jumps'. Do you mind explaining in more detail how this works?

reginaldo · on March 29, 2012

Ok, one way to do interpretation is to do a loop like

  while True:
    instr = decode_instruction(pc)
    execute_instruction(instr)

Where execute instruction is usually a big switch that has every opcode on it, or, in my case, is a call to a function in a table whose indices are opcodes.

The problem is you decode the same instructions many times (in a loop, for instance). A slight more elaborated version is like this:

  while True:
    if not pc in cached_code:
      block = generate_block(pc)
      cached_code[pc] = block
    run_from_cache(pc)

The generate_block function decodes and generates code for every instruction until it finds a jump. When it finds it, it finishes up that block, which is then added to the cache.

Now, whenever there is a jump to an address that is in the cache, the cached code can be executed directly.

Finally, the captures up to three backwards jumps thing:

My first try was to have the cached code just be a regular function call without loops. So a block of code is executed and then control is given back to the main loop.

But a significant number of jumps have targets that are inside the same block of code (anything you run on a loop, for instance). So, to get better performance, I generate, for my block of code, a function that has a loop in it.

It has a rather characteristic look:

  function code_block(cs) { // cs is the "cpu state"
      var count = 0;
      cs.next_pc = BLOCK_START;
      block_loop: while(++count < 3) {
          count++;
          block_choice: switch(cs.next_pc) {
              case BLOCK_START: cs.r0 = cs.r0 ^cs.r0;
              case BLOCK_START+4: // regular instructions fall through
              case CONDITIONAL_BRANCH_ADDR:
                  if(branch_condition) {
                      cs.next_pc = BRANCH_TARGET;
                      break block_choice; // this captures backwards jumps targeting the block
                  }
              case  UNCONDITIONAL_BRANCH_ADDR:
                      cs.next_pc = BRANCH_TARGET;
              default: break block_loop;

          }
      }
  }

The "captures up to three backwards jumps things" is controlled by the ++count < 3 loop condition. 3 was obtained by trial and error and looks like a good number for my machine (I could, for instance, put 100 in there, but then my javascript functions would take too much time to return and the javascript engines don't like this).

int3 · on March 30, 2012

Thanks for the explanation! It makes sense. I was just wondering though -- why is it necessary for generate_block / decode_instruction to be called on the fly, instead of as a pre-compilation step? I guess you need some sort of runtime information for the compilation?

reginaldo · on March 31, 2012

What you're looking for is called static recompilation. It's what Emscripten [1] does, it statically recompiles LLVM bitcode to javascript. Emscripten is a wonderful project, by the way. After compiling from LLVM to Javascript, it applies an algorithm (relooper) to the "intermediate" javascript code whose result is code that resembles the structure of the original code so that, for instance, if the original code has a loop, the final javascript code will also have a loop. It is a true gem, and it's part of what makes Emscripten so awesome and fast.

I believe I could do static recompilation for programs that run on a barebones environment and do not employ self modifying code. The more general problem of finding all executable code in a blob of arbitrary binary data (which I would have to solve if I wanted to run an arbitrary program unmodified, which I do), is equivalent to the Halting Problem, so that is a no go. This means that, to have a general solution, a hybrid approach of static and dynamic recompilation would have to be employed.

For well behaved programs, code generation on the fly does not pose a significant overhead (that's what the profiler tells, if I recall correctly), so I'm not sure the hybrid approach is worth its weight in complexity.

Besides, I don't know if I'd be able to run a full OS, though, because those need interrupts (in my case, currently there's a timer tick and the UART interrupt when input is entered through the keyboard), and I haven't thought about handling that.

[1] https://github.com/kripken/emscripten

int3 · on April 2, 2012

Yup, I'm aware of Emscripten, which was part of the reason behind my question. I hadn't considered the issue of lack of code / data separation when dealing with pure binary code.

I'm actually working on a JVM interpreter in Coffeescript now, so this topic is particularly interesting to me. Thanks again for taking the time to explain things.

fudged · on March 29, 2012

Is the idea of an "average" customer really this lost on so many tech writers? I thought my little sister was tech-literate, but she doesn't give a shit about "Ice Cream Sandwich"... she has customized the hell out of her 2.3 handset and there's nothing that she wants changed. How many "average" customers are entirely satisfied with the smallest amount of customization the system allows?

I'm returning my new iPad this week because 1) My Macbook Air is light and more useful, I have no drawing ability, and I don't read much; the iPad is no more than a toy and waste of money in my ecosystem, and 2) The app store ecosystem is dreadfully boring on iOS. On Android I feel like I'm actually rewarding the good developers by buying their Pro versions of apps and such... on iOS everyone wants your money, regardless of how shitty the product is. On Android, I see so many developers experimenting across so many categories just for the fun of it. 3) High-resolution has hardly benefitted anything that I do on the device. The majority of content that I see (images and video) are low-res. Plus, very few apps that I use have been upgraded to 'retina' quality. ALSO, retina doesn't just mean scaling everything up... why is nobody making their content more detailed? Orthogonal lines don't render that differently in higher resolution.