Hacker News new | past | comments | ask | show | jobs | submit login

Is the code up anywhere? I’d be interested in studying a lisp compiler.



Yes:

http://www.kylheku.com/cgit/txr/tree/share/txr/stdlib/

Files compiler.tl, optimize.tl. asm.tl.

Main entry points into it are the functions compile and compile-file, as well as compile-toplevel which is public and used by those two.


There's some interesting syntax in those files that I haven't seen before. What's ^((,me.(get-sidx 'exception-subtype-p) mean? Specifically, dot followed by open paren. Is it a method call?

In my own lisp, I settled on a convention like (foo (.bar 1) (.baz 2)) but that might be an interesting alternative.

^ seems like quasiquote. Is it?


Yes, ^ is quasiquote. The only other weird thing is that ,* is splicing, and not ,@. With that mapping, your quasiquote skills transfer.

` cannot be quasiquote because it's used for quasistrings. @ is also a notational prefix with a number of uses.

When an earmuffed special variable has to be inserted inserted without splicing, it is written , *variable* with extra whitespace. ,**variable* inserts *variable* with splicing.

   obj.(fun arg ...)
is indeed a method call.

Within certain constraints, a.b.c.d stands for (qref a b c d). The elements can be symbols or compound expressions. obj.(fun arg) is (qref obj (fun arg)). The qref macro transforms that into the method calls and slot accesses. From time to time qref appears in backquote templates. The syntax ^(lorem ipsum ,var.(abc)) calls the abc method on var and inserts the return value into the template, which is wrong if what we want is for just the value of var to be inserted into the template, in order to generate a method call on whatever syntax var stands for. That can only be directly written as ^(lorem ipsum (qref ,foo (abc))).

  ^(... ,me.(get-sidx 'exception-subtype-p) ...)
means that we call the get-sidx method on the me object with that exception-subtype-p symbol as its argument. The return value of that method calls is inserted.

get-sidx builds a table of symbols which have integer indices, referenced from the virtual machine code. The catch needs to be able to call exception-subtype-p at run-time, so it nails down an integer index for it and generates a gcall instruction referencing it.

The symbol index looks like a vector in the compiled VM description; the VM turns it into an array which provides efficient access to global bindings: the symbol is looked up the first time the index position is accessed, and the binding is forwarded into the table for faster access next time.


The objection I have about me.(get-sidx 'foo) as opposed to me (.get-sidx 'foo) is that get-sidx will be expanded by symbol macros, which you almost never want. Ditto for me.sidx: sidx will be substituted by any symbol macro named sidx, whereas me .sidx won't. (me.sidx can be expanded into me .sidx automatically, but most lisps, including my own, tend to expand it to something like ((idx me sidx)) which turned out to be a mistake, since sidx ends up expanded in later macroexpansion passes.)

It's not a theoretical objection, FWIW. I've been bitten by this in practice, and wince each time it comes up.

Cool lisp. What's it called?


This is TXR Lisp.

I assure you that me.(get-sidx 'foo) is (qref me (get-sidx 'foo)), where (get-sidx 'foo) is not an ordinary function call, but a macro argument that is never treated as a form to be expanded and evaluated:

From the repl:

  1> (expand 'me.(get-foo 'bar))
  (call (slot me 'get-foo)
        me 'bar)
So as you can see get-foo turns into a quoted symbol used for a slot lookup. The qref macro receives (get-foo 'bar) unevaluated and does this simple transformation.

Don't let the repetition of me fool you; a gensym will be used when it matters:

  2> (expand '(car x).(get-foo 'bar))
  (let ((#:g0012 (car x)))
    (call (slot #:g0012 'get-foo)
          #:g0012 'bar))
Likewise

  3> (expand 'me.sidx)
  (slot me 'sidx)
Again, by the me.sidx -> (qref me sidx) -> (slot me 'sidx) route. me.sidx is read syntax; nothing sees that but the parser. You can just pretend it doesn't exist and read it as (qref me sidx), just like you can read 'foo as (quote foo).

In any expression where some symbol sidx is susceptible for being treated as a symbol macro, that (almost) implies it is being treated as an evaluated form, which would be the bad thing. If we have some (. obj sidx) where sidx could be expanded, it means that sidx is being evaluated as a variable, which is the bigger problem. If the . operator can prevent that evaluation, surely it can prevent the expansion.

The reason for "(almost)" is that macros can do anything. We can have a (mac foo) which symbol-macro-expands foo, but doesn't evaluate it.

I have an operator like that in TXR Lisp called equot: expand macro, but suppress evaluation ("expand-quote").

  1> (symacrolet ((a (+ 2 2)))
       a)
  4
  2> (symacrolet ((a (+ 2 2)))
       (equot a))
  (+ 2 2)
But me.sidx doesn't do anything like equot on sidx; just regular quote!


I have my own Lisp compiler as well if you are interested: https://bernsteinbear.com/blog/lisp/


Bonus points for (a) being (essentially) a literate program, and (b) being in C. Thanks!


Plus, it has a nice exercise for he reader:

Comment above Compile_let:

  // This is let, not let*. Therefore we keep track of two environments -- the
  // parent environment, for evaluating the bindings, and the body environment,
  // which will have all of the bindings in addition to the parent. This makes
  // programs like (let ((a 1) (b a)) b) fail
But, let and let* can easily be handled by the same logic, with just some conditional adjustments. In some ways it is easier than interpreting both let and let* with the same interpreter code.

Here is why: in a compiler, we can fold the let* bindings into a single environment. The abstract model is that each new binding is a new environment which sees the previous bindings. But in fact, that doesn't have to be the reality.

To compile let and let* with the same logic all we have to do is:

- create an empty environment for the variables

- add variables to that environment as we iterate over the init expressions: after treating each init expression, we then add its corresponding variable to the environment

Then the difference between let and let* is that:

- for let, compile each init expression in the original, incoming environment

- for let*, compile each init expression in the new environment.

The extend-as-you-go logic takes care of the scoping.

(Note that because let* usefully allows duplicate variables, you have to make it possible to add the same variable to the env more than once, and be sure that the lookup finds the most recent one.)

The down-side of the flat environment is that a lexical closure captured by the init form of an earlier variable will capture later variables. The closure's body cannot see those variables, but the garbage collector probably can.

    (let* ((light-object (cons 1 2))
           (fun (lambda () light-object))
           (expensive-object (allocate-lots)))
       ...)
If fun escapes, the environment object it references will hold a reference onto expensive-object, even though the lambda only refers to light-object, unless the implementation goes out of its way to fix this. Whereas, if multiple cascaded environments were used (or a naive alist) then that entire part of the environment that holds expensive-object would become garbage, if the only object that is reachable is that lambda.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: