Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rhine – A Lisp on LLVM (github.com/artagnon)
143 points by artagnon on Sept 6, 2014 | hide | past | favorite | 91 comments


Nice, but the author could join efforts with one of the few attempts to have a native Clojure compiler instead.

Another remark, is that it would have been nicer to have the language being made in itself instead of OCaml.

Other than that, very nice work.


> Nice, but the author could join efforts with one of the few attempts to have a native Clojure compiler instead.

Clojure has a layer of complexity originating from the JVM background. It makes sense to not to use the Clojure codebase as is.

JVM is the reason I am not using Clojure. I can't be the only one in this position. I need interoperability with native libs, JVM is just a hindrance for me.

> Another remark, is that it would have been nicer to have the language being made in itself instead of OCaml.

Definitely, but bootstrapping a language is quite a lot of work. Makes a lot of sense to build the "Stage 0" compiler using O'Caml or Haskell (which are excellent languages with good libs for building compilers).

If it catches on, I'm sure this project will eventually become self-hosting.


Many modern high-level languages use some kind of VM (most notably Clojure and V8). This is a means to provide a JIT, do runtime optimizations, and garbage collection. AOT means that you don't have a chance to do anything about the generated code at runtime. That said, the startup time of a monster like JVM is definitely problematic; the approach I'm taking is a thin VM.


> Many modern high-level languages use some kind of VM

I don't think this is a good approach and it's very nice to see projects like yours that try to shift the paradigm. LLVM can provide many of the benefits of bytecode based VMs, yet give efficient native code in the end.

How did you go about the GC? Did you grab a GC from somewhere else or build your own? Did you use LLVM shadow stacks and/or GC annotations? I guess Lisp is a lot easier to build a GC for than other languages.

There are many promises VMs should give, performance, safety and security but in practice not a lot of those promises were ever fulfilled. Performance can be on-par with native code in the best case but usually 2-10x behind, safety may be a bit better but we still get crashes because of incorrect memory use and the Oracle JVM has had more zero day vulnerabilities than any other software I can name. I think it's time to move on.


One issue with the current state of affairs with VM runtimes is that most developers without compiler developer knowledge, now think memory safe runtimes == VM.

I really think Java did a left turn there, by not providing native compilers as part of the standard toolchain.

At least .NET has NGEN and now will get .NET Native compilers as well.

Yes, Sun/Oracle JVM had quite a few security vulnerabilities, but many were caused by the native code in the VM stack. While others were indeed caused by real bugs in library code. Still way less exploits that in C and friends.

Besides that was one JVM, there are plenty of others out there which share 0% implementation with Oracle's.


The plan is to eventually stick with LLVM MCJIT with custom optimizations. There's really no point in producing an AOT binary.

I don't do GC yet.

I suspect it happens because the VM becomes too large and complex for anyone to debug.


Are you doing some fancy region analysis (e.g., like in Harlan) instead of GC?


> I need interoperability with native libs, JVM is just a hindrance for me.

Have you checked out CFFI/Common Lisp ? I've heard Chicken Scheme and Bigloo are also quite good when it comes to this.


Actually, there's a hidden motive behind not targeting Clojure: the eventual plan is to allow Elisp-like mutability so it can be used to build an Emacs-like editor.

I chose OCaml because of the relatively mature OCaml-LLVM bindings. A self-hosting compiler is a lot of work: it'll require Rhine-LLVM bindings first.


So you're building an Emacs-like editor on an Elisp-like language. What are you hoping to do differently than Emacs?

(Not being snarky, languages and text editors are both interesting projects and worth experimenting with, and combining the two is especially interesting to me.)


A new editor that copied emacs' good Lisp support and insane customizability via hooks for everything but had an interface that was approachable for regular desktop users (tabs instead of buffers, good CUA-style keybinds out of the box, no surprising weirdness like nonlinear undo, etc) is about the single most important thing anyone could do to increase the popularity of the Lisp family. Right now, if you want to write Lisp, your choices are:

emacs, which is utterly unapproachable for newcomers, and will always be uncomfortable to use for people that grew up with "modern" desktop GUIs, because it and most of its userbase predate established interface standards and they like it that way.

DrRacket, which is a step in the right direction but is Racket-specific, bloated, slow, and somewhat buggy.

LispWorks, which is proprietary and prohibitively expensive unless you sign up for and are allowed to use the crippleware version.

Given those choices, it's not surprising that most people just give up on Lisp and use languages that you could comfortably program in Notepad if you had to.


Why do you need mutability for that? Doesn't LT's approach give the same flexibility without the need for mutation everywhere? Or is it mainly a concern of performance?


I didn't quite understand how LT works. I have the following tasks in mind:

1. Set indent-tabs-mode in all modes which use that variable. So, if I install a hs-mode, the indent-tabs-mode in my .emacs should be the one that it picks up.

2. Add multiple independent hooks to c-mode-hook from different modes.

3. Change the global keymap in different modes, as and when they are loaded.

Isn't mutable variables the most obvious way to solve these problems?


> Isn't mutable variables the most obvious way to solve these problems?

Yes, but the mutation can be entirely limited to the reference. The data structures do not have to be mutable. It's a trade-off, of course, because you can't pass somebody just part of a data structure (like a sub-map) and have them mutate it there, but I think Clojure has shown that in practice this isn't really much of a loss.


Fast, modern and portable to the 3 platforms like Sublime, and extensible like Emacs, could compete with Atom and LightTable.


One of the major problems with Emacs is that it's hard to isolate, debug, and rollback Elisp fragments; the only way to fix many problems is to restart Emacs. I still haven't figured out how to solve this problem with Rhine: keep (limited) history for all variables? How do we keep track of the effect of specific function callchains?


I wish I can write my own editor someday to learn about those things ;)


Faire enough. Thanks for jumping in.


Do you happen to know, from a community perspective, which native Clojure compiler project is farthest along, or the most usable?


I barely follow it, but I think they are all pretty stale at the moment.


ocaml is really good for writing compilers in, so why not use it?


It depends. OCaml is great for compilers, but having the compiler of a language in that same language lowers the barrier of entry for contributors and could lead to more innovation down the road.


I don't see this as an issue, the fact that OCaml is such a great compiler language means that plenty of good compiler writers already know and use OCaml.

Facebook wrote the compiler for their Hack language (basically a better PHP) in OCaml, for example.

Bootstrapped compilers are nice but can come later, after the language gains traction in the community. This language is brand new.


Sure it is, but I think writing bootstrapable compilers is a better approach.


I have never understood this (admittedly common) mindset. if writing a compiler does not play to your language's core strengths, what have you gained by insisting on bootstrapping it? you might argue that writing a compiler provides a good stress test of the language and stdlib, but so would identifying a large project that does play to its strengths, and developing it in parallel with the language.

and contrariwise, bootstrapping a compiler might even be a net negative, not just because you have passed up on the chance to use a language that is better suited to the task, but because you have complicated your build process, created potential morasses when you have bugs with both your language design and your compiler implementation and you can't fix them separately, and are implicitly prioritising those features of your design and implementation that are useful for writing compilers, even though that might not be a very large use case for it at all.

in the days where people were building up from assembly, bootstrapping compilers made a lot of sense. but now that there are existing, well established languages that have proven their value in the specialised problem domain, it makes no sense to ignore them and insist on bootstrapping for its own sake.


Any Turing complete language can be used to write compilers.

For me bootstraping is the best way of testing if the language design is sound and to become independent of other tooling. As you happen to mention.

Many language designers that don't follow this process, happen to never fully experiment with their own language, as they spend most of their time in the compiler.

As for tooling, the overlying of many designers in C and similar toolchains has created the wrong belief that only the said languages can be used to write compilers. To the point of the famous statement in the language design community "My compiler compiles yours".

For me, the only place where bootstraping is to be avoided are DSLs.


The risk is that instead of designing your language with your original goals in mind, you bend the design to suit the task at hand - writing a compiler. You add features that you wouldn't otherwise, because they make that job easier, and then they remain as ugly warts for the rest of the language's life.

Also, compilers are a particular subset of programs - they're pure functions. By bootstrapping do we end up with nothing but languages suited for this kind of problem?

http://tratt.net/laurie/blog/entries/the_bootstrapped_compil...


So Rhine, being a Lisp, may not be a good language for implementing a big pure function involving a lot of tree transformations. Is that what I'm reading?


I'm talking in general.

If your goal is to write a language that is very good for writing compilers, then bootstrapping is probably a great idea. If you have other goals, such as being well suited to CRUD applications, concurrency, parallelism, graphics, you may find that bootstrapping is not the most productive activity.

More broadly, if your goal is to produce a useful language, why would bootstrapping help you achieve that goal, beyond writing any other system in your language?


I certainly can imagine that tradeoff happening, in the abstract at least.

Was there a language/compiler combination that you worked on where you experienced this pitfall first hand, are there historical anecdotes from language/compiler authors, or did you arrive at this a priori?


ML? I believe the genesis of 'Meta Language' and 'Standard Meta Language' was language design being done by people who wrote nothing but experimental compilers. Not sure how often they were bootstrapping, but it would be a similar issue.


Solarsail -- not true. For one example, BNF is a meta-language (a language about language), and it is extremely useful in the real world.

As the author of a computer language expert system, I refer to its rules language as a meta-language (and rules as meta-code) because it manipulates language content symbolically. And, as it happens, it actually executes BNF in order to parse the languages it consumes, the BNF being delivered to it via that same rules language.

Code content represented in my system's Internal Representation (IR) is not itself a meta-language; it's more of a "super-language", being synthesized across real languages. But that IR's architecture does comprise a kind of meta-language; because it is a structure that represents language content in an abstracted way, it is "about" the very concept of a computer language.

As the author of such a system, I do meta every day, meta-meta often, and triple-meta occasionally.


(Sorry I didn't see your reply earlier)

Sounds like an interesting project. I didn't point ML to say that meta-languages generally or ML itself were useless. Rather, that ML is specialized in a way that wouldn't aid in writing things completely unrelated to compilers. (Tho I hadn't thought of using one in an expert system.) Say, writing relational databases, game AI's or GUI interfaces.


So, what is zhe point here? That writing a compiler in your new language is a good way of dogfooding? Wouldn't that mean that any other complex application would do? Or is it that most new languages are more powerful than the portable languages (Java, C, Javascript) they are written in, and should thus ease the pain in compiler writing?

FTR, the first version of gcc was written in Scheme, but even the c++ frontend of gcc is written in c rather than c++


> That writing a compiler in your new language is a good way of dogfooding? Wouldn't that mean that any other complex application would do?

Yes, but when creating a new language, the own compiler or at least the runtime/library, are the first complex applications, usually.

> ... but even the c++ frontend of gcc is written in c rather than c++

was written in C, now it is C++.

http://lwn.net/Articles/542457/


Rust was originally written in OCaml, and then ported to Rust once things were working well enough.


Rust has an FFI to C, which is how it calls out to LLVM. Can you point me to the part of the codebase that codegens the basic constructs?


I _believe_ that https://github.com/rust-lang/rust/blob/master/src/librustc_l... is the LLVM wrapper, and that https://github.com/rust-lang/rust/tree/master/src/librustc/m... is what does the actual generation, but frankly, I am quite bad with the compiler internals.


I wrote a little about the guts of the Rust compiler earlier this year. Shouldn't be too out of date, should get you pointed in the right direction if you're interested in the details:

http://tomlee.co/2014/04/03/a-more-detailed-tour-of-the-rust...

At the time of writing, Rust would use LLVM to generate object files from LLVM IR, then link 'em together using the system's C compiler.


Isn't JVM and Java libraries (i.e. "batteries") the one of the main selling points of Clojure?


> Isn't JVM and Java libraries (i.e. "batteries") the one of the main selling points of Clojure?

Yeah, but that only appeals to the people who are already committed to the JVM (or .NET), and that is a shrinking population. There's a whole layer of complexity to deal with managed code and OOP languages and there's a bit of a impedance mismatch between a Lisp and a Java-style library and the VM itself.

For people working with native code or otherwise not in the JVM ecosystem, Clojure is not really ideal.


> Yeah, but that only appeals to the people who are already committed to the JVM (or .NET), and that is a shrinking population.

Looking at the enterprise world, I would say it is still growing.


The enterprise world isn't exactly known for adopting new languages, JVM or not JVM.

But you could provide some kind of source if you disagree, all the statistics I've seen have had Java and JVM languages in a steady downward slope for the past five years or so.


The source is the projects my employer gets and I cannot disclose.

It is a international consulting company that does large IT projects, mostly multi-site. You can imagine the possible candidates.

We only get JVM (only Java) and .NET (C# / VB.NET) requests for proposals for greenfield projects. Any other platform requests tend to be pure data spikes.


Couldn't disagree more. I work in the enterprise space and there are plenty of companies using Groovy, Scala, Clojure as well as Python/Ruby on the JVM.

I know it's anecdotal but what platform do you actually think enterprises are using ? Companies are looking for cheaper options so .Net isn't exactly on fire.


> there are plenty of companies using Groovy, Scala, Clojure as well as Python/Ruby on the JVM

Jython would be by far the least used of those alternatives, whereas Scala the most used. Clojure's coming up strong, JRuby's making slow inroads, Kotlin's getting some traction at Ceylon's expense, whereas Groovy's on the slide.


> Companies are looking for cheaper options so .Net isn't exactly on fire.

I work with Fortune 500 companies that have 100% Microsoft stacks.

MSDN yearly licenses are a drop in the ocean in these companies.


Clojure also targets JavaScript and .NET as targets.

One of the selling points of Clojure is that it embraces the libraries of the targeted eco-system.

The Clojure community has some time ago discussed addeding conditional support, to be able to build common libraries across multiple runtimes. Similar to what Common Lisp offers.


Doesn't targeting LLVM get you this kind of flexibility "for free"?


Yes and no. It can help with different target backends (including JavaScript, not sure about JVM or .NET) but you still have to do work to make sure that the calling conventions of your language and the target's match.


Sure, but AFAIK many/most of the "supports Javascript" languages out there rely on emscripten, so LLVM is involved anyway. By its nature, LLVM gives you more flexibility in the long term than VMs that were intended as a final destination.


> Sure, but AFAIK many/most of the "supports Javascript" languages out there rely on emscripten, so LLVM is involved anyway.

Yes, yes. LLVM definitely helps here. But it doesn't come "for free", which is what the GP post asked. You still need some "glue" code to make it work.


Someone mentioned the JVM interop of Clojure is it's main selling point. While its definitly awesome to have that great integration (I'm a heavy interop user), for me, the main strength is having persistent data structures at its base for everything. I myself played with the thoughts of creating a native runtime based on Clojure's syntax and data structures. I cant see that this project uses persistent data structures. Am I wrong? please enlighten me.


> for me, the main strength is having persistent data structures at its base for everything.

Agreed, but also:

- The ISeq abstraction

- Software Transactional Memory for state management

- Structure-sharing of the persistent structures, for added efficiency/performance

And of course, all the nice reader macros that make things so much easier to read.

This project is cool, but what I'd rather see is Clojure becoming parasitic, living on all the host VMs it can. We've got the CLR version, and ClojureScript for V8/JS, but we could also have it properly on top of Python, Ruby[1], Lua, Erlang, and LLVM.

[1]: I know Rouge exists, but I'm not sure how well it's progressing.


Clojure itself would also love to become parasitic just as you described. It's actually one of it's explicitly stated goals if I'm correct.

Apparently implementing Clojure on the LLVM is no easy feat though; judging from references to previous discussions on the mailing list. [1] There is some work on it already though [2,3,4] and I'm sure we'll get there eventually.

[1] https://groups.google.com/forum/#!topic/clojure-dev/bex25u9h... [2] https://github.com/ohpauleez/cljs-terra [3] https://github.com/halgari/clojure-metal [4] https://github.com/halgari/mjolnir


I would say it is also a matter of focus.

A real compiler backend takes a bit more than a weekend project, and it is hard to know the real reasons, technical or personal life of the coders, for the current state of the projects.


I think the real problem with the native Clojure compilers is that Clojure is designed to be symbiotic with host platforms that offer a much higher level of abstraction. For example, to perform well, all those persistent data structures require an advanced garbage collector. The JVM, the CLR, and the leading JavaScript implementations offer this; it would take a lot of work to implement one in the absence of one of these mature platforms.


> Structure-sharing of the persistent structures, for added efficiency/performance

I think the structural sharing is the definition of persistent data structures, as distinct from immutable data structures in general.


> I think the structural sharing is the definition of persistent data structures, as distinct from immutable data structures in general.

Oh, yes, you are very much correct. I just wanted to highlight that is itself a feature of Clojure's behavior.


Not directly, but nearly. Structural sharing / copy-on-write is basically the only way to satisfy the actual definition.


ISeq is considered to be harmful, because partly due to ISeq over use Clojure's code is cluttered with unnecessary and costly (making a new copy) "casts" (or what a proper clojurian term for this?) from one kind of a container to another. There is an opinion that classic lists, sequences (only on lists, vectors and strings) plus series are good-enough.


STM/Refs are not really used in the Clojure community, its a great a concept but in practice almost no one uses it, at least that's what you can see from public Clojure projects.


At some point most applications have to have some state. Most clojure projects are nice functional libraries that work without state but somewhere are one or to refs (or atoms). A good example is Datomic where Rich Hickey said, it only has 6 refs. Its 99% persistent but somewhere you need at least a pointer to your current state. Of course you can push the problem to your database or some libraries but I really like refs with their optimistic locking and am using them to have in-memory consistency.


Wouldn't you generally want an STM in your domain model in places where it's essential rather than in (ideally generic) open source libraries?


Currently, vectors are the only real data structures; they're naively persistent: operations like rest and cons create a fresh copy (I haven't implemented COW yet). I'm not quite sure how to implement it for `setq` though (a feature not present in Clojure, but something that's necessary for building an editor like Emacs): when a variable is set to a new value, where is the reference to the old variable, for gc and access?


>> ... `setq` though (a feature not present in Clojure...

I am not familiar with elisp but from first google result it reads that the equivalent is alter! / swap!.


Only special ref objects are mutable in Clojure (via STM). I was talking of `setq` in the context of mutating normal variables. It's perhaps a bad idea, and I have to study how LT manages its plugin system without mutation.


You don't have to mutate them. You can create a new persistent variant and just mutate the ref to point to the new one. Users of that "variable" store the ref instead of the value (well, actually, they get the choice of which to do). If users have held on to references to the old values, the GC doesn't collect them. If they haven't, the GC can collect them just like any other circumstance involving persistent data structures.


Actually, you can have real mutable variables as fields in a type, although you should think twice about doing so in practice as mutable variables are often not the solution. But here's an example!

(deftype SomeTypeWithMutableFields [^:unsynchronized-mutable some-mutable-field ^:volatile-mutable another-mutable-field])

:unsynchronized-mutable corresponds to a normal mutable Java variable, and :volatile-mutable corresponds to a Java variable with the volatile modifier.

They are private by default, you can only set! them within the lexical scope of the deftype definition. If you needed to expose fast host mutation to the public, I would create a setter/getter interface (via definterface) and implement it on your deftype and use set! locally to mutate the field there.


JVM as primary target is the main selling point, just watch author's talks. Back then Rich Hickey described JVM as "the default platform of choice" for "serious development" (read - corporations). OK, I am paraphrasing a bit, but he explicitly praised JVM.)



Which is very interesting because statically typed!

Unlike Rhine, it's not Clojure inspired though.


And also I think eudoxia's request in the README on Github was pretty clear about posting to HN. Haha.


I think the holy grail may be the combination of working in a very expressive high level language, but knowing and seeing exactly what that generates at the low level. (And in those few cases where it matters, being able to control/influence/change what it generates at the low level).

So this sounds pretty interesting.


FWIW, SBCL does exactly that. You can defun a function and disassemble it right down to assembly. What this project does is slightly different which is to use LLVM to achieve the same thing.


FYI - that disassemble feature is part of the CL spec - it's not unique to SBCL.

http://www.lispworks.com/documentation/HyperSpec/Body/f_disa...


Yeah I realize that, but good point. LLVM is very interesting to me, but you're right that it's not as utterly low level as native assembly. It is the most interesting assembly ("assembly-like"?) language to me at this time.

I didn't know SBCL did that. Cool.


All CL's have a #'disassemble, although the output of it is left undefined.


To use LLVM to achieve the same thing would be to find a way to bootstrap SBCL through LLVM, which would, again, be very cool.


There are occasional discussions on sbcl-devel about it, but I'm not aware of anyone committing serious resources to it (it's not really a weekend-project, or even single-GSoC-student, amount of work). There is also some skepticism about whether it's worth doing at all, because of worries that it will be difficult to get good GC performance out of LLVM: http://marc.info/?l=sbcl-devel&m=136219729420475&w=2

(Reference [1] in that email points to: https://github.com/elliottslaughter/rust-gc-notes, which documents difficulties trying to build a precise GC for Rust on top of LLVM.)


Simply doing a correct precise GC with LLVM is still a work in progress: http://www.philipreames.com/Blog/


Taking GC seriously in the design of a low-level compiler-target language was one of the aspects that I found very interesting about the C-- project of Simon Peyton Jones & Norman Ramsey. But it unfortunately didn't get anywhere near the same backing/momentum as LLVM, and now seems defunct.


Indeed, and I found the whole concept to be excellent.

Unfortunately, as far as I can tell, it's defunct outside of its use in the GHC. E.g. the Quick C-- compiler was recently archived here https://github.com/nrnrnr/qc--


Julia has some nice functionality for introspecting both lowered and type-inferred ASTs, the corresponding LLVM IR, and finally the resulting assembly code. (several Lisp systems have this kind of thing too)



I believe Haskell compilers are pretty amenable to stuff like this - watching "Data Parallel Haskell", it was interesting to see how they actually map out contiguous memory, etc. https://www.youtube.com/watch?v=NWSZ4c9yqW8

Sadly, it doesn't seem to be as tightly integrated as what you'd see in SBCL from a developers PoV, but I'm certainly intrigued by their ability to achieve some very specific implementation while separating it from the expression of the algorithms.


You can add Scheme to the list of languages that do this. CHICKEN in particular works by compiling highly expressive Scheme to C, which can be inspected, modified, etc. Also the CHICKEN compiler has a few dozen switches to tweak output, but it's fully open-source (BSD) if that's interesting.

If C is not "low level" enough what are you looking for?

(It occurs to me, the C code can be compiled with clang, which provides further opportunities for low-level output.)


What is the reason for using product of all primitive types for boxing instead of tagged union?


Because one type may use more than one field. Vectors and strings use the nr field additionally, and functions use the vector/fenv field additionally. Instead of creating a hierarchy, I figured a flat structure would be easier to understand.


I was going to ask about this as well. It seems like there's a lot of wasted space in how things are represented. The first field already tells you what type of object you're dealing with. Since you're using 64-bit integers, you could represent pointers, bools, ints and floats all in the same memory space. I guess there might be some types that use more than one field, but certainly a bool is not going to.

There's also the issue of alignment with the usage of 1-bit and 1-byte types...


Lisp? So, it runs Maxima?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: