> Nice, but the author could join efforts with one of the few attempts to have a native Clojure compiler instead.
Clojure has a layer of complexity originating from the JVM background. It makes sense to not to use the Clojure codebase as is.
JVM is the reason I am not using Clojure. I can't be the only one in this position. I need interoperability with native libs, JVM is just a hindrance for me.
> Another remark, is that it would have been nicer to have the language being made in itself instead of OCaml.
Definitely, but bootstrapping a language is quite a lot of work. Makes a lot of sense to build the "Stage 0" compiler using O'Caml or Haskell (which are excellent languages with good libs for building compilers).
If it catches on, I'm sure this project will eventually become self-hosting.
Many modern high-level languages use some kind of VM (most notably Clojure and V8). This is a means to provide a JIT, do runtime optimizations, and garbage collection. AOT means that you don't have a chance to do anything about the generated code at runtime. That said, the startup time of a monster like JVM is definitely problematic; the approach I'm taking is a thin VM.
> Many modern high-level languages use some kind of VM
I don't think this is a good approach and it's very nice to see projects like yours that try to shift the paradigm. LLVM can provide many of the benefits of bytecode based VMs, yet give efficient native code in the end.
How did you go about the GC? Did you grab a GC from somewhere else or build your own? Did you use LLVM shadow stacks and/or GC annotations? I guess Lisp is a lot easier to build a GC for than other languages.
There are many promises VMs should give, performance, safety and security but in practice not a lot of those promises were ever fulfilled. Performance can be on-par with native code in the best case but usually 2-10x behind, safety may be a bit better but we still get crashes because of incorrect memory use and the Oracle JVM has had more zero day vulnerabilities than any other software I can name. I think it's time to move on.
One issue with the current state of affairs with VM runtimes is that most developers without compiler developer knowledge, now think memory safe runtimes == VM.
I really think Java did a left turn there, by not providing native compilers as part of the standard toolchain.
At least .NET has NGEN and now will get .NET Native compilers as well.
Yes, Sun/Oracle JVM had quite a few security vulnerabilities, but many were caused by the native code in the VM stack. While others were indeed caused by real bugs in library code. Still way less exploits that in C and friends.
Besides that was one JVM, there are plenty of others out there which share 0% implementation with Oracle's.
Actually, there's a hidden motive behind not targeting Clojure: the eventual plan is to allow Elisp-like mutability so it can be used to build an Emacs-like editor.
I chose OCaml because of the relatively mature OCaml-LLVM bindings. A self-hosting compiler is a lot of work: it'll require Rhine-LLVM bindings first.
So you're building an Emacs-like editor on an Elisp-like language. What are you hoping to do differently than Emacs?
(Not being snarky, languages and text editors are both interesting projects and worth experimenting with, and combining the two is especially interesting to me.)
A new editor that copied emacs' good Lisp support and insane customizability via hooks for everything but had an interface that was approachable for regular desktop users (tabs instead of buffers, good CUA-style keybinds out of the box, no surprising weirdness like nonlinear undo, etc) is about the single most important thing anyone could do to increase the popularity of the Lisp family. Right now, if you want to write Lisp, your choices are:
emacs, which is utterly unapproachable for newcomers, and will always be uncomfortable to use for people that grew up with "modern" desktop GUIs, because it and most of its userbase predate established interface standards and they like it that way.
DrRacket, which is a step in the right direction but is Racket-specific, bloated, slow, and somewhat buggy.
LispWorks, which is proprietary and prohibitively expensive unless you sign up for and are allowed to use the crippleware version.
Given those choices, it's not surprising that most people just give up on Lisp and use languages that you could comfortably program in Notepad if you had to.
Why do you need mutability for that? Doesn't LT's approach give the same flexibility without the need for mutation everywhere? Or is it mainly a concern of performance?
I didn't quite understand how LT works. I have the following tasks in mind:
1. Set indent-tabs-mode in all modes which use that variable. So, if I install a hs-mode, the indent-tabs-mode in my .emacs should be the one that it picks up.
2. Add multiple independent hooks to c-mode-hook from different modes.
3. Change the global keymap in different modes, as and when they are loaded.
Isn't mutable variables the most obvious way to solve these problems?
> Isn't mutable variables the most obvious way to solve these problems?
Yes, but the mutation can be entirely limited to the reference. The data structures do not have to be mutable. It's a trade-off, of course, because you can't pass somebody just part of a data structure (like a sub-map) and have them mutate it there, but I think Clojure has shown that in practice this isn't really much of a loss.
One of the major problems with Emacs is that it's hard to isolate, debug, and rollback Elisp fragments; the only way to fix many problems is to restart Emacs. I still haven't figured out how to solve this problem with Rhine: keep (limited) history for all variables? How do we keep track of the effect of specific function callchains?
It depends. OCaml is great for compilers, but having the compiler of a language in that same language lowers the barrier of entry for contributors and could lead to more innovation down the road.
I don't see this as an issue, the fact that OCaml is such a great compiler language means that plenty of good compiler writers already know and use OCaml.
Facebook wrote the compiler for their Hack language (basically a better PHP) in OCaml, for example.
Bootstrapped compilers are nice but can come later, after the language gains traction in the community. This language is brand new.
I have never understood this (admittedly common) mindset. if writing a compiler does not play to your language's core strengths, what have you gained by insisting on bootstrapping it? you might argue that writing a compiler provides a good stress test of the language and stdlib, but so would identifying a large project that does play to its strengths, and developing it in parallel with the language.
and contrariwise, bootstrapping a compiler might even be a net negative, not just because you have passed up on the chance to use a language that is better suited to the task, but because you have complicated your build process, created potential morasses when you have bugs with both your language design and your compiler implementation and you can't fix them separately, and are implicitly prioritising those features of your design and implementation that are useful for writing compilers, even though that might not be a very large use case for it at all.
in the days where people were building up from assembly, bootstrapping compilers made a lot of sense. but now that there are existing, well established languages that have proven their value in the specialised problem domain, it makes no sense to ignore them and insist on bootstrapping for its own sake.
Any Turing complete language can be used to write compilers.
For me bootstraping is the best way of testing if the language design is sound and to become independent of other tooling. As you happen to mention.
Many language designers that don't follow this process, happen to never fully experiment with their own language, as they spend most of their time in the compiler.
As for tooling, the overlying of many designers in C and similar toolchains has created the wrong belief that only the said languages can be used to write compilers. To the point of the famous statement in the language design community "My compiler compiles yours".
For me, the only place where bootstraping is to be avoided are DSLs.
The risk is that instead of designing your language with your original goals in mind, you bend the design to suit the task at hand - writing a compiler. You add features that you wouldn't otherwise, because they make that job easier, and then they remain as ugly warts for the rest of the language's life.
Also, compilers are a particular subset of programs - they're pure functions. By bootstrapping do we end up with nothing but languages suited for this kind of problem?
So Rhine, being a Lisp, may not be a good language for implementing a big pure function involving a lot of tree transformations. Is that what I'm reading?
If your goal is to write a language that is very good for writing compilers, then bootstrapping is probably a great idea. If you have other goals, such as being well suited to CRUD applications, concurrency, parallelism, graphics, you may find that bootstrapping is not the most productive activity.
More broadly, if your goal is to produce a useful language, why would bootstrapping help you achieve that goal, beyond writing any other system in your language?
I certainly can imagine that tradeoff happening, in the abstract at least.
Was there a language/compiler combination that you worked on where you experienced this pitfall first hand, are there historical anecdotes from language/compiler authors, or did you arrive at this a priori?
ML? I believe the genesis of 'Meta Language' and 'Standard Meta Language' was language design being done by people who wrote nothing but experimental compilers. Not sure how often they were bootstrapping, but it would be a similar issue.
Solarsail -- not true. For one example, BNF is a meta-language (a language about language), and it is extremely useful in the real world.
As the author of a computer language expert system, I refer to its rules language as a meta-language (and rules as meta-code) because it manipulates language content symbolically. And, as it happens, it actually executes BNF in order to parse the languages it consumes, the BNF being delivered to it via that same rules language.
Code content represented in my system's Internal Representation (IR) is not itself a meta-language; it's more of a "super-language", being synthesized across real languages. But that IR's architecture does comprise a kind of meta-language; because it is a structure that represents language content in an abstracted way, it is "about" the very concept of a computer language.
As the author of such a system, I do meta every day, meta-meta often, and triple-meta occasionally.
Sounds like an interesting project. I didn't point ML to say that meta-languages generally or ML itself were useless. Rather, that ML is specialized in a way that wouldn't aid in writing things completely unrelated to compilers. (Tho I hadn't thought of using one in an expert system.) Say, writing relational databases, game AI's or GUI interfaces.
So, what is zhe point here?
That writing a compiler in your new language is a good way of dogfooding? Wouldn't that mean that any other complex application would do?
Or is it that most new languages are more powerful than the portable languages (Java, C, Javascript) they are written in, and should thus ease the pain in compiler writing?
FTR, the first version of gcc was written in Scheme, but even the c++ frontend of gcc is written in c rather than c++
I wrote a little about the guts of the Rust compiler earlier this year. Shouldn't be too out of date, should get you pointed in the right direction if you're interested in the details:
> Isn't JVM and Java libraries (i.e. "batteries") the one of the main selling points of Clojure?
Yeah, but that only appeals to the people who are already committed to the JVM (or .NET), and that is a shrinking population. There's a whole layer of complexity to deal with managed code and OOP languages and there's a bit of a impedance mismatch between a Lisp and a Java-style library and the VM itself.
For people working with native code or otherwise not in the JVM ecosystem, Clojure is not really ideal.
The enterprise world isn't exactly known for adopting new languages, JVM or not JVM.
But you could provide some kind of source if you disagree, all the statistics I've seen have had Java and JVM languages in a steady downward slope for the past five years or so.
The source is the projects my employer gets and I cannot disclose.
It is a international consulting company that does large IT projects, mostly multi-site. You can imagine the possible candidates.
We only get JVM (only Java) and .NET (C# / VB.NET) requests for proposals for greenfield projects. Any other platform requests tend to be pure data spikes.
Couldn't disagree more. I work in the enterprise space and there are plenty of companies using Groovy, Scala, Clojure as well as Python/Ruby on the JVM.
I know it's anecdotal but what platform do you actually think enterprises are using ? Companies are looking for cheaper options so .Net isn't exactly on fire.
> there are plenty of companies using Groovy, Scala, Clojure as well as Python/Ruby on the JVM
Jython would be by far the least used of those alternatives, whereas Scala the most used. Clojure's coming up strong, JRuby's making slow inroads, Kotlin's getting some traction at Ceylon's expense, whereas Groovy's on the slide.
Clojure also targets JavaScript and .NET as targets.
One of the selling points of Clojure is that it embraces the libraries of the targeted eco-system.
The Clojure community has some time ago discussed addeding conditional support, to be able to build common libraries across multiple runtimes. Similar to what Common Lisp offers.
Yes and no. It can help with different target backends (including JavaScript, not sure about JVM or .NET) but you still have to do work to make sure that the calling conventions of your language and the target's match.
Sure, but AFAIK many/most of the "supports Javascript" languages out there rely on emscripten, so LLVM is involved anyway. By its nature, LLVM gives you more flexibility in the long term than VMs that were intended as a final destination.
> Sure, but AFAIK many/most of the "supports Javascript" languages out there rely on emscripten, so LLVM is involved anyway.
Yes, yes. LLVM definitely helps here. But it doesn't come "for free", which is what the GP post asked. You still need some "glue" code to make it work.
Someone mentioned the JVM interop of Clojure is it's main selling point. While its definitly awesome to have that great integration (I'm a heavy interop user), for me, the main strength is having persistent data structures at its base for everything. I myself played with the thoughts of creating a native runtime based on Clojure's syntax and data structures. I cant see that this project uses persistent data structures. Am I wrong? please enlighten me.
> for me, the main strength is having persistent data structures at its base for everything.
Agreed, but also:
- The ISeq abstraction
- Software Transactional Memory for state management
- Structure-sharing of the persistent structures, for added efficiency/performance
And of course, all the nice reader macros that make things so much easier to read.
This project is cool, but what I'd rather see is Clojure becoming parasitic, living on all the host VMs it can. We've got the CLR version, and ClojureScript for V8/JS, but we could also have it properly on top of Python, Ruby[1], Lua, Erlang, and LLVM.
[1]: I know Rouge exists, but I'm not sure how well it's progressing.
Clojure itself would also love to become parasitic just as you described. It's actually one of it's explicitly stated goals if I'm correct.
Apparently implementing Clojure on the LLVM is no easy feat though; judging from references to previous discussions on the mailing list. [1] There is some work on it already though [2,3,4] and I'm sure we'll get there eventually.
A real compiler backend takes a bit more than a weekend project, and it is hard to know the real reasons, technical or personal life of the coders, for the current state of the projects.
I think the real problem with the native Clojure compilers is that Clojure is designed to be symbiotic with host platforms that offer a much higher level of abstraction. For example, to perform well, all those persistent data structures require an advanced garbage collector. The JVM, the CLR, and the leading JavaScript implementations offer this; it would take a lot of work to implement one in the absence of one of these mature platforms.
ISeq is considered to be harmful, because partly due to ISeq over use Clojure's code is cluttered with unnecessary and costly (making a new copy) "casts" (or what a proper clojurian term for this?) from one kind of a container to another.
There is an opinion that classic lists, sequences (only on lists, vectors and strings) plus series are good-enough.
STM/Refs are not really used in the Clojure community, its a great a concept but in practice almost no one uses it, at least that's what you can see from public Clojure projects.
At some point most applications have to have some state. Most clojure projects are nice functional libraries that work without state but somewhere are one or to refs (or atoms). A good example is Datomic where Rich Hickey said, it only has 6 refs. Its 99% persistent but somewhere you need at least a pointer to your current state. Of course you can push the problem to your database or some libraries but I really like refs with their optimistic locking and am using them to have in-memory consistency.
Currently, vectors are the only real data structures; they're naively persistent: operations like rest and cons create a fresh copy (I haven't implemented COW yet). I'm not quite sure how to implement it for `setq` though (a feature not present in Clojure, but something that's necessary for building an editor like Emacs): when a variable is set to a new value, where is the reference to the old variable, for gc and access?
Only special ref objects are mutable in Clojure (via STM). I was talking of `setq` in the context of mutating normal variables. It's perhaps a bad idea, and I have to study how LT manages its plugin system without mutation.
You don't have to mutate them. You can create a new persistent variant and just mutate the ref to point to the new one. Users of that "variable" store the ref instead of the value (well, actually, they get the choice of which to do). If users have held on to references to the old values, the GC doesn't collect them. If they haven't, the GC can collect them just like any other circumstance involving persistent data structures.
Actually, you can have real mutable variables as fields in a type, although you should think twice about doing so in practice as mutable variables are often not the solution. But here's an example!
:unsynchronized-mutable corresponds to a normal mutable Java variable, and :volatile-mutable corresponds to a Java variable with the volatile modifier.
They are private by default, you can only set! them within the lexical scope of the deftype definition. If you needed to expose fast host mutation to the public, I would create a setter/getter interface (via definterface) and implement it on your deftype and use set! locally to mutate the field there.
JVM as primary target is the main selling point, just watch author's talks. Back then Rich Hickey described JVM as "the default platform of choice" for "serious development" (read - corporations). OK, I am paraphrasing a bit, but he explicitly praised JVM.)
I think the holy grail may be the combination of working in a very expressive high level language, but knowing and seeing exactly what that generates at the low level. (And in those few cases where it matters, being able to control/influence/change what it generates at the low level).
FWIW, SBCL does exactly that. You can defun a function and disassemble it right down to assembly. What this project does is slightly different which is to use LLVM to achieve the same thing.
Yeah I realize that, but good point. LLVM is very interesting to me, but you're right that it's not as utterly low level as native assembly. It is the most interesting assembly ("assembly-like"?) language to me at this time.
There are occasional discussions on sbcl-devel about it, but I'm not aware of anyone committing serious resources to it (it's not really a weekend-project, or even single-GSoC-student, amount of work). There is also some skepticism about whether it's worth doing at all, because of worries that it will be difficult to get good GC performance out of LLVM: http://marc.info/?l=sbcl-devel&m=136219729420475&w=2
Taking GC seriously in the design of a low-level compiler-target language was one of the aspects that I found very interesting about the C-- project of Simon Peyton Jones & Norman Ramsey. But it unfortunately didn't get anywhere near the same backing/momentum as LLVM, and now seems defunct.
Indeed, and I found the whole concept to be excellent.
Unfortunately, as far as I can tell, it's defunct outside of its use in the GHC. E.g. the Quick C-- compiler was recently archived here https://github.com/nrnrnr/qc--
Julia has some nice functionality for introspecting both lowered and type-inferred ASTs, the corresponding LLVM IR, and finally the resulting assembly code. (several Lisp systems have this kind of thing too)
I believe Haskell compilers are pretty amenable to stuff like this - watching "Data Parallel Haskell", it was interesting to see how they actually map out contiguous memory, etc. https://www.youtube.com/watch?v=NWSZ4c9yqW8
Sadly, it doesn't seem to be as tightly integrated as what you'd see in SBCL from a developers PoV, but I'm certainly intrigued by their ability to achieve some very specific implementation while separating it from the expression of the algorithms.
You can add Scheme to the list of languages that do this. CHICKEN in particular works by compiling highly expressive Scheme to C, which can be inspected, modified, etc. Also the CHICKEN compiler has a few dozen switches to tweak output, but it's fully open-source (BSD) if that's interesting.
If C is not "low level" enough what are you looking for?
(It occurs to me, the C code can be compiled with clang, which provides further opportunities for low-level output.)
Because one type may use more than one field. Vectors and strings use the nr field additionally, and functions use the vector/fenv field additionally. Instead of creating a hierarchy, I figured a flat structure would be easier to understand.
I was going to ask about this as well. It seems like there's a lot of wasted space in how things are represented. The first field already tells you what type of object you're dealing with. Since you're using 64-bit integers, you could represent pointers, bools, ints and floats all in the same memory space. I guess there might be some types that use more than one field, but certainly a bool is not going to.
There's also the issue of alignment with the usage of 1-bit and 1-byte types...
Another remark, is that it would have been nicer to have the language being made in itself instead of OCaml.
Other than that, very nice work.