Hacker News new | past | comments | ask | show | jobs | submit login

Are there any courses on creating a new language, interpreter, and compiler from scratch? Like not using LLVM or similar to generate intermediate representation but to literally pick an ISA and, well, compile for it.



Various books do this. Essentials of Compilation is a recent one: https://github.com/IUCompilerCourse/Essentials-of-Compilatio...

You can find a PDF and the associated course with a very minimal amount of following links.

If you are thinking of making your own language, it's also good to learn something about programming language theory, if you don't already. Many languages make mistakes that have been solved 25+ years ago. PLAI is good for that: https://www.plai.org/


I compiled a basic C program and then read the output of GCC -S to see the assembly.

I used what I learned from GCC output Then I wrote a basic amd64 expression compiler with simple live ranges and linear register allocation and ANF (a normal form) for postorder AST traversal for code generation. It can only evaluate nested add and mul expressions.

https://github.com/samsquire/compiler or on replit: https://replit.com/@Chronological/Compiler3#main.py

Finding opcodes is a struggle but there is this website: https://www.felixcloutier.com/x86/ and https://www.cs.uaf.edu/2016/fall/cs301/lecture/09_28_machine... (the table at the bottom is really useful)

To learn code generation not targeting amd64, I wrote an imaginary assembly language and then wrote a switch based virtual machine and a compiler for it with a frontend that looks similar to javascript.

https://github.com/samsquire/multiversion-concurrency-contro... in ProgramParser.java, LanguageInterpreter.java, and LanguageInterpreterRunner.java for where main() is.


Intel's documentation covers encoding. I remember it being relatively heavy going to decipher, been a few years since I've worked on x64 though.


Check out https://mitpress.mit.edu/9780262047760/essentials-of-compila...

Author has made complete draft PDFs for both the Racket and Python versions available here. https://wphomes.soic.indiana.edu/jsiek/


Thanks, highly appreciated.

I tend to be biased towards what Jeremy Siek himself markets as "Proven in the classroom" when it comes to book authors in CS. Many book authors lack this experience and simply write for themselves, which is ok, but can result in bad didactics. Good teachers and authors from academia are invaluable.


That's a great point about experience teaching the topic. (I've written a few articles where I certainly wished for that experience.)

I just wish textbooks didn't have their own bad tendencies: they have pablum as an attractor, because on average students just want to get through the class, not doing too much worse than average among the other students. Even without this problem, there's a more basic one: like with enterprise software, the decision to buy the book is not typically up to the user. "Will people actually want to read this on their own time?" is a strong driver of quality, even though it has pitfalls too.


Precisely made for "creating a new language, interpreter, and compiler from scratch": http://craftinginterpreters.com


This does not cover compilers. I know because I specifically asked munificent (the author) about this before and he mentioned some other books to look at.


He's being modest :) Sure, the book doesn't cover many traditional compiler topics such as register allocation and code optimization. However, it does cover parsing and translating to intermediate representation, which is 90% of what you need to get your own programming language off the ground. And to be honest, many university compiler courses won't have time to go much farther than that anyway.


What do you mean exactly by "compilers"? Because the book features a chapter on how to compile source code to virtual machine byte code.


Most of the time when people talk about compiled languages/compilers, they mean languages that compile to native machine code, not languages that compile to a bytecode/run on a virtual machine.


Compiling to machine code rather than VM bytecode is a difference in degree rather than kind, and doing it well rather than naively is a deep topic.

Simple machine code generation with basic register allocation and evaluation which eagerly spills to the stack is not super hard. Producing fast code instead is a bottomless well.


Is that true? I think of C# and Java as compiled languages and they run on the CLR/JVM which are both virtual machines.


It's just the nomenclature that I'm used to, compiled languages are so called because there is no VM, it's just machine code that is run while C# and Java still requires the JVM to do JIT compilation at runtime. But it's mostly splitting hairs, there is no agreed upon definition of what makes a languages compiled/interpreted.


Both have JITs. So someone wrote a compiler from CIL/JVM bytecode into native code.


There's literally no difference in the theory or technique of compiling for a virtual machine or a real one. Real machines are just messier and have a bunch of edge cases and weirdness that require you to read those multi-kilopage manuals to figure out. Having one or more intermediate steps is useful for portability and optimization though, so it's probably a good idea to target a virtual machine unless you only intend to ever target one specific architecture that you know inside and out.


Plot twist - the developer then runs the code in qemu for rapid iteration and ease of debugging... ;)


Yes, but it's still the most beginner-friendly book on implementing programming languages that I've ever seen. It covers VMs on the second part, and going from that to asm is not a huge leap.


Compiling to machine code is not that hard once you get to the IR or bytecode phase, check out this toy compiler:

https://github.com/byo-books/pretty_laughable_lang/


Is that really true? I would expect that going from IR to native code would be the hardest part because you have to worry about architecture and supporting different ISA’s. IR can be simple because it is abstract and hides lots of the dirty details of the hardware.


Generating code is not all that hard. Generating fast code is a challenge. Decoding instructions on some architectures is hard. If you have a nice general IR naively generating native code can be just a table lookup with a register tracking/allocation routine.


UPenn's CIS341[1] is pretty nice. It uses a bottom-up approach, where assignments start with a 64-bit integer-only x64 simulator, then goes up to compiling LLVM IR to x64, then compiling the lecturer's language ('Oat') into LLVM IR, and then optimisations like dead code elimination, register allocation, variable substitution, etc.

[1]: https://www.seas.upenn.edu/~cis3410/current/


The https://www.nand2tetris.org/ is pretty neat. The compiler in part 2 isn’t very advanced (it doesn’t cover much in the space of optimization or the state of the art) but taken as a whole you end up designing the chip, ISA, VM, language and compiler, so it’s pretty good for that aspect.


When I did my undergraduate degree, you could opt to do that for 1 extra point. Compiling for an ISA is not that much more difficult than compiling for a bytecode IR like that of Java or .Net which I imagine is the typical target of these courses.




Check out project Oberon by Niklaus Wirth. It involves building a compiler from scratch for an ISA designed from scratch and building an OS for it. You can even put the processor onto an FPGA and run the code you compiled on physical hardware you designed.


The prior course involves implementing a code generating compiler with optimisation passes etc for a language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: