How to learn compilers: LLVM Edition

CalChris · on Nov 5, 2021

1. Getting Started with LLVM Core Libraries

It's a bit dated (covers DAGISel rather than GlobalISel) but it gives a thorough introduction.

2. LLVM Developer Meeting tutorials

These are really good although you'll have to put them in order yourself. They will be out of date, a little. LLVM is a moving target. Also, you don't have to go through every tutorial. For example, MLIR is not for me.

3. LLVM documentation

I spent less time reading this than going through the Developer Meeting tutorials. I generally use it as a reference.

4. Discord, LLVM email list, git blame, LLVM Weekly

... because you will have questions.

5. MyFirstTypoFix (in the docs)

... when it comes time to submit a patch.

6. Mips backend

If you're doing a backend, you will need a place to start. The LLVM documentation points you to the horribly out of date SPARC backend. Don't even touch that. AArch64 and x86 are very full featured and thus very complex (100 kloc+). Don't use those either. RISC-V is ok but concerns itself mostly with supporting new RISC-V features rather than keeping up to date with LLVM compiler services. Don't use that either although definitely work through Alex Bradbury's RISC-V backend tutorials. Read the Mips backend. It is actively maintained. It has good GlobalISel support almost on par with the flagship AArch64 and x86 backends.

BTW, Chris Lattner is a super nice guy.

jcranmer · on Nov 5, 2021

> 4. Discord, LLVM email list, git blame

and don't forget IRC!

UncleOxidant · on Nov 5, 2021

> LLVM Developer Meeting tutorials

Are these all in one place or scattered about?

CalChris · on Nov 5, 2021

Either llvm.org under Developer Meetings or the LLVM Youtube channel. The advantage of llvm.org is that it has a lot of the PDFs for the presentations as well as some old, pre-Youtube tutorials.

chrisaycock · on Nov 5, 2021

I learned a lot about LLVM by looking at the compiler output from Clang:

  clang -emit-llvm -S sample.cpp

The article mentions Clang's AST, which can also be emitted:

  clang -Xclang -ast-dump -fsyntax-only sample.cpp

And for checking compiler outputs across lots of languages and implementations, there's always Matt Godbolt's Compiler Explorer:

  https://godbolt.org

anonymousDan · on Nov 5, 2021

I have to say personally I find general program analysis (e.g. for security) a much more interesting topic than most vanilla compiler courses. For example I recently came across this course by the maintainers of soot: https://youtube.com/playlist?list=PLamk8lFsMyPXrUIQm5naAQ08a...

Any pointers to similar courses much appreciated!

the_benno · on Nov 5, 2021

Anders Moeller and Michael Schwartzbach's book [1] on static program analysis is a fantastic resource, with (I think) a great balance of theory and practice. If you want to get really deep into the theory of program analysis, Patrick Cousot just published an incredibly thorough book on abstract interpretation (just got my copy this week, so haven't fully explored enough to have much of an opinion on it as a pedagogical resource)

[1] cs.au.dk/~amoeller/spa

anonymousDan · on Nov 6, 2021

Thanks, yes I've read the Moeller book and I agree with your evaluation. I wasn't aware of the Cousot book though, I'll be sure to check it out.

matt_d · on Nov 6, 2021

More program analysis & LLVM resources (books, courses, and talks): https://gist.github.com/MattPD/00573ee14bf85ccac6bed3c0678dd...

igotsideas · on Nov 6, 2021

Thank you for YouTube link. I would of never found this and it looks like gold.

andrewchambers · on Nov 5, 2021

More great things:

- https://c9x.me/compile/

- https://github.com/vnmakarov/mir

gavinray · on Nov 6, 2021

Thank you for posting these. I had seen and read about both of these before, and had been blown away by both of them -- but since had forgotten they existed!

  (For MIR overview, I recommend reading the Redhat blogposts mentioned in README rather than the README itself)

If I implement a language for learning purposes, I will implement it first in QBE and then MIR and then LLVM, and publish a retrospective analysis + comparison + thoughts.

I think that would make really interesting reading!

tester34 · on Nov 5, 2021

since here's many compiler hackers then I'd want to ask question:

How do you distribute your frontend with LLVM?

Let's say that I have lexer, parser and emitter written in e.g Haskell (random example)

I emit LLVM IR and then I use LLVM to generate something other

but the problem is, that I need to have LLVM binaries and I'd rather avoid telling people that want to contribute to my OSS project to install LLVM because it's painful process as hell

So I thought about just adding "binaries" folder to my repo and put executables there, but the problem is that they're huge as hell! and also when you're on linux, then you don't need windows' binaries

Another problem is that LLVM installer doesnt include all LLVM components that I need (llc and wasm-ld), so I gotta compile it and tell cmake (iirc) to generate those

I thought about creating 2nd repo where there'd be all binaries compiles for all platforms: mac, linux, windows and after cloning my_repo1, then instruction would point to download specific binaries

How you people do it?

xrisk · on Nov 5, 2021

You can statically link LLVM, no problem.

In fact, you never have to call any binaries specifically; just do it through code and everything should link at compile-time and become one big binary.

10000truths · on Nov 5, 2021

This is in fact what Zig does. Everything is statically linked into one binary that is used for compiling, linking, building, testing etc.

jcranmer · on Nov 5, 2021

When you have an LLVM frontend, what you generally do is have your driver run the optimization and code generation steps itself using the LLVM APIs rather than using opt/llc binaries to drive this step. That way, you don't need the LLVM binaries, just the libraries that you statically link into your executable.

For example, all of the code in clang to do this is located in https://github.com/llvm/llvm-project/blob/main/clang/lib/Cod...

tester756 · on Nov 5, 2021

What if my frontend is written in non cpp? e.g haskell, js, java, c#, etc.

jcranmer · on Nov 5, 2021

You use the LLVM-C bindings via your favorite FFI mechanism to generate the code then, usually.

staticfloat · on Nov 5, 2021

In the Julia world, we make redistributable binaries for all sorts of things; you can find lots of packages here [0], and for LLVM in particular (which Julia uses to do its codegen) you can find _just_ libLLVM.so (plus a few supporting files) here [1]. If you want a more fully-featured, batteries-included build of LLVM, check out this package [2].

When using these JLL packages from Julia, it will automatically download and load in dependencies, but if you're using it from some other system, you'll probably need to manually check out the `Project.toml` file and see what other JLL packages are listed as dependencies. As an example, `LLVM_full_jll` requires `Zlib_jll` [3], since we build with support for compressed ELF sections. As you may have guessed, you can get `Zlib_jll` from [4], and it thankfully does not have any transitive dependencies.

In the Julia world, we're typically concerned with dynamic linking, (we `dlopen()` and `dlsym()` our way into all our binary dependencies) so this may not meet all your needs, but I figured I'd give it a shout out as it is one of the easier ways to get some binaries; just `curl -L $url | tar -zxv` and you're done. Some larger packages like GTK need to have environment variables set to get them to work from strange locations like the user's home directory. We set those in Julia code when the package is loaded [5], so if you try to use a dependency like one of those, you're on your own to set whatever environment variables/configuration options are needed in order to make something work at an unusual location on disk. Luckily, LLVM (at least the way we use it, via `libLLVM.so`) doesn't require any such shenanigans.

[0] https://github.com/JuliaBinaryWrappers/ [1] https://github.com/JuliaBinaryWrappers/libLLVM_jll.jl/releas... [2] https://github.com/JuliaBinaryWrappers/LLVM_full_jll.jl/rele... [3] https://github.com/JuliaBinaryWrappers/LLVM_full_jll.jl/blob... [4] https://github.com/JuliaBinaryWrappers/Zlib_jll.jl/releases [5] https://github.com/JuliaGraphics/Gtk.jl/blob/0ff744723c32c3f...

Micoloth · on Nov 5, 2021

I’ll take advantage of this comment to ask the tangential question: Where can i learn how llvm “compilation” works in Julia?

I know code is only supposed to be JIT’ed and then executed by the runtime (that’s why PackageCompiler exists), but still I’d like to know more about how it works..

Like, if i write a simple pure function in Julia and call code_llvm on it… How “standalone” is the llvm code (if that is even a thing)? When does GC get called? How exactly does the generated code depend on the runtime?

Is there any good explanation of this?

staticfloat · on Nov 6, 2021

To add to Keno's sibling comment, Julia, as a JIT compiler, essentially creates large chunks of standalone, "static" code, and runs those as much as it can, breaking out into the "dynamic" runtime when it has reached the limits of type inference or for some other reason needs to return to the runtime to perform dynamic dispatch etc... in these instances, we break out of the standalone code and start using the runtime to do things like determine where to jump next (or whether to compile another chunk of static code and jump to that). Note that these chunks of static code can be both smaller or larger than a function, it all depends on what Julia can compile in one go without needing to break out into the dynamic environment.

KenoFischer · on Nov 5, 2021

> Like, if i write a simple pure function in Julia and call code_llvm on it… How “standalone” is the llvm code (if that is even a thing)? When does GC get called? How exactly does the generated code depend on the runtime?

It's standalone unless there's explicit calls to the runtime in it. The most common runtime support is probably heap allocation, so if you see a `jl_alloc_obj` in there, that gets lowered to runtime calls eventually. GC gets called during allocation if the runtime thinks there's been enough garbage generated to have a collection be worth it.

HowardStark · on Nov 5, 2021

Not a compiler hacker and unfamiliar with the scene but is there a specific reason that `git-lfs` wouldn’t work? It’s the first thing that came to mind reading this. You can also pretty easily fetch specific objects as opposed to everything, so in your README you could direct contributors to only fetch specific binaries for given tasks.

Lamad123 · on Nov 5, 2021

You could've mentioned Crafting Interpreters. It's an excellent book!!! (At least for the rest of us, mere mortals!)

dlsa · on Nov 6, 2021

I'd second this. Been a good read. Learnt a lot. Now I have a DSL for some really strange usecases and its been a blast. Never thought having an interpreter of my own would be so much fun.

Lamad123 · on Nov 6, 2021

Cool! I just started reading it about a week ago, and almost having that feeling I had when I first learned how to program.