Compilers work by taking a (computer-)linguistic analysis of the source code text, a syntax tree, converting it into an intermediate representation (IR) which is then progressively lowered to the raw machine code form by a series of transformations.
A connected graph of basic blocks (BBs) is generally how such IRs work. Basic blocks have a formal structure involving them being connected to other basic blocks only at their tops and bottoms, and not being multiply-connected internally. A perhaps simplistic way to visualize them is as the polygons on a flowchart.
So the article, which assumes that you know this background, is showing what kinds of basic blocks in Clang's intermediate representation result, before any lowering and optimizations, from the syntax tree. It's skipping over the part where a syntax tree is constructed from the source code shown.
In many compilers, you don't get to see the intermediate representation. In compilers like OpenWatcom C/C++ and GCC you can always go and look at the source code to see what the IR data structures are. But the IR is in memory (or internal temporary files) and not accessible outwith the compiler programs themselves. Clang's design involves a publicly specified intermediate representation that has documented serialization formats for writing to file and then accessing with other, potentially third party, tools.
It's not as clean as clang/llvm, but you can see the intermediate steps of GCC with the -fdump-tree-alland -fdump-rtl-all options. The GIMPLE IR is not so hard to follow (not as hardcore as the gcc docs...). Can be interesting to see the effect of front- or back-end compiler options on your code and track down how an optimization work.
You can also write compiler plugins in GCC. Not as fun as clang/llvm, but manageable.
> A connected graph of basic blocks (BBs) is generally how such IRs work
I was wondering. When is this graph actually built? is it before creating the IR or after? The way I see it, you can generate IR without worrying about control flow graph. Then you can build the CFG from the IR in order to perform common optimization or register allocation. Is that correct?
In LLVM, a function is linked list of BBs, each of which is a linked list of instructions. The final instruction in each basic block is a special kind of instruction that includes a list of all possible successor BBs.
In effect, the IR is exactly a control-flow graph represented in adjacency list form, so it's not possible to construct the IR without constructing the control flow graph. You could theoretically write the IR in textual form, but that's definitely not the common case, and you generally need to construct the CFG anyways to properly emit the IR (particularly since adding the phi nodes for SSA requires knowing the predecessors of every basic block).
A term I had to look up was "SSA", which it turns out is short for "single static assignment". [1] This is where a variable in the intermediate representation can only be assigned to once, and this has to be at the point of definition. It seems that a single variable in the original C++ can (/must) be translated into multiple SSA variables in the IR, representing its value after each assignment to it.
There are other classical books, but this is one of the best ones, while containing quite a few modern stuff like GC implementations and SSA algorithms code code generation.
There are some well known complaints, but on my opinion none of them really hinders the subject of learning about compilers.
I like these books too but I'm wondering if they are still a good reference considering they are quite old now (I think the last edition is 2002 or something).
Well, clang is a C compiler that by itself only generates highly unoptimized LLVM IR, an intermediate language used in the compilation process. It illustrates a few slightly unusual features, such as the fact that clang generates stack allocations liberally, relying on mem2reg to optimize them away.
If you really want to learn more, you have to check out the well written LLVM docs. If you prefer a hands-on approach, follow the kaleidoscope tutorial.
> we’re not going to look at any non-trivial C++ features
Becomes: we're only going to look at trivial C++ features.
Which is then recognized as a no-op and replaced with the null sentence ;-)
Nice article though.