Hacker News new | past | comments | ask | show | jobs | submit login
Advanced Compilers: Self-Guided Online Course (cornell.edu)
662 points by macrolocal on March 13, 2023 | hide | past | favorite | 82 comments



I hope I actually sit down and go through this because this seems like an excellent resource. I took a grad level (they just let us take them as undergrad) compilers course in college and it was the most fun I've had at Uni. It was all SML building a functional language that supported a solid type system with ADTs and parametric polymorphism and it was so interesting to see that it wasn't as complicated as it seemed to implement (although time consuming to iron out bugs).

My prof developed a production SML compiler that's fairly widely used, so he was the best resource I could have asked for. Such a humble guy as well. The thing that was most interesting to me was the use of ADTs on the implementation end. I couldn't imagine how much more tedious it would be dealing with the many types of trees a compiler deals with without them.

The class's impact on my overall career isn't direct, as I don't work near the compiler space at all (unfortunately), but there were so many good type level concepts I got out of that course. Plus, it was just a blast. Late nights with a bunch of tmux panes ironing out bugs to rack up test coverage - such fond memories :)

I highly recommend playing with compiler implementation for anyone that's unfamiliar. There are a lot of good introductory resources out there and they're just such a fascinating part of our industry. A beautiful cross roads between CS theory and practical application.


Your comment seems as good a place as any to drop the requisite Yegge compiler evangelism piece: https://steve-yegge.blogspot.com/2007/06/rich-programmer-foo...


> My prof developed a production SML compiler that's fairly widely used

MLton? I've been playing around on-and-off with a non-production SML compiler and MLton is such a fantastic resource.

This is just a hobby for me, but I have read (and own) several different compiler textbooks. What I really want is one that focuses on compiling functional languages down to assembly.


Yes! Professor Fluet was probably the most passionate professor I had. The man has three kids, a full time job as a professor at the university, and still works on MLton, it's incredible.

During our code gen lectures, he had told us that MLtons original (not sure if it's still the default) backend generated C and shipped it over to a C compiler to do the dirty work. I know they have an official x86 backend now too though.

I wish I had resource recommendations for what you're looking for (because it is a super interesting topic), but hopefully MLton can serve as a great resource . PolyML may also be a good resource - not sure though.


Adrian is a great guy and I highly recommend all his content.

For undergrad level PL, I highly recommend Dan Grossman's MOOC[0] or recent class recorded lectures[1]. Dan's thesis work strongly influenced Rust's borrow checker.

Disclosure: Dan advised me in undergrad and Adrian is a friend.

[0] https://www.coursera.org/learn/programming-languages

[1] https://courses.cs.washington.edu/courses/cse341/19sp/ (ctrl-f "Videos")


That sounded interesting. I think it's probably this Dan, https://homes.cs.washington.edu/~djg/publications/, making said thesis "Safe programming at the C level of abstraction", https://homes.cs.washington.edu/~djg/papers/grossman_thesis....


> I wish my school offered something even half good as this.

Me too. In my school, we had great introduction/mid-level classes but at the graduate level, I found our classes underwhelming. Mostly Prof/Researchers teaching their narrow specialties and trying to recruit PhD students, but without putting much time in their lectures as they didn't care about teaching. Bunch of slides, research papers to read. They wouldn't bother making a heavy programming project, which was left to their colleagues doing less research.


I'm in a Master's (and possibly jumping to PhD) program at a Big Tech School that you've probably heard of -- and that's my feeling as well.

Fairly underwhelming. Lots of online resources to chase after to catch up to the state of the art, but a lot of the courses are things like the history of, and developments of, the technology. Worth knowing, but I don't need multiple, year-long adventures in this stuff -- get me to modern, and then let's solve some actual problems!


I used to think like this until I actually spent some time with the modern things after learning a lot of the historical development paths. A lot of research is stitching together disparate ideas that have developed in their own little community silos. The best skills you can learn as a researcher are to appreciate, recognize (non-surface level) similarities between, and how to synthesize existing research.

There's an awful lot of cult of the new that goes on in modern research.


> modern things after learning a lot of the historical development paths

While not research, it is nontheless interesting to watch every generation of programmers (including mine) re-discover RDBMS after tripping over all the pitfalls with the hot new stuff.


I wish my school offered something even half good as this.

I've seen it like 2 years ago and it is well prepared course.

>CS 6120 is a PhD-level Cornell CS course by Adrian Sampson on programming language implementation.

I just dont understand why is this "PhD level"


At my school we used the course work of Stanford for the operating system class which was amazing: https://web.stanford.edu/class/cs140/projects/pintos/pintos_...

It's really up to the professor.


> I just dont understand why is this "PhD level"

It would have been if the papers covered were closer to the state of the art. The most recent one is from 2015.

However, this course would actually give you the background required to read more recent paper and work on compiler research, so calling it PhD level is not a huge stretch.


Even if you're working at the PhD level, a course is only going to at best give you a survey of material instead of the latest-and-greatest papers hot off the presses at PLDI or ASPLOS or whatever conferences matter most for your field. If you're actually doing research on a particular subfield, then you are going to be finding and reading those papers on your own.

In that vein, it makes a lot more sense to focus on a solid, foundational paper than whatever incrementalism came out most recently. SLP vectorization is a good example here--yeah, the foundational paper is 20 years old at this point, but you'd rather students read and understand that than whatever the most recent addition to SLP vectorization came out in this (or last) year's PLDI. Even the other papers I think I might add or substitute for others in this course aren't going to be substantially newer than the stuff already here.


I mean a lot of programming language design goes back decades; it's better to focus a course on the basics that every language will have vs the most recent developments that haven't fully crystallized yet either.


There is a critical difference between programming language design and compiler design.

You are right when it comes to programming language design and somewhat wrong when it comes to compiler design.

If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.

For programming languages, you absolutely have to understand the foundations to do absolutely anything.


> If you are doing compiler research/development, you have to be familiar with the latest and greatest research because the margins for improving things are fairly slim. It’s usually fine to be not familiar with how things worked ~30 years ago.

On the other hand, if the margins for improving compiler quality are slim they can be ignored without great consequences: it makes sense to study state of the art research only in the niches that turn out to be relevant for a specific need of a specific project, not blindly and at the expense of important general principles and techniques (i.e. mostly the problems of 75 years ago and the solutions of 25 years ago).


Again this is not necessarily for someone doing compiler research, but foundational knowledge for anyone at the PhD level.

Someone doing compiler research this might be a first graduate level course to get them used to the technique and concepts used in more recent research for the might have two or three additional courses getting closer and closer to state of the art.


I’d guess “PhD level” just means it’s intended for graduate students in a PhD program.

For example, they read and discuss papers rather than just working out of a textbook.


I took this course when it was PL-focused in 2015ish! It starts with an in-depth exploration of the untyped lambda calculus (with the usual fixins like Church numerals, various combinators, etc), spending half a week or so on semantic expansion rules, reduction strategies, recursion, fixpoints, continuations, how one might encode state with only lexical closures, etc. After maybe the tenth lecture, the class takes a quick detour to talk about predicate logic, Hoare logic, dynamic logic, and Kleene algebra, before deriving a typed variant of the lambda calculus as an illustrative pathway into the latter half of the course featuring algebraic datatypes, unification, type inference, and the duality between formal logic propositions and typing systems. We explored subtype polymorphism, bisimulation, equirecursive equality, and closed on a gentle introduction to monads (of all things) and a teaser of category theory (which IMO should have been taught first but oh well).

As more of a math course than a computer engineering course, 6120 was a first-semester Ph.D. level class, and I certainly wouldn't recommend taking it unless you really wanted to dive deep into the depths of programming language implementation. Incoming students were expected to be fluent in discrete mathematics and formal proofwriting skills. I wasn't, so I found myself falling behind at the start even though I had some familiarity with lambda calculus and algebraic datatypes. You definitely benefit from a strong math background.


You’re thinking of 6110 (advanced programming languages) not 6120


Ah so I am! Thanks for the correction


Right after undergrad.


Love the recommendations people are giving out here, not an exaggeration to say that HN gave me a better CS education than my M.S.


Whenever I need to learn about any topic even non-tech I search it on hn.algolia.com


Same here! well curated resource


Not an exaggeration at all.

HN gave me much better education in CS and programming than any traditional source.


Are there any courses on creating a new language, interpreter, and compiler from scratch? Like not using LLVM or similar to generate intermediate representation but to literally pick an ISA and, well, compile for it.


Various books do this. Essentials of Compilation is a recent one: https://github.com/IUCompilerCourse/Essentials-of-Compilatio...

You can find a PDF and the associated course with a very minimal amount of following links.

If you are thinking of making your own language, it's also good to learn something about programming language theory, if you don't already. Many languages make mistakes that have been solved 25+ years ago. PLAI is good for that: https://www.plai.org/


I compiled a basic C program and then read the output of GCC -S to see the assembly.

I used what I learned from GCC output Then I wrote a basic amd64 expression compiler with simple live ranges and linear register allocation and ANF (a normal form) for postorder AST traversal for code generation. It can only evaluate nested add and mul expressions.

https://github.com/samsquire/compiler or on replit: https://replit.com/@Chronological/Compiler3#main.py

Finding opcodes is a struggle but there is this website: https://www.felixcloutier.com/x86/ and https://www.cs.uaf.edu/2016/fall/cs301/lecture/09_28_machine... (the table at the bottom is really useful)

To learn code generation not targeting amd64, I wrote an imaginary assembly language and then wrote a switch based virtual machine and a compiler for it with a frontend that looks similar to javascript.

https://github.com/samsquire/multiversion-concurrency-contro... in ProgramParser.java, LanguageInterpreter.java, and LanguageInterpreterRunner.java for where main() is.


Intel's documentation covers encoding. I remember it being relatively heavy going to decipher, been a few years since I've worked on x64 though.


Check out https://mitpress.mit.edu/9780262047760/essentials-of-compila...

Author has made complete draft PDFs for both the Racket and Python versions available here. https://wphomes.soic.indiana.edu/jsiek/


Thanks, highly appreciated.

I tend to be biased towards what Jeremy Siek himself markets as "Proven in the classroom" when it comes to book authors in CS. Many book authors lack this experience and simply write for themselves, which is ok, but can result in bad didactics. Good teachers and authors from academia are invaluable.


That's a great point about experience teaching the topic. (I've written a few articles where I certainly wished for that experience.)

I just wish textbooks didn't have their own bad tendencies: they have pablum as an attractor, because on average students just want to get through the class, not doing too much worse than average among the other students. Even without this problem, there's a more basic one: like with enterprise software, the decision to buy the book is not typically up to the user. "Will people actually want to read this on their own time?" is a strong driver of quality, even though it has pitfalls too.


Precisely made for "creating a new language, interpreter, and compiler from scratch": http://craftinginterpreters.com


This does not cover compilers. I know because I specifically asked munificent (the author) about this before and he mentioned some other books to look at.


He's being modest :) Sure, the book doesn't cover many traditional compiler topics such as register allocation and code optimization. However, it does cover parsing and translating to intermediate representation, which is 90% of what you need to get your own programming language off the ground. And to be honest, many university compiler courses won't have time to go much farther than that anyway.


What do you mean exactly by "compilers"? Because the book features a chapter on how to compile source code to virtual machine byte code.


Most of the time when people talk about compiled languages/compilers, they mean languages that compile to native machine code, not languages that compile to a bytecode/run on a virtual machine.


Compiling to machine code rather than VM bytecode is a difference in degree rather than kind, and doing it well rather than naively is a deep topic.

Simple machine code generation with basic register allocation and evaluation which eagerly spills to the stack is not super hard. Producing fast code instead is a bottomless well.


Is that true? I think of C# and Java as compiled languages and they run on the CLR/JVM which are both virtual machines.


It's just the nomenclature that I'm used to, compiled languages are so called because there is no VM, it's just machine code that is run while C# and Java still requires the JVM to do JIT compilation at runtime. But it's mostly splitting hairs, there is no agreed upon definition of what makes a languages compiled/interpreted.


Both have JITs. So someone wrote a compiler from CIL/JVM bytecode into native code.


There's literally no difference in the theory or technique of compiling for a virtual machine or a real one. Real machines are just messier and have a bunch of edge cases and weirdness that require you to read those multi-kilopage manuals to figure out. Having one or more intermediate steps is useful for portability and optimization though, so it's probably a good idea to target a virtual machine unless you only intend to ever target one specific architecture that you know inside and out.


Plot twist - the developer then runs the code in qemu for rapid iteration and ease of debugging... ;)


Yes, but it's still the most beginner-friendly book on implementing programming languages that I've ever seen. It covers VMs on the second part, and going from that to asm is not a huge leap.


Compiling to machine code is not that hard once you get to the IR or bytecode phase, check out this toy compiler:

https://github.com/byo-books/pretty_laughable_lang/


Is that really true? I would expect that going from IR to native code would be the hardest part because you have to worry about architecture and supporting different ISA’s. IR can be simple because it is abstract and hides lots of the dirty details of the hardware.


Generating code is not all that hard. Generating fast code is a challenge. Decoding instructions on some architectures is hard. If you have a nice general IR naively generating native code can be just a table lookup with a register tracking/allocation routine.


UPenn's CIS341[1] is pretty nice. It uses a bottom-up approach, where assignments start with a 64-bit integer-only x64 simulator, then goes up to compiling LLVM IR to x64, then compiling the lecturer's language ('Oat') into LLVM IR, and then optimisations like dead code elimination, register allocation, variable substitution, etc.

[1]: https://www.seas.upenn.edu/~cis3410/current/


The https://www.nand2tetris.org/ is pretty neat. The compiler in part 2 isn’t very advanced (it doesn’t cover much in the space of optimization or the state of the art) but taken as a whole you end up designing the chip, ISA, VM, language and compiler, so it’s pretty good for that aspect.


When I did my undergraduate degree, you could opt to do that for 1 extra point. Compiling for an ISA is not that much more difficult than compiling for a bytecode IR like that of Java or .Net which I imagine is the typical target of these courses.




Check out project Oberon by Niklaus Wirth. It involves building a compiler from scratch for an ISA designed from scratch and building an OS for it. You can even put the processor onto an FPGA and run the code you compiled on physical hardware you designed.


The prior course involves implementing a code generating compiler with optimisation passes etc for a language.


Jumping ahead to the video on LLVM, the instructor mentions his prior LLVM tutorial/intro which looks really helpful as an overview, with more informative links. Probably the best get-started tutorial I've come across yet:

https://www.cs.cornell.edu/~asampson/blog/llvm.html

This is a great resource, I've wanted to learn how to use LLVM to compile to RISCV assembly for a while, e.g.

https://llvm.org/docs/RISCVUsage.html


The kaleidoscope tutorial for LLVM is a good intro tutorial to read before one gets deeper into LLVM. The tutorial is in c++ but you can follow along in any language with LLVM bindings: https://llvm.org/docs/tutorial/


> in the self-guided version, your end-of-semester assignment is to change the world through the magic of compilers.

I’m not feeling any pressure or anything


Are there equivalent 'Advanced Databases: The Self-Guided Online Course'?

Andy Pavlo's CMU 15-721: Advanced Database Systems is similar, where you will hack on Postgres to implement a Foreign Data Wrapper (FDW)



That class just uses my slides. It says so at the bottom of the page:

> The lecture slides used in the course are taken from Prof. Andy Pavlo's CMU 15-721 course

Just go to the source: https://15721.courses.cs.cmu.edu/spring2023/


Is there an accompanying textbook for your course or only the papers that are listed?


Not for the Advanced class. We cover state-of-the-art systems. Textbooks are about 10 years behind.


okay I'm curious - what are dirty systems programming skills, professor?


It's one of those things where you know it when you see it. I got 18 year old CMU students showing up knowing how write x86 SIMD intrinsics or ones with LLVM JIT experience. Dirty.


running sed against a binary to replace function calls because the source has been lost to the sands of time


Thanks for this. His lecture on SSA was the best explanation I've found of SSA vs alternatives that I've found (building up a motivation for why you'd want to use SSA which is lacking in other explanations I've found) - and unfortunately, I'd lost track of where I'd seen it. Was looking for it again recently not having much luck (not on YouTube) And now here it is.


This one? https://vod.video.cornell.edu/media/1_130pq2fh

SSA is functional programming (the paper) resonated with me, though I remain unconvinced by phi nodes relative to basic blocks taking arguments.


Another resource great for understanding how programming languages work is:

https://www.plai.org/


thanks for sharing this!!!!


I went through part of this while working on some liveness analysis. The lecturer and the notes do a good job of conveying ideas.

Well worth using as a resource.


When am I going to be able to do self-paced for credit courses from good schools online without the crazy price tag?


Why should online classes be any cheaper? I'm not talking about self-paced, certificate only classes, where you never interact with a real human. You can already find plenty of these for free from any number of top schools. I'm talking about a real class, for real credit, with a real professor and assignments graded by real humans. Online students end up consuming the same resources as regular students (with the exception of occupying space in a physical classroom). What's more, online students require _extra_ accommodations, since they can't just stop by the professor's office for office hours. They have to setup online meetings for that, etc.


Self-paced is the tough ask. Part of the value of formal schooling is the forcing function of deadlines to make you actually spend the time and absorb the material.


An important component of credit courses is the time limitness so this seems unlikely to happen.


> An important component of credit courses is the time limitness

Huh? Why would that be an important component?

I mean I know that it's typical for courses that come with credit, but in my mind it's just a side effect of the administrative side of providing courses, not something that's at all central to accreditation. I.e. if you build up the knowledge over a longer period (which you can still do with e.g. resits) that's completely fine as well.


Mostly because self-paced = doesn’t finish.

Read any of the research on MOOCs and it becomes clear that we don’t have a clue how to write a self-paced course people finish with any regularity.


How much do you think it should cost?


Georgia Tech's OMSCS comes in at about $7k total. I think that's a good place to start.


Any other graduate-level self-guided courses worth doing? I'm interested in AI/ML/Statistics.


>This page lists the curriculum for following this course at the university of your imagination, for four imagination credits (ungraded).

Seems rather ridiculous of Cornell to dictate terms for the University of My Imagination (UMI), if an imaginary professor at UMI wants to grade my work I think they should be allowed to do so and even grant me an imaginary degree if they feel I have earned it. Guess it is true, universities do stifle creativity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: