Writing a compiler in Python using Lex, Yacc and LLVM

sketerpot · on Sept 20, 2009

Inspired by this, I've started writing a compiler for a subset of Matlab. I'm sick of Matlab being an interpreted language year after year, and while I don't hope to change that, I can at least do a proof-of-concept to lend weight to my derision.

I've got the lexer and parser done, thanks to Parsec, and some basic C code generation as a sanity check. Now for the runtime LLVM code generation, to make it feel like an interpreter! Thanks, HN!

maximilian · on Sept 20, 2009

I've been fooling around with this idea for a good year now. I've never found the time, but I think about doing something like it all the time. If I come across free time after my thesis, I hope to write a simple typed matlab-like clone that targets llvm code (which supports vectors). To me it seems that there is a vacuum for a simple typed fast numerics language with native vector support. Writing numerics in Matlab is great, but if you ever have to loop, it ends up dog slow. Porting the code to C is usually not terribly difficult, but is always a big pain, but usually yields a factor 10x or more speedup which is pretty significant. It would seem that applied mathematicians and physicists around the world would love a new, modern and fast numerics language to replace coding in C/C++/Fortran.

ilkhd2 · on Sept 20, 2009

Probably frontend can be taken from GNU Octave? Claims to be similar to Matlab...

maximilian · on Sept 20, 2009

Could you post a link to Parsec. Quick googling didn't come up with any compiler tools. Thanks!

GeoJawDguJin · on Sept 20, 2009

It's a Haskell library: http://hackage.haskell.org/package/parsec

Here's a tutorial that uses it to parse a simplified Scheme: http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...

anatoly · on Sept 20, 2009

http://www.haskell.org/haskellwiki/Parsec

jwecker · on Sept 20, 2009

He mentioned that llvm-py documentation was lacking... I was going through it, seems pretty darn good to me (relatively). http://mdevan.nfshost.com/llvm-py/userguide.html

ilkhd2 · on Sept 19, 2009

Impractical: 1) The compiler is gonna be very slow; 2) If you make compiler for low-level language, such as C you need precise correspondence between data types used in cpu, datatypes used in compiler's implementation language and datatypes in the target language.

daeken · on Sept 19, 2009

On the contrary. Compilers written in a high-level language are at a distinct advantage: they're easier to optimize (significantly so), easier to debug, and support forms that are difficult to deal with in a low(er)-level language like C or C++.

I do compiler development every day, and I no longer touch low-level compilers. Everything I write is Boo (for my OS), Ruby (for my startup -- by far my favorite language for compiler dev, despite not liking it in general), or Python. As a result, I have far more maintainable code than the majority, and the code I output is incredibly well optimized.

Edit: To clarify, by 'support forms that are difficult to deal with ...' I mean things like using a pure S-exp structure. Rather than a traditional intermediary form, you can represent your compiler state as an S-exp and iteratively optimize and compile it. Standardizing around a form like that, when it's easy to deal with, greatly simplifies code. (This is actually the reason Ruby is my compiler language of choice these days.)

judofyr · on Sept 20, 2009

May I ask you what tools you use in Ruby?

ilkhd2 · on Sept 20, 2009

Ok, the discussion is becoming pointless: you like ruby for compiler construction. Well I'd rather use a language with proper tail recursion, algebraic data types, with pattern matching as a part of it, fast, strongly typed [I beleive algebraic types with pat. matching gives roughly same possibilities as does dyn. typing, for this task], with standard or at least big industrial backing behind - such as F# or Haskell. And at the end of the day get warm feeling that my code is "green" - does not waste joules, by working fast.

I appreciate your choice, but I'd myself do differently.

ilkhd2 · on Sept 20, 2009

if you read carefully what I said, i never said that hig-level languages are bad - I said only this: 1) Compiler written in python, ruby and other "slow" (do not beleive that pyruby is slow?) languages going to take eternity to compile linux kernel. 2) You need to have _precise_ mapping between types in compiler and the language you implement. It is so important that gcc use a software library for floating point computation, otherwise you are locked-in wit you cpu's FP implementation and can not write cross-compilers.

viraptor · on Sept 20, 2009

> You need to have _precise_ mapping between types in compiler and the language you implement.

No. You don't need that. You can treat your data however you want until you place it into the final binary. Before that, you can even treat all your numbers as strings if you really want to. If you need some constant expression evaluation, you just have to replicate the target machine's math operations in software - no magic involved here. As long as you get the right result, noone cares what you do internally.

If your compilation machine == target machine, you can even construct a function that calculates the expression and actually run it to get the result.

ilkhd2 · on Sept 20, 2009

When you make optimization you need to do intermediate calclulation _exactly_ the same way the target machine does. Even if the number is ascii or unicode string, you need to do the computations strictly same way s target machine does, otherwise you'd break assumptions of the programmer, and lead to unreproducible code.

daeken · on Sept 20, 2009

You said they were impractical, then went on to explain why. I then explained why they are not impractical. The speed I'll give you, but the type issue is nonsensical. GCC's use of a software FP implementation is in support of this. Regardless of the language, you'll need to have an FP implementation that matches the target, which is nearly always done in software.

High-level languages for compilation are, 9 times out of 10, a better choice.

ilkhd2 · on Sept 20, 2009

1. FP is almost never done in software, it hurts performance big time, The only reason gcc uses it - it wants to be cross-platform, otherwise it might have perfectly used native FP. Also a lot more convenient to use different floating point point formats whene they are supported by language and hardware you running on, than messing with libraries. 2. Another example : OCaml. It has no "single float" type, and that is a reason I am not gonna use it for making a compiler; I'd choose haskell or F# - both of them have float32. 3. Depends what kind of HLL we are tslking about - Python and Ruby are defintely very poor choice for a compiler : lack of strong typing, only one type of FP is used, needs explicit watch for integer operations overflow, poor performance, lack of mature, bug free implementation (compared to C++ or ML for example). ML derived languages, however are very good for making compilers.

daeken · on Sept 20, 2009

1. Any compiler that supports cross-compilation supports a software FP implementation. GCC, MSVC (CL, that is), and many others. You take a speed hit, but it's downright negligible in a compiler.

2. Having never used OCaml for compiler dev, I can't speak on this.

3. The lack of strong typing has no effect on the ability of Python or Ruby to handle compilation. Like the Lisp family, these are very, very flexible languages with strong benefits to compiler developers, and I wouldn't trade them for anything else I've used.

TBH, I think ML-based languages are poor choices for compiler development. You spend as much time worrying about your own code as you do the code you're compiling, which you just don't do when you write a compiler in Scheme, Ruby, or another 'dynamic' language.

ilkhd2 · on Sept 20, 2009

1. No not any. Many cross-compilers do not if the binary formats and pecularities of arithmetic operations maps well between host system and target system - why waste time? Example = x86 and x64, two different platforms, with same binary representation. About performance - not sure if it is negligible - optimisation takes a lot of time in moder compilers. 3. You have not seen ml code I am sure. ML has type inference, you are not bothered about types too much. And it has pattern matching, very good when traversing trees. Neither scheme nor pyruby has this. ML also a language with completely formalized semantics, in combination with strong type it allows you to prove that you compiler is correct by means of formal provers.

daeken · on Sept 20, 2009

I've done quite a lot with the ML family, just not OCaml specifically. Type inference helps, but being able to use an Sexp-style representation a-la lisp is a huge, huge benefit in compilation. You cannot do this in any of the ML family.

Pattern matching is a great feature, and it's available in all of the languages I mentioned. It's a built-in in Boo, there are many libraries for it in Scheme, and I have my own Sexp pattern matching libraries for Python and Ruby. It may have come from the ML world, but it's by no means exclusive to it.

As for proving correctness, depending on the language to provide this is simply not realistic in compilers. You've got a much better shot at correctness by making your compiler easier to work on, IMO.

ilkhd2 · on Sept 20, 2009

There many attempts to get proofs for compilers for popular languages, C or Ada for example: http://portal.acm.org/citation.cfm?id=1315602 http://pauillac.inria.fr/~xleroy/publi/compiler-certif.pdf

daeken · on Sept 20, 2009

Yes, I'm aware. However, like most such things, they're simply not practical. Proving compiler correctness is insanely difficult (if it wasn't, we'd have much better compilers) and it very frequently ignores optimization entirely, making such compilers useless.

In theory, proving correctness is a good way to go. In the real world, it's not remotely viable.

viraptor · on Sept 20, 2009

> FP is almost never done in software, it hurts performance big time

During the compilation? Would you even see the difference if the compiler did 0.1*0.2 by emulating cpu processing, instead of via one instruction? The cost of register allocation for a simple function will be greater than optimising many constant fp expressions. It simply doesn't matter here. If you have a convenient type, you use it, if you don't, you download any libieee754 and interface with its functions...

Seriously - if you're writing a compiler, there are so many more important problems than lack of mapping between the target and local types :( It may not be an elegant solution to handle fp expression evaluation via an external library, but it's only done during the compile time (so only once) and it's easily solvable.

sketerpot · on Sept 20, 2009

Okay, I've been wondering about this for a while: what processor would you possibly want to run a compiler on that has floating-point arithmetic incompatible with IEEE 754? Are we still making cross-compilers that run on a VAX?

(Not that this would stop the GNU people. Have you seen the code from bison? They actually emit code for their own reimplementation of memcpy() on some platforms. They are fanatical about compatibility. Or they were last time I checked. Which was sometime during the 90s.)

daeken · on Sept 20, 2009

Many embedded platforms (the vector units on the PS2, several FP coprocessors in the ARM and MIPS world, etc) aren't (entirely) 754-compatible. Since these are commonly cross-compiled, you can easily run into issues.

scott_s · on Sept 20, 2009

The Cell's SPEs aren't fully IEEE 754 compatible. Some GPUs may not be.

mahmud · on Sept 20, 2009

Compiler written in python, ruby and other "slow" .. languages going to take eternity to compile linux kernel.

What makes you think everyone is out to write a C compiler?

ilkhd2 · on Sept 20, 2009

What make you think that I think this way? Linux kernel is just an example of relatively big program, where performance of the compiler is an important issue.

Another example of performance-sensitive application of compilers is background Java compiler in Eclipse - even for rather small programs it has to be fast. Ruby is 10-100 (actually more than that) times slower than C/Java, So if you optimizer (wriiten in C/Java/Haskell) needs 0.5 second to optimize code, same code will need (by very optimistic measure) 30 sec. Python not much better. Scheme a bit better (especially Stalin compiler), Lisp almost good.

dkersten · on Sept 20, 2009

Background compilers do not need to perform heavy optimiztion, so can skip a lot of time consuming steps.

ilkhd2 · on Sept 20, 2009

Even if they do not need a lot of optimizations, they still have to be fast. You would not like the performance would they be written in ruby.

And do not forget by the way that Ruby is a memory hog.

http://shootout.alioth.debian.org/u32q/benchmark.php?test=al...

pc · on Sept 20, 2009

Have you actually used Stalin? Compiling anything with it is shockingly slow. It does, of course, produce very fast programs—but it can take minutes to generate them, even in simple cases.

ilkhd2 · on Sept 20, 2009

No I did not, butI read papers, written by the author. Indeed it is slow, but still be good for final compilation. Alas it is r4rs, and also I heard that it sometimes generate incorrect code. Still Scheme is better choice in my opinion for performance sensitive code, than ruby/python.

mahmud · on Sept 20, 2009

Stalin and Mlton are whole-program optimizing compilers. They take a bit longer than other compilers, but the results are impressive.

ilkhd2 · on Sept 20, 2009

You probably missed the point - my concern is not the speed of compilation _of_ _the_ _compiler_ _itself_ - it is totally irrelevant, what is important is the performance of _created_ _compiler_, which is going to be ok, taking into account the quality of Stalins optimizer.

mahmud · on Sept 20, 2009

We can debate the performance of a hypothetical compiler for eternity, or we can cite examples; slow compilers written in a non-C language along with publicly available input data.

i.e. benchmarks or it didn't happen.

ilkhd2 · on Sept 20, 2009

Man, you are being difficult. Ruby is well known to be a very slow language (if you want numbers check http://shootout.alioth.debian.org/u32q/benchmark.php?test=al...). Optimizing compilers are computationally demanding tasks, and the proof (albeit extreme one) is Stalin Scheme compiler, that can take minutes to compile code, but the result is sometimes faster than C/C++ code, or at least on par. Now, this compiler is written either in Scheme itself or in C i do not remember, but definetelt not in ruby. If it take to compile your code a minute for a compiler written in [C/Stalin Scheme] it will take at least ten minutes for one written in ruby.

ilkhd2 · on Sept 20, 2009

And I do not know where you saw me saying C is the ultimate language for making compilers. I never said that, I do not think so, I just said python/ruby are bad for the task.

mahmud · on Sept 21, 2009

Sorry, I must have misunderstood your intent then. Python and Ruby are probably fast enough for compiler construction as well; I am sure they could be tweaked by an expert to yield acceptable performance, but I am just not such expert :-)

pmorici · on Sept 19, 2009

Did you read the post? The guy was writing this for an academic project where speed didn't matter and secondly he was using LLVM for code generation. low level details don't really come into play in compiler construction until the code generation step most of the work up to that point is text parsing, syntax checking, and AST creation. In other words tasks suited perfectly to a language like Python.

ilkhd2 · on Sept 19, 2009

> low level details don't really come into play in compiler construction until the code generation

-- Not true. modern compilers do a lot of optimization long before code generation, and you really want to have for example floating point to be behaving exactly like in target language - if you collapse operations on constants such as 1.0D + 4.0. And aside from optimization, lexer - it has to distinguish single-precision and double precision floats.

pmorici · on Sept 20, 2009

If what you are saying is true and the language you write the compiler in needs to mirror the type properties of the target then there would be no cross compilers or new architectures or new languages because you would be stuck with the type semantics of your original platform. It doesn't make common sense let alone technical correctness.

ilkhd2 · on Sept 20, 2009

Of course it is true. And of course it does not make common sense - common sense usually made by brain/mind. If you cannot mirror properties of target machine - you cannot optimize code for that target - period. Any computatiuon made in compiler will be potentially different from the same, nonoptimized computation made in target. The more computations have been optimized in the compiler the bigger probabilty you'll get wildly different code.

However, if you do not have the target datatype you dealing with during compilation, you are destined to simulate them.

Example: ((unsigned char) 2) + ((unsigned char)255) = 1

That is in C. In ruby you'll have to manually check for overflow, and do it for each operation. You can invent classes all kind of things, but why? Use a language with types, and the check will be done for you by you Core Duo.

anamax · on Sept 20, 2009

> you really want to have for example floating point to be behaving exactly like in target language

You keep writing that like it's a hard thing to do. It isn't. More to the point, it's a small part of the whole problem.

jerf · on Sept 20, 2009

Because as everybody knows, just as the correct way to learn about weightlifting is to jump straight to benching 350lbs, the correct way to learn compilers is to write a complete gcc replacement as your first project.

If you can't manage that, then, geeze, what kind of crappy programmer are you?