Hacker News new | past | comments | ask | show | jobs | submit login
Why split lexing and parsing into two separate phases? (tratt.net)
159 points by ltratt on May 4, 2023 | hide | past | favorite | 122 comments



Compilers used to be my favorite subject in CS (circa 1975) and back then I never saw a single case where lexing was handled in the parser. A few factors back then may have been responsible for this.

First memory was very limited by todays standards, say 30 to 60kb. By making lexing a separate pass, the source could be discarded before the parser started and the tokenized intermediate file written by the lexer could be read during the parsing pass. Typically, the compiler might keep the name table of all identifiers in memory between these passes.

Around the mid 70s, languages that could be compiled in a single pass were investigated. Pascal was one of these. Pascal didn’t support separate compilation either so the linking step was eliminated. Nevertheless, the early Pascal compilers still did lexing separate from parsing; I learned Pascal by studying the source for Wirth’s compiler (it had crazy inconsistent indenting).

The second reason lexing was done separately was performance. Touching every character of the input source file was a major bottleneck for compilers back then, so optimizing the lexer was perceived to be very important. By doing the lexing in a tight loop rather than being called once for ever token lots of overhead associated with these calls was eliminated.

Also, there was a questionable attraction to bottom up parsing. Knuth had shown how LR (left to right) shift-reduce parsers could parse a very large family of grammars efficiently around 1965. Everyone was enamored with them. I even wrote a set of FORTRAN programs that would construct the SLR tables suitable for a subset of LR parsable grammars. When lex and yacc applications came along, everyone thought that every compiler should be built this way (to be fair, Wirth and Per Brinch Hansen were both designing languages that could easily be parsed by recursive descent because their grammars were LL(1) a smaller family of grammars than LR(n)). I’ve never seen anyone try to use only a LR parser at the lexical level combined with the normal grammar parsing.

I went back to University for another graduate degree in 1984, and I was surprised to hear the professor that taught the compiler class say that everyone should be using lex and yacc for any compiler development. By then, in the real world, people had discovered that much more meaningful error messages were possible with LL or recursive descent (i.e. top down) parsing.

Now, performance and memory considerations are different. It’s practical for compilers to read an entire source file in a single read (or memory map the entire source file), saving all the round trips to the OS for reading input. Top down parsing provides better error messages and it fits well with parsing all the way down to the lexiems.


By then, in the real world, people had discovered that much more meaningful error messages were possible with LL or recursive descent (i.e. top down) parsing.

This is largely orthogonal to whether you separate lexing out, though - it's perfectly possible (and very pleasant) to use recursive descent parsing over a token stream rather than a character stream.


Yes absolutely, I was just pointing out that there were in those days people still committed to bottom up parsing despite the advantages of top down approaches. The bottom up parsing with LR parsers that I was familiar with always used a table driven approach for the language grammar and assumed a separate lexer, sometimes hand written and sometimes generated by a program like lex. To me, the effort saved by using lex and yacc was completely lost by having to incorporate extra work to get good error messages for users out of LR parsers.

Recursive descent over a character stream might have a negative performance impact, but I’m not sure of this considering the simple grammar involved in the lexical portion of the overall grammar and the ability of today’s tools to optimize function calls via inclining etc. If I was writing a compiler today for modern hardware, I would like to try using recursive descent on a single grammar that went all the down to the characters.


"I went back to University for another graduate degree in 1984, and I was suprised to hear the professor that taught the compiler class say that everyone should be using lex and yacc for any compiler development. By then, in the real world, people had discovered the much more meaningful error messages were possible with LL or reecursive descent (i.e., to down) parsing."

The error messages complaint is the one I always see. I rarely see other, specific complaints.

Question: Is this "error messages" complaint referring to error messages for lex, yacc or both. Does the complaint apply evenly to both programs. As a flex user, generally, I rarely if ever see error messages.

I like flex because it does not require as much memory as the alternatives. Ideally, I want scanners to perform roughly same from computer to computer and from input to input, irrespective of the amount of memory available or the size of the input. I like UNIX utilties that read files line by line.


It is not the error messages produced by yacc when you have a bug in your parser. It is the error messages produced by the compiler when the input program does not conform to the language specification. More useful error messages let users more rapidly diagnose the error in their program and fix it.


This means that the "error messages" complaint does not apply to someone who is only using lex and not yacc. Thus, when someone complains about "lex and yacc" and error messages, they are complaining about yacc. I often use flex without using yacc. I am not using it in compiler construction. Thus, according to UncleMeat's comment the "error message" complaints do not apply to flex if used without yacc, i.e,, if used for purposes other than compiler construction..


Complaining about error messages when using LALR is one thing. Constantly singling out "lex and yacc" (maybe because of some bad experience in a compilers course) is another. Yacc is not the first implementation of LALR, there are others, e.g., the lemon parser used in the SQLite Project. There is nothing LALR-specific about lex.


This isn't about these tools in the abstract. This thread is about various compiler architectures.


I saw source of pascal parser which was crazy beautiful and simple, and couldnt find it since. Does anybody know something like that?


Niklaus Wirth had a very readable and concise text about parsing Pascal family languages (originally Pascal style, eventually evolving to an Oberon dialect): https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...


this may be interseting (YMMV) read page 50 of "Compiler Construction for Digital Computers" 1971 which lists other reason some which are still valid and good to know


I usually love Laurence's posts. However, one thing he said in this one is wrong in my experience.

> Since some parsing approaches such as recursive descent parsing unify these phases, why do lex and yacc split them apart?

I have written two recursive descent parsers for one small, but production quality compiler. I am in the middle of writing another two, which will also be production quality and much bigger. And I have written various small parsers.

I always use recursive descent, and I always split the lexing and parsing phases.

Now, I fully acknowledge that my experience could be an outlier, but I believe the benefits of lexing are so big that most recursive descent parsers have separate lexers.


I've seen interleaved lexing and parsing, where the recursive descent parser asks for the next token, which is computed on demand.

They're still separate modules. You just don't have to run lexing to completion ahead of time (and as a consequence, it's natural to let lexing be context-sensitive where needed).


> I've seen interleaved lexing and parsing, where the recursive descent parser asks for the next token, which is computed on demand.

This is how I implemented it in my language. Evaluator asks parser for a value, parser asks lexer for one or more tokens, lexer asks the I/O layer for characters. It's nice.


> I've seen interleaved lexing and parsing, where the recursive descent parser asks for the next token, which is computed on demand.

You even get this for free in a lazy language like Haskell, where your parser can accept a list of tokens `[Token]` whilst the list itself is only computed on-demand whenever the parser tries to get the next one.

Well, except for the context-sensitive option:

> (and as a consequence, it's natural to let lexing be context-sensitive where needed).


This is how I am setup in the little language I am writing in C++. I put the lexer into an iterator.


A Javascript parser have to do something like this. The text /a/ tokenize differently depending on where in the grammar, eg a/b/c is division (/a/) is a regex.


That's not unusual though and can just be normal lexing (or parsing - IMO it can be a pretty arbitrarily drawn line) e.g. in many languages = 'tokenises differently depending on where it is in the grammar': a = b is assignment; a == b is an equality test.


No, = or == is purely a lexer decision easily decided by greedy _character_ matching and easily resolved.

In JavaScript the following give wildly differently shaped ASTs since the / character initiates RegEx parsing when in a _value position_ that has different lexing than the division operator that _only_ appears as a potential binary operator, consider the following:

a = b + /c.y/+d

a = b /c.y/+d

(Given b=1 , c={y:2} and d=3 )

The first assigns a as add b to the RegExp matching c.y added to +d giving us the string "1/c.y/3" (JS converts most types to string on addition and strings are dominant during addition as concatenations), this is correct since the + operator before the / character forces the parser to look for a value.

The second reads as assign a to b divided by property y of c then divided by d (ie the numeric computation (1/2)/3 = 0.166666.... ), this is because the parser is looking for binary operators after the b identifier and when the / character appears it becomes an operator.

So, without knowing the parsing context (operator or value position) the lexer decision is ambigious.

This is somewhat how A<B<C>> is ambigious when parsing C++/Java/C# templates/generics VS the <, > and >> operators but that case is usually easier since the parser could include a hack in the generic parsing code that mutates the token stream if it encounters >> when closing a generic. (The JS ambiguity is worse though since RegEx lexing rules are totally different from regular JS)


True; but the division vs regex example that I replied to was also 'easily decided bg greedy character matching and easily resolved.

I don't know how JS (implementations) actually does it, but this is what I meant by the line being a bit arbitrary, in my limited experience - you can just lex 'forward slash token' or whatever and keep your lexer/parser separation even with this ambiguity.


Consider this text: /a+ ”/ How would you tokenize this if you don’t know the grammar context? If it is a regex, the space is significant - it is part of the pattern. If it is a divsion, the quote will start a string.


With horrible hacks in the reference - search for InputElementDiv and InputElementRegExp if you want to read more about it.

It's not so bad in practice, since you only need to look at the previous token to decide whether a / should be parsed as a division or regular expression, once you've excluded the possibility for // and /* comments.

Comments are also why the empty regular expression in JavaScript must be written as /(?:)/


How about this:

    ( x ) / b / a
It could be part of:

    var x = (x) / b / a;
Or

    if (x) /b/a.test(z)
X could be an arbitrary complex expression. Looking at one preceding token is not enough to disambiguate the slash.

I suspect the only real solution is to intertwine lexing and parsing, so the parser asks for the next token with a flag indicating context. But this constrains what parsing algorithm can be used.


Most lexers just match the longest legal token, so there is no ambiguity between = and == and the grammar does not influence tokenization.

The issue with JavaScript is that say /a+”/ tokenize differently based on the position in the grammar, so you can’t tokenize in a seperate pass before parsing.


My experience is that splitting lexing and parsing is by far the norm, even in recursive descent.


Oh, thank you! I was wondering if I was an outlier.


Nah, I’ve been toying with writing a compiler and that’s where I’ve gotten to too, it seems to my inexperienced self at least, if what you’re implementing is a specified language, combining the phases makes sense since you only have one thing per language, but while making a language and experimenting, having them be separate helps reduce a lot of complexity in the parser for things like tokens that go across multiple lines.

I have no idea if the above observations are generally true, but that’s been my experience with the handful of parsers I’ve thrown away because they get too unwieldy for what I want.


Funny; my experience is the opposite.

My first production-quality parser was for the `bc` language, which is specified in a standard.

But your experience is just as valid as mine.


I know right?

I had a friend in college who nearly always solved problems 'backwards' to everyone else cause it's just how her brain worked, like we'd do a standard for(int i = 0; i < length; i++) and she'd have for(int i = length; i >=0; i--) kinda backwards.

I don't question it at this point anymore and just marvel at the mathematical validity of it all.


It's buggy though


It definitely can be, and it was mostly an illustrative example for the ‘mathematical validity’ part of my comment. When you’re just starting out, working valid code is better than nothing!


I think you're right that most modern recursive decent parsers have separate lexers, but I think what you'll find in many older recursive descent parsers is that when it's clear which token type may come next, the parsers will often call a specialised function that will return or throw an error if it's not that specific type.

E.g. you might be calling a "getIdentifier()" function instead of a generic "getToken()".

In practical terms you sort-of have a lexer separation anyway - you will have a set of functions that each lexes one type of token. You just call them directly instead of letting a generic function discriminate between the token types.

But a generic separation works also for grammars where a scannerless approach is awkward. It's always possible; worst case you add support for backtracking and/or add helpers for more generic parses, but if you need to jump through hoops like that it quickly gets silly. And if you're first used to the separation there's little reason to write parsers in two different ways depending on the grammar.

Ultimately I think that's the reason why scannerless parsers are relatively rare.


That's how most of my recursive descent parsers have worked. I think the first ones I wrote did have a lexer/parser split and then I realized it was easier to use the specific calls for each expected token class.


Most of mine too. When you're hand-writing the whole thing anyway, and the grammar makes it obvious the next token can only be an identifier it feels wasteful to call a generic function, and so I rarely add a generic function unless I get to somewhere in the grammar where there's no easy way to determine a single or small set of token types to parse. But at the same time, if you're reaching for automation the lexer side is easier to automate "cleanly" and most of the automation tools tend to provide a single entry-point.


In my first C compiler, I combined the lexer and preprocessor. That was a mistake. The (much) later one I wrote, Warp, tokenized and then preprocessed. It was much easier to work with.


It is also about performance.

I've watched a video recently from a guy who compiled entire Linux kernel in under 1 second I believe, he used TinyC and also noticed that something like 90% of compilation is tokenization of headers that are included many times, like there are headers that are included thousands of times in almost every C file, so he ended up caching tokens. So a big reason to have a separate tokenizer is that tokenization is a simpler task and can be optimized with all low level approaches, like perfect hashes, crafted nested switch/if tries, branchless algs, compiler intrinsics etc.

The good tokenizer is about as fast as a speed of writing the output array of records into memory. Which means it is important to choose right memory layout for your tokenized data so that when parser reads tokens it has as little cache misses and indirect memory access as possible. Tokenization can be thought of as a sort of in memory compression.


Excellent summary. A couple other reasons for a separate tokenizer:

1. Sometimes all you need is a tokenizer - such as for highlighting in a code editor

2. D has a construct called a token string - where a string literal consists of tokens

3. A separate tokenizer means the lexer and parser can run in separate threads


I won't say this is the explicit reason recursive descent parsers have separate lexers, but there is a major downside to folding the lexer into a recursive descent parser, which would occur when/if you make the leap to packrat parsing. If the parsed language develops far enough you'll often run into edge cases where backtracking can become pathological. One benefit to recursive descent is that you can often just fix this by memoizing the parser functions (called a packrat parser).

However, you really don't want to memoize calls to lexer-y parser functions because it's almost always faster and less memory-intensive to simply re-lex the source string. This gets more complicated if you're parsing a stream (but if your language is being parsed from a stream you should definitely be giving strong preference to language constructs which don't require backtracking).


Incremental re-Lexing works great in IDEs where you want to preserve tokens and the state associated with them. It also helps to isolate errors by attributing them to a token that contains the change rather than somewhere far off. Incremental parsing is the same: it costs more in terms of memory and even overall time, but provides better feedback for the IDE.

Batch compilation is a different beast, but I haven’t written one of those for a pint time.


I have similar compiler history and employ similar practice.

Keep things where they logically belong - the parser is for mapping tokens onto syntax, for that it needs tokens! It doesn't matter if lexing is done in a separate pass, or "interleaved" in time. What matters is that the each "pass" is logically distinct.

Most languages are designed so that it's easy to produce tokens from text, without knowing where you are in the syntax tree - that makes it easy to read code fragments.

(There are languages where the same string can have very distinct structure & meaning depending on context. They tend to cause much swearing and confusion.)


You're quite right, I over-simplified -- mea culpa! That should have said "often unify these phases". FWIW, I've written recursive descent parsers with and without separate lexers, though my sense is that the majority opinion is that "recursive descent" implies "no separate lexer".


For what it's worth, in my little corner of the world, all of the recursive descent parsers I've seen and worked with have separate lexers. I can't recall seeing a single recursive descent parser in industry that didn't separate lexing.

However, I do often see a little fudging the two together for funny corners of the language. Often that just means handling ">>" as right-shift in some contexts and nested generics in others.


That's not my impression of the majority opinion, fwiw. (I wrote my first recursive-descent parser in the 80s and I learned from pretty standard sources like one of Wirth's textbooks.)


As another data point in addition to the sibling comments, all IntelliJ language parsers use recursive descent with a separate lexer.


I don’t think it’s an outlier, I do my lex phase before the parsing phase, it’s not concurrent. However, I find reporting bad tokens and parse errors to work well even if the phases are sequential.


I’ve only written lexing and parsing together in recursive descent parsers. What am I missing out on?


Lexing separately is cleaner in that the lexer's only job is to produce tokens and then the parser's only job is to consume tokens and produce e.g. an AST. It also means that you can use that token stream for other things that don't require a full parse (e.g. for syntax highlighting). The disadvantage is that you're often doubling the allocations and increasing the memory footprint. I alternate between separate lexer, combined recursive descent parser and generated PEG parsers depending on what is more important: speed + maintainability, speed of execution or speed of development


If you find the code wants factoring out (if you don't, just leave it be, parsers are wonderful things to have written but IME writing them is generally a labour of love) then that split tends to be very natural and works well in the vast majority of cases.

It's also worth having a poke around at some of the Racket community's Language Oriented Programming efforts, they usually have a split they call 'reader' versus 'expander' that I found helped me get my head around how the dividing line can be drawn and why you'd want to.

(Racket's approach isn't quite lexer vs. parser AFAICT but while I at least -think- I've understood it well enough to use ideas, I'm not going to pretend I understand it well enough to provide a correct explanation, let alone a well written one)


If you don't think you're missing out, you're not. I think it could be preference.

I just prefer for a lexer to be responsible for handling characters, and a parser be responsible for handling tokens. So Single Responsibility Principle.


I believe the point being made is that while it is possible to merge lexing and parsing with recursive descent parser, it is not essential to this style, and separating lexing and parsing while doing recursive descent parser is also a legitimate approach.


Can you expand on why you prefer that approach?


It's much easier to have a grammar that is LL(1) - which also makes recursive descent that much more convenient to write - if you separate.


I think the main reason why you split it is that it simply is easier to get the semantics you want, which is roughly what is being demonstrated here. Working on tokens is easier in the parser because things like the semantics of whitespace and lexical ambiguities can be handled by your algorithm of choice. Having tried to write full grammars using only parser combinators on top of raw codepoint streams, I can say:

- It's completely possible

- BUT the bits that would normally be handled elegantly in the lexing phase are now strewn into every single grammatical element's parser, and it's surprisingly hard to formulate composable solutions that don't lead to undesirable resolutions to ambiguous situations (or rejecting desired grammar, or accepting undesired grammar, etc.)

I think it's possible that with differently defined grammars, it would be fine. For example, JSON should be completely trivial just because of how few tokens there are and how little ambiguity there is; I tend to write JSON parsers directly against the codepoints for this reason, in the odd event I have to write a JSON parser. (Yes, it comes up. Yes, I am aware JSON has some weird edge cases and I shouldn't do this.)


I agree! The article addresses this fairly well.

In my experience, another reason to split is to be able to manipulate the AST independently from parsing. For example, you may have multiple dialects or representations (such as a serialized form) for which you don't require the lexer, which is typical for non-toy or trivial parsers.

Even if tokenization happens during parsing, there is almost always still some separation. Often times it's just semantics, where the lexer doesn't strictly emit tokens, but simply represents the token in-place.

Another (more historic) reason the phases are typically separate is that it allows you to reason about them and the finite state machine. For example, you could write the state transitions by hand in the traditional form and reduce as you go, just like in the dragon book.

If you want to reason about grammars then it behoves you to wield the traditional form.


Yeah, on the note of benefits from the split, I'm actually curious about when the best time to implement C macro expansion is. I think after lexing but potentially before parsing might be a good answer, which I did not expect. Of course... It would be wise to take a look at what the Clang and GCC frontends do, but I haven't gotten the chance.


Traditionally the C Preprocessor is a separate dialect distinct from C itself, and outputs C. So, historically it would be before lexing with respect to C. It's also been used to process non-C text, but that's fairly obscure. The hint is in the name: "Preprocessor"

I'm sure some modern compilers blur the stages nowadays.


Yeah, I totally hear what you are saying. At the same time, I at least do know that Clang and GCC do the preprocessing in-line with parsing, rather than forking out to an external preprocessor. In fact, I assume these days, the way it works is actually that your cpp binary would basically be the compiler except it disables the part after the preprocessor and just dumps the resulting AST.

That said, I have not read a lot of compiler source code to back this thought up. I have read some, but mostly in the C++ backend side, usually to resolve curiosities about vtable implementation details.


The #line directive essentially requires that you either handle macros directly in the compiler or add a syntax extension to carry that info over the wall.


> or add a syntax extension to carry that info over the wall.

Isn't that why #line exists in the first place? A lot of online literature, including the clang and GCC manuals, are worded in a way that makes it seem like the purpose is for consumption by the preprocessor, but I'd be surprised if the original purpose wasn't to communicate line numbers from the preprocessor to the compiler. The #line directive isn't mentioned in K&R; not even in the "ANSI C" second edition, despite C89 standardizing #line.

Of course, it's a preprocessor directive, but leveraging the syntax this way is a nice hack:

1) It's very simple to identify C preprocessor directives (see #2), and cheap to add a pre-pass to a C compiler to identify directives, accepting #line but throwing an error (or complaining about and discarding) any other directive.

2) There's no way to generate valid, non-preprocessor C code that might look like a directive. '#' is not and never will be an operator in C or have any other lexical role; it can only exist within comments, character literals, or string literals.

3) You can gainfully feed #line to the preprocessor, for example from the output of a different transformer like M4.


The Unicode input alphabet is up to 4096 symbols.

The Lexer's output alphabet (a set of tokens) which is an input for a Parser is up to a few tens of symbols at worst. The smaller alphabet the easier it is to make efficient recursive decent parser.

Regular expressions are Finite-State Machines under the hood. There exist one single minimal deterministic FSM per any Regex. And that model is the best in performance. In contrast, for context-free grammars we don't have such unique computation model that would be best in performance. Because of this it is reasonable to separate a Regex-based part of the grammar (the Lexing part) and treat it independently.

Also, when parsing, perhaps it would be easier to recover from input syntax errors from the Token-based input rather than Unicode-based input.

Finally, my personal opinion is that if you write a recursive-descend parser manually, it would be really easier to reason about just the limited number of token classes rather than the entire Unicode alphabet.


I used to feel similarly until I spent a bunch of time with nom [1].

There's not really any difference between tokens and grammar rules, in my experience. They're just terminal and non-terminal symbols yeah? Doing these things in one place doesn't get in the way of writing a nice LL recursive-descent parser.

I'm sure it will depend on what you're doing but for me at least splitting into lex and parse has a much higher bar now. I've done it both ways many times and at least for my applications adding a separate lexer has historically been the wrong abstraction in retrospect.

[1] https://blog.logrocket.com/parsing-in-rust-with-nom/


Well, my personal experience was in the opposite direction actually.

I used to use combinators-based approach without Lex/Syn separation (aka PEGs) for a long time. But then I came up to understanding that the separation approach is actually better in performance. And also that working and debugging of the Token sequences while writing parser manually is just more handy (at least for me).

But this is my personal experience of course. I do believe too that it all depends on the goal, and parsers micro-optimizations is not that much critical in many cases, and that combinators approach actually works quite well too.

As of Nom, I can say that it works quite well. But I think that the it's performance gains stem from the fact that Rust is a systems-based PL, and it optimizes function combinations just better than, let say, JavaScript or Python.

In my incremental parsers library Lady Deirdre I utilize Lex/Syn separation, and the LL(1) recursive-descend parsing, and it shows much better performance than in Tree-Sitter at least on relatively big files [1].

[1] https://github.com/Eliah-Lakhin/lady-deirdre/tree/master/wor...


> The Unicode input alphabet is up to 4096 symbols.

Huh?

How did you come up with this number?


ah, I apologize for misdirection. In general it could be much bigger of course. Wikipedia[1] says Utf-8 can encode up to 1,112,064. Anyway, it is quite big "alphabet" than the usual set of Lexis Tokens :)

[1] https://en.wikipedia.org/wiki/UTF-8


> easier to recover from input syntax errors from the Token-based input rather than Unicode-based input

Rust has pretty exact mapping from syntax errors to lexical placement. Couldn't a similar approach be used if you have a potential recovery path?


I think it depends on how the recovery path is made.

During the lexical scanning stage you can persist the token's source code sites, and later reuse them to indicate syntax or similar errors.

For example, in Rust when you parse the syntax from the TokenStream of a procedural macro input, you can bind custom errors to any spans of input tokens, and it works just fine.


Some of you may enjoy this fantastic recent paper on how to win both the comprehension benefits of splitting lexer and parser AND the performance benefits of fusing them: https://www.cl.cam.ac.uk/~jdy22/papers/flap-a-deterministic-...


There is an extended version on arXiv: https://arxiv.org/abs/2304.05276.

Also, the artifact (Docker image + some guides, CC BY 4.0) for the OCaml library: https://zenodo.org/record/7824835.

Edit: The opam file actually says the license is MIT. I'd be more inclined to believe that over CC BY 4.0 as far as source code licensing goes.


In addition to this, I always think about Matt Might's "Parsing with Derivatives." [1]

I'll have to read about flap, as it seems to utilize a lexer based on derivatives.

[1]: https://www.youtube.com/watch?v=ZzsK8Am6dKU


Matt Might also did an article about lexing with regex derivatives.

https://matt.might.net/articles/nonblocking-lexing-toolkit-b...


Lexer is just a parser for a regular language. They output a linear stream of tokens, conceptually similar to a stream of characters, only smaller in size and easier to create a grammar for since superfluous elements like spaces are discarded. You could totally parse at the character level if you want to by treating each character as a token, it's just that it makes things needlessly hard.

The general rule seems to be: lexer is for parsing linear structures like numbers and symbols while parsers are for recursive nested structures like lists and objects.


And especially strings, which with all those escape codes are small programming languages in it's own right.


The title of this post stood out to me because it's a fairly niche topic that I've been meaning to write a 'rant' post about myself. I think in a very 'pure' parsing theory sense, there is no real advantage to separating the lexer and the parser. I claim that any counter-argument is due to the limitations of individual software tools, and their inability to let up specify the grammar that we actually want.

If I recall correctly, the original historical reason for separating them was due to memory limitations on early mainframe computers. Actual source code would be fed into the 'lexer' and a stream of 'tokens' would come out the other end and then be saved to a file, which would then be fed into the next stage.

Having said this, you can ask "In practice, is there currently an advantage to separating the lexer and the parser with the tools we use now?" The answer is 'yes', but I claim that this is just a limitation of the tools we have available to us today. Tokens are usually described using a simple regular expression, whereas the parsing rules are 'context free', so worst-case complexity of parsing the two is not the same. If you pull tokens into the parser and just parse them as naive 'single-character parser items', then you end up doing more work than you would have otherwise since your parser is going to try all kinds of unlikely and impossible chopped up token combinations. The other big issue is memory savings. Turning every token into a (small) parse tree is going to increase memory usage 10-20x (depending on the length of your tokens).

Personally, I think there is a substantial need for more advanced ways of describing language grammars. The ideal model would be to have various kinds of annotations or modifiers that you could attach to grammar rules, and then the parser generator would do all sorts optimizations based on the constraints that these annotations imply. Technically, we already do this. It's called putting the lexer in one file, and the parser in another file. There is no good reason why we can't specify both using a more unified grammar model and let the parser generator figure out what to do about each rule to be as efficient as possible.

The computer science theory behind grammars seems to have peaked in the 1980s, and there doesn't seem to have been many new innovations that have made it into daily programming life since then.


One good reason (at least in the olden days, where lex and yacc come from) is efficiency. I wrote a compiler for a made-up programming language for a very slow CPU, with a recursive-descent parser, and it spends almost all of its time re-examining characters that have already been examined, checking if they match a new rule, then giving up, backtracking, and trying the exact same characters again in the next rule.

If it was looking at tokens instead of characters this would go a lot faster, because examining characters is O(n) in the number of characters.


This is probably the best argument I've seen here. Everyone else is just saying "well, we always just did it that way".

I wonder however: would it be possible to combine lexers and parsers into a single logical construct, but rely on caching to get the same effect?


I had wondered about why the split is useful, and this article has answered the question to some degree. The general idea seems to be "it's easier to split it" and "you help reduce ambiguity in the parser by taking care of that in the lexer".

To those knowledgeable in lexing and parsing, I pose a question: How do you determine what should be part of the lexing phase and what should be part of the parsing phase? Do you choose arbitrarily and move things from one phase to another if it's simpler one way or another? Or is there a better way?


Lexing translates raw characters into non-semantic tokens. Punctuation is a good way to demonstrate what I mean by non-semantic. Take Rust, which uses `{` for both blocks and closure literal expressions; and uses `<` for generic parameter/argument wrappers, and also a binary operator. While lexing, you don't try to determine whether `{` is opening a closure or a block, or whether `<` is a binary operator. You just turn them into the tokens `LeftCurlyBrace` and `LeftAngleBracket`. It's the parser's responsibility to take those tokens and their neighbors and figure out what they're being used for.

The benefit is that this keeps each stage much simpler: lexing doesn't have to do much if any lookaround, and parsing mostly doesn't have to do anything character-wise.


Thanks. That clears it up a lot.

The example of using the same tokens but for different purposes makes it clear that the lexer shouldn't be in the business of assigning meaning (semantics) to the tokens, and the parser shouldn't be in the business of fiddling with characters when it's something the lexer can do.


If you have half an hour, this is worth a watch as well: https://www.destroyallsoftware.com/screencasts/catalog/a-com...


Lexing should be strictly local and possible to do as a state machine that is reset after each token. So if something depends on the context, save it for the parser.


Yes, that makes sense. Chunks of text that don't depend on context is for the lexter, and context sensitive (i.e. "{" needs a matching "}" ) is for the parser


> How do you determine what should be part of the lexing phase and what should be part of the parsing phase?

Recursion. If your thing can contain other things, it requires a full parser. Otherwise a lexer can handle it.

In other words, use a lexer when a regular language is sufficient to describe the input and a parser when it's not.

Lexer:

  12345
Parser:

  [123 [45]]


Roughly lexing handles things that can be recognized with regexes (i.e., that don't have recursively nested structure).


The lexer groups individual characters into building blocks, tokens, that the language is made of. From keywords such as "if", "for", "function" and symbols like "{", to arbitrary identifiers that doesn't contribute to the actual meaning or structure of the program, i.e. "foo".

The parser groups the tokens into meaningful statements, it implements the grammar. If the parser sees an "if", it expects some condition next (depending on the grammar of course). If the parser detects statements that are not consistent with the grammar of the language, it should reject the program.

Tldr they work on different levels, lexers just finds the individual building blocks, the parser builds the program


The part I was missing was the idea of keeping ALL meaning/structure in the parser, and the lexer is used to identify chunks of characters that can be used to construct the meaning/structure of the program.


One benefit of splitting the lexer and parser is that you can write tests for the lexer that cover different lexical variations of a token -- e.g. for decimal/double values you could test that `1.`, `.2`, `1.0` all produce the same token type. Then, when writing tests for your parser in the parts of the grammar that have that token type you only need one test to cover all the different lexical variations.

This way, your parser tests are then about covering the different symbol transitions in the grammar, making them much more manageable.

It is possible to have more complex lexers -- typically state-based lexers -- that switch state on given input. You can do this when tokenizing comments, strings, or other constructs so the lexer can emit the correct symbol for things like string interpolation. This can be useful when syntax highlighting or other token-based tasks are separate to parsing.

I can see how a state-based lexer and a parser can be combined to avoid managing the state in the lexer -- that is because the parser will know what the current state is due to where it is in the grammar, so can call the appropriate method on the lexer, or pass the corresponding state information. I've not yet experimented with this approach, but it could be useful in some contexts.

Note: I've written a lexer and parser for a language plugin for IntelliJ which has separate lexer and parser steps, and uses the lexer for syntax highlighting. I'm investigating support for the language server protocol (LSP), so may also experiment with a directed/orchestrated lexer as described above.


> `lex` and `yacc`

yacc is ancient. Even its replacement, bison, is ancient at this point.

I have seen "lexing" done on the parsing level in the past. The same way you can say "a sum expression is a subexpression, a plus sign and another expression" you can say "an identifier is 1 letter followed by 0 or more letters or digits". It was a bit cute, but awkward.

It's very convenient to have two passes. The first one removes unnecessary stuff like superfluous spaces or comments, so that the actual parsing can operate over a simplified space (not having to take into account things like "an expression 0 or more spaces, the plus sign, 0 or more spaces and another expression"). At that point, you might as well just use a Lexer for phase 1.


There is more to translation of text to AST than parsing and lexing. There are semantic checks.

In addition, most languages are not CFG and require nested contexts.

    "#{"\tThis is some #{'damn'} string".to_upper_case + " hello"}\n"
Also, there are lexerless parsers such as derivative parsing that can be combined together and don't produce a stream of tokens.

https://dl.acm.org/doi/10.1145/2034773.2034801

https://arxiv.org/pdf/1604.04695.pdf


As a counterpoint to the dozens of people saying they've only ever used separate lexers and parsers, Parsing Expression Grammar (PEG) libraries generally combine high-level parsing with low-level lexing. This allows grammars to be composable, such as embedding SQL, JSON, or XML literals directly into code as-is, without a second level of escaping.

In my opinion, a lot of the need for separate lexing was historical, due to the weak optimisations in the compilers of the era. With modern languages like C++ or Rust on top of LLVM, the lexing steps can be inlined efficiently.


When you get into a situation where the approach you take causes people to start playing whack-a-mole to patch issues that arise from said approach, it's a sign that it may not be a good approach after all.

Your IDE will do type checking as well. Yes, the lexer stage will parse all identifiers into 'ID', but to provide a contrived counter-example, what if I do want to allow 'pi' to be assigned based on something that can only be processed in a later stage (perhaps an annotation, or a result of compile time function evaluation, etc). It will push the logic of handling 'ID' to later stages anyway. Now you're handling IDs in at least 2 places - that's strictly worse than the status quo.

Part of our jobs is to be able to reason about trade-offs when making decisions. Pointing out that the lexer has a limitation does not necessarily justify removing that limitation for an edge case because it might be due to that limitation that it can be kept simple for the general case.

Adding complexity to the lexer without simplifying the later stages is a net negative in my opinion, which would make this trade-off not worth it.

Is it an interesting approach? Sure. Is it worth it? Maybe in some very specific use cases, but it doesn't seem like a convincing argument to me to combine the lexing and parsing phases.


> Your IDE will do type checking as well. Yes, the lexer stage will parse all identifiers into 'ID', but to provide a contrived counter-example, what if I do want to allow 'pi' to be assigned based on something that can only be processed in a later stage (perhaps an annotation, or a result of compile time function evaluation, etc). It will push the logic of handling 'ID' to later stages anyway. Now you're handling IDs in at least 2 places - that's strictly worse than the status quo.

Typechecking and access control are clearly semantic analysis considerations, which are better handled in a separate pass on an explicit AST. In particular, they may require a symbol table (i.e. an environment of names) that is not possible to compute during the parsing phase.


Compilers are one of my favorite topics in CS as well, and this post was pretty interesting and easy to read.

I’ll add my own at https://jmmv.dev/2023/01/endbasic-parsing-difficulties.html where I described some difficulties I encountered while trying to apply these more “modern” techniques on a language that was designed before them.


Answer: performance.

Scannerless parser grammars, especially with GLR, are extremely flexible and a pleasure to write. They are the "DWIM" of parsers. But you pay a high price in performance for that flexibility.

Since programming language grammars don't tend to change much, by the time you have a production compiler it's usually worth your time to switch to a split lexer/scanner in order to get that extra performance.


One of the coolest parsing libraries I've ever seen doesn't require a separate lexing phase, but IIRC it kinda supports something similar in the form of regex tokens: https://github.com/engelberg/instaparse


I love that supporting left recursion allows you to avoid contorting what would otherwise be the most obvious grammar definition.


Yep! That flexibility runs through Instaparse. Overall, it's so flexible that you can literally copy and paste an EBNF grammar in from a standards document, and it will hand you back a parser. I love it.


The split in lexing and parsing is indeed artificial and one often has to make use of states or even cases where the parsing is controlling the lexing. Try for example to parse an HTML file with embedded CSS and/or JavaScrips using lex(flex) and yacc(bison).

Recursive decent, where the parser asks the 'scanner' to accept a certain symbol at a certain location in the input stream, resolves most of these problems, such as for example the fact that in HTML the tags are case insensitive while JavaScript the keywords are case sensitive.

If you try to develop a parser that is user-friendly, but maybe not the most efficient (if you allow back-tracking) the whole problem of parsing become far less complicated. (There a good methods to counter performance problems due to back-tracking.)


You lex into tokens and then parse the tokens. It’s much easier to write cleaner, extensible, and maintainable code this way. For example, why have the hassle of worrying about whether you’re dealing with a string with numbers or just a number itself while parsing?


The part of my parser that parses tokens looks identical whether or not I'm using a separate lexing stage. I write a rule that expects a number, I write a rule that expects a string. Whether or not the item is a terminal (like it would be with a lexer) or a non-terminal (like it would be with a parser) doesn't really change how the rule is written.


Lexers have a maximal-munch rule. Without a separate lexing stage, that's kinda hard to emulate within a parser (depends on the type of parser, I guess), as you'd end up with additional ambiguities.


So it sounds like lexers have specific advantages with parsing technologies that admit ambiguities? Since I've used PEGs and recursive-descent parsers almost exclusively, that might explain why I haven't had issues.


Yes, but when or if you need to update your list of available terminals, or extend the parser because of introduced ambiguity to the language, having the lexing and parsing be separate reduces the overhead significantly.


When processing "pi = 3", I'd rather have the error "cannot assign to constant" than "unexpected token =".

Generally, whenever I write a parser+checker, I try to have a very relaxing parser and then all the checks in the checker.


IIRC some compilers have taken this to the extreme, adding parsing rules for outright invalid syntax like missing semicolons, so that they can give cleaner error messages.


It's not just the cleaner error messages - if possible, you want to recover from syntactically invalid input at some point so that you can 1) detect errors that follow, but 2) more importantly, do NOT show errors past the point where the original error has recovered. This last bit is especially important in the context of visual dev tooling like IDEs and other smart editors where intermediate states involve missing semicolons, unmatched braces etc all the time. Roslyn (C#, VB) is one specific example of a parser where this approach is pervasive.


Having recovery points in a grammar is a really nice thing. I was writing a parser for a lisp-like language and turns out it's surprisingly difficult to come up with robust heuristics for parser recovery. People indent lisp code in various ways but eventually I got something that works most of the time. My approach was indentation based, seeking to the next location with compatible number of spaces/tabs at start of line.


Swift applies this across language versions as well. You'll see error messages like "fooWithBar() has been renamed to foo(withBar:)".


Sometimes you can't tokenize free of the parsing context. Is "MODULEPROCEDUREFOO" a Fortran statement that begins a new module called 'procedurefoo', or a reference to a module procedure 'foo'? Fortran compilers that use table-driven parsing typically feed state back to a "statement classifier", and the class of the statement determines how to tokenize it. Fortran compilers that use recursive descent typically just operate on characters; it seems to me to be more robust and error-recoverable to do it that way.


One of my uni subjects required us to write a C to MIPS32 compiler, due to the lecturer being a fan of MIPS. No yacc, no gcc, do it all from scratch.

An interesting lesson from it was that the argument about differences between compiled programming languages and scripts (which also arose during the course) is ultimately pointless. You can write a compiler to compile a scripting language, and you can write an interpreter to interpret and run something like C. So its not the language which matters but the compiling or interpreting.


For many languages, the lexical part requires an "eager consumer rule"/"maximal munch" in addition to the actual grammar, whereas the remaining grammar does not.


I've been wondering the same thing. I started doing projects that involved parsing during covid lockdown, and since then I've done lots of toy projects and one full-sized programming language, and I very quickly realized that a) lexing didn't really make parsing any easier, and b) any non-trivial lexing (eg. number literals) was just duplicating parsing work anyway. It just seems much simpler to have a single layer that consumes characters off the front of a string


I asked myself the same question when I learned yacc and bison.

For most languages, the tokens are a regular language and a simple regex lexer is enough. Regular expressions used by a lexer end up as a state machine.

It does not make much sense to use a full recursive descent parser just to do lexing, as it will be less efficient and use more memory. Consider that the amount of characters ingested by the lexer is far greater than the tokens ingested by the parser.


Short answer: Some combination of efficiency, simplicity, and error messages.

Mostly efficiency. Lexer-specific code can be significantly faster on input with large tokens.


I for one find this separation so useful that I wrote my parser combinator library for polymorphic token streams where the first layer (à la lex) parses characters into lexer tokens, and the next layer (à la yacc) parses those into an AST, for instance. That also makes it easier to separate orthographic (“syntax”) errors from grammatical errors.


It's more efficient, both in time (optimized character parsing, no repeated work if parser backtracks) and memory usage (store identifiers as an interned byte string instead of a tree of character tokens).

It also makes it easier for some tools to only use the lexer, which makes them more likely to work even if code is not syntactically correct or written for a newer language revision than the tool.


I had no idea what lexing and parsing was when I wrote my first compiler so I tried writing both of these processes in the same module. It was so hard that I naturally split the lexing part of things in a first pass, then realized that it had a name and that it was what everyone was doing. So the simple answer is that it’s the natural way of doing things.


I mean, this is how we write programs usually ain't it? We break the problem up into chunks, composing functions and/or objects on our way to the solution.

In this case there's a fairly natural split, as you mention, so it's natural to use composition to handle this as well.


A lot of the PI drawbacks / justifications for splitting sound like they would be eliminated with the introduction of a cut operator similar to Prolog. My limited experience has not seen any evidence of cut operators being used in language processing; are there downsides that only become evident with experience?


Interesting I've always been allured by the prospect of parser generators for unified parser/lexers like ebnf & regular right part grammars, so it is nice to hear some of the downsides of these approaches that had never really occurred to me.


two things about splitting into phases:

1. history (scanner, lexer, parser all would not fit on one tape/cassette/disk back then)

2. performance (each component can benefit from specialized optimization techniques without affecting the other)


The old Burroughs COBOL compiler had 8 passes, supposedly so it could run on machines with about a megabyte of disk, and 64Kb of memory.

The output of the 7th stage was an intermediate code; my guess is that part of the reason there were so many stages was to enable most of the compiler to be language-independant. If you renamed the executable for the 8th pass, you could hack on the intermediate code with an editor, and make COBOL programs that could do illegal things, like access IO directly.

I suspect a lot of the passes were common to the ALGOL compiler.


Seems like the author have not take a look at work such as OMeta [1].

[1] https://en.wikipedia.org/wiki/OMeta




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: