Parsing Algorithms

ChuckMcM · on Oct 26, 2020

FWIW, parsing and lexical analysis was the CS class I have used most thoroughly in my career. Sure data structures is probably the most often used, but other than hash tables and b-trees, not much of that class was useful. But lexing and parsing? Seems like every other project benefited by it either in handling configuration files, or log/sensor data, or some other need to convert what was human readable into machine manipulable.

That said, I really recommend the crafting interpreters work. It covers all the bases pretty solidly. If you want more depth and theory then get the dragon book (Ullman on compiler design) and read it afterwards :-)

DmitrySoshnikov · on Oct 26, 2020

Yes, in the "Essentials of Interpretation" class (aka "Building an Interpreter from scratch" we focus exactly on runtime semantics, and evaluating the language. The S-expression allows greatly simplifying, focus on runtime specifics themselves, skipping parsing stage altogether.

In "Essentials of Parsing" class (aka "Parsing Algorithms") we shift exactly to the syntax, and understanding the parsing process from within -- this in general may have nothing to do with runtime -- for the same exact syntax you may have different interpreters or VMs (even with different semantics).

chubot · on Oct 27, 2020

Yeah I think that makes sense, and that's how I learned... The s-expressions let you concentrate on the meaning without getting bogged down in details.

And then parsing has some nice algorithms, but it's full of details ... I actually find the parsing part harder in many respects. At least it's more code to write, and test.

fjfaase · on Oct 27, 2020

This course follows the traditional approach to writing parsers. These traditional approaches were developed in times when memory was scarce and where back-tracking was impossible, simply because files were too large to be stored in RAM. Back-tracking parsers are much easier to write and in most cases performance can be brought to acceptable levels by applying caching. I have experimented with developing interpreting parsers that work with user-friendly grammar representations, and discovered that using caching can result in acceptable performance. I also discovered that a small interpreter often is faster than generated code, probably due to a better CPU cache performance. For two examples of these approaches see: https://github.com/FransFaase/IParse and https://github.com/FransFaase/RawParser (WIP).

ssl232 · on Oct 27, 2020

Guido van Rossum talks about this in his series on designing a PEG parser for Python [1], now used in 3.9.

[1] https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d...

DmitrySoshnikov · on Oct 27, 2020

Yes, backtracking still might be an option although has its known limitations in terms of parallel paths. We describe backtracking in this class too.

The LL in the the view of manual Recursive descent is the most used on practice along with combinators and LALR(1).

default-kramer · on Oct 26, 2020

Does anyone know if there are any good resources on "tolerant parsing," if that is the correct terminology? For example, when I write C# in Visual Studio, the IDE remains amazingly helpful even when the code is incomplete and would be rejected by a traditional parser. I'd guess that Microsoft simply has the budget to have the VS/C# dev teams grind out hundreds or thousands of special cases that are specific to C#... but I would love to learn that there are some fundamental techniques that would fit into a few credits worth of CS education.

derriz · on Oct 26, 2020

It's fairly straightforward to offer some functionality in this area with hand-written recursive descent parsers.

For example, a language with a C-like syntax, you'll often be parsing a sequence of statements separated by semi-colons (a block). If a statement fails to parse, you can just consume tokens until you hit the next semi-colon and then try to continue to parse statements from there.

Fairly crude approaches like this are easy to implement (at least with recursive-descent) but can be surprisingly effective. It's easy to construct counter examples where an approach like this will get it wrong but in practice it's hugely more useful to the poor user than just abandoning the parse completely.

eru · on Oct 27, 2020

You can probably make your heuristic even more robust, by looking at hints from indentation and parens and braces in addition to just semicolons.

mzl · on Oct 27, 2020

Another commenter recommended rust-analyzer as a source of inspiration. the main author, matklad, has written some very interesting blog posts describing ideas used that might be of interest to you: * https://rust-analyzer.github.io/blog/2020/09/16/challeging-L... * https://rust-analyzer.github.io/blog/2020/10/24/introducing-... * https://rust-analyzer.github.io/blog/2020/09/28/how-to-make-... * https://matklad.github.io/2020/04/13/simple-but-powerful-pra... * https://matklad.github.io/2020/04/15/from-pratt-to-dijkstra....

DmitrySoshnikov · on Oct 26, 2020

Yes, this is called "parse error recovery" and there are multiple techniques for this. In fact, most of the production parsers support this mode. E.g. when you try executing a C++ or Java file, it shows you all the errors at once instead of failing on first parse error. The way it's achieved is by constructing a "dummy" AST node (caught up to some delimiter, e.g. semicolon in statements) and continue the parsing process as there would be no any error.

MrManatee · on Oct 27, 2020

Tree-sitter is the library that Atom uses to parse as you type. It's incremental: if you just add one character, you don't have to parse the entire file from scratch. And it's robust: it tries to provide useful results even if there are syntax errors.

If you're more interested in the theoretical side of how it works, there are some talks and articles that cover it.

https://tree-sitter.github.io/tree-sitter/

mrits · on Oct 26, 2020

https://github.com/rust-analyzer/rust-analyzer is a good resource. I went there because I was surprised how good the project worked. I found some really great docs and learned quite a bit from the code.

DmitrySoshnikov · on Oct 27, 2020

See this small summary doc on different techniques for error recovery: https://gist.github.com/DmitrySoshnikov/feee52cbfb03b7b69110...

psykotic · on Oct 27, 2020

With your IDE example you need the full parser and type checker to be "tolerant". For recursive-descent parsing, there isn't much to say about theory. You try to pick reliable synchronization points and prevent cascading errors. Here's the classic example:

In a statement-oriented language like C#, synchronizing to the next statement upon finding an error by scanning for a semicolon token is a good place to start. (Statements like 'if' with nested statement blocks as arguments have their own sync logic.) Aside from having reliable sync points, statements (unlike expressions) don't have a type that you need to propagate, so you don't usually have to worry about cascading errors in the type checker from skipping a faulty statement. That said, if you skip a faulty statement that was the sole reference to a local variable, that might get flagged as an 'unreferenced variable' warning. It's often a good idea to disable sensitive warnings (and some errors) for the rest of the function as soon as an error is found.

Probably the most important sync point is at the level of symbol declarations. Even if a function's body is totally botched up, as long as you could successfully resolve the function's signature (name, parameter types and return type) you don't get cascading errors from other functions that reference that function. Resyncing to top-level declarations is so important that if you're designing the syntax it's worth having a dedicated declaration keyword (especially for functions) which is only valid at top level. That way you have a reliable sync point even if everything else is out of wack (unbalanced braces, etc). Something like Go's 'func' keyword is close enough: it can appear in function literals and function types as 'func' followed by '(' but 'func' followed by a name is only valid in top-level declarations.

Fine-grained error recovery for expressions is the biggest problem from both a parsing and type checking perspective. There aren't any reliable expression sync points in a C#-like language and you need to fabricate best-effort types for the faulty expressions (with an 'error' type as a fallback) and make sure that the various operators for combining expressions have heuristics so you don't get spurious cascading errors. It generally involves numerous special cases to good results. If you have a choice in the matter, don't worry about expression error recovery: report the error, sync to the next statement and go for the lower-hanging fruit instead. In my experience it isn't worth it in a statement-oriented language.

If you want to eliminate as many spurious warnings/errors like this as possible during error recovery, you do end up adding many special cases over time. But it's an incremental process and you can get good results immediately with basic recovery techniques.

Various comments:

On the lexer side of things, if you get to design the syntax it helps to avoid multi-line lexemes in the common cases. That's why I cringed a little when I learned that Rust's "..." string literals are multi-line. It's fine to have multi-line lexemes like /* ... */ comments in C and """...""" string literals in Python but make the default choices be single-line lexemes like "..." and // ... so you can sync to the newline.

A benefit of a syntax with indentation-defined block structure is that you don't need to rely on balanced grouping tokens like { ... }, so it's easy and reliable to sync to the outer block levels. (In Python there's a caveat that the lexer suspends indentation tracking when the ([{ nesting level is nonzero, but it's still a robust heuristic even when recovering from an error in a nested state.) In particular, if you require top-level declarations to be at column 0 this avoids the need for dedicated keywords for reliable declaration resync. Then the only thing you have to worry about are unbalanced multi-line lexemes gobbling up chunks of your programs.

While I talked about error recovery, a lot of these syntactic properties help with fast symbol indexing. E.g. it's easy to write a fast symbol indexer when you can just sync to "\n<letter>" for top-level declarations and otherwise only need to worry about rare multi-line lexemes. This lets your outer scan loop avoid the byte-at-a-time bottleneck.

DmitrySoshnikov · on Oct 27, 2020

Great details, thanks!

> A benefit of a syntax with indentation-defined block structure is that you don't need to rely on balanced grouping tokens like { ... }

In fact from the lexer perspective there is no big difference, the matching indent-dedent is the same token type as would be { and }

psykotic · on Oct 27, 2020

Indeed, the difference is that the lexer offers guarantees about the synthetic INDENT/DEDENT tokens. From an error sync perspective, the benefit is that the programmer (redundantly) re-asserts the block level every line by the amount of indentation. As a small addendum on Python's suspension of indentation tracking when nesting > 0, when I designed the syntax for indentation-based block structure in another language, I required such nested code to always have indentation beyond the current block's level even though no INDENT/DEDENT/NEWLINE tokens are emitted in this state. So this was legal:

    x = (1 +
        2)

    x = (1 +
            2)

    x = (1 +
      2)

    x = (1 +
     2)

But this was illegal:

    x = (1 +
    2)

The legal variants are all identical to

    x = (1 + 2)

from the parser's perspective. Adding this restriction (which is already the idiomatic way to indent nested multi-line expressions) means that you can reliably sync to block levels even when recovering from an error in a nested state. If your lexer already strips leading indentation from multi-line string literals you could add a similar constraint for them.

The moral of a lot of these tricks is that by turning idioms and conventions into language enforced constraints you can detect programmer errors more reliably and you can do a better job of error recovery. That said, even in a curly brace language like C# you could still use the indentation structure as a heuristic guide for error recovery--it's just going to be less reliable.

DmitrySoshnikov · on Oct 27, 2020

Yeah, this makes sense, thanks.

cjauvin · on Oct 26, 2020

For an alternative take on a related topic, this is really a fantastically well-written and practical (free) book: http://craftinginterpreters.com

ufo · on Oct 27, 2020

I a big fan of the way Bob Nystrom skips the LR and LL theory and goes straight to recursive descent parsing plus the precedence-climbing trick.

I'm of the opinion that if you have to learn ONE thing about parsing then it should be how to write a recursive descent parser by hand. It is the parsing technique that you are most likely to use in a real project if someone throws a parsing hot potato in your direction.

That said, if you are on the mood to learn at least two things about parsing then there is no way around the LL and LR fundamentals. :)

DmitrySoshnikov · on Oct 27, 2020

Yes, in fact for building a language the parsing stage should be skipped altogether (start with interpreter or bytecode). We do this in the interpreters class. And once you have a fully working VM, _now_ it is a good time to shift to parsing and design a good syntax.

For recursive descent we have a separate class "Building a Recursive descent parser from scratch" which is purely practical coding class for those interested mainly in practice.

_0ffh · on Oct 27, 2020

Absolutely with you on recursive descent! And looking back, from a purely practical perspective, I think the second thing should be TDOP/Pratt parsing. When your RD-parser hits an expression, simply call the TDOP subroutine and let that sort it out. It's just a beautiful (and fast) combination!

FeepingCreature · on Oct 27, 2020

I second this. Recursive descent is so good. I've used it for all my languages with no issues.

DmitrySoshnikov · on Oct 26, 2020

Yeah, this class is specifically on parsing pipeline and syntactic analysis.

For runtime semantics (Interpreters and Virtual Machines) you can address "Essentials of Interpretation" aka "Building an Interpreter from scratch".

mssundaram · on Oct 26, 2020

Not free but also very good is https://interpreterbook.com/

bensonalec · on Oct 26, 2020

Both of Thorsten Ball's books in that series are phenomenal, could not recommend more. He's also got a great discussion of it on the Go Time podcast, https://changelog.com/gotime/28 and https://changelog.com/gotime/107.

mssundaram · on Oct 26, 2020

Thanks for sharing the podcast links, I hadn't known about that

wtetzner · on Oct 26, 2020

Any chance of also including GLL (generalized LL)?

I found the paper (http://dotat.at/tmp/gll.pdf) quite hard to follow, and haven't been able to find a good explanation anywhere else.

samatman · on Oct 26, 2020

I found the code for Instaparse (relatively) easy to follow.

I had considered leaving a comment here like "hey could you cover combinators and PEGs?", but after thinking it over, it's important to limit the scope for a class like this.

It would be pretty great to offer a "201" edition, covering ALL*, GLR, GLL, combinators/PEGs, Earley, parsing-with-derivatives, Marpa, and anything else I might have forgotten: basically a survey of modern parsing algorithms, which frankly, LR and LL are not.

But for your own learning, I bet you could take this course, and then spend some time with Instaparse and the GLL paper, and walk away with a solid understanding of GLL in practice.

DmitrySoshnikov · on Oct 26, 2020

Great point on combinators, PEG, and GLL -- this potentially would be covered in 201 as suggested, since it's good having a foundation of the LL/LR, and then gradually moving to combinators if needed. LALR(1) covers a pretty wide range of the most practical languages.

samatman · on Oct 26, 2020

To a significant degree, the arrow of causality runs LALR(1) -> practical languages, not the other direction!

The languages and formats we use have been heavily shaped by the practical parsing algorithms of the 20th century. An example: you can't have a struct field called "while" in C, because once the lexer declares a token to be a keyword, that's that.

jcranmer · on Oct 26, 2020

Contextual keywords are a growing thing in modern languages. Modern compilers don't tend to have strictly separated lexers and parsers, but instead use a combined lexer/parser model that feeds information back and forth between them. If your language is designed such that you aren't often in multiple potential parse states, then it's easy to feed into the lexer "get me the next token, and by the way, expect function attributes to be keywords right now."

Note that the requirement to not be in multiple potential parse states also tends to boil down to "build a language that's usually LL(1) or LALR(1)."

tgv · on Oct 26, 2020

TBH, that's not a problem of LALR(1), or any of the other, more old fashioned methods. I've written an LL(1) parser generator that (generates a function that) parses modern awk, without semicolons, but with operator-less concatenation, and that can deal with tokens that can be keywords and identifiers (which is also needed for e.g. FORTRAN). As long as your language is deterministic, it can be expressed as an LR grammar, although legibility might suffer.

DmitrySoshnikov · on Oct 26, 2020

Yes, to some degree -- Syntax tool normally support lexer states, and the same "while" token may mean a keyword or the property/field name of a struct. You can find more details of the lexer states in the docs.

dataflow · on Oct 26, 2020

What is GRR? Did you mean GLR?

samatman · on Oct 26, 2020

Yep, that was a typo, fixed it

dataflow · on Oct 26, 2020

Ah okay. Note that GLR isn't exactly modern (1974). However, I would also remove "modern" as a requirement here... what really matters is how good the algorithm is, not how old it is. "Modern" algorithms easily end up being less powerful than the old ones... they just end up becoming popular due to other factors, e.g. simplicity.

samatman · on Oct 26, 2020

"Advanced" would have been more expressive, agreed.

In the sense that LALR and LL with limited lookahead are the introductory parsing algorithms which everyone knows, and there's a reason for that.

DonaldPShimoda · on Oct 26, 2020

Just generalized parsing algorithms in general would be good to include, I think. It looks like the course only plans to cover basic LL/LR, which are admittedly the most commonly used parsers but more would be interesting. A fun one to include might be Might's "Parsing with Derivatives", which is algorithmically novel (though not very performant). I think there was a recent innovation on this: "Parsing with Zippers" at ICFP this year.

Hard agree that GLL is hard to follow. I've read the paper a number of times and struggle with it every time haha.

giraj · on Oct 26, 2020

Thanks for mentioning "Parsing with Zippers"! I read "Parsing with Derivatives" last week and wondered if that could be taken further. The paper can be found here: https://dl.acm.org/doi/10.1145/3408990

DonaldPShimoda · on Oct 26, 2020

It's always fun to spread new research around the community! I remember reading PwD and thinking it was just so cool as an algorithm, though the performance concerns were a bit of a turn-off. Still, I always like to talk to people about it because I just think it's such an elegant approach to the problem.

As for PwZ, I know one of the authors so maybe I'm a little biased, but I also thought his ICFP talk was quite good. It's here: https://youtu.be/fakSKvP9yaM?t=6180

thesz · on Oct 26, 2020

Parsing with derivatives is not all that slow: https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

The paper shows that it can compete with re2 (by google).

DonaldPShimoda · on Oct 26, 2020

This is not what I'm talking about.

"Parsing with Derivatives" is the title of a 2011 paper by Matt Might, David Darais, and Daniel Spiewak that generalizes the Brzozowski derivative from regular expressions to context-free grammars. In the present discussion, I assumed the context to be about CFGs instead of REs because that is most often what people are referring to when we talk about parsers in programming languages, though I admit that I could have been more explicit about this.

What you linked is an improvement of the Brzozowski derivative, but it does not constitute a parser for CFGs.

thesz · on Oct 26, 2020

According to "Parsing Techniques: A Practical Guide" [1], it is quite common for computer languages to have most of the grammar in the regular grammars class and some parts to be, actually, context-free.

[1] https://dickgrune.com/Books/PTAPG_1st_Edition/BookBody.pdf

For example, consider addition and subtraction in most grammars. They can be expressed as "summation ::= factor ((PLUS | MINUS) factor) * " and factor can be similarly defined as "factor ::= multiplicand ((MUL | DIV) multiplicand) * ".

Authors of PTAPG note that these regularities can be exploited for speed. And I think "parsing with derivatives" techniques can be used for speedy parsing too.

DonaldPShimoda · on Oct 26, 2020

Uhh sure, but you're kind of deflecting. The phrase "parsing with derivatives" (or "Might's 'Parsing with Derivatives'" as I wrote in my initial comment to which you first replied) refers specifically to the technique developed by Might et al that generalizes the Brzozowski derivative to CFGs. And, more to the point, their technique has very poor performance, which is addressed directly in the paper.

If you talk to parsing people about "parsing with derivatives", they will undoubtedly assume that you specifically mean the Might et al work and not some other, more general notion, regardless of whether one could technically call such notion a "parsing with derivatives" technique. I have never heard of anybody calling anything else "parsing with derivatives" and, indeed, the paper you linked never uses this phrase (the closest they come is "matching with derivatives", which is a bit different in semantics).

I appreciate the points you've raised, and I do think the RE-parsing paper you linked previously looks very interesting (I hadn't seen it before), but the crux of the issue is that you misinterpreted what I said and haven't yet acknowledged that misinterpretation. Instead, it feels like you're trying to fight me on other points to win back some ground or something, though I hope I'm just misreading this because I find that kind of a frustrating conversational method.

thesz · on Oct 27, 2020

My point was that "parsing with derivatives" not necessarily a slow thing and I provided example supporting that point.

I specifically has been searching for performant regular expression library recently. The "parsing with derivatives" approach can share more of the state, I believe, when doing several matches in parallel (think about trie-encoded dictionary) than DFA-based libraries do and should have smaller startup time.

I have not misinterpreted your argument. I have provided a point where it does break because I consider that point important. The reader of our conversation will, from now on, I hope, not consider the original "parsing with derivatives" paper as the state of the art and, probably, will come up with something himself.

I, actually, did and I am glad you answered my points. They made me thinking.

I think that parsing with derivatives can be used as a tool to parse (in parallel! possibly sharing derivatives computed!) parts of text with regular subgrammars and then something like CYK can be applied (again, in parallel like in [1]) to the regions parsed.

[1] https://jyp.github.io/pdf/PP.pdf

PS

Please note that in [1] they show that chart parsing struggle with exactly regular subgrammars - typical chart parsing algorithm has O(n^2) complexity for repetitions expressed with asterisk in regular grammars. I do not think my "solution" is necessarily better than in the [1], but I think I have to think about it more.

DonaldPShimoda · on Oct 27, 2020

I want to preface this by saying that most of this comment is kind of blunt and wordy, but I really don't intend it to be taken rudely. I just wanted to address your points very explicitly to explain my position more clearly. I value your contributions and hope you'll take this response in the spirit I intend it.

> My point was that "parsing with derivatives" not necessarily a slow thing and I provided example supporting that point.

I actually don't think you supported this point, because I specifically said (emphasis new):

> A fun one to include might be Might's "Parsing with Derivatives", which is algorithmically novel (though not very performant).

I was explicitly talking about the 2011 paper titled "Parsing with Derivatives" which generalizes Brzozowski's derivatives from REs to CFGs, and which has absolutely terrible performance compared to most other parsers for CFGs.

You went on a bit of a tangent by arguing about "using derivatives to parse REs in general" (to paraphrase). This was never the scope of what I was talking about. Whether derivatives can be used to parse REs efficiently is irrelevant when I was specifically referring to the 2011 paper, which provides a novel, general parsing algorithm — not LL, not LR, not GLL, not GLR, but just "general", for all CFGs in existence.

> I have not misinterpreted your argument. I have provided a point where it does break because I consider that point important.

I still think you misinterpreted me, because you're talking about "how can derivatives be used to parse things efficiently" while I was talking about "the contributions of this specific 2011 paper that introduced a new general parsing algorithm that can handle all CFGs and which happens to be called 'Parsing with Derivatives'".

At this point I want to say that I'm not trying to be mean here, and I really hope that's not how I'm coming across. I just think something isn't connecting in our dialogue, so I'm trying to be overly explicit and cautious in my wording to make sure I don't introduce any possibility for misinterpretation.

> The reader of our conversation will, from now on, I hope, not consider the original "parsing with derivatives" paper as the state of the art and, probably, will come up with something himself.

See, this is the thing. The 2011 paper I referenced isn't about REs, but your papers were about REs. The 2011 paper introduced a brand-new general parsing algorithm to the world. It was subsequently refined by Adams, Hollenbeck, and Might in 2016's "On the Complexity and Performance of Parsing with Derivatives" [0], and then again refined by Darragh and Adams in "Parsing with Zippers" [1] (published at ICFP this year). To the best of my knowledge, the materials you've linked so far do not even reference the 2011 paper and so cannot be said to improve on the state of the art in that regard, where "state of the art" refers to "state of parsing all CFGs using a derivative-based approach rooted in the Brzozowski derivative". They instead start with the 1964 Brzozowski paper that originated the derivative of regular expressions and improve on that directly — meaning they are not addressing CFGs in general, as the 2011 paper I was talking about does. Your paper and the 2011 paper are more like siblings in that they both descend from the same 1964 paper, but solve different problems. So I think it would be misleading to suggest that your paper is an improvement on the state of the art of the 2011 paper because they're really very different problem domains, despite both involving "parsing" and "derivatives".

> I think that parsing with derivatives can be used as a tool to parse (in parallel! possibly sharing derivatives computed!) parts of text with regular subgrammars and then something like CYK can be applied (again, in parallel like in [1]) to the regions parsed.

So, if I'm understanding you right, you're suggesting using a hybrid approach. I think that's an interesting idea! I don't know that I've seen much done like that. I do know that CYK is almost completely ignored in PL these days, so perhaps there is some other strategy to use there to augment the derivatives-of-REs subcomponents. I'll have to think about that. What a neat idea!

[0] https://doi.org/10.1145/2980983.2908128

[1] https://doi.org/10.1145/3408990

eru · on Oct 27, 2020

Parsing with derivatives has been around for a long time, at least for regular languages.

DonaldPShimoda · on Oct 27, 2020

Yes, the technique was introduced in 1964 by Brzozowski. But the Brzozowski derivative (as it is now known) was generalized to handle all CFGs (instead of only REs) in the 2011 Might et al paper titled "Parsing with Derivatives". This work was then refined this year with "Parsing with Zippers", which generalizes Huet's Zipper data structure to represent CFGs (instead of acyclical, deterministic tree structures) and uses that to re-build the 2011 algorithm with a better representation, achieving greater performance with a small algorithm.

eru · on Oct 30, 2020

Thanks! I'll check out the papers.

DmitrySoshnikov · on Oct 26, 2020

Yes, GLL is a good algorithm and I potentially going to publish it separately as a single public video.

wtetzner · on Oct 26, 2020

That would be fantastic!

siraben · on Oct 27, 2020

One parsing technique that covers a lot of ground for little code is parser combinators[0]. And the more space and time efficient Parsec library[1]. Recursive descent is not difficult to hand-roll but the correspondence between the code and BNF can be more obscured, whereas with parser combinators the correspondence is quite clear.

[0] http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf

[1] https://dspace.library.uu.nl/bitstream/handle/1874/2535/2001...

redwoolf · on Oct 26, 2020

This is great! I'm working on designing a language right now and I'm just getting to the point where I have to parse the AST. Looking forward to taking this course, I just purchased it on Udemy.

DmitrySoshnikov · on Oct 26, 2020

Congrats, and hope this makes building of your parser easy and fun!

This course covers a lot of parsing theory as well, and if you're interested in pure practical (manual) parser, there will be also "Building a Recursive descent parser from scratch", which is mainly coding class and is an extension for the "Parsing Algorithms".

jeofken · on Oct 27, 2020

If you are implementing it in a functional and GC language (like OCaml, Scheme/Clojure, JavaScript, F#, Elm, Haskell) using parser combinators is a lovely technique where you make composable mini-parsers. Worth researching! As a side effect (hehe) you’ll learn how to simulate state in a pure language...

Enjoy

LordGrey · on Oct 27, 2020

I really liked the intro. I purchased both the "Building an Interpreter from scratch" and the "Parsing Algorithms" courses from Udemy and went through the first four modules of the former. Very clear presentation!

DmitrySoshnikov · on Oct 27, 2020

Thank you for the feedback, glad you liked it, and glad to see more people interested in deeper CS topics!

bgorman · on Oct 26, 2020

Thanks for building this, I completed the "make a lisp" project, but the parsing stage was greatly simplified so it is nice to have a resource to learn more about the parsing/lexing stage.

DmitrySoshnikov · on Oct 26, 2020

Absolutely! S-expression (used in Scheme, Lisp, etc) is a great AST-based syntax to start building an interpreter right away. But for fully ergonomic language you would need a parser for a more complex syntax.

humanlion87 · on Oct 26, 2020

Apologies for the tangential question. I am currently working on a codebase that involves trying to understand code that parses and input string (that has various search parameters and values) and generates an appropriate SQL query to search the associated database. I am struggling to build a mental model of how this parsing works. Will going through one of the resources mentioned in this thread help with my understanding?

DmitrySoshnikov · on Oct 26, 2020

Yes, if you need to parse that input string to generates an appropriate SQL query, you would need to have a small DSL (domain-specific language) for that "string", whatever it contains. If the string contains SQL-like syntax, e.g. "SELECT name from users", then yes, it would be easy to build a grammar for this.

red_admiral · on Oct 27, 2020

I don't mind paying for quality content (and I mind paying in cash a lot less than paying with personal data), but if someone wants a free alternative to this, I recommend https://craftinginterpreters.com/ . Bit lighter on the theory side, but exactly what you need if you want to write a parser in practice.

kayson · on Oct 27, 2020

This is really great! Thanks for posting. Watched the first video on youtube and just grabbed the rest on Udemy. Your style works perfectly for me.

DmitrySoshnikov · on Oct 27, 2020

Thanks for the feedback, and glad to see more engineers interested in deeper topics!

MaxBarraclough · on Oct 26, 2020

Seems the place to ask: what do folks make of the IELR algorithm? [0] Apparently GNU Bison supports it these days. [1]

[0] https://cs.stackexchange.com/a/99463/

[1] https://en.wikipedia.org/wiki/GNU_Bison

rmasters · on Oct 27, 2020

I think its great. IELR is a straightforward optimization that just makes sense to use when building an LR family parser.

You can think of LALR(1) as just taking an LR(1) state graph and merging together nodes that are compatible (in that they are essentially the same parsing state, but differ in which "lookahead" tokens are valid). A grammar is consider to be LALR(1) if combining states in this way still results in a correct parser.

IELR, which is derived from David Pager's PGM and Lane Tracing "Minimal LR" techniques from the 1970s, applies a stricter compatibility test for nodes. This check usually allows a large majority of merges while rejecting the ones that will lead to parsing errors. In this way, a more sophisticated grammar, all the way up to LR(1) can still be processed correctly using a table size much closer to what you get with LALR.

You essentially get the best of both worlds and its tragic that this fairly simple technique is so obscure. I believe the reason for the obscurity of these techniques is that Pager's PGM and lane tracing papers are extremely terse, somewhat confusing, and lacking in sufficient detail in certain areas. For example, the PGM paper on p.256 crucially notes that successors may need to be "regenerated as a distinct state" without further explanation. (Until his protege Chen provided some pseudo-code 34 years later in his dissertation of 2009. BTW, I have verified that the Wisent and Menhir parsers and others implement PGM correctly, if anyone is looking for actual details.)

Note that there is some additional time and complexity required to detect conflicts and regenerate/split nodes and so the benefits or IELR are not entirely free.

More obscurely, it should be noted that IELR suffers the same problem as LALR in that combining states can introduce conflicts between tokens when context aware lexing [Nawrocki 1991] is being utilized. (BTW, Tree Sitter uses context aware lexing but its not very common otherwise - yacc/bison doesn't AFAIK). Suddenly tokens are eligible for matching alongside other tokens that normally would not be matched together. Unless your tokens are globally conflict-free or you've got a priority scheme that always resolves conflicts properly (but then you wouldn't need context-aware-lexing would you?), you'll start matching tokens that shouldn't be matched in a particular parsing state. There are ways to avoid those conflics as well (see PSLR - also from IELR author Joel Denny) but it is a lot more work. Even if you avoid conflicts, invalid content can match tokens that shouldn't be matched which complicates error reporting because you get a confusing parse error instead of a correct tokenization error. So this is one case where the full LR(1) may be preferred over IELR.

But I can't think of a convincing argument for preferring LALR over IELR.

DmitrySoshnikov · on Oct 27, 2020

That's a great point and I have the https://github.com/DmitrySoshnikov/syntax/issues/99 to add support for IELR in Syntax.

davidkellis · on Oct 26, 2020

I'd like to see something like this for practitioners. I kind of have a feel for what's out there, but I don't know of anything that is: 1. pleasant to use 2. simple 3. scannerless 4. supports left recursion that produces a left-associative parse

norswap · on Oct 26, 2020

Shameless self-promotion: https://github.com/norswap/autumn

DmitrySoshnikov · on Oct 26, 2020

Yes, we use LALR(1) parsing mode to build the actual parser, and it exactly supports Left recursive grammars (which are much more elegant than LL). We also don't focus much on scanner (tokenizer) since this is a topic of Regular expressions and Finite automata which we discuss in detail in the separate class "Building a RegExp machine".

bmn__ · on Oct 26, 2020

Marpa meets 2.a) 2.b) 3.

1. is subjective.

xc468 · on Oct 27, 2020

I recommend compilers course by Alex Aiken from Stanford open

DmitrySoshnikov · on Oct 27, 2020

Professor Aiken is a great teacher and I love his compilers course. However as for the parsering stage, that course goes as maximum as to SLR(1) which is pretty "toy" parsing mode. That's the problem with a combined "compilers class" -- one simply can't put everything, and everything is becoming slightly superficial. That's why I have Parsers and Garbage Collectors class as separate and fully specialized course.

_0ffh · on Oct 27, 2020

Funny, I used to dig into everything about parsing that I could get my grubby fingers on, and now I don't even care anymore.

Use recursive descent + TDOP for expressions!

nly · on Oct 27, 2020

TDOP / Pratt parsing is basically the same as recursive descent, it's just an optimization where rules with the same or lower precedence are matched iteratively rather than recursively.

I can't find the link now but I found a great page once where this was shown, and it made all the complexity vanish for me.

_0ffh · on Oct 27, 2020

Yeah, the advantage here is that once you wrote your Pratt parser it's a nice, quite generic piece of code that you can just drop into any new project, reconfigure a bit and it does it's job.

Don't know what page you mean, but personally I first stumbled upon it on Douglas Crockford's page [1]. A couple of years later Eli Bendersky did a nice writeup for Python [2].

I wrote my first Pratt parser in D, based on Crockford's. It was a learning project that ended up as a Javascript interpreter. I parsed JS into a Lispy syntax tree and added a simple Lisp interpreter based on Norvig's [3].

In the meantime I've switched my default language (to Nim [4]), but I've since never used anything but RD and/or Pratt for any parsing job.

[1] https://crockford.com/javascript/tdop/tdop.html

[2] https://eli.thegreenplace.net/2010/01/02/top-down-operator-p...

[3] https://norvig.com/lispy.html

[4] https://nim-lang.org/

nly · on Oct 27, 2020

I found it : https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#more_clim...

It shows the transformations that get you from recursive descent to table driven precedence climbing

Pratt/TDOP and precedence climbing are actually the same algorithm: https://www.oilshell.org/blog/2016/11/01.html

...so basically everything can be derived from simple, straightforward predictive recursive descent.

esrh · on Oct 27, 2020

While i find the topic really interesting, I'm don't enjoy or learn very well from video lectures.

Is there some comprehensive textbook on the topic?

DmitrySoshnikov · on Oct 27, 2020

Yes, I recommend "Parsing Techniques" book.

andreygrehov · on Oct 26, 2020

Very well done! I like video animations. What software do you use to make them?

DmitrySoshnikov · on Oct 26, 2020

Thanks for the feedback, glad you like it. For production I use combination of software: Keynote, Notability, Good notes, Camtasia, and live editing on iPad.

waynesonfire · on Oct 26, 2020

super interesting subject; can anyone recommend resources for self learners?

egonschiele · on Oct 26, 2020

Isn't this for self-learners?

> The course starts now and never ends! It is a completely self-paced online course - you decide when you start and when you finish.

rschachte · on Oct 26, 2020

Is the site down?

DmitrySoshnikov · on Oct 26, 2020

Should be up by now; seems auto-DDOS'ed, lol