Earley Parsing Explained

kazinator · on Sept 16, 2015

I'm looking at this, and the Earley items with the "fat dot" look a heck of a lot like LR(0) kernel items in LALR(1) parser generation.

Then it hits me: it looks like Earley parsing is to LALR(1) (vaguely ) like NFA simulation is to DFA table.

If we look at it at a very high level: Earley is making these items dynamically while scanning the input, and grouping them into sets representing states. Whereas a LALR parser generator will generate the items and group them into subsets statically, while processing the grammar, generating a push-down table driven by lookahead which is then applied to the input.

Analogously, NFA computes sets of states dynamically according to the input, by performing closures on the NFA graph, whereas DFA does it statically in the absence of input.

pbiggar · on Sept 15, 2015

Slight tangent, but interesting: Earley became a therapist and has been practicing since 1973. From his bio:

> Jay also has a Ph.D. in computer science from Carnegie-Mellon University and was formerly on the U.C. Berkeley faculty, where he published 12 computer science papers, one of which was voted one of the best 25 papers of the quarter century by the Communications of the A.C.M.

https://selftherapyjourney.com/Pattern/Beginning/Who_We_Are....

thristian · on Sept 15, 2015

Traditional shift/reduce parsers are impractical to write without a specialised tool and often difficult to debug, PEG parsers and parser-combinators are great to work with but are often inefficient and produce unhelpful syntax errors. If there's some other parsing scheme that could be a best-of-both-worlds, I would love to learn more.

spacebat · on Sept 15, 2015

I believe that the Marpa parser may be what you're looking for. A C library with the most refined API in Perl5, improvements are in the works to make it easier to produce interfaces to the library in other languages.

https://jeffreykegler.github.io/Marpa-web-site/ http://savage.net.au/Marpa.html

kazinator · on Sept 16, 2015

Unfortunately, there is a catch:

http://loup-vaillant.fr/tutorials/earley-parsing/parser

Once you have a successful parse, extracting the tree from it is a little bit like pulling teeth. You have to perform searches on the Early set data, and deal with ambiguities at that point.

None of the mainstream method have any major difficulty with popping out abstract syntax trees in a straightforward syntax-directed manner.

yxhuvud · on Sept 16, 2015

There are efficient ways to do it though (the article link to Scott), even if they are not always very easy to understand.

sklogic · on Sept 16, 2015

An optimised Packrat is very efficient (especially when combined with Pratt), and the error handling can be on the same level as the best handcrafted professional parsers (like Clang or gcc) - it is easy to mix the recovery rules and arbitrarily complex messages into a PEG.

I do not uderstand why PEG is so persistently misunderstood and unjustly criticised.

joe_the_user · on Sept 16, 2015

I've always loved traditional recursive descent parsers.

You have to learn a bit about transforming the target language into the required form. But they are nice because they require no tools at all. Being hand-created, they can be among the most efficient available.

This link seems to have good info on language operations required to put language in the required form.

http://www.cs.engr.uky.edu/~lewis/essays/compilers/rec-des.h...

Others · on Sept 16, 2015

I quite like anltr4's ALL(*) parsing. It always seems to hit the spot between reasonable speed and user friendliness. http://www.antlr.org/papers/allstar-techreport.pdf

amelius · on Sept 15, 2015

This one uses the shift/reduce approach but can be applied to any context free grammer, efficiently: [1]

[1] http://scottmcpeak.com/elkhound/

YorickPeterse · on Sept 16, 2015

A good LL(1) parser generator will be able to provide helpful error messages and good performance, at the cost of not being able to parse all grammars you might come up with.

sklogic · on Sept 16, 2015

LL(1) cannot recover that easily, which makes error reporting less useful.

YorickPeterse · on Sept 16, 2015

This depends on the parser implementation. For example, https://github.com/yorickpeterse/ruby-ll lets you customize the error messages as the default ones can be a little bit confusing at times. An example of this is https://github.com/YorickPeterse/oga/blob/0fd6fd8645e57ea4b2... which changes messages from "unexpected T_FOO, expected T_BAR" to "unexpected end of input, expected element closing tag".

Having said that, a hand written parser _will_ beat LL(1) (or any parser generator for that matter) when it comes to error reporting, though this depends on the amount of effort a programmer put in to the error reporting.

sklogic · on Sept 16, 2015

I'm talking about the recovery, not the messages - things like "skip to the next ';' and continue parsing a statement".

> or any parser generator for that matter

PEG-based generators can be just as good as handcrafted ones.

yxhuvud · on Sept 16, 2015

As can Earley based ones, since the full parsing state is available (and, for that matter - possible to modify. It is quite possible to make up rules as you go).