I'm looking at this, and the Earley items with the "fat dot" look a heck of a lot like LR(0) kernel items in LALR(1) parser generation.
Then it hits me: it looks like Earley parsing is to LALR(1) (vaguely ) like NFA simulation is to DFA table.
If we look at it at a very high level: Earley is making these items dynamically while scanning the input, and grouping them into sets representing states. Whereas a LALR parser generator will generate the items and group them into subsets statically, while processing the grammar, generating a push-down table driven by lookahead which is then applied to the input.
Analogously, NFA computes sets of states dynamically according to the input, by performing closures on the NFA graph, whereas DFA does it statically in the absence of input.
Slight tangent, but interesting: Earley became a therapist and has been practicing since 1973. From his bio:
> Jay also has a Ph.D. in computer science from Carnegie-Mellon University and was formerly on the U.C. Berkeley faculty, where he published 12 computer science papers, one of which was voted one of the best 25 papers of the quarter century by the Communications of the A.C.M.
Traditional shift/reduce parsers are impractical to write without a specialised tool and often difficult to debug, PEG parsers and parser-combinators are great to work with but are often inefficient and produce unhelpful syntax errors. If there's some other parsing scheme that could be a best-of-both-worlds, I would love to learn more.
I believe that the Marpa parser may be what you're looking for. A C library with the most refined API in Perl5, improvements are in the works to make it easier to produce interfaces to the library in other languages.
Once you have a successful parse, extracting the tree from it is a little bit like pulling teeth. You have to perform searches on the Early set data, and deal with ambiguities at that point.
None of the mainstream method have any major difficulty with popping out abstract syntax trees in a straightforward syntax-directed manner.
An optimised Packrat is very efficient (especially when combined with Pratt), and the error handling can be on the same level as the best handcrafted professional parsers (like Clang or gcc) - it is easy to mix the recovery rules and arbitrarily complex messages into a PEG.
I do not uderstand why PEG is so persistently misunderstood and unjustly criticised.
I've always loved traditional recursive descent parsers.
You have to learn a bit about transforming the target language into the required form. But they are nice because they require no tools at all. Being hand-created, they can be among the most efficient available.
This link seems to have good info on language operations required to put language in the required form.
A good LL(1) parser generator will be able to provide helpful error messages and good performance, at the cost of not being able to parse all grammars you might come up with.
Having said that, a hand written parser _will_ beat LL(1) (or any parser generator for that matter) when it comes to error reporting, though this depends on the amount of effort a programmer put in to the error reporting.
As can Earley based ones, since the full parsing state is available (and, for that matter - possible to modify. It is quite possible to make up rules as you go).
Then it hits me: it looks like Earley parsing is to LALR(1) (vaguely ) like NFA simulation is to DFA table.
If we look at it at a very high level: Earley is making these items dynamically while scanning the input, and grouping them into sets representing states. Whereas a LALR parser generator will generate the items and group them into subsets statically, while processing the grammar, generating a push-down table driven by lookahead which is then applied to the input.
Analogously, NFA computes sets of states dynamically according to the input, by performing closures on the NFA graph, whereas DFA does it statically in the absence of input.