Hacker News new | past | comments | ask | show | jobs | submit login
On the Complexity and Performance of Parsing with Derivatives (arxiv.org)
182 points by ingve on June 25, 2016 | hide | past | favorite | 114 comments



Summary of parsing with derivatives (Brzozowski 1964) for the unfamiliar (like me):

  For example, with respect to the character f, the derivative
  of the language for which [[L]] = {foo, frak, bar} is
  Df (L) = {oo, rak}. Because foo and frak start with the
  character f and bar does not, we keep only foo and frak
  and then remove their initial characters, leaving oo and rak.


  We repeat this process with each character in the input
  until it is exhausted. If, after every derivative has been performed,
  the resulting set of words contains the empty word,
  then there is some word in the original language consisting
  of exactly the input characters, and the language accepts
  the input. All f this processing takes place at parse time, so
  there is no parser-generation phase.
The specific algorithm they build off (Might 2011) has extended this to memoize repeated derivations of the same arguments and to allow for lazy evaluation in the case of recursion.

But this algorithm was implemented in a mind-bendingly-slow way:

   For example, they report
  that a 31-line Python file took three minutes to parse!
They prove that the worst case is actually cubic time and provide a concise cubic-time implementation in Racket: https://bitbucket.org/ucombinator/derp-3

Their implementation is orders of magnitude faster of prior PWD implementations, but still orders of magnitude slower than a Bison parser of the same grammar, which they attribute to language choice (C vs Racket).

Please correct me if I'm wrong about any of this! Summarizing research like this helps me understand it, but that doesn't mean I actually understand it.


Yeah, that's accurate! I'm glad we finally know the running time for parsing with derivatives.

Let me add some more background, about stuff that I would expect to be covered in this paper's related work section, except that it doesn't have one...

Most well-known parsing algorithms that handle arbitrary CFGs run in cubic time, so this paper's O(n^3) running time is about the best you would expect. Some other parsing algorithms that also handle arbitrary CFGs in cubic time include CYK[1], Early[2], and GLR[3]. I used to be excited about parsing with derivatives because of its simplicity, but CYK and Early parsers are both actually simpler than parsing with derivatives, once you've thrown this paper's optimizations in.

[1] https://en.wikipedia.org/wiki/CYK_algorithm [2] https://en.wikipedia.org/wiki/Earley_parser [3] https://en.wikipedia.org/wiki/GLR_parser

[EDIT: Previously, I stated "CFGs, in general, can't be parsed in better than cubic time.", which is incorrect. Thanks, pacala.]


CFG parsing can be done better than O(n^3), reducing to Boolean matrix multiplication, which is about O(n^2.37).

http://theory.stanford.edu/~virgi/cs367/papers/valiantcfg.pd...

https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...

Agreed that CYK and Earley are rather simple to implement. Though Earley has an edge, as its best case is O(n), whereas CYK is O(n^2) best case.


> CFG parsing can be done better than O(n^3), reducing to Boolean matrix multiplication, which is about O(n^2.37).

Is there a known lower bound on running time to parse CFGs, like O(n^2)?


It is known that any fast CFG parsing algorithm can be turned into a fast boolean matrix multiplication algorithm.

So if CFG parsing was n^2, so is boolean matrix multiplication :)

See http://arxiv.org/abs/cs/0112018


One important note is that this holds for general grammars that can contain everything. If the grammar is restricted somehow, for example to lr(k) then the best complexity is linear (there are even parsers that manage both LR(n) in linear time and general parsing in cubic).


I'm confused by this argument. To show that

> if CFG parsing was n^2, so is boolean matrix multiplication

wouldn't you need to show that any boolean matrix multiplication can be turned into CFG parsing (and not the converse)?


Great question, I just learned something.

A boolean matrix multiplication problem can be turned into a CFG problem. Therefore, an algorithm that solves CFG problems can be turned into an algorithm that solves boolean matrix multiplication.


Sorry, yes, i deleted a line accidentally, but it turns out you can show both directions, and to prove a tight bound, you need to show they reduce to each other without changing the complexity class.


Out of curiosity, do any of the other algorithms have the ability to be "paused" arbitrarily, i.e. with inverted control? I'm thinking specifically of parsing input from an asynchronous interface.

Derivatives do that well, since input is essentially pushed into the parser.


Pausing is very straightforward with bottom up algorithms like GLR. The entire state of the parse is stored in an explicit stack data structure, as opposed to implicitly in the program's call stack as with top-down algorithms. So resuming the parse is as simple as resuming the parse loop using a previously-stored stack.


I think this is a bit of an orthogonal issue. Inversion of control is a problem that can be solved at the language level with coroutines / continuations / threads / etc without forcing you to choose a special type of algorithm.


Earley parsing can handle asynchronous input just fine.


> CYK and Early parsers are both actually simpler than parsing with derivatives, once you've thrown this paper's optimizations in.

Yeah, GLR is very elegant as well; I'm a little puzzled by this paper's initial claim (and similar claims in the Yacc is Dead paper):

> Current algorithms for context-free parsing inflict a trade-off between ease of understanding, ease of implementation, theoretical complexity, and practical performance. No algorithm achieves all of these properties simultaneously.

GLR is has good performance and isn't much harder to understand than LR, which is a very elegant theory. So I assume that 'ease of implementation' is the property the authors think GLR lacks. But ease is a little hard to measure, until you have an equally fast and production-ready implementation of both algorithms.


I'm one of the paper authors. That's a great summary. Thanks!

We're hopeful that this new algorithm already makes PwD practical enough for many use cases, and we have significant further improvements in progress.

Stay tuned!


Thanks for looking that up. I had not realized that Boyer-Moore is an optimized subset of derivative parsing.


Note that the complexity is still O(n^3), which is higher than that of other algorithms for most "real-world" grammars.

I read Might's initial paper on derivative parsing a while ago, and I agree that it really seems very elegant at first when compared e.g. to bottom-up parsing. The problem of the approach though is that parse tree generation is more tricky than for other methods and (IMHO) renders the technique considerably less elegant. Of course, if only recognition is needed then this technique is straightforward, but in most practical use cases the user will want to generate a parse tree as well.

Also, the performance of the parser implementations dicsussed in the paper (and in Might's original work) are able to parse a few hundred lines of code per second (for a small language), which is very slow compared to e.g. generated recursive-descent or bottom-up parsers, which can achieve several hundred thousand of lines of code per second for real-world languages.

So although it's an interesting concept I personally think it will not be of very high relevance to practical parser generation, at least for use cases where efficiency and scalability matters.

I have been working on a descriptive parser generator though to see if I can get anywhere near the performance of my currently used PEG parser (which uses conditional memoization -i.e. packrat parsing- to speed up the parsing).


Also, like other people already have pointed out, the time complexity calculations in the academic literature are not always relevant for practical parsing:

1) A parser that might have exponential-time complexity might still beat a polynomical/linear-time complexity parser in practice, depending on the grammar and actual code parsed.

2) For real-world grammars, constants (i.e. the c in O(c*n)) will be the determining factor between two alternative parsing techniques with identical time-complexity. As an example, memoization-based (packrat) parsing in combination with PEG achieves the same time complexity as shift-reduce style parsers for many grammars, but in practice the memory allocations and bookkeeping required to do the memoization make the latter approach much faster.

I have implemented several different parsers myself, and while it is pretty straightforward to write a parser for a real-world language (e.g. Python) today, achieving very good parsing speed is not. As an example, on my machine, the Python parser can process about 100-300k lines of code per second (including AST generation) while a comparable packrat PEG parser is slower by a factor of 5-10 (for many use cases, parsing at 10.000 loc / second is still good enough though, but it should not be considered fast)


So, if applicable here, what do you parsing experts think of GLL parsing?

http://dotat.at/tmp/gll.pdf

It mentions retaining easy implementation and debugging of LL parsers while knocking out limitations with some LR-style stuff. A hybrid of sorts. Result is cubic time.


GLL is fantastic! Unlike Earley and GLR, implementing GLL parser combinators is pretty easy. It's pretty fast and has some other nice features (it can be implemented to produce a lazy stream of possible parse sequences).

It is, IMO, the current state-of-the-art in parsing general CFGs. Derivative parsing is very interesting and may overtake GLL and related algorithms.


Appreciate the feedback. My plan was to push some formal methodists or LANGSEC types to try to do a verified GLL if the opportunity arises. So far, there's been SLR [1], LR(1) [2], and PEG [3] verified for correctness. Think verified GLL generator is ideal next target given benefits?

[1] http://users.cecs.anu.edu.au/~aditi/esop.pdf

[2] http://pauillac.inria.fr/~xleroy/publi/validated-parser.pdf

[3] https://arxiv.org/pdf/1105.2576.pdf


Not trained as a computer scientist I find myself in the position of those six blind men and an elephant.

I have bits and pieces of information but not a coherent whole. I would suspect that this parsing by derivative thing is connected to the fact that there is one to one correspondence between rational polynomials and regular languages. What feels tantalizing to me is the following: given the connection between regular languages and neural networks other (meaning those that are not based on differentiation) fast parsing techniques probably would have imply something about fast back-propagation with automatic differentiation. Would love hearing more on this.


I'm currently working on a parser generator with easy of use and performance in mind.

https://github.com/dragostis/pest


Looks interesting! Have you implemented any more complex grammars with this, e.g. Python or Javascript? Would be interested to see how they perform.

Also, does it support multi-stage parsing, i.e. generation of a token tree and then a parse tree from it?


There is a WIP php parser developed by the community: https://github.com/steffengy/pesty-php/blob/master/src/parse...

And, yes, it does support multi-stage parsing. After the tokens are generated, there is a process! macro which handles them. This is where you can produce an AST.


Being a regular developer how does this help me?


It doesn't. This is designed for more general context-free developers.


Quite the memer, aren't you?


No memes involved; that was a 100% artisanal hand-crafted joke. I'm sorry if you didn't like it, though.


It probably doesn't.


For me "Figure 6. Performance of various parsers" is unreadable.


It's amazing that text parsing -- one of the first problems studied in CS -- is still such a difficult problem (conceptually). I look forward to the day when the average undergrad graduating with a computer science degree will be able to write a context-free-language parser from scratch in a few days.


Text parsing for programming languages is NOT a difficult problem. It is very easy actually, much easier than most academics would have you believe.

What they are doing is trying to write theories and build conceptual systems about how to do things. That is their job. But when it comes to practical matters, the best route to take, as someone who wants to build a working compiler that gives good error messages and where the parser does not hamstring the rest of it, is to ignore almost all that stuff and just type the obvious code.


> Text parsing for programming languages is NOT a difficult problem.

That depends entirely on 1) what grammar you're parsing, 2) what memory pressure you're in, 3) how fast you want to do it, and 4) how comprehensive you want the error handling. It's actually quite difficult to parse C++ on a low-memory embedded device without resorting to clever techniques.


Ideally, you don't even target a low-memory embedded device with C++, let alone process C++ on it.

Anyway, C++ is just a strawman that seems to have been contrived deliberately in order to wreck most of the rational arguments defending "programming languages are easy to parse".

Parsing as programming language is exactly as difficult as the hole we dig ourselves into. Programming language parsing isn't some information processing problem that exists in the world, or in another domain, that computer science is tasked with to solve; it's a problem that computer scientists and software people inflict on themselves.

By making languages hard to parse by machine, we are also making them hard to use by humans. Humans need to parse the language also, and not have a magic oracle for doing so. (Indentation only goes so far, and can tell a lie).

Still, there aren't actually difficult problems in parsing C++. It's more of a "death by a thousand blows". There is just the sheer size of it alone. The C++ parser source file in GCC is larger than some programming language implementations: their parsers, interpreters/compilers and all intrinsic routines combined. It's hard to comprehend the C++ grammar so that you can identify where the ambiguities lie, and form a strategy to deal with them, so as not to be surprised later.


> It's actually quite difficult to parse C++

This is a specific problem of C++ though, given the fact that its grammar is extremely complex.


I'm pretty curious what kind of project gave zeroxfe experience with parsing C++ on low-memory embedded devices.


It also depends on what functionality you want your parser to support. In theory, parsing just means "accept or reject". In practice, you need to do something like build a tree, and there are many different possible tree representations depending on what you want to do.

Clang aims to support very general transformation of C++ with their parser and AST. I was shocked to learn that Clang's AST has 60K lines of header files! That's not including the code, and it's not templates! That's class definitions!!! (wc -l include/clang/AST/*.h)


This is not because of a parsing complexity, it is only because C++ is an awful language for writing ASTs and parsers. Without all the boilerplate, an AST can be much smaller.


You haven't said much, because you can say that about anything. "Drinking water from a glass is a hard problem when you have to drink it through this straw that is 1 mm wide and you're blindfolded"

> It's actually quite difficult to parse C++ on a low-memory embedded device without resorting to clever techniques.

Compiling C++ programs on low-memory embedded devices is not a mainstream use-case. Unless you can demonstrate that it is, your point is moot.


Why would you parse C++ on a low memory device anyway? What are you going to do with it? Anything that follows the parsing would require far more memory.


What algorithm are you using with the "obvious code"? Recursive descent? That definitely works well in many cases, but not all.

Here are some things that I think are difficult about parsing:

1. The algorithm to use depends heavily on the language, and it takes some experience to know which algorithm to use for which language. Some languages are hard to write predictive recursive descent parsers for -- they can require a lot of backtracking, which is exponential in general. C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue. In contrast, languages like Python or Go are designed to be LL(1) -- you just have to look at a single token -- "def" or "func" -- to know what kind of thing you need to parse next.

Recursive descent was also considered too slow for expression parsing, where you have left recursion. If you have 13 levels of precedence, then you require 13 nested function calls to parse an atom. So you have basically 3 choices of: Pratt parsing, Shunting Yard algorithm, or Precedence Climbing (I've implemented #1 and #3 and like #3 the most). So you have to change the algorithm for the sublanguage of expressions, i.e. compose two different algorithms.

(I think these days nobody cares about 13 levels of function calls, unless you are parsing gigabytes of C++ code. Of course, this happens literally all the time, because of header file explosion ...)

2. If you know exactly what language you are going to parse, writing a hand-written parser is straightforward (although laborious). If you don't know the language, then refactoring parsers can be quite difficult. You need good test cases not to break corner cases.

A Lua paper talked about how until Lua 4.0 they used yacc while they were designing the language, and then when the language became more fixed in Lua 5.0, they switched to a hand-written parser. This seems to be very common. I think gcc also switched from yacc to a hand-written parser.

3. It's not straightforward to combine two parsers for two sublanguages and get a parser for the combined language (or get errors if the languages can't be composed). Many real languages are really several different sublanguages combined. C++ has a normal mathematical expression language, and a template expression language (there was that notorious >> vs foo<bar<T>> ambiguity in C++). Unix sh is about 5 different sublanguages mashed together (not counting external utilities). Parsing HTML requires parsing JS and CSS.

http://tratt.net/laurie/blog/entries/parsing_the_solved_prob...

4. Writing secure parsers. This is an issue for languages like JavaScript, where the input is untrusted. If you write a parser in C, I pretty much guarantee I will find a crash or buffer overflow in it (through fuzzing or otherwise). Basically ALL parsers in production have crash bugs uncovered after YEARS of widespread use, e.g. Python, Clang, etc.

See http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-e... . Note how mature sqlite is and how extensively it is tested -- you can still shake obscure bugs out quite easily.

5. Writing reusable parsers.

5a. Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved. Most languages have multiple parsers these days -- i.e. the Java parser in your compiler is not the same parser that's used by the IDE.

5b. Try writing a parser with the hooks necessary to supports code completion and auto-correct. There was a good video by Anders Hejlsberg here that mentioned that these type of parsers have to parse MORE THAN the language, so that they can continue to provide the required functionality when the code in the IDE is malformed.

FWIW I recall that Steve Wozniak also disclaimed any background in parsing algorithms, and that he said he invented his own table driven parsing algorithm for BASIC. That is totally plausible, but I also think he would have a hard time writing certain parsers for certain languages, supporting advanced functionality, without some study.

EDIT: #6: The line between the lexer and parser is not clear -- it depends on the language being parsed, and sometimes even on the implementation language.

In theory, you have a pipeline from the lexer to the parser. In practice, the lexer and the parser often need to communicate, resulting in circularity. This communication can be hard to reason about and debug.

Moreoever, you can often solve parsing problems (i.e. resolve ambiguity) in either the lexer or the parser. (I think Go's and JavaScript's semicolon insertion are done in the lexer) If you pick the wrong choice, then you can have bugs or a lot of extra complexity. It takes experience and thought to get these things right.

(Side note: depending on the language, there can be more stages than just lexing and parsing. Unix shell has at least 3 stages.)


C would be a good example -- it is more efficiently parsed with a bottom-up algorithm than top down (recursive descent), because of the lookahead issue.

Which C compilers use a bottom-up algorithm? I'm fairly certain both GCC and Clang both use top-down recursive descent parsers.

It would be great to write your grammar once and get security, code-formatting, autocomplete, etc. for free, but in practice, you'll need a lot of control over things like error reporting. It's hard to match the control you have with plain recursive-descent code.


That's a good question, and you're of course right about GCC and Clang using recursive descent.

Well, I said that bottom-up is more efficient for such languages (with C style declaration syntax), not that it's used in production C compilers :)

My feeling is that they use recursive descent for reasons of readability and maintainability, error handling, debugging, etc. (although I have seen a couple hand-written bottom up parsers for other languages)

I suspect that they have ad hoc tricks for the specific lookahead cases, or perhaps they just live with it because it doesn't come up often enough in real C code (there are plenty of other reasons C and C++ parsers are slow; this is an algorithmic reason, which may not be the bottleneck)

In practice I think there is a sliding scale from strict "recursive descent" to "ad hoc bunch of recursive procedures". Depending on how you organize the output data structures of the parser, I guess you can just delay knowing what production you are in. It doesn't have to be organized strictly according to the grammar. A lot of C parsers I've looked at use untyped (homogeneous) tree output, which may help with this. Every function returns Node* rather than an object of a specific type.

But I'm not an expert on GCC or Clang, so I would be interested hear anything to the contrary. (I'm still sort of shocked by how huge the Clang front end is; the lib/Lex/ dir is 21K lines of code; lib/Parse/ is 33K lines; and that's not even close to the full front end -- it doesn't even count the 60K lines of AST headers that I mentioned before...)

Of course, the danger with a more ad hoc structure is that it's harder to reason about what language you're actually parsing. The only solution seems to be rather exhaustive test cases (e.g. as mentioned on https://gcc.gnu.org/wiki/New_C_Parser) It's almost mind-boggling to try to understand what language 33K lines of code are recognizing, without some higher level structure.

This question addresses the same issue, but there's no real answer.

http://stackoverflow.com/questions/5992130/how-do-you-parse-... (ignore the fact that the question is misusing the term "context sensitive")

"When you dive into the parens you don't know if you're parsing a type, or an array of type, or an expression (and whether it's l-value or r-value), so you have to somehow postpone that decision"


1. You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom. Also, you don't have to "compose two different algorithms" per se, there's no impedance mismatch.

2. I've found refactoring hand-rolled parsers to be just fine, I believe the things I needed to do would not be so convenient with parsing tools, though it depends on what direction I'd have gone down at a certain point in time.

3. The problem there is that of proper language design. You need to combine your sub-languages in a way such that there isn't parsing problems. Your language needs to be parsed by humans, too, after all.

4. This is actually a solved problem, it's just the solution isn't evenly distributed. I think that parsing code is much more low-risk than other C code I've written, but there are people who do things a lot more bozotically...

5a. Does the parser has special obligations beyond recording comments and source locations here?

4a^H^H5b. Yeah, now, the question is, how did they solve that problem?


Your comments betray a lot of ignorance, so I will just point out the glaring inaccuracies in #1. Here are two different expression parsers that are likely installed on machines you use, and they absolutely make a function call for each level of precedence (eval1, eval2, ..., exp1, exp2, ...)

http://code.metager.de/source/xref/gnu/coreutils/src/expr.c

https://opensource.apple.com/source/bash/bash-76.1/bash/expr...

My point is not that this is BAD -- it probably doesn't matter on modern machines. I'm just saying there are choices to make. Parsing requires some knowledge and choices -- that's all. You don't just "type out the code".

#3 is not a refutation, because all the languages I mentioned are extraordinarily common. The answer to "parsing real languages is hard" is not "don't parse real languages".

I can also tell by #4 that you don't have any background in security. It's not about YOUR code per se. It's about code that is deployed and that we all rely on.


1. Did I not say there might be some living strawmen out there?

3. Did I say we shouldn't parse real languages? No, I said the problem was that it's not straightforward to combine two languages into a new one. The problem's not at the level of parsers.

4. Did I not say it wasn't evenly distributed?


"You're arguing against a strawman (maybe one that lives in some basements) if you think anybody's parsing expressions by writing down a CFG and translating that into a strawman "recursive descent" parser. A hand-rolled parser will not traipse down 13 levels to parse an atom."

So, how do you handle expression precedence? (I haven't hand written a parser since I made peace and developed a working relationship with bison.)


A Pratt-style parser makes expression precedence easy in a recursive descent parser:

    "use strict";
    var tokens = [2, '+', 5, '+', 9, '+', 4];
    var PREC = {'+': 1, '-': 1, '*': 2, '/': 2};

    function expr(prec) {
      var value = tokens.shift();
      while (true) {
        var nextToken = tokens[0];
        var nextPrec = PREC[nextToken];
        if (!nextPrec) break; // not an operator
        if (nextPrec <= prec) break;
        tokens.shift();
        var rest = expr(nextPrec);
        value = [nextToken, value, rest];
      }
      return value;
    }

    console.log(expr(0));
I find this technique useful because I can implement it in any language without being dependent on a parser generator.


Precedence is actually not hard. The way I do it is:

(1) Parse everything as though it were left-to-right.

(2) After each node is parsed, look at its immediate descendants and rearrange links as necessary. (Nodes in parentheses are flagged so you don't rearrange them.)

I can tell that the person above who is listing off a bunch of reasons not to use "recursive descent" hasn't written a compiler by hand ever (or not well). Most of the things he is talking about are easier to do by hand than in some complicated and relatively inflexible system.

Note that 'prediction' is mostly a red herring since you can look as many tokens ahead as you want before calling the appropriate function to handle the input. You would need to have a pathologically ambiguous language in order to make that part hard, and if your language is that ambiguous, it is going to confuse programmers!

In general, parsing is easy (if you know how to program well in the first place) and is only made more difficult/inflexible/user-unfriendly by using parsing tools. That doesn't mean that academic theories about parsing are bad -- it's good that we understand deeply things about grammars -- but that does not mean you should use those systems to generate your source code. (I do think it's a good idea to use a system like that to spot ambiguities in your grammar and decide how to handle hem, because otherwise it's easy to be ignorant... But I would not use them to generate code!)


Precedence isn't hard -- but it's often not done with recursive descent. In other words, it's not just "typing out the code".

I think you're misreading my reply. A hand-written recursive descent works a lot of the time, and is almost certainly the right "default". But my point is that it doesn't work all the time. If it works for you, great! That doesn't mean all parsing is easy. Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser).

In particular, I'm NOT advocating that all parsers should written with parsing tools. I think you must have read that somewhere else.

In fact, I used EXACTLY the same strategy as you mention on my recent shell parser. I machine-checked the grammar with ANTLR, and wrote some test cases for it. Then I transcribed that grammar to recursive descent by hand, along with several other ad hoc strategies that you need to parse an ad hoc language like shell.

(ANTLR is a great front end, but not necessarily a good back end. The generated Java code is probably OK, but the C++ code it generates is preposterous. Yacc is not that much better, i.e. with globals all over the place.)

"Just type out the code" is a horrible strategy because you will end up not knowing what language you've designed. If your post advocated "prototype your grammar with a tool, and then manually transcribe the grammar to code", then I'd be inclined to agree.

But that's not what you said. It seems like you are more interested in telling everyone what a great programmer you are, and how things are so easy for you, rather than spreading any interesting knowledge about parsing.


"Try adding an autoformatter like gofmt or IDE completion for Jai, and see how your parser changes (or if you have to write an entirely new parser)."

It would not change at all, and I have no idea why you think it would, except to guess that the model you have in your head of a hand-written parser kind of sucks. They don't have to suck.

"...not knowing what language you've designed." I have no idea what you're on about here either.

Look, I think you are making things a lot harder than they are. I am not bragging ... I used to build lexers and parsers by hand 23+ years ago when I was a student in college and had almost no programming experience compared to what I have now. It is not hard. If you think it's hard, something is missing in your knowledge set.

(I also built stuff using parser tools 23+ years ago, and was able to very clearly contrast the two methods. Parser tools have gotten slightly better since then, but not much.)


I don't think you've really thought about those things, or you don't understand what I'm saying.

Here is the recent video about Heljsberg I mentioned: https://news.ycombinator.com/item?id=11685317

I doubt your parser is even close to doing what he is talking about. I think you're an advocate of Microsoft's debugger over GDB, so hopefully what they are saying carries some weight (i.e. a little computer science doesn't hurt). They aren't just a bunch of eggheads; they are actually building language tools that people use.

And again, I'm not claiming that ALL parsing is hard. Sometimes you can just write a hand written lexer and recursive descent parser and be done with it. But that's not true in all cases, particularly for "real world" languages, or when you are designing a brand new language.

Saying parsing is easy is like saying "I wrote a game in BASIC in 8th grade, so writing games is easy". (And yes I've actually heard that statements like that ... )

I was researching Jai awhile back, and was definitely curious about it. I don't think you have released the code, so I can't tell which of these statements is true:

1) Parsing is easy for you

2) You think parsing is easy, but your language and your parser are full of bugs

I'm perfectly willing to believe #1 (honestly). But that doesn't mean all parsing is easy, for say a competent programmer in some other domain. It doesn't help anyone to say "just type out the code". Your comment about how you do precedence is a little more helpful.


I watched that talk when it appeared on the HN front page, and I actually think the whole methodology he is talking about is misguided. I don't find any of the "incremental program understanding" stuff in Visual Studio to be useful at all. I wish it were not there because it only causes problems and distractions.

It's a case where some people are choosing to do something that is a lot harder than a straightforward parse ... but as a user, a straightforward parse is actually what I want.

That said, even if you thought this was the right way to go, I am not sure that the internals of their code would look anything like the kinds of parsing tools you are talking about, so I am not sure it supports your point in any way.

> And again, I'm not claiming that ALL parsing is hard.

Parsing is easy. The video you link above is harder, but that's not really parsing any more, it's more like "make sense of this text that is sort of like a working program", which is more like an AI problem.

But anyway. It's pretty clear you haven't written many parsers (or any) so I am going to stop arguing. If I were to "win" this argument I wouldn't get anything out of it. I am trying to help by disavowing people of the notion that certain things are harder than they've been indoctrinated to think. If you don't want that help, fine ... just keep doing what you do and the world will keep going onward.


> I doubt your parser is even close to doing what he is talking about.

You act like if someone hasn't architecture-astronauted all their software to meet any hypothetical future requirement, you've won. The truth is, making a parser that can restart incrementally and reuse above-and-below data structures is just another requirement. It shapes the software design and would probably result in a more iterator-oriented approach -- see, we can already imagine how it might be built. It's like you're saying, "Writing a piddly game in BASIC isn't all that, let me show you how hard it is to write this Tetris program." Maybe a better analogy is those people that thought OOP was some deep advanced stuff.

And note again parsing is the easiest part of that software, a neat and well-contained problem.

See also https://hn.algolia.com/?query=by:WalterBright%20parser&sort=... for the opinion of somebody else that knows things about the relative difficulty of parsing.


> Try adding an autoformatter like gofmt

Since you seem to know so much about this, what's the big deal here? And how does using some parsing tool help?


Step 1. Begin writing a parser.

Step 2. Notice that writing functions parsing "expressions," "expressions without equals signs," "expressions without equals signs or plus/minus signs," ... is very repetitive.

Step 3. Being a competent programmer, find a way to solve that problem. (Instead of N functions, make one function that takes a parameter specifying the precedence level. Edit: Durr, I'm a moron, this doesn't really describe how to solve the problem of making N recursive calls. But still, you just figure it out, see my other post for how I actually tend to do it. Edit: Or judofyr's post describes it too.)


> which is exponential in general

In Packrat it's linear.

> Pratt parsing

And it's a natural match for Packrat, they work together really well.

> writing a hand-written parser is straightforward (although laborious)

Not that much more laborious than writing a BNF. You can still use a parser generator, just add some annotations for the error recovery, error reporting, pretty printing and indentation.

> It's not straightforward to combine two parsers for two sublanguages

With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.

> Writing reusable parsers.

Again, trivial with PEG, where parsers can be extensible and generic, and can be inherited in full or partially by the consequent parsers.

> Try writing a parser that allows you to reformat your source code, with whitespace and comments preserved.

And again, trivial with the lexerless PEG-based parsers.

> Try writing a parser with the hooks necessary to supports code completion and auto-correct.

You're getting this for free when using PEG.

> The line between the lexer and parser is not clear

Just forget about the lexers, once and for all. It's 21st century already. Burn your Dragon Book.


> With PEG or another lexerless parsing it's totally trivial. You can mix any languages you like.

Not strictly true; combining PEG parsers will always work, but it might not give you the answers you want. If you have some outer layer that has 'value=' and you want to follow it by an expression in various sub-languages, you have to try each language in turn - if language A has some weird corner-case where it happens to accept something that's an important and common case in language B, language A will always win unless you swap the two languages around, in which case B might recognise and mask some important syntax from A.

Worse, your combined language parser cannot tell you that the combination of A and B cause a problem, because PEG parsers don't really support that kind of consistency-checking. It's just a bug that'll crop up at run-time.

You can get around this by manipulating the outer syntax to have special "language A" and "language B" prefixes to mark what syntax to expect, or by manually merging them to create "language AB" which has the syntax priorities you want. But in both cases, that's (potentially delicate and thoughtful) human intervention, not "straightforward combining of two parsers".


> because PEG parsers don't really support that kind of consistency-checking

Not true at all. You can easily check if a new parser is breaking anything in the old one.

And in practice you never mix languages at the same level. A more typical example of such a mixing would be, say, regexp syntax in javascript.

EDIT: if you want more details on a theory of this static PEG verification, they will be available some time later when I polish my next bunch of commits.


I like PEGs a lot and even wrote my own PEG-like parsing language. The main problem I found was that, in practice, mixing lexing and parsing is a bad idea, so I have separate lexing in my system. It depends on the language, but I would say it's true for all programming languages.

It's just obvious that programming languages have separate lexical and grammatical structure. If you want to disprove that, show me some languages like C, Java, Python, etc. expressed as PEGs.

PEGs have been around for 12 years now; I don't see them being deployed widely. There are probably multiple reasons for that, but I believe usability in terms of expressing real languages is a big one. (People always harp on ordered choice; I don't think it's that big a deal because when you write a recursive descent parser, you're mostly using ordered choice too.)

You want to do as much work in the less powerful computational paradigm as possible -- that is, lex with regular languages. And then use the more powerful paradigm (PEGs or CFG algorithms) on top of that token stream.

I believe that lexing and parsing were combined in PEGs for reasons of academic presentation and bootstrapping, not for usability or practicality for recognizing real languages.

Several of your other points are wrong, but I'll leave it at that.


> mixing lexing and parsing is a bad idea

Why exactly? I find it rather liberating to mix lexing and parsing, for any possible language.

> show me some languages like C, Java, Python, etc. expressed as PEGs.

Feel free to browse my github repos, username 'combinatorylogic'.

There is a lot of parsers in PEG, including C and Verilog.

One of the very good reasons to do such a mix is to be able to have nested comments (which rules out regular expressions).

> I don't see them being deployed widely

Do you see any new languages deployed widely? Inertia is very strong here.

> Several of your other points are wrong, but I'll leave it at that.

Mind elaborating? All my points are based on years of practice in writing parsers for all languages imaginable.


> Text parsing for programming languages is NOT a difficult problem

And of course that's why the world is littered with a million 100%-functional C++ parsers, right?


C++ is a bad example because its' grammar is not actually context free.

Writing a recursive descent parser is actually quite intuitive. It's the solution that you'd eventually arrive at if given the problem of parsing some specific context free language.

GLR, GLL and PwD are solutions to parsing non-cfg (i.e. grammars that have more than one parse-tree) in optimal time.


Given that parsing C++ may be Turing complete¹ I'm not exactly sure that's the best example to use. The world _is_ littered with a million JavaScript and Python parsers.

¹ http://stackoverflow.com/questions/14589346/is-c-context-fre...


It kind of is yeah. A new language is posted on reddit every day.


In jai, did you write the parser yourself or used a parser generator like yacc? In my toy languages I find writing my own parser more fun and not hard. I find it harder to make yacc do things such as indentation significance.


I'm pretty sure he mentioned he wrote his own parser.

https://twitter.com/jonathan_blow/status/734817293331357697


Yes, it is hand-written. The parser is MUCH easier to build than the interesting/new parts of the compiler are. Lexing is the absolute easiest thing, then parsing is 2nd place.


> Text parsing for programming languages is NOT a difficult problem

Counterpoint: Ohloh tried to build a simple line counter. After 7 years even with community support they never really got it right.

https://github.com/blackducksoftware/ohcount/tree/master/src...


I don't know what Ohloh is but it looks like it's some kind of web software, in which case I am totally unsurprised.

Are you confident that the same programmers could have successfully built a line-counter if they built it using parser tools?


Saying "it's not a difficult problem" and then turning around and implying "... but only if you're not one of those lousy web programmers" sure sounds like moving the goalpost.


Well, I think a lot of web programmers do not really know how to program.

If someone is going to be offended that a potential employer asks them to reverse a linked list in an interview -- something that seems a bit trendy in the web world lately and several such articles have made the HN front page -- then look, that person does not really know how to program, so of course they think it's hard to do basic stuff. Such a person's opinion on how hard things are is not that relevant to how hard they are given a reasonable background education.

Probably this sounds snobby to some people, but look, programming well is a never-ending pursuit, you can spend your whole life getting better, but it won't help anyone advance if we all pretend that everyone is good already.


The directory I linked shows Ragel lexers. It was a command line tool that fed a web app. All it had to do was count lines of code, blank lines, and comment lines for a variety of languages.

It took years and they never got it right. Partly because every technical decision was based on neato-tool-of-the-month in the early Ruby/Zed Shaw era. But mostly because it's hard.


Care to explain why you think it's hard? (For someone with a basic education in programming, say, someone with a bachelor's degree from a reasonable school). Exactly what part of this problem is hard?


You seem to be treating this like "Knuth solved it in 1965, how hard can it be?" I provided a link to 100 language parsers with commit history. A quick Google reveals the most frequent committer has a CS degree from Caltech. Either it's hard or they're idiots. (I suppose we both can be right.)

Personally I wouldn't want to write something that can parse the Javascript embedded in the HTML emitted by a PHP script.

Subtle details like "screw that, life's way too short" can conspire to make "easy" things hard.


I look forward to the day when the average undergrad graduating with a computer science degree will be able to write a context-free-language parser from scratch in a few days.

Does recursive-descent count? If so, I've come across material on the Internet suggesting that a simple arithmetic expression parser/evaluator is an assignment in some first and second-year CS courses, often covered at the same time as trees and tree-traversal.

Of course, being taught and doing an assignment on something is not quite the same as really knowing, as according to some sources many of those graduates can barely manage FizzBuzz.

On the other hand, recursive descent seems to be one of those techniques that everyone eventually manages to reinvent even if they've never been taught about it, as long as they're aware of recursion.


I don't think it's true that a lot of people really reinvent recursive descent parsing. Or rather I think a lot more programmers who don't understand language concepts wind-up writing broken pieces of what could be a recursive descent parser. The programming piece of a recursive descent parser all easy - the tricky parts are having a language spec, operating on the language spec to get the appropriate normal forms (Chomsky and Greibach) and then turning the spec into a parser.


No, you don't need to know anything about normal forms to make a good parser.


How do you handle those sorts of things?


Looking these things up, Chomsky normal form is clearly irrelevant, and Greibach normal form forces "progress" to be made. The equivalent of this is, let's say you hand-roll a parser and notice that it's recursing infinitely. Then you realize, "Durr, you bozo, you need to force it to consume something or it'll recurse forever without going anywhere!"

And then you solve that problem by having it consume something first. For example, if you're parsing an expression a+b*c+d, and thanks to some naive thinking about left-associative operators, it recursively calls itself without consuming data. So you'll say, okay, first I need to chew off something that consumes data. Then -- what else can you do -- see if you've got an operator, and grow an expression from there. That's a well-contained situation that marches you straight into the right solution.


Yes,

I'm scanning the original video introducing derivatives[1].

The need for parsing is ubiquitous yet, Never mind CS students, in my experiences a significant number of professional programmers cannot create a basic parser either from scratch or from a standard tool.

Edit: The video picks on hand-written c parser. Such a thing isn't necessarily bad IF it's in a formally derived recursive descent parser. But just ad-hoc parsing without being informed by a grammar is pretty much waiting for disaster.

[1] https://www.youtube.com/watch?v=ZzsK8Am6dKU


It's a pain and not something you really keep touching again and again once you've got it baked.

I've done it a number of times and I've had to relearn ANTLR each time (or fall back to flex/yacc or perl if I'm feeling grungy or go commando with a hand-crafted FSM if I'm on an odd machine architecture)

Is this something that you'd expect people who code even daily to be able to whip out without looking at a man page?


Sure,

All of the parser-generator tools are really opaque and not something that can be easily used for simple things. On the other hand, recursive descent parsers are something that can be whipped easily if you have a little project. And the problem is that a lot programmers both can't produce a recursive descent parser AND don't know when their problem domain is moving into a place where such a thing is needed.


Fair enough. They're not rocket science conceptually - but I have to relearn the tools each time. Not a huge deal - a couple of hours - but I was thinking that you were claiming I needed to have it down as cold as when I got out of CS 202 (or whatever that thing was).


I don't understand why people use FSMs or parser tools.

I wrote the parser for my script language, jerboa ( https://github.com/FeepingCreature/jerboa ) over a few days; it's (I thought) entirely standard recursive descent single-pass. This is how I've always written parsers. Where would you even use FSMs for parsers?


> Where would you use FSMs for parsers?

For the lexing (lexical analysis) stage of the parser. I know that the alternative to using a lexer (for which the tokens are specified by regular languages, thus lexed by FSMs) is to use so-called scannerless parsing (this is possible since every regular-language is context-free, thus the tokens can also be specified by the original context-free grammar). But using this alternative is usually much more painful since you usually want the grammar to have a nice property that allows linear-time parsing and a unique parse tree. LR(k), LL(k) etc. are such properties. Such properties are much more difficult to ensure for scannerless parsing.


Except no, lexing is trivial; it is by far the easiest part of writing a compiler. You don't need anything fancy, you just type in the code.


When you add unicode, it can become more complicated (or rather, cumbersome)...


Well, you just need to call unicode_next_character all the time instead of saying s++, similarly for whitespace, similarly for asking whether a character can initiate or continue an identifier, etc. It does not change the basic nature of the task at all.


Sure, but if you are using a language without support for unicode, and you don't use a dedicated library (which would be already using a kind of lexer, wouldn't it?), you have also to parse these unicode characters yourself.


A unicode_next_character function is very simple to write regardless of unicode support in your language.

I usually write it as a small (256 byte) lookup table where each entry tells you how many characters to skip next. If you don't use a lookup table, its 4 single-line `if` statements (and if you do it's a one liner, plus however many lines the table takes).


There's free source code for basic Unicode operations all over the internet.


Here's a video of Rob Pike, developer of Go, talking about a lexer he built for the Go text/template package:

https://www.youtube.com/watch?v=HxaD_trXwRE

It supports unicode.


If you want to write a lexer from scratch, this is IMHO not trivial. If you use a generator it is much easier, but so is also using a parser generator for the CFG parsing stage.


If you can't write a lexer by hand, just forget trying to write a compiler that does anything interesting, because the lexer is MUCH easier than any other part of the compiler.

There are a lot of reasons for this, but one of the basic ones is that the lexer does not need to interact in a complex way with the compiler's state. It is a relatively simple pipeline where characters go in one end and tokens come out the other.


Hand-rolling a lexer seems pretty trivial to me. The code writes itself. A generator is a tool whose limitations you have to work around, be it for a lexer or a parser.


The only actual increase in difficulty from scannerless parsing is having to janitor whether a function will always slurp up whitespace before it finishes. Linear-time is easy, questions of unique parse tree are irrelevant because the only ambiguity is who owns a piece of whitespace that gets discarded anyway.


What wolfgke said. I (personally) use FSMs as my swiss-army chainsaw a lot. I realize that this isn't common practice, but I came from silicon, so what are ya gonna do?


Which is funny because a recursive descent parser is actually very simple.


I agree completely, but it seems a lot of others (unfortunately) don't, as this rather saddening link shows:

http://stackoverflow.com/questions/28256/equation-expression...


Yes, the thing about the inability of a given programmer to produce a recursive descent parser is that it generally comes from an actual ignorance of a what an abstract language is rather than from a particular technical inability. Especially, coming from naive object orientation, a language seems like it should just be a thing, an "object" and so you should be able to define a "language object" and go from there. But actually an abstract language operates on a "whole other level" than the usual hierarchy of members and methods described in simple object-oriented design.

Anyway, hopefully the derivative idea is to have constructs kind-of like regular expressions but actually reflecting language-structure and able to general good and understandable parsers.


Writing a parser by hand can be a bit repetitive, but it's actually not that bad with a bit of experience. Many production parsers are written by hand.

Here's a good article about parsing expressions: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...

And my half-baked explanation which may not make sense unless you've already tried it: https://plus.google.com/+BrianSlesinsky/posts/hMr5NEnQcxC


With the vanishing of required compilers courses, I am unsure that this will ever be the case.


I've taught undergrads CKY in a half-hour, but only when limited to Chomsky-form grammars. This is still a general parsing algorithm, but the restriction is very annoying in practice. Turning a grammar into Chomsky form is quite annoying, too.


> Turning a grammar into Chomsky form is quite annoying, too.

This can be done automatically.


Sure, but you haven't exactly taught undergrads how to parse arbitrary grammars until you teach them that automatic algorithm, too, and that algorithm is harder to teach than simple CKY.


Can they not? I had at least two undergrad classes (Programming Languages and Automata Theory) that taught how to do efficient CFG parsing.


The problem is that the languages people actually use are NOT context free. C and C++ are known not to be context free, and (somewhat surprisingly to me) it's not clear that Java and Python are context-free either. I also bet that JavaScript's semicolon insertion rule makes it not context-free (perhaps among other things).

http://trevorjim.com/parsing-not-solved/

Just because you use yacc doesn't mean it's context free. The entire design of yacc explicitly lets you break out of that paradigm -- i.e. with arbitrary semantic actions in a Turing-complete language (C).

There is a fairly big gap between theory and practice. Even if parsing with derivatives were fast, or even it were LINEAR, it wouldn't really "solve" parsing in any meaningful sense. Because it's too weak to parse most real languages.


The Python grammar can be formulated in a context-free form (to my knowledge), except for the indentation rules that require access to a stack, albeit only in the preprocessor (which replaces the indentation with INDENT / DEDENT tokens). Another complication is that NEWLINE characters that occur inside parentheses are optional terminals, whereas outside they are relevant (this does not require a context to be parsed though, it just makes the tokenization step a bit more complex).


Yes, that is exactly what's mentioned here, on the same site that I linked:

http://trevorjim.com/python-is-not-context-free/

(Are you the author? :) )

Why does this matter? Because AFAICT the current best practice for implementing languages is to 1) prototype the language with a tool, then 2) throw that out and write the parser by hand.

If you skip step 1, then it will be unclear what language you've defined (and it's not always obvious how to break up the procedures). If you skip step 2 and auto-generate your code instead, you typically don't have enough expressive power, and you have to hack around it (communication with globals, etc.)

Tools and theory are focused excessively on CFGs (i.e. in this thread parsing with derivatives). Whereas real language always have non-context-free parts. Granted, Python is probably one of the most friendly and well-designed, and writing the lexer by hand and generating the parser works well there. I don't think that, in the case of Python, writing the parser by hand would have been a good idea.


Not the author, just wrote a Python parser recently. The problem with most grammar specifications is that they are not clear about the tokenization stage and assume that you already have a valid token stream to work with, which is of course not the case usually. The Python grammar specification for example does not even mention anything about how one should implement the tokenizer (to my knowledge).

That said, most programming languages can be formulated in an almost context-free way.

For real-world parsers, I think speed is just valued much higher than elegance, which often makes sense, especially for scripting languages where the parsing speed is directly proportional to program startup time (leaving caching aside).


Agreed, although I would not say "most" can be formulated as CFGs. There is the lexer issue you mention, and that is mentioned in that article, but there are also tons of other issues in real languages.

Languages that are close to context free: Python, Java, JavaScript, R, Go, awk. (and I think Standard ML and OCaml)

Languages that are not even close: C, C++, Perl, Unix shell, make.

Languages I'm not sure about but which probably fall in the "no" category: PHP, Ruby, C#.

So I think that is enough to warrant some research into parsing tools for non-CFGs. (PEGs are interesting and I explored them for some of these applications)


Parsing is a solved problem. Any other research in this area is done out of curiosity, not necessity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: