Hacker News new | past | comments | ask | show | jobs | submit login

As far as I know C is context-free. If C's grammar is ambiguous, that just makes it an ambiguous context-free grammar [1].

[1]: https://en.wikipedia.org/wiki/Ambiguous_grammar




The context-free terminology doesn't apply to C because the ambiguity rests in not knowing what is the lexical category of foo in "foo * bar". That information comes from semantics.

A language whose semantics you have to understand to understand its syntax is outside of the entire formal language realm inside which we formally recognize a "context-free" category. That formal language realm is purely syntactic: the terminal symbols are just "dumb" and stand for themselves. They are subject to rules which indicate how those symbols are combined.

C is "quasi context free" in that if we resolve what the raw tokens mean (which one is a type and which one isn't), then the resulting language is context-free. In fact stronger than context free: amenable not only to LALR(1) parsing approaches but to recursive descent, with a suitable refactoring of the left-recursive rules given in the standard.


Your post doesn't really agree with what I was taught.

> the ambiguity rests in not knowing what is the lexical category of foo in "foo * bar".

Right, so there are 2 possible parse trees. Being able to disambiguate them with more information doesn't mean it's not context-free. Given just a CFG, you were still able to determine that you have a syntactically valid C program.

> A language whose semantics you have to understand to understand its syntax is outside of the entire formal language realm inside which we formally recognize a "context-free" category.

A formal language is just the set of strings that belong to it. That's it. An absurd example might be "the set of strings corresponding to valid C programs that do not terminate". To recognize such a language, you would not only have to understand the semantics of C, you would even need to solve the halting problem, but that doesn't make it "outside of the entire formal language realm".


> Right, so there are 2 possible parse trees.

Not with the same symbols, though.

> A formal language is just the set of strings that belong to it. That's it.

That is correct, and in that framework, we cannot reason about the meaning of "foo"; that if it's a type apply this rule otherwise that rule.

If we do that reasoning, what we're really doing is replacing "foo" with one of two other symbols, and then parsing those; then we have different strings (so of course the parse trees cannot be the same, no matter what).


We don't need to decide to apply this rule sometimes but other times that rule. When we come to an ambiguous string, we apply both rules nondeterministically. Here's a pseudo-grammar that I think describes both possibilities.

  Statement -> Type Op Var ";"
  Statement -> Var Op Var ";" 
  Op -> "*"
I think the disconnect here is that most lexers would insist on feeding a different token to the parser depending on whether "foo" is a Type or a Var, and choosing one or the other means the lexer might feed it bad information. The pragmatic way to implement this in a real compiler is to add some logic to the lexer so that it's working with more information than just the syntax. Theoretically though, all that really matters is that both possibilities are syntactically valid. So another way to implement it would be to pass on both possibilities, letting the parser eliminate the one that doesn't compile.


Indeed. There is a rule that Type generates Identifier, which can generate an example like foo. Likewise, Var generates Identifier, which generates foo. But these generations are not purely grammar rules; Type can only generate foo if there exists a declaration in the semantic space. If we regard it as purely a grammar rule, then we have a straightforward ambiguity in a context-free language. It is context-free simply because the rules are all of the form one_sym -> zero_or_more_syms.

If C were parsed this way, nondeterministically, ultimately the ambiguity would be resolved by looking up the type info anyway. (The interesting possibility exists, though, that in some cases the type info could be inferred, based on how the declared identifier is used.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: