Is C++ context-free or context-sensitive?

ggchappell · on Jan 30, 2013

It seems to me that the real issue is not mentioned by the original question. Yes, "a b(c);" has ambiguous semantics; its interpretation depends on whether c is a type. But such ambiguity is not a matter of syntactic correctness.

However, in a function prototype, we can name the parameters. Thus,

  a b(c x);

is syntactically correct C++ if c is a type, but not if it isn't. A correct grammar for C++ must deal with this issue, which cannot be settled by a CFG.

Also, a couple of notes.

(1) "Context-sensitive" does not mean "not context-free". In fact, strictly speaking, every context-free grammar is a context-sensitive grammar. And there are grammars that are neither. So the original question should really be restated.

(2) The long answer makes a very important point:

> ... the standard does not attempt to provide a complete formal grammar, and ... it chooses to write some of the parsing rules in technical English.

> What looks like a formal grammar in the C++ standard is not the complete formal definition of the syntax of the C++ language.

tsahyt · on Jan 30, 2013

Correct me if I'm wrong here but isn't that exactly the line between syntax and semantics? I'd think that

    a b(c x);

is syntactically correct C++ but still not valid C++ code (as in: a well defined line of C++). As far as I know, syntax doesn't depend on whether a type or an identifier or whatever actually exists.

scott_s · on Jan 30, 2013

No; if "c" is not a type, then there are no grammar rules which will allow you to produce that statement. So, as you parse that text, when you get to "c", you have to ask, "Is 'c' a type?" The answer you get will determine which subsequent grammar rules you apply. That context - having to know something about "c" that was described elsewhere - isn't technically semantics. It's an assertion just that "c" appeared elsewhere in the text, and when we parsed it, it was in a particular location in the grammer (which just so happens to mean it is a type - but the parser doesn't need to know what that means).

An example of syntactically correct, but semantically undefined code is:

  *(static_cast<int*>(NULL)) = 42;

The difference between here and above is that there is nothing in the grammar that can tell you "this is an error". It's only an error because of the semantics that we've given to the operations.

_mog1 · on Jan 31, 2013

I don't think what you've said is precisely true...

    a b(c x);

can be syntactically correct even if c is not a type. Trivially, any of the four tokens could be preprocessor definitions. Even if we are talking about preprocessed c++, the following statement is syntactically correct:

    new bool(new int);

ggchappell · on Jan 31, 2013

Preprocessor tokens: by that logic, just about anything is syntactically correct. I'd prefer to deal with the syntax of a program after the preprocessor is done with it.

Using "new": Yes, that's right. OTOH, if "c" is literally "c", and not a placeholder for something else, then it looks like my statement is correct. More generally, let c be a placeholder for any identifier, and my statement is correct.[1]

[1] I think.

dottrap · on Jan 30, 2013

C++ is absolutely not context-free. It is probably in fact the worst and is "undecidable". http://yosefk.com/c++fqa/defective.html#defect-2 Compiler writers have to use heuristics to sort out the all the ambiguities and cross their fingers they are correct.

Lua might be context-free. I forgot how to do rigorous proofs. But is one of the cleanest and most elegant languages ever made. See the complete (tiny) syntax of Lua in extended BNF at the very bottom of the manual. http://www.lua.org/manual/5.2/manual.html

CJefferson · on Jan 30, 2013

While parsing C++ is indeed a pain, there are no heuristics used in the parsing of C++.

lucian1900 · on Jan 30, 2013

Afaik both Lua and Python are context-free.

Twisol · on Jan 31, 2013

How would you implement Lua's long-string syntax in a context-free grammar?

    str = [==[foo]bar]=]baz]==]

Been trying to puzzle this one out.

chimeracoder · on Jan 30, 2013

I don't think any commonly used language has a truly context-free grammar (including Lisp, actually, though Lisp's deviations are both minor and straightforward). That said, Lisp and Go are the two that come the closest, in my recollection.

We actually had this discussion on #go-nuts a few weeks ago - part of the confusion stems from the fact that most language compilers enforce requirements that aren't covered by the parser. In other words, a parser oftentimes accepts (fails to reject) invalid strings.

Easy examples of requirements that violate context-freedom are the requirement that variables be declared before use, or the inability to divide an integer literal by 0. (The latter is a compiler error in Go[0] - I'm not sure about C++, but I know the former is a requirement in both languages).

Finally, we should make a distinction - even if the set of valid programs could be described perfectly by a context-free grammar (and therefore implemented by a pushdown automaton), chances are the actual parser for the language's compiler doesn't implement the full set of rules, which again reinforces the point that most parsers are biased in their outcome (biased in favor of accepting potentially invalid strings).

> If you look up the definition of context-free languages, it will basically tell you that all grammar rules must have left-hand sides that consist of exactly one non-terminal symbol

That's not exactly true - one can rewrite (normalize) a context-free grammar into this form, but CFGs are often written in an "unnormalized" form which has multiple nonterminal symbols, because it tends to be clearer to read in many cases.

[0] As demonstrated here, the compiler will actually reject the division-by-zero at compile-time: http://play.golang.org/p/3gwDeaoiM2

Rexxar · on Jan 30, 2013

I think you are conflating the grammar checking of a language with all the checks done in other compilation phases.

For example, in a context free language, if "A : B;" is a variable declaration, A is always a type and B is always a variable name. In the expression "C = C + 1", C would always be a variable name even if C has not been declared. In this case the error will be detected in the next compilation phase but not during grammar checking.

In the same ways all type checking are done after grammar checking.

For example :

    int i = "iyi";

is a valid line for C++ grammar but invalid for the type checker.

praptak · on Jan 30, 2013

> I think you are conflating the grammar checking of a language with all the checks done in other compilation phases.

The main ppoint is that for some languages those two kinds of checks can not be easily separated. The top answer to the linked question demonstrates that for C++. It occurs that in C++ typen<1>() can be either parsed as a template istantiation or not-parsed (i.e. producing a syntax error) as (typen < 1)>().

The syntactical correctness of this expression depends on the result of evaluation of a fine piece of template metaprogramming - there must be more than one grammar checking phase, each one dependent on the evaluation of the previous one. This gives a good hint on why C++ compile times are sometimes so huge.

jejones3141 · on Jan 30, 2013

Actually, having to draw that distinction is one motivation for van Wijngaarden grammars such as was used to specify Algol 68; that let the language definition be purely grammatical rather than having potentially ambiguous prose "semantics" implemented in ad hoc code.

jlarocco · on Jan 31, 2013

I'm a bit confused about Go's grammar. I thought I read somewhere that Go could be parsed by a CFG.

But when I looked into it just now, the introduction to the language spec [1] says "The grammar is compact and regular ...". But how can the grammar be regular? According to Wikipedia "regular grammars" describe languages that can be expressed using regular expressions. [2][3] Wouldn't matching parenthesis and braces require at least a context free language instead of regular?

[1] http://golang.org/ref/spec#Introduction

[2] https://en.wikipedia.org/wiki/Regular_grammar

[3] https://en.wikipedia.org/wiki/Regular_language

gchpaco · on Jan 31, 2013

Regular here does not mean regular in the technical sense, it means there aren't any weird special case constructs that break the rest of the parsing (for a particularly egregious example, consider the Perl report sublanguage, which is nothing like the rest of the language).

danking00 · on Jan 30, 2013

"I don't think any commonly used language has a truly context-free grammar"

It's a matter of semantics. (doubly so)

These errors that you describe during compilation are defined by what? Not by the syntax (at least not in r5rs Scheme [1]) but instead by the semantics. Conflating the two would, of course, result in the impression of non-context-free language syntaxes because if we require the syntax to also be the semantics of the language then the syntax must be able to evaluate the program i.e. the syntax must describe a turing machine.

I think the confusion is compilation. Compilation is a series of evaluation steps taken before complete evaluation.

All the errors (as they exist in Scheme) that you describe are DEFINED by the semantics of r5rs Scheme. For example, the semantics of variable reference are:

E[[I]] = λρκ . hold (lookup ρ I) (single(λ . = undefined → wrong “undefined variable”, send κ)) [r5rs, Section 7.2.3]

Which basically says, if the variable has been bound, then use its value to continue the evaluation. If it is unbound, then stop evaluating and signal the error "undefined variable".

====

"That's not exactly true - one can rewrite (normalize) a context-free grammar into this form, but CFGs are often written in an "unnormalized" form which has multiple nonterminal symbols, because it tends to be clearer to read in many cases."

I think you misunderstand what the OP is stating. OP refers to the left hand sides:

  E -> E + E
      | E * E

Certainly the left hand side in a CFG normally has a single non-terminal. I suppose you could write a grammar with the left hand sides containing multiple non-terminal symbols and then, if it turns out the grammar you wrote is in fact context-free, you could re-write it in the above form, but does anyone actually do this?

[1] http://www.schemers.org/Documents/Standards/R5RS/HTML/r5rs-Z...

cube13 · on Jan 30, 2013

>Easy examples of requirements that violate context-freedom are the requirement that variables be declared before use, or the inability to divide an integer literal by 0. (The latter is a compiler error in Go[0] - I'm not sure about C++, but I know the former is a requirement in both languages).

It's usually just a warning, and runtime error. It's actually used in a few cases to attempt to kill a process.

Apple's abort(), for example, uses it: http://www.opensource.apple.com/source/Libc/Libc-262/stdlib/...

xyzzyz · on Jan 30, 2013

>That said, Lisp and Go are the two that come the closest, in my recollection.

Which Lisp you have in mind? Surely not Common Lisp, which has Turing-complete syntax.

chimeracoder · on Jan 30, 2013

Well, yes, though by definition that's what Common Lisp-style macros will lead to.

By 'closest', I simply meant that these deviations are easy to analyze or account for in an informal sense - for example, by forbidding/exempting user-defined macros. (Naturally, they cannot be precisely accounted for without executing the program or solving the halting problem, since the definition of the macro may itself require executing arbitrary code).

The deviations of languages like C++ and Java, for example, are hairier to try and enumerate, even informally (imprecisely) - I remember doing this as an exercise for my compilers class in school.

adobriyan · on Jan 31, 2013

Lisp reader alone is Turing-complete.

thwest · on Jan 30, 2013

{ int x = 1 / 0; } is valid c++ that results in a runtime error.

jessaustin · on Jan 30, 2013

Best comment on SO was from jpalecek, who observed that the requirement that variables be declared prevents the language from being context-free. Of course that requirement is not enforced in the parser, so you might not care about it.

ComputerGuru · on Jan 31, 2013

Can I reply with another question: why does it matter? I mean, even if C++ were context-free, that does not make it any better or worse of a language. It certainly does not alleviate the fact that the best-case number of passes required to compile C++ code is (if I'm not mistaken) 3 all the way up to 6 for proper optimization.

(Note: I love C++. But that doesn't take away from the fact that's one of the worst languages to parse and compile.)

bo1024 · on Jan 31, 2013

As pointed out in Dan's answer, the phrase "grammar of a programming language" is not well-defined. Even "syntax of a programming language" is not well-defined.

Until someone defines those terms accurately, trying to answer the question is a waste of time.

niggler · on Jan 30, 2013

"I wish more titles were that Google-friendly."

I agree that this specific case is google-friendly, but many SO and HN posts are not. In general, how does one construct a google-friendly title?

jsmeaton · on Jan 31, 2013

"Problem with my Python" is not google friendly.

"Why does my division produce an integer in Python" is google friendly.

Essentially, you should be able to work out the rough content of a question based on its title. I'm sure there's some irony about context and grammars hidden in this response.