It seems to me that the real issue is not mentioned by the original question. Yes, "a b(c);" has ambiguous semantics; its interpretation depends on whether c is a type. But such ambiguity is not a matter of syntactic correctness.
However, in a function prototype, we can name the parameters. Thus,
a b(c x);
is syntactically correct C++ if c is a type, but not if it isn't. A correct grammar for C++ must deal with this issue, which cannot be settled by a CFG.
Also, a couple of notes.
(1) "Context-sensitive" does not mean "not context-free". In fact, strictly speaking, every context-free grammar is a context-sensitive grammar. And there are grammars that are neither. So the original question should really be restated.
(2) The long answer makes a very important point:
> ... the standard does not attempt to provide a complete formal grammar, and ... it chooses to write some of the parsing rules in technical English.
> What looks like a formal grammar in the C++ standard is not the complete formal definition of the syntax of the C++ language.
Correct me if I'm wrong here but isn't that exactly the line between syntax and semantics? I'd think that
a b(c x);
is syntactically correct C++ but still not valid C++ code (as in: a well defined line of C++). As far as I know, syntax doesn't depend on whether a type or an identifier or whatever actually exists.
No; if "c" is not a type, then there are no grammar rules which will allow you to produce that statement. So, as you parse that text, when you get to "c", you have to ask, "Is 'c' a type?" The answer you get will determine which subsequent grammar rules you apply. That context - having to know something about "c" that was described elsewhere - isn't technically semantics. It's an assertion just that "c" appeared elsewhere in the text, and when we parsed it, it was in a particular location in the grammer (which just so happens to mean it is a type - but the parser doesn't need to know what that means).
An example of syntactically correct, but semantically undefined code is:
*(static_cast<int*>(NULL)) = 42;
The difference between here and above is that there is nothing in the grammar that can tell you "this is an error". It's only an error because of the semantics that we've given to the operations.
I don't think what you've said is precisely true...
a b(c x);
can be syntactically correct even if c is not a type. Trivially, any of the four tokens could be preprocessor definitions. Even if we are talking about preprocessed c++, the following statement is syntactically correct:
Preprocessor tokens: by that logic, just about anything is syntactically correct. I'd prefer to deal with the syntax of a program after the preprocessor is done with it.
Using "new": Yes, that's right. OTOH, if "c" is literally "c", and not a placeholder for something else, then it looks like my statement is correct. More generally, let c be a placeholder for any identifier, and my statement is correct.[1]
C++ is absolutely not context-free. It is probably in fact the worst and is "undecidable".
http://yosefk.com/c++fqa/defective.html#defect-2
Compiler writers have to use heuristics to sort out the all the ambiguities and cross their fingers they are correct.
Lua might be context-free. I forgot how to do rigorous proofs. But is one of the cleanest and most elegant languages ever made. See the complete (tiny) syntax of Lua in extended BNF at the very bottom of the manual.
http://www.lua.org/manual/5.2/manual.html
I don't think any commonly used language has a truly context-free grammar (including Lisp, actually, though Lisp's deviations are both minor and straightforward). That said, Lisp and Go are the two that come the closest, in my recollection.
We actually had this discussion on #go-nuts a few weeks ago - part of the confusion stems from the fact that most language compilers enforce requirements that aren't covered by the parser. In other words, a parser oftentimes accepts (fails to reject) invalid strings.
Easy examples of requirements that violate context-freedom are the requirement that variables be declared before use, or the inability to divide an integer literal by 0. (The latter is a compiler error in Go[0] - I'm not sure about C++, but I know the former is a requirement in both languages).
Finally, we should make a distinction - even if the set of valid programs could be described perfectly by a context-free grammar (and therefore implemented by a pushdown automaton), chances are the actual parser for the language's compiler doesn't implement the full set of rules, which again reinforces the point that most parsers are biased in their outcome (biased in favor of accepting potentially invalid strings).
> If you look up the definition of context-free languages, it will basically tell you that all grammar rules must have left-hand sides that consist of exactly one non-terminal symbol
That's not exactly true - one can rewrite (normalize) a context-free grammar into this form, but CFGs are often written in an "unnormalized" form which has multiple nonterminal symbols, because it tends to be clearer to read in many cases.
I think you are conflating the grammar checking of a language with all the checks done in other compilation phases.
For example, in a context free language, if "A : B;" is a variable declaration, A is always a type and B is always a variable name. In the expression "C = C + 1", C would always be a variable name even if C has not been declared. In this case the error will be detected in the next compilation phase but not during grammar checking.
In the same ways all type checking are done after grammar checking.
For example :
int i = "iyi";
is a valid line for C++ grammar but invalid for the type checker.
> I think you are conflating the grammar checking of a language with all the checks done in other compilation phases.
The main ppoint is that for some languages those two kinds of checks can not be easily separated. The top answer to the linked question demonstrates that for C++. It occurs that in C++ typen<1>() can be either parsed as a template istantiation or not-parsed (i.e. producing a syntax error) as (typen < 1)>().
The syntactical correctness of this expression depends on the result of evaluation of a fine piece of template metaprogramming - there must be more than one grammar checking phase, each one dependent on the evaluation of the previous one. This gives a good hint on why C++ compile times are sometimes so huge.
Actually, having to draw that distinction is one motivation for van Wijngaarden grammars such as was used to specify Algol 68; that let the language definition be purely grammatical rather than having potentially ambiguous prose "semantics" implemented in ad hoc code.
I'm a bit confused about Go's grammar. I thought I read somewhere that Go could be parsed by a CFG.
But when I looked into it just now, the introduction to the language spec [1] says "The grammar is compact and regular ...". But how can the grammar be regular? According to Wikipedia "regular grammars" describe languages that can be expressed using regular expressions. [2][3] Wouldn't matching parenthesis and braces require at least a context free language instead of regular?
Regular here does not mean regular in the technical sense, it means there aren't any weird special case constructs that break the rest of the parsing (for a particularly egregious example, consider the Perl report sublanguage, which is nothing like the rest of the language).
"I don't think any commonly used language has a truly context-free grammar"
It's a matter of semantics. (doubly so)
These errors that you describe during compilation are defined by what? Not by the syntax (at least not in r5rs Scheme [1]) but instead by the semantics. Conflating the two would, of course, result in the impression of non-context-free language syntaxes because if we require the syntax to also be the semantics of the language then the syntax must be able to evaluate the program i.e. the syntax must describe a turing machine.
I think the confusion is compilation. Compilation is a series of evaluation steps taken before complete evaluation.
All the errors (as they exist in Scheme) that you describe are DEFINED by the semantics of r5rs Scheme. For example, the semantics of variable reference are:
Which basically says, if the variable has been bound, then use its value to continue the evaluation. If it is unbound, then stop evaluating and signal the error "undefined variable".
====
"That's not exactly true - one can rewrite (normalize) a context-free grammar into this form, but CFGs are often written in an "unnormalized" form which has multiple nonterminal symbols, because it tends to be clearer to read in many cases."
I think you misunderstand what the OP is stating. OP refers to the left hand sides:
E -> E + E
| E * E
Certainly the left hand side in a CFG normally has a single non-terminal. I suppose you could write a grammar with the left hand sides containing multiple non-terminal symbols and then, if it turns out the grammar you wrote is in fact context-free, you could re-write it in the above form, but does anyone actually do this?
>Easy examples of requirements that violate context-freedom are the requirement that variables be declared before use, or the inability to divide an integer literal by 0. (The latter is a compiler error in Go[0] - I'm not sure about C++, but I know the former is a requirement in both languages).
It's usually just a warning, and runtime error. It's actually used in a few cases to attempt to kill a process.
Well, yes, though by definition that's what Common Lisp-style macros will lead to.
By 'closest', I simply meant that these deviations are easy to analyze or account for in an informal sense - for example, by forbidding/exempting user-defined macros. (Naturally, they cannot be precisely accounted for without executing the program or solving the halting problem, since the definition of the macro may itself require executing arbitrary code).
The deviations of languages like C++ and Java, for example, are hairier to try and enumerate, even informally (imprecisely) - I remember doing this as an exercise for my compilers class in school.
Best comment on SO was from jpalecek, who observed that the requirement that variables be declared prevents the language from being context-free. Of course that requirement is not enforced in the parser, so you might not care about it.
Can I reply with another question: why does it matter? I mean, even if C++ were context-free, that does not make it any better or worse of a language. It certainly does not alleviate the fact that the best-case number of passes required to compile C++ code is (if I'm not mistaken) 3 all the way up to 6 for proper optimization.
(Note: I love C++. But that doesn't take away from the fact that's one of the worst languages to parse and compile.)
As pointed out in Dan's answer, the phrase "grammar of a programming language" is not well-defined. Even "syntax of a programming language" is not well-defined.
Until someone defines those terms accurately, trying to answer the question is a waste of time.
"Why does my division produce an integer in Python" is google friendly.
Essentially, you should be able to work out the rough content of a question based on its title. I'm sure there's some irony about context and grammars hidden in this response.
However, in a function prototype, we can name the parameters. Thus,
is syntactically correct C++ if c is a type, but not if it isn't. A correct grammar for C++ must deal with this issue, which cannot be settled by a CFG.Also, a couple of notes.
(1) "Context-sensitive" does not mean "not context-free". In fact, strictly speaking, every context-free grammar is a context-sensitive grammar. And there are grammars that are neither. So the original question should really be restated.
(2) The long answer makes a very important point:
> ... the standard does not attempt to provide a complete formal grammar, and ... it chooses to write some of the parsing rules in technical English.
> What looks like a formal grammar in the C++ standard is not the complete formal definition of the syntax of the C++ language.