That's a good question, and you're of course right about GCC and Clang using recursive descent.
Well, I said that bottom-up is more efficient for such languages (with C style declaration syntax), not that it's used in production C compilers :)
My feeling is that they use recursive descent for reasons of readability and maintainability, error handling, debugging, etc. (although I have seen a couple hand-written bottom up parsers for other languages)
I suspect that they have ad hoc tricks for the specific lookahead cases, or perhaps they just live with it because it doesn't come up often enough in real C code (there are plenty of other reasons C and C++ parsers are slow; this is an algorithmic reason, which may not be the bottleneck)
In practice I think there is a sliding scale from strict "recursive descent" to "ad hoc bunch of recursive procedures". Depending on how you organize the output data structures of the parser, I guess you can just delay knowing what production you are in. It doesn't have to be organized strictly according to the grammar. A lot of C parsers I've looked at use untyped (homogeneous) tree output, which may help with this. Every function returns Node* rather than an object of a specific type.
But I'm not an expert on GCC or Clang, so I would be interested hear anything to the contrary. (I'm still sort of shocked by how huge the Clang front end is; the lib/Lex/ dir is 21K lines of code; lib/Parse/ is 33K lines; and that's not even close to the full front end -- it doesn't even count the 60K lines of AST headers that I mentioned before...)
Of course, the danger with a more ad hoc structure is that it's harder to reason about what language you're actually parsing. The only solution seems to be rather exhaustive test cases (e.g. as mentioned on https://gcc.gnu.org/wiki/New_C_Parser) It's almost mind-boggling to try to understand what language 33K lines of code are recognizing, without some higher level structure.
This question addresses the same issue, but there's no real answer.
"When you dive into the parens you don't know if you're parsing a type, or an array of type, or an expression (and whether it's l-value or r-value), so you have to somehow postpone that decision"
Well, I said that bottom-up is more efficient for such languages (with C style declaration syntax), not that it's used in production C compilers :)
My feeling is that they use recursive descent for reasons of readability and maintainability, error handling, debugging, etc. (although I have seen a couple hand-written bottom up parsers for other languages)
I suspect that they have ad hoc tricks for the specific lookahead cases, or perhaps they just live with it because it doesn't come up often enough in real C code (there are plenty of other reasons C and C++ parsers are slow; this is an algorithmic reason, which may not be the bottleneck)
In practice I think there is a sliding scale from strict "recursive descent" to "ad hoc bunch of recursive procedures". Depending on how you organize the output data structures of the parser, I guess you can just delay knowing what production you are in. It doesn't have to be organized strictly according to the grammar. A lot of C parsers I've looked at use untyped (homogeneous) tree output, which may help with this. Every function returns Node* rather than an object of a specific type.
But I'm not an expert on GCC or Clang, so I would be interested hear anything to the contrary. (I'm still sort of shocked by how huge the Clang front end is; the lib/Lex/ dir is 21K lines of code; lib/Parse/ is 33K lines; and that's not even close to the full front end -- it doesn't even count the 60K lines of AST headers that I mentioned before...)
Of course, the danger with a more ad hoc structure is that it's harder to reason about what language you're actually parsing. The only solution seems to be rather exhaustive test cases (e.g. as mentioned on https://gcc.gnu.org/wiki/New_C_Parser) It's almost mind-boggling to try to understand what language 33K lines of code are recognizing, without some higher level structure.
This question addresses the same issue, but there's no real answer.
http://stackoverflow.com/questions/5992130/how-do-you-parse-... (ignore the fact that the question is misusing the term "context sensitive")
"When you dive into the parens you don't know if you're parsing a type, or an array of type, or an expression (and whether it's l-value or r-value), so you have to somehow postpone that decision"