I don't think you can write a production quality parser for any "real" language in 3 weeks ... You can get something working in 3 weeks, but then you'll be adding features and fixing bugs for a year or more.
If you take something like Python or JavaScript, the basics are simple, but there are all sorts of features like splatting, decorators, a very rich function argument syntax, etc. and subtle cases to debug, like the rules of what's allows on LHS of assignment. JavaScript has embedded regexes, and now both languages have template strings, etc. It's a huge job.
It's not necessarily hard, but it takes a lot of work, and you will absolutely learn a lot of stuff about the language long after 3 weeks. I've programmed in Python for 18 years and still learned more about the language from just working with the parser, not even implementing it!
And this doesn't even count error recovery / dealing with broken code ...
I don't see what is challenging about any of what you mentioned, furthermore parsing a language is not the same thing as verifying that what is parsed is a semantically valid. Python is almost a context free language with the exception of how it handles indentation. With indentation taken into account, the entire language can be parsed directly from the following grammar using something like yacc:
JavaScript is not a strictly context free grammar either, but like Python the vast majority of it is and the parts that are not context free can be worked around. Furthermore the entire grammar is available here:
It isn't trivial to work around the parts that aren't context free, but it's also nothing insurmountable that requires more than 120 hours of effort. The document explicitly points out which grammar rules are not context free and gives an algorithm that can be used as an alternative.
Parsing is really not as challenging a job as a lot of people make it out to be and it's an interesting exercise to try yourself and get an intuitive feel for. You can use a compiler compiler (like yacc) if you feel like it to just get something up and running, but the downside of such tools is they do very poorly with error handling. Rolling out a hand written parser gives much better error messages and really is nothing that crazy. C++ is the only mainstream language I can think of that has a grammar so unbelievably complex that it would require a team of people working years to implement properly (and in fact none of the major compilers implement a proper C++ parser).
For statically typed languages things get harder because you first need to parse an AST, and then perform semantic analysis on it, but if all you need is syntax highlighting, you can skip over the semantic analysis.
> but if all you need is syntax highlighting, you can skip over the semantic analysis.
I wish we could move toward semantics highlighting.
I will chime in with you though and agree, as a writer and teacher of parsers, it doesn’t have to be that hard. In fact, if you implement your parser as a PEG, it really doesn’t have to be much longer than the input to a parser generator like YACC. Parser combinators strongly resemble the ebnf notation, it’s almost a direct translation. That’s why parser generators are possible to write in the first place. But in my opinion they are wholly unnecessary, since true grammar itself is really all you need if you’ve designed your grammar correctly. Just by expressing the grammar you’re 90% of the way to implementing it.
The thing is, for IDE purposes “production ready” has a different definition. The thing shouldn’t have 100% parity with the compiler, it should be close enough to be useful, and it must be resilient. This is definitely not trivial, but is totally within the reach of a single person.
> And this doesn't even count error recovery / dealing with broken code ...
With a hand written parser, you mostly get error resilience for free. In rust-analyzer’s parser, there’s very little code which explicitly deals with recovery. The trick, is, during recursive descent, to just not bail on the first error.
Those are some very nice insights, thanks for sharing them! Can you recommend a good resource on writing a parser by hand that doesn't bail on the first error? Or would you instead suggest studying the source code for e.g. the rust-analyzer parser?
In that case, they're parsing a haskell-like language and can use indentation as a guide for how far to skip ahead.
In a C-like language, I'd imagine you'd use braces or semicolons to see how far to skip ahead - the error bubbles up to a parser that knows how to recover, like say a statement or function body, it scans ahead to where it thinks its node ends and returns an error node, allowing the parent to continue.
If you take something like Python or JavaScript, the basics are simple, but there are all sorts of features like splatting, decorators, a very rich function argument syntax, etc. and subtle cases to debug, like the rules of what's allows on LHS of assignment. JavaScript has embedded regexes, and now both languages have template strings, etc. It's a huge job.
It's not necessarily hard, but it takes a lot of work, and you will absolutely learn a lot of stuff about the language long after 3 weeks. I've programmed in Python for 18 years and still learned more about the language from just working with the parser, not even implementing it!
And this doesn't even count error recovery / dealing with broken code ...