> Well, because it’s gosh-darn hard to do it the right way.
I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.
> it doesn’t have to redo the whole thing on every keypress.
This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).
> The reason (most) LSP servers don’t offer syntax highlighting is because of the drag on performance.
I am surprised to hear that. We never had performance problems with highlighting on the server in rust-analyzer. I remember that for Emacs specifically there were client side problems with parsing LSP JSON.
> Every keystroke you type must be sent to the server, processed, a partial tree returned, and your syntax highlighting updated.
That’s not the bottleneck for syntax highlighting, typechecking is (and it’s typechecking that makes highlighting especially interesting).
In general, my perception of what’s going on with proper parsing in the industry is a bit different. I’d say status quo from five years back boils down to people just getting accustomed to the way things were done. Compiler authors generally didn’t think about syntax highlighting or completions, and editors generally didn’t want to do the parsing stuff. JetBrains were the exception, as they just did the thing. In this sense, LSP was a much-needed stimulus to just start doing things properly. People were building rich IDE experiences before LSP just fine (see dart analyzer), it’s just that relatively few languages saw it as an important problem to solve at all.
I don't think you can write a production quality parser for any "real" language in 3 weeks ... You can get something working in 3 weeks, but then you'll be adding features and fixing bugs for a year or more.
If you take something like Python or JavaScript, the basics are simple, but there are all sorts of features like splatting, decorators, a very rich function argument syntax, etc. and subtle cases to debug, like the rules of what's allows on LHS of assignment. JavaScript has embedded regexes, and now both languages have template strings, etc. It's a huge job.
It's not necessarily hard, but it takes a lot of work, and you will absolutely learn a lot of stuff about the language long after 3 weeks. I've programmed in Python for 18 years and still learned more about the language from just working with the parser, not even implementing it!
And this doesn't even count error recovery / dealing with broken code ...
I don't see what is challenging about any of what you mentioned, furthermore parsing a language is not the same thing as verifying that what is parsed is a semantically valid. Python is almost a context free language with the exception of how it handles indentation. With indentation taken into account, the entire language can be parsed directly from the following grammar using something like yacc:
JavaScript is not a strictly context free grammar either, but like Python the vast majority of it is and the parts that are not context free can be worked around. Furthermore the entire grammar is available here:
It isn't trivial to work around the parts that aren't context free, but it's also nothing insurmountable that requires more than 120 hours of effort. The document explicitly points out which grammar rules are not context free and gives an algorithm that can be used as an alternative.
Parsing is really not as challenging a job as a lot of people make it out to be and it's an interesting exercise to try yourself and get an intuitive feel for. You can use a compiler compiler (like yacc) if you feel like it to just get something up and running, but the downside of such tools is they do very poorly with error handling. Rolling out a hand written parser gives much better error messages and really is nothing that crazy. C++ is the only mainstream language I can think of that has a grammar so unbelievably complex that it would require a team of people working years to implement properly (and in fact none of the major compilers implement a proper C++ parser).
For statically typed languages things get harder because you first need to parse an AST, and then perform semantic analysis on it, but if all you need is syntax highlighting, you can skip over the semantic analysis.
> but if all you need is syntax highlighting, you can skip over the semantic analysis.
I wish we could move toward semantics highlighting.
I will chime in with you though and agree, as a writer and teacher of parsers, it doesn’t have to be that hard. In fact, if you implement your parser as a PEG, it really doesn’t have to be much longer than the input to a parser generator like YACC. Parser combinators strongly resemble the ebnf notation, it’s almost a direct translation. That’s why parser generators are possible to write in the first place. But in my opinion they are wholly unnecessary, since true grammar itself is really all you need if you’ve designed your grammar correctly. Just by expressing the grammar you’re 90% of the way to implementing it.
The thing is, for IDE purposes “production ready” has a different definition. The thing shouldn’t have 100% parity with the compiler, it should be close enough to be useful, and it must be resilient. This is definitely not trivial, but is totally within the reach of a single person.
> And this doesn't even count error recovery / dealing with broken code ...
With a hand written parser, you mostly get error resilience for free. In rust-analyzer’s parser, there’s very little code which explicitly deals with recovery. The trick, is, during recursive descent, to just not bail on the first error.
Those are some very nice insights, thanks for sharing them! Can you recommend a good resource on writing a parser by hand that doesn't bail on the first error? Or would you instead suggest studying the source code for e.g. the rust-analyzer parser?
In that case, they're parsing a haskell-like language and can use indentation as a guide for how far to skip ahead.
In a C-like language, I'd imagine you'd use braces or semicolons to see how far to skip ahead - the error bubbles up to a parser that knows how to recover, like say a statement or function body, it scans ahead to where it thinks its node ends and returns an error node, allowing the parent to continue.
> I remember that for Emacs specifically there were client side problems with parsing LSP JSON.
I am given to understand that this is not a problem any more (since Emacs 27.1). Before that, the JSON parser was written in elisp which is a slow language (though somewhat mitigated with recent native-compilation). But now Emacs has preference to just use native bindings (jansson), and afaik this had solved most of the performance grievances raised by LSP clients.
> I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand.
I don't agree. Newer languages are all being designed with the constraint that the grammar should be easy to parse and not require indefinite lookahead and full compilation to get back on track after an error.
That's a big change from the C/C++ heritage.
It's no coincidence that "modern" languages (call it the last 10 or so years) tend to have things like explicit variable assignment (let-statement-like) and delimiters between variable and type, for example.
> Newer languages are all being designed with the constraint that the grammar should be easy to parse
I think that says less about the difficulty of parsing and more that language designers have realised that 'easy to parse' is not incompatible with good readability and terse syntax. In fact, the two go hand in hand: languages that are easy for computers to understand are often easy for users to understand too.
This has nothing to do with old or new and everything with both C and C++ being serious aberrations in programming language design. Most languages not directly influenced by C (new or old) simply don't have these bizarre issues. Also a lot of languages are becoming significantly harder to parse as time goes on (python for example).
> Most languages not directly influenced by C (new or old) simply don't have these bizarre issues
I don't agree. Lisp is "easy" to parse, but difficult to add structure to. Tcl similarly. Typeless languages are now out of favor--everybody wants to be able to add types.
Perl is a nightmare and probably undecidable. Satan help you if you miss a period in COBOL because God sure won't. FORTRAN is famous for it's DO LOOP construct that would hose you.
About the only language that wasn't hot garbage to parse was Pascal. And I seem to recall that was intentional.
> I don't agree. Lisp is "easy" to parse, but difficult to add structure to.
I have no idea what you mean by this, or how you think it relates to your original claim that having languages with a less terrible grammar than C++ or even C is some recent development.
> Perl is a nightmare
And it's pretty clearly C-inspired, even if it added lots of new syntactic horrors of its own invention. Also, it's late 80ies not early 70ies, so hardly a poster-case for languages becoming grammatically saner.
> About the only language that wasn't hot garbage to parse was Pascal.
In addition to Pascal and Lisp which you already mentioned Algol, Prolog, APL, Smalltalk are all famous languages from around the same time as C or significantly older and none of them are "hot garbage to parse". Neither are important 80ies languages like Postscript or SML. In fact the only significant extant 70s language I can think of from the top of my head that is syntactically noticeably more deranged than C and maybe even C++ is TeX.
> And I seem to recall that was intentional.
Well yes, why would anyone create a language that's really hard to parse for no discernible benefit? This is not the counterintuitive recent insight you make it out to be. If anything, the trend for popular languages would seem to become harder to parse -- none of the significant languages from the 2000s (like Swift, Julia or Rust) are anywhere as easy to parse as the languages I listed above.
Readers, please don't accept anything anyone writes about "FORTRAN", unless in a historical context. They probably last encountered the leading edge of the language 40 years ago.
Python is not typeless, it is strongly typed. Each value has one, precisely known type. Names may refer to values of different types, which is the "dynamic" part of pythons typing.
Javascript is weakly typed and most of its mayhem comes from there.
This is me being obtuse, but it seems like an appropriate time to ask... What is the difference? You mention that each value has one known type in a strongly typed language. Isn't this the case for Javascript as well? I'm having a difficult time trying to conjure a situation in JS where a value has multiple types (but I'm certainly no expert in JS).
It's a bit of a mixed bag and the terminology is difficult to grasp. I'd say Tcl and Bash are languages that only have strings ('stringly typed') that can be interpreted in a number of ways. JavaScript, PHP, and SQLite's SQL OTOH have lots of implicit type coercion---`1` and `'1'` can both act as a string, a number, or a boolean.
Python is considerably more picky in what it allows in which constructs; it does all the numerical typing stuff implicitly (so switching between integer, big int, and float happens most of the time without users even knowing about it), and b/c of backwards compatibility, coercions between numbers and booleans still happen (`True + 1` is still `2` in Python3.9). By extension, this includes empty lists, strings, and dictionaries evaluating to `False` in appropriate contexts.
I believe that in retrospect most of these efforts—coming up with a Just Works arrangement of strategically placed implicit coercions—that so much defined the rise of scripting languages in the 90s are questionable. Subsequently, many millions of USD went into making JavaScript faster than reasonable (on V8) giving its pervasive dynamic nature. Just bolting down types and banning any implicit coercions would have given it similar performance gains with a fraction of effort and a fraction of the resulting complexity. Backward compatibility could have been done with a pragma similar to the existing `'use strict'`.
I guess what I want to say is that strong and loose typing exists on a continuum much like human languages are never just of a single idealized kind.
I had an interesting experience making a simple "Wiki" style editor for a web app back around 2008 or so. To my surprise, even an ANTLR-generated JavaScript parser could easily handle keystroke-by-keystroke parsing and fully updating the entire DOM in real time, up to about 20-30KB of text. After 60KB the performance would drop visibly, but it was still acceptable.
A hand-tuned Rust parser on a 2021 machine? I can imagine it handling hundreds of kilobytes without major issues.
Still, there's some "performance tuning itch" that this doesn't quite scratch. I can't get past the notion that this kind of things ought to be done incrementally, even when the practical evidence says that it's not worth it.
> This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).
Glances at the memory usage of Goland in a moderately sized project and weeps
> I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.
Having a parser which generates an AST is just the first step. Then, you actually need to implement all the rules of the language, so for instance the scoping rules, how the object system works, any other built-in compound/aggregate types, other constructs like decorators, generics, namespaces or module systems, and on and on and on. Depending on the language, this will usually be the main work.
And then of course there's dynamic typing - if you want to enable smart completions for a dynamically typed language, you need to implement some kind of type deduction. This alone can take a lot of time to implement.
If you want syntax highlighting, the AST is enough to generate pretty colours for the source code. If you want semantic highlighting… sure, that's another story entirely. And even then you don't necessarily have to do as much work as the compiler itself.
And don't even try to be smart with dynamically typed languages, it cannot possibly be reliable, short of actually executing the program. If your program are short enough you won't need it, and if you do need such static analysis… consider switching to a statically typed language instead.
Yeah, to clarify, memoization happens after parsing. So for syntax highlighting we have a situation where from-scratch parsing is faster than incremental typechecking.
I was under the impression that rust-analyzer (and more generally LSP) provides augmentative (contextual) syntax highlighting, whereas most of the highlighting still comes from editor-specific configuration. Is this not the case? If so I would be thrilled; as someone authoring a custom language right now it has been very frustrating to not be able to provide a single source of syntax highlighting for all popular editors.
rust-analyzer highlights everything, I have an empty vs code theme for it somewhere. But yeah, in general LSP highlighting is specified in augmentative way.
I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.
> it doesn’t have to redo the whole thing on every keypress.
This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).
> The reason (most) LSP servers don’t offer syntax highlighting is because of the drag on performance.
I am surprised to hear that. We never had performance problems with highlighting on the server in rust-analyzer. I remember that for Emacs specifically there were client side problems with parsing LSP JSON.
> Every keystroke you type must be sent to the server, processed, a partial tree returned, and your syntax highlighting updated.
That’s not the bottleneck for syntax highlighting, typechecking is (and it’s typechecking that makes highlighting especially interesting).
In general, my perception of what’s going on with proper parsing in the industry is a bit different. I’d say status quo from five years back boils down to people just getting accustomed to the way things were done. Compiler authors generally didn’t think about syntax highlighting or completions, and editors generally didn’t want to do the parsing stuff. JetBrains were the exception, as they just did the thing. In this sense, LSP was a much-needed stimulus to just start doing things properly. People were building rich IDE experiences before LSP just fine (see dart analyzer), it’s just that relatively few languages saw it as an important problem to solve at all.