Tree Sitter and the Complications of Parsing Languages

matklad · on Nov 24, 2021

> Well, because it’s gosh-darn hard to do it the right way.

I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.

> it doesn’t have to redo the whole thing on every keypress.

This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).

> The reason (most) LSP servers don’t offer syntax highlighting is because of the drag on performance.

I am surprised to hear that. We never had performance problems with highlighting on the server in rust-analyzer. I remember that for Emacs specifically there were client side problems with parsing LSP JSON.

> Every keystroke you type must be sent to the server, processed, a partial tree returned, and your syntax highlighting updated.

That’s not the bottleneck for syntax highlighting, typechecking is (and it’s typechecking that makes highlighting especially interesting).

In general, my perception of what’s going on with proper parsing in the industry is a bit different. I’d say status quo from five years back boils down to people just getting accustomed to the way things were done. Compiler authors generally didn’t think about syntax highlighting or completions, and editors generally didn’t want to do the parsing stuff. JetBrains were the exception, as they just did the thing. In this sense, LSP was a much-needed stimulus to just start doing things properly. People were building rich IDE experiences before LSP just fine (see dart analyzer), it’s just that relatively few languages saw it as an important problem to solve at all.

chubot · on Nov 24, 2021

I don't think you can write a production quality parser for any "real" language in 3 weeks ... You can get something working in 3 weeks, but then you'll be adding features and fixing bugs for a year or more.

If you take something like Python or JavaScript, the basics are simple, but there are all sorts of features like splatting, decorators, a very rich function argument syntax, etc. and subtle cases to debug, like the rules of what's allows on LHS of assignment. JavaScript has embedded regexes, and now both languages have template strings, etc. It's a huge job.

It's not necessarily hard, but it takes a lot of work, and you will absolutely learn a lot of stuff about the language long after 3 weeks. I've programmed in Python for 18 years and still learned more about the language from just working with the parser, not even implementing it!

And this doesn't even count error recovery / dealing with broken code ...

Kranar · on Nov 24, 2021

I don't see what is challenging about any of what you mentioned, furthermore parsing a language is not the same thing as verifying that what is parsed is a semantically valid. Python is almost a context free language with the exception of how it handles indentation. With indentation taken into account, the entire language can be parsed directly from the following grammar using something like yacc:

https://docs.python.org/3/reference/grammar.html

JavaScript is not a strictly context free grammar either, but like Python the vast majority of it is and the parts that are not context free can be worked around. Furthermore the entire grammar is available here:

https://262.ecma-international.org/5.1/#sec-A

It isn't trivial to work around the parts that aren't context free, but it's also nothing insurmountable that requires more than 120 hours of effort. The document explicitly points out which grammar rules are not context free and gives an algorithm that can be used as an alternative.

Parsing is really not as challenging a job as a lot of people make it out to be and it's an interesting exercise to try yourself and get an intuitive feel for. You can use a compiler compiler (like yacc) if you feel like it to just get something up and running, but the downside of such tools is they do very poorly with error handling. Rolling out a hand written parser gives much better error messages and really is nothing that crazy. C++ is the only mainstream language I can think of that has a grammar so unbelievably complex that it would require a team of people working years to implement properly (and in fact none of the major compilers implement a proper C++ parser).

For statically typed languages things get harder because you first need to parse an AST, and then perform semantic analysis on it, but if all you need is syntax highlighting, you can skip over the semantic analysis.

ModernMech · on Nov 24, 2021

> but if all you need is syntax highlighting, you can skip over the semantic analysis.

I wish we could move toward semantics highlighting.

I will chime in with you though and agree, as a writer and teacher of parsers, it doesn’t have to be that hard. In fact, if you implement your parser as a PEG, it really doesn’t have to be much longer than the input to a parser generator like YACC. Parser combinators strongly resemble the ebnf notation, it’s almost a direct translation. That’s why parser generators are possible to write in the first place. But in my opinion they are wholly unnecessary, since true grammar itself is really all you need if you’ve designed your grammar correctly. Just by expressing the grammar you’re 90% of the way to implementing it.

matklad · on Nov 24, 2021

The thing is, for IDE purposes “production ready” has a different definition. The thing shouldn’t have 100% parity with the compiler, it should be close enough to be useful, and it must be resilient. This is definitely not trivial, but is totally within the reach of a single person.

> And this doesn't even count error recovery / dealing with broken code ...

With a hand written parser, you mostly get error resilience for free. In rust-analyzer’s parser, there’s very little code which explicitly deals with recovery. The trick, is, during recursive descent, to just not bail on the first error.

jhck · on Nov 24, 2021

Those are some very nice insights, thanks for sharing them! Can you recommend a good resource on writing a parser by hand that doesn't bail on the first error? Or would you instead suggest studying the source code for e.g. the rust-analyzer parser?

dunham · on Nov 24, 2021

I can't answer this well and don't know of any resources, but I have seen it before in the parser for sixten:

    https://github.com/ollef/sixten/blob/60d46eee20abd62599badea85774a9365c81af45/src/Frontend/Parse.hs#L458

In that case, they're parsing a haskell-like language and can use indentation as a guide for how far to skip ahead.

In a C-like language, I'd imagine you'd use braces or semicolons to see how far to skip ahead - the error bubbles up to a parser that knows how to recover, like say a statement or function body, it scans ahead to where it thinks its node ends and returns an error node, allowing the parent to continue.

matklad · on Nov 24, 2021

I don’t have a good resource, but https://m.youtube.com/watch?v=0HlrqwLjCxA is probably an OK start.

jhck · on Nov 24, 2021

Thanks a lot, I'll give it a watch!

Also, thanks dunham for the sixten suggestion!

natrys · on Nov 24, 2021

> I remember that for Emacs specifically there were client side problems with parsing LSP JSON.

I am given to understand that this is not a problem any more (since Emacs 27.1). Before that, the JSON parser was written in elisp which is a slow language (though somewhat mitigated with recent native-compilation). But now Emacs has preference to just use native bindings (jansson), and afaik this had solved most of the performance grievances raised by LSP clients.

bsder · on Nov 24, 2021

> I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand.

I don't agree. Newer languages are all being designed with the constraint that the grammar should be easy to parse and not require indefinite lookahead and full compilation to get back on track after an error.

That's a big change from the C/C++ heritage.

It's no coincidence that "modern" languages (call it the last 10 or so years) tend to have things like explicit variable assignment (let-statement-like) and delimiters between variable and type, for example.

zesterer · on Nov 24, 2021

> Newer languages are all being designed with the constraint that the grammar should be easy to parse

I think that says less about the difficulty of parsing and more that language designers have realised that 'easy to parse' is not incompatible with good readability and terse syntax. In fact, the two go hand in hand: languages that are easy for computers to understand are often easy for users to understand too.

patrec · on Nov 24, 2021

This has nothing to do with old or new and everything with both C and C++ being serious aberrations in programming language design. Most languages not directly influenced by C (new or old) simply don't have these bizarre issues. Also a lot of languages are becoming significantly harder to parse as time goes on (python for example).

bsder · on Nov 24, 2021

> Most languages not directly influenced by C (new or old) simply don't have these bizarre issues

I don't agree. Lisp is "easy" to parse, but difficult to add structure to. Tcl similarly. Typeless languages are now out of favor--everybody wants to be able to add types.

Perl is a nightmare and probably undecidable. Satan help you if you miss a period in COBOL because God sure won't. FORTRAN is famous for it's DO LOOP construct that would hose you.

About the only language that wasn't hot garbage to parse was Pascal. And I seem to recall that was intentional.

patrec · on Nov 24, 2021

> I don't agree. Lisp is "easy" to parse, but difficult to add structure to.

I have no idea what you mean by this, or how you think it relates to your original claim that having languages with a less terrible grammar than C++ or even C is some recent development.

> Perl is a nightmare

And it's pretty clearly C-inspired, even if it added lots of new syntactic horrors of its own invention. Also, it's late 80ies not early 70ies, so hardly a poster-case for languages becoming grammatically saner.

> About the only language that wasn't hot garbage to parse was Pascal.

In addition to Pascal and Lisp which you already mentioned Algol, Prolog, APL, Smalltalk are all famous languages from around the same time as C or significantly older and none of them are "hot garbage to parse". Neither are important 80ies languages like Postscript or SML. In fact the only significant extant 70s language I can think of from the top of my head that is syntactically noticeably more deranged than C and maybe even C++ is TeX.

> And I seem to recall that was intentional.

Well yes, why would anyone create a language that's really hard to parse for no discernible benefit? This is not the counterintuitive recent insight you make it out to be. If anything, the trend for popular languages would seem to become harder to parse -- none of the significant languages from the 2000s (like Swift, Julia or Rust) are anywhere as easy to parse as the languages I listed above.

theodorethomas · on Nov 24, 2021

Readers, please don't accept anything anyone writes about "FORTRAN", unless in a historical context. They probably last encountered the leading edge of the language 40 years ago.

lispm · on Nov 24, 2021

> Typeless languages are now out of favor

Javascript, Python, ... aren't they THE two most popular languages?

bsder · on Nov 24, 2021

And what are the two biggest things about those languages?

Python recently added gradual typing due to overwhelming pressure. And everybody is using Typescript instead of Javascript.

maxnoe · on Nov 24, 2021

Python is not typeless, it is strongly typed. Each value has one, precisely known type. Names may refer to values of different types, which is the "dynamic" part of pythons typing.

Javascript is weakly typed and most of its mayhem comes from there.

asxd · on Nov 24, 2021

This is me being obtuse, but it seems like an appropriate time to ask... What is the difference? You mention that each value has one known type in a strongly typed language. Isn't this the case for Javascript as well? I'm having a difficult time trying to conjure a situation in JS where a value has multiple types (but I'm certainly no expert in JS).

DemocracyFTW · on Nov 24, 2021

It's a bit of a mixed bag and the terminology is difficult to grasp. I'd say Tcl and Bash are languages that only have strings ('stringly typed') that can be interpreted in a number of ways. JavaScript, PHP, and SQLite's SQL OTOH have lots of implicit type coercion---`1` and `'1'` can both act as a string, a number, or a boolean.

Python is considerably more picky in what it allows in which constructs; it does all the numerical typing stuff implicitly (so switching between integer, big int, and float happens most of the time without users even knowing about it), and b/c of backwards compatibility, coercions between numbers and booleans still happen (`True + 1` is still `2` in Python3.9). By extension, this includes empty lists, strings, and dictionaries evaluating to `False` in appropriate contexts.

I believe that in retrospect most of these efforts—coming up with a Just Works arrangement of strategically placed implicit coercions—that so much defined the rise of scripting languages in the 90s are questionable. Subsequently, many millions of USD went into making JavaScript faster than reasonable (on V8) giving its pervasive dynamic nature. Just bolting down types and banning any implicit coercions would have given it similar performance gains with a fraction of effort and a fraction of the resulting complexity. Backward compatibility could have been done with a pragma similar to the existing `'use strict'`.

I guess what I want to say is that strong and loose typing exists on a continuum much like human languages are never just of a single idealized kind.

L0stLink · on Nov 24, 2021

JavaScript will dynamically change the effective type of a value like sting or number into another depending on the operation being performed on it:

'2' * 3 => 6

'2' + 3 => '23'

philh · on Nov 24, 2021

The context was parsing, not semantics. "Typeless" meant "lacking type annotations", not directly to do with static/dynamic or weak/strong typing.

(Though Python does have optional type annotations these days.)

jiggawatts · on Nov 24, 2021

I had an interesting experience making a simple "Wiki" style editor for a web app back around 2008 or so. To my surprise, even an ANTLR-generated JavaScript parser could easily handle keystroke-by-keystroke parsing and fully updating the entire DOM in real time, up to about 20-30KB of text. After 60KB the performance would drop visibly, but it was still acceptable.

A hand-tuned Rust parser on a 2021 machine? I can imagine it handling hundreds of kilobytes without major issues.

Still, there's some "performance tuning itch" that this doesn't quite scratch. I can't get past the notion that this kind of things ought to be done incrementally, even when the practical evidence says that it's not worth it.

BobbyJo · on Nov 24, 2021

> This is probably what makes the task seem harder than it is. Incremental parsing is nice, but not mandatory. rust-analyzer and most IntelliJ parsers re-parse the whole file on every change (IJ does incremental lexing, which is simple).

Glances at the memory usage of Goland in a moderately sized project and weeps

deng · on Nov 24, 2021

> I think this overstates the difficulty. This of course depends a lot on the language, but for a reasonable one (not C++) you can just go and write the parser by hand. I’d ballpark this as three weeks, if you know how the thing is supposed to work.

Having a parser which generates an AST is just the first step. Then, you actually need to implement all the rules of the language, so for instance the scoping rules, how the object system works, any other built-in compound/aggregate types, other constructs like decorators, generics, namespaces or module systems, and on and on and on. Depending on the language, this will usually be the main work.

And then of course there's dynamic typing - if you want to enable smart completions for a dynamically typed language, you need to implement some kind of type deduction. This alone can take a lot of time to implement.

loup-vaillant · on Nov 24, 2021

If you want syntax highlighting, the AST is enough to generate pretty colours for the source code. If you want semantic highlighting… sure, that's another story entirely. And even then you don't necessarily have to do as much work as the compiler itself.

And don't even try to be smart with dynamically typed languages, it cannot possibly be reliable, short of actually executing the program. If your program are short enough you won't need it, and if you do need such static analysis… consider switching to a statically typed language instead.

lenkite · on Nov 24, 2021

Rust-analyzer uses Salsa - incremental computation library that uses memoization.

https://github.com/rust-analyzer/rust-analyzer/blob/master/d...

zesterer · on Nov 24, 2021

Interesting choice to reply to matklad, one of rust-analyzer's primary authors, to explain how it works.

matklad · on Nov 24, 2021

Nah, it’s a totally valid question: I indeed didn’t clarify where incrementality starts to happen.

lenkite · on Nov 24, 2021

I slapped myself in the face.

matklad · on Nov 24, 2021

Yeah, to clarify, memoization happens after parsing. So for syntax highlighting we have a situation where from-scratch parsing is faster than incremental typechecking.

lenkite · on Nov 24, 2021

Thanks for explaining and making rust-analyzer!

afranchuk · on Nov 24, 2021

I was under the impression that rust-analyzer (and more generally LSP) provides augmentative (contextual) syntax highlighting, whereas most of the highlighting still comes from editor-specific configuration. Is this not the case? If so I would be thrilled; as someone authoring a custom language right now it has been very frustrating to not be able to provide a single source of syntax highlighting for all popular editors.

matklad · on Nov 24, 2021

rust-analyzer highlights everything, I have an empty vs code theme for it somewhere. But yeah, in general LSP highlighting is specified in augmentative way.

aidos · on Nov 24, 2021

Before this conversation is railroaded by talk about language servers, as the article points out, tree sitter tends to need to be a bit closer to the environment to be effective.

There’s still work to do, but having tree sitter in neovim feels like a great step forward.

omnicognate · on Nov 24, 2021

Yes, it's more for syntax highlighting where you don't want the lag of an external server and don't need the deep language analysis needed for diagnostics, refactoring, etc. I'm not sure what other use cases it would be superior to LSP for, but I'm sure there are some.

clktmr · on Nov 24, 2021

It can also be used for text editing, e.g. changing, deleting, swapping of function arguments or any other text object defined by the language syntax.

GrumpySloth · on Nov 24, 2021

Cursor movement as well.

mickeyp · on Nov 24, 2021

Author here. Yes, both are very useful, with some overlap of purpose, but work well together.

aidos · on Nov 24, 2021

Thanks for the article. Even though I'm not active on the development side in either editor, I love the idea that people are toiling away on these same sorts of enhancements in both environments (and I get the benefit in neovim).

smcameron · on Nov 24, 2021

> Semantic Bovinator

Heh. A long time ago I wrote a video game[1] somewhat similar to Williams Defender, and casting about for some sort of "theme" for the game, I hit upon the "editor wars", the ancient storied battle between vi and emacs. You are ostensibly "vi", (a little spaceship vaguely reminiscent of the Vipers from Battlestar Galactica) cruising through system memory, evading system processes, GDB instances, etc trying to recover your ".swp" files. How to represent Emacs? Obviously, via a giant blimp! and I could display all sorts of messages on the side of the blimp, singing the praises of Emacs, and disparaging fans of vi. And the Emacs blimp had a "memory leak", which meant that pieces of the xemacs source code would literally leak out of the back end of the blimp, with the letters floating lazily away, like smoke. So that meant I had to take a look at the xemacs source, dig through it and try to find some funny bits to put in. Of course, "semantic bovinate" jumped out at me.[2]

[1] https://github.com/smcameron/wordwarvi [2] https://github.com/smcameron/wordwarvi/blob/master/wordwarvi...

krylon · on Nov 25, 2021

That is gorgeous! Thanks for sharing!

dgellow · on Nov 24, 2021

Checkout the project page here: https://tree-sitter.github.io/tree-sitter/

Quite a lot of languages are already supported, it's really nice to see. I might have a use for such a library for a personal project :)

You can play around with the playground here: https://tree-sitter.github.io/tree-sitter/playground

kieckerjan · on Nov 24, 2021

I suppose that these days I am one of the few professional programmers who has an active dislike of syntax highlighting. I find it immensely distracting. The only stuff I allow the highlighter to touch are my comments (I turn them bold) and I consider this a somewhat frivolous indulgence.

(I appreciate the complexity of the problem, btw)

catskul2 · on Nov 24, 2021

To each their own, and fortunately most(all?) editors allow such features to be turned off.

On the other hand, I find the "frivolous indulgence" perspective extremely obnoxious along with the related implication of moral or technical superiority of not using syntax highlighting.

As a side note, the way it helps many people who prefer it has some fascinating cog-psych underpinnings: https://en.wikipedia.org/wiki/Visual_search

Sometimes I wonder if those who don't prefer it might have some synesthesia which might allow their brain hardware to provide what the syntax highlighting does for the rest of us.

IshKebab · on Nov 24, 2021

Yes it's pure egotism. "I don't need silly colours to code, not like those noobs."

You get the same attitude a lot for things like autocomplete and even for static typing.

orf · on Nov 24, 2021

I guess it turns out actively depriving yourself of relevant information at a glance isn’t that popular, no.

mst · on Nov 24, 2021

While I'm aware that OP and I are in a minority - there is a cognitive overhead to having that information surfaced at all times when you may be trying to focus on something at e.g. the method level rather than the individual syntactic element level, and if that cognitive overhead exceeds the utility of having that information available, the sensible answer is to turn the highlighting off.

If I could have some sort of focus follows mind where highlighting automatically happens commensurate to what level of granularity I'm currently thinking about the code at I would be extremely interested, but absent "focus follows mind" it's a trade-off that everybody has to make for themselves.

Some people prefer to highlight almost everything, some almost nothing, some people find it helpful for some languages/tasks but not for others.

It's similar IME to the extent to which preferred debugging styles (printf versus interactive versus hybrid versus situational choices) are also something people have to figure out, and, well, different people are different, and that's neither a bad thing nor an avoidable thing.

nusaru · on Nov 24, 2021

I wonder if perhaps this is also a generational thing. Programmers from before syntax highlighting became popular would be less likely to prefer it, no? I’m not even sure if programmers from current/recent generations ever prefer not to have syntax highlighting, but I’m genuinely curious if there are such people out there.

mst · on Nov 24, 2021

I've known people who've started without synhi but wouldn't leave home without and people who started with rich synhi everywhere and now avoid it except for the first couple months of learning a new language, so it's certainly not just generational.

I would not at all be surprised if people who started off one way or the other (for generational reasons or indeed any 'whatever environment they were first introduced to' style reasons) are less likely to end up switching, but that probably says more about perceived switching costs as what would be most comfortable for somebody.

e.g. I know people who took a month to be comfortable without synhi but then loved that, and I've spent weeks trying to be more comfortable -with- and given up, and honestly anything that half screws your productivity for over a week is going to be a hard sell even if the end result -would- be better (waves in "also, still can't manage to drive emacs" ;)

AkshitGarg · on Nov 24, 2021

I grew up with fully fledged syntax highlighting, but I still prefer to use really minimal themes as I find them to reduce cognitive overhead and eye strain.

It was hard for the first few hours, but then I eventually got used to it, and now I can't use anything else.

I know this is not quite as extreme as working without syntax highlighting :)

NoGravitas · on Nov 24, 2021

I started before syntax highlighting became widely used (in the Emacs 18 era), but was super-excited for syntax highlighting when it became available (Emacs 19 and XEmacs), and probably went overboard with it. These days, I prefer minimal syntax highlighting.

azeirah · on Nov 24, 2021

Not sure which editor I'm thinking about, but I do remember the exact feature you're describing implemented in one I used a while ago.

Ie, the paragraph (or block of code) your cursor is focused on is visible, the rest of the code is blurred out.

spiralx · on Nov 25, 2021

I recall that as well, my guess would be Sublime Text. Ah, it might have been this plugin for Atom, I was using that for a while:

https://github.com/davidleghorn/atom-focus-mode

Edited to add that I found this for VS Code which I might try:

https://marketplace.visualstudio.com/items?itemName=imagio.v...

azeirah · on Nov 26, 2021

I think I met this feature specifically in an editor made specifically for markdown and markdown only.

Although I can definitely imagine this being a plugin in different editors

Kinrany · on Nov 24, 2021

The IDE could temporarily turn off all syntax highlighting outside the node that has the cursor.

mst · on Nov 24, 2021

Where the cursor is and where my mental focus is don't necessarily match, and it's really the granularity problem - expression level versus statement versus block versus etc. that causes 'mismatch between highlighting and focus' for me at least.

It certainly sounds like an experiment that would be interesting to try, though.

kieckerjan · on Nov 24, 2021

The way I see it, most syntax highlighting is actively adding mostly irrelevant information to the cognitive load of programming: stuff that should be obvious if you know the language. It does as little for my understanding as a novel in which, say, every proper noun was printed in red.

I can imagine more useful highlighting than color coding the types of the symbols encountered. Lighting up the active scopes. Giving the same hue to names that look like each other. There are probably highlighters out there that do that. But "simple" syntax highlighting is still the norm.

orf · on Nov 24, 2021

The same argument could be made for seeing anything in colour:

> most colours are actively adding irrelevant information to the cognitive load of existing. It should be obvious that apples and red and the sky is blue.

That’s silly, because it does add relevant information. Obviously it’s a spectrum - too many colours can hide information, but when used appropriately it’s fine.

Also everyone is different. Perhaps your brain gets distracted by the colours more than the majority of people.

mijoharas · on Nov 24, 2021

> Lighting up the active scopes

As you had guessed a little later, there are a few different emacs packages that do this. One of them is "rainbow parentheses" that gives every bracket a different colour (remember that emacs supports lisp, so differentiating between lots of different parentheses is arguably more useful in emacs than any other editor). [0].

Another one is highlight parentheses [1] which highlights all parens that enclose the cursor position, and gives a darker colour to those "further away" from the cursor.

[0] https://github.com/Fanael/rainbow-delimiters

[1] https://sr.ht/~tsdh/highlight-parentheses.el/

ReleaseCandidat · on Nov 24, 2021

> Giving the same hue to names that look like each other.

Emacs' 'Rainbow Identifiers' does that. I like it.

    https://github.com/Fanael/rainbow-identifiers

turminal · on Nov 24, 2021

Define relevant

AkshitGarg · on Nov 24, 2021

While I don't fully disable syntax highlighting, I use a minimal theme [0,1] that only has highlighting for comments, strings and globals. It reduces eye strain for me, and I never find myself relying on highlighting to navigate through code.

LSPs provide an "outline" which can be very useful to navigate through code. I find "jump to symbol" function in my text editor to be faster than scanning all of the code to find the line.

Also most themes dim the comments, but IMO if something in the code needed an explanation, it should be brighter, not dimmer.

[0]: https://github.com/tonsky/sublime-scheme-alabaster

[1]: https://github.com/gargakshit/vscode-theme-alabaster-dark

llimllib · on Nov 24, 2021

> Also most themes dim the comments, but IMO if something in the code needed an explanation, it should be brighter, not dimmer.

That makes me crazy! I use base2tone, which is not nearly as minimal as your theme but more than most, and I modify the comments to be bright.

chriswarbo · on Nov 24, 2021

Syntax highlighting is pretty redundant. Some interesting alternative uses of colour information are given at https://buttondown.email/hillelwayne/archive/syntax-highligh... (e.g. colouring different scopes, or different imports)

I also like the idea of using colour to distinguish different identifiers, e.g. https://wordsandbuttons.online/lexical_differential_highligh...

https://medium.com/@evnbr/coding-in-color-3a6db2743a1e

https://zwabel.wordpress.com/2009/01/08/c-ide-evolution-from...

Derbasti · on Nov 24, 2021

A few years ago I switched my color theme to something very simple, just as an experiment.

Somehow I never found a need to change that. I highlight comments, keywords, and strings. Comment and string highlights are helpful if they contain code-like text, to make them obviously not-code. Keywords give some structure to the text.

Everything else is frivolous to me. Books do not highlight verbs in green, either.

xyzzy_plugh · on Nov 24, 2021

> Books do not highlight verbs in green, either.

While I will not argue with your general point -- I also don't really need highlighting and I read a lot of plaintext code -- I wonder about this.

Would this make languages easier for non-native speakers? Would improve comprehension?

It's funny that the industry spends so much time on syntax highlighting for programming languages, when humanity's written languages are arguably more complex and difficult to parse and master.

_pvxk · on Nov 24, 2021

> Would this make languages easier for non-native speakers? Would improve comprehension?

When I've been trying to learn languages, I can typically part-of-speech tag unknown words quite easily (common prefixes/suffixes/word length/sentence position give lots of information – and some of this is shared across languages as well). The comprehension difficulty is nearly always due to content words I haven't seen before (or have forgotten).

Derbasti · on Nov 26, 2021

I think the point is that books do highlight things: Headlines, italics, Capitalization. Just not silly technicalities like parts of speech.

NoGravitas · on Nov 24, 2021

Not to bikeshed on this, but I have a pretty strong preference for minimalist syntax highlighting. I'm currently using tao-themes in emacs: matching light and dark themes that are grayscale or sepiatoned, and mostly use character properties like bold or italic along with a few shades of gray. Much more calming than the usual "angry fruit salad on black" programmer themes, but also providing more intuitive information than no syntax highlighting.

jpe90 · on Nov 24, 2021

Thanks for the recommendation! I've been on the lookout for a good monochrome theme, this looks great to me w/ boxes off

harrisfarris · on Nov 24, 2021

I feel the same way. Never understood what the point of highlighting certain keywords or if something is a type or a function would be, it's all obvious from the grammar and where things are positioned anyway. And When I read code I want to read all of it, not draw any particular attention to "if" or "else".

robert-brown · on Nov 24, 2021

Keyword highlighting is explicitly called out as an antipattern in the book Human Factors and Typography for More Readable Programs, which I highly recommend.

brabel · on Nov 24, 2021

I would normally respond that, as others have pointed out, you're basically saying you prefer to "hide" information that, to most people, is relevant (is this a keyword, a global or local variable, a type, a method, a static function...), but I've noticed that when I'm doing code review, using the shitty BitBucket interface which shows everything red or green, without any code highlighting, actually helps me a little bit to focus on the changeset as opposed to what the code is actually doing in general. This is helpful because the changeset is what I care about when doing a review (what's different than before is the first question, with understanding what code is actually doing comes second)... Later, I might need to look at the code in my IDE with proper highlighting to better understand what the changed code is actually doing in more detail, but that's rarely needed (unless it's comprehensive changes).

So, it occurred to me that whether syntax highlighting is actually useful depends somewhat on the context, what are you trying to do?!

I suppose it's easy to extend that realization to people who are different and might feel overloaded by information more easily, so I can sympathize with what you're saying (hope this doesn't sound condescending, I am just trying to say people can have very different cognition overload levels, regardless of how capable they actually are in general).

robert-brown · on Nov 24, 2021

Anyone interested in syntax highlighting should read the book Human Factors and Typography for More Readable Programs. The majority of the book is devoted to non-color techniques, but they do present some ideas for how to effectively use color near the end.

Much of syntax highlighting in the wild is junk, just distracting eye candy.

yewenjie · on Nov 24, 2021

how do you know when you have missed one single parenthesis somewhere ?

mst · on Nov 24, 2021

I am very fond of things akin to vim's 'showmatch' mode where when I close a paren or block or etc. the editor highlights the opening element for a second or so and then returns to baseline.

(I have almost no -ambient- highlighting in my baseline but I know lots of people who do and still derive great value from showmatch for the feedback - from discussions with other people rainbow parens style lisp modes seem to provide a maximum overkill approach to that question but I very much prefer maximum underkill in my own tooling even while wanting to be very sure I'm not making it unduly difficult for collaborators with opposite preferences)

todd8 · on Nov 24, 2021

Emacs blink-matching-paren minor mode does this. It’s quite configurable, like every other thing in Emacs.

Some of the alternatives can be found by starting at: https://www.gnu.org/software/emacs/manual/html_node/emacs/Ma...

kieckerjan · on Nov 24, 2021

Does it sound presumptuous if I say that hardly ever happens? And if it does, the compiler never fails to remind me. :-)

petepete · on Nov 24, 2021

There's no correct answer here, it's totally subjective.

I can sympathise with both sides; I like syntax highlighting when it's done well - when it's distracting I turn it off.

Seeing a keyword highlighted within a comment is an instant red flag - unfortunately it happens loads in Azure Data Studio (which I need to occasionally use).

Never happened in TreeSitter though.

shirleyquirk · on Nov 24, 2021

parentheses matching is a surprisingly non-trivial problem. (i.e. simple counting of opening and closing parens isn't sufficient, given that quoted parens shouldn't be counted) For humans, ignoring quoted parens is maybe easier, but i would say it's a flex to assert that you can tell if (3,(g(f('('),x))) is balanced at a glance.

even if you can, surely you're wasting time and/or focus on an automatable task.

But we all have something we do 'the hard way' because it feels like more effort to relearn the task than its worth, or because we tried the easy way once and were put off by some side-effect.

paren highlighting never comes as a single unit, its always packaged with other 'helpful' tools, some subset of which will always be infuriating to someone.

kieckerjan · on Nov 24, 2021

> For humans, ignoring quoted parens is maybe easier, but i would say it's a flex to assert that you can tell if (3,(g(f('('),x))) is balanced at a glance.

True, but mentally balancing parentheses is usually something that you do while writing the code: you pushpop a little stack in your head and this becomes second nature.

Mentally verifying if parentheses are balanced while reading code is hardly ever required. You can usually safely assume that they are (unless that darn compiler tells you otherwise).

shirleyquirk · on Nov 24, 2021

you're probably right, difficulties when writing are mainly due to tools 'helpfully' adding a ket as soon as i type a bra.

maybe i just dont have that stack well enough built in my head--if im editing in a plugin-free vim, i do find i have to backtrack and count to make sure i've put the right number of kets at the end of a nested expression.

if i used s-expressiony instead of tab-heavy languages more often i'm sure i'd be better at it.

AkshitGarg · on Nov 24, 2021

Not the OP, but my editor colors the parenthesis red in case of an imbalance [0].

I do have a minimal amount of highlighting though.

[0]: https://imgur.com/haVWset

grenoire · on Nov 24, 2021

I am in love with language servers, the quality of life improvement is just unreal.

5e92cb50239222b · on Nov 24, 2021

Wait until you try a fully featured "real" IDE. The features language servers provide are only some of the many things that IDE users have had for literally decades.

omnicognate · on Nov 24, 2021

It's kind of hilarious that programmers, who learn again and again the value of decoupling and cohesion, fell so hard for the idea of an Integrated Development Environment. There's nothing about syntactic/semantic code analysis, to pick one example, that requires it to be packaged along with a particular text editor in a single big blob.

Ironically, the most successful IDEs today, the Jetbrains ones, are demonstrations of this. They are built out of reusable components that are combined to produce a wide range of IDEs for different languages.

LSP and DAP aren't perfect, but they're a huge step in the right direction. There's no reason people shouldn't be able to use the editor of their choice along with common tooling for different languages. The fact that IDEs had (for a while) better autocomplete, for example, than emacs wasn't because of some inherent advantage an IDE has over an editor. It's because the people that wrote the language analysis tools behind that autocomplete facility deliberately chose to package them in such a way that they could only be used with one blessed editor. It's great to see the fight back against that, and especially so to see Microsoft (of all people) embracing it with LSP, Roslyn, etc.

gmueckl · on Nov 24, 2021

The technical design isn't the user experience. IDEs are an integrated user experience. It literally doesn't matter to the user how nicely decoupled everything is or isn't under the hood if the end results are indistinguishable.

One point in favor of tight integration and against LSPs is that editing programs isn't like editing unstructured text at all and shouldn't be presented as such. There are tons of ways in which the IDE UX can be enhanced using syntactic and semantic knowledge of programs. Having a limited and standardized interface between the UI and a module providing semantic information will just hamper such innovation.

omnicognate · on Nov 24, 2021

The user experience is entirely determined by the technical design. The difference between emacs or vim and the editor built into say Visual Studio is enormous, and if a developer is prevented from using the former (if that's what they're comfortable with) alongside the language analysis capabilities of the latter, that has a huge impact on the user's experience.

It's true that if you own both the editor and the language analysis tools you can more rapidly add new capabilities, but many facilities that were historically the domain of IDEs, such as autocomplete, are very easy to standardise an interface for, and this has been done. Supporting such interfaces doesn't prevent you from also supporting nonstandard/internal interfaces for more cutting-edge capabilities. The argument made by Jetbrains is similar to the one you've made and it's entirely false. They could easily support LSP and it would have no impact on their ability to innovate. They refuse to do so for purely business reasons (as is their right).

Editing instead of replying as the depth limit is reached (bad form, perhaps, but gmueckl's reply is in the form of a question and I'd like to respond): The necessary UI capabilities for the features you describe already exist in emacs. Multiple alternative implementations of them, actually (lsp-mode vs eglot). It's the editor's job to provide the UI and the LSP server's job to provide the backend. The interface between them is easy to standardise and it has been done (yes, even for the features you mention).

gmueckl · on Nov 24, 2021

For example, how do you use a feature like "Navigate to derived symbols" without the required UI integration (prompt for which derived symbol to go to, opening up the correct code location...)? How do you define an "Extract interface" or "Extract base class" refactoring (name of new class/interface, members to extract, abstract or not...)? There are tons of UX aspects to good code navigation and refactoring. In order to get the equivalent of the feature set of, say, Resharper into vi or emacs, you'd have to add tons of new UI stuff. And once you do that, you are back to pretty tight coupling.

omnicognate · on Nov 24, 2021

The necessary UI capabilities for the features you describe already exist in emacs. Multiple alternative implementations of them, actually (lsp-mode vs eglot). It's the editor's job to provide the UI and the LSP server's job to provide the backend. The interface between them is easy to standardise and it has been done (yes, even for the features you mention).

(Seems I can reply after all so have done so, and now I appear unable to edit the GP and remove the text above from it :-/)

gmueckl · on Nov 24, 2021

How does emacs display a dialog where you can select all the methods to select along with some additional UI elements for extra options?

omnicognate · on Nov 24, 2021

Emacs can be described as an interactive, lisp-based environment for building textual UIs (TUIs). It's very easy to extend it with arbitrary, dynamic behaviours, including ones that need to collect information from the user.

As a trivial example, let's say for some reason I keep needing to generate the sha256 hash of a password and add it to the current file. I could add this to my .emacs:

    (defun my-insert-password-hash (password)
      (interactive "MPassword: ")
      (insert (secure-hash 'sha256 password)))

    (global-set-key (kbd "C-c p h") 'my-insert-password-hash)

Now if I hit Ctrl-C followed by p then h I will be prompted for a password in the minibuffer and the hash of the string I provide will be added where my cursor is. I didn't need to write any GUI code.

This kind of user interaction can equally easily allow the user to select from lists of dynamically determined options, including very large ones (with nice fuzzy matching menus if you use ivy, helm or similar). It's also trivial to write functions that prompt for several pieces of information, ask for different information depending on context, etc.

In the case of LSP, the server only has to provide information about what options are available and what possible responses are permitted. It's easy for emacs to dynamically provide the corresponding UI.

gmueckl · on Nov 24, 2021

Ok, so you're not pointing to an implementation of the feature that I asked about. Instead, you're asking me to implement it as a special case. That's backwards.

You're effectively confirming that there needs to be feature-specific integration code for each and every navigation/refactoring/... feature in the editor. Once you have that, you again have tight coupling.

omnicognate · on Nov 24, 2021

Not at all. The server allows for discovery of code actions and their parameters and the UI is displayed in response, by code that knows nothing about particular actions. It goes something like this (won't be accurate in the details, check the lsp spec [1] under Code Action Request if you want chapter and verse):

User to emacs: I want to perform a code action.

Emacs to server: The selection is this. What code actions can you perform?

Server to emacs: I can perform actions called "Extract Method", "Extract Base Class", etc.

Emacs to user: Choose an action from this list

User to emacs: I'll do "Extract method"

Emacs to server: We're going with "Extract Method", what info do you need.

Server to emacs: I need an item called "method name" which is a string and an item called "visibility", which is one of the following: "public", "private", "protected".

Emacs to user: Enter method name... Select visibility...

Emacs to server: Here are the parameter values.

Server: Here are the edits necessary to perform the code action

Emacs: Updates code.

No special per-action code is required. If you want to see an implementation have a look at lsp-mode[2].

I hope that makes it clear. I've spent more of my day explaining this than I should have now, so I'll leave it there.

[1] https://microsoft.github.io/language-server-protocol/specifi...

[2] https://emacs-lsp.github.io/lsp-mode/

gmueckl · on Nov 25, 2021

Please forget "Extract Method" as an example. It's simply far too trivial. Imaging extracting a new superclass or interface from a class with 100 methods and you need to pick the set of methods to extract. You can't do that with 100 y/n choices.

spinningslate · on Nov 24, 2021

>Having a limited and standardized interface between the UI and a module providing semantic information will just hamper such innovation.

Maybe. That can certainly be a downside of standardisation in general. However, it doesn't necessarily follow in all cases, and this is, I think, one where it doesn't. The features LSP provides are stable - and have been standard across most editors/IDEs for quite some time. Implementing them once for N editors, rather than N times, is just far more approachable (and appealing) for language tooling developers.

It doesn't stop those developers (or anyone else) adding features beyond the LSP standard. But that means doing it in an editor-specific way. Which is no worse than where we were before anyway.

mst · on Nov 24, 2021

DAP being the Debug Adapter Protocol described here? https://microsoft.github.io/debug-adapter-protocol/

omnicognate · on Nov 24, 2021

azeirah · on Nov 24, 2021

I think the primary reason why IDEs are generally better than maximally customised editors like vim, emacs, sublime or vscode and whatnot is pretty simply put: money.

People buy IDE -> money goes to improving the IDE -> IDE gets better

People download one of 6 competing open-source plugins -> a couple of people improve it a little -> 3 years pass, the author loses interest -> someone else else reinvents the wheel, there are now 7 competing open-source plugins 3 of which are good but not maintained anymore.

Great features require time, I just don't see non-commercial work succeeding here.

Doesn't mean it's not possible to create commercial fantastic open-source standalone language tools, it's just not happening for some reason. Probably just because most businesses are still hesitant to open-source their core business?

jimmyvalmer · on Nov 24, 2021

"Commercial open-source" will always be oxymoronic to me, despite all the kumbaya naysayers. There's a basic law of nature at work here which US patent law and a fictional character ("if you're good at something, never do it for free") understood. To a programmer, opening the source essentially renders the product gratis, some custom integration work notwithstanding.

azeirah · on Nov 24, 2021

Yeah I get that, and I think that's the primary reason it's difficult to do well.

I think there _are_ ways to do it right. For instance, open-sourcing a Windows application is not necessarily problematic if 99.9% of your user base has never ever compiled something from source. Heck, my father is the kind of person who doesn't know the difference between "Windows" and "gmail". He has purchased software for his business once, it would've made no difference to him whether it was open-source or not.

Despite my believing that it's possible, I can't really think of any examples other than redhat and Qt from the top of my head...

jimmyvalmer · on Nov 24, 2021

> open-sourcing a Windows application is not necessarily problematic if > 99.9% of your user base has never ever compiled something from source.

What a fatuous remark. I publish the Coca-Cola recipe to Pepsi drinkers. I'm fairly sure the recipe will eventually get around to a home brewer who's sick of paying the Coca-Cola company for its product.

grenoire · on Nov 24, 2021

I'm old enough to have used IDEs, the issue is that my job involves dealing with multiple different languages and markup files. In turn, a general purpose editor with language servers just suits my workflow better.

smw · on Nov 24, 2021

Try a jetbrains ide? They handle pretty much any language you can think of, including fun things like lua with ERB substitutions.

square_usual · on Nov 24, 2021

Yeah, IntelliJ Ultimate with plugins gets pretty close to VSCode in terms of language support.

riffraff · on Nov 24, 2021

surely any decent IDE will handle that nicely?

I used to work with Eclipse and it supported everything* through plugins just nicely.

* ..that I was using at the time: java, xml, python, html, jsp, javascript

tapia · on Nov 24, 2021

For me, there is no IDE feature that can compete with the experience of editing in vim/neovim. When I use any other editor I just feel like I have a hand tied. The development of LSP and tree-sitter just makes the whole experience even better.

nick_ · on Nov 24, 2021

LSP is just a generalization of the implementations IDEs have had for decades.

Lio · on Nov 24, 2021

Could you provide some concrete examples?

I ask because modern editors can do most things people often regard as IDE only but there is still the odd gem that’s worth hearing about.

sa46 · on Nov 24, 2021

I'm not familiar with state of the art for language servers but here's common IntelliJ refactors I use across Go and TypeScript (and Java a while ago):

- Add a parameter to a method signature and fill it in with a default value.

- Reorder method parameters.

- Extract a block of code to a method that infers the correct input and output types.

The most advanced refactoring I've done with IntelliJ is structural replace for Java which can do something like: for every private method matching the name "*DoThing" defined on a class that extends Foo, insert a log line at the beginning of the method: https://www.jetbrains.com/help/idea/structural-search-and-re...

I make heavy use of the "integrated" aspect of IntelliJ. One of the nicer benefits is that SQL strings are type-checked against a database schema.

bestouff · on Nov 24, 2021

All of these are doable with LSP. I'm a big fan of the "LSP rename" action which will rename a particular semantic item (e.g. a method) across files, or the refactoring actions (e.g. change an "if let" into a "match" in Rust).

matklad · on Nov 24, 2021

Everything is doable with LSP as it is an extensible RPC essentially.

But the above things are not done in LSP generally. It doesn’t have first-class support for structural search replace. It doesn’t have support for interactive refactors which require user input.

jen20 · on Nov 24, 2021

Do you have an example of a language server capable of structural refactoring of the type mentioned in the GP? The “semantic rename” is table stakes from ~20 years ago in IntelliJ and ReSharper, and even Eclipse.

Measter · on Nov 24, 2021

Rust Analyzer can do structural search and replace [0], though I've never used it nor IntelliJ's so I can't compare how capable both features are.

[0] https://rust-analyzer.github.io/manual.html#structural-searc...

unrealhoang · on Nov 24, 2021

Structural refactoring as in extract function/variable? You can find that in Rust Analyzer https://rust-analyzer.github.io/manual.html#assists-code-act...

colordrops · on Nov 24, 2021

I want to customize every last element of my editor, have native VI bindings for everything, and run in a terminal. What IDE does that?

rowanG077 · on Nov 24, 2021

I have not ever had a good experience with an IDE. They are always bloated messes that try to force you into their shitty project structure.

aidos · on Nov 24, 2021

Sure, but a lot of us don’t like that extra overhead. LSP is great in that it’s a tool you can tap into to use in a workflow that’s best for you. More of a library than an application (not technically, but in terms of how you use it).

IshKebab · on Nov 24, 2021

VSCode is a "real" IDE.

> The features language servers provide are only some of the many things that IDE users have had for literally decades.

Yes of course, because that's what they were explicitly designed to do. The novel thing about language servers isn't that they enable code intelligence features like auto-complete and variable renaming. It's that they do so over a standard protocol that any editor or IDE (or website or CI system or ...) can use.

mickeyp · on Nov 24, 2021

It's hard for an open source community to build features that compete with a commercial offering like that of Microsoft (or Borland in the 90s.)

And the reason for that is mostly down to fragmentation: the vim guys are doing their thing; the Emacs theirs, etc.

Now focus that energy into a singular project like a Language Server and the payout is likely to be many orders of magnitude greater.

turminal · on Nov 24, 2021

Language server is a Microsoft offering

NateEag · on Nov 24, 2021

It's an open standard started and maintained by Microsoft.

I think that's a bit different from just being a "Microsoft offering".

Epitom3 · on Nov 25, 2021

"Hey I use an IDE, im special"

jicea · on Nov 24, 2021

I'm a maintainer of a cli HTTP client with a text plain file format, Hurl [1]. I would like to begin to add support for various IDE (VSCode, IntelliJ), starting from syntax highlighting, but I have hard time to start.

I struggle on many "little" details, for instance: syntax error should be exactly the same in the terminal and in the IDE. Should I reimplement exactly the same parsing or should I reuse some of the cli tools parser? If I reuse it, how do I implement things given than, for instance, IntelliJ plugin are written in Java/Kotlin, while VScode plugin are Javascript/TypeScript, and Hurl is written in Rust...

Very hard to figure all when it's not your core domain,

[1] https://hurl.dev

IshKebab · on Nov 24, 2021

If it's something simple (and it sounds like it is) then I would strongly recommend just making a single parser library that you use in both the language server and CLI. That's what I've done for my RPC format.

I used Nom. Even though it's not incremental, parsing is easily fast enough to just reparse the entire document on each change.

An alternative is to just use Tree Sitter as your parser for the CLI too. You won't use the incremental parsing feature in the CLI but that's fine.

Supporting IntelliJ may be tricky but there is a WIP plugin that adds LSP support.

billconan · on Nov 24, 2021

https://microsoft.github.io/language-server-protocol/

rcshubhadeep · on Nov 24, 2021

tree-sitter is a great framework. I have used it quite a bit in past. I even created a small library on top of it, called tree-hugger (https://github.com/autosoft-dev/tree-hugger) Really enjoyed their playground as well.

IshKebab · on Nov 24, 2021

> The reason (most) LSP servers don’t offer syntax highlighting is because of the drag on performance. Every keystroke you type must be sent to the server, processed, a partial tree returned, and your syntax highlighting updated. Repeat that up to 100 words per minute (or whatever your typing speed is) and you’re looking at a lot of cross-chatter that is just better suited for in-process communication.

While I agree... he might be surprised to know that that is what all language servers do anyway, even if they don't provide syntax highlighting. Every keystroke gets sent over the LSP. As JSON. It's amazing it works as well as it does.

0x008 · on Nov 24, 2021

Not coming from the vim/eMacs world, I fail to understand what treesitter is compare to a language server? Why would I need both?

dbalatero · on Nov 24, 2021

The article talks about why LSP servers don't typically implement syntax highlighting (performance).

petepete · on Nov 24, 2021

Here's a great video on the topic by one of Neovim's core team

https://www.youtube.com/watch?v=c17j09vY5sw

0x008 · on Nov 24, 2021

thank's that is very helpful.

I was wondering if the both were to achieve similar goals it makes no sense to run them both, but now I can educate myself.