Hacker News new | past | comments | ask | show | jobs | submit login
Let's write a treesitter major mode for Emacs (masteringemacs.org)
209 points by nanna on Sept 14, 2023 | hide | past | favorite | 84 comments



BTW:

While Emacs 29.1 comes with "treesitter" built-in, you still need to manually build and install any treesitter language plugin implementing the actual language specific parser. This can be fiddly and frustrating doing it yourself.

I had a quick success with using this convenience script: https://github.com/casouri/tree-sitter-module/. It provides fully-automated builds for the most popular languages (including typescript, c and c++).

This is how it works for "typescript":

1. Clone the repository: https://github.com/casouri/tree-sitter-module/

2. Install "build-essentials" (providing a c/c++ compiler if you're on Linux).

3. run "./build typescript" from within the repo

4. Copy the resulting shared library from "dist/libtree-sitter-typescript.so" into your "~/.emacs.d/tree-sitter/".

5. Open a random typescript file and try "M-x typescript-ts-mode" which should not give you any error but instead nice syntax highlighting.

You might find there is a treesitter plugin for your language available and it is even supported by "tree-sitter-module" but there is still no major mode, yet. Happened to me for Perl 5.


Technically in Emacs 29.1 tree-sitter is still only an optional build option, which a given package maintainer may have 'built in' to your package. It isn't actually a default. If you build it from source you need to pass the --with-tree-sitter flag to ./configure. See:

https://www.masteringemacs.org/article/how-to-get-started-tr...

What I read from this is that tree-sitter isn't considered quite ready by the Emacs maintainers, perhaps because of the restricted number of actual treesitter modes, or maybe because the treesitter support itself is not quite considered there yet?


I was lazy. I installed Emacs from Alex Murray's snap:

  sudo snap install emacs
It works w/o problems as it is using snapcraft's "classic" runtime. It comes with both native compilation and tresitter support.


It's more external dependency. Libjansson (json serialisation) isn't default either and neither is libgccjit (native elisp comp).


I found this snippet in one of Mickey's earlier tree-sitter posts that works great. It does require searching through the tree-sitter repo to make sure your paths are correct:

  (setq treesit-language-source-alist
      '((typescript "https://github.com/tree-sitter/tree-sitter-typescript" "master" "typescript/src")
        (tsx "https://github.com/tree-sitter/tree-sitter-typescript" "master" "tsx/src")))

  (mapc #'treesit-install-language-grammar (mapcar #'car treesit-language-source-alist))


For the record, it's in Mickey's How to Get Started with Tree-Sitter post:

https://www.masteringemacs.org/article/how-to-get-started-tr...


Or you can just "M-x treesit-install-language-grammar" then follow the prompts.


I tried that but unsuccessfully. Did not get an error but also no grammar for Typescript or C/C++. Perhaps mileage varies with the language.

Edit: I can confirm it works nicely for "javascript". Cool!


I have done typescript successfully. It has another install folder than the default, as the repo has tree-sitter for both tsx and typescript. C/C++ could be a similar situation. The installer should prompt you for it during the setup.


Is there anything that returns a parse tree of an org document? A while ago I wrote some super hacky elisp to navigate around the structure of a giant org mode doc, but it was rickety and terrible and constantly breaking.

Part of this is surely that I don't know wtf I'm doing, but it seemed like there was not an underlying data structure held in memory that you could conveniently query / manipulate, but rather, most of the existing org functionality built some kind of structure each time you did an operation.

Would appreciate any pointers, code examples, tutorials that show how to effectively navigate / manipulate an org structure and have it reflected in the buffer, if there is such a thing.


Organice is org-mode but as react apt. they have pretty complete parser. https://github.com/200ok-ch/organice#background-information


https://orgmode.org/worg/dev/org-element-api.html

But even with this I found it pretty awful.


What was awful about it? It's been a while since setting up but my config uses it to pluck the important bits from my Org library into SQLite as I edit; it works well enough and wasn't difficult to set up or understand. Admittedly it is relegated to this one step so that everything else can query docs with SQL but I was quite happy that the API exists so as not to do the parsing myself with the mistakes and minor deviations that would entail.


In org-alert we use `org-map-entries` and a simple `org-alert--parse-entry` function for stripping out the details we're looking for. Depending on what you want, it's not exactly a data structure, but maybe it will help you get started!

https://github.com/spegoraro/org-alert/blob/master/org-alert...


While it doesn't properly understand the structure, you can move around pretty well with Imenu or (configured) org-goto. I assume it's also possible to make something for it so that it take nesting into consideration like it does for some programming languages. My org files are only a couple 1000 lines though, so don't know how they perform when it gets larger than that.


This is from the author of the excellent book Mastering Emacs.

I am very far from being knowledgeable about programming on the Emacs platform, but I am trying to learn. I grabbed the name M-x-AI.com a while back with the goal of integrating other people’s Emacs packages with some of my own hacks into a better AI dev work environment and writing a short book on it. I have been using Emacs since, I think, 1982. There are so many good new packages for integrating CoPilot, GPT-4, etc., as well as major Emacs platform improvements that are too many to list.


Out of interest, do you use Emacs as an alternative to Jupyter for interactive work (examining plots etc.)?

If so, which modes and packages do you use?


Jupyter can be used with Org Babel (interactive output, ansi escape codes, plots. I haven't tried widgets (shiny apps mostly cover this use-case for me https://shiny.posit.co/)).

https://github.com/emacs-jupyter/jupyter#org-mode-source-blo...


I'm not trying to bash Emacs or treesitter or anyone. But I find it mildly amusing that after so many decades, parsing and syntax highlighting aren't a perfectly solved problem, considering programming languages are the most used tools for developers.


Parsing of a correct program is a pretty "solved" problem.

But fast enough re-parsing of fragments and recovery from errors is a much more complex problem, that often doesn't have a single correct answer, and it's also a much newer problem in as much as syntax-highlighting is much newer feature, being preceded largely by "offline" pretty-printers with very different constraints.

The extent to which modern compilers try to parse past errors still varies greatly, with a whole lot not even trying to.

But just any recovering parser also does not mean the problem is solved. E.g. you've typed "foo". Now you type "(". It'd be very annoying if your editor now re-colors everything as an error, so you typically want some error recovery. But how soon? Do you assume the tokens immediate afterwards are par of what was a valid expression until you typed "foo", or are they a valid part of an argument list? And where do they end? Do you just delay re-parsing until the user has typed more? Or left the line? Sometimes that can help, sometimes it will just make things worse.

Parsing methods that work fine if you assume you can "reset" the parse at many different points which tend to constrain the area considered an error and so reducing the size of a typical re-parse will fail badly if you want stricter re-parsing that frequently may trigger reparsing most of a file, for example.

A lot of this is subjective, and picking the "right" way of handling it largely comes down to unpacking humans unstated preferences, and trying to reconcile competing and possibly contradictory preferences.


> it's also a much newer problem in as much as syntax-highlighting is much newer feature

Just an aside. I wouldn't associate syntax highlighting with new. It's almost as old as text editors. Accurate syntax highlighting in real-time visual text editors is over 40 years old by now and was becoming common by the late 1980s.

GNU Emacs gained syntax highlighting in 1989 and it was considered late to the party. Many programming editors and IDEs had syntax highlighting by then.


I wonder why there aren't more tools to allow structural editing like Lisp has. This way the file can never be in an unparsable state.


You may be aware but the author of TFA also has a tree-sitter based minor mode called Combobulate for exactly that:

https://www.masteringemacs.org/article/combobulate-structure...

There is also evil-textobj-tree-sitter for tree-sitter based text objects for Evil mode:

https://github.com/meain/evil-textobj-tree-sitter


I spent a couple of months working with a tool along those lines. Unfortunately, it's one of those issues where it won't work specifically without the galactic emperor mandating these tools under pain of death. A colleague using VS Code would commit code with unbalance parentheses. The tooling was smart enough that it could load the file, even though it wasn't parsable, but there was no structural was to fix the missing closing brace. Adding a closing brace automatically inserted another opening brace, leaving the original mismatch.


This is not fully correct. You are right that lisp has very regular atoms that you can read, but the full parsing is different. Consider (let (a 0) (1+ a)) is not valid syntax. Such that if you were to add color for the different parts, it would fail. Indeed, if you want to label the parts of the program tree, you have to parse it more than just "lists of atoms."


because in most languages there's a level of abstraction between the syntax of the language and the actual data structure of the program. To use the much maligned term, most languages aren't homoiconic, the internal structure of the program is hidden away from the programmer.

Lisps essentially have no syntax, that's why it is trivial to manipulate a lisp program structurally.


You can do this with most other language too, but I think the challenge is that a lot of us have habits that we'd need to change. E.g. with structural editing I couldn't easily decide I want to add exception handling around a block in Ruby by writing "begin", then moving down, and writing "rescue ..." and "end" - I'd likely need to mark the block first. That may well end up being faster, but it's a fairly big change.

I wonder, though, if a "semi-structural" editor with a keypress to "complete what is otherwise an error" would work well. E.g. I write "begin" and so imbalances an expression, and the editor looked at what I typed that changed the parse from successful to an error and puts up options to insert "rescue ... end" or just "end". Maybe coupled with a warning on save if the file does not parse.


This is what treesitter enables!


I clarified exactly why it is the way it is here:

https://www.masteringemacs.org/article/tree-sitter-complicat...

And also why using LSP to furnish your editor with highlight markers is an inelegant solution for many languages.


The problem is that a good editor-compatible tool for parsing and syntax highlighting is at cross-purposes with what you want from a compiler.

A good overview here: https://matklad.github.io/2022/04/25/why-lsp.html


Insightful article. Key point:

> Fourth, Microsoft itself doesn’t try to take advantage of M + N. There’s no universal LSP implementation in VS Code. Instead, each language is required to have a dedicated plugin with physically independent implementations of LSP.

There's no (official) LSP implementation for Typescript either. Instead of using LSP, Microsoft maintains tsserver which uses a custom protocol for better integration.

Take-home message: don't try to write a universal tool that solves everything, as it will be lowest common denominator. Create building blocks such as Tree-sitter, which make specific (M×N) integrations easier and more powerful.


But Tree Sitter is an M+N tool.

Also LSP was never going to completely eliminate the need for language-specific plugins. But it does significantly reduce the amount of work needed to build each one, providing a consistent baseline.


In my classification, Tree-sitter is a building block (a parser), not a tool (compiler, editor, highlighter etc.) And as seen in this HN submission about implementing HTML support for Emacs, it's used in M×N integrations.

I didn't know tree-sitter-highlight exists though. It seems to provide M+N highlighting and includes a CLI tool. https://tree-sitter.github.io/tree-sitter/syntax-highlightin...

Similarly, tree-sitter tags seems to provide M+N code indexing: https://tree-sitter.github.io/tree-sitter/code-navigation-sy...


It's an M+N tool in my opinion because it parses M languages with a uniform interface, for use in N higher-level tools. I can use the same Python grammar/parser with any tool that's based on Tree Sitter.


Interestingly, treesitter is designed for this space (parsing invalid structures with a time component) but it’s still not used as a base for LSPs. I once asked why on HN and people that know more than me said it wasn’t suitable.


tree-sitter definitely can be used as a incremental parser for an LSP. But the original purpose was and still is to provide a standard format for an editor-agnostic way to parse languages.

As somebody who wrote a tree-sitter grammar recently I can confirm that the library definitely has its share of... Interesting choices. But there's nothing else like it as most parser generators are non-incremental, don't do generalized parsing and don't provide decent error recovery.


> has its share of… Interesting choices

Curious to hear more.


Well, nothing truly horrible but here are things I find inconvenient (but working nevertheless) when using the sdk.

1. NodeJS-based utilities. The tooling around js doesn't lean itself well to integration with with the dev's environment.

2. The sdk itself is a mix of Rust, javascript and C. I'd say that this is at least one language too many - makes it harder to contribute. Ideally such fundamental projects should be as simple and homogeneous as possible.

3. I keep wondering if generating languages other than C should be supported.

The core idea of the library is brilliant either way so it sort of compensates all of the above.


Until TreeSitter came along the effort to add support for new languages to your editor would be gargantuan.

Now it's much, much easier providing there's a TreeSitter parser for your language.

I don't know of anything else that bridges the gap like this.


Note that not that long ago, computers were simply not powerful enough to run that in acceptable time. Dedicated parsers can be much faster.


It's a tough problem. Steve Yegge blogged about the complexities involved when he wrote js2-mode:

https://steve-yegge.blogspot.com/2008/03/js2-mode-new-javasc...

I guess comp. sci. people studying languages have been more interested in syntactically valid programs than the opposite.


Well, I bet most syntactically valid programs are not interested in comp-sci people!


The languages extension strategies for CSS pretifiers are usually pretty reasonable.

In editors this always seems extremely esoteric comparatively: I've tried doing it in a few.

I'm sure brilliant people find it easy, but I'm merely average on a good day.

I haven't tried extending any of these modern electron based editors, can anyone speak to that?


There will never be "perfectly" solved problems at that complexity-level. There are always changing requirements and space for improvement. Make it faster, add new features, use new hardware-abilities, follow the flavors of this decade, this is an eternally going game of catching up.


Well, there are more problems like that. You'd think diffing is a solved problem, and yet we still struggle with syntax-aware diffs (I use difftastic, which is great, but doesn't always work well, and is under constant development).


Interesting, it's also based on parsing by Tree-sitter: https://difftastic.wilfred.me.uk/parsing.html


It's not really a solved problem in general. Most editors appear to use TextMate grammars which nobody likes. Otherwise you have to implement it using whatever custom setup your specific editor uses. It just happens that most languages have some poor soul who set this up already. Emacs is actually on the better side because tree-sitter is a much better setup for writing grammars.


This is that. Tree Sitter has become one of the foundational advances that is allowing us to make progress on solving that problem.


What would you consider a perfectly solved problem in this case? I.e. how is current development experience bad and how it could be better?


In a perfect world of emacs/tree-sitter, I would imagine tree-sitter to be a single minor mode. You don't need 'java-ts-mode' and 'c-ts-mode'. Just a single 'treesitter-mode' toggle and it will call the treesitter binary to do the hard lifting.

I guess the reality just isn't so rosy.


A `treesitter-mode` is a so-so idea, because the way different languages are interacted with is different. The way lisp sexp handle in an IDE is different from Python's syntax. It doesn't make sense to try and have one mode that handles both - it is better to have many modes that interact with the common data structures that treesitter provides and then starts to provide specific convenience functions.

Even if all it was supposed to do is syntax highlighting, different languages are usually highlighted differently and presumably the person maintianing the bindings would need to be a language expert with strong opinions. The options are they either maintain their own mode or a module in `treesitter-mode` which is basically 2 ways of describing the same situation.

Although to address your other point it is quite funny that after 40 years and 29 versions of experimenting in OS design, GNU Emacs is starting to implement syntax highlighting properly for its system text editor.


A treesitter-mode would make sense for organizational reasons and user-experience. But under the hood it would be just a proxy-mode which figures out the language and load the appropriate treesitter-sub-modes. And maybe in a more advanced version it could also mix modes for different languages.


But you're describing Emacs. Emacs figures out the language and loads the appropriate sub-modes. It even calls them modes. Modes are mixable.

We're in a transition period while everything is rewritten to use tree-sitter. In a few years all the default major modes in Emacs are probably going to be tree-sitter based - unless the maintainers believe them to be better than tree-sitter.


> Emacs figures out the language and loads the appropriate sub-modes.

Not really. You setup hooks, usually based by fileextension, and emacs execute the hooks. Emacs itself has no understanding of languages. And this will not work well when there is no hook executed.

> Modes are mixable.

You can load multiple minor-modes, but there is always just one major-mode per buffer, and languages are usually major-modes. But ok, treesitter can be handled as a minor. But it still needs explicit support. You cannot give treesitter control over a range of text, let if figure out the language automatically and let it make it's thing. This needs explicit support.


> GNU Emacs is starting to implement syntax highlighting properly for its system text editor What do you mean?


It is a classic joke. Emacs is often a little behind competing text editors (Vim has better keybindings IMO, things like Visual studio have better IDE support, Atom text editor turns out to have better syntax highlighting, etc, etc).

So, the line of the joke goes, it must not really be a text editor because it is not very good at it and has a wild array of other capabilities. Like "M-x dunnet". Must be an OS.

Of course the reality is that Emacs just copies good ideas from other people. A pretty normal Emacs setup uses Vim keybindings, LSP and now tree-sitter to get a good experience programming.


> Of course the reality is that Emacs just copies good ideas from other people. A pretty normal Emacs setup uses Vim keybindings, LSP and now tree-sitter to get a good experience programming.

Too bad other people don't copy good ideas from Emacs. Why other text editors/processors have no equivalent of view-lossage is a mystery to me. Also, where-is, describe-key, insert-char, transpose-chars and lots of other very useful stuff...


I'm not the GP, but personally I'd consider syntax highlighting perfectly solved if it was handled in the language server. That way we could have highlighting performed by 100% accurate parsers instead of approximations. It wouldn't be as fast as in-editor simple tree-sitter highlighting, but I don't see any issue with my text appearing uncolored at first and gaining color as my editor polls the language server as I type.

Currently language servers support semantic highlighting which is close, but as far as I can tell, it seems meant to be supplemental to in-editor local highlighting rather than a full replacement.


You would only get the top half of your code highlighted because a compiler doesnt usually continue when encountering syntax errors due to partial code


> You would only get the top half of your code highlighted because a compiler

Your comment implies a language server taps into the language's compiler/interpreter. That is a popular misconception. Almost all LSP severs don't actually use the compiler backend of the language they are servicing. All the LSP server implementations I've seen at least just use a static parsing approach (or even worse i.e. just a tokenizer) similar to what Treesitter does. It is just not limited to a single file.

That's why I still think Treesitter has the potential to not only improve syntax highlighting but also to simplify language servers or even -- in the long run with some extensions of the the language modes -- replace it altogether.


But some do. Most notably clangd. Still clangd tries parse broken code using recovery heuristics (which can still be useful in a compiler to produce decent error messages).


> Your comment implies a language server taps into the language's compiler/interpreter. That is a popular misconception.

But it is a misconception which framed the discussion, not a misconception of the answerer. The claim was that LSP was

> performed by 100% accurate parsers instead of approximations


If the code was valid at some point (e.g. when opening it) it could remember the old colorings of the rest of the code.

Though handling opening and closing a string and still ending up in an illegal state would need some more logic.. Maybe not worth it.


> If the code was valid at some point (e.g. when opening it) it could remember the old colorings of the rest of the code.

If you continue down this line of thinking a few more steps, eventually you produce tree-sitter.


That's sort-of already the case whenever we start a string in any language that supports multi-line strings. You type the first " and then the rest of your file gets re-interpreted as a string. It is annoying but I still think its a worthy trade-off, because I spend a lot more time reading code than being mid-edit with invalid syntax.


Syntax errors are not limited to just typing a quote


I never said they were. I'm just pointing out that we already experience the same problem in the case of quotes without the world ending, and I don't think expanding the problem to all syntax errors will make it much worse.


It would make it significantly much worse. Because every incomplete statement that you're just writing would break all syntax highlighting right after your cursor. And if you use any modern editor or IDE this doesn't happen.


> You would only get the top half of your code highlighted because a compiler doesnt usually continue when encountering syntax errors due to partial code

Sounds like a feature - don't need to scan for tiny squiggly red lines, or examine some tiny font output in a status bar somewhere.

I'd love it.


But you want syntax highlighting performed by approximations, because when editing your code is only approximately valid.


It would be even better if the code was always valid because we only edit the AST itself, like what paredit does for Lisp modes.


No.

(The code being valid s-sexps doesn’t even guarantee it’s syntactically valid after macro expansion, but even if it does this is a bad idea.)


One interesting instance of semantic highlighting occurs in the Lean 4 language. It has extensible syntax whose parsers can be expressed in the language itself. Since parsing and evaluation are intertwined, highlighting can only really be effectively achieved in an LSP lest you re-implement the entirety of Lean.


Is anyone using treesitter with lsp-mode?

I see some people say it's possible and use both together but I thought for the most part language servers offer the same set of features, and probably better? My current mental model for how to use them together is that the majority of the languages I quickly read I set up treesitter for speed. For languages I read extensively or write I set up a language server.


they are mostly used for different things.

lsp (and lsp-mode) are mostly concerned with IDE functionality- go-to definition, show references, displaying project errors in real time without explicitly building, etc.

tree-sitter builds a syntax tree of your source code; its applications are things like syntax highlighting and structural navigation of your code.

there is some overlap in functionality, lsp has somewhat supported mechanisms for syntax highlighting iirc, but they are fairly orthogonal overall

so yes, it makes sense to use them together


Quick follow up, looked into the LSP spec and it does define "SemanticTokens" since 3.16

https://microsoft.github.io/language-server-protocol/specifi...

Obviously it depends on the LSP implementation but it looks expressive enough to replace treesitter?


That was interesting, thanks for pointing it out

I was tremendously sad to see that the Typescript Language Server wasn't owned by Microsoft <https://microsoft.github.io/language-server-protocol/impleme...>, since if there was any sanity in the world a spec bump would travel with a reference implementation showing how they envision such a thing being used

But, I found that the Typescript Language Server that they did list does indeed have a semantic-tokens module in it, although it's much shorter than I would have expected from reading that section in the spec: https://github.com/typescript-language-server/typescript-lan...


That code implements a (luckily) simple transformation from Microsoft's tsserver format to Microsoft's LSP format:

> Transforms the semantic token spans given by the ts-server into lsp compatible spans


Interesting, I know LSPs provide me some syntax highlighting and some structural navigation but I haven't compared the two APIs directly. I assumed most LSPs are a superset.


I use both. In my experience, syntax highlighting with language servers is slower than with tree-sitter.

It stands to reason: a language server often does way more than just incremental parsing of the source code into a concrete syntax tree. By limiting itself to syntax, tree-sitter can be much faster.


Do you keep the lsp semantic tokens capability enabled along with treesitter or only use treesitter for syntax highlights?


I've been using it a bit but it still not on par with, well, vscode. It tends to be a bit slow on big files (say 10000+ lines) when you open type an fstring in python such as 'print(f"p={' once the open accolade is typed in, it can get noticeably slow.

But well, I still love emacs :-)


It's so hard to give up your custom environment and keyboard muscle memory! (25 year emacs user here)


35 years here. 2/3rds of that without any syntax highlighting. My brain is a pretty good parser and I don't like visual distraction. Can you imagine if books had colored syntax highlighting?

I have not embraced some syntax highlighting but with a very subtle color scheme. I tried LSP for a moment but the performance was such that it was a net negative.

I allocate a week over the holiday for "tooling refactors" and I'll probably kick the tires of emacs 29. The big refactor is building a new rig.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: