While Emacs 29.1 comes with "treesitter" built-in, you still need to manually build and install any treesitter language plugin implementing the actual language specific parser. This can be fiddly and frustrating doing it yourself.
I had a quick success with using this convenience script: https://github.com/casouri/tree-sitter-module/. It provides fully-automated builds for the most popular languages (including typescript, c and c++).
2. Install "build-essentials" (providing a c/c++ compiler if you're on Linux).
3. run "./build typescript" from within the repo
4. Copy the resulting shared library from "dist/libtree-sitter-typescript.so" into your "~/.emacs.d/tree-sitter/".
5. Open a random typescript file and try "M-x typescript-ts-mode" which should not give you any error but instead nice syntax highlighting.
You might find there is a treesitter plugin for your language available and it is even supported by "tree-sitter-module" but there is still no major mode, yet. Happened to me for Perl 5.
Technically in Emacs 29.1 tree-sitter is still only an optional build option, which a given package maintainer may have 'built in' to your package. It isn't actually a default. If you build it from source you need to pass the --with-tree-sitter flag to ./configure. See:
What I read from this is that tree-sitter isn't considered quite ready by the Emacs maintainers, perhaps because of the restricted number of actual treesitter modes, or maybe because the treesitter support itself is not quite considered there yet?
I found this snippet in one of Mickey's earlier tree-sitter posts that works great. It does require searching through the tree-sitter repo to make sure your paths are correct:
I have done typescript successfully. It has another install folder than the default, as the repo has tree-sitter for both tsx and typescript. C/C++ could be a similar situation. The installer should prompt you for it during the setup.
Is there anything that returns a parse tree of an org document? A while ago I wrote some super hacky elisp to navigate around the structure of a giant org mode doc, but it was rickety and terrible and constantly breaking.
Part of this is surely that I don't know wtf I'm doing, but it seemed like there was not an underlying data structure held in memory that you could conveniently query / manipulate, but rather, most of the existing org functionality built some kind of structure each time you did an operation.
Would appreciate any pointers, code examples, tutorials that show how to effectively navigate / manipulate an org structure and have it reflected in the buffer, if there is such a thing.
What was awful about it? It's been a while since setting up but my config uses it to pluck the important bits from my Org library into SQLite as I edit; it works well enough and wasn't difficult to set up or understand. Admittedly it is relegated to this one step so that everything else can query docs with SQL but I was quite happy that the API exists so as not to do the parsing myself with the mistakes and minor deviations that would entail.
In org-alert we use `org-map-entries` and a simple `org-alert--parse-entry` function for stripping out the details we're looking for. Depending on what you want, it's not exactly a data structure, but maybe it will help you get started!
While it doesn't properly understand the structure, you can move around pretty well with Imenu or (configured) org-goto. I assume it's also possible to make something for it so that it take nesting into consideration like it does for some programming languages. My org files are only a couple 1000 lines though, so don't know how they perform when it gets larger than that.
This is from the author of the excellent book Mastering Emacs.
I am very far from being knowledgeable about programming on the Emacs platform, but I am trying to learn. I grabbed the name M-x-AI.com a while back with the goal of integrating other people’s Emacs packages with some of my own hacks into a better AI dev work environment and writing a short book on it. I have been using Emacs since, I think, 1982. There are so many good new packages for integrating CoPilot, GPT-4, etc., as well as major Emacs platform improvements that are too many to list.
Jupyter can be used with Org Babel (interactive output, ansi escape codes, plots. I haven't tried widgets (shiny apps mostly cover this use-case for me https://shiny.posit.co/)).
I'm not trying to bash Emacs or treesitter or anyone. But I find it mildly amusing that after so many decades, parsing and syntax highlighting aren't a perfectly solved problem, considering programming languages are the most used tools for developers.
Parsing of a correct program is a pretty "solved" problem.
But fast enough re-parsing of fragments and recovery from errors is a much more complex problem, that often doesn't have a single correct answer, and it's also a much newer problem in as much as syntax-highlighting is much newer feature, being preceded largely by "offline" pretty-printers with very different constraints.
The extent to which modern compilers try to parse past errors still varies greatly, with a whole lot not even trying to.
But just any recovering parser also does not mean the problem is solved. E.g. you've typed "foo". Now you type "(". It'd be very annoying if your editor now re-colors everything as an error, so you typically want some error recovery. But how soon? Do you assume the tokens immediate afterwards are par of what was a valid expression until you typed "foo", or are they a valid part of an argument list? And where do they end? Do you just delay re-parsing until the user has typed more? Or left the line? Sometimes that can help, sometimes it will just make things worse.
Parsing methods that work fine if you assume you can "reset" the parse at many different points which tend to constrain the area considered an error and so reducing the size of a typical re-parse will fail badly if you want stricter re-parsing that frequently may trigger reparsing most of a file, for example.
A lot of this is subjective, and picking the "right" way of handling it largely comes down to unpacking humans unstated preferences, and trying to reconcile competing and possibly contradictory preferences.
> it's also a much newer problem in as much as syntax-highlighting is much newer feature
Just an aside. I wouldn't associate syntax highlighting with new. It's almost as old as text editors. Accurate syntax highlighting in real-time visual text editors is over 40 years old by now and was becoming common by the late 1980s.
GNU Emacs gained syntax highlighting in 1989 and it was considered late to the party. Many programming editors and IDEs had syntax highlighting by then.
I spent a couple of months working with a tool along those lines. Unfortunately, it's one of those issues where it won't work specifically without the galactic emperor mandating these tools under pain of death. A colleague using VS Code would commit code with unbalance parentheses. The tooling was smart enough that it could load the file, even though it wasn't parsable, but there was no structural was to fix the missing closing brace. Adding a closing brace automatically inserted another opening brace, leaving the original mismatch.
This is not fully correct. You are right that lisp has very regular atoms that you can read, but the full parsing is different. Consider (let (a 0) (1+ a)) is not valid syntax. Such that if you were to add color for the different parts, it would fail. Indeed, if you want to label the parts of the program tree, you have to parse it more than just "lists of atoms."
because in most languages there's a level of abstraction between the syntax of the language and the actual data structure of the program. To use the much maligned term, most languages aren't homoiconic, the internal structure of the program is hidden away from the programmer.
Lisps essentially have no syntax, that's why it is trivial to manipulate a lisp program structurally.
You can do this with most other language too, but I think the challenge is that a lot of us have habits that we'd need to change. E.g. with structural editing I couldn't easily decide I want to add exception handling around a block in Ruby by writing "begin", then moving down, and writing "rescue ..." and "end" - I'd likely need to mark the block first. That may well end up being faster, but it's a fairly big change.
I wonder, though, if a "semi-structural" editor with a keypress to "complete what is otherwise an error" would work well. E.g. I write "begin" and so imbalances an expression, and the editor looked at what I typed that changed the parse from successful to an error and puts up options to insert "rescue ... end" or just "end". Maybe coupled with a warning on save if the file does not parse.
> Fourth, Microsoft itself doesn’t try to take advantage of M + N. There’s no universal LSP implementation in VS Code. Instead, each language is required to have a dedicated plugin with physically independent implementations of LSP.
There's no (official) LSP implementation for Typescript either. Instead of using LSP, Microsoft maintains tsserver which uses a custom protocol for better integration.
Take-home message: don't try to write a universal tool that solves everything, as it will be lowest common denominator. Create building blocks such as Tree-sitter, which make specific (M×N) integrations easier and more powerful.
Also LSP was never going to completely eliminate the need for language-specific plugins. But it does significantly reduce the amount of work needed to build each one, providing a consistent baseline.
In my classification, Tree-sitter is a building block (a parser), not a tool (compiler, editor, highlighter etc.) And as seen in this HN submission about implementing HTML support for Emacs, it's used in M×N integrations.
It's an M+N tool in my opinion because it parses M languages with a uniform interface, for use in N higher-level tools. I can use the same Python grammar/parser with any tool that's based on Tree Sitter.
Interestingly, treesitter is designed for this space (parsing invalid structures with a time component) but it’s still not used as a base for LSPs. I once asked why on HN and people that know more than me said it wasn’t suitable.
tree-sitter definitely can be used as a incremental parser for an LSP. But the original purpose was and still is to provide a standard format for an editor-agnostic way to parse languages.
As somebody who wrote a tree-sitter grammar recently I can confirm that the library definitely has its share of... Interesting choices. But there's nothing else like it as most parser generators are non-incremental, don't do generalized parsing and don't provide decent error recovery.
Well, nothing truly horrible but here are things I find inconvenient (but working nevertheless) when using the sdk.
1. NodeJS-based utilities. The tooling around js doesn't lean itself well to integration with with the dev's environment.
2. The sdk itself is a mix of Rust, javascript and C. I'd say that this is at least one language too many - makes it harder to contribute. Ideally such fundamental projects should be as simple and homogeneous as possible.
3. I keep wondering if generating languages other than C should be supported.
The core idea of the library is brilliant either way so it sort of compensates all of the above.
There will never be "perfectly" solved problems at that complexity-level. There are always changing requirements and space for improvement. Make it faster, add new features, use new hardware-abilities, follow the flavors of this decade, this is an eternally going game of catching up.
Well, there are more problems like that. You'd think diffing is a solved problem, and yet we still struggle with syntax-aware diffs (I use difftastic, which is great, but doesn't always work well, and is under constant development).
It's not really a solved problem in general. Most editors appear to use TextMate grammars which nobody likes. Otherwise you have to implement it using whatever custom setup your specific editor uses. It just happens that most languages have some poor soul who set this up already. Emacs is actually on the better side because tree-sitter is a much better setup for writing grammars.
In a perfect world of emacs/tree-sitter, I would imagine tree-sitter to be a single minor mode. You don't need 'java-ts-mode' and 'c-ts-mode'. Just a single 'treesitter-mode' toggle and it will call the treesitter binary to do the hard lifting.
A `treesitter-mode` is a so-so idea, because the way different languages are interacted with is different. The way lisp sexp handle in an IDE is different from Python's syntax. It doesn't make sense to try and have one mode that handles both - it is better to have many modes that interact with the common data structures that treesitter provides and then starts to provide specific convenience functions.
Even if all it was supposed to do is syntax highlighting, different languages are usually highlighted differently and presumably the person maintianing the bindings would need to be a language expert with strong opinions. The options are they either maintain their own mode or a module in `treesitter-mode` which is basically 2 ways of describing the same situation.
Although to address your other point it is quite funny that after 40 years and 29 versions of experimenting in OS design, GNU Emacs is starting to implement syntax highlighting properly for its system text editor.
A treesitter-mode would make sense for organizational reasons and user-experience. But under the hood it would be just a proxy-mode which figures out the language and load the appropriate treesitter-sub-modes. And maybe in a more advanced version it could also mix modes for different languages.
But you're describing Emacs. Emacs figures out the language and loads the appropriate sub-modes. It even calls them modes. Modes are mixable.
We're in a transition period while everything is rewritten to use tree-sitter. In a few years all the default major modes in Emacs are probably going to be tree-sitter based - unless the maintainers believe them to be better than tree-sitter.
> Emacs figures out the language and loads the appropriate sub-modes.
Not really. You setup hooks, usually based by fileextension, and emacs execute the hooks. Emacs itself has no understanding of languages. And this will not work well when there is no hook executed.
> Modes are mixable.
You can load multiple minor-modes, but there is always just one major-mode per buffer, and languages are usually major-modes. But ok, treesitter can be handled as a minor. But it still needs explicit support. You cannot give treesitter control over a range of text, let if figure out the language automatically and let it make it's thing. This needs explicit support.
It is a classic joke. Emacs is often a little behind competing text editors (Vim has better keybindings IMO, things like Visual studio have better IDE support, Atom text editor turns out to have better syntax highlighting, etc, etc).
So, the line of the joke goes, it must not really be a text editor because it is not very good at it and has a wild array of other capabilities. Like "M-x dunnet". Must be an OS.
Of course the reality is that Emacs just copies good ideas from other people. A pretty normal Emacs setup uses Vim keybindings, LSP and now tree-sitter to get a good experience programming.
> Of course the reality is that Emacs just copies good ideas from other people. A pretty normal Emacs setup uses Vim keybindings, LSP and now tree-sitter to get a good experience programming.
Too bad other people don't copy good ideas from Emacs. Why other text editors/processors have no equivalent of view-lossage is a mystery to me. Also, where-is, describe-key, insert-char, transpose-chars and lots of other very useful stuff...
I'm not the GP, but personally I'd consider syntax highlighting perfectly solved if it was handled in the language server. That way we could have highlighting performed by 100% accurate parsers instead of approximations. It wouldn't be as fast as in-editor simple tree-sitter highlighting, but I don't see any issue with my text appearing uncolored at first and gaining color as my editor polls the language server as I type.
Currently language servers support semantic highlighting which is close, but as far as I can tell, it seems meant to be supplemental to in-editor local highlighting rather than a full replacement.
You would only get the top half of your code highlighted because a compiler doesnt usually continue when encountering syntax errors due to partial code
> You would only get the top half of your code highlighted because a compiler
Your comment implies a language server taps into the language's compiler/interpreter. That is a popular misconception. Almost all LSP severs don't actually use the compiler backend of the language they are servicing. All the LSP server implementations I've seen at least just use a static parsing approach (or even worse i.e. just a tokenizer) similar to what Treesitter does. It is just not limited to a single file.
That's why I still think Treesitter has the potential to not only improve syntax highlighting but also to simplify language servers or even -- in the long run with some extensions of the the language modes -- replace it altogether.
But some do. Most notably clangd. Still clangd tries parse broken code using recovery heuristics (which can still be useful in a compiler to produce decent error messages).
That's sort-of already the case whenever we start a string in any language that supports multi-line strings. You type the first " and then the rest of your file gets re-interpreted as a string. It is annoying but I still think its a worthy trade-off, because I spend a lot more time reading code than being mid-edit with invalid syntax.
I never said they were. I'm just pointing out that we already experience the same problem in the case of quotes without the world ending, and I don't think expanding the problem to all syntax errors will make it much worse.
It would make it significantly much worse. Because every incomplete statement that you're just writing would break all syntax highlighting right after your cursor. And if you use any modern editor or IDE this doesn't happen.
> You would only get the top half of your code highlighted because a compiler doesnt usually continue when encountering syntax errors due to partial code
Sounds like a feature - don't need to scan for tiny squiggly red lines, or examine some tiny font output in a status bar somewhere.
One interesting instance of semantic highlighting occurs in the Lean 4 language. It has extensible syntax whose parsers can be expressed in the language itself. Since parsing and evaluation are intertwined, highlighting can only really be effectively achieved in an LSP lest you re-implement the entirety of Lean.
I see some people say it's possible and use both together but I thought for the most part language servers offer the same set of features, and probably better? My current mental model for how to use them together is that the majority of the languages I quickly read I set up treesitter for speed. For languages I read extensively or write I set up a language server.
lsp (and lsp-mode) are mostly concerned with IDE functionality- go-to definition, show references, displaying project errors in real time without explicitly building, etc.
tree-sitter builds a syntax tree of your source code; its applications are things like syntax highlighting and structural navigation of your code.
there is some overlap in functionality, lsp has somewhat supported mechanisms for syntax highlighting iirc, but they are fairly orthogonal overall
I was tremendously sad to see that the Typescript Language Server wasn't owned by Microsoft <https://microsoft.github.io/language-server-protocol/impleme...>, since if there was any sanity in the world a spec bump would travel with a reference implementation showing how they envision such a thing being used
But, I found that the Typescript Language Server that they did list does indeed have a semantic-tokens module in it, although it's much shorter than I would have expected from reading that section in the spec: https://github.com/typescript-language-server/typescript-lan...
Interesting, I know LSPs provide me some syntax highlighting and some structural navigation but I haven't compared the two APIs directly. I assumed most LSPs are a superset.
I use both. In my experience, syntax highlighting with language servers is slower than with tree-sitter.
It stands to reason: a language server often does way more than just incremental parsing of the source code into a concrete syntax tree. By limiting itself to syntax, tree-sitter can be much faster.
I've been using it a bit but it still not on par with, well, vscode. It tends to be a bit slow on big files (say 10000+ lines) when you open type an fstring in python such as 'print(f"p={' once the open accolade is typed in, it can get noticeably slow.
35 years here. 2/3rds of that without any syntax highlighting. My brain is a pretty good parser and I don't like visual distraction. Can you imagine if books had colored syntax highlighting?
I have not embraced some syntax highlighting but with a very subtle color scheme. I tried LSP for a moment but the performance was such that it was a net negative.
I allocate a week over the holiday for "tooling refactors" and I'll probably kick the tires of emacs 29. The big refactor is building a new rig.
While Emacs 29.1 comes with "treesitter" built-in, you still need to manually build and install any treesitter language plugin implementing the actual language specific parser. This can be fiddly and frustrating doing it yourself.
I had a quick success with using this convenience script: https://github.com/casouri/tree-sitter-module/. It provides fully-automated builds for the most popular languages (including typescript, c and c++).
This is how it works for "typescript":
1. Clone the repository: https://github.com/casouri/tree-sitter-module/
2. Install "build-essentials" (providing a c/c++ compiler if you're on Linux).
3. run "./build typescript" from within the repo
4. Copy the resulting shared library from "dist/libtree-sitter-typescript.so" into your "~/.emacs.d/tree-sitter/".
5. Open a random typescript file and try "M-x typescript-ts-mode" which should not give you any error but instead nice syntax highlighting.
You might find there is a treesitter plugin for your language available and it is even supported by "tree-sitter-module" but there is still no major mode, yet. Happened to me for Perl 5.