Topiary: A code formatting engine leveraging Tree-sitter

hardwaregeek · on May 20, 2023

I'd be curious to know how effective this is at getting formatting close to a Prettier or rustfmt level of quality. I know that for Prettier there was a heavy layer of custom logic around the core printer just to get output that looked decent. Of course even if it only gets 80-90% of the way there, that's still a massive achievement.

It's fascinating seeing these tools that facilitate building better programming language experiences. I've called them "Tooling for Tooling", basically tools that make it easier to create tools like formatters, linters, etc.

ErinvanderVeen · on May 22, 2023

Hi!

We share the same curiosity for the effectiveness of our approach! Right now, we want to make Topiary great for languages with less complex formatting rules, and "good enough" for languages that are a bit more complex. Where, on a one-off basis, you don't feel the need to get the dedicated formatter.

We don't yet have any ambitions to compete directly with Prettier and rustfmt among others.

Having said that, we are quite proud of how the OCaml rules turned out, and even had some great results with the Rust rules.

As we explore more, and expand the complexity of our tree-sitter scopes, who knows what kind of things we might be able to format!

It's all very exciting!

afiori · on May 21, 2023

It might be that JS/TS creates very hard to handle code, but I find prettier choices quite disappointing.

aidos · on May 20, 2023

Nice work!

I keep wondering, there a reason everything isn’t just based off treesitter these days? If I were tasked with writing tsserver, my instinct would be to layer it over treesitter. Does anyone know if there are practical reasons that doesn’t happen, or is it just legacy?

The neovim world is slowly converging on using it for syntax. This project for formatting. I personally use it in neovim for things like highlighting and formatting sql within strings in python.

hardwaregeek · on May 20, 2023

It's really annoying to produce an AST from tree-sitter. I tried writing my programming language parser with tree-sitter and it was a huge pain[1]. Anything with error recovery or good error messages is hard to customize too. If you want the ability to work with partial pieces of code in a homogeneous syntax tree and not an AST, then tree-sitter is great. Otherwise it's definitely rough around the edges.

[1]: https://uptointerpretation.com/posts/vicuna-update/

vkazanov · on May 20, 2023

I hope the author of the blog post realizes that treesitter was never mean to be "an editable ast" or something. It is a partial ast parser taking into account presence of errors and error reporting. It is written in C for a reason: it needs to quickly get a partial ast and report errors in it.

There SO MANY perfect tools for language implementation and ast manipulation..! Starting with ML family languages that are very good at this.

hardwaregeek · on May 20, 2023

There aren’t that many great tools that also buy you into a syntax highlighting, code searching and the general tree-sitter ecosystem. If tree-sitter could produce an AST, it’d be a very compelling option for writing a compiler. Not to mention the robust, fast incremental parsing is precisely what modern compilers need, see Roslyn’s red green trees or rust analyzer’s Rowan crate.

vkazanov · on May 20, 2023

Having a reasonable formal grammar is like 90% percent of making a parser. Tresitter or not.

hardwaregeek · on May 21, 2023

If you’re writing a parser for a simple language, sure. But most programming languages have a sophisticated enough grammar, with stringent enough performance requirements and error handling requirements, that a library that can function as a single source of truth for your compiler and your tooling is very valuable. Take JavaScript, which has extremely hairy logic around JSX parsing and arrow functions. Or C’s issues with preprocessing.

vkazanov · on May 21, 2023

So you want to use a hammer for cutting trees because an axe looks somewhat similar anyway and it just doesn't make sense to have both :-) don't blame you hammer for being a bad axe!

People have been looking for universal approach to parsing for so long... maybe there is one, maybe not, but treesitter was never meant to be one.

And it's great for what it does!

hardwaregeek · on May 21, 2023

No...more like we're building a sophisticated infrastructure for cutting trees that handles trees that are malformed, processes them super efficiently, and handles all sorts of different species. And you seem to think that I want an axe.

If you want to create a parser for a toy language that produces an AST or a single error, then sure, that's trivial. But if you want a parser that does good error recovery, produces a high fidelity CST, and reuses memory in an efficient manner (red-green trees ideally), that's a lot of work. And that's table stakes for good programming language tooling. We're not in the era of emacs plugins that do regex syntax highlighting and call it a day. If there was a framework that could accomplish this, and function as a parser for the compiler (which is not so crazy, since most modern compilers are also the engines for tooling, i.e. language servers)

I agree that tree-sitter was never meant to be a universal solution, but I think it's easily could be with some adjustments. And because of the existing infrastructure, because of the existing parsers, I think that it's reasonable to consider pushing tree-sitter in that direction instead of creating yet another parsing framework.

vkazanov · on May 22, 2023

Let's see if somebody can come up with something replacing treesitter :-)

As somebody who came up with a couple of quick modes and parsers for Emacs and in Emacs Lisp I can say that for people like myself it's a blessing. I sincerely hate how there are numerous implementations of everything in dozens of editors out there, but nobody benefits from each others work in a reasonable way... Treesitter's universal community-centric approach kind of resonates with the stronger side of OSS: suddenly all of these little steps individuals do contribute to the ecosystem as a whole.

Now, admittedly, all I need is an axe. I know I need an axe, treesit gives it to me and this makes me a happy little contributor.

So let's say somebody comes up with a factory of a tool. All inside: properly incremental, smart error handling, tree editing, transformations and stuff. Something tells me it would much harder to contribute a simplified barely working grammar for that thing. And this kind of kills the point of emacsy-sh moonlight hacker tool.

It would be useful, sure, but would it work in practise?

hardwaregeek · on May 24, 2023

Oh I totally agree with your sentiment about tree-sitter. That's why I want it to be extended in functionality. It makes so much sense to have a single place where one parser can be written and everybody benefits. Much like language servers.

Where I disagree is that IMO, tree-sitter already is very close to this ideal model. It has incremental parsing. It has great tree querying. Where it needs help is an AST facade over the raw syntax tree, which is very much feasible. rust-sitter[1] does it for instance. Tree-editing and tree construction is also very much doable. I don't think it'd have an impact on grammar construction at all. As for error recovery, I think it could function as a reparsing feature where you can drop down to a manual parser (or even a secondary grammar) that is more tolerant. Or an error recovery function that can be written in any language. tree-sitter already has the ability to use a manual lexer written in native code, so this is not such a stretch.

[1]: https://github.com/hydro-project/rust-sitter.

junon · on May 20, 2023

Tree sitter isn't exactly the most ergonomic API or structure to use. In my opinion Neovim stuff moving entirely to TS made it worse that the existing tools that were out there. But for something like this, I think TS fits pretty nicely into the use case.

IshKebab · on May 20, 2023

TreeSitter is easy to use, pretty language neutral and it is tolerant to errors. Plus at this point people have already implemented support for a ton of languages.

My only issue with it is that it really only does half the job. You get a CST of sorts, but if you want to do anything with it you pretty much have to hand write another parser for that node tree.

In contrast parser combinator libraries like Nom and Chumsky give you "the final output".

hardwaregeek · on May 20, 2023

Chumsky is so good for parsing. I’m not a huge combinator fan but man, it works so well. If only the errors weren’t so horrendous.

yewenjie · on May 20, 2023

VS Code or Monaco still doesn't support tree-sitter.

ckolkey · on May 20, 2023

The tree-sitter ecosystem continues to expand thanks in large part to neovim's enthusiastic adoption. It's a shame Atom was sunset in favor of VS-code - having spawned both Electron and Tree-sitter, I think it's clear which was more visionary.

ajoberstar · on May 20, 2023

VS Code brought Language Server Protocol, so I don't think the "vision" balance is as one-sided as you imply.

dbalatero · on May 20, 2023

And the debug adapter protocol IIRC.

vkazanov · on May 20, 2023

It continues to expand because there's nothing else like it: a language and editor agnostic tool for implementing parsers for IDEs. All of the editors come up with all kinds of niche hacks and DSLs. All editors keep reimplementing broken, partial, regexp soups..

I can only see more editors just giving up on this and going all in of treesitter.

It's not only neovim, btw, the next emacs is also gonna have it included.

rowanG077 · on May 20, 2023

Doesn't LSP subsume tree-sitter?

solarkraft · on May 20, 2023

The tree sitter ecosystem is very cool. I'm happy it exists.

My research into programming language parsing started with a very specific problem: I like folding code, and I like disabling ("commenting out") code to test behavior. Well, but (with rare exceptions: Xcode and nowadays some languages in VSCode) "commenting out" code breaks folding. I never got around to really solving it, but the learning involved (including about tree sitter) was very cool.

frou_dh · on May 20, 2023

It's ironic that just as the excellent https://github.com/ocaml-ppx/ocamlformat is seemingly closing in on a 1.0 after ~4 years of development, here comes the implication that it's not good enough.

junon · on May 20, 2023

Neat. I tried to do this exact thing a while back, leveraging TS as well, and struggled to find a generalized rule engine for it. I'll give this a try later, been hoping for something like this.

xupybd · on May 20, 2023

Can someone explain why semantic whitespace wouldn't work with a tool like this?

Xophmeister · on May 20, 2023

Topiary contributor, here: In theory, I think a simple semantic white space language could work, provided the Tree-Sitter grammar for that language is adequate. Python, for example, might be possible; as long as we ignore things like line-continuations.

mhh__ · on May 20, 2023

You don't want an AST or even a full parser for formatting most languages.

Tree-sitter deals with errors better than most parser generators but if you just lex and separate into chunks then you can much more flexibly format broken code.

ErinvanderVeen · on May 22, 2023

Hej!

I agree with you in that there are many languages where skipping parsing altogether could still result in a good formatter, and I would love to see a Topiary-like project attempt it.

I don't feel confident in saying that that holds for most languages however, worrying that it can lead to a lot of ambiguity in languages with more complex formatting conventions.

Regardless, the eventual goal of Topiary is to be able to format the widest possible spectrum of languages, and so limiting ourselves to just lexing didn't seem like the right choice at the time.

Like you mention, this does mean we give up being able to format broken code. In fact, we currently even ensure that TS is able to parse the entire input before formatting. This is a shame, but ultimately what we decided was the best approach for Topiary to achieve its goal.

xigoi · on May 20, 2023

How do you produce something like this with just lexing?

    aaa(
      bbb,
      ccc(
        ddd,
        eee,
      ),
      fff,
    )

mhh__ · on May 20, 2023

Configure the line breaking heuristic properly.

jensenbox · on May 20, 2023

Can this implement/emulate something like Python Black?

ErinvanderVeen · on May 22, 2023

Hi!

We are not sure right now because Topiary is still very much an experiment.

Having said that, we are constantly surprised what we can do with Topiary. So with a dedicated Python developer willing to draft a set of rules, it might just be possible!