Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Topiary: A code formatting engine leveraging Tree-sitter (tweag.io)
120 points by Xophmeister on May 20, 2023 | hide | past | favorite | 33 comments


I'd be curious to know how effective this is at getting formatting close to a Prettier or rustfmt level of quality. I know that for Prettier there was a heavy layer of custom logic around the core printer just to get output that looked decent. Of course even if it only gets 80-90% of the way there, that's still a massive achievement.

It's fascinating seeing these tools that facilitate building better programming language experiences. I've called them "Tooling for Tooling", basically tools that make it easier to create tools like formatters, linters, etc.


Hi!

We share the same curiosity for the effectiveness of our approach! Right now, we want to make Topiary great for languages with less complex formatting rules, and "good enough" for languages that are a bit more complex. Where, on a one-off basis, you don't feel the need to get the dedicated formatter.

We don't yet have any ambitions to compete directly with Prettier and rustfmt among others.

Having said that, we are quite proud of how the OCaml rules turned out, and even had some great results with the Rust rules.

As we explore more, and expand the complexity of our tree-sitter scopes, who knows what kind of things we might be able to format!

It's all very exciting!


It might be that JS/TS creates very hard to handle code, but I find prettier choices quite disappointing.


Nice work!

I keep wondering, there a reason everything isn’t just based off treesitter these days? If I were tasked with writing tsserver, my instinct would be to layer it over treesitter. Does anyone know if there are practical reasons that doesn’t happen, or is it just legacy?

The neovim world is slowly converging on using it for syntax. This project for formatting. I personally use it in neovim for things like highlighting and formatting sql within strings in python.


It's really annoying to produce an AST from tree-sitter. I tried writing my programming language parser with tree-sitter and it was a huge pain[1]. Anything with error recovery or good error messages is hard to customize too. If you want the ability to work with partial pieces of code in a homogeneous syntax tree and not an AST, then tree-sitter is great. Otherwise it's definitely rough around the edges.

[1]: https://uptointerpretation.com/posts/vicuna-update/


I hope the author of the blog post realizes that treesitter was never mean to be "an editable ast" or something. It is a partial ast parser taking into account presence of errors and error reporting. It is written in C for a reason: it needs to quickly get a partial ast and report errors in it.

There SO MANY perfect tools for language implementation and ast manipulation..! Starting with ML family languages that are very good at this.


There aren’t that many great tools that also buy you into a syntax highlighting, code searching and the general tree-sitter ecosystem. If tree-sitter could produce an AST, it’d be a very compelling option for writing a compiler. Not to mention the robust, fast incremental parsing is precisely what modern compilers need, see Roslyn’s red green trees or rust analyzer’s Rowan crate.


Having a reasonable formal grammar is like 90% percent of making a parser. Tresitter or not.


If you’re writing a parser for a simple language, sure. But most programming languages have a sophisticated enough grammar, with stringent enough performance requirements and error handling requirements, that a library that can function as a single source of truth for your compiler and your tooling is very valuable. Take JavaScript, which has extremely hairy logic around JSX parsing and arrow functions. Or C’s issues with preprocessing.


So you want to use a hammer for cutting trees because an axe looks somewhat similar anyway and it just doesn't make sense to have both :-) don't blame you hammer for being a bad axe!

People have been looking for universal approach to parsing for so long... maybe there is one, maybe not, but treesitter was never meant to be one.

And it's great for what it does!


No...more like we're building a sophisticated infrastructure for cutting trees that handles trees that are malformed, processes them super efficiently, and handles all sorts of different species. And you seem to think that I want an axe.

If you want to create a parser for a toy language that produces an AST or a single error, then sure, that's trivial. But if you want a parser that does good error recovery, produces a high fidelity CST, and reuses memory in an efficient manner (red-green trees ideally), that's a lot of work. And that's table stakes for good programming language tooling. We're not in the era of emacs plugins that do regex syntax highlighting and call it a day. If there was a framework that could accomplish this, and function as a parser for the compiler (which is not so crazy, since most modern compilers are also the engines for tooling, i.e. language servers)

I agree that tree-sitter was never meant to be a universal solution, but I think it's easily could be with some adjustments. And because of the existing infrastructure, because of the existing parsers, I think that it's reasonable to consider pushing tree-sitter in that direction instead of creating yet another parsing framework.


Let's see if somebody can come up with something replacing treesitter :-)

As somebody who came up with a couple of quick modes and parsers for Emacs and in Emacs Lisp I can say that for people like myself it's a blessing. I sincerely hate how there are numerous implementations of everything in dozens of editors out there, but nobody benefits from each others work in a reasonable way... Treesitter's universal community-centric approach kind of resonates with the stronger side of OSS: suddenly all of these little steps individuals do contribute to the ecosystem as a whole.

Now, admittedly, all I need is an axe. I know I need an axe, treesit gives it to me and this makes me a happy little contributor.

So let's say somebody comes up with a factory of a tool. All inside: properly incremental, smart error handling, tree editing, transformations and stuff. Something tells me it would much harder to contribute a simplified barely working grammar for that thing. And this kind of kills the point of emacsy-sh moonlight hacker tool.

It would be useful, sure, but would it work in practise?


Oh I totally agree with your sentiment about tree-sitter. That's why I want it to be extended in functionality. It makes so much sense to have a single place where one parser can be written and everybody benefits. Much like language servers.

Where I disagree is that IMO, tree-sitter already is very close to this ideal model. It has incremental parsing. It has great tree querying. Where it needs help is an AST facade over the raw syntax tree, which is very much feasible. rust-sitter[1] does it for instance. Tree-editing and tree construction is also very much doable. I don't think it'd have an impact on grammar construction at all. As for error recovery, I think it could function as a reparsing feature where you can drop down to a manual parser (or even a secondary grammar) that is more tolerant. Or an error recovery function that can be written in any language. tree-sitter already has the ability to use a manual lexer written in native code, so this is not such a stretch.

[1]: https://github.com/hydro-project/rust-sitter.


Tree sitter isn't exactly the most ergonomic API or structure to use. In my opinion Neovim stuff moving entirely to TS made it worse that the existing tools that were out there. But for something like this, I think TS fits pretty nicely into the use case.


TreeSitter is easy to use, pretty language neutral and it is tolerant to errors. Plus at this point people have already implemented support for a ton of languages.

My only issue with it is that it really only does half the job. You get a CST of sorts, but if you want to do anything with it you pretty much have to hand write another parser for that node tree.

In contrast parser combinator libraries like Nom and Chumsky give you "the final output".


Chumsky is so good for parsing. I’m not a huge combinator fan but man, it works so well. If only the errors weren’t so horrendous.


VS Code or Monaco still doesn't support tree-sitter.


The tree-sitter ecosystem continues to expand thanks in large part to neovim's enthusiastic adoption. It's a shame Atom was sunset in favor of VS-code - having spawned both Electron and Tree-sitter, I think it's clear which was more visionary.


VS Code brought Language Server Protocol, so I don't think the "vision" balance is as one-sided as you imply.


And the debug adapter protocol IIRC.


It continues to expand because there's nothing else like it: a language and editor agnostic tool for implementing parsers for IDEs. All of the editors come up with all kinds of niche hacks and DSLs. All editors keep reimplementing broken, partial, regexp soups..

I can only see more editors just giving up on this and going all in of treesitter.

It's not only neovim, btw, the next emacs is also gonna have it included.


Doesn't LSP subsume tree-sitter?


The tree sitter ecosystem is very cool. I'm happy it exists.

My research into programming language parsing started with a very specific problem: I like folding code, and I like disabling ("commenting out") code to test behavior. Well, but (with rare exceptions: Xcode and nowadays some languages in VSCode) "commenting out" code breaks folding. I never got around to really solving it, but the learning involved (including about tree sitter) was very cool.


It's ironic that just as the excellent https://github.com/ocaml-ppx/ocamlformat is seemingly closing in on a 1.0 after ~4 years of development, here comes the implication that it's not good enough.


Neat. I tried to do this exact thing a while back, leveraging TS as well, and struggled to find a generalized rule engine for it. I'll give this a try later, been hoping for something like this.


Can someone explain why semantic whitespace wouldn't work with a tool like this?


Topiary contributor, here: In theory, I think a simple semantic white space language could work, provided the Tree-Sitter grammar for that language is adequate. Python, for example, might be possible; as long as we ignore things like line-continuations.


You don't want an AST or even a full parser for formatting most languages.

Tree-sitter deals with errors better than most parser generators but if you just lex and separate into chunks then you can much more flexibly format broken code.


Hej!

I agree with you in that there are many languages where skipping parsing altogether could still result in a good formatter, and I would love to see a Topiary-like project attempt it.

I don't feel confident in saying that that holds for most languages however, worrying that it can lead to a lot of ambiguity in languages with more complex formatting conventions.

Regardless, the eventual goal of Topiary is to be able to format the widest possible spectrum of languages, and so limiting ourselves to just lexing didn't seem like the right choice at the time.

Like you mention, this does mean we give up being able to format broken code. In fact, we currently even ensure that TS is able to parse the entire input before formatting. This is a shame, but ultimately what we decided was the best approach for Topiary to achieve its goal.


How do you produce something like this with just lexing?

    aaa(
      bbb,
      ccc(
        ddd,
        eee,
      ),
      fff,
    )


Configure the line breaking heuristic properly.


Can this implement/emulate something like Python Black?


Hi!

We are not sure right now because Topiary is still very much an experiment.

Having said that, we are constantly surprised what we can do with Topiary. So with a dedicated Python developer willing to draft a set of rules, it might just be possible!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: