Hacker News new | past | comments | ask | show | jobs | submit login
Frawk: A fast, JITted, statically-typed AWK written in Rust (github.com/ezrosent)
145 points by benhoyt on Feb 15, 2022 | hide | past | favorite | 38 comments



I tried this last year when I started writing a book with Rust cli tools like ripgrep, hck, sd, huniq, etc. Was impressed with the performance. Found a few syntax issues when I simply tried to run gawk one-liners I had with frawk and they were fixed by the author. Hope the project crosses version 1.0 soon.


One of my recent project requires a fast awk implementation, and I tried frawk and got surprised by how robust it is. Although eventually we chose https://github.com/noyesno/awka for better native awk script compatibility, it's still wonderful to see a project incorporating recent advances of programming language implementation into the ancient awk.


Note: You can do many of these one-liners using tools that understand Python: Xonsh or Pyp. You won't need to learn Awk -- let alone any non-standard dialect of Awk. It won't be as fast as some Awk implementations (maybe), but you'll have a gentle on-ramp and off-ramp to a more featureful programming language.

I think the problem with DSLs is when a problem only partially fits into what they can do.


I would kindly disagree here.

I myself been dedidacted Python lover in past for many years now came to a conclusion that what can be scripted with AWK, should be scripted in AWK (over Python, Ruby, Perl, etc.). I'm not saying that you should write big apps though, but for small scripts AWK is absolutely fine alternative to major scripting languages with lots of benefits. Been universally available (as part of POSIX) and very compliant (language standard is almost unchanged for over 30 years now). As they say: "Good programmer chooses the most powerful tool for the job, best programmer chooses the least powerful tool for the job". Also see [1].

There is absolutely nothing wrong in learning AWK. It's very small language you can grasp fully in hours or days, and be sure you know it all, since it's very unlikely it chanages any time soon. Besides the classical book [2] by A., W., K. is absolute pleasure to read. Amazing, but it's still totally relevant, despite been published in 1988.

Shameless plug. I'm the author of a task/command runner [3] implemented almost 100% in AWK and I still think this was perfect choice of a language for this project.

[1] https://en.wikipedia.org/wiki/Rule_of_least_power

[2] https://archive.org/download/pdfy-MgN0H1joIoDVoIC7/The_AWK_P...

[3] https://github.com/xonixx/makesure


> As they say: "Good programmer chooses the most powerful tool for the job, best programmer chooses the least powerful tool for the job". Also see [1].

I don't agree with this. For a network protocol, sure. But not for what Awk does.

Tim Berners Lee's argument (in the Wikipedia page) does not apply to Awk.

> Been universally available (as part of POSIX) and very compliant (language standard is almost unchanged for over 30 years now).

There's tons of incompatible dialects. I think that shows the problem with what you're saying.


>There's tons of incompatible dialects. I think that shows the problem with what you're saying.

To my knowledge the major dialects in use are:

- One True Awk (aka bwk) (https://github.com/onetrueawk/awk) - this one is bundled in all *BSD/macOS

- Gawk (https://www.gnu.org/software/gawk/) - this one is bundled in most Linux

- mawk (https://invisible-island.net/mawk/) - bundled in some Linux distros (?), known as the fastest byte-code compiled implementation.

All three have very good compatibility, but Gawk is super-set over POSIX standard. I have some evidence here, since I regularly test [1] against these implementations and even some others, like GoAWK.

[1] https://github.com/xonixx/makesure/actions/runs/1830978431


I learned awk a while back, thinking along the same lines, and decided that there's a reason why everyone uses Python/Ruby/etc. nowadays when I learned that arguments and locals are the same thing. Even the designers of awk realize now that this was a mistake, as I recall. The rule of least power is a good guideline, but awk unfortunately has basic design mistakes that modern languages correct.


If you’re writing the kind of script that should be written in Awk, then the upsides will far outweigh that downside. Awk is not designed for 500-line, 50-function scripts, it’s designed for 50-line, 1- or 2- function scripts.


What are the chances that Xonsh or Pyp will be installed on a system I wish to use?


The submission is about a non-standard dialect of Awk. There is zero chance you'll have that on any random system you might want to use. If it's between that and Pyp, why not use Pyp?

And also, in the technology business, you are supposed to make progress happen. Sometimes, that involves making older technology obsolete. Making things obsolete is sometimes good, especially if the older thing has major problems.


> The submission is about a non-standard dialect of Awk

True, but the comment I responded to also has this:

> You won't need to learn Awk

Regardless, let's accept the claim that it's only about non-standard frawk.

> If it's between that and Pyp, why not use Pyp?

Because, as the instructions[0] point out

> Run pip install pypyp (note the extra "yp"!)

> pyp requires Python 3.6 or above.

It's necessary to install pip and Python at least to do so, I expect that I can produce a binary for frawk and drop it in far more easily. I'm not even sure what advantage Pyp has over using a Python REPL (or Ruby or perl etc).

> And also, in the technology business, you are supposed to make progress happen.

I do that by writing good code and improving processes.

As I've pointed out, I fail to see how needing to include the Python ecosystem is "progress" over a binary that is tiny, fast, and works. Calling things obsolete because they're "old" is the argument of a teenager. I apologise if that sounds rude, but I don't know how else to put it, that's what it is.

[0] https://github.com/hauntsaninja/pyp


> As I've pointed out, I fail to see how needing to include the Python ecosystem is "progress" over a binary that is tiny, fast, and works. Calling things obsolete because they're "old" is the argument of a teenager. I apologise if that sounds rude, but I don't know how else to put it, that's what it is.

That is indeed rude. And it's not what I said. I'm talking about the proliferation of DSLs like Awk in the world of Unix. I think this makes Unix harder to learn. I also think that each DSL has random limitations. I think installing Python libraries is much simpler, and you get more complete tooling, and a gentler learning curve. If each Unix DSL becomes its own Python library, then it's easier to use them together nicely.

Python can be substituted with another Turing-complete language. Why not just use libraries and abbreviate them for command line use?


> I'm talking about the proliferation of DSLs like Awk in the world of Unix

I get that there is a learning curve for Unix. However, a complaint about a "proliferation" in the context of a tool which is decades old seems a bit off. The response here of "just install Python" doesn't work for $random_unix for a variety of reasons. Such reasons include lack of root access or business rules which prevent modification (such as for a production system where you do have root access).

> Python can be substituted with another Turing-complete language. Why not just use libraries and abbreviate them for command line use?

This suggests that we take a tool such as awk which is decades old, stable and efficient and replace it with an equivalent implementation based on an interpreted language. This process would take years to complete and then would end up with a tool which consumes more resources and is slower than the original. I don't see the payoff here.


> That is indeed rude. And it's not what I said.

This is what you wrote.

> And also, in the technology business, you are supposed to make progress happen. Sometimes, that involves making older technology obsolete. Making things obsolete is sometimes good, especially if the older thing has major problems.

Awk does not have "major problems" and I fail to see how I misrepresented you.

Aside from that, Python is newer than perl and Awk and anyone who can write perl can write Awk without too much trouble. If we apply the proliferation principle then what good does having more general purpose languages like Python do?

> I think installing Python libraries is much simpler

Than a single binary? I'm interested, how is it done?

> Why not just use libraries and abbreviate them for command line use?

Chesterton's fence comes to mind. Because they're reliable, quick, are usually preinstalled or easily installed if not, lightweight, well known, and don't require learning a particular programming language nor the install of its comparably giant ecosystem.


Since AWK is DSL for text processing it can be (and usually is) much more elegant and efficient for (surprise!) text processing tasks, than general purpose languages like Python.

Not long ago I was really fascinated by this example I came across: https://pmitev.github.io/to-awk-or-not/Python_vs_awk/.


FWIW, Pyp is now broken due to the python2 -> python3 silliness


Interesting (and a bit surprising) that the cranelift backend keeps up with LLVM backend reasonably well, especially with parallelism enabled https://github.com/ezrosent/frawk/blob/master/info/performan...


There is also goAWK[0], a re-implementation of AWK in Go.

[0] https://github.com/benhoyt/goawk


GoAWK is a great tool. Added benefit of POSIX-compliance, it builds FAST, and is just generally much more reliable and enjoyable to work with.

I'll be honest, I'm still quick to reach for perl/python for most awk-able tasks, but, I'm often impressed with Awk and its ilk.


Drat:

> To a first approximation, it is an implementation of the AWK language; many common Awk programs produce equivalent output when passed to frawk.

So it sounds like it's (intentionally) not a drop in replacement. That's probably reasonable in context but unfortunate in terms of making adoption harder.


The Overview page has plenty of details explaining the differences and why they exist: https://github.com/ezrosent/frawk/blob/master/info/overview....


The only real-world difference that I see would be that frawk uses Rust's regex syntax, not awk's regex syntax. I'm actually willing to write that off as an improvement, though it may break scripts. I'd like to see this hidden behind a flag, and the classic awk regex syntax used without the flag. For now, while the classic syntax is unavailable, frawk should error wit ha message that classic syntax is not yet supported. Right now it is too big a footgun.

That said, gawk uses PCRE but awk does not. So varying regex implmentations already infect the awk ecosystem. I believe that Debian distros actually install mawk when installing awk via the standard repos. mawk has its own incompatibilities with awk, such as empty string representation.


>gawk uses PCRE

ERE with GNU extensions would be more apt here (PCRE has many more powerful features): https://www.gnu.org/software/gawk/manual/html_node/GNU-Regex...


How would a statically typed language be a drop-in for something that isn't?


This is a different compiler/interpreter for the same language - nothing prevents it from being a drop-in replacement.


But the specific design of Frawk is that it's statically typed. Hence both the title of this thread and the description in the linked documentation. If Frawk sees that you've got a variable N, and you add 1 to it, Frawk concludes that's a number. Seems fair enough and it makes Frawk faster.

However for example, Frawk may be quite sure that something is a number, and so when there isn't a value for that - Frawk says the value is a zero. But Awk isn't fixated on the idea that it's always a number, if there's no value awk will say it is the empty string, and why not.

Lots of Awk programs don't care, but if yours does it will behave differently under Frawk.


I guess the assumption is that working AWK programs will typecheck.


In the example I noticed a PREPARE block in addition to the well known BEGIN and END blocks. After a few minutes of searching it appears to be frawk specific:

https://github.com/ezrosent/frawk/blob/master/info/paralleli...

"Because the repeated map references are both annoying to write and inefficient to execute, frawk has a PREPARE block which executes in the worker threads at the end of its input"


> You will need to install Rust

Am I right in saying Rust can produce executables that can be run without any Rust runtime etc being on the target system? Why would you need to install Rust in cases like this?


Yes you are, but the author hasn't provided those executables, so you need to produce one for your platform yourself; for which you need to install rust.

(Presumably it's deemed not worthwhile providing them while it's unstable/in active pre-1.0 development.)


Rust is a compiler, so you need to install it to compile this application, which you can then use without Rust.

You need an oven to bake a cake but not to eat one.


Java is both a compiler and a run time. You can ship someone a compiled Java and they will still need the JVM to actually run it.

Sometimes you need the oven to bake a cake and eat it to.


Well you can AOT Java to a stand-alone binary too - but none of this is relevant to Rust.


To build and install it, you could have a binary release instead of using cargo.


> Lack of support for structured CSV input data.

Alternative approach that can be used sometimes: use TSV (tab-separated values), which make parsing trivial and works great with a wide range of tools; conversion from CSV to TSV can be done with csvkit or similar.


There's also this for the Python fans:

https://github.com/alecthomas/pawk


A bit offtopic, but has someone implemented a structured regex awk? I'm constantly hoping it will become a thing.


Neat tool! I like this is as a small example of SSA-based dataflow analysis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: