Awk: Power and Promise of a 40 yr old language (2021)

neonate · on Jan 15, 2023

http://web.archive.org/web/20230115205059/https://www.fossli...

benhoyt · on Jan 16, 2023

Brian Kernighan recently told me that "Al and Peter and I" (that is, A, W, and K) are working on a second edition of the book "The AWK Programming Language". The first edition is excellent (it inspired me to write an AWK interpreter in Go). Anyway, I'm really looking forward to seeing what the second edition looks like. I think it'll help renew interest in AWK in general, and no doubt bring some of the examples up to date (though the book has aged very well!).

asicsp · on Jan 16, 2023

Also, Unicode support being added to onetrueawk: https://news.ycombinator.com/item?id=32534173

marttt · on Jan 20, 2023

This is incredible news, thanks for sharing. I would never have guessed that, really. I wonder what's their motivation (why now?).

The AWK book was one of the fundamental books I used to teach myself some coding. The precision of the language is remarkable; I wonder how much will be different in the new edition.

alex_muscar · on Jan 16, 2023

Glad to hear that. Brian Kernighan said that he was thinking of updating the AWK book on numberphile last year. Looking forward to it as well :)

sargstuff · on Jan 18, 2023

Any links to posts about Kernighan's take on gagh[1]?

[1] https://www.startrek.com/database_article/gagh

boneitis · on Jan 16, 2023

I try not to plug it every time I see a submission about awk, but here I must recommend freedomben's presentation "awk: Hack the Planet!" [0]

It starts off with similar background and praises for a few minutes before spending the rest of the hour crash coursing you through awk, then leaving you with a very approachable, digestible, and realistic set of exercises (and optionally, a second video covering his solutions).

I keep a text file with a copy of the questions and my solutions on a public gh repo, so I can quickly refer to it from anywhere when needed.

I am much more powerful on the command line because of it.

[Edit]

[0] https://github.com/FreedomBen/awk-hack-the-planet

Originally encountered @ https://news.ycombinator.com/item?id=25144697

LecroJS · on Jan 16, 2023

I owe you for indulging in your recommendation - thanks! The lecture quality is unrivaled when held up to the YouTube tutorials of today.

For anyone else considering investing the time: I am extremely satisfied with the 2 hours it took to learn + practice the basics. As far as high-yield learning investments go, I’d already put awk up there with my time spent learning Vim and Git.

xenophonf · on Jan 15, 2023

GNU awk and jq are at the heart of the templating engine used by a number of Docker Official Images:

https://github.com/docker-library/bashbrew/blob/master/scrip...

https://github.com/docker-library/postgres/blob/master/apply...

https://github.com/docker-library/postgres/blob/master/Docke...

comex · on Jan 16, 2023

I wish Awk had capture groups. It would fit in so well with typical Awk one-liners to be able to say:

    awk '/foo=([0-9]+)/ { print $1 }'

although I suppose the syntax would have to be different since $1 has a meaning already.

Yes, gawk has a function that returns capture groups, but it's a bit verbose for one-liners. Instead I switch to Perl:

    perl -nE 'if (/foo=([0-9]+)/) { say $1 }'

But I wish I could just use Awk.

_TwoFinger · on Jan 16, 2023

You can still do it with awk, but with different "ergonomics":

  awk -F= '$2 ~ /[0-9]+/ { print $2 }'

With imaginative choice of FS and RS you can push it very far.

Whether other people having to deal with such code will appreciate your imagination is another matter, though.

Edit: I missed the detail where you want to specifically match "foo" as lhs, and anywhere on the line. So the correct condition would be even lengthier : ^ ) You have a valid point. Captures would provide for shorter patterns.

kazinator · on Jan 17, 2023

I recently saw someone on StackOverflow asking for something similar: access, from the condition-action body, to the text which matched the regex.

So I added a feature to the TXR Lisp awk macro. There is now an Awk variable called res which holds the result of the condition. If the condition is a regex, then that has the matching part. The fact that the action is executing tells us that the result of the condition is true; but res gives us the specific true value, like the it in anaphoric if macros.

This made it into release 284.

1vuio0pswjnm7 · on Jan 18, 2023

Can use sed. One way is to use a "sentinal", for example, a non-printable character that is absent from the input.

   sed -n "s/SENTINAL//g;s/\(.*\)\(foo=\)\([0-9]*\)\(.*\)/SENTINAL\3/;/SENTINAL/!d;s/SENTINAL//;/./p"

shortened to

   x=$(echo x|tr x '\34');
   sed -n "s/$x//g;s/\(.*\)\(foo=\)\([0-9]*\)\(.*\)/$x\3/;/$x/!d;s/$x//;/./p"

Using flex is another option. Faster than AWK, Perl, sed and similarly ubiqitous.

   flex -8iCrf -o/dev/stdout << eof|cc -xc -O3 -std=c89 -pedantic -W -Wall -static /dev/stdin
     int fileno(FILE *);
     #define J BEGIN
     #define E ECHO
     int x;
   %option noyywrap noinput nounput
   %s x1 x2
   %%
   foo=[^\n] yyless(1);J x1;x++;
   <x1>[0-9]+ if(x==1)E;J x2;
   <x2>\n E;x=0;J 0;
   <x2>.
   .|\n
   %%
   int main(){ yylex();exit(0);}
   eof

asicsp · on Jan 16, 2023

I don't use `match` often, so I get the `match(str, regex)` order wrong. So yeah, it would be nice if gawk automatically provided the capture groups via some special variable.

Not really verbose for this particular example though:

    $ echo 'foo=42' | awk 'match($0, /foo=([0-9]+)/, m){print m[1]}'
    42
    $ echo 'foo=42' | perl -nE 'if (/foo=([0-9]+)/) { say $1 }'
    42
    # TIMTOWTDI 
    $ echo 'foo=42' | perl -nE 'say $1 if /foo=([0-9]+)/'
    42

sargstuff · on Jan 16, 2023

GNU Awk version 5 has dynamic regexprs. [1]

[1] 3.6 using Dynamic Regexps : https://www.gnu.org/software/gawk/manual/html_node/Computed-...

iso1631 · on Jan 16, 2023

I have to admit I never really learned awk, but whenever I want to do something that I think "awk would be god for this", I use perl. Are there things for which awk is a significantly better tool?

sargstuff · on Jan 16, 2023

Ideal for embedded systems where resources are limited.

forinti · on Jan 16, 2023

I wish grep had capture groups so I could do things like

    grep "\((\d+)\)" -print "$1" file.txt

defrost · on Jan 16, 2023

ripgrep is a modern faster grep in Rust, some cons, many pros.

It'll do capture groups with search | replace

    # remove square brackets that surround digit characters
    $ echo '[52] apples [and] [31] mangoes' | rg '\[(\d+)]' -r '$1'
    52 apples [and] 31 mangoes

asicsp · on Jan 16, 2023

Pleasantly surprised to see an example from my ebook :)

Add `-o` option to get only the digits:

    $ echo '[52] apples [and] [31] mangoes' | rg -o '\[(\d+)]' -r '$1'
    52
    31

defrost · on Jan 16, 2023

Congratulations (?) for making the top five google results for "ripgrep capture groups" [1].

I cite sources way more often than not, this time I got lazy after dithering over whether to go with the definitive ripgrep source page [2] or a decent looking third party(?) tutorial .. pressed for time I did neither.

[1] https://learnbyexample.github.io/learn_gnugrep_ripgrep/ripgr...

[2] https://github.com/BurntSushi/ripgrep

sargstuff · on Jan 16, 2023

[1] gives some background about the issue and possible solution(s).

[1] : https://unix.stackexchange.com/questions/536657/how-to-refer...

asicsp · on Jan 16, 2023

If PCRE is supported:

    grep -oP '\(\K\d+(?=\))'

The above will give all matches in a line though. You can remove `(?=\))` part if numbers are always enclosed in `()` or if you don't care about the `)`

forinti · on Jan 16, 2023

Lookarounds are a bit messy. It's easier to just pipe into a second grep.

sargstuff · on Jan 17, 2023

certainly gets rid of one type of edge case.

DamonHD · on Jan 15, 2023

I use awk for lots of things, including running simple nomerical models since its results are totally portable between OSes etc for my purposes.

And I learnt a new thing from this article, even though I have been using awk for decades... Functions. Yep!

Stratoscope · on Jan 16, 2023

Yep, functions! I used to write a fair amount of Awk code back in the late '80s and early '90s. I treated Awk as a "real" programming language and tried to make the code nice and readable. This of course involved a lot of use of functions.

I only have a couple of surviving examples of the code from back then, but here they are for the curious:

https://github.com/geary/awk

LJPII.AWK is probably the best example. It made a nicely formatted printout of source code on my HP LaserJet II printer. I wish I had one of the printouts it generated, but they are long gone.

Hmm... I wonder if my Brother printer supports the old LaserJet II control codes? Or maybe there is an emulator online?

The code was written for Thompson Awk (TAWK), so some bits would need to be adapted to modern Awks.

DamonHD · on Jan 16, 2023

Oooooooo, that's how functions should be done.

Here's a recent one of mine (albeit as usual embedded in a /bin/sh script) that I should try to functionify!

https://www.earth.org.uk/script/BibTeX-to-HTML.sh

Thanks!

shrubble · on Jan 16, 2023

You can check if the Brother printer supports PCL which it likely does. Somewhere online will be an explanation of the differences between gnu awk and the version you used.

KRAKRISMOTT · on Jan 15, 2023

JIT-compiled Awk in Rust

https://github.com/ezrosent/frawk

metadat · on Jan 16, 2023

Discussed 11 months ago:

https://news.ycombinator.com/item?id=30343373 (38 comments)

I wish this "AWK-like language" came with performance benchmark comparisons.

Edit: Thank you, benhoyt!

benhoyt · on Jan 16, 2023

It has very detailed benchmarks (the "Benchmarks" link in the README): https://github.com/ezrosent/frawk/blob/master/info/performan...

pclmulqdq · on Jan 15, 2023

Awk is an amazingly powerful tool. I remember writing an awk script in an interview once (instead of using Python), and the person giving the interview was amazed at how fast it could be written and how fast ran.

082349872349872 · on Jan 15, 2023

Poplar is an also-ran (from that time frame). Normally I prefer to concentrate on ideas (of which there are a few in the link below) rather than people ... but in this case, for context, it's worth noting the authors:

https://www.softwarepreservation.org/projects/poplar/doc/Mor...

asicsp · on Jan 16, 2023

Previous discussion: https://news.ycombinator.com/item?id=28441887 (251 points | Sept 7, 2021 | 118 comments)

spccdt · on Jan 16, 2023

In my opinion Perl is much more powerful than Awk, available just about everywhere Awk is available, and even explicitly takes inspiration from Awk for some features (like BEGIN blocks).

kqr · on Jan 16, 2023

I agree. Yet I have used awk in a few places instead. Part of it is that it's easier to get a less powerful language accepted by colleagues, when both are in the "completely foreign to me" category.

sargstuff · on Jan 16, 2023

Yes, but across 40 years, pearl versions introduced compatability issues.

wrldos · on Jan 16, 2023

Really not a fan of awk. It really looks nice but I have inherited a lot of it in the past and know how many foot guns there are. Trashing the unix philosophy of using text as a communication protocol, I've seen terrible terrible mistakes when it comes to repetitive re-parsing stuff at each pipe step.

The finest was a 3rd party who accidentally added a space in a data feed. This was dutifully sucked out of their SFTP server via a bash script, pre-processed using awk into the standard internal format and then picked up by a cron job which ran a python script to inject it into postgres. The outcome was of course that the columns were offset by 1. This caused a huge asset valuation dip and some market alarm.

A proper parser would have rejected the whole dataset as the data row could not be parsed.

kqr · on Jan 16, 2023

I get that AWK encourages this type of code, but I would still argue it's not really "an AWK problem". It's more of a problem of how carefully you model your inputs.

Model inputs in great detail and you can throw out a lot of invalid data, but it takes longer to get the code running. Model the input only very crudely and you're up and running quicker, but more open to broken expectations.

Of course, it's way too easy to go the latter route for most programmers...

sargstuff · on Jan 16, 2023

Classic CS off by one error.

wombatpm · on Jan 15, 2023

I wrote an awk script to rip through supposedly ascii text files to deal with CP1252 extended ascii characters that would creep in. Those characters played havoc with the output of our bindery’s commercial inkjet print heads. That stuff was fast on ancient equipment

every · on Jan 16, 2023

I do something similar to convert supposed plain text to 7-bit US-ASCII. I use a sed script however...

DamonHD · on Jan 16, 2023

Don't shoot me, but I've used ask to create binary MIDI and WAV files...

sargstuff · on Jan 18, 2023

Ok, even though hasn't been ported as an awk extension under gawk, will POKE [1][2] with command line pipe do (perhaps with enough sed, too)?

[1] : https://www.gnu.org/software/poke/

[2] : https://kernel-recipes.org/en/2019/talks/gnu-poke-an-extensi...

sargstuff · on Jan 18, 2023

awk system() makes possible to avoid being BASH'ed.

DamonHD · on Jan 17, 2023

Hmmm aWk!

sargstuff · on Jan 18, 2023

When appropriate, don't pass up opportunities to ingest Gawk [1]!

edit: especially since unicode support allows for use of native Klingon fonts.

[1] https://www.startrek.com/database_article/gagh

vfclists · on Jan 16, 2023

> Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk.

Freepascal/Delphi user and Smalltalk lover here.

I beg to differ.

KMnO4 · on Jan 16, 2023

Very few is not none, but I still think “very few” is an appropriate adjective.

Aloha · on Jan 16, 2023

Our flagship product is still Delphi based. It's an old application, but yeah.

nequo · on Jan 16, 2023

What does your Delphi-based product do?

Aloha · on Jan 16, 2023

It's a radio dispatch console.

They're currently trying to transport it to C#, but its slower going than developing in Delphi, ironically, the thing that makes us move to C# is developer availability.

sargstuff · on Jan 16, 2023

Cobol legacy programmers make quite a bit because of such small numbers / no longer actively/widely taught at university level.

The multi-decade investment in cobol for critical systems (banking) does not make for quick/easy switch.

gpvos · on Jan 15, 2023

I think the main thing Awk is missing is standard support for CSV (with quoting). The recent goawk has it, though.

tyingq · on Jan 15, 2023

Many (most?) Linux distros that have gawk also ship gawkextlib, which includes gawk-csv.

https://gawkextlib.sourceforge.net/csv/gawk-csv.html

gpvos · on Jan 16, 2023

Thanks, I didn't know about that. For oneliners a simple command-line option would be preferable though.

benhoyt · on Jan 16, 2023

Thanks for the plug (I'm the author of GoAWK). Yeah, I'm hoping the CSV feature will really be useful for data science and the like. There are so many CSVs pushed around these days. See more here: https://benhoyt.com/writings/goawk-csv/

asicsp · on Jan 16, 2023

https://github.com/ezrosent/frawk also supports csv

And handling it with awk: https://stackoverflow.com/questions/45420535/whats-the-most-...

b5n · on Jan 15, 2023

You can get pretty far with FPAT, and as mentioned elsewhere, gawkextlib is available when FPAT doesn't quite cut it.

lumb63 · on Jan 16, 2023

I find that opinionated tools like awk are esoteric and very niche, but for things they do, they do very well. Writing awk scripts for simple text transformations brings me immense pleasure.

sargstuff · on Jan 16, 2023

yes, maybe esoteric today, but not when only tool available (~late 80's) across multiple platforms mac, unix flavors, vax, spery rand, burrows, wang, sun, alpha and x86.

froh · on Jan 16, 2023

yup. I wonder "what if" awk had added records (dotted r.attrib notation) and namespaces. I've created sizeable awk scripts back in the day, and I missed the former dearly and the latter somewhat.

sargstuff · on Jan 16, 2023

Nah, (humor observation), using system() is the only one true way to do namespaces.

The way gawk handles functional parameters can simulate records (dotted r.attrib notation)

Gawk @ additon [1] permits namespaces (include,load) and other fun syntax/sematic jit enforced separation.

[1] https://www.gnu.org/software/gawk/manual/html_node/Index.htm...

sargstuff · on Jan 16, 2023

tend to forget, awk isn't the typical choice for programs in 20k-100k lines range.

bsder · on Jan 15, 2023

I don't mind AWK for super simple things, but there is a reason why the sysadmins from the Bad Old Days(tm) who now have grey beards all converged to Perl.

Practically everything you can do in AWK can be done just as easily and quickly in Perl. And Perl absolutely wins when you need to do that one extra thing that AWK really just can't do.

And I say this as a person who switched over from Perl to Python eons ago.

doctor_eval · on Jan 15, 2023

We didn’t all converge to perl. Some of us avoided it like the plague. I saw some reference to CPAN recently and it made me feel… uncomfortable.

Whatever things I couldn’t do in sh or Awk, I would do in C.

dale_glass · on Jan 16, 2023

Good lord, why?

Seriously, Perl is an okay language for quick and dirty things of a tiny/small size. Yeah, it's not the best language for a large development project, but if you do need to parse /etc/passwd or something, not only it's perfectly good as-is, but you'll certainly find something on CPAN that already does it well.

I can't imagine why would one want to do that kind of thing in C. It's just unnecessarily painful, and you'll spend 90% of the time on doing things that don't solve the actual task you need to be solved.

Yeah, in modern times it's gone way downhill, but that's mostly if you intend to do something big with it. I wouldn't use it to start a new, fully featured CMS. But for sysadmin type stuff as an alternative to sh/awk it's still just as usable as ever.

doctor_eval · on Jan 16, 2023

I know loads of people loved Perl, and I'm not suggesting that my perspective is mainstream or even defensible. I'm just saying that I was a young sysadmin in the "bad old days", but that I didn't like Perl. Of course I did use an awful lot of Perl scripts.

That said, I have to revise my original comment. I forgot that I was a big fan of Tcl/Tk/Expect back in the day. So it's not like my taste is better than anyone else's :)

sramsay · on Jan 16, 2023

Greybeard here. Just blogged about this very topic a couple of weeks ago: “The Unreasonable Effectiveness of Awk”

https://stephenramsay.net/posts/unreasonable-awk.html

doctor_eval · on Jan 16, 2023

Nice!

You could also use

    BEGIN {
      FS = “: “
    }

Which sets the field separator. This would put the keywords into $2 and remove the need for the gensubs :)

yesenadam · on Jan 16, 2023

It's much shorter to set the field separator with an option: awk -F": " or awk -F ": " do the same thing.

sramsay · on Jan 16, 2023

Well, look at that. Thanks!

MrVandemar · on Jan 16, 2023

In both your essays, you have some text surrounded by square brackets: are these intentded to render out as hyperlinks?

sramsay · on Jan 16, 2023

Yes. Are they not!?

MrVandemar · on Jan 17, 2023

They are now. Some of them weren't. Maybe it was a local error for me. Or maybe PEBCAK.

Anyway, cool blog. I'm reading it.

sargstuff · on Jan 16, 2023

Perl doesn't work so well with /etc stuff. when /usr, /usr/local, /opt aren't available.

QuadrupleA · on Jan 16, 2023

I have a perennial fascination with awk. It's one of the first things you find in /bin alphabetically, on almost any Unix. It's small but powerful, and somewhat mysterious.

Sadly I've learned it & cheatsheeted it for future reference but never find myself reaching for it. Part of it is prefering python over shell scripting maybe - awk fits better in a shell scripting world.

sargstuff · on Jan 16, 2023

Like the scripts that web browsers use?

bsdice · on Jan 16, 2023

I'm using a 20 KiB awk script that I wrote from scratch to calculate taxes owed from investments. German tax code is a bit tricky when your broker is abroad because you have to calculate everything yourself. Hilariously some German brokers don't even apply the FIFO rule correctly. Nevermind regulations about fees or double taxation treaties. FX rates for conversion into EUR are extracted either using rga from Swissquote's own PDFs, or downloaded off the internet. The transaction history itself is one big CSV export from their web site that is also parsed and analyzed using awk. For tax time I call the script with the year I want and transfer everything into official forms. Without spending 3 days in Excel. Or 10. I can't praise awk enough.

somat · on Jan 16, 2023

I quite like awk mainly because the entire manual is only 500 lines long.

http://man.openbsd.org/awk

m463 · on Jan 16, 2023

I have used awk in a pipeline to pick out a field or fields for forever:

   foo | awk '{print $4}'

I know there are easier ways. Heck I even wrote a command called "words" (like "foo|words 4" or "words 1 3 2")... but I forget to use it.

kazinator · on Jan 17, 2023

If you're using Awk, you owe it to yourself to check out CppAwk

https://www.kylheku.com/cgit/cppawk/about/

the-printer · on Jan 15, 2023

I don’t know any programming languages…but I feel like I “get” awk…

Also, this blog post helps with the above. For some reason I like how the author cites technical reviews of the article and email correspondence as references.

KMnO4 · on Jan 16, 2023

Missed opportunity to use one of my favourite words: grok [0].

And congratulations. I’ve been trying for the past year to learn awk, and while I can pretty reliably split text files and extract columns, I’m pretty far from being able to grok awk.

[0]: https://www.merriam-webster.com/dictionary/grok

hobs · on Jan 16, 2023

I actually hate it for this purpose, in the book it also means to drink deep and has psycho-sexual connotations.

Why not just say understand and move on?

KMnO4 · on Jan 16, 2023

The book that coined the term was written over 60 years ago. Its usage has evolved since then.

college_physics · on Jan 16, 2023

Are there any benchmarks how much faster or memory efficient it might be for parsing large csv files (~GB size) versus popular python or R tools?

aabedraba · on Jan 16, 2023

Where’s Bryan Cantrill in this thread?

coldtea · on Jan 16, 2023

I wish Awk had built-in csv (with quotes, commas inside quotes, etc) parsing support...

kazinator · on Jan 17, 2023

GNU Awk can recognize fields using a regular expression, rather than separators.

See the "Defining Fields by Content" topic in the manual, which is based around the FPAT variable, specific to Gawk:

https://www.gnu.org/software/gawk/manual/html_node/Splitting...

sargstuff · on Jan 17, 2023

tyingq in this post noted requirements for cvs support in awk via library.

johnthescott · on Jan 17, 2023

awk was my earliest exposure to the <pattern> fires <action> model of programming. those were the days.

sargstuff · on Jan 18, 2023

Alternate reality where lisp less dominant AI; might have resulted in an awk version of parallel make & AI-awk dominance.

jaza · on Jan 15, 2023

> Awk, as created by Alfred Aho, Peter J. Weinberger, and Brian Kernighan (who drew on their initials to create the name of the utility)

TIL