Hacker News new | past | comments | ask | show | jobs | submit login
Awk: Power and Promise of a 40 yr old language (2021) (fosslife.org)
229 points by sargstuff on Jan 15, 2023 | hide | past | favorite | 95 comments




Brian Kernighan recently told me that "Al and Peter and I" (that is, A, W, and K) are working on a second edition of the book "The AWK Programming Language". The first edition is excellent (it inspired me to write an AWK interpreter in Go). Anyway, I'm really looking forward to seeing what the second edition looks like. I think it'll help renew interest in AWK in general, and no doubt bring some of the examples up to date (though the book has aged very well!).


Also, Unicode support being added to onetrueawk: https://news.ycombinator.com/item?id=32534173


This is incredible news, thanks for sharing. I would never have guessed that, really. I wonder what's their motivation (why now?).

The AWK book was one of the fundamental books I used to teach myself some coding. The precision of the language is remarkable; I wonder how much will be different in the new edition.


Glad to hear that. Brian Kernighan said that he was thinking of updating the AWK book on numberphile last year. Looking forward to it as well :)


Any links to posts about Kernighan's take on gagh[1]?

[1] https://www.startrek.com/database_article/gagh


I try not to plug it every time I see a submission about awk, but here I must recommend freedomben's presentation "awk: Hack the Planet!" [0]

It starts off with similar background and praises for a few minutes before spending the rest of the hour crash coursing you through awk, then leaving you with a very approachable, digestible, and realistic set of exercises (and optionally, a second video covering his solutions).

I keep a text file with a copy of the questions and my solutions on a public gh repo, so I can quickly refer to it from anywhere when needed.

I am much more powerful on the command line because of it.

[Edit]

[0] https://github.com/FreedomBen/awk-hack-the-planet

Originally encountered @ https://news.ycombinator.com/item?id=25144697


I owe you for indulging in your recommendation - thanks! The lecture quality is unrivaled when held up to the YouTube tutorials of today.

For anyone else considering investing the time: I am extremely satisfied with the 2 hours it took to learn + practice the basics. As far as high-yield learning investments go, I’d already put awk up there with my time spent learning Vim and Git.



I wish Awk had capture groups. It would fit in so well with typical Awk one-liners to be able to say:

    awk '/foo=([0-9]+)/ { print $1 }'
although I suppose the syntax would have to be different since $1 has a meaning already.

Yes, gawk has a function that returns capture groups, but it's a bit verbose for one-liners. Instead I switch to Perl:

    perl -nE 'if (/foo=([0-9]+)/) { say $1 }'
But I wish I could just use Awk.


You can still do it with awk, but with different "ergonomics":

  awk -F= '$2 ~ /[0-9]+/ { print $2 }'
With imaginative choice of FS and RS you can push it very far.

Whether other people having to deal with such code will appreciate your imagination is another matter, though.

Edit: I missed the detail where you want to specifically match "foo" as lhs, and anywhere on the line. So the correct condition would be even lengthier : ^ ) You have a valid point. Captures would provide for shorter patterns.


I recently saw someone on StackOverflow asking for something similar: access, from the condition-action body, to the text which matched the regex.

So I added a feature to the TXR Lisp awk macro. There is now an Awk variable called res which holds the result of the condition. If the condition is a regex, then that has the matching part. The fact that the action is executing tells us that the result of the condition is true; but res gives us the specific true value, like the it in anaphoric if macros.

This made it into release 284.


Can use sed. One way is to use a "sentinal", for example, a non-printable character that is absent from the input.

   sed -n "s/SENTINAL//g;s/\(.*\)\(foo=\)\([0-9]*\)\(.*\)/SENTINAL\3/;/SENTINAL/!d;s/SENTINAL//;/./p" 
shortened to

   x=$(echo x|tr x '\34');
   sed -n "s/$x//g;s/\(.*\)\(foo=\)\([0-9]*\)\(.*\)/$x\3/;/$x/!d;s/$x//;/./p"
Using flex is another option. Faster than AWK, Perl, sed and similarly ubiqitous.

   flex -8iCrf -o/dev/stdout << eof|cc -xc -O3 -std=c89 -pedantic -W -Wall -static /dev/stdin
     int fileno(FILE *);
     #define J BEGIN
     #define E ECHO
     int x;
   %option noyywrap noinput nounput
   %s x1 x2
   %%
   foo=[^\n] yyless(1);J x1;x++;
   <x1>[0-9]+ if(x==1)E;J x2;
   <x2>\n E;x=0;J 0;
   <x2>.
   .|\n
   %%
   int main(){ yylex();exit(0);}
   eof


I don't use `match` often, so I get the `match(str, regex)` order wrong. So yeah, it would be nice if gawk automatically provided the capture groups via some special variable.

Not really verbose for this particular example though:

    $ echo 'foo=42' | awk 'match($0, /foo=([0-9]+)/, m){print m[1]}'
    42
    $ echo 'foo=42' | perl -nE 'if (/foo=([0-9]+)/) { say $1 }'
    42
    # TIMTOWTDI 
    $ echo 'foo=42' | perl -nE 'say $1 if /foo=([0-9]+)/'
    42


GNU Awk version 5 has dynamic regexprs. [1]

[1] 3.6 using Dynamic Regexps : https://www.gnu.org/software/gawk/manual/html_node/Computed-...


I have to admit I never really learned awk, but whenever I want to do something that I think "awk would be god for this", I use perl. Are there things for which awk is a significantly better tool?


Ideal for embedded systems where resources are limited.


I wish grep had capture groups so I could do things like

    grep "\((\d+)\)" -print "$1" file.txt


ripgrep is a modern faster grep in Rust, some cons, many pros.

It'll do capture groups with search | replace

    # remove square brackets that surround digit characters
    $ echo '[52] apples [and] [31] mangoes' | rg '\[(\d+)]' -r '$1'
    52 apples [and] 31 mangoes


Pleasantly surprised to see an example from my ebook :)

Add `-o` option to get only the digits:

    $ echo '[52] apples [and] [31] mangoes' | rg -o '\[(\d+)]' -r '$1'
    52
    31


Congratulations (?) for making the top five google results for "ripgrep capture groups" [1].

I cite sources way more often than not, this time I got lazy after dithering over whether to go with the definitive ripgrep source page [2] or a decent looking third party(?) tutorial .. pressed for time I did neither.

[1] https://learnbyexample.github.io/learn_gnugrep_ripgrep/ripgr...

[2] https://github.com/BurntSushi/ripgrep


[1] gives some background about the issue and possible solution(s).

[1] : https://unix.stackexchange.com/questions/536657/how-to-refer...


If PCRE is supported:

    grep -oP '\(\K\d+(?=\))'
The above will give all matches in a line though. You can remove `(?=\))` part if numbers are always enclosed in `()` or if you don't care about the `)`


Lookarounds are a bit messy. It's easier to just pipe into a second grep.


certainly gets rid of one type of edge case.


I use awk for lots of things, including running simple nomerical models since its results are totally portable between OSes etc for my purposes.

And I learnt a new thing from this article, even though I have been using awk for decades... Functions. Yep!


Yep, functions! I used to write a fair amount of Awk code back in the late '80s and early '90s. I treated Awk as a "real" programming language and tried to make the code nice and readable. This of course involved a lot of use of functions.

I only have a couple of surviving examples of the code from back then, but here they are for the curious:

https://github.com/geary/awk

LJPII.AWK is probably the best example. It made a nicely formatted printout of source code on my HP LaserJet II printer. I wish I had one of the printouts it generated, but they are long gone.

Hmm... I wonder if my Brother printer supports the old LaserJet II control codes? Or maybe there is an emulator online?

The code was written for Thompson Awk (TAWK), so some bits would need to be adapted to modern Awks.


Oooooooo, that's how functions should be done.

Here's a recent one of mine (albeit as usual embedded in a /bin/sh script) that I should try to functionify!

https://www.earth.org.uk/script/BibTeX-to-HTML.sh

Thanks!


You can check if the Brother printer supports PCL which it likely does. Somewhere online will be an explanation of the differences between gnu awk and the version you used.


JIT-compiled Awk in Rust

https://github.com/ezrosent/frawk


Discussed 11 months ago:

https://news.ycombinator.com/item?id=30343373 (38 comments)

I wish this "AWK-like language" came with performance benchmark comparisons.

Edit: Thank you, benhoyt!


It has very detailed benchmarks (the "Benchmarks" link in the README): https://github.com/ezrosent/frawk/blob/master/info/performan...


Awk is an amazingly powerful tool. I remember writing an awk script in an interview once (instead of using Python), and the person giving the interview was amazed at how fast it could be written and how fast ran.


Poplar is an also-ran (from that time frame). Normally I prefer to concentrate on ideas (of which there are a few in the link below) rather than people ... but in this case, for context, it's worth noting the authors:

https://www.softwarepreservation.org/projects/poplar/doc/Mor...


Previous discussion: https://news.ycombinator.com/item?id=28441887 (251 points | Sept 7, 2021 | 118 comments)


In my opinion Perl is much more powerful than Awk, available just about everywhere Awk is available, and even explicitly takes inspiration from Awk for some features (like BEGIN blocks).


I agree. Yet I have used awk in a few places instead. Part of it is that it's easier to get a less powerful language accepted by colleagues, when both are in the "completely foreign to me" category.


Yes, but across 40 years, pearl versions introduced compatability issues.


Really not a fan of awk. It really looks nice but I have inherited a lot of it in the past and know how many foot guns there are. Trashing the unix philosophy of using text as a communication protocol, I've seen terrible terrible mistakes when it comes to repetitive re-parsing stuff at each pipe step.

The finest was a 3rd party who accidentally added a space in a data feed. This was dutifully sucked out of their SFTP server via a bash script, pre-processed using awk into the standard internal format and then picked up by a cron job which ran a python script to inject it into postgres. The outcome was of course that the columns were offset by 1. This caused a huge asset valuation dip and some market alarm.

A proper parser would have rejected the whole dataset as the data row could not be parsed.


I get that AWK encourages this type of code, but I would still argue it's not really "an AWK problem". It's more of a problem of how carefully you model your inputs.

Model inputs in great detail and you can throw out a lot of invalid data, but it takes longer to get the code running. Model the input only very crudely and you're up and running quicker, but more open to broken expectations.

Of course, it's way too easy to go the latter route for most programmers...


Classic CS off by one error.


I wrote an awk script to rip through supposedly ascii text files to deal with CP1252 extended ascii characters that would creep in. Those characters played havoc with the output of our bindery’s commercial inkjet print heads. That stuff was fast on ancient equipment


I do something similar to convert supposed plain text to 7-bit US-ASCII. I use a sed script however...


Don't shoot me, but I've used ask to create binary MIDI and WAV files...


Ok, even though hasn't been ported as an awk extension under gawk, will POKE [1][2] with command line pipe do (perhaps with enough sed, too)?

[1] : https://www.gnu.org/software/poke/

[2] : https://kernel-recipes.org/en/2019/talks/gnu-poke-an-extensi...


awk system() makes possible to avoid being BASH'ed.


Hmmm aWk!


When appropriate, don't pass up opportunities to ingest Gawk [1]!

edit: especially since unicode support allows for use of native Klingon fonts.

[1] https://www.startrek.com/database_article/gagh


> Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk.

Freepascal/Delphi user and Smalltalk lover here.

I beg to differ.


Very few is not none, but I still think “very few” is an appropriate adjective.


Our flagship product is still Delphi based. It's an old application, but yeah.


What does your Delphi-based product do?


It's a radio dispatch console.

They're currently trying to transport it to C#, but its slower going than developing in Delphi, ironically, the thing that makes us move to C# is developer availability.


Cobol legacy programmers make quite a bit because of such small numbers / no longer actively/widely taught at university level.

The multi-decade investment in cobol for critical systems (banking) does not make for quick/easy switch.


I think the main thing Awk is missing is standard support for CSV (with quoting). The recent goawk has it, though.


Many (most?) Linux distros that have gawk also ship gawkextlib, which includes gawk-csv.

https://gawkextlib.sourceforge.net/csv/gawk-csv.html


Thanks, I didn't know about that. For oneliners a simple command-line option would be preferable though.


Thanks for the plug (I'm the author of GoAWK). Yeah, I'm hoping the CSV feature will really be useful for data science and the like. There are so many CSVs pushed around these days. See more here: https://benhoyt.com/writings/goawk-csv/



You can get pretty far with FPAT, and as mentioned elsewhere, gawkextlib is available when FPAT doesn't quite cut it.


I find that opinionated tools like awk are esoteric and very niche, but for things they do, they do very well. Writing awk scripts for simple text transformations brings me immense pleasure.


yes, maybe esoteric today, but not when only tool available (~late 80's) across multiple platforms mac, unix flavors, vax, spery rand, burrows, wang, sun, alpha and x86.


yup. I wonder "what if" awk had added records (dotted r.attrib notation) and namespaces. I've created sizeable awk scripts back in the day, and I missed the former dearly and the latter somewhat.


Nah, (humor observation), using system() is the only one true way to do namespaces.

The way gawk handles functional parameters can simulate records (dotted r.attrib notation)

Gawk @ additon [1] permits namespaces (include,load) and other fun syntax/sematic jit enforced separation.

[1] https://www.gnu.org/software/gawk/manual/html_node/Index.htm...


tend to forget, awk isn't the typical choice for programs in 20k-100k lines range.


I don't mind AWK for super simple things, but there is a reason why the sysadmins from the Bad Old Days(tm) who now have grey beards all converged to Perl.

Practically everything you can do in AWK can be done just as easily and quickly in Perl. And Perl absolutely wins when you need to do that one extra thing that AWK really just can't do.

And I say this as a person who switched over from Perl to Python eons ago.


We didn’t all converge to perl. Some of us avoided it like the plague. I saw some reference to CPAN recently and it made me feel… uncomfortable.

Whatever things I couldn’t do in sh or Awk, I would do in C.


Good lord, why?

Seriously, Perl is an okay language for quick and dirty things of a tiny/small size. Yeah, it's not the best language for a large development project, but if you do need to parse /etc/passwd or something, not only it's perfectly good as-is, but you'll certainly find something on CPAN that already does it well.

I can't imagine why would one want to do that kind of thing in C. It's just unnecessarily painful, and you'll spend 90% of the time on doing things that don't solve the actual task you need to be solved.

Yeah, in modern times it's gone way downhill, but that's mostly if you intend to do something big with it. I wouldn't use it to start a new, fully featured CMS. But for sysadmin type stuff as an alternative to sh/awk it's still just as usable as ever.


I know loads of people loved Perl, and I'm not suggesting that my perspective is mainstream or even defensible. I'm just saying that I was a young sysadmin in the "bad old days", but that I didn't like Perl. Of course I did use an awful lot of Perl scripts.

That said, I have to revise my original comment. I forgot that I was a big fan of Tcl/Tk/Expect back in the day. So it's not like my taste is better than anyone else's :)


Greybeard here. Just blogged about this very topic a couple of weeks ago: “The Unreasonable Effectiveness of Awk”

https://stephenramsay.net/posts/unreasonable-awk.html


Nice!

You could also use

    BEGIN {
      FS = “: “
    }
Which sets the field separator. This would put the keywords into $2 and remove the need for the gensubs :)


It's much shorter to set the field separator with an option: awk -F": " or awk -F ": " do the same thing.


Well, look at that. Thanks!


In both your essays, you have some text surrounded by square brackets: are these intentded to render out as hyperlinks?


Yes. Are they not!?


They are now. Some of them weren't. Maybe it was a local error for me. Or maybe PEBCAK.

Anyway, cool blog. I'm reading it.


Perl doesn't work so well with /etc stuff. when /usr, /usr/local, /opt aren't available.


I have a perennial fascination with awk. It's one of the first things you find in /bin alphabetically, on almost any Unix. It's small but powerful, and somewhat mysterious.

Sadly I've learned it & cheatsheeted it for future reference but never find myself reaching for it. Part of it is prefering python over shell scripting maybe - awk fits better in a shell scripting world.


Like the scripts that web browsers use?


I'm using a 20 KiB awk script that I wrote from scratch to calculate taxes owed from investments. German tax code is a bit tricky when your broker is abroad because you have to calculate everything yourself. Hilariously some German brokers don't even apply the FIFO rule correctly. Nevermind regulations about fees or double taxation treaties. FX rates for conversion into EUR are extracted either using rga from Swissquote's own PDFs, or downloaded off the internet. The transaction history itself is one big CSV export from their web site that is also parsed and analyzed using awk. For tax time I call the script with the year I want and transfer everything into official forms. Without spending 3 days in Excel. Or 10. I can't praise awk enough.


I quite like awk mainly because the entire manual is only 500 lines long.

http://man.openbsd.org/awk


I have used awk in a pipeline to pick out a field or fields for forever:

   foo | awk '{print $4}'
I know there are easier ways. Heck I even wrote a command called "words" (like "foo|words 4" or "words 1 3 2")... but I forget to use it.


If you're using Awk, you owe it to yourself to check out CppAwk

https://www.kylheku.com/cgit/cppawk/about/


I don’t know any programming languages…but I feel like I “get” awk…

Also, this blog post helps with the above. For some reason I like how the author cites technical reviews of the article and email correspondence as references.


Missed opportunity to use one of my favourite words: grok [0].

And congratulations. I’ve been trying for the past year to learn awk, and while I can pretty reliably split text files and extract columns, I’m pretty far from being able to grok awk.

[0]: https://www.merriam-webster.com/dictionary/grok


I actually hate it for this purpose, in the book it also means to drink deep and has psycho-sexual connotations.

Why not just say understand and move on?


The book that coined the term was written over 60 years ago. Its usage has evolved since then.


Are there any benchmarks how much faster or memory efficient it might be for parsing large csv files (~GB size) versus popular python or R tools?


Where’s Bryan Cantrill in this thread?


I wish Awk had built-in csv (with quotes, commas inside quotes, etc) parsing support...


GNU Awk can recognize fields using a regular expression, rather than separators.

See the "Defining Fields by Content" topic in the manual, which is based around the FPAT variable, specific to Gawk:

https://www.gnu.org/software/gawk/manual/html_node/Splitting...


tyingq in this post noted requirements for cvs support in awk via library.


awk was my earliest exposure to the <pattern> fires <action> model of programming. those were the days.


Alternate reality where lisp less dominant AI; might have resulted in an awk version of parallel make & AI-awk dominance.


> Awk, as created by Alfred Aho, Peter J. Weinberger, and Brian Kernighan (who drew on their initials to create the name of the utility)

TIL




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: