Awk Technical Notes

jrochkind1 · on March 29, 2023

My very first job getting paid to write software was writing in scripts in Awk to parse and analyze some software log files, for a faculty software researcher, in, maybe, 1997? i didn't know Awk before, it's just what I inherited. Spent a few hours with the O'Reilly book, and I was like, okay, sure, let's go.

As the stuff we were doing in that project got more complex, at some point someone suggested to teenage me "You might want to look at Perl for this now," and then I moved to that. (with the Camel O'Reilly book, of course!)

Haven't touched either one in years now.

Learning new things can be much more overwhelming for me now, I don't know how much is me vs environment. But I am nostalgic for those days where I'd sit down with a print book, and within hours have a grasp of the fundamentals, or within days feel like I had basic fundamental conceptual understanding of the whole dang thing (not of every possible feature, but of the conceptual framework, the big picture).

bajsejohannes · on March 30, 2023

I read a different book, written by the creators of AWK themselves. But the experience was much the same. You can read it in one sitting.

Or, rather, you can read the two first chapter easily in one sitting. Chapter one gives a brief overview and examples. Chapter two describes the whole language, every function and every variable! The rest of the book is just more examples. I really love this style!

rjh29 · on March 30, 2023

If I write more than about 50 bytes of awk I inevitably end up using perl instead because it's much more powerful. One example was I wanted to convert a column in format a|b|c to ['a', 'b', 'c']. Doing that in awk is painful. In Perl, there's join and map, takes a few seconds.

layer8 · on March 29, 2023

Yes, my impression is there are less such books written nowadays that explain the conceptual fundamentals, unfortunately.

BossingAround · on March 30, 2023

One of the reason is that the frameworks that the book'd use are changing completely every 6-12 months. If you write a book that uses k8s, after 12-24 months, the examples you use might not even run.

Whereas if I pick up an Awk book that OP referred, it's likely I could still use it to learn, you can't really do the same with most of modern tech stack.

version_five · on March 29, 2023

I'm already a casual awk enthusiast but I'm really hoping to find an opportunity to use it for a "real" software project soon. I've been reading the gawk user manual, and suffice to say, the power and features of the language is dramatically underutilized for most of the things people normally do with it (my most common use case is probably a hybrid of grep and cut)

https://www.gnu.org/software/gawk/manual/gawk.html

Arnavion · on March 29, 2023

I wrote an IRC bot in it, one of those "paste a line of code and the bot will evaluate it and print the result" bots that you find in programming language channels. It's not a particularly big or "real" project, but it definitely fulfills the need of having a bot in that particular IRC channel.

awk is great for it because IRC (or at least the subset that the bot cares about) is relatively easy to parse, and shelling out the shell script that does the actual code evaluation and prints the result back is also fairly straightforward. Someone else used to have such a bot before but they had written it in Rust with a bajillion dependencies; if I had done that I would've had to update dependencies and redeploy it every other week. In contrast I deployed my awk version once and then basically haven't touched it in years.

networked · on March 30, 2023

This sounded interesting enough that I went and found the source code on GitHub: https://github.com/Arnavion/evalr. Starred. :-)

The following caught my attention in the bash wrapper:

  coproc GAWK {
          gawk ...
  }
  
  <&"${GAWK[0]}" openssl s_client -connect "$IRC_SERVER" -quiet >&"${GAWK[1]}"

This is a cool way to make awk talk over a socket that isn't specific to Gawk. For sockets without TLS you can replace openssl(1) with nc(1). I'll keep it in mind.

Edit: You can also use http://www.dest-unreach.org/socat/ and https://nmap.org/ncat/ with awk:

  socat "OPENSSL:$host:$port" 'EXEC:awk ...'
  
  socat "TCP:$host:$port" 'EXEC:awk ...'
  
  ncat --exec '/usr/bin/awk ...' --ssl "$host:$port"
  
  ncat -e '/usr/bin/awk ...' "$host:$port"

networked · on March 30, 2023

I recently wrote a program of slightly over 200 lines in portable AWK: https://gitlab.com/dbohdan/humsize. I wrote it for a specific operating system (NetBSD), but I have ended up using it everywhere; the portability helped with it. I can recommend AWK for small utilities that transform text in a line- and column-oriented manner and don't need libraries.

The main difficulties were making the command-line interface and testing. You can't have flags that begin with a dash in portable AWK without a shell wrapper, and I didn't want one. I settled on manually parsing key=value options, which I don't think are bad, just nonstandard. They look like this:

  humsize format=%6.1f%1s 'zero=  empty'

There is no standard way to test AWK code. For testing I wrote a shell script that checks the program's outputs with grep: https://gitlab.com/dbohdan/humsize/-/blob/122aaed8d65dc8c285.... Don't do this; your tests should give the user (you) better feedback. You may think your program doesn't need anything but a couple of trivial tests that won't ever change; it is a pain when you inevitably are proven wrong. I should have instead had a directory with reference outputs and diffed against them to see what went wrong (my own example: https://github.com/dbohdan/initool/blob/72f65d3fde245ff8660c...).

To ensure I didn't introduce portability issues, I set up testing against different awks in GitLab CI.

  image: debian:bullseye-slim

  before_script:
    - apt update
    - apt install -y busybox gawk mawk original-awk
    - ln -s "$(which busybox)" awk
    - busybox wget -O goawk.tar.gz https://github.com/benhoyt/goawk/releases/download/v1.21.0/goawk_v1.21.0_linux_amd64.tar.gz
    - tar xzvf goawk.tar.gz 

  test:
     script:
       - AWK=false ./test || true
       - AWK=./awk ./test
       - AWK=gawk ./test
       - AWK=./goawk ./test
       - AWK=mawk ./test
       - AWK=original-awk ./test

Edit: Rephrased and added a nicer shell test example.

dc-programmer · on March 29, 2023

I’m usually not a big side project guy, but I successfully used AWK to solve a IRL problem last year. It really helped solidify my understanding of the language.

The problem was that the Garmin GPS data for a bike ride I had just completed had split into multiple rides. I used AWK to stitch together the data into one file. I also did some basic linear interpolation to fill in missing data points.

The GPS data is formatted as XML and I was able to parse it fairly robustly using AWK.

account-5 · on March 29, 2023

How did you parse XML with AWK? I would never think of using AWK for XML data. I'd even stear clear of CSV data unless I could guarantee no in field commas or newlines.

version_five · on March 30, 2023

Commas are easy if it's quoted. I just first run an awk script that uses " as the field separator and substitutes or deletes commas in odd numbered fields (as long as that's acceptable for your use case). Then with `-F,` I always check that NF is the same for all lines in the csv before proceeding.

Depending on how the xml is structured, it can be possible to just pattern match on the tags if you have something simple to do.

dc-programmer · on March 30, 2023

Yes this is it. I patterned matched on tags to create a simple state machine. Then I extracted values using splitlines on commas and quotes

ufo · on March 29, 2023

I find that I tend to use AWK for text munging tasks that are too small to call a "project".

elteto · on March 29, 2023

Big, big fan of AWK. It sometimes feels like ancient, alien UNIX technology to me. But lately I've been gravitating more and more towards perl. You can write the same one liners (with perl -e and friends), it has superb support for regexes and it's just a more capable language (as expected, not bashing AWK).

pcwalton · on March 29, 2023

You can use Ruby for this task too. I used to use Perl for throwaway one-liners, but on advice I switched to Ruby because of the bigger community and I'm pretty happy with it.

(Python isn't as nice for one-liner text processing, both because of the lack of Awk heritage--so no built-in regex syntax--and because of the indentation-based syntax requiring newlines for most things.)

sacnoradhq · on March 30, 2023

General performance rankings:

grep|ripgrep > awk|sed > most scripting languages > shell

gawk has regexps.

We use Ruby a bit at work. Most coworkers hate it, internal customers scoff at it, and no one's interested in mastering it, using it properly, or considering even fundamental software engineering principles. Tech debt piles up and no one wants to touch it because there's no performance review KPI credit for it.

brigandish · on March 30, 2023

I like your list. As a Ruby writer for many years I’ll add Crystal > Ruby. Being able to deploy a single binary is such a boon that even the other myriad improvements aren’t worth mentioning, especially if we’re in the “what you might do with Awk” territory. Go users probably feel that way too but I know much less about that.

meltedcapacitor · on March 29, 2023

Awk is an improvement on most of its successors.

(h/t Tony Hoare)

eimrine · on March 29, 2023

The author is really persuading me to learn awk because he use to talk about the very reasons I avoid to do it as a faulty reasons, and I consider his reasoning as decent.

thayne · on March 30, 2023

> The absense of GC allows to keep the language implementation very simple, thus fast and portable. Also, with predictable memory consumption. To me, this qualifies AWK as perfect embeddable language, although, for some reason this niche is firmly occupied by (GC-equipped) Lua

The absence of a GC is nice for an embedded language, but I don't think that should be the only criteria. Unless you needed an embedded language that processes text one line at a time awk is probably not a good fit.

2h · on March 29, 2023

I used AWK for many years, but one day I realized that I had pushed AWK beyond whats its meant for, same as the author here. classic red flag from the article:

    function NUMBER(    res) {
      return (tryParse1("-", res) || 1) &&
        (tryParse1("0", res) || tryParse1("123456789", res) && (tryParseDigits(res)||1)) &&
        (tryParse1(".", res) ? tryParseDigits(res) : 1) &&
        (tryParse1("eE", res) ? (tryParse1("-+",res)||1) && tryParseDigits(res) : 1) &&
        asm("number") && asm(res[0])
    }

why put yourself through this, when you can just do something like this instead:

    package parse
    
    import "strconv"
    
    func parse_float(s string) (float64, error) {
       return strconv.ParseFloat(s, 64)
    }
    
    func parse_int(s string) (int64, error) {
       return strconv.ParseInt(s, 10, 64)
    }

donio · on March 29, 2023

Not disagreeing with the overall point but that particular example is from an AWK JSON parser implementation so the whole point is to do it in AWK. If you look at the entire file it's not too bad considering.

Funnily the actual Go JSON decoder code ends up doing something similar during scanning:

https://github.com/golang/go/blob/master/src/encoding/json/d...

pmarreck · on March 29, 2023

Depends on the AWK implementation, apparently.

    bash> awk -v v="80.1%" 'BEGIN{print v+0.1}'
    80.2

gawk has `strtonum`. But yes, parsing in awk generally looks like a pain. With plain positive/negative ints though, not so hard:

    echo "123456" | awk '{
        if ($0 ~ /^-?[0-9]+$/) {
            num = 0
            sign = 1
            start = 1
            if (substr($0, 1, 1) == "-") {
                sign = -1
                start = 2
            }
            for (i = start; i <= length($0); i++) {
                digit = substr($0, i, 1)
                num = num * 10 + digit
            }
            num = sign * num
            print "The integer is:", num
        } else {
            print "Invalid input string:", $0
        }
    }'

czx4f4bd · on March 29, 2023

As mentioned, the example you quoted is from a pure-AWK JSON parser. I don't dispute that AWK has issues, but AWK is one of those languages that magically coerces strings to numbers, so you can just write `"1" + 2 + "3.5"` and it'll work.

colonwqbang · on March 29, 2023

> AWK was designed to not require a GC

...

> The most substantial consequence is that it’s forbidden to return an array from a function, you can return only a scalar value.

This doesn't make sense to me. Does someone understand what it means?

In e.g. C++ a function can return an array without any GC or refcounting, by "moving" the array into the caller's stack.

ufo · on March 30, 2023

If arrays were first-class values we would be able to construct circular references:

    a[1] = a

Even if we only allowed it on return values:

    function f(x) {
        return x
    }

    a[1] = f(a)

doctor_eval · on March 29, 2023

I am not an expert but I’d say it’s because Awk arrays are associative; they are more like maps than slices, to use Go terminology. And IIRC (it’s been a while) the array values are not strongly typed. So I think you could even say:

  a[1] = “hello”
  a[“world”] = 2

That means that - unlike C arrays - Awk arrays are not a simple, addressable byte range, but a complex data structure with lots of pointers.

I suppose you could come up with a way to serialise the array and pop it on the stack but that would be a lot of work, and for the kind of things I use Awk for, the arrays would often be huge.

colonwqbang · on March 30, 2023

I'm not really convinced my that argument. No matter how big your associative array is, it is still represented by a pointer to the initial/root element. It should be quite possible to move the pointer into the caller.

Maybe the language is simpler without it, and that can be a good reason to avoid it. But I don't buy that it has anything to do with GC.

ghostDancer · on March 30, 2023

An 18 episode/chapter series about GNU Awk in HPR https://hackerpublicradio.org/series.php?id=94

jalk · on March 29, 2023

OMG never realized that $ is an infix operator - Plenty of times where I needed something like $(NF-1) and instead used verbose stuff like NF==5 { ... } NF==6 { ... }

jrochkind1 · on March 29, 2023

I don't think you actually mean "infix"?

jalk · on March 30, 2023

sorry - unary

cbazz · on March 29, 2023

[flagged]

version_five · on March 29, 2023

This commentor is a troll, see history if the above comment isn't enough.

Edit: what the fuck is going on?

bioemerl · on March 29, 2023

Looks like they have a history of using chatGPT to post comments, specifically.

version_five · on March 29, 2023

Right, but somehow they've been the top post for 1/2 hour and I got modded way down for pointing out it was an obvious troll. I hesitate to comment because I assume that's what the script kiddie is looking for out of this.

bioemerl · on March 29, 2023

I think the problem you're running into is that this particular comment looks human written?

cbazz · on March 29, 2023

I don't get how you could conclude I'm a troll? I'm not spamming nor arguing with anyone, just sharing my opinions and experiences.

meindnoch · on March 29, 2023

No, you're copy/pasting low-effort ChatGPT babble.