I was privileged to be one of the technical reviewers for this book. There's a fair bit of the original content (which is still great), but Kernighan's done a great job with some good restructuring and some significant updates, too. The early chapters are very hands-on, with something of a focus on "exploratory data processing", particularly with CSV files. Big data with AWK, you could say.
Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode (parsing files with quoted and multiline fields per the CSV RFC). I'm really glad Arnold Robbins added a robust "--csv" implementation to Gawk, too, because that's really the most-heavily used version of AWK nowadays. I've already got CSV support in my own GoAWK implementation, and I'll be adding "--csv" to make it compatible.
I'm really glad this new updated version is coming out!
Its a crying shame we never settled on a control character separated text format. There's a ascii control characters for record and field (unit) separators. A bit of user space support for that would have been great.
As I recall, you can tell Awk to use the control characters as record and field separators. Not helpful if you're getting your data from others, but if you're working by yourself, you have the option. I've come to use control characters as a default because it makes life so much easier.
lolive, VisiData has some Excel support. However, don't expect VisiData to be a full blown editor for Excel files. It can provide a view of the data in an Excel spreadsheet.
If you have a python installation available, openpyxl[1] is great both for converting to .csv and for packaging .csv outputs as .xlsx (which is really zipped .xml, anyway).
It is a shame. I have been using tab-separated sheets recently as it allows me to simply not care about almost any possible character in my strings...apart from tabs of course. But those are far less common than commas, and putting strings in quotes 100% of the time looks messy to me.
To be really useful as a format it would just need for text editors to:
-display something distinct for the field separator (some editors do this)
-treat the record separator character like a carriage return (not aware of any editors that do this)
Tab-delimited "csv" formats are quite common (e.g. the CONLL format family for many natural language processing tasks) and also supported by common tools such as MS Excel for decades already.
Awk is really great, for those knowing nvm [1], I used awk to make `nvm ls-remote` run more than 10 times faster [2] by replacing the related shell script with around 60 lines of awk script [3], and I was quite happy with the improvement.
It's not really a one-liner, neither something big, but one can take that as an example regarding that awk is really not just for one-liners.
Meanwhile having `--csv` support is really nice. I'd also like to see things like a builtin `length` function to be standard.
But length() is standard POSIX, no? Even length(array) has been approved by POSIX [1] but not yet included in the spec (they're very slow to update the spec for some reason). Both forms have been supported in onetrueawk, Gawk, mawk, and Busybox awk for a long time.
Our data product is delivered in CSV format. Even though I create user documentation mainly using csvkit, grep and sed, I would love to convert all those solutions to AWK. Sometimes AWK is more readable than sed and csvkit requires installation.
It will be nice to have a awk cookbook for CSV. In terms of CSV maniupulation and querying there is only a limited number of operations and I think there is potential to standardize those operation using AWK.
It's nice that everyone is supporting this, I've written a portable awk module that takes control of the parsing and it is SLOW (and a little buggy). I'm a little bummed that nobody will use it but this is truly a step in the right direction.
I guess for the people that are still using nawk, you can set up an AWK envvar so you can { awk -f $AWKU/ucsv.awk -f <(echo '{print NR, $1}') }
Would you say the first few chapters are enough to get the 75-80% usefulness for mere mortals like me who will never try to master the full language? Or is the material fairly sprinkled throughout the whole tome?
Yes, definitely. The first three chapters would be more than enough for that: 1) An Awk Tutorial, 2) Awk in Action, and 3) Exploratory Data Analysis. For most people who just want to use AWK for one-liners on the command line, you can stop there. The rest of the chapters are about writing larger (still small! but not one-liner) programs in AWK to create reports, little languages, and experiment with algorithms.
Fantastic news. I’ve tried lots of new CLI tools but they always seem to fall between too little functionality (eg. xsv) and too much (VisiData). AWK is just right.
Awk is awesome! Glad that they are looking to modernize the book. It wasn't really necessary, all the code examples in the original edition of the book still run just fine, although some are somewhat dated, like printing ASCII bar graphs. They also had examples of writing VMs, parsers and interpreters in the book, which run on modern implementations.[0]
The language has some quirks. To declare temporary variables, it's common practice to add extra arguments to functions that won't be used. And traversal of associative arrays is implementation-dependent. I'm not sure what the situation is regarding locale and UTF-8 support.
EDIT: Looks like Brian Kernighan added Unicode support last year.[1]
What would you suggest as an alternative to printing ASCII bar graphs? I do that all the time. Takes 20 seconds and often makes distributions, modalities, and patterns over time obvious right away.
`sparklines`[1] is good for an overall low-res view. `termgraph`[2] is sometimes better for a higher-res, more capable view (but can be finicky about the data.)
Sure but, e.g., sparklines can show me the shape of my 60 numbers[1] more effectively on a single line of 60 characters[2] than an ASCII bar chart which would be 60 lines (without binning).
Is there a particular benefit in writing a VM in AWK, placed in a big BEGIN block? Very similar code can be written in Perl or Python. Isn't the strength of AWK in its line-matching capability, being able to pattern-match a line against a block of code?
> Is there a particular benefit in writing a VM in AWK
Not really. Later on the book just ran out of line-matching examples to go through and started doing regular programming instead :P. When I actually write AWK code I rely on line-matching and using a variable to handle state.
At the time, awk was the only scripting language (other than shell) generally available on Unix systems. Perl, Tcl, Python didn't exist yet. So awk was often used for general-purpose programming.
There are many systems which lack Perl or Python, but include awk.
You might be carrying an Android device at the moment --- if you drop to its default userland, that provides a bunch of utilities, including awk, via Busybox. But not, so far as I'm aware, either Perl or Python.
(You can of course install Termux which will then give you both Perl and Python, along with Node.js, ruby, and a whole slew of other scripting and compiled languages. But so long as we're considering stock installs, it's sed and awk.)
awk can be mastered by just reading the man page. The book doesn't take long to read either. Once you understand the simple principles, you can write an infinite number of scripts for all kinds of tasks.
See, when I'm writing a shell script interactively and work myself into a corner, I reach for awk, struggle with it for a bit, and then either:
1) succeed, and regret the messiness of the solution
or
2) fail, and find a non-awk way to handle it.
I really tried to like awk, but its portability hasn't been enough of a feature to raise it above other scripting languages for me. Especially if I'm going to end up in an editor
"Dark corners are basically fractal - no matter how much you illuminate, there is always a smaller but darker one." - - Brian Kernighan (quoted in the GNU Awk book)
Awk has always been a language that I loved but I have struggled to use besides quick jobs for parsing text files. I understand it is meant to be use for exactly that, but the fact that is simple, fast and lightweight sometimes makes me want to do something more with it, but when I start trying to do something besides parsing text I find that it starts becoming awkward (pun intented?).
> but the fact that is simple, fast and lightweight
I see awk as a DSL to be honest. Yes, it can be used as a general purpose language, but that quickly becomes, as you say, awkward :D
Like many DSLs, it is simple, fast and lightweight as long as it is used for it's intended purpose. Once you start using it for something else, these advantages evaporate pretty quickly, because then you have to essentially work around the DSL design to get it to do what you want.
One simple thing I do with awk is to create a command processor: read one line at a time and do things on my data as a response. This is very useful because you can make your command as powerful as needed and call other unix tools as a result.
I find it pretty nice for writing simple preprocessors. For example I have one which takes anything between two marker lines and pipes it through a command (one invocation per block). Awk has an amazing pipe operator which lets you do something like this:
... {
print $0 | "command"
}
"command" is executed once, and the pipe is kept open until closed explicitly by close("command"), at which point the next invocation will execute it again. The command string itself acts as a key for the pipe file descriptor.
And of course, no mention of awk is complete without the "uniq" implementation, which beats the coreutils uniq in every way possible (by supporting arbitrary expressions as keys and not requiring sorted input):
I had no idea about this "keep the pipe open" behaviour. I thought it would spawn the binary on every print statement and thus didn't consider it in the past. But now...
This is exactly why I moved from AWK to Perl for these quick jobs a couple of years ago. If you stick to an AWK-like subset, Perl is also simple, fast and lightweight. If you want to grow your scripts (and you have a lot of discipline) Perl – in contrast to AWK – gives you enough noose to hang^W^W^W^Wthe tools you need.
I write bash python and nodejs all day, and have no professional history with Perl.
One day while avoiding working on something important, I spent half a day learning Perl in order to implement something related to a build tool that was being used in the important thing I was avoiding.
I was blown away. It's a really delightful language. Its big downfall is that it makes it feel good to do something "clever."
Perl is a joy to write, and a devil to read. I liked it, and wish I had started my career earlier so I could have enjoyed Perl in its heyday.
The same shortcut syntax that people complain about does make perl really handy for one-time tasks where you're iterating on ideas. Lots of features there that make that easy. One example:
#!/usr/bin/perl
while (<>) {
# various processing here
# $ARGV is set to either "-" for piped input, or the current filename
# $_ is the data of the current line
}
That (<>) construct accepts data from stdin, redirection or file(s) named as arguments and iterates over the data. There's lots of things like that throughout the language.
> Perl? Wow. Is that better than bash, python or even nodejs? Why write in Perl over these?
It depends on scale.
If you have some quick parsing to do, then awk will get you started quickly, but as you expand your experimentation on what you want to extract/manipulate, it may not be easy to add onto the awk beginnings of your "one liner".
But if you start with awk-like† syntax but invoking it with Perl, then if you find you have to expand, Perl has more elbow room.
The intention is not to 'go big', which those other languages may be better at, but to more easily 'start small'.
† IIRC, Larry Wall wanted a utility that had awk/(s)ed-like syntax for text manipulation, just 'with more'.
Have you ever tried to dig a hole? What tool did you use?
- Want to cut through and move loam, compost, sandy, and compacted soil? You're gonna want a rounded shovel.
- Want to break up rocky, clay soil? A pick mattock will penetrate deep, breaking up soil, shattering smaller rocks, and is used as a lever to uproot. A tiller is a faster method but disturbs the soil more.
- Want to dig a narrow, deep hole? An augur will quickly break up rocks and soil in a shaft and move them upwards.
What do you use the Perl tool for?
- Quickly and efficiently open files, read line by line, analyze text, and perform any kind of operation you can think of, with complex data structures, objects and modular code, using very few lines of code.
- Executing external commands with a shell, returning their output, and making complex yet short programs easily with arguments to the interpreter from a command line.
Absolutely. It is comparable to python in some ways, but makes it much easier to write quick one-liners using regexes and data manipulation, and to scale those up to real programs. It fills the gap between bash scripts using awk, grep and sed, and C/java/C#. Compared to bash scripting, perl is a real programming language. The documentation and library ecosystem are excellent, backwards compatibility is legendary, yet it supports modern Unicode. The syntax is weird, but try it for a bit, read the man pages, it's not that hard. The OO system is weirder, and I wouldn't make complex class hierarchies in it, but it is usable.
I like how Awk is just a single executable. A single-executable Perl that includes only the core library would be great. There is Microperl [0, 1], but no idea how well it compiles with more up-to-date Perl versions.
It can be very useful and they are pretty robust.
I often found Perl scripts running for years and years without issues at different companies.
My main issue with Perl-scripts is that they often are not "readable" by anybody but the original creator. Which of course left the company. (not a fault of Perl itself tough)
But your millage may vary and any script can be made (un)readable.
I've always found it weird that people bash on Perl relentlessly for being hard to read and then turn around and praise Rust's syntax when it is full of stuff like this:
>> My main issue with Perl-scripts is that they often are not "readable" by anybody but the original creator.
Anyone writing Perl scripts like this should not be trusted with any programming language.
Perl scripts are no less readable than bash scripts or Awk scripts. This is because so much of Perl was written to do the same work as bash, awk, sed, and the other related Unix text processing command line programs, but all under one roof.
The truth is that insulting Perl is considered stylish by some, so many people do despite knowing little to nothing about Perl and having never used it.
However, if you want Perl to be hilariously unreadable, why not write it in Latin:
There's a limited problem domain where it's unquestionably the best. Perl beats awk and bash at their own game on their home turf. That's the best way to put it. It's faster, has more shortcuts, less warts, more power, and more readability when well written, and while aged and not huge by modern standards, CPAN (like pypi or npm) is incredible for a hyper-powered awk and bash mash-up for those tasks at the edge of of that limited problem domain. It's installed almost everywhere, so almost always available.
That stuff is just awkward and painful in Python by comparison.
I don't write Perl code, but its CLI has been a very good way to replace sed with something decent. sed not supoorting Perl regex syntax, the most commonly kind of regex out there by large, is frankly disappointing. Even grep was able to put it together and add the -P switch. But sed is still stuck in the prehistoric syntax of ERE ("Extended Regular Expressions", as described in man pages) which e.g. instead of \d for a digit, use [[:digit:]], a syntax present in... zero? other tools or programming environments.
Better than BASH? Mostly. Better than Python, subjective as you would have to use them both yourself. I lean towards Perl as I like sigils to denote things. I have nothing against Python though. Both are typically installed as a default now. I have never used nodejs for sys admin work.
Perl is super-specialized at reporting (that's in fact the "r" in Perl). In particular there's a bunch of extremely useful implicitly defined variables that take their context from your place in a line-by-line loop through a text file.
Perl is a great language, but please listen to this old perl programmer's advice:
1. You can write totally unreadable perl. It is probably the single worst language in this regard most programmers will run into. Be careful to make your code readable.
2. Keep your amount of perl small. 200-300 lines is a good bit of it.
So for quick bang it out scripts that want to parse text etc... perl is great. For writing a major application, not so much.
I have found a handful of unconventional applications for awk -- I once needed a tiny pcm pulsewave generator, and awk was surprisingly decent for the job [1].
Aside from that I've mostly been using it for quick statistics [2], but it quickly moves into perl territory...
It's a language for creating quick alternative views from line- and column-oriented text streams. That means, take the output of another tool and represent it in a different way.
Ok, dumb question: Is the link supposed to link to the actual book (i.e., is the book free and/or open source) or is this just a page of miscellaneous interesting links about the book (which we can pay for, later, when it's published).
I was expecting the book, but the page itself says "This page is a placeholder for material related to the second edition of The AWK Programming Language."
It's fine if this is a placeholder page (and an awesome excuse to read talk about AWK here on HN :) ) but I want to be sure that I'm not missing the book itself.
What I understand from the page is that the Second Edition of the book will reside in the page when it is released (the reason why it says it is a "placeholder").
I think the page description is quite clear: it contains material related to the book. Not the book itself. So I would guess all downloadable code and perhaps supplementory material.
One of my first big projects at my first job fresh out of college was using sed & awk to semi-automate the transformation of semi-unstructured data into a database.
IIRC I couldn't completely automate because it contained author names, from global naming conventions. (parsing names correctly is deceptively complex) They had somewhat arbitrary #'s of initials ranging from 0-3.
Again, IIRC, I could easily accommodate 0 or 1 initial (followed by \.) but trying for more would make the regex I was using too greedy and pull in part of the article abstract. These were scientific books and journals.
So I scripted a sed & awk program to detect the possibility of > 1 initials and when that occured, I'd pipe the record into nano for a quick review where I manually inserted the correct \. characters for the initials.
It was decades of back-catalogue publications for digitization so I sat there for days, listening to music on an original 1st gen iPod, waiting for my duct-taped kludge of a program to pipe one of thousands of records into a nano session every few minutes. This was on an Apple G4 workstation running OS X, where I earned my real bash scripting chops. It was an awful hack by today's standards, but at the time, accomplishing what was expected to be a 1-year long project in ~1 month, it was seen as nearly miraculous.
I know lots of people like awk, but I pretend it doesn't exist. Why? Here's my comment on this from 6 years ago[0],
>I used awk until I learned Python (long ago). For me, awk was yet another example of the "worse is better" approach to things so common in unix. For example, if you make a syntax error, you might get a message like "glob: exec error," rather than an informative message. "Worse is better" is probably a good strategy in business and for getting things done, but still, mediocrity and the sense of entitlement that so often goes with carelessness, sickens me.
You are missing out. As a former data engineer/current SRE, I spend my entire day with VSCode/Python/Notebooks/CoPilot banging out python code - but whenever I need to do a complex analysis of a semistructured text file in < 60 seconds, awk is my twitch reflex tool. It can trivially do state transition based on patterns in the file, as well as populate hashes from one file and use them in analysis of the next file in just a few characters.
Awk's claim to fame in my world is that it's cognitive activation energy for anyone who has taken the 3-4 hours to learn the language from start to finish (and that's the awesome thing about the language - it really is about 3 hours of concentrated attention) - is essentially nil. You see a bunch of ugly not really structured text 500 MB files that you can't pull into pandas, or easily parse into python dicts? No problem - awk will tear through them for you and get the information you want in < 60 seconds, including the time you took to write your (almostl always single line) of code.
Point taken. I have a Python program that is an elemental version of awk, and I use that for the odd task. I can modify it if needed and I have the entire Python library to help me. Is the text Unicode? HTML? These little details matter.
I'm not complaining that someone banged out awk (speaking figuratively) on a Friday afternoon to do something and not have to stay after work. Excellent! My complaint is that the failure to address technical debt has negatively affected the productivity of millions, if not tens of millions, of people, often working under pressure, for DECADES.
I'm not sure what technical debt you are referring to. Awk is designed to do one very simple job, and it does so using a language that I can usually teach to new SREs in < 2 Hours with 9-10 follow up tasks that drill in their understanding.
It's benefited from extraordinarily enlightened stewardship, kept it's minimalism and strengths, and will finally get a key enhancement (UTF-8 support).
The first edition manual is probably the greatest example I've ever seen of technical writing as well.
I will bet you $1000 that time spent learning Awk will lead to better results much faster than time spent polluting your privileged user directories with Python's excuse for "dependency management"
For many python users, it’s the only language they know. Often, they see programming in python, as part of their “identity” - so they’re overly invested in it, to the detriment of other wonderful languages, like awk.
I used to code perl myself, back in the day - but I came to appreciate the simplicity of awk, and now it’s one of my favourites. I no longer code perl, as a consequence, as I believe awk to be far more elegant! I wouldn’t have done so, if I was overly invested in being a “perl programmer”.
Specifically, Awk is a good solution to a problem that should never have existed in the first place. Why am I having to write these bespoke parsers for the random mess of output formats that you get from the UNIX command line?
Well, the fact is that I have to write such parsers. That's very sad, but has no chance of being fixed. So it's good to know Awk.
I think Erik Naggum had this exact criticism of Perl.
Seems like the best time to ask since this is an awk thread: if anyone has a line on the original artwork or a source for the awk t-shirt please let me know. From memory it's of a gangly bird jumping / parachuting from an airplane (DC3?) and captioned with awk's infamous catch-all error message: "Awk: bailing out near line one".
One of the first utilities I had to get to grips with way back was awk, and it serves me well to this day. Best bang for buck investment of time in my entire career. Even today I still use some variant of awk -F(x) '{print $x}'.
This is good news, because you have to pay a lot for a used copy of the first edition nowadays. I hope the spirit remains the same as in the first edition.
I read the first edition so many times as a young kid... AWK was just such a cool name when I would go to the library and grab a book out of the stacks trying to learn something new.
Honestly after watching a lot of Kernighan interviews and reading his original book on C he is a very great communicator. I wonder how different the software world would have been without him at Bell Labs. Would Unix and C have become as widely used as quickly?
Awk is old but great, designed to chew through lines of text files with ease, and has great defaults that minimize the amount of awk code you actually have to write to do anything. It's underrated.
I like the idea of Unix pipelines, but I hate all the sublanguages, awk being one of the biggest. I scratched my itch and built my own shell, marcel: https://github.com/geophile/marcel.
I mention this specifically, here, because of the CSV point. Marcel handles CSV, e.g. "read --csv foobar.csv" reads the foobar.csv file, parses the input (getting quotes and commas correct), and yields a stream of Python tuples, splitting each line of the CSV into the elements of the output tuples.
Marcel also supports JSON input, translating JSON structures into Python equivalents. (The "What's New" section of marcel's README has more information on JSON support, which was just added.)
I usually use this awk function to parse CSV in awk:
# This function takes a line i.e. $0, and treats it as a line of CSV, breakin
# it into individual fields, and storing them in the passed in field array. It
# returns the number of fields found, 0 if none found. It takes account of CSV
# quoting, and also commas within CSV quoted fields, but doesn't remove them
# from the parsed field.
# use in code like:
# number_of_fields = parse_csv_line($0, csv_fields)
# csv_fields[2] # get second parsed field in $0
function parse_csv_line(line, field, _field_count) {
_field_count = 0
# Treat each line as a CSV line and break it up into individual fields
while (match(line, /(\"([^\"]|\"\")+\")|([^,\"\n]+)/)) {
field[++_field_count] = substr(line, RSTART, RLENGTH)
line = substr(line, RSTART+RLENGTH+1, length(line))
}
return _field_count
}
It's not perfect but gets the job done most of the time and works across all awk implementations.
I FINALLY started learning awk in the past couple weeks. I think I was intimidated because awk can be very terse, and there are some default actions that aren't clear when you first start looking at awk scripts.
My other problem is that I want to accomplish things, not learn a tool, and it generally takes me a bit longer than it should to decide to actually learn something and not just hack at it.
yes, because you'll be done with your thing before others figure out how to lay out your spreadsheet. also your solution will be reusable.
(based on my experience where people who could've benefited from awk for a one-liner dependably reach for sheets/excel rather than something like python or perl)
I wish I use awk all the time but everytime I use it the knowledge I gain doesn't stick. Could be due to its arcane syntax which is just too hard for me to remember.
Yeah this solves the "I don't use it enough to remember it problem". ChatGPT eliminates the first hurdle of using it, so I'm likely to use it more, and then hopefully it will start to stick.
Gawk and awk will soon have a new "--csv" option that enables proper CSV input mode (parsing files with quoted and multiline fields per the CSV RFC). I'm really glad Arnold Robbins added a robust "--csv" implementation to Gawk, too, because that's really the most-heavily used version of AWK nowadays. I've already got CSV support in my own GoAWK implementation, and I'll be adding "--csv" to make it compatible.
I'm really glad this new updated version is coming out!