which is nice for more complex "one liners".
Awk is also standard on all posix environments and in the mawk flavor is extremely fast (relevant if you are processing huge files).
There is superb book, The AWK Programming Language, which teaches a lot about programming in addition to the awk language. Good discussion and link to pdf here: [0]
The AWK Programming Language is the best programming book ever because it lets you learn the language through interesting problems (like writing a very small assembler).
I believe the book's been out of print for sometime. I've no idea if that's why you were downvoted. I wouldn't post a link to a pirated anything. That pdf is linked from a ton of sites and the site I found it on seemed like a reputable site, though I don't remember which it was at the moment.
Fast, but partly because it's not Unicode-aware: it treats strings as 8-bit character sequences rather than UTF-8. Often that's fine, if non-ASCII characters are only passed through unmodified, but requires some care to avoid problems.
I ran into this in practice because I was using awk to convert paper titles from "Title Case" to APA-style "Only first word of title capitalized" case. The garbled output led me down a rabbit hole where I discovered that only some awks support Unicode locales, and the default awk on Debian (mawk) isn't one of them.
See my post way down the bottom, I do mention this. GNU awk also has a lot of useful extensions and builtins so that sometimes it's painful to use a plain Posix awk. But, when you need the speed, it's nice to know mawk is out there.
I'd love to see the mawk compilation technology merged to GNU awk. Or mawk updated with Unicode support and a few of the GNU extensions.
The other item on my awk-like wishlist is CSV support, ie split $1 .. $N by CSV rules instead of just a field separator. I usually end up copying CSVs into postgresql because it is fast and then I can process it very flexibly, but it's a bit heavy for things that could be one line-ers. Also, postgresql won't load malformed CSV, but I suspect awk could be less picky.
There is miller which hits some of these points, but I haven't really grokked it yet and the syntax seems ... awkward.
I have been reading through it and running code snippets from there as I find time. It's awesome!
If anyone's interested, I also update my notes from there on my blog.. https://scripter.co/notes/awk. I plan to have that notes post contain the entire book when I am done.
Pretty much this whole thread is people coming up with better ways of solving the problem. I love Hacker News! :)
But I think the point of the blog (for me) is showing what 'ed' is and where can you use it for (the problem at hand is secondary). And it did exactly that for me, I knew about 'ed' but never really understood what made it special or where/how you could use it. Thanks Julia for enlightening me!
> I knew about 'ed' but never really understood what made it special or where/how you could use it.
ed script is one of the formats supported by diff and patch (with their -e or --ed command line switch). (diff generates the script by itself, but patch just pipes it to ed.)
This problem is easily solved with a regexp that processes the input as a whole, rather than working on it line by line. I had this need often enough that I exported Go regexp engine as a command line tool regrep, which can insert "elephant" after "baz" with:
go get github.com/orivej/unix/regrep
regrep s '(\n( *-) baz\n)' $'$1$2 elephant\n' < input.yaml
It only processes standard input, and can not by itself replace the contents of an input file with its output; but another tool, inplace, helps:
go get github.com/orivej/unix/inplace
find . -name '*.yaml' -exec inplace {} regrep s '(\n( *-) baz\n)' $'$1$2 elephant\n' \;
The commands set the argument list to a list of three files (by default, it is set to the filenames you passed to vim on the command-line). Then, autowrite is enabled which automatically saves each buffer after editing it. Finally, argdo runs a command on each argument file.
Why write the ‘ex’ commands to a file? Why not just do the echo inline, like this
for f in *.html; do
echo -e "g/search string/ .,+20 d\nx" | ex - "$f"
done
?
(I also added quotes to the $f dereference in case of file names containing white space, and the -e flag to echo to expand \n to newline. In case of a Bourne shell without support for -e in echo, I would probably use “{ echo "g/..."; echo "x"; } | ex - ...” instead of using \n.)
Also, in a production script I would probably have used “find . -maxdepth 1 -name "*.html" -print0 | xargs --null --no-run-if-empty | while read f; do ...; done” instead of a ‘for’ loop from a pathname expansion, in order to guard against there being no html files, in which case a ‘for’ loop from a pathname expansion otherwise would be passing the literal string “*.html” as the file name argument to ‘ex’.
I've recently used EDLIN from an 80s version of MS-DOS.
After spending 5 minutes with the manual, I realized it was the best line editor I've ever used.
ed is very similar, but it doesn't come with a nice manual.
The info page is chaotic and doesn't start from the beginning. You can cover typical usage in 5 lines, but no, you have to read through 25 pages of stuff just to figure out a sensible command. I just use vim or nano.
> ed is very similar, but it doesn't come with a nice manual
In many case it indeed doesn't. Historically, that documentation void was meant to be filled by learn tutorials[1] and a written introduction in volume 2A of the manual[2,3]. Unfortunately, the learn tutorials can't really be made use of in a modern environment; you’d actually have to set up a PDP-11 emulator with V7 (though prebuilt images exist) and work with that, an environment where backspace doesn’t really work out of the box.
OpenBSD ed(1)[4] tries, but it's just not quite there as an introduction.
I'm much more familiar with sed than ed, so here's how I would to this:
sed '/baz/{s/.*/&\n&/;s/baz/elephant/2}' input.txt
or, slightly more readable
sed '/baz/ {
s/.*/&\n&/
s/baz/elephant/2
}' input.txt
The first substitution appends a copy of the line to the pattern space, the second substitution replaces the second occurrence of "baz" with "elephant".
This being said, I went ahead and bought the book mentioned in the article [0] - a neat little read.
To use this solution with a version of sed that does not accept newlines in patterns (i.e. to make it portable), one has to put the commands in a sed commands file and run it with sed -f.
How to make the one-liner portable without using a sed commands file?
Maybe something like:
sed 's/baz/elephant/;/^ \{2\}- elephant/{h;G;};/^ \{4\}- elephant/{h;G;};s/elephant/baz/' foo|sed -a wfoo
1. s/baz/elephant/
2. duplicate that line if two or four space indent
3. s/elephant/baz/
4. save
N.B. no temp file used to save changes
cf. jvns.ca blog:
1. search for baz
2. copy that line and paste it
3. s/baz/elephant/
4. save and quit
When I first encountered the command-line in college my Prof introduced me to VI and the basic bash commands, but I wasn't familiar with any other scripting languages, (or even, if memory serves, the concept of a 'scripting language'). As a result, I ended up creating a pretty dizzying array of ed scripts until someone introduced me to sed and the fact you can use bash as a scripting language.
An older version of the article contained the following:
> I had one extra weird requirement which was that some of the lines were indented with 2 spaces, and some with 4 spaces. The - elephant line needed to have the same indentation as the previous line.
Chapter 20 of O'Reilly's "Unix Power Tools, 3rd Edition" is all about batch editing and covers ed/ex as well.
Maybe there's an old copy of "Unix Power Tools" over in your server room or an abandoned cubicle in the office... the content has not changed much in the ensuing decades!
Ed is pretty complete, to the point it was a little too powerful when it was part of a security problem with FreeBSD's patch. https://securitytracker.com/id/1033188
I do this fairly regularly in various shell scripts, but less now than previously ever since “sed” introduced the --in-place option, making it more useful for my purposes most of the time.
It's worth noting that the "enter a single . on a line to signal the end of an input" convention found its way into mail/mailx and SMTP too. The good thing is that it means you don't need to insert special characters like Esc (Ctrl+[, 27, 0x1B, whatever you want to call it) into your script; the bad thing is when you do want to add a line containing a single "."... whereby ed and SMTP have diverged with different "escaping" conventions.
Good to know about ed! Since noone else has mentioned this, emacs' keyboard macros seem much easier to me especially since more than the basic editing stuff off awk/ed/sed i can leverage all the editor extensions and modifications i have accumulated over the years. That is unless its tens of thousands of files and the edit is exceptionally simple. I would write a script in that case too.
Since discovering them, I use Emacs keyboard macros all the time.
Let's say I have:
key value
salt pepper
fish chips
vodka orange
rum cola
and I want the second column in uppercase.
<F3> to start recording a macro. Alt →, → to position the cursor at "v" (or just →→→→→→ if this is a fixed width column), then Alt U to uppercase the next word. → to move the cursor one forward, to the start of the next line. <F4> to finish recording the macro.
Then press <F4> five times to run the macro five times.
(Explanation intended for users who've never used Emacs before. Of course, there are optimizations.)
But in this example I would have probably used ex command:
:%normal wgUaw
(for every line do as if I had typed wgUaw) or visualy selected the second column as a block and just pressed U.
I'm genuinely interested in someone showing how to do this kind of trasformation in popular modern editors such as Atom and VSCode. Is there such a flexible way as in the classic editors?
You can do it using multiple cursors. On Sublime, you place the cursor on beginning of “value”, then press Ctrl+Shift+Down until the end - there will be a cursor on every line. Then you press Ctrl+Right to select all values on the second column. Then press Ctrl+P and choose “Convert to Upper Case”, or just Ctrl+KU.
The main reason i am not moving to something more trendy are kbd macros and the fact that emacs is designed to be equal parts runtime and editor. I think that for most purposes emacs and vim are equivalent but virtually all modern editors are lacking compared to a well good emacs/vim configuration.
Microoptimizing a couple of things, I'd do this from the top of the buffer:
<F3> M-f M-u C-n C-a C-0 <F4>
I particularly like combining the "finish recording" and "running the macro" steps into a single <F4>. Plus, using a numeric argument of 0 seems better than counting lines.
There is another program I use for editing that is older than ed.
It is written in asm.
I think it may actually be faster than sed (and sed is faster than AWK, Lua, Perl, Python, etc.)
1.spt:
; x = " - baz"
; y = " - elephant"
;a a = input :f(end)
; output = a
; a ? x :s(d)f(a)
;d output = y
; :(a)
;end
spitbol 1.spt < foo
Depends on the awk implementation and the task. However even gnu awk (gawk) is very fast and mawk is astonishing.
Here is a simple example: count the lines, words, and characters in a 65MB text file (10 copies of a novel stuck together).
Testing on Ubuntu GNU/linux 16.10 reporting middle of three tries:
export LANG=ASCII # avoid differences due to unicode
$ time -p wc big10.txt
1284570 10956950 64886660 big10.txt
real 0.29
user 0.28
sys 0.01
$ time -p gawk '{l+=1; w+=NF; c+=length($0)+1} END {print l, w, c}' big10.txt
1284570 10956950 64886660
real 0.55
user 0.53
sys 0.01
Not bad, gawk is less than twice as slow as wc which is the standard tool for this.
$ time -p mawk '{l+=1; w+=NF; c+=length($0)+1} END {print l, w, c}' big10.txt
1284570 10956950 64886660
real 0.35
user 0.33
sys 0.01
But mawk is only 20% slower than wc. For a script!
Just for a check, even python is not terrible at this:
#!/usr/bin/python
import sys
l, w, c = 0, 0, 0
for line in file(sys.argv[1], "rb"):
l += 1
w += len(line.split())
c += len(line)
print l, w, c
$ time -p ./wc.py big10.txt
1284570 10956950 64886660
real 0.87
user 0.86
sys 0.01
As with k, I am lacking in spitbol experience and so the counts are not identical to wc. Also I am using 10MB of big10.txt instead of the entire file.
1.spt:
;* m line count, c word count, o char count
;* p word pattern
; n = "0123456789"
; w = n &ucase &lcase "-"
; p = break(w) span(w)
;a a = input :f(c)
; o = o + size(a)
; m = m + 1
;b a ? p = :f(a)
; c = c + 1 :(b)
;c output = m ' ' c ' ' o
;end
dd if=big10.txt bs=5m count=2 of=10m.txt
time -p wc 10m.txt
201346 1763181 10485760 10m.txt
real 0.51
user 0.44
sys 0.00
time -p spitbol 1.spt < 10m.txt
201347 1770831 10235302
real 0.33
user 0.31
sys 0.01
time -p wc big10.txt
1284570 10956950 64886660 big10.txt
real 2.76
user 2.68
sys 0.08
Trying this as novice with k3.
Because novice, 2 out of 3 counts are incorrect and probably not the fastest solution used.
Total "words" in the example was simply AWK's NF. But looking at big10.txt there anomalies such as words separated by "--" instead of space.
Here I used non-space character followed by space. Far from accurate but not too far.
1.k:
w:0:"big10.txt";v:,/$w
m:v _ss "[^ ] " / "word": char followed by space
#w / lines
1+#m / words
#v / characters
time -p k 1
1284570
10019630
63602090
real 2.70
user 2.40
sys 0.28
Counting lines with sed
time -p wc -l big10.txt
1284570 big10.txt
real 0.13
user 0.06
sys 0.07
sed -n '$!d;=' big10.txt
1284570
real 0.29
user 0.19
sys 0.09
That is a slow computer, mine is a pre-haswell i3.
$ time -p sed -n '$!d;=' big10.txt
1284570
real 0.07
user 0.06
sys 0.00
time -p mawk 'END {print NR}' big10.txt
1284570
real 0.04
user 0.03
sys 0.00
$ time -p gawk 'END {print NR}' big10.txt
1284570
real 0.14
user 0.13
sys 0.00
$ time -p wc -l big10.txt
1284570 big10.txt
real 0.02
user 0.02
sys 0.00
Counts for words and chars are closer but still short due to inexperience using k.
But it appears the script is now faster than wc.
time -p wc big10.txt
1284570 10956950 64886660 big10.txt
real 2.78
user 2.66
sys 0.12
time -p k 1
1284570
10956830
63602090
real 2.57
user 2.42
sys 0.14
It's closer because vi was built on top of ex -- vi's ":" commands are just ex commands. In fact, on my Linux box, /bin/ex is just a symbolic link to /bin/vi:
$ ls -l /bin/ex
lrwxrwxrwx 1 root root 2 Oct 2 2017 /bin/ex -> vi
The original code for vi was written by Bill Joy in 1976, as the visual mode for a line editor called ex that Joy had written with Chuck Haley. Bill Joy's ex 1.1 was released as part of the first BSD Unix release in March 1978.[1]
(Bill Joy went on to become a co-founder of Sun Microsystems.)
There is superb book, The AWK Programming Language, which teaches a lot about programming in addition to the awk language. Good discussion and link to pdf here: [0]
[0] https://news.ycombinator.com/item?id=13451454