Hacker News new | past | comments | ask | show | jobs | submit login
Batch editing files with ed (jvns.ca)
151 points by weinzierl on May 12, 2018 | hide | past | favorite | 72 comments



Awk really is the tool of choice for this sort of thing:

  $ awk '{print}/baz/ {sub("baz", "elephant"); print}' jvns.txt
  foo:
    - bar
    - baz
    - elephant
    - bananas
Since the script is single quoted you can also lay it out legibly:

  $awk '
  {print}
  /baz/ {
      sub ("baz", "elephant")
      print
  }
  '
which is nice for more complex "one liners". Awk is also standard on all posix environments and in the mawk flavor is extremely fast (relevant if you are processing huge files).

There is superb book, The AWK Programming Language, which teaches a lot about programming in addition to the awk language. Good discussion and link to pdf here: [0]

[0] https://news.ycombinator.com/item?id=13451454


The AWK Programming Language is the best programming book ever because it lets you learn the language through interesting problems (like writing a very small assembler).


It's available online. Thanks for the tip.

https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...


Come on, don't promote piracy of the book! It's worth buying! Mods, please take down this link.


Hmm, someone downvoted me. Can the downvoter explain what was wrong in what I said?


I believe the book's been out of print for sometime. I've no idea if that's why you were downvoted. I wouldn't post a link to a pirated anything. That pdf is linked from a ton of sites and the site I found it on seemed like a reputable site, though I don't remember which it was at the moment.


> I believe the book's been out of print for sometime.

Hmm, I did not know that (I have a print copy of that book).

> I wouldn't post a link to a pirated anything.

OK. But just in case, someone still wants a physical book, they can get it used from Amazon (can talk about Amazon in US) for ~$3.


> the mawk flavor is extremely fast

Fast, but partly because it's not Unicode-aware: it treats strings as 8-bit character sequences rather than UTF-8. Often that's fine, if non-ASCII characters are only passed through unmodified, but requires some care to avoid problems.

    $ echo $LANG
    en_GB.UTF-8

    $ echo "ÜNICÖDE" | gawk '{print tolower($0)}'
    ünicöde

    $ echo "ÜNICÖDE" | mawk '{print tolower($0)}'
    �nic�de
I ran into this in practice because I was using awk to convert paper titles from "Title Case" to APA-style "Only first word of title capitalized" case. The garbled output led me down a rabbit hole where I discovered that only some awks support Unicode locales, and the default awk on Debian (mawk) isn't one of them.


See my post way down the bottom, I do mention this. GNU awk also has a lot of useful extensions and builtins so that sometimes it's painful to use a plain Posix awk. But, when you need the speed, it's nice to know mawk is out there.

I'd love to see the mawk compilation technology merged to GNU awk. Or mawk updated with Unicode support and a few of the GNU extensions.

The other item on my awk-like wishlist is CSV support, ie split $1 .. $N by CSV rules instead of just a field separator. I usually end up copying CSVs into postgresql because it is fast and then I can process it very flexibly, but it's a bit heavy for things that could be one line-ers. Also, postgresql won't load malformed CSV, but I suspect awk could be less picky.

There is miller which hits some of these points, but I haven't really grokked it yet and the syntax seems ... awkward.


+1 for The AWK Programming language book!

I have been reading through it and running code snippets from there as I find time. It's awesome!

If anyone's interested, I also update my notes from there on my blog.. https://scripter.co/notes/awk. I plan to have that notes post contain the entire book when I am done.


Pretty much this whole thread is people coming up with better ways of solving the problem. I love Hacker News! :) But I think the point of the blog (for me) is showing what 'ed' is and where can you use it for (the problem at hand is secondary). And it did exactly that for me, I knew about 'ed' but never really understood what made it special or where/how you could use it. Thanks Julia for enlightening me!


> I knew about 'ed' but never really understood what made it special or where/how you could use it.

ed script is one of the formats supported by diff and patch (with their -e or --ed command line switch). (diff generates the script by itself, but patch just pipes it to ed.)

ed scripts are used at Apple to maintain their patches for Python: see e.g. https://opensource.apple.com/source/python/python-97.50.7/2.... and others in https://opensource.apple.com/source/python/python-97.50.7/2....


fun side note: `patch` runs ed and ed allows you to run arbitary commands resulting in a security issue:

https://rachelbythebay.com/w/2018/04/05/bangpatch/

This was used for the https://holeybeep.ninja/ April fool joke release.


This problem is easily solved with a regexp that processes the input as a whole, rather than working on it line by line. I had this need often enough that I exported Go regexp engine as a command line tool regrep, which can insert "elephant" after "baz" with:

  go get github.com/orivej/unix/regrep
  regrep s '(\n( *-) baz\n)' $'$1$2 elephant\n' < input.yaml
It only processes standard input, and can not by itself replace the contents of an input file with its output; but another tool, inplace, helps:

  go get github.com/orivej/unix/inplace
  find . -name '*.yaml' -exec inplace {} regrep s '(\n( *-) baz\n)' $'$1$2 elephant\n' \;


It's close to the idea of structural regular expressions [0]. I'm still waiting for the awk from the paper.

[0] http://doc.cat-v.org/bell_labs/structural_regexps/


In practice I often use Vim instead.

    :args file1.txt file2.txt file3.txt
    :set autowrite
    :argdo norm /- baz/<CR>yypwCelephants
The commands set the argument list to a list of three files (by default, it is set to the filenames you passed to vim on the command-line). Then, autowrite is enabled which automatically saves each buffer after editing it. Finally, argdo runs a command on each argument file.



Thanks for that tip. I've used ex for similar things when editing hundreds of files.

This example searches each HTML file in a directory for a line with a string and then deletes a number of lines:

    $ echo "g/search string/ .,+20 d\nx" >> exscript
    $ for f in *.html
      do
          ex - $f < exscript
      done


Why write the ‘ex’ commands to a file? Why not just do the echo inline, like this

   for f in *.html; do
       echo -e "g/search string/ .,+20 d\nx" | ex - "$f"
    done
?

(I also added quotes to the $f dereference in case of file names containing white space, and the -e flag to echo to expand \n to newline. In case of a Bourne shell without support for -e in echo, I would probably use “{ echo "g/..."; echo "x"; } | ex - ...” instead of using \n.)

Also, in a production script I would probably have used “find . -maxdepth 1 -name "*.html" -print0 | xargs --null --no-run-if-empty | while read f; do ...; done” instead of a ‘for’ loop from a pathname expansion, in order to guard against there being no html files, in which case a ‘for’ loop from a pathname expansion otherwise would be passing the literal string “*.html” as the file name argument to ‘ex’.


Thanks for the tips. I put them in a file, because I saw someone using ed last year in a script and started looking at ex from there.

    diff -e file1 file2 > ed_script
Using echo (along with your other suggestions) is probably better for that example.


Note: I forgot the “--max-args=1” option to xargs.


I've recently used EDLIN from an 80s version of MS-DOS.

After spending 5 minutes with the manual, I realized it was the best line editor I've ever used.

ed is very similar, but it doesn't come with a nice manual. The info page is chaotic and doesn't start from the beginning. You can cover typical usage in 5 lines, but no, you have to read through 25 pages of stuff just to figure out a sensible command. I just use vim or nano.


> ed is very similar, but it doesn't come with a nice manual

In many case it indeed doesn't. Historically, that documentation void was meant to be filled by learn tutorials[1] and a written introduction in volume 2A of the manual[2,3]. Unfortunately, the learn tutorials can't really be made use of in a modern environment; you’d actually have to set up a PDP-11 emulator with V7 (though prebuilt images exist) and work with that, an environment where backspace doesn’t really work out of the box.

OpenBSD ed(1)[4] tries, but it's just not quite there as an introduction.

[1] http://a.papnet.eu/UNIX/v7/files/doc/07_learn.pdf

[2] http://a.papnet.eu/UNIX/v7/files/doc/04_edtut.pdf

[3] http://a.papnet.eu/UNIX/v7/files/doc/05_adved.pdf

[4] https://man.openbsd.org/ed.1


25 pages that could be shown in one page if they skipped all the talk and just showed carefully selected examples .


That seems to be the theme of all info pages, and most GNU manpages. I suppose it beats no docs at least.


My understanding is that man/info is meant more as a reference and less like a first time users guide.


Good man pages can be great first time user guides.


I'm much more familiar with sed than ed, so here's how I would to this:

  sed '/baz/{s/.*/&\n&/;s/baz/elephant/2}' input.txt
or, slightly more readable

  sed '/baz/ {
           s/.*/&\n&/
           s/baz/elephant/2
       }' input.txt
The first substitution appends a copy of the line to the pattern space, the second substitution replaces the second occurrence of "baz" with "elephant".

This being said, I went ahead and bought the book mentioned in the article [0] - a neat little read.

[0]: https://www.michaelwlucas.com/tools/ed


To use this solution with a version of sed that does not accept newlines in patterns (i.e. to make it portable), one has to put the commands in a sed commands file and run it with sed -f.

How to make the one-liner portable without using a sed commands file?

Maybe something like:

  sed 's/baz/elephant/;/^ \{2\}- elephant/{h;G;};/^ \{4\}- elephant/{h;G;};s/elephant/baz/' foo|sed -a wfoo

  1. s/baz/elephant/ 
  2. duplicate that line if two or four space indent
  3. s/elephant/baz/
  4. save
N.B. no temp file used to save changes

cf. jvns.ca blog:

  1. search for baz
  2. copy that line and paste it
  3. s/baz/elephant/
  4. save and quit
N.B. temp file in $TMPDIR used to save changes


When I first encountered the command-line in college my Prof introduced me to VI and the basic bash commands, but I wasn't familiar with any other scripting languages, (or even, if memory serves, the concept of a 'scripting language'). As a result, I ended up creating a pretty dizzying array of ed scripts until someone introduced me to sed and the fact you can use bash as a scripting language.


Ed is the standard text editor https://www.gnu.org/fun/jokes/ed-msg.html


Nice article, good to hear ed is not dead. =)

You could also just add the text after the matching line. A little simpler and more straight forward.

    $ cat > /tmp/ed-script
    /baz
    a
      - elephants
    .
    w
    q
    $ cat /tmp/2
    foo:
      - bar
      - baz
      - bananas
    $ cat /tmp/ed-script | ed /tmp/2
    33
      - baz
    47
    $ cat /tmp/2
    foo:
      - bar
      - baz
      - elephants
      - bananas
    $·


I don’t think matches the spec that the new line has the same number of leading spaces as the surrounding lines


Weird, after rereading the article, it seems like I may have imagined that part.


An older version of the article contained the following:

> I had one extra weird requirement which was that some of the lines were indented with 2 spaces, and some with 4 spaces. The - elephant line needed to have the same indentation as the previous line.


Well! That explains the .t. Thanks! :)


My guess is that the - is replacing something like an org-mode cookie, which could be in a few states that one might want to preserve.


Chapter 20 of O'Reilly's "Unix Power Tools, 3rd Edition" is all about batch editing and covers ed/ex as well.

Maybe there's an old copy of "Unix Power Tools" over in your server room or an abandoned cubicle in the office... the content has not changed much in the ensuing decades!


Ed is pretty complete, to the point it was a little too powerful when it was part of a security problem with FreeBSD's patch. https://securitytracker.com/id/1033188


   echo -e '/-baz\n+1\ni\n-elephant\n.\nw\nq\n'|ed foo
but ed requires a temp file in $TMPDIR to save changes

for speed, put $TMPDIR on memory file system

sed requires no temp file

   1.sed:
   /-baz/a\
   -elephant
   
   
   sed -f 1.sed foo|sed -a wfoo
works with all versions of sed, e.g., not all versions support "\n" in patterns nor so-called "edit-in-place" automatic temp file creation and removal


Sure, there is more than one way to do X. It will be useful to learn why you prefer sed over ed


The sed command doesn't get the indentation right, though, as the article says it could be indented by two or four spaces.


   1.sed:
   s/- baz/- elephant/;
   /^  - elephant/{h;G;}
   /^    - elephant/{h;G;}
   s/- elephant/- baz/;

   sed -f 1.sed foo|sed -a wfoo
or

   1.sed:
   /^  - baz/a\
     - elephant
   

   /^    - baz/a\
       - elephant
   

   sed -f 1.sed foo|sed -a wfoo


  sed s/baz/baz\\nelephants/


that doesn't match the indentation on the subsequent line though, does it?


You are missing the dash that the line starts with


  sed -i.bak s/baz/"baz\\n  - elephants"/ *.txt


The article says that the indentation can be two or four spaces, though.


I do this fairly regularly in various shell scripts, but less now than previously ever since “sed” introduced the --in-place option, making it more useful for my purposes most of the time.


It's worth noting that the "enter a single . on a line to signal the end of an input" convention found its way into mail/mailx and SMTP too. The good thing is that it means you don't need to insert special characters like Esc (Ctrl+[, 27, 0x1B, whatever you want to call it) into your script; the bad thing is when you do want to add a line containing a single "."... whereby ed and SMTP have diverged with different "escaping" conventions.


Good to know about ed! Since noone else has mentioned this, emacs' keyboard macros seem much easier to me especially since more than the basic editing stuff off awk/ed/sed i can leverage all the editor extensions and modifications i have accumulated over the years. That is unless its tens of thousands of files and the edit is exceptionally simple. I would write a script in that case too.


Since discovering them, I use Emacs keyboard macros all the time.

Let's say I have:

  key   value
  salt  pepper
  fish  chips
  vodka orange
  rum   cola
and I want the second column in uppercase.

<F3> to start recording a macro. Alt →, → to position the cursor at "v" (or just →→→→→→ if this is a fixed width column), then Alt U to uppercase the next word. → to move the cursor one forward, to the start of the next line. <F4> to finish recording the macro.

Then press <F4> five times to run the macro five times.

(Explanation intended for users who've never used Emacs before. Of course, there are optimizations.)


The equivalent in vim is:

• qq to start recording a macro in register q,

• w to jump to the second word,

• gUaw to "go uppercase a word",

• j to move to next line (↓ works as well),

• q to stop recording,

• 5@q to apply macro in register q five times.

But in this example I would have probably used ex command:

    :%normal wgUaw
(for every line do as if I had typed wgUaw) or visualy selected the second column as a block and just pressed U.

I'm genuinely interested in someone showing how to do this kind of trasformation in popular modern editors such as Atom and VSCode. Is there such a flexible way as in the classic editors?


You can do it using multiple cursors. On Sublime, you place the cursor on beginning of “value”, then press Ctrl+Shift+Down until the end - there will be a cursor on every line. Then you press Ctrl+Right to select all values on the second column. Then press Ctrl+P and choose “Convert to Upper Case”, or just Ctrl+KU.


The main reason i am not moving to something more trendy are kbd macros and the fact that emacs is designed to be equal parts runtime and editor. I think that for most purposes emacs and vim are equivalent but virtually all modern editors are lacking compared to a well good emacs/vim configuration.


Microoptimizing a couple of things, I'd do this from the top of the buffer:

    <F3> M-f M-u C-n C-a C-0 <F4>
I particularly like combining the "finish recording" and "running the macro" steps into a single <F4>. Plus, using a numeric argument of 0 seems better than counting lines.


There is another program I use for editing that is older than ed. It is written in asm. I think it may actually be faster than sed (and sed is faster than AWK, Lua, Perl, Python, etc.)

  1.spt:

  ; x = "  - baz" 
  ; y = "  - elephant"
  ;a a = input :f(end)
  ;  output = a
  ;  a ? x :s(d)f(a) 
  ;d output = y
  ; :(a)
  ;end

  spitbol 1.spt < foo


>sed is faster than AWK

Depends on the awk implementation and the task. However even gnu awk (gawk) is very fast and mawk is astonishing.

Here is a simple example: count the lines, words, and characters in a 65MB text file (10 copies of a novel stuck together).

Testing on Ubuntu GNU/linux 16.10 reporting middle of three tries:

  export LANG=ASCII    # avoid differences due to unicode
  $ time -p wc big10.txt 
   1284570 10956950 64886660 big10.txt
  real 0.29
  user 0.28
  sys 0.01
  $ time -p gawk '{l+=1; w+=NF; c+=length($0)+1} END {print l, w, c}' big10.txt
   1284570 10956950 64886660
  real 0.55
  user 0.53
  sys 0.01
Not bad, gawk is less than twice as slow as wc which is the standard tool for this.

  $ time -p mawk '{l+=1; w+=NF; c+=length($0)+1} END {print l, w, c}' big10.txt
   1284570 10956950 64886660
  real 0.35
  user 0.33
  sys 0.01
But mawk is only 20% slower than wc. For a script!

Just for a check, even python is not terrible at this:

  #!/usr/bin/python
  import sys
  l, w, c = 0, 0, 0
  for line in file(sys.argv[1], "rb"):
      l += 1
      w += len(line.split())
      c += len(line)
  print l, w, c
  
  $ time -p ./wc.py big10.txt 
  1284570 10956950 64886660
  real 0.87
  user 0.86
  sys 0.01
About 3 times slower than wc and mawk.


Here is how spitbol script measures against wc.

As with k, I am lacking in spitbol experience and so the counts are not identical to wc. Also I am using 10MB of big10.txt instead of the entire file.

  1.spt:

  ;* m line count, c word count, o char count 
  ;* p word pattern

  ; n = "0123456789"
  ; w = n &ucase &lcase "-"
  ; p = break(w) span(w)
  ;a a = input :f(c)
  ;  o = o + size(a) 
  ;  m = m + 1
  ;b a ? p = :f(a)
  ;  c = c + 1 :(b)
  ;c output = m ' ' c ' ' o
  ;end

  dd if=big10.txt bs=5m count=2 of=10m.txt

  time -p wc 10m.txt

  201346 1763181 10485760 10m.txt
  real         0.51
  user         0.44
  sys          0.00

  time -p spitbol 1.spt < 10m.txt

  201347 1770831 10235302
  real         0.33
  user         0.31
  sys          0.01
It appears that spitbol script is faster than wc.


On a much slower computer...

  time -p wc big10.txt
  1284570 10956950 64886660 big10.txt

  real         2.76
  user         2.68
  sys          0.08
Trying this as novice with k3.

Because novice, 2 out of 3 counts are incorrect and probably not the fastest solution used.

Total "words" in the example was simply AWK's NF. But looking at big10.txt there anomalies such as words separated by "--" instead of space.

Here I used non-space character followed by space. Far from accurate but not too far.

  1.k: 
  w:0:"big10.txt";v:,/$w
  m:v _ss "[^ ] " / "word": char followed by space
  #w   / lines
  1+#m / words
  #v   / characters

  time -p k 1

  1284570
  10019630
  63602090

  real         2.70
  user         2.40
  sys          0.28

Counting lines with sed

  time -p wc -l big10.txt
  1284570 big10.txt

  real         0.13
  user         0.06
  sys          0.07

  sed -n '$!d;=' big10.txt
  1284570

  real         0.29
  user         0.19
  sys          0.09


That is a slow computer, mine is a pre-haswell i3.

  $ time -p sed -n '$!d;=' big10.txt
  1284570
  real 0.07
  user 0.06
  sys 0.00

  time -p mawk 'END {print NR}' big10.txt
  1284570
  real 0.04
  user 0.03
  sys 0.00

  $ time -p gawk 'END {print NR}' big10.txt
  1284570
  real 0.14
  user 0.13
  sys 0.00

  $ time -p wc -l big10.txt
  1284570 big10.txt
  real 0.02
  user 0.02
  sys 0.00


Revised 1.k.

  w:0:"big10.txt";v:{" ",x}'w;u:{#v[x] _ss " [^ ]"}'!#v;t:{#w[x]}'!#w

  #w / lines
  +/u / words
  +/t / chars
Counts for words and chars are closer but still short due to inexperience using k.

But it appears the script is now faster than wc.

  time -p wc big10.txt

  1284570 10956950 64886660 big10.txt
  real         2.78
  user         2.66
  sys          0.12

  time -p k 1

  1284570
  10956830
  63602090

  real         2.57
  user         2.42
  sys          0.14


How can I download big10.txt or the novel to recreate it?


big10.txt is just 10 copies of big.txt from the Peter Norvig spelling corrector essay [0].

[0] http://www.norvig.com/big.txt


There is also ex, which is sort of cousin of ed, but also closer to nowdays more familiar vi(m).


It's closer because vi was built on top of ex -- vi's ":" commands are just ex commands. In fact, on my Linux box, /bin/ex is just a symbolic link to /bin/vi:

    $ ls -l /bin/ex
    lrwxrwxrwx 1 root root 2 Oct  2  2017 /bin/ex -> vi
The original code for vi was written by Bill Joy in 1976, as the visual mode for a line editor called ex that Joy had written with Chuck Haley. Bill Joy's ex 1.1 was released as part of the first BSD Unix release in March 1978.[1]

(Bill Joy went on to become a co-founder of Sun Microsystems.)

"ex" stood for the extended version of ed.

[1] https://en.wikipedia.org/wiki/Vi


ex == vi(m). Really, it's the same executable. You switch from vi to ex with Q, and from ex to vi with vi^M .

I loved ex's open mode. It's kind of a line-mode version of vi. Great to use over 1200 baud. Looks like vim doesn't have it. Sad.


Good article. It has everything - a problem and a good solution to this problem.


?


sed combined with iTerm's abilities to type in multiple panes lets me edit 100s of config files across multiple servers at the same time.


(Novice k user.)

    1.k:

    /k3
    v:"- baz";u:"  - elephant";t:"foo";s:"\n"
    w:_ssr[,/$0:t;v;v,u]
    w[1_ {x-2}'(&w="-")]:s
    t 0:_ssr[w;s;s," "],s

    k 1


Php -> one line




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: