Useful Unix commands for data science

makmanalp · on July 15, 2013

Everyone forgets the brilliant and sometimes crazy BSD ones:

  - Column: Create columns / tables from input data
  - tr: substitute / delete chars
  - join: like a database join, but for text files
  - comm: like diff, but you can use it programmatically to choose if a       line is in one file, or another, or both.
  - paste: put file lines side-by-side
  - rs: reshape arrays
  - jot: generate random or sequence data
  - expand: replace tabs / spaces

mjn · on July 15, 2013

Looks like 6 of those 8 are in GNU coreutils as well (and therefore can be assumed present on just about any modern Unix). 'rs' and 'jot' are the two missing from most default Linux installs. On Debian you can install them via the packages 'rs' and 'athena-jot'.

merlincorey · on July 16, 2013

'jot' is pretty sweet, especially for creating ranges for iteration and random numerical data for testing arguments and such.

Check out the man page for a few snippets: http://www.unix.com/man-page/FreeBSD/1/jot/

It is the older, more flexible uncle of gnu's 'seq' command: http://administratosphere.wordpress.com/2009/01/23/using-bsd...

emmelaich · on July 17, 2013

And you can't mention jot and rs without lam: http://www.unix.com/man-page/FreeBSD/1/lam/

wiredfool · on July 15, 2013

Join is really one of those awesome unknown commands.

bch · on July 15, 2013

> cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

"Don't pipe a cat".

My test doesn't show a speed improvement, but there are less processes running, and less memory consumed.

  bch:~ bch$ jot 999999999 2 99 > data.dat


  bch:~ bch$ time cat data.dat  | awk '{sum +=$1} END {printf "sum: %d\n", sum}'
  sum: 50499999412

  real 6m21.111s
  user 6m15.506s
  sys  0m5.711s

  PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORTS #MREGS RPRVT  RSHRD  RSIZE  VPRVT  VSIZE  PGRP  PPID  STATE    UID  FAULTS    COW      MSGSENT     MSGRECV
  22342  awk          100.7 05:11.84 1/1   0    17     21     52K    212K   340K   17M    2378M  22341 22306 running  501  311       49       73          36
  22341  cat          1.1   00:03.94 1/1   0    17     21     272K   212K   548K   17M    2378M  22341 22306 running  501  268       51       73          36

==============

  bch:~ bch$ time awk '{sum +=$1} END {printf "sum: %d\n", sum}' ./data.dat
  sum: 50499999412

  real 6m24.023s
  user 6m13.828s
  sys  0m2.774s

  PID    COMMAND      %CPU  TIME     #TH   #WQ  #PORTS #MREGS RPRVT  RSHRD  RSIZE  VPRVT  VSIZE  PGRP  PPID  STATE    UID  FAULTS    COW      MSGSENT     MSGRECV
  22373  awk          100.0 00:30.16 1/1   0    17     21     276K   212K   624K   17M    2378M  22373 22306 running  501  256       46       73          36

aidos · on July 15, 2013

Sometimes I like to start with cat so I can easily swap for zcat when changing to gripped input.

jbert · on July 15, 2013

Agree. Or actually I start with a 'head -100' so I don't handle too much data in my pipeline until it's ready.

tzs · on July 16, 2013

I'm old fashioned, so use "sed 100q" instead of the newer "head -100". It saves a keystroke, too.

There are enough variations in ways to do things on Unix that I've sometimes wondered about how easy it would be to identify a user by seeing how they accomplish a common task.

For instance, I noticed at one place I worked that even though everyone used the same set of options when doing a "cpio -p", everyone had their own order they wrote them. Seeing one "cpio -p" command was sufficient to tell which of the half dozen of us had done the command.

I think I'm the only one where I work who uses "sed Nq" instead of "head -N", so that would fingerprint me.

smutticus · on July 16, 2013

I sorta had this happen to me once. I have used "lsl" as an alias for long directory listings for longer than I can remember. And just out of habit it was almost always the first command I typed when logging into any box anywhere.

So one day I telnetted into a Solaris machine and immediately typed "lsl" before doing anything else. A short while later a colleague came to my cube. He had been snooping the hme1 interface and saw me login. He didn't need to trace the IP because he knew it was me when he saw 3 telnet packets with "l" "s" "l" in them.

lotsofcows · on July 16, 2013

Stoll's "The Cuckoo's Egg" contains a bit of detective work based around the hacker's switch style.

nnnnni · on July 15, 2013

Hmm… Now we need a zawk.

minimax · on July 15, 2013

AWK is worth learning completely. It hits a real sweet spot in terms of minimizing the number of lines of code needed to write useful programs in the world of quasi-structured (not quite CSV but not completely free form) data. You can learn the whole language and become proficient in an afternoon.

I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger, though it seems to be listed for a hilariously high price on Amazon at the moment. Maybe try to pick up a used copy.

kingmanaz · on July 15, 2013

>I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger

I concur with this recommendation. "The AWK Programming Language", at little over 100 pages, is a classic of programming language instruction. The book jumps right into use cases, it does not waste one's time. This book should be required reading for anyone contemplating writing a handbook on any programming language; my CS bookshelf would be several feet thinner and several times more informative.

cjg_ · on July 15, 2013

Sadly it seems very expensive now, $95 on Amazon...

jh3 · on July 15, 2013

Fortunately, a google search for "the awk programming language pdf" returns a link to this: http://books.cat-v.org/computer-science/awk-programming-lang...

It's the first result for me.

KC8ZKF · on July 15, 2013

USD 8.99 used, with 3.99 shipping.

EvanKelly · on July 15, 2013

Here is Kernighan's personal help file on AWK:

http://www.cs.princeton.edu/courses/archive/spr08/cos333/awk...

It deals with things he forgets or needs to remind himself of.

If you're interested in his other personal tutorials, they are here:

http://www.cs.princeton.edu/courses/archive/spr08/cos333/tut...

username42 · on July 16, 2013

Kernighan's personal help file is excellent. If you need more, you should switch to a more powerful language.

username42 · on July 16, 2013

I advise not learning more than basic usage of awk and to spend the time on more versatile languages. You can do very neat tricks with sed and awk, but when the problems become more complex, it is a lot faster to use a smarter language. And if you know well this language, you will discover that it may also be very concise for relatively simpler tasks.

When Perl was created, one of its advertised goal was to avoid all the time lost trying to work around the limitations of awk, sed and shell.

daemon13 · on July 15, 2013

I recall there was a pointer to an old great AWK tutorial some time ago - smth along the lines 'how to approach awk language....' - anyone kept the link?

foobarbazqux · on July 15, 2013

This is the first hit for awk tutorial and it's all you need.

http://www.grymoire.com/Unix/Awk.html

McUsr · on July 15, 2013

I think Steve's Awk academy is a nice supplement to Grymoire : http://www.troubleshooters.com/codecorn/awk/

By the way: what people need to understand is that in order to use Awk, efficently, you'll either use associative arrays, or structure your script like a sed script, otherwise it will be slow. The interesting thing about both of those, is the regex algorithm Thompson NFA, that is from what I hear around 7 times faster than PCRE that is used in Perl, PHP, Python and Ruby?

jnazario · on July 15, 2013

i wrote one a long time ago:

http://linuxgazette.net/67/nazario.html

i still use a buttload of awk for data science type uses.

DEinspanjer · on July 15, 2013

One of my favorite little tools that makes all these others better is pv -- PipeViewer.

Use it any place in a pipeline to see a progress meter on stderr. Very handy when grepping through a bunch of big log files looking for stuff. Here is a quick strawman example:

  pv /data/*.log.gz | zgrep -c 'hello world'
  241MiB 0:00:15 [15.8MiB/s] [==>       ]  2% ETA 0:12:12

gwu78 · on July 15, 2013

BSD has a progress meter utility. It's called progress(1).

   progress -zf /data/*.log.gz grep -c 'hello world'

   progress -f /data/*.log.gz zgrep -c 'hello world'

The second form will show the progress of the decompression process.

You can also adjust buffer size, set the length for the time estimate (otherwise we have to fstat the input), and display progress to stderr instead of stdout.

gaadd33 · on July 16, 2013

Which BSD has that? I don't seem to find it in my FreeBSD installs.

gwu78 · on July 16, 2013

It's actually in FreeBSD's base, but it's part of the ftp(1) program.

   http://svnweb.freebsd.org/base/vendor/tnftp/dist/src/progressbar.c

   http://ftp.netbsd.org/pub/NetBSD/NetBSD-release-6/src/usr.bin/{Makefile,progress.c}

Not sure if you prefer binary installs or whether you compile your installs yourself... but I'm sure you could get this to compile on FreeBSD with a little work.

pjungwir · on July 15, 2013

Shameless plug: here is a similar tool I wrote that prints not a progress bar but the contents flowing through the pipe, to help in debugging:

https://github.com/pjungwir/stutter

jbert · on July 15, 2013

I like slicing and dicing with awk, grep and friends too.

One thing I find odd that you have to drop to a full language (awk, perl etc) to sum a column of numbers. Am I missing a utility?

  echo "1\n2\n3\n" | sum # should print 6 with hyphothetical sum command

I suppose more generally you could have a 'fold initial op' and:

  echo "1\n2\n3\n4\n" | fold 0 +   # should print 10
  echo "1\n2\n3\n4\n" | fold 1 \*  # should print 24

But I guess at that point you're close enough to using awk/perk/whatever anyway. Which probably answers my question.

andrus · on July 15, 2013

Just for grins:

    $ alias sum="xargs | tr ' ' '+' | bc"
    $ echo -e "1\n2\n3\n" | sum
    6

gnosis · on July 15, 2013

Here's one way to do this with the standard "dc" (RPN calculator) utility:

  echo "1\n2\n3\n+\n+\np\n" | dc -

Or, a little more legibly:

[Edited to fix bug]

Not sure how to automate this to sum 1000 values without needing to explicitly insert 999 + signs, though. Haven't explored dc in depth myself yet. There's probably some way to do it with a macro or something, but it may not be pretty.

mjn · on July 15, 2013

If you consider the use of dc/bc as in the other solutions to be cheating, you can use unary-encoded integers...

   alias sum='xargs -I{} sh -c "head -c {} < /dev/zero" | wc -c'

chubot · on July 15, 2013

Yeah I wrote my own sum utility in Python... the syntax is just sum 1 or sum 2 for the column, with a -d delimiter flag. In retrospect I guess it could have been a one line awk script. But yeah if you are doing this kind of data-processing, it makes sense to have a hg/git repo of aliases and tiny commands that you sync around from machine to machine. You shouldn't have to write the sum more than once.

Another useful one is "hist" which is sort | uniq -c | sort -n -r.

jbert · on July 15, 2013

I've never aliased it, but yes I use your 'hist' a lot. Useful for things like "categorise log errors" etc.

Does everyone else edit command history, stacking up 'grep -v xxxx' in the pipeline to remove noise?

If I'm working on a new pipeline, my normal workflow is something like:

  head file     # See some representative lines
  head file | grep goodstuff
  head file | grep good stuff | grep -v badstuff
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/'
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' # get a col
  head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' | sort | uniq -c | sort -nr  # histogram as parent

Then I edit the 'head' into a 'cat' and handle the whole file. Basically all done with bash history editing (I'm a 'set -o vi' person for vi keybindings in bash, emacs is fine too :-)

mjn · on July 15, 2013

Yeah, this is my quick-and-dirty way of looking at referers in Apache logs, built up from a few history edits. It excludes some bot-like stuff (many bots give a plus-prefixed URL in the user-agent string) and referer strings from my own domain, removes query strings, and cleans up trailing slashes:

   grep -v "+http" access_log | cut -d \" -f 4 | cut -d \? -f 1 | sed 's/\/$//' | grep -v kmjn.org | sort | uniq -c | sort -nr

ibotty · on July 15, 2013

> awk '{print $3}'

is the same as

> cut -f3 -d' '

cut is amazing for what it does. and most people know only the subset of awk that effectively _is_ cut anyway :D.

toupeira · on July 15, 2013

Not exactly, awk will consume all whitespace while cut will split on each individual space character, and not on newlines and tabs.

Sprint · on July 16, 2013

Bundling expressions into regular expressions can be handy, for example "grep -Ev '(thisbot|thatbot|bingbot|bongbot)'" instead many single grep pipes.

atondwal · on July 15, 2013

    echo "1\n2\n3\n" | tr '\n' + | bc

FreeFull · on July 16, 2013

Doesn't work. By default, echo doesn't translate \n into a newline, so you have to add the -e flag. Then, bc doesn't like the extra plusses at the end, so you have to either add the -n flag to echo and remove the last \n, or somehow trim the newlines from the end beforehand.

jbert · on July 16, 2013

Thanks for the detail. I was worried about the escapes in the echo, but didn't check.

phaemon · on July 15, 2013

You could do:

echo -e "1\n2\n3\n4" | paste -sd+ | bc

kind of cheating though! :-)

whacker · on July 15, 2013

    paste -sd+|bc

gnosis · on July 15, 2013

I was hoping to see an article about some neat new utilities specifically tailored for doing advanced data analysis.

Instead this is a set of basic examples of bog-standard tools that every newbie *nix user should be already familiar with: cat, awk, head, tail, wc, grep, sed, sort, uniq

pmelendez · on July 15, 2013

>"tools that every newbie nix user should be already familiar with"

The key word is should... you might be surprised how many "not newbie" nix users are not aware of those commands or how using them in this fashion. Specially awk.

D9u · on July 15, 2013

I forgot all about wc, but thanks to this article I may remember it the next time I need it.

Thanks!

robinson-wall · on July 16, 2013

Indeed. I'm going to start calling myself a data scientist now, instead of a sysadmin. See if I can't get a raise.

Osmium · on July 15, 2013

Don't forget there are only ever going to be more unix newbies in the world. It's not like they're a dying breed. There are more people than ever who have never been exposed to unix tools who might benefit from them (myself included several years ago).

jlgaddis · on July 15, 2013

Lots of people are familiar with the "basics" of each of these commands but many of them (awk, sed) are very powerful utilities that can do much, much more than it appears at first glance.

_kst_ · on July 15, 2013

A commenter on the article pointed out the "Useless use of cat".

What most users probably don't realize is that the redirection can be anywhere on the line, not just at the beginning. Putting an input redirection at the beginning of the command can make the data flow clearer: from the input file, through the command, to stdout:

    < data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

(This only works for simple commands; you can't do `< file if blah; then foo; else bar; fi`)

alayne · on July 15, 2013

"Useless use of cat" is one of those boring pedantic comments that makes me cringe. Who cares? It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.

coolj · on July 16, 2013

> It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.

True, however, people pointing out UUOC are in fact pointing out that you should not be building a pipeline at all. If you want to apply an awk / sed / wc / whatever command to a file, then you should just do that instead of piping it through a extraneous command.

Sure, as people always mention, in your actual workflow you might have a cat or grep already, and are building a pipeline incrementally; there's no reason to remove previous stuff to be "pure" or whatever. But if you're giving a canonical example, there's no reason to add unneeded commands.

codezero · on July 15, 2013

Please be very careful doing math with bash and awk...

  cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'

From that command, it's unclear whether the sum will be accurate, it depends on the inputs and on the precision of awk. See (D.3 Floating-Point Number Caveats): http://www.delorie.com/gnu/docs/gawk/gawk_260.html

minimax · on July 15, 2013

Please be very careful doing math with bash and awk...

I don't see how that's more true for awk than it is for any other programming language. Awk uses double precision floating point for all numeric values, which isn't a horrible choice for a catch-all numeric type.

pstuart · on July 15, 2013

Starts off with unnecessary use of cat, e.g., cat file | awk 'cmds'.

One can simply do awk 'cmds' file.

mds · on July 15, 2013

I know purists always complain about unnecessary cats, but I always find it useful to start with "head" or "tail" in the first position to figure out my pipeline, and then replace it with cat when it's all working.

And if the extra cat is actually making a measurable difference, maybe that's a good signal that it's time to rewrite it in C.

sleepydog · on July 15, 2013

You can do with simple IO redirection. For example, the arbitrary pipeline

    $ cat data.txt | awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'

can be written as

    $ <data.txt awk '{ print $2+$4,$0 }'|sort|sed '/^0/d'

koralatov · on July 16, 2013

Some people prefer the first, longer-winded way because it's more explicit. To some --- myself included --- it makes more sense because it explicitly breaks each function into a seperate steps; I'm explicitly telling the system to print the contents of data.txt rather than implictly doing so. I'll happily type those five extra characters for that additional clarity.

zeidrich · on July 15, 2013

If you need them, Windows also has most of those tools somehow replicated in Powershell. For instance, the initial example can be replicated with:

Get-Content .\data.csv | %{[int]$total+=$_.Split('|')[3]; } ; Write-Host "$total"

dredmorbius · on July 16, 2013

Or you can actually use the Linux commands by installing Cygwin. Pretty much my first conscious action when I wake up stranded on a desert Windows system.

pseut · on July 16, 2013

Why Cygwin instead of MinGW? (I can't remember why I prefer MinGW, but at some point I had a reason).

dredmorbius · on July 17, 2013

I could care less which you choose so long as you're getting a proper Linux toolset.

Powershell is a skill I don't have yet which carries over to ... precisely one declining technical dinosaur (with a penchant for expiring its skillsets).

The Linux toolbox is a set of skills I embarked on learning over a quarter-century ago, most of which goes back another decade or further (the 'k' in 'awk' comes from Brian Kernighan, one of Unix's creators). And while some old utilities are retired and new ones replace them (telnet / rsh for ssh, sccs/rcs for git), much of the core has remained surprisingly stable over time.

The main difference between MinGW and Cygwin appears to be how Windows-native they are considered, which for my own purposes has been an entirely irrelevant distinction, though if you're building applications based off of the tools might matter to you.

http://www.mingw.org/node/21

username42 · on July 16, 2013

Virtualbox ubuntu.

sleepydog · on July 15, 2013

One trick I like to do is to feed two files into awk; /dev/stdin and some other file I'm interested in. Here's an example: lookup the subject names of a list of serial numbers in an openSSL index.txt

    #!/bin/sh
    
    printf %s\\n "$@" | awk -F'\t' '
        FILE == "/dev/stdin" {
            needle[$0] = 1
            next
        }

        needle[$4] {
            print $NF
        }
    ' /dev/stdin /etc/pki/CA/index.txt

I find myself using this idiom (feeding data from a file to awk and selecting it with data from standard input) again and again. It's a great way to scale shell scripts to take multiple arguments while avoiding opening the same file N times, or doing clunky things with awk's -v flag.

nfg · on July 15, 2013

Worth noting that:

    grep -A n -B n

is more easily written:

    grep -C n

If you think "C" for "context" this is easier to remember too.

jonjenk · on July 15, 2013

If you're interested in this general topic check out this Wikibook from John Rauser. He's a data scientist at Pinterest.

http://en.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_The_U...

res0nat0r · on July 15, 2013

If this interests you, you should check out Joyents new Manta service which lets you do this type of thing on your data via their infrastructure. It's really cool.

http://www.joyent.com/products/manta

jlgaddis · on July 15, 2013

If I needed to do this type of thing on 10 TB of data, it would probably take me longer to get the data to them than it would to just run it on my own hardware.

Apparently there's a need for it, though, or it wouldn't exist.

mcavage · on July 15, 2013

Disclaimer: I work at Joyent, on Manta.

This entire HN thread is a perfect example of why we built Manta. Lots of engineers/scientists/sysadmins/... already know how to (elegantly) process data using Unix and augmenting with scripts. Manta isn't about always needing to work on a 10TB dataset (you can), but about it being always available, and stored ready to go. I know we can't live without it for running our own systems -- all logs in the entire Joyent fleet are rotated and archived in Manta, and we can perform both recurring/automated and ad-hoc analysis on the dataset, without worrying about storage shares, or ETL'ing from cold storage to compute, etc. And you can sample as little as much or as much as you want. At least to us (and I've run several large distributed systems in my career), that has tremendous value, and we believe it does for others as well. And that's just one use case (log processing).

Like I said, disclaimers/bias/etc.

m

mjn · on July 16, 2013

Wow, this looks great. My ideal cloud-computing platform is basically something like xargs -P or GNU parallel, but with the illusion that I'm running it on a machine with infinite CPU cores and RAM (charged for usage, of course). I was spoiled early on by having once had something almost like that, via a very nice university compute cluster, where your data was always available on all nodes (via NFS), and you just prefixed your usual Unix commands with a job-submit command, which did the magic of transparently running stuff wherever it wanted to run it. Apart from the slight indirection of using the job-submit tool, it almost succeeded in giving the illusion of ssh-ing into a single gazillion-core big-iron machine, which is more or less the user experience I want. But I haven't found a commercial offering where I can get an account on a big Unix cluster and just get billed for some function of my (disk space, CPU usage, RAM usage) x time.

Cloud services are amazing in a lot of ways, but so far I've found them much more heavyweight for the use-case of running ad-hoc jobs from the Unix command line. You don't really want to write Hadoop code for exploratory data analysis, and even managing a little fleet of bashreduce+EC2 instances that get spun up and down on demand is error-prone and tedious, turning me more into the cluster administrator rather than a user, which is what I'd rather be. Admittedly it's possible that could be abstracted out better in the case where you don't mind latency: I often don't mind if my jobs queue up for a few minutes, which would mean a tool could spin up EC2 instances behind the scenes and then tear them down without me noticing. But I haven't found anything that does that transparently yet, and Manta looks like a more direct implementation of the "illusion of running on an N-core machine for arbitrary N" idea that seems in the same cost ballpark. Definitely going to do some experimentation here, to see if 2010s technology will enable me to keep using a 1970s-era data-processing workflow.

jrn · on July 16, 2013

I know manta, has default software packaged, but is it possible to install your own like ghci, or julia? Or is that something that needs to be brought in as an asset. This isn't necessarily a feature request, just trying to figure out how it works. https://apidocs.joyent.com/manta/compute-instance-software.h...

dap · on July 17, 2013

An asset is currently the way to do that.

res0nat0r · on July 15, 2013

Mark, is there any info on how I can figure out my monthly billing cost easily? Do I just need to sum the /user/reports/summary data for an estimate?

mcavage · on July 15, 2013

Yeah that's why we generate ~/reports for you every hour - that's what our billing runs off of. I know there's an internal "turn that into daily $ script" somebody wrote -- we'll get that put out as a sample job.

reyan · on July 15, 2013

A short and nice read is Unix for Poets by Kenneth Ward Church: http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf

hafabnew · on July 15, 2013

I actually was about to post this -- this guide is great. As an undergraduate, this was what was given to us to help demonstrate above-introductory command line tools/pipes.

snorkel · on July 15, 2013

Data science? Is that we're calling DBA work nowadays?

xntrk · on July 15, 2013

I'm surprised there was no mention of cut -d. It's good for simple stuff where you don't need all of awk.

throwwiffle · on July 15, 2013

and paste - quick and simple for working with columns

trimbo · on July 15, 2013

Crush tools.

https://code.google.com/p/crush-tools/

westurner · on July 15, 2013

The "Text Processing" category of this list of unix utilities is also helpful: http://en.wikipedia.org/wiki/List_of_Unix_programs

BashReduce is a pretty cool application of many of these utilities.

mountaineer · on July 15, 2013

Nice overview. Some guys at my previous company used to show off stuff like this one line recommender (slide 6)

http://www.slideshare.net/strands/strands-presentation-at-re...

samspenc · on July 15, 2013

Wow, I have to say I use all these commands - these are also particularly useful while testing Hadoop streaming jobs since you can test locally on your shell using "cat | map | sort | reduce" (replace cat with head if you want) and then actually run it in Hadoop.

noloqy · on July 15, 2013

This reminds me of page 213 (175 in the book's numbering) of the Unix Haters Handbook, found at http://web.mit.edu/~simsong/www/ugh.pdf

gwu78 · on July 16, 2013

"Imagine you have a 4.2GB CSV file." "

All you need... is the sum of all values in one particular column."

In that case, if speed was paramount, I'd use Kona or kdb. Unquestionably, k is the best tool for that particular job.

lotsofcows · on July 16, 2013

Really? In a recent test of a whole bunch of languages, scripting, compiled and JVM (but not Kona or kdb), our awk test was beaten only by C. awk was so far ahead, its run time beat other's compile + run time.

gwu78 · on July 16, 2013

Yes, really. Nothing I know of beats the speed of k for this column-oriented type of task. Kernighan himself tested it against awk many years ago and if I recall it was generally faster even in that set of tests.

Where can I see your experimental design? I'd like to try to replicate your results.

wasd · on July 16, 2013

I've been using Ubuntu for about a year now and although I feel comfortable doing a lot of things with the CL, I'm not sure if I really know enough about *nix. I wish there was a was a website with the 20-30 most useful unix commands and very clear language as to what they do with examples. Although, I've used all the tools in this post, I still enjoyed the use of example.

bendmorris · on July 16, 2013

Software Carpentry provides a good overview with examples: http://software-carpentry.org/4_0/shell/index.html

wasd · on July 16, 2013

Looks pretty solid but a bit on the simple side. Thanks for the link.

kamaal · on July 16, 2013

>>Writing a script in python/ruby/perl/whatever would probably take a few minutes and then even more time for the script to actually complete.

Thankfully you can also write a Perl one liner. Which most of the times is far powerful than awk.

dpatru · on July 16, 2013

sum the 0th field: perl -lane '$a += $F[0]; END{ print $a; }'

dbbolton · on July 15, 2013

In some cases, (command line) Perl will actually process piped text faster than awk or even sed. I'm not sure about arithmetic though.

Rickasaurus · on July 16, 2013

"data science"... seriously?

thrownaway2424 · on July 15, 2013

Actually useful data science tips for unix users.

  Make all your commands 3x faster:
   export LC_ALL=C

  Actually use the 32 CPUs you paid for:
   sort --parallel=32 ...
   xargs -P32 ...

gnosis · on July 15, 2013

Could you expand on why

   export LC_ALL=C

would "make all your commands 3x faster"?

barrkel · on July 15, 2013

If your text manipulation programs are locale-aware, they may be interpreting the input as a multibyte encoding, and need to do a lot more work in preprocessing to get semantically correct operation. For example, a Unicode-aware grep may understand more forms of equivalence, similarly for sorting. See e.g. http://en.wikipedia.org/wiki/Unicode_equivalence

With the C locale, text is more or less treated as plain bytes.

bcantrill · on July 15, 2013

Actually, it was more like 2000X[1] -- and I believe that it still stands as Brendan Gregg's biggest performance win.

[1] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

gnosis · on July 16, 2013

According to the comments in that thread, this issue was fixed in GNU grep 2.7 (my system currently has grep 2.14 on it, so this must have been some time ago).

foobarbazqux · on July 15, 2013

Gnu grep is or was very slow with the UTF-8 locale. Not sure about other commands, perhaps anything that processes text, awk and sed maybe?

dsberkholz · on July 15, 2013

That was mostly fixed. http://savannah.gnu.org/bugs/?14472

thrownaway2424 · on July 15, 2013

It's no longer quadratic in so many cases, but it's still true that UTF-8 string operations require, in the best case, several CPU cycles per character consumed, even when the input is an ASCII subset. LC_ALL=C pretty much guarantees one or fewer CPU cycles per input character. Basics like strlen and strchr and strstr are significantly faster in "C" locale.

binarycrusader · on July 16, 2013

Note that the standard Solaris versions of many commands are substantially faster than their GNU equivalents in the 'C' and multi-byte locales so this advice doesn't necessarily apply.

That's part of why Solaris continues to use them in favour of GNU alternatives (although the GNU alternatives are available easily in /usr/gnu/bin).

foobarbazqux · on July 15, 2013

Also export MAKEOPTS=-j33

dahart · on July 15, 2013

> Actually use the 32 CPUs you paid for:

gnu parallel FTW!