GNU datamash

jph · on Aug 4, 2014

This is great work, and runs fast. The documentation is well done and has plenty of examples.

Here's an example of datamash and R with timing.

    time datamash sstdev 1 < data.txt
    288891.28552648
    0.76s user 0.01s system 99% cpu 0.775 total

    time R --vanilla --slave -e \
    "x <- read.table('data.txt', header=F); sd(x\$V1);"
    288891.3
    2.68s user 0.06s system 99% cpu 2.761 total

(The data.txt file is 1 million lines, each line a random number 1 to 1 million. The timing is on a MacBook Pro Retina 13" 2014)

thebelal · on Aug 4, 2014

For this example (1 column data), you get much closer results using R's scan function rather than read.table

  awk 'END{for(i=0;i<1000000;i++){ print int(rand() * 1000000) } }' </dev/null > data.txt
   
  time datamash sstdev 1 < data.txt
  288619.72189328
  0.72s user 0.01s system 99% cpu 0.736 total

  time R --vanilla --slave -e 'sd(scan("data.txt"))'
  Read 1000000 items
  [1] 288619.7
  1.09s user 0.04s system 99% cpu 1.134 total

R read.table read performance is fairly slow by default because it has to infer the types of columns and check for inline comments, quotes ect.

This seems like a better replacement for awk and bash one-liners to me than tasks I would use R for.

For instance counting unique elements.

  #naive approach
  time (sort data.txt | uniq | wc -l)
  632209
  13.09s user 0.04s system 101% cpu 12.984 total

  #using hashing
  time (awk '!a[$0]++' data.txt | wc -l)
  632209
  1.34s user 0.03s system 100% cpu 1.360 total

  #R
  time R --vanilla --slave -e 'length(unique(scan("data.txt")))'
  Read 1000000 items
  [1] 632209
  1.20s user 0.04s system 99% cpu 1.244 total

  #datamash
  time datamash countunique 1 <data.txt
  632209
  0.83s user 0.01s system 99% cpu 0.840 total

Quite good performance in that case, although R surprised me here as well.

a3_nm · on Aug 4, 2014

In fact, the documentation mentions that their operators are tested to match those of R https://www.gnu.org/software/datamash/manual/datamash.html#S... which looks like a pretty neat idea.

mbq · on Aug 4, 2014

That is not fair; you count R's start-up time and all the guesswork which read.table does and datamash doesn't have to do.

JamesMcMinn · on Aug 4, 2014

Well, it is fair if all you want to do is mash some data together. Why shouldn't startup times be taken into consideration?

mbq · on Aug 6, 2014

Because with R you can and usually do multiple things with multiple data sources within one session, which effectively dissolves the start-up and load time. Even if reading one file and calculating mean or something with a single script is the only thing you do, you can use Rscript which runs R without loading heavy stuff like the methods package.

bagrow · on Aug 4, 2014

Welp, I'm outta business... https://github.com/bagrow/datatools

thyrsus · on Aug 5, 2014

You provide multivariate statistics, and this doesn't.

cbsmith · on Aug 4, 2014

You can always contribute to the GNU project.

theophrastus · on Aug 4, 2014

Apologies for the tangential question, but how does one find the public key for (something like) datamash?

Downloaded: datamash-1.0.6.tar.gz and datamash-1.0.6.tar.gz.sig

Then did:

  gpg --verify datamash-1.0.6.tar.gz.sig datamash-1.0.6.tar.gz

Which results:

  gpg: Signature made Tue 29 Jul 2014 03:30:23 PM PDT using   RSA key ID 3657B901
  gpg: Can't check signature: public key not found

Where can one import that public key, and is it the public key for datamash or gnu?

jonbaer · on Aug 4, 2014

-> % gpg --search-keys 3657B901

(1) Assaf Gordon <agordon@wi.mit.edu> 4096 bit RSA key 2272BC86, created: 2014-07-09, expires: 2015-07-09

Initial announcement ... http://lists.gnu.org/archive/html/info-gnu/2014-07/msg00007....

theophrastus · on Aug 4, 2014

thank you! (moral of story: don't start the search by visiting keyserver sites like https://pgp.mit.edu/)

thristian · on Aug 4, 2014

Don't forget the FreeBSD 'ministat' tool, which supports fewer operations but will draw ASCII-art histograms:

https://github.com/thorduri/ministat

dredmorbius · on Aug 4, 2014

Sweet.

I've had a little awk routine that I wrote some years back that does much of this -- it computes (or tabulates) n, sum, min, max, mean, median, standard deviation, and percentiles of the input data series. For generating quick stats, it's quite useful.

I'm looking forward to datamash turning up in my Debian repos.

dufferzafar · on Aug 4, 2014

The page mentions Windows, but there aren't any binaries available for it. Am I missing something?

grepinsight · on Aug 5, 2014

I LOVE the interface and a variety of operations, especially the grouping functionality! Thank you for making my life much easier.. I would love to see more of R operations such as "sample" or "rnorm" added in the later version.

pavanred · on Aug 4, 2014

I used to choose awk/gawk, python, R for different file, numeric, textual and statistical operations. This is great, I would definitely use it.

_of · on Aug 7, 2014

I love it. No more loading tables into R just for transposing it.. just doing

cat table.txt | datamash transpose

voltagex_ · on Aug 4, 2014

http://www.gnu.org/software/datamash/manual/datamash.html

This looks pretty cool. Anyone used it in "real life"?