Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GNU datamash (gnu.org)
236 points by jonbaer on Aug 4, 2014 | hide | past | favorite | 19 comments


This is great work, and runs fast. The documentation is well done and has plenty of examples.

Here's an example of datamash and R with timing.

    time datamash sstdev 1 < data.txt
    288891.28552648
    0.76s user 0.01s system 99% cpu 0.775 total

    time R --vanilla --slave -e \
    "x <- read.table('data.txt', header=F); sd(x\$V1);"
    288891.3
    2.68s user 0.06s system 99% cpu 2.761 total
(The data.txt file is 1 million lines, each line a random number 1 to 1 million. The timing is on a MacBook Pro Retina 13" 2014)


For this example (1 column data), you get much closer results using R's scan function rather than read.table

  awk 'END{for(i=0;i<1000000;i++){ print int(rand() * 1000000) } }' </dev/null > data.txt
   
  time datamash sstdev 1 < data.txt
  288619.72189328
  0.72s user 0.01s system 99% cpu 0.736 total

  time R --vanilla --slave -e 'sd(scan("data.txt"))'
  Read 1000000 items
  [1] 288619.7
  1.09s user 0.04s system 99% cpu 1.134 total
R read.table read performance is fairly slow by default because it has to infer the types of columns and check for inline comments, quotes ect.

This seems like a better replacement for awk and bash one-liners to me than tasks I would use R for.

For instance counting unique elements.

  #naive approach
  time (sort data.txt | uniq | wc -l)
  632209
  13.09s user 0.04s system 101% cpu 12.984 total

  #using hashing
  time (awk '!a[$0]++' data.txt | wc -l)
  632209
  1.34s user 0.03s system 100% cpu 1.360 total

  #R
  time R --vanilla --slave -e 'length(unique(scan("data.txt")))'
  Read 1000000 items
  [1] 632209
  1.20s user 0.04s system 99% cpu 1.244 total

  #datamash
  time datamash countunique 1 <data.txt
  632209
  0.83s user 0.01s system 99% cpu 0.840 total
Quite good performance in that case, although R surprised me here as well.


In fact, the documentation mentions that their operators are tested to match those of R https://www.gnu.org/software/datamash/manual/datamash.html#S... which looks like a pretty neat idea.


That is not fair; you count R's start-up time and all the guesswork which read.table does and datamash doesn't have to do.


Well, it is fair if all you want to do is mash some data together. Why shouldn't startup times be taken into consideration?


Because with R you can and usually do multiple things with multiple data sources within one session, which effectively dissolves the start-up and load time. Even if reading one file and calculating mean or something with a single script is the only thing you do, you can use Rscript which runs R without loading heavy stuff like the methods package.


Welp, I'm outta business... https://github.com/bagrow/datatools


You provide multivariate statistics, and this doesn't.


You can always contribute to the GNU project.


Apologies for the tangential question, but how does one find the public key for (something like) datamash?

Downloaded: datamash-1.0.6.tar.gz and datamash-1.0.6.tar.gz.sig

Then did:

  gpg --verify datamash-1.0.6.tar.gz.sig datamash-1.0.6.tar.gz
Which results:

  gpg: Signature made Tue 29 Jul 2014 03:30:23 PM PDT using   RSA key ID 3657B901
  gpg: Can't check signature: public key not found
Where can one import that public key, and is it the public key for datamash or gnu?


-> % gpg --search-keys 3657B901

(1) Assaf Gordon <agordon@wi.mit.edu> 4096 bit RSA key 2272BC86, created: 2014-07-09, expires: 2015-07-09

Initial announcement ... http://lists.gnu.org/archive/html/info-gnu/2014-07/msg00007....


thank you! (moral of story: don't start the search by visiting keyserver sites like https://pgp.mit.edu/)


Don't forget the FreeBSD 'ministat' tool, which supports fewer operations but will draw ASCII-art histograms:

https://github.com/thorduri/ministat


Sweet.

I've had a little awk routine that I wrote some years back that does much of this -- it computes (or tabulates) n, sum, min, max, mean, median, standard deviation, and percentiles of the input data series. For generating quick stats, it's quite useful.

I'm looking forward to datamash turning up in my Debian repos.


The page mentions Windows, but there aren't any binaries available for it. Am I missing something?


I LOVE the interface and a variety of operations, especially the grouping functionality! Thank you for making my life much easier.. I would love to see more of R operations such as "sample" or "rnorm" added in the later version.


I used to choose awk/gawk, python, R for different file, numeric, textual and statistical operations. This is great, I would definitely use it.


I love it. No more loading tables into R just for transposing it.. just doing

cat table.txt | datamash transpose


http://www.gnu.org/software/datamash/manual/datamash.html

This looks pretty cool. Anyone used it in "real life"?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: