This is great work, and runs fast. The documentation is well done and has plenty of examples.
Here's an example of datamash and R with timing.
time datamash sstdev 1 < data.txt
288891.28552648
0.76s user 0.01s system 99% cpu 0.775 total
time R --vanilla --slave -e \
"x <- read.table('data.txt', header=F); sd(x\$V1);"
288891.3
2.68s user 0.06s system 99% cpu 2.761 total
(The data.txt file is 1 million lines, each line a random number 1 to 1 million. The timing is on a MacBook Pro Retina 13" 2014)
For this example (1 column data), you get much closer results using R's scan function rather than read.table
awk 'END{for(i=0;i<1000000;i++){ print int(rand() * 1000000) } }' </dev/null > data.txt
time datamash sstdev 1 < data.txt
288619.72189328
0.72s user 0.01s system 99% cpu 0.736 total
time R --vanilla --slave -e 'sd(scan("data.txt"))'
Read 1000000 items
[1] 288619.7
1.09s user 0.04s system 99% cpu 1.134 total
R read.table read performance is fairly slow by default because it has to infer the types of columns and check for inline comments, quotes ect.
This seems like a better replacement for awk and bash one-liners to me than tasks I would use R for.
For instance counting unique elements.
#naive approach
time (sort data.txt | uniq | wc -l)
632209
13.09s user 0.04s system 101% cpu 12.984 total
#using hashing
time (awk '!a[$0]++' data.txt | wc -l)
632209
1.34s user 0.03s system 100% cpu 1.360 total
#R
time R --vanilla --slave -e 'length(unique(scan("data.txt")))'
Read 1000000 items
[1] 632209
1.20s user 0.04s system 99% cpu 1.244 total
#datamash
time datamash countunique 1 <data.txt
632209
0.83s user 0.01s system 99% cpu 0.840 total
Quite good performance in that case, although R surprised me here as well.
Because with R you can and usually do multiple things with multiple data sources within one session, which effectively dissolves the start-up and load time. Even if reading one file and calculating mean or something with a single script is the only thing you do, you can use Rscript which runs R without loading heavy stuff like the methods package.
I've had a little awk routine that I wrote some years back that does much of this -- it computes (or tabulates) n, sum, min, max, mean, median, standard deviation, and percentiles of the input data series. For generating quick stats, it's quite useful.
I'm looking forward to datamash turning up in my Debian repos.
I LOVE the interface and a variety of operations, especially the grouping functionality! Thank you for making my life much easier.. I would love to see more of R operations such as "sample" or "rnorm" added in the later version.
Here's an example of datamash and R with timing.
(The data.txt file is 1 million lines, each line a random number 1 to 1 million. The timing is on a MacBook Pro Retina 13" 2014)