Everyone forgets the brilliant and sometimes crazy BSD ones:
- Column: Create columns / tables from input data
- tr: substitute / delete chars
- join: like a database join, but for text files
- comm: like diff, but you can use it programmatically to choose if a line is in one file, or another, or both.
- paste: put file lines side-by-side
- rs: reshape arrays
- jot: generate random or sequence data
- expand: replace tabs / spaces
Looks like 6 of those 8 are in GNU coreutils as well (and therefore can be assumed present on just about any modern Unix). 'rs' and 'jot' are the two missing from most default Linux installs. On Debian you can install them via the packages 'rs' and 'athena-jot'.
I'm old fashioned, so use "sed 100q" instead of the newer "head -100". It saves a keystroke, too.
There are enough variations in ways to do things on Unix that I've sometimes wondered about how easy it would be to identify a user by seeing how they accomplish a common task.
For instance, I noticed at one place I worked that even though everyone used the same set of options when doing a "cpio -p", everyone had their own order they wrote them. Seeing one "cpio -p" command was sufficient to tell which of the half dozen of us had done the command.
I think I'm the only one where I work who uses "sed Nq" instead of "head -N", so that would fingerprint me.
I sorta had this happen to me once. I have used "lsl" as an alias for long directory listings for longer than I can remember. And just out of habit it was almost always the first command I typed when logging into any box anywhere.
So one day I telnetted into a Solaris machine and immediately typed "lsl" before doing anything else. A short while later a colleague came to my cube. He had been snooping the hme1 interface and saw me login. He didn't need to trace the IP because he knew it was me when he saw 3 telnet packets with "l" "s" "l" in them.
AWK is worth learning completely. It hits a real sweet spot in terms of minimizing the number of lines of code needed to write useful programs in the world of quasi-structured (not quite CSV but not completely free form) data. You can learn the whole language and become proficient in an afternoon.
I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger, though it seems to be listed for a hilariously high price on Amazon at the moment. Maybe try to pick up a used copy.
>I recommend "The AWK Programming Language" by Aho, Kernighan, and Weinberger
I concur with this recommendation. "The AWK Programming Language", at little over 100 pages, is a classic of programming language instruction. The book jumps right into use cases, it does not waste one's time. This book should be required reading for anyone contemplating writing a handbook on any programming language; my CS bookshelf would be several feet thinner and several times more informative.
I advise not learning more than basic usage of awk and to spend the time on more versatile languages. You can do very neat tricks with sed and awk, but when the problems become more complex, it is a lot faster to use a smarter language. And if you know well this language, you will discover that it may also be very concise for relatively simpler tasks.
When Perl was created, one of its advertised goal was to avoid all the time lost trying to work around the limitations of awk, sed and shell.
I recall there was a pointer to an old great AWK tutorial some time ago - smth along the lines 'how to approach awk language....' - anyone kept the link?
By the way: what people need to understand is that in order to use Awk, efficently, you'll either use associative arrays, or structure your script like a sed script, otherwise it will be slow. The interesting thing about both of those, is the regex algorithm Thompson NFA, that is from what I hear around 7 times faster than PCRE that is used in Perl, PHP, Python and Ruby?
One of my favorite little tools that makes all these others better is pv -- PipeViewer.
Use it any place in a pipeline to see a progress meter on stderr. Very handy when grepping through a bunch of big log files looking for stuff. Here is a quick strawman example:
The second form will show the progress of the decompression process.
You can also adjust buffer size, set the length for the time estimate (otherwise we have to fstat the input), and display progress to stderr instead of stdout.
Not sure if you prefer binary installs or whether you compile your installs yourself... but I'm sure you could get this to compile on FreeBSD with a little work.
Here's one way to do this with the standard "dc" (RPN calculator) utility:
echo "1\n2\n3\n+\n+\np\n" | dc -
Or, a little more legibly:
> dc
1
2
3
+
+
p
6
[Edited to fix bug]
Not sure how to automate this to sum 1000 values without needing to explicitly insert 999 + signs, though. Haven't explored dc in depth myself yet. There's probably some way to do it with a macro or something, but it may not be pretty.
Yeah I wrote my own sum utility in Python... the syntax is just sum 1 or sum 2 for the column, with a -d delimiter flag. In retrospect I guess it could have been a one line awk script. But yeah if you are doing this kind of data-processing, it makes sense to have a hg/git repo of aliases and tiny commands that you sync around from machine to machine. You shouldn't have to write the sum more than once.
Another useful one is "hist" which is sort | uniq -c | sort -n -r.
I've never aliased it, but yes I use your 'hist' a lot. Useful for things like "categorise log errors" etc.
Does everyone else edit command history, stacking up 'grep -v xxxx' in the pipeline to remove noise?
If I'm working on a new pipeline, my normal workflow is something like:
head file # See some representative lines
head file | grep goodstuff
head file | grep good stuff | grep -v badstuff
head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/'
head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' # get a col
head file | grep ... | grep ... | sed -e 's/cut out/bits/' -e 's/i dont/want/' | awk '{print $3}' | sort | uniq -c | sort -nr # histogram as parent
Then I edit the 'head' into a 'cat' and handle the whole file. Basically all done with bash history editing (I'm a 'set -o vi' person for vi keybindings in bash, emacs is fine too :-)
Yeah, this is my quick-and-dirty way of looking at referers in Apache logs, built up from a few history edits. It excludes some bot-like stuff (many bots give a plus-prefixed URL in the user-agent string) and referer strings from my own domain, removes query strings, and cleans up trailing slashes:
Doesn't work. By default, echo doesn't translate \n into a newline, so you have to add the -e flag. Then, bc doesn't like the extra plusses at the end, so you have to either add the -n flag to echo and remove the last \n, or somehow trim the newlines from the end beforehand.
I was hoping to see an article about some neat new utilities specifically tailored for doing advanced data analysis.
Instead this is a set of basic examples of bog-standard tools that every newbie *nix user should be already familiar with: cat, awk, head, tail, wc, grep, sed, sort, uniq
>"tools that every newbie nix user should be already familiar with"
The key word is should... you might be surprised how many "not newbie" nix users are not aware of those commands or how using them in this fashion. Specially awk.
Don't forget there are only ever going to be more unix newbies in the world. It's not like they're a dying breed. There are more people than ever who have never been exposed to unix tools who might benefit from them (myself included several years ago).
Lots of people are familiar with the "basics" of each of these commands but many of them (awk, sed) are very powerful utilities that can do much, much more than it appears at first glance.
A commenter on the article pointed out the "Useless use of cat".
What most users probably don't realize is that the redirection can be anywhere on the line, not just at the beginning. Putting an input redirection at the beginning of the command can make the data flow clearer: from the input file, through the command, to stdout:
< data.csv awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
(This only works for simple commands; you can't do `< file if blah; then foo; else bar; fi`)
"Useless use of cat" is one of those boring pedantic comments that makes me cringe. Who cares? It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.
> It's usually much more straightforward to build a pipeline from left to right, particularly for people who are just learning this stuff.
True, however, people pointing out UUOC are in fact pointing out that you should not be building a pipeline at all. If you want to apply an awk / sed / wc / whatever command to a file, then you should just do that instead of piping it through a extraneous command.
Sure, as people always mention, in your actual workflow you might have a cat or grep already, and are building a pipeline incrementally; there's no reason to remove previous stuff to be "pure" or whatever. But if you're giving a canonical example, there's no reason to add unneeded commands.
Please be very careful doing math with bash and awk...
cat data.csv | awk -F "|" '{ sum += $4 } END { printf "%.2f\n", sum }'
From that command, it's unclear whether the sum will be accurate, it depends on the inputs and on the precision of awk. See (D.3 Floating-Point Number Caveats): http://www.delorie.com/gnu/docs/gawk/gawk_260.html
Please be very careful doing math with bash and awk...
I don't see how that's more true for awk than it is for any other programming language. Awk uses double precision floating point for all numeric values, which isn't a horrible choice for a catch-all numeric type.
I know purists always complain about unnecessary cats, but I always find it useful to start with "head" or "tail" in the first position to figure out my pipeline, and then replace it with cat when it's all working.
And if the extra cat is actually making a measurable difference, maybe that's a good signal that it's time to rewrite it in C.
Some people prefer the first, longer-winded way because it's more explicit. To some --- myself included --- it makes more sense because it explicitly breaks each function into a seperate steps; I'm explicitly telling the system to print the contents of data.txt rather than implictly doing so. I'll happily type those five extra characters for that additional clarity.
Or you can actually use the Linux commands by installing Cygwin. Pretty much my first conscious action when I wake up stranded on a desert Windows system.
I could care less which you choose so long as you're getting a proper Linux toolset.
Powershell is a skill I don't have yet which carries over to ... precisely one declining technical dinosaur (with a penchant for expiring its skillsets).
The Linux toolbox is a set of skills I embarked on learning over a quarter-century ago, most of which goes back another decade or further (the 'k' in 'awk' comes from Brian Kernighan, one of Unix's creators). And while some old utilities are retired and new ones replace them (telnet / rsh for ssh, sccs/rcs for git), much of the core has remained surprisingly stable over time.
The main difference between MinGW and Cygwin appears to be how Windows-native they are considered, which for my own purposes has been an entirely irrelevant distinction, though if you're building applications based off of the tools might matter to you.
One trick I like to do is to feed two files into awk; /dev/stdin and some other file I'm interested in. Here's an example: lookup the subject names of a list of serial numbers in an openSSL index.txt
I find myself using this idiom (feeding data from a file to awk and selecting it with data from standard input) again and again. It's a great way to scale shell scripts to take multiple arguments while avoiding opening the same file N times, or doing clunky things with awk's -v flag.
If this interests you, you should check out Joyents new Manta service which lets you do this type of thing on your data via their infrastructure. It's really cool.
If I needed to do this type of thing on 10 TB of data, it would probably take me longer to get the data to them than it would to just run it on my own hardware.
Apparently there's a need for it, though, or it wouldn't exist.
This entire HN thread is a perfect example of why we built Manta. Lots of engineers/scientists/sysadmins/... already know how to (elegantly) process data using Unix and augmenting with scripts. Manta isn't about always needing to work on a 10TB dataset (you can), but about it being always available, and stored ready to go. I know we can't live without it for running our own systems -- all logs in the entire Joyent fleet are rotated and archived in Manta, and we can perform both recurring/automated and ad-hoc analysis on the dataset, without worrying about storage shares, or ETL'ing from cold storage to compute, etc. And you can sample as little as much or as much as you want. At least to us (and I've run several large distributed systems in my career), that has tremendous value, and we believe it does for others as well. And that's just one use case (log processing).
Wow, this looks great. My ideal cloud-computing platform is basically something like xargs -P or GNU parallel, but with the illusion that I'm running it on a machine with infinite CPU cores and RAM (charged for usage, of course). I was spoiled early on by having once had something almost like that, via a very nice university compute cluster, where your data was always available on all nodes (via NFS), and you just prefixed your usual Unix commands with a job-submit command, which did the magic of transparently running stuff wherever it wanted to run it. Apart from the slight indirection of using the job-submit tool, it almost succeeded in giving the illusion of ssh-ing into a single gazillion-core big-iron machine, which is more or less the user experience I want. But I haven't found a commercial offering where I can get an account on a big Unix cluster and just get billed for some function of my (disk space, CPU usage, RAM usage) x time.
Cloud services are amazing in a lot of ways, but so far I've found them much more heavyweight for the use-case of running ad-hoc jobs from the Unix command line. You don't really want to write Hadoop code for exploratory data analysis, and even managing a little fleet of bashreduce+EC2 instances that get spun up and down on demand is error-prone and tedious, turning me more into the cluster administrator rather than a user, which is what I'd rather be. Admittedly it's possible that could be abstracted out better in the case where you don't mind latency: I often don't mind if my jobs queue up for a few minutes, which would mean a tool could spin up EC2 instances behind the scenes and then tear them down without me noticing. But I haven't found anything that does that transparently yet, and Manta looks like a more direct implementation of the "illusion of running on an N-core machine for arbitrary N" idea that seems in the same cost ballpark. Definitely going to do some experimentation here, to see if 2010s technology will enable me to keep using a 1970s-era data-processing workflow.
I know manta, has default software packaged, but is it possible to install your own like ghci, or julia? Or is that something that needs to be brought in as an asset. This isn't necessarily a feature request, just trying to figure out how it works. https://apidocs.joyent.com/manta/compute-instance-software.h...
Yeah that's why we generate ~/reports for you every hour - that's what our billing runs off of. I know there's an internal "turn that into daily $ script" somebody wrote -- we'll get that put out as a sample job.
I actually was about to post this -- this guide is great.
As an undergraduate, this was what was given to us to help demonstrate above-introductory command line tools/pipes.
Wow, I have to say I use all these commands - these are also particularly useful while testing Hadoop streaming jobs since you can test locally on your shell using "cat | map | sort | reduce" (replace cat with head if you want) and then actually run it in Hadoop.
Really? In a recent test of a whole bunch of languages, scripting, compiled and JVM (but not Kona or kdb), our awk test was beaten only by C. awk was so far ahead, its run time beat other's compile + run time.
Yes, really. Nothing I know of beats the speed of k for this column-oriented type of task. Kernighan himself tested it against awk many years ago and if I recall it was generally faster even in that set of tests.
Where can I see your experimental design? I'd like to try to replicate your results.
I've been using Ubuntu for about a year now and although I feel comfortable doing a lot of things with the CL, I'm not sure if I really know enough about *nix. I wish there was a was a website with the 20-30 most useful unix commands and very clear language as to what they do with examples. Although, I've used all the tools in this post, I still enjoyed the use of example.
If your text manipulation programs are locale-aware, they may be interpreting the input as a multibyte encoding, and need to do a lot more work in preprocessing to get semantically correct operation. For example, a Unicode-aware grep may understand more forms of equivalence, similarly for sorting. See e.g. http://en.wikipedia.org/wiki/Unicode_equivalence
With the C locale, text is more or less treated as plain bytes.
According to the comments in that thread, this issue was fixed in GNU grep 2.7 (my system currently has grep 2.14 on it, so this must have been some time ago).
It's no longer quadratic in so many cases, but it's still true that UTF-8 string operations require, in the best case, several CPU cycles per character consumed, even when the input is an ASCII subset. LC_ALL=C pretty much guarantees one or fewer CPU cycles per input character. Basics like strlen and strchr and strstr are significantly faster in "C" locale.
Note that the standard Solaris versions of many commands are substantially faster than their GNU equivalents in the 'C' and multi-byte locales so this advice doesn't necessarily apply.
That's part of why Solaris continues to use them in favour of GNU alternatives (although the GNU alternatives are available easily in /usr/gnu/bin).