Author begins with fairly idiomatic shell pipeline, but in the search for perfor...

tracker1 · on Jan 19, 2015

It probably depends on what you are trying to accomplish... I think a lot of us would reach for a scripting language to run through this (relatively small amount of data)... node.js does piped streams of input/output really well. And perl is the grand daddy of this type of input processing.

I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.

If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.