There was a similar article (2014) that is also interesting. I think too many of us see new and shiny and immediately glom onto it, forgetting that the UNIX/regex fathers knew a thing or two about crunching data.
> often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.
Right tool for the right job, as always. For a 2-3GB dataset size you don't need to bother with Hadoop just as for a 2-3PB dataset size you probably don't need to bother with awk.
If like to think that it is feasible most 2-3PB datasets can be easily partitioned to GB datasets. I rather guess it is more common to expand GB datasets into PB ones, though. :(
Yeah I decide to read that book and install the Docker file. The thing is I'm kind of not sure how to setup the whole thing in Linux as I also need Python, some database and gcc. I think I should be able to find some tutorials for Ubuntu.
Although the docker image is based on Alpine Linux, examining the corresponding Dockerfile [1] may provide some guidance on how to install the tools and their requirements on Ubuntu. Let me know if you have any questions. Always happy to help.
https://adamdrake.com/command-line-tools-can-be-235x-faster-...