There was a similar article (2014) that is also interesting. I think too many of...

close04 · on June 27, 2019

> often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

Right tool for the right job, as always. For a 2-3GB dataset size you don't need to bother with Hadoop just as for a 2-3PB dataset size you probably don't need to bother with awk.

taeric · on June 27, 2019

If like to think that it is feasible most 2-3PB datasets can be easily partitioned to GB datasets. I rather guess it is more common to expand GB datasets into PB ones, though. :(

malshe · on June 27, 2019

Thanks for the link. Very interesting.

In OP's article there is a link to a book "Data Science at the Command Line" which sounds quite relevant: https://www.datascienceatthecommandline.com

markus_zhang · on June 27, 2019

Yeah I decide to read that book and install the Docker file. The thing is I'm kind of not sure how to setup the whole thing in Linux as I also need Python, some database and gcc. I think I should be able to find some tutorials for Ubuntu.

jeroenjanssens · on June 27, 2019

Although the docker image is based on Alpine Linux, examining the corresponding Dockerfile [1] may provide some guidance on how to install the tools and their requirements on Ubuntu. Let me know if you have any questions. Always happy to help.

[1] https://github.com/datascienceworkshops/dockerfiles/blob/mas...

rout39574 · on June 27, 2019

You might be on to something there.