Hacker News new | past | comments | ask | show | jobs | submit login

There was a similar article (2014) that is also interesting. I think too many of us see new and shiny and immediately glom onto it, forgetting that the UNIX/regex fathers knew a thing or two about crunching data.

https://adamdrake.com/command-line-tools-can-be-235x-faster-...




> often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

Right tool for the right job, as always. For a 2-3GB dataset size you don't need to bother with Hadoop just as for a 2-3PB dataset size you probably don't need to bother with awk.


If like to think that it is feasible most 2-3PB datasets can be easily partitioned to GB datasets. I rather guess it is more common to expand GB datasets into PB ones, though. :(


Thanks for the link. Very interesting.

In OP's article there is a link to a book "Data Science at the Command Line" which sounds quite relevant: https://www.datascienceatthecommandline.com


Yeah I decide to read that book and install the Docker file. The thing is I'm kind of not sure how to setup the whole thing in Linux as I also need Python, some database and gcc. I think I should be able to find some tutorials for Ubuntu.


Although the docker image is based on Alpine Linux, examining the corresponding Dockerfile [1] may provide some guidance on how to install the tools and their requirements on Ubuntu. Let me know if you have any questions. Always happy to help.

[1] https://github.com/datascienceworkshops/dockerfiles/blob/mas...


You might be on to something there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: