So what is used instead of Hadoop currently?

serial_dev · on March 8, 2022

Whether a technology can replace Hadoop in an organization depends on many factors, but some technologies that solve at least in part similar problem are Apache Storm, Spark, Flink, Kafka Streams, and maybe BigQuery?

Or, as the original article says, some companies just use some command line tools, shell scripts.

It's been a couple of years since I was interested in Data Engineering, so my knowledge on this topic is some years behind.

riknox · on March 8, 2022

I've not seen Storm being used anywhere sane for a few years at least now, and from a glance at job postings it looks unlikely. Spark, Kafka Streams etc. are definitely used in a modern data platform from my experience.

I think we're seeing a big shift with Hadoop-like workloads being moved onto cloud providers, so BigQuery, Amazon EMR etc.

bsenftner · on March 8, 2022

I'm curious what constitutes "big data" anymore. In an intermediate machine learning course, we train on nearly a petabyte of data using Google Colab and Jupyter Notebooks. Nobody discusses the size of the data requiring any special treatment due to its size... would not 95% of a petabyte be "big data"?

happymellon · on March 8, 2022

Big data is a shifting concept as computers gain more storage and faster commodity processors.

My general rule of thumb is whether it is too big to put on my laptop. So greater than a couple of Tb's.

pilotneko · on March 8, 2022

What course are you taking? Imagenet is only 150 GB, and Common Crawl is only 320 TB.

Big data is a moving target, but I’m comfortable defining it as data too large to fit in memory. Obviously, you can always get a bigger node, my rule is thumb is that if you need generators, you are working with big data.

rurban · on March 8, 2022

GNU parallel. not moreutils parallel, and not xargs