The problem with such courses is that "massive" tend to change every year in exp...

moultano · on Feb 6, 2020

The datasets grow to meet the computing available to them. The things gathering the data themselves become more powerful, and so more of that data makes it downstream.

I'd define "massive" data as anything where n^2 is too big, where "too big" is bigger than either my ram or my patience.

JimmyAustin · on Feb 6, 2020

I've heard everything from "it doesn't fit in Excel" to "it doesn't fit in ram for a standard dev's laptop (~10gb) to "it doesn't fit in ram in a decent sized EC2 instance (~250gb).

moultano · on Feb 6, 2020

I started worrying at one point that all the techniques I learned when I started my career for working with big data were becoming obsolete, but they aren't. What you needed to do before to make things possible is now needed to make it fast.

Qu3tzal · on Feb 7, 2020

Isn't it the same as before? If 4gb of data was too big because you had 2gb of RAM, then the methods used at that time are the same you would apply for a 500gb dataset that can't fit in a 250gb RAM machine, right?

New issues appear when you have to analyze 2Tb with a 32gb RAM machine, but when the order of difference is the same, the issues and thus the answers are the same as before?

streetcat1 · on Feb 7, 2020

No. Because the number of use cases where you have 1TB or 2TB of data is smaller in comparison.

Also, the rest of the use cases (which fits into a single machine memory now), can be handled much more efficiently with memory base algorithm, instead of I/O based algorithms.

The goal of Hadoop, as well as most of the theory on disk-based indices (E.g. BTREE), was to overcome the I/O bottlenecks. But as memory is getting bigger and cheaper there is a trend to drop Hadoop in favor of reading data directly from the cloud and into memory.