It's slightly mystifying. The only company I've worked at that did "big data" _r...

madhadron · on Nov 12, 2019

You can't put Hadoop on your resume if you solved the problem in RAM instead.

RivieraKid · on Nov 12, 2019

I've always disliked the term "big data" because all of the attempts at a definition seemed either stupid or vague. After a while, I came up with this definition: it's a set of technnologies used for processing data that are too large to be processed in a single machine.

mumblemumble · on Nov 12, 2019

The thing that gets me about that definition is that "too large to be processd in a single machine" leaves out a lot of variables. How's the machine specced? How's it being analyzed? Using what kinds of software?

If the only single-machine option you consider is Pandas, which doesn't do streaming well and is built on a platform that makes ad-hoc multiprocessing a chore, you'll hit the ceiling a lot faster than if you had done it in Java, which might in turn be hard to push as far as something like C# (largely comparable to Java, but some platform features make it easier to be frugal with memory and mind your cache lines) or, dare I say it, something native like ocaml or C++.

Alternatively, if you start right off with Spark, you won't be able to push even one node as far as if you hadn't, because Spark is designed from the ground up for running on a cluster, and therefore has a tendency to manage memory the same way a 22-year-old professional basketball player handles money. It makes scale-out something of a self-fulfilling prophecy.

Also, as someone who was doing distributed data processing pipelines well before Hadoop and friends came along, I'm not sure I can swallow "big data" being one and the same as "handling data that is too big to run on one computer." Big data sort of implies a certain culture of handling data at that scale, too.

Because of that, I tend to think of "big data" as describing a culture as much as it describes anything practical. It's a set (not the only set) of technologies for procesing data on multiple machines. Whether you actually need multiple machines to do the job seems to be less relevant than the marketing team at IBM (to pick an easy punching back) would have us believe.

toast0 · on Nov 12, 2019

Saying big data is data too large to process on a single machine purposefully leaves out the spec of the machine.

That's because a reasonably sized machine from today is much larger than one from five years ago. And an unreasonably large machine today is also larger but yet more achievable.

A basic dual Epyc system can have 128 cores, and 2TB of ram. Someone mentioned 24 TB of ram, which is probably not a two socket system.

You can do a lot with 2TB of ram.

e12e · on Nov 12, 2019

And there are still some use cases beyond the single machine: eg CERN.

But I think it's quite safe to say that it's not often because you need to process so much data, but rather that your experiment is a fire hose of data, and you're not sure what you want to keep, and what you can summarize - until after you've looked at the data.

And there might be a reason to keep an archive of the raw data as well.

Another common use case would be seismic data from geological/oil surveys.

But "human generated" data, where you're doing some kind of precise, high value recording, like click streams, card transactions etc might be "dense", but usually quite small compared to such "real world sampling".

colechristensen · on Nov 12, 2019

And more often than not, it's a set of technologies used for processing data on many machines which could be processed faster on one.

hinkley · on Nov 11, 2019

People do not want to hear about how they’re working way, way too hard to feel special.

pjmlp · on Nov 12, 2019

So far my use of "big data" was on networking management solutions for mobile networks, containing years of telecomunication data growing by the second, Oracle and SQL Server OLAP engines managed any kind of query without much sweat, other that fine tuning queries and indexes.