Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What Happened to Big Data?
66 points by night-rider on June 23, 2022 | hide | past | favorite | 74 comments
There was a buzzword Big Data which regularly popped up in tech news. I haven’t seen the word used much lately. What advances are being made on it?

I imagine with various privacy scandals it fell out of favour since your data should be /your/ data only.

And many have talked about data being the ‘new oil’ when really it should be reframed as radioactive waste.

What happened to using this term to hype up your brand: ‘We use Big Data to infer information about how to improve and go forward’?

Was it just a hyped up buzzword?



To echo what others have said, the practice of big data has become so normalized that the language term "big data" -- as a new thing to call attention to itself -- is not needed as much as before.

Similar language history that happened to terms like "dynamic web" or "interactive web". In the late 1990s when Javascript started to be heavily used, we called attention to that new trend the "dynamic web". Today, the "interactive web" phrase has mostly gone away. But that doesn't mean that Javascript-enabled web pages were a fad. On the contrary, we're using more Javascript than ever before. We just take it as a given that the web is interactive.

Examples of rise & fall of "interactive web" in language use that peaked around 2004:

https://books.google.com/ngrams/graph?content=dynamic+web&ye...

https://books.google.com/ngrams/graph?content=interactive+we...


I am too young to remember "information superhighway".

But just the transfer of text files over HTTP or some other protocol was a buzzword some day!


Responsive web design is other of those.


Responsive web design, mobile-first design, progressive web-app, there's a lot of phrases that have done the rounds over the past decade or so. Dynamic HTML (DHTML) was a thing for a while too, early drag-and-drop and other such interactive things on a website.


Remember "multimedia" computers? Those were the days


That's a big part. It's also the case that a lot of (but not all) of the technologies associated with Big Data were found to be less than broadly useful (many of the NoSQL database technologies, arguably Hadoop, etc.)


Like a lot of fads in IT, Big Data sounded like "if you have a lot of data you can monetize it" so companies threw 7+ figures at the technology and then realised that you can have too much data to know what to do with and couldn't really monetize it (obviously some did/still do). Even at a simple level, working at a data collection company, it is very clear that lots of people want to collect as much as possible only then to do precisely nothing useful with the results.

Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.


> Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.

My favorite of one of these was an ML model demo we got was the incredibly insightful analysis: "As customer dissatisfaction increases, customer satisfaction decreases."


Best one I've encountered is: "Customers whose accounts are up for renewal, are the ones most at risk of churning"


One insight we found (with just a calculator) is that 85% of our customers that say they will leave, will stop beign our customers.

Sometimes you don't need a huge machine learning model. Just ask the customer, record answer, and outcome.


Rather than the skills, the difficulty most companies have with ML is that they don't understand their data before they begin. There's a data engineering piece of work required before the data science.


It became ubiquitous.

Now it is in many places. Enterprises use it each moment.

A laptop hard disk is now capable of holding databases with tens of millions of rows.

Traditional "Data Science" and modern Deep Learning rely entirely on it. Millions of datapoints are used to create models everyday.

A sensor on human wrist collects and stores thousands of data points each day.

So do refrigrators, cars, and your washing machine with ubiquity of IoT.

Giant tech cos use billions of rows each day to show users products, or sell their attention as products.

Big Data became ubiquitous. And it became so common that no nody calls it that anymore.

Tools like BigQuery, Dask, and even Pandas and SQL can handle hundreds of thousands to hundreds of millions of rows (or other structure) with normal, regular programming, command, etc.


> A laptop hard disk is now capable of holding databases with tens of millions of rows.

> A sensor on human wrist collects and stores thousands of data points each day.

I feel we have a very different view of what comprises big data :)


Yeah I work with hundreds of GBs of data every day. I have worked on a dataset with 40 million images in the past. I am also aware of OpenAI, Google AI, etc. train on billions of images. Internally, I am aware of Amazon, Google, META handling really large datasets.

But that is now.

If you are old enough you must remember the early years of big data hype. It was not far from millions or tens of millions of rows.

Re: sensors on fitbits, I thought everyone would read between the lines, and consider hundreds of thousands of these devices sold every year (every month?) will definitely amount to "big data".

Either these companies are plainly hoarding all of it and running some kind of analysis, or they maybe are doing federated learning. From the cos' standpoint, yes, it is big data.


For a blow of the past for younger people I can really recommend Tanenbaum's Computer Architecture. Great book for understanding how a computer works, and the practical example he gives really illuminate well how far we've come!


It's just considered "data" these days. We just look at the Vs of the data and adjust based on those. High velocity? Do X. High volume? Accommodate Y. High variety? ... The other side of things is the underlying data quality often had tons of issues, so there's been a lot of focus on the data observability (which isn't sexy at all).

Still tons of folks out there using Hadoop (ew), Snowflake, etc. New technologies coming out include things like Trino, Apache Iceberg, etc. So it's there ... just no one cares about the moniker .. just getting things done.


It's simply become the norm. Companies store and analyze lots of data all the time. It's no longer special but simply table stakes. Look at the valuations of Snowflake and Databricks.


I disagree. Big data came associated with a new swap of algorithms. To big to handle? Use new algorithms, maybe not 100% accurate but can handle the load. And streaming data as opposed to static data.

The are a lot of approaches like Change Data Capture CDC or HyperLogLog - but the norm? Far from.

I think the marketing BS fell out of fashion when every database designer became a data scientist, but that's another issue.


Big Data was about being able to store and query data beyond the limits of a single machine or existing database. The point being to store as much data as you possibly can and then extract value out of it. That grew out of Hadoop/MapReduce which let you cheaply store and access data that doesn't fit into one machine. Streaming was not part of the initial marketing pitch.

That said if you want to do streaming nowadays then you just integrate with Segment. If you want to track your database then you can dump data using Fivetran. If you want to track client events in excruciating detail then you can use Fullstory/Heap to do so in real time. That's all now table stakes for any company and outsourced to those platforms.


I'm non-technical on business/strategy side of things at a tech company, I have no idea what any of those things but I interact with my company's datalake/warehouse. I don't even know the right term for it but its the source of truth of all reports, dashboards and presentations. I don't know what they used before this but I imagine it was a bit more painful to use


Those algorithms and improvements in large data processing got bundled away into a platform/infra layer a developer or user interfaces with unaware of what's going on in the background to produce the results they want.


Most companies probably realized that they don't have Big Data problems because they only have a limited amount of data which you actually can process in an acceptable amount of time on a single Postgres instance. Distributed data processing has a huge upfront tax and you really only want to be doing it if the data set is enormous.

I guess it is similar to other technologies which most companies or developers would really never need due to their limited scale like distributed databases, NoSQL or microservices: It is interesting technology and engineers would like to get their hands on it because that's what the big boys play with, even if they don't really need it. In the meantime the industry hypes it because the technology is difficult so they know that they can make money doing consulting.

I'm not saying that it is not useful technology, I work at a company where we had the need to go from Postgres to "Big Data" tooling. But for tons of businesses it just doesn't make any sense. And even in our case one of the questions I have most frequently is: What business decision are you taking based on processing this enormous amount of data? Can we not take the same decision based on less data?


It’s been the same experience for me. Until you get into petabyte levels of data a single replicated and vertically scaled pg is probably going to be just fine. Quite a few orgs probably realized that eventually after sinking $MM into some Hadoop/Spark setup in the mid 2010’s.

If what you’re doing is

1. Easily parallelizable

2. CPU intensive

3. In 10’s of petabytes or more

Then one of these machine gun like setups makes sense in 2022. Otherwise YAGNI (you aren’t gonna need it)


"For Basecamp, we take a somewhat different path. We are on record about our feelings about sharding. We prefer to use hardware to scale our databases as long as we can, in order to defer the complexity that is involved in partitioning them as long as possible — with any luck, indefinitely."

— Mark Imbriaco, 37Signals

These quotes are from 2009 and 2010, and yet here most of us are in 2022, having learned the lesson the hard way over the last decade that there is no refuting this simple logic. I'll add my own truism: All else being equal, designing and maintaining simpler systems will always cost less than complex ones.

quote references:

https://signalvnoise.com/posts/2479-nuts-bolts-database-serv... http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on...


Don't forget evolution: Business do change. A simple sistem far easier to evolve than a complex one, too.


In addition to the skeptical comments, I think infrastructure and best practice also caught up such that what used to be big data is not so big anymore.

Storing data on S3 or using BigQuery remove a lot of the challenges as opposed to doing this stuff in the data centre. You then also have services such as EMR, Databricks and Snowflakes to acquire the tooling and platforms as an IaaS/SaaS. The actual work then moves up the stack.

Businesses are doing more with data than ever before and the volumes are growing. I just think the challenge moved on from managing large datasets as result of new tooling, infrastructure and practices.


If you used big Data tech (Hadoop/yarn/spark) when you didn't actually have big Data (PB), it was slower than columnar databases so the shine wore off.


This. The falling cost of storage and increasing speed of SSDs means that for most use cases a column store database is significantly faster and cheaper.

Plus people started wising up to COST.


Yeah. Companies don't like it when their expensive fancy new Hive/Hadoop cluster takes longer to run an even moderately complex query that's core to their business than their existing Oracle or SQL Server DB.

For some reason there's ridiculous levels of FOMO in executive ranks, so any new trend is something they need to jump on like it will be what keeps their company around in 10 years. The result of this is fad-jumping, which I've seen happen from Big Data to ML to Blockchain, costing companies millions that could have been better invested in their own products or offerings and actually competing better. It's a really expensive educational cost for leadership IMO.


It might sound trite, but we got "big disk", "big memory", "big cpu" and "big gpu" instead.

It's crazy how much you can do with one machine these days. Hence you often just have "data". And then snowflake/bigquery/redshift if it literally can't fit on a machine (which is rare).


Not to mention big compression and big vectorization. I'm right this minute messing around with a trillion+ row dataset in ClickHouse. It runs fine on a VM with 36 vCPUs and 1.8Tb of storage. (AWS c5.9xlarge instance type, EBS GP2 storage)


c5.9xlarge -> 74 GiB memory + 36 vCPUs (EBS storage) - $1.53 on-demand (N. Virginia).


It's not very expensive as your note implies. This is probably all you need for most analytic projects. Well, plus a couple replicas in case the first one catches on fire.


If you need big data, the thing you are looking for is small and the effect size of what you are optimizing will also be small.


Very well put.


> Was it just a hyped up buzzword?

Yes.

I can't tell you how many meetings I've been in where someone was pitching a big data idea and it the meeting ended when we all realized that if it fits on a $50 thumb drive it isn't big data.


Data maybe the new oil, which I don't agree with, but it looked about like this https://en.wikipedia.org/wiki/Petroleum_industry_in_Ohio#/me...

You can call data the new oil when someone invades a country to secure a data center.

I don't think anything fell out of favor and things are a long way from data being, "your data only" although you have been given some rights in that regard.

Nothing happened to it. Big data always represented pushing the boundaries of what could be done when dealing with large amounts of data. After a while the technology matured to the point where working with large datasets just became something you did. There was a lot of hype to it and many organizations unnecessarily went along for the ride. It's also a balance betwee current technology and economics of compute, storage, networking, etc. As the balance changes what and how you do things also changes.


> You can call data the new oil when someone invades a country to secure a data center

State-sponsored hacking for purposes of data exfiltration has been going on for years.


... and data leaks are the new oil spills ...


Big Data was about the giddy excitement of being able to run some fancy predictive model on a large amount of data and get some sort of incredible benefit. Now most organizations have tried it. The few that actually benefit now take it for granted. The rest have moved on, although they still have a team of data engineers babysitting a legacy Hadoop cluster.


Under most circumstances, there was no "big data" in the first place, and many businesses discovered that you can't magically derive insight, create value, build features, solve problems, etc. by performing aggregation and data science on most forms of data, especially if that data is not "big." No one can even define what "big" means. How many "rows" of data do you need for it to be "big"? How many "columns" per record? How many relationships both implicit and explicit? How fast does that data change or grow? How structured or unstructured? It's entirely a value judgment that many engineers and managers had no business determining as being "big."

The true "big data" became so ubiquitous and accessible that there became no reason for anyone to care about it outside the bubble of Silicon Valley. It's just data, and really was all along.


Well, with the introduction of kubernetes as a platform and other cloud solutions, most "big data" just became "data".

Its amazing to see that nowadays the persistent volume claim used for logging, is on average now much bigger than the average dedicated machine was about 10 years ago.


You definitely shouldn't be storing logs on giant PVCs; they should be injested into a log aggregator.


yes of course but what i meant that its not uncommon to see elasticsearch instances who themselfs need huge pvcs etc. i was just abbreviating to make a point =)

What I was trying to say is just that the scale of data we work on has changed and henceforth "big data" became "data" and what is now considered "big data" really would have been called insane just 10 years ago.


You wrote, "And many have talked about data being the ‘new oil’ when really it should be reframed as radioactive waste."

It's true that privacy regulations have made personally identifiable information (PII) into something that is challenging to store, like radioactive waste.

But most of the world's big data is not PII. For example, the huge amount of data being produced by modern telescopes and particle physics labs is about things like stars and subatomic particles, not people.

The world has less than 8e9 people, but there are around 1e11 stars in our galaxy, and there are more than 1e11 galaxies in the observable universe.


The buzzword had many definitions, and with time, people realize that what they are dealing with is not BIG data, but just data.

People tried to define big data in terms of the size of the data set. The best definition of big data I heard is "a data storage and/or processing system that cannot handle the amount of data in one physical machine and needs distributed storage and/or processing".

That's a lot of data. Most people and companies are not dealing with big data.

Kind of like everything being "blockchain" at one point. Eventually people realized that the word has a specific meaning that does not apply to many things..


If the comments here anything to go by, “Big Data” has simply become “data”


These days I hear a lot of "AIML"

It doesn't make much sense to me as I've never seen anyone use anything you'd find in an AI book that you wouldn't also find in a Machine learning book

For big data, I think that the terminology waned but data engineers internalized the desire to scale everything they make to handle big data. So data engineering teams are still using things like Spark (or databricks) even if their datasets aren't big enough to need that


A cursory search of DDG News & Goggle News reveals "big data" is still a widely used buzzword in headlines. I don't think it went anywhere.


The goalpost moved a bit. We now talk about privacy since there is at least something we could potentially accomplish. Fighting big data is impossible, corps are collecting more and more if it and we fall for it. We are giving it away via using free products that appear to be backed by advertising.


A new one is born: extreme data.

For example: https://www.horizon-europe.gouv.fr/extreme-data-mining-aggre...

But most people work on small to medium data.


It was just a hyped up buzzword, and new ones have been substituted now. "machine learning" had a broader appeal, many places didn't have that much data; the ones that did have "big" data largely didn't find much signal in the noise even with AI/ML tools.


- hyped buzzword

- catch-all excuse to record everything forever without having an idea how to use that data

- actually hard problems


The Big Data Problem in a nutshell: the more data you have, the easier it is to draw wrong conclusions from it.

Otherwise known as, "you tend to find what you're looking for", as hidden biases in the query will ignore data that doesn't support that.


Big Data has always been a marketing paradigm. We've always had lots of data we just didn't process it for business intelligence before.

"The advances in computing have made it easier to accomplish tasks that were completely unnecessary before"


Everyone who has data is still doing it, the buzzword just went out of fashion. Now it’s data science, analytics, ML eng. What truly ended is “big data” meaning “we’ll come take your logs and magically transform your business.”


Two things:

1. It morphed into ML as the dirty secret to most ML projects is that they're predominantly about data. Put another way you can't derive a model from nothing.

2. You mentioned privacy scandals, but things like CCPA + GDPR legitimately did make larger corporations pause and ask "Do we actually need this information?" where prior to that everyone was a hoarder "just in case"


Now people throw everything in “data lakes”. It’s already so complicated to handle the ingestion that they don’t even want to try to do anything with that much data.


We have configured our lakes to flow into a data deltas, data gulfs or sometimes directly into a data ocean as we see fit.

We solidify our known good historic data into immutable data icebergs. We distill the data ocean into data vapor, allow it to condense in the cloud where it precipitates onto lakes, deltas, gulfs, and users as data rain.

Finally some small % of the data vapor escapes into our data vacuum where it drifts and mingles with other matter, eventually accumulating into a new data moon, data planet, or data star.

My new startup, Biggest Data, is looking for funding now.


Compliance and web3 arrived.

A lot of companies didn't even need to go the hadoop route. CSVs, jupyter notebooks and SQL databases are very powerful tools for most companies.


Big data is just data now.

We nonchalantly spin up massive 1TB+ ram clusters to process our data without really admiring how much data it actually is.


Here's a rule of thumb: anything with the name "big" before it is bad.

Big oil. Big bad. Big lie. Big brother. Big apple. Etc...


We still have teams doing work that would qualify probably, but it's definitely not the craze anymore


I thing that "big data" has simply became "data".


It's parked next to the information superhighway.


Never really went away, never really arrived: https://www.youtube.com/watch?v=pcBJfkE5UwU


It was consumed by AI


See “Serverless”, “cloud native”, “zero trust” etc

Hype.


zero trust isnt hype


I'll bite, name one use case which takes off?


Unless it's changed in the ~6 years since I left, Google uses zero trust (via BeyondCorp) for everything internally.


But that hasn't been replicated or "productized" the way the hype suggested. Google did a really good job there, but was in control of all their infra and had very smart people.


Basically.


As a consultant, my executive clients simply aren't helped by phrases such as big data. A big pool of garbage is still garbage. Instead, I've dropped back into the traditional vocabulary of data and analytics, and I talk about statistics and deep learning as a goal, and data engineering as the means.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: