I was working in data analytics + data science a decade ago and we stored everything, not aggregates, and pushed them through hadoop. I have been "out of the game" since then. What has changed that is making people saying "store everything" is a new phenomenon? (genuine question bc I am clearly missing something.)
It’s not a new phenomenon so much as it has emerged as an important shift from the status quo 20 years ago.
What’s changed in the last 10 years are the access patterns. There’s increased demand to have arbitrary query access over the raw data. The most impactful technology changes have been about pushing the access layer (queries, stream & batch processing, dashboards, BI tools, etc) down as close to the raw data as possible and making that performant. What’s fallen out of that are better MPP OLAP databases (snowflake), new columnar formats (parquet), SQL as the transform layer (dbt).
why is it actually that SQL "re-emerged" as the transformation layer? I thought that it first shifted from SQL to Query Builders inside talend, matillion etc. Why now SQL again?
Probably just the emergence of dbt? I’ve only been doing ETL for a couple years personally but couldn’t imagine using so much SQL in our pipelines without a framework like dbt