There's a lot of confusion around data lakes. One source of confusion is that "data lake" versus "data warehouse" is often presented as a choice, where you can have either:
1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).
2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).
This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.
Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.
There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake
Has anyone written about privacy implications of data lakes and data warehouses? The Extract in ETL is usually supposed to filter out private data, but if instead all of the raw native data is dumped into a data lake, what ensures that data is handled with the same care as the individual systems that normally handle the data? What stops some random business analyst from running individual or aggregated queries that would be contractually or legally forbidden?
The solution is to divide your Data Lake into different zones with access control, so that user can only access what they're allowed to. That said, it's a lot of work to do this properly, so it's often neglected.
Short version, you need to identify data that absolutely must not be retained and either block it or hash it as close as possible to the source. This means you still have to do a little transformation before you load into your data lake/warehouse.
Second, you need to identify the soft constraints and enforce them with the access controls of your data warehouse. This is (another) reason why you should use a relational database like Snowflake or BigQuery as your primary data store, and treat any nonrelational data lake like Parquet-in-S3 as a backup/staging area for 1 or more relational stores.
This! And the two can coexist as well -- Data lake for everything (partitioned appropriately) and modern data warehouse for the stuff from your data lake that you want to curate (of course with partially structured intake first).
Interesting. But what is delta.io then, as you are storing data not in its pure native format (CSV,JSON,TSV...) but in parquet files. Or does it even matter?
1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).
2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).
This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.
Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.
There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake