There's a lot of confusion around data lakes. One source of confusion is that "d...

nitrogen · on April 25, 2019

Has anyone written about privacy implications of data lakes and data warehouses? The Extract in ETL is usually supposed to filter out private data, but if instead all of the raw native data is dumped into a data lake, what ensures that data is handled with the same care as the individual systems that normally handle the data? What stops some random business analyst from running individual or aggregated queries that would be contractually or legally forbidden?

dikei · on April 25, 2019

The solution is to divide your Data Lake into different zones with access control, so that user can only access what they're allowed to. That said, it's a lot of work to do this properly, so it's often neglected.

georgewfraser · on April 25, 2019

That is a great question, we have written about this too: https://fivetran.com/blog/how-fivetran-helps-you-stay-compli...

Short version, you need to identify data that absolutely must not be retained and either block it or hash it as close as possible to the source. This means you still have to do a little transformation before you load into your data lake/warehouse.

Second, you need to identify the soft constraints and enforce them with the access controls of your data warehouse. This is (another) reason why you should use a relational database like Snowflake or BigQuery as your primary data store, and treat any nonrelational data lake like Parquet-in-S3 as a backup/staging area for 1 or more relational stores.

triplee · on April 25, 2019

This! And the two can coexist as well -- Data lake for everything (partitioned appropriately) and modern data warehouse for the stuff from your data lake that you want to curate (of course with partially structured intake first).

kakoni · on April 25, 2019

Interesting. But what is delta.io then, as you are storing data not in its pure native format (CSV,JSON,TSV...) but in parquet files. Or does it even matter?