Hacker News new | past | comments | ask | show | jobs | submit login
Databricks open-sources Delta Lake to make data lakes more reliable (techcrunch.com)
142 points by solidangle on April 24, 2019 | hide | past | favorite | 54 comments



There's a lot of confusion around data lakes. One source of confusion is that "data lake" versus "data warehouse" is often presented as a choice, where you can have either:

1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).

2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).

This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.

Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.

There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake


Has anyone written about privacy implications of data lakes and data warehouses? The Extract in ETL is usually supposed to filter out private data, but if instead all of the raw native data is dumped into a data lake, what ensures that data is handled with the same care as the individual systems that normally handle the data? What stops some random business analyst from running individual or aggregated queries that would be contractually or legally forbidden?


The solution is to divide your Data Lake into different zones with access control, so that user can only access what they're allowed to. That said, it's a lot of work to do this properly, so it's often neglected.


That is a great question, we have written about this too: https://fivetran.com/blog/how-fivetran-helps-you-stay-compli...

Short version, you need to identify data that absolutely must not be retained and either block it or hash it as close as possible to the source. This means you still have to do a little transformation before you load into your data lake/warehouse.

Second, you need to identify the soft constraints and enforce them with the access controls of your data warehouse. This is (another) reason why you should use a relational database like Snowflake or BigQuery as your primary data store, and treat any nonrelational data lake like Parquet-in-S3 as a backup/staging area for 1 or more relational stores.


This! And the two can coexist as well -- Data lake for everything (partitioned appropriately) and modern data warehouse for the stuff from your data lake that you want to curate (of course with partially structured intake first).


Interesting. But what is delta.io then, as you are storing data not in its pure native format (CSV,JSON,TSV...) but in parquet files. Or does it even matter?


We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.

Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.

[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf


I don't really understand the concept of Data Lake, and wikipedia isn't helping much... is it just a buzzword for a collection of data stores?


It's basically a reliable, horizontally scalable object store + a collection of data storage and processing engines.

So amongst the cloud providers, AWS calls a combination of S3 + Glue + Athena (for example) a "data lake", where S3 is the object storage which can store data in various formats, and Glue and Athena are used to transform/process/query the data. See a more detailed article/guide here: https://aws.amazon.com/lake-formation/

If you didn't want to put anything into the cloud and keep all your services on-premise, a local Hadoop cluster could be a data lake, for example using HDFS + Zookeeper + YARN + Hive.

[This is a huge over-smiplification because it's late and I really should be going home :)]


To storage and compute I would also add metadata, this is why most data lake projects end up in data swamps.

When people just dump data in their storage they end having really hard time sharing them in their organization.


I am personally inclined to lump this in with the other enterprisey buzzwords that are generally safe to avoid. From my perspective, the 'data lake' is the actual collection of business systems and your ability to extract useful information from them.

Trying to come up with some unified standard or common API for the extraction, transformation and implementation of useful data from a heterogenous collection of systems sounds like a problematic task at best.

The other, more likely, interpretation of 'data lake' is that it is the staging ground between your ability to do the above stated activity and other downstream systems interested in the data. If the idea is that you are creating the actual normalization layer, I feel like this is still more in the realm of SQL/ETL, as there really isn't any other direction to move that would reduce your entropy in a valuable way (relative to your time invested).

SQLite or Postgres is usually the right choice. This simple rule can help you avoid a lot of pain. Once you have convinced everyone that Postgres is to be used, the only other real barriers are your ability to get a data transport to each business system and the authoring of some SQL scripts. The workload of building a SQL representation of any particular business system is fairly predictable once you break it down to entity-relationship abstractions. Also, using a language with powerful class/object, serialization and database support such as C# or Java can cut your workload by orders of magnitude if you choose a SQL architecture. In C# for instance, you can just write POCOs and use Entity Framework to build out all of the SQL for you. This is not the most performant option by a long shot, but it can get you going incredibly quickly on a first iteration.


> Once you have convinced everyone that Postgres is to be used

... the data has already disappeared.

Systems get shut down and replaced. Operational systems may discard history.

By the time you get a fully operational data warehouse set up, it may be too late to preserve the data.



Awesome link!

The key line for me:

"The data lake stores raw data, in whatever form the data source provides."

The emphasis on"raw" was his, not mine.


Thanks from me, too :-)


Going through the process of defining and planning to build a data lake with my team right now.

A data lake is like you said a collection of data stores, and the industry as a whole hasn't defined it very well past that.

IMHO - A (useful) data lake is a platform that can support any type of data store in any format (be it relational, flat, graph, document, etc) and offer a way to consistently query it. A (useful) data lake does much in the way of managing metadata about those data stores that makes it easy to consume.


I have heard these promises of a single query language for different data formats many times and the results (foisted upon developers and customers) have always far more painful than just sticking to SQL.


To be fair, most data stores have an SQL layer available, even if it's not optimal


Plug: We help discover, catalog and search for metadata and lineage in your data lake. We support all major hadoop distros and cloud platforms. Check us out! https://www.waterlinedata.com


AFAIK, the data lake is the next step in the evolution of the data warehouse. Instead of storing data in a data mart/data warehouse, the concept of the datalake (as a design pattern) is that you don't schemas (support for unstructured data), better support for auditing and data governance/democratization, and schema (?) evolution


I wouldn't say you don't have schema's, rather you have schema-on-read instead of schema-on-write, and you use an extract-load-transform pattern instead of extract-transform-load. The data is replicated as-is into the data lake and only then do you figure out what to do with it.


Yes, in my mind this is the key of a data lake. Take all your raw data and store it somewhere, then provide ways for people to access and query the raw data.

This means ingestion is faster (no transformation) and you don't throw away any data that you might want later. If multiple teams want to query the same data in different ways they have the ability to do so. And ideally it prevents data silos because everyone can stuff their raw data into a master data lake and each team has access to all the data but is responsible for doing the work to make it look like they want.

Reality of the above obviously doesn't always match the theory but schema-on-read/ELT are the easiest ways to handle the above. Typically this involves some kind of Hadoop-style technology, like Hive or SparkSQL for SQL-based querying, Spark for non-SQL, etc. But you've always got the raw data and can go back and re-ELT it from the data lake if your needs change.


I don't know much about the strict definition, but that's how I use them. I have had several clients that want to analyze data they didn't capture in their schema. I'd say: disk is cheap. Throw everything in there (medical records, events, etc.). If we need it later, we'll fish it out. Ugly, but simple.


Note that a data lake does not necessarily replace the data warehouse, but rather often complements it. As such, you store your raw data from various sources in a centralised data store (Hadoop-like, NoSQL, etc.). From there, you prune, clean, select, and potentially aggregate data that you would like to provide in a quality-controlled way to your business users, in a data warehouse. This data warehouse most often will be a more traditional relational data store (usually some flavour of SQL database), which allows users to select data from a curated, pre-selected slice of the overall data stored in the data lake, and which enables easier integration with common reporting tools, whether more traditional standard reporting tools or self-service BI tools.


Practical definitions:

Data lake = place to store unstructured raw data. Usually as files in an object store or Hadoop/HDFS cluster. Analyse with data processing/SQL frameworks. Schema may be part of the data (parquet, avro) or on-read (raw csvs or json modeled into tables).

Data warehouse = place to store processed (semi)structured data. Usually in a distributed columnstore database. Analyze with SQL. Schema is pre-defined by the tables. Usually for smaller, faster, real-time queries or as a cache in-front of a data lake.

Some cloud data warehouses like BigQuery and Snowflake can also query unstructured files and even run on top of the object store so the boundaries are getting blurred. Will probably converge at some point in the future.


The number of definitions here suggest its not well defined, one more :)

A data lake is when you store your data in object storage (S3, GCS etc) as opposed to a filesystem (HDFS) or some indexed datastore (Redshift etc).

This potentially saves a lot of money because you can scale compute separately from storage, and object storage can be extremely cost effective compared to running a distributed filesystem.

Where the two overlap is when you store something like parquet in object storage, the file format is somewhat indexed already so you spend a bit more money preprocessing it but save a lot of money querying it.

I think whether its "raw" json or log files or preprocessed parquet doesnt really differentiate whether its a data lake or not


But interestingly delta.io only supports HDFS currently.


It is a fancy name for a data swamp. :) On the more serious note data warehouses called nowadays data lakes, not sure wjy people prefer this term, maybe because some folks like to think that it is ok to just throw all data in one place and it will be ok to query it like that. I still prefer data warehouse over it.


Some call it "data dump" too.


It kind of makes sense for a cloud to form above a lake.

On another tangent, I wonder if it would be possible to make an un-warrantable cloud? Would it be feasible to create a distributed cloud within the borders of the US or another industrialized country which doesn't actually exist at a particular address?


Other efforts in improving the parquet datasets on cloud storage world:

https://github.com/apache/incubator-iceberg

https://github.com/apache/incubator-hudi

Happy to see Delta go open source.


They tried to keep it closed and sell it as a premium service, but looks like they need help from the open source community to make the product better. Great to see. Databricks has its roots in open source (the founder created Spark) and it's great that they're still making a lot of open source code rather than making everything private.


What are the other alternatives for data lakes that can be used (both open source and close)?


Apache Iceberg is probably the closest product to what databricks is open sourcing, but none of these products are everything that's needed for datalake management.

What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.


Yes, when I heard about Delta I thought the same. Would love to see a comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.


You can see some related but not directly comparable efforts in HuDI see https://eng.uber.com/uber-big-data-platform/

Or https://iceberg.apache.org/

Which both keep track of data versioning and management of file based datasets on object stores.

Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.


Throwing my own project's hat in the ring, Pachyderm[0] is opensource, written in Go and built on Docker and Kubernetes. It versions controls your data, makes modifications atomic and tracks data lineage. You implement your pipelines in containers, so any tool you can put in a container can be used (and horizontally scaled) on Pachyderm.

[0] https://github.com/pachyderm/pachyderm


DIY is probably the biggest alternative. I think Databricks saw so many customers solving the same problem and decided to make a service of it. Great idea and a bit surprising that it hasn't continued as an Enterprise feature that they only offer customers.


Hive ACID solves a similar set of problems with a similar design (base Parquet/ORC files + delta files in the same format). It's a bit more "managed" in that compaction is automatic and it supports features like column-level access control, which aren't possible when you're executing an end-user's Scala code against a directory of files.

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...



I don’t think Amundsen is comparable. Amundsen is essentially a data catalog that has ambition to get into master data management.

Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.


You're absolutely correct, thanks. Edited for accuracy.


That's what I'd be interested in, too :-)


I appreciate the thought TechCrunch put into the image representing ACID-compliant data lakes.


What I don't like about these ACID storage layers is they reduce compatibility between different query engine. For example, Spark cannot read Hive ACID tables natively and Hive cannot read Spark Delta tables either. Then there's other tools such as Presto or Drill which can read neither.

When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.


Recently I was interested to learn more on Data Lakes, how to design and maintain them.

There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.

Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...


What's the difference between a Delta Lake and Change Data Capture? Seems like in both cases you're creating a type 2 dimension against a source table.


It is a different technology entirely. CDC is just the log of changes on a relational table. Delta Lake appears to be providing more native administrative capabilities to a data lake implementation (schemas, transactions, cataloging).


CDC you are looking at the change logs of a Relational Database typically.

With a (batch) Data Lake you accept as input different file types (JSON, CSV, AVRO) from various systems. Could be Hadoop systems, could be from a COBOL system on a mainframe.


The Delta here isn’t so related to CDC. This is a replacement for the ad how methods people use now to keep multiple readers and writers of datasets in parquet format in agreement when using Spark’s dataframe API.


Are there other ways of implementing ACID transactions on Spark tables?


What do you mean Spark tables? Generally speaking it is a bad idea to try to combine ACID with data warehouses.


Sounds cool, but then I'd have to use Spark...


Cloud -> Data lake -> Data stream -> Data Ocean -> Cloud




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: