What are the other alternatives for data lakes that can be used (both open sourc...

zjaffee · on April 24, 2019

Apache Iceberg is probably the closest product to what databricks is open sourcing, but none of these products are everything that's needed for datalake management.

What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.

groodt · on April 24, 2019

Yes, when I heard about Delta I thought the same. Would love to see a comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.

mobileexpert · on April 24, 2019

You can see some related but not directly comparable efforts in HuDI see https://eng.uber.com/uber-big-data-platform/

Or https://iceberg.apache.org/

Which both keep track of data versioning and management of file based datasets on object stores.

Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.

jdoliner · on April 24, 2019

Throwing my own project's hat in the ring, Pachyderm[0] is opensource, written in Go and built on Docker and Kubernetes. It versions controls your data, makes modifications atomic and tracks data lineage. You implement your pipelines in containers, so any tool you can put in a container can be used (and horizontally scaled) on Pachyderm.

[0] https://github.com/pachyderm/pachyderm

century19 · on April 24, 2019

DIY is probably the biggest alternative. I think Databricks saw so many customers solving the same problem and decided to make a service of it. Great idea and a bit surprising that it hasn't continued as an Enterprise feature that they only offer customers.

monkeyoct · on April 24, 2019

Hive ACID solves a similar set of problems with a similar design (base Parquet/ORC files + delta files in the same format). It's a bit more "managed" in that compaction is automatic and it supports features like column-level access control, which aren't possible when you're executing an end-user's Scala code against a directory of files.

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...

fimbulvetr · on April 24, 2019

AWS has one in preview: https://aws.amazon.com/lake-formation/

mobileexpert · on April 24, 2019

I don’t think Amundsen is comparable. Amundsen is essentially a data catalog that has ambition to get into master data management.

Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.

fimbulvetr · on April 24, 2019

You're absolutely correct, thanks. Edited for accuracy.

lichtenberger · on April 24, 2019

That's what I'd be interested in, too :-)