Hacker News new | past | comments | ask | show | jobs | submit login

What are the other alternatives for data lakes that can be used (both open source and close)?



Apache Iceberg is probably the closest product to what databricks is open sourcing, but none of these products are everything that's needed for datalake management.

What these products do is make it as easy to use decoupled storage and compute as your analytics system as it would be to use a fully managed analytics DBMS system.


Yes, when I heard about Delta I thought the same. Would love to see a comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.


You can see some related but not directly comparable efforts in HuDI see https://eng.uber.com/uber-big-data-platform/

Or https://iceberg.apache.org/

Which both keep track of data versioning and management of file based datasets on object stores.

Most people today have very ad how approaches to handling data versioning and lineage on Hadoop datasets.


Throwing my own project's hat in the ring, Pachyderm[0] is opensource, written in Go and built on Docker and Kubernetes. It versions controls your data, makes modifications atomic and tracks data lineage. You implement your pipelines in containers, so any tool you can put in a container can be used (and horizontally scaled) on Pachyderm.

[0] https://github.com/pachyderm/pachyderm


DIY is probably the biggest alternative. I think Databricks saw so many customers solving the same problem and decided to make a service of it. Great idea and a bit surprising that it hasn't continued as an Enterprise feature that they only offer customers.


Hive ACID solves a similar set of problems with a similar design (base Parquet/ORC files + delta files in the same format). It's a bit more "managed" in that compaction is automatic and it supports features like column-level access control, which aren't possible when you're executing an end-user's Scala code against a directory of files.

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/usi...



I don’t think Amundsen is comparable. Amundsen is essentially a data catalog that has ambition to get into master data management.

Delta here is adding features more closely associated with RDBMS or MPP data warehouses to the Spark data pipelines with parquet data on object stores big data world.


You're absolutely correct, thanks. Edited for accuracy.


That's what I'd be interested in, too :-)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: