Hacker News new | past | comments | ask | show | jobs | submit login
Apache Hudi 1.0 released with secondary indexes for data lakehouses (apache.org)
11 points by v5c6 5 months ago | hide | past | favorite | 9 comments



Really curious of the performance gain we can get from the secondary index, and the cost of it.


It’s comparable to and depends on selectivity of the query, like any database index. On a 10TB tpc-ds web_sales with 1:150 selectivity, we see an impressive 95% gain.

If the query fetches out most records for e.g, then gains will be lower


Is this too little, too late while Snowflake and Databricks are marketing Iceberg full steam? Maybe Hudi will hang on a little longer than Delta if it builds new things like this?


An open source community cannot out market big vendors. But can certainly out execute and the judicious engineers will continue making choices based on technical evaluations, to keep it going.

I’d be very surprised if delta goes away, since iceberg still is not feature complete to replace it. Databricks has somewhat of a confusing position now, which is hurting themselves. It’d be interesting to watch.


First lakehouse system to introduce secondary index!


Are there good use cases for indexing on data lakes?


Faster upserts directly benefit from indices on the write side and the read side benefits from fast lookups.


I thought it was more for writes, but reading here looks like the index will also help reads


Yes. Indexes are integrated into reads, with (near) standard SQL for managing them




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: