Apache Hudi 1.0 released with secondary indexes for data lakehouses

redhouse · 2024-12-17T16:35:11 1734453311

Really curious of the performance gain we can get from the secondary index, and the cost of it.

v5c6 · 2024-12-17T17:31:54 1734456714

It’s comparable to and depends on selectivity of the query, like any database index. On a 10TB tpc-ds web_sales with 1:150 selectivity, we see an impressive 95% gain.

If the query fetches out most records for e.g, then gains will be lower

dunwaldo · 2024-12-17T16:46:41 1734454001

Is this too little, too late while Snowflake and Databricks are marketing Iceberg full steam? Maybe Hudi will hang on a little longer than Delta if it builds new things like this?

v5c6 · 2024-12-17T17:35:35 1734456935

An open source community cannot out market big vendors. But can certainly out execute and the judicious engineers will continue making choices based on technical evaluations, to keep it going.

I’d be very surprised if delta goes away, since iceberg still is not feature complete to replace it. Databricks has somewhat of a confusing position now, which is hurting themselves. It’d be interesting to watch.

sudha_sakthee · 2024-12-17T16:14:13 1734452053

First lakehouse system to introduce secondary index!

cloud8bits · 2024-12-17T15:31:25 1734449485

Are there good use cases for indexing on data lakes?

sudha_sakthee · 2024-12-17T16:15:44 1734452144

Faster upserts directly benefit from indices on the write side and the read side benefits from fast lookups.

dunwaldo · 2024-12-17T16:52:50 1734454370

I thought it was more for writes, but reading here looks like the index will also help reads

v5c6 · 2024-12-17T17:32:37 1734456757

Yes. Indexes are integrated into reads, with (near) standard SQL for managing them