Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing Postgres text search with trigrams (alexklibisz.com)
139 points by petercooper on Feb 23, 2022 | hide | past | favorite | 15 comments



That was a very interesting article, and I have not seen such a thorough text on this specific kind of searches in Postgres so far. What I found interesting are the comments on using GiST instead of GIN, turns out I had some misconceptions here. For the text search feature the Postgres documentation recommends GIN indexes, I somehow missed that for Trigram indexes the GiST version is more powerful.

I'll have to play around with this at some time, but I am wondering how GiST and GIN compare if I don't order by similarity. I was also surprised I never heard about the siglen option, but it turns out this is new since Postgres 13. It looks like this has a big effect and is a useful knob to tune this.


Thanks for the feedback -- sorry I missed this a couple days ago! Had no clue it got posted to HN.

FWIW, I've found GIN is a bit faster if you're just looking to filter. IIRC it was maybe 10-15% faster for the particular use-case I was looking at. So, worth a try, but don't expect a 10x improvement.


This is great, hadn’t heard about siglen!

For the solution at the end, you’ll often find yourself wanting to query across multiple tables instead of multiple columns in a single table, in which case creating a materialised view with a single search term column and a GIST index is the best I’ve come up with. Obviously you then have a refresh to do instead of the expression being maintained as part of the index.

But no matter what, if your search isn’t fast, you’re probably doing something wrong. Postgres is great for this stuff (obviously for longer documents and fuzzier searches we can argue about quality).


I'm currently thinking about the multi table search problem and lean towards updating the index table in the ORM (save method of the corresponding Django models in this case). Postgres doesn't seem to support incremental updates of materialized views yet, and always recomputing the whole table just because some few rows changed seems very much overkill to me. That it integrates better into the rest of the app and makes configuring the indexed columns easier is just a bonus.

Not that Postgres probably wouldn't handle the full table refresh effortlessly as it generally does - at least at my scale. But that feels kind of icky to me and I'm surprised there doesn't seem to be a better solution for this yet - everyone building a search function for their site must run into this and I'd expect it to scale terribly at dimensions startups generally aim at.


I've personally had some painful experiences with refreshing materialized views in Postgres. In particular, highly variable performance on read replicas that were receiving a refreshed matview every few minutes. Maybe we were just doing it wrong, but I tend to avoid it if I can. Plus the eventual consistency can introduce confusion.

In any case, there's an interesting feature called Incremental View Maintenance that is being worked on by some Postgres developers: https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

This would let us define a materialized view that gets automatically updated as the source tables change. When I last checked (late 2021), they were saying it might land in PG15.


This is a really interesting area, would love it if Postgres provided a strong option here. In the meantime, Materialize has very good Postgres integration if it might work for you:

https://materialize.com/


I’d advise just using materialized views. Yes, they don’t support incremental updates. But anything homemade will eventually be buggy somewhere and cause the index to get out of date in unexpected and probably unmonitored ways.


As long as there's a key on the materialized view you can rebuild it concurrently with no locks, and if you remember to vacuum there's no particular overhead to doing this regularly. Smaller more targeted updates seem good to me though if you want things instantly in the index.


Triggers would be another less fragile way to do this.


Thanks for the feedback -- sorry I missed this a couple days ago! Had no clue it got posted to HN.

I've deployed a solution that uses roughly this same method with multiple tables. I experimented with a materialized view that would centralize all the text columns, but ultimately found that it was much simpler and fast enough to have a single expression index in each of the tables. Then the query either joins the tables and checks each column, or you can run a separate query for each of the tables and stitch together the results in app code.


I went through much of this a few weeks ago, and this article would have saved me quite a bit of time. Like others here, siglen is new to me, so I'll have to have a play around with that.

One thing that bothered me was how cumbersome the queries become once you start having to concatenate and coalesce a whole bunch of columns together. Postgres has a variadic function called "concat_ws" which ought to be useful here, especially as it implicitly coalesces NULL inputs to empty string. However, it's not declared to be immutable, so can't be used in an expression index.

The solution (from Erwin Brandstetter's answer on Stack Overflow) is to use an immutable variant of this, constrained to only handle data of type "text".

https://stackoverflow.com/questions/54372666/create-an-immut...

With this immutable concat function in place, and a trigram expression index making use of it, the WHERE clauses become a lot more concise, for example, the clause in the final query of the article reduces to just:

  WHERE immutable_concat_ws(' ', asin, reviewer_id, reviewer_name, summary) ilike '%' || input.q || '%'


Yeah it's a shame the concat_ws function doesn't quite work here.

I've had success using Slick, an ORM-ish Scala library, to abstract away this tedious concatenation in app code.


Thanks for sharing this! Great stuff.

Also, there is this presentation that might be complementary to the one shared here.

http://matheusoliveira.s3-website-us-east-1.amazonaws.com/pr...

The presentation above is really good for people interested in FTS on PostgreSQL. The company behind this presentation does tens of thousands of sophisticated searches per second. And it's 100% powered by PostgreSQL.

Thanks


Kudos to author .... Extremely well written article.


Learned a bunch, thanks!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: