Pg_bm25: Elastic-Quality Full Text Search Inside Postgres

samokhvalov · on Oct 8, 2023

I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.

philippemnoel · on Oct 8, 2023

One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)

gvkhna · on Oct 8, 2023

Great to hear, a benchmark against trigram searching with gin index would also be great. There are multiple ways to do full text search with postgres and they’re all insanely fast and memory efficient. Benchmarking various methods for comparison would be helpful.

https://www.crunchydata.com/blog/postgres-full-text-search-a...

philippemnoel · on Oct 8, 2023

Thanks for sharing, will look to add a benchmark for that as well

some_developer · on Oct 8, 2023

I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.

How does Pg_bm25 compare here with maintaining the index & performance?

unblough · on Oct 8, 2023

If I am understanding your experience correctly the colloquial wisdom here is to use GIN on static data and GIST on dynamic data.

> In choosing which index type to use, GiST or GIN, consider these performance differences:

> GIN index lookups are about three times faster than GiST

> GIN indexes take about three times longer to build than GiST

> GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 54.3.1 for details)

> GIN indexes are two-to-three times larger than GiST indexes

> As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

https://www.postgresql.org/docs/9.1/textsearch-indexes.html

arecurrence · on Oct 8, 2023

This sort of thing is more common with postgres than you'd think. I interviewed a candidate once whose company completely replaced querying in their postgres with elasticsearch because they could not figure out how to speed up certain text search queries. Nothing they tried would use the index.

westurner · on Oct 9, 2023

"Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :

> There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.

gvkhna · on Oct 8, 2023

I had same thought as soon as I read the article, with a gin index the benchmarks would be wildly different and not sure why they didn’t compare against that. Of course a non indexed search is going to be slow.

I was looking for comparison against a gin index specifically, without it pros/cons unclear.

dekimir · on Oct 8, 2023

I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?

retakeming · on Oct 8, 2023

Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!

pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.

To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).

ralusek · on Oct 8, 2023

For what it's worth, the single biggest selling point to a better search, for me, would be not having to deal with additional infrastructure and all the hassle that comes with keeping data in sync. I would be very reluctant to move off of RDS/Aurora, and therefore have my principal motivation to use something like this is greatly negated.

I understand that it becomes very hard to monetize if you're not able to offer your own hosted service, and I don't have a solution for that, but not supporting RDS is going to really diminish the product for many people.

philippemnoel · on Oct 8, 2023

Our goal is for one day ParadeDB to be a viable alternative to AWS RDS/Aurora, so that like you say, you don't need to keep data in-sync and can just use one system (ParadeDB). Soon it will be possible for you to have ParadeDB running on your AWS (utilizing your cloud credits+all security/privacy guarantees) but be managed via the ParadeDB dashboard, similar to how Aurora works from a developer UX.

Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')

olivermuty · on Oct 8, 2023

Will you be providing this for bring-your-own-compute in general? There is a gaping hole in the market for this. All the big vendors that provide postgres as a service require you to be on very specific types of hosting like aws fargate, google gke etc (looking at you Crunchydata).

We are using Scaleway (french cloud) which is heaven when it comes to GDPR and Schrems compliance, but once we grow out of their managed db offerings or if we want something their managed db offering does not provide we are out of luck.

Been looking for a year more or less now and I am simply unable to find something that doesnt amount to us just paying a fraction of a consulting FTE to be our lightweight DBA. There are only so many ways you can set up postgres HA, it is amazing that no one has made a product out of doing it for someone else yet.

philippemnoel · on Oct 9, 2023

Hey! Absolutely, we would love to offer as many cloud providers as possible for our compute backend. We're starting with AWS, and will be adding other clouds based on demand. I've added Scaleway to our list, and if you'd like to help us bring ParadeDB to Scaleway we would love to work together to make it happen faster.

In the meantime, you can self-host ParadeDB on Scaleway directly by running the Docker container. Hope this helps!

ddorian43 · on Oct 15, 2023

Only one i know is elest.io

wdb · on Oct 8, 2023

Yes, I have a similar feeling towards Cloud SQL for Postgres. Would be great if Azure/GCP would be supported in some manner

paulddraper · on Oct 8, 2023

What are the features of RDS/Aurora that you need?

Also, it would be possible to set up a logical PG replica.

ralusek · on Oct 8, 2023

Being in my VPC, having the support and track record of AWS, scaling to 128TB without me having to think about it, easy snapshots/backups.

raybb · on Oct 9, 2023

Could this also work as an alternative to Apache Solr? If so might be worth while to market it that way a bit.

I don't really know much about Solr but just started using it while helping with a project for openlibrary.org and it seems pretty alright but I'm still not totally sure I understand what makes it popular.

rdhatt · on Oct 11, 2023

Solr and Elasticsearch are both Java servers built on top of the Java search library Lucene. There are plenty of articles on the internet describing how they differ. However since they share the same core, so they are very similar as well. For the context of this discussion, you can consider Solr & Elasticsearch as interchangeable - a potayto, potahto situation.

phamilton · on Oct 8, 2023

With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?

My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?

allan_s · on Oct 8, 2023

Related question, could it be possible that at some point postgresql natively implements that algorithm ? Or as there is already an extension doing it , regardless of the licence , it is unlikely that patches in that direction will be accepted ?

j45 · on Oct 8, 2023

Running it for your own purposes as part of a solution that includes search should be fine under AGPL.

If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.

allan_s · on Oct 8, 2023

yes I understand I can do that, and I also understand why the authors chose to do that, I would have done the same.

My point of view is more from a small saas company perspective (i.e 100% pragmatic):

1. I want as less vendor as possible, especially on something as mission critical as my database 2. I already use AWS RDS and it comes with a LOT of nice things (managed, multi-az, easy backup/restore story, etc.)

In that situation:

1. hosting myself is not an option because I will loose all the niceties that I will have to reimplement 2. buying from a 3rd party is not an option either because: 1. What if they go bankrupt ? 2. We are ISO 27001 and they may be not ISO 27001 themselves or forever. 3. If I choose a vendor because it's "postgres + feature A" then if there's an other vendor selling "postgres + feature B" (timescaledb etc.) what do I do ?

That's why I was more interested in knowing if that specific could one day be implemented in postgres directly (as there's already tsvector).

Once again I'm 100% behind them to have chosen a restrictive license if they plan on selling it, but in that case their interested and mine are not aligned, and that's fine.

j45 · on Oct 13, 2023

That's a really fair use case acknlowedging personal preference to interpret how you like it.

I find some of the built in services on clouds are just open source libraries that are packaged up to increase tie in to that platform.

I like cloud, but cloud agnostically, and hybrid/private clouds in the mix with that seem like a good skill to at least be able to consider thinking through.

philippemnoel · on Oct 8, 2023

ParadeDB author here -- correct! We plan to offer a hosted version soon and the idea behind picking AGPL is to be as permissive as possible so that people can use the product for free, but also protect ourselves from abuse in case a large company, say AWS, were to want to ship it in their own environment.

In fact, we went through much questioning wondering to go with ELv2, Apache, AGPL, etc. before settling on AGPL

j45 · on Oct 13, 2023

Appreciate the response! This would be a great blog post btw.

adobrawy · on Oct 8, 2023

See who made pg_bm25 - vendor of database based on PostgreSQL. Most likely they would like offer that as hosted solution itself, so they attempt avoid Elasticsearch / Terraform-like drama using AGPL license from beginning.

klysm · on Oct 8, 2023

I forget, does AWS let you use custom extensions from pgrx?

adobrawy · on Oct 8, 2023

No, they allow use Rust for custom functions (alternatively to PL/SQL) only.

hardwaresofton · on Oct 8, 2023

pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.

Awesome to see so many high quality extensions come out of it.

https://github.com/pgcentralfoundation/pgrx

philippemnoel · on Oct 8, 2023

pgrx is awesome and making pg_bm25 would've been infinitely more challenging without it. Check them out if you want to make a Postgres extension, we can't recommend them enough

zombodb · on Oct 8, 2023

Thank you. I’ll pass this on to the team.

wkoszek · on Oct 8, 2023

Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?

This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.

ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.

benpacker · on Oct 8, 2023

I’m not sure if Postgres could support that type of operation directly via count() since I don’t know if the fact that no other filters are present is available to the Index Access Method API.

It might be possible to do a separate function though, like:

select pg_bm25_direct_count(‘term’)*

dekimir · on Oct 8, 2023

If you do that, I can update postgres-searchbox [1] to use it for better frontend experience.

[1] https://www.npmjs.com/package/postgres-searchbox

wkoszek · on Oct 8, 2023

That would be fine--basically any way of achieving it would be fine. As of now, in PostgreSQL's FTS, I don't think there's any way to do this fast enough to give it back to the user.

retakeming · on Oct 8, 2023

Thanks!

We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.

We haven't gotten around to benchmarking aggregations - that's the focus for next week and we'll publish them once they're done. I would suspect that it's a lot faster than Postgres aggregates since it leverages Tantivy Columnar.

francoismassot · on Oct 8, 2023

Nice! I would be very interested by your benchmark, don't hesitate to jump in the quickwit discord server to talk about the results. https://discord.quickwit.io/

machty · on Oct 11, 2023

What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).

philippemnoel · on Oct 12, 2023

ParadeDB and pg_bm25 offer weak consistency. pg_bm25 doesn't slow down transactions for indexing, and like ElasticSearch it becomes become eventually consistent shortly after (typically at most a few seconds, altough your mileage may vary based on the amount of data modified in the transaction(s)).

mkleczek · on Oct 8, 2023

This is really exciting and I hope to try it out at my company ASAP.

iamdanieljohns · on Oct 8, 2023

Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?

iamdanieljohns · on Oct 8, 2023

Ok, all caught up now. Great work and best of luck!

When it comes to the business model: it seems an acqui-hire by Supabase/Neon/etc would be the best bet. It insures the team's focus is on the core product instead of the litany of things to figure out when creating a pg hosting service (payments, downtime, upgrades, customer support, ...) in this highly competitive and demanding market.

ckok · on Oct 9, 2023

Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.

antman · on Oct 8, 2023

An important step, could be a good combination with pg_vector if they are fast enough

pritambaral · on Oct 8, 2023

I believe the parent project — paradedb — already does that, for their support of HNSW indexes.

philippemnoel · on Oct 8, 2023

That's right, we do support pgvector (it is pre-installed on ParadeDB) and support full HNSW. In fact, we even have another extension, called pg_search, which is the combination of searching on pgvector and pg_bm25 for better results! Topic of another blog post to come sometime soon :)

tristan957 · on Oct 9, 2023

Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?

rawsh · on Oct 8, 2023

Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres

philippemnoel · on Oct 8, 2023

Yes! We have another extension, pg_search, which is specifically for hybrid search using pg_bm25+pgvector. You can find it here: https://github.com/paradedb/paradedb/tree/dev/pg_search

anon373839 · on Oct 8, 2023

This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.

est · on Oct 8, 2023

looks like a cool project https://github.com/paradedb/paradedb

ashu1461 · on Oct 8, 2023

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

binarymax · on Oct 8, 2023

Lots of reasons:

1) switching search engines is hard when you’ve built your information needs around one. I’ve led lots of search engine migrations and they’re not fun. I even gave a talk on the problems companies face when doing so. https://haystackconf.com/us2020/search-migration-circus/

2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic

3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?

4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?

ntonozzi · on Oct 8, 2023

By adding the features that those new age startups launch: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.

vb-8448 · on Oct 8, 2023

Reading "legacy" near "elastic" make me feel a little bit old :D :D

BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.

ashu1461 · on Oct 8, 2023

:D :D

Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search

jillesvangurp · on Oct 8, 2023

They are part of the hype. Lucene has vector search capabilities. Elasticsearch and Opensearch have support for that (slightly different implementations). I assume solr has similar capabilities. The combination of traditional search and vector search makes a lot of sense from a cost control point of view. Vector search at scale is expensive. The smaller the result set, the cheaper it is to do vector search over it. So using a cheap traditional search to limit the results before you run vector search makes a lot of sense.

Also, bm25 holds up well against vector search. A well tuned model can outperform it but many off the shelf models struggle to do that. Vector search is a useful tool but so far it's not a one size fits all solution that "just works". It's something that can work really well if you know what you are doing and with a lot of tuning. With things like Elasticsearch you can try both approaches.

philippemnoel · on Oct 8, 2023

pg_bm25/ParadeDB author here. What we're doing is building an opinionated alternative within PostgreSQL. If you are not using Postgres, or want your system to be separate, Elastic is still the best choice and will likely remain so.

Other people have brought up great points for why or why not to switch. Our vision for this is that ParadeDB is not merely "better" than Elastic, but rather different. Elastic will never be a PostgreSQL database, and we'll never be a NoSQL search engine. If you want one or the other, you'll pick either ParadeDB or Elastic.

kriz9 · on Oct 8, 2023

Who is the competition besides Algolia? Last I checked most of the competition is either very expensive or very feature limited compared to Elastic/Solr.

pg_bot · on Oct 8, 2023

Meilisearch seems like it is the best open source option.

https://www.meilisearch.com/

ashu1461 · on Oct 8, 2023

I think pretty much all the companies who provide vector search are indirect competitors

canadiantim · on Oct 8, 2023

ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.

eclectic29 · on Oct 8, 2023

Is BM25 still used by "modern" search engines? I wasn't aware.

mugivarra69 · on Oct 8, 2023

is this better than lucene

benpacker · on Oct 8, 2023

The underlying engine, Tantivy, has better performance characteristics than Lucene.

You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB

fiedzia · on Oct 8, 2023

It's faster, but misses tons of features, starting with geosearch. Hopefully they will come with wider use.

PSeitz · on Oct 9, 2023

The issue for geo search is here: https://github.com/quickwit-oss/tantivy/issues/44

stopman · on Oct 8, 2023

Excited to give this a try.