Hacker News new | past | comments | ask | show | jobs | submit login
Pg_bm25: Elastic-Quality Full Text Search Inside Postgres (paradedb.com)
206 points by billwashere on Oct 8, 2023 | hide | past | favorite | 71 comments



I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.


One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)


Great to hear, a benchmark against trigram searching with gin index would also be great. There are multiple ways to do full text search with postgres and they’re all insanely fast and memory efficient. Benchmarking various methods for comparison would be helpful.

https://www.crunchydata.com/blog/postgres-full-text-search-a...


Thanks for sharing, will look to add a benchmark for that as well


I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.

How does Pg_bm25 compare here with maintaining the index & performance?


If I am understanding your experience correctly the colloquial wisdom here is to use GIN on static data and GIST on dynamic data.

> In choosing which index type to use, GiST or GIN, consider these performance differences:

> GIN index lookups are about three times faster than GiST

> GIN indexes take about three times longer to build than GiST

> GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 54.3.1 for details)

> GIN indexes are two-to-three times larger than GiST indexes

> As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

https://www.postgresql.org/docs/9.1/textsearch-indexes.html


This sort of thing is more common with postgres than you'd think. I interviewed a candidate once whose company completely replaced querying in their postgres with elasticsearch because they could not figure out how to speed up certain text search queries. Nothing they tried would use the index.


"Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :

> There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.


I had same thought as soon as I read the article, with a gin index the benchmarks would be wildly different and not sure why they didn’t compare against that. Of course a non indexed search is going to be slow.

I was looking for comparison against a gin index specifically, without it pros/cons unclear.


I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?


Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!

pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.

To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).


For what it's worth, the single biggest selling point to a better search, for me, would be not having to deal with additional infrastructure and all the hassle that comes with keeping data in sync. I would be very reluctant to move off of RDS/Aurora, and therefore have my principal motivation to use something like this is greatly negated.

I understand that it becomes very hard to monetize if you're not able to offer your own hosted service, and I don't have a solution for that, but not supporting RDS is going to really diminish the product for many people.


Our goal is for one day ParadeDB to be a viable alternative to AWS RDS/Aurora, so that like you say, you don't need to keep data in-sync and can just use one system (ParadeDB). Soon it will be possible for you to have ParadeDB running on your AWS (utilizing your cloud credits+all security/privacy guarantees) but be managed via the ParadeDB dashboard, similar to how Aurora works from a developer UX.

Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')


Will you be providing this for bring-your-own-compute in general? There is a gaping hole in the market for this. All the big vendors that provide postgres as a service require you to be on very specific types of hosting like aws fargate, google gke etc (looking at you Crunchydata).

We are using Scaleway (french cloud) which is heaven when it comes to GDPR and Schrems compliance, but once we grow out of their managed db offerings or if we want something their managed db offering does not provide we are out of luck.

Been looking for a year more or less now and I am simply unable to find something that doesnt amount to us just paying a fraction of a consulting FTE to be our lightweight DBA. There are only so many ways you can set up postgres HA, it is amazing that no one has made a product out of doing it for someone else yet.


Hey! Absolutely, we would love to offer as many cloud providers as possible for our compute backend. We're starting with AWS, and will be adding other clouds based on demand. I've added Scaleway to our list, and if you'd like to help us bring ParadeDB to Scaleway we would love to work together to make it happen faster.

In the meantime, you can self-host ParadeDB on Scaleway directly by running the Docker container. Hope this helps!


Only one i know is elest.io


Yes, I have a similar feeling towards Cloud SQL for Postgres. Would be great if Azure/GCP would be supported in some manner


What are the features of RDS/Aurora that you need?

Also, it would be possible to set up a logical PG replica.


Being in my VPC, having the support and track record of AWS, scaling to 128TB without me having to think about it, easy snapshots/backups.


Could this also work as an alternative to Apache Solr? If so might be worth while to market it that way a bit.

I don't really know much about Solr but just started using it while helping with a project for openlibrary.org and it seems pretty alright but I'm still not totally sure I understand what makes it popular.


Solr and Elasticsearch are both Java servers built on top of the Java search library Lucene. There are plenty of articles on the internet describing how they differ. However since they share the same core, so they are very similar as well. For the context of this discussion, you can consider Solr & Elasticsearch as interchangeable - a potayto, potahto situation.


With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?

My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?


Related question, could it be possible that at some point postgresql natively implements that algorithm ? Or as there is already an extension doing it , regardless of the licence , it is unlikely that patches in that direction will be accepted ?


Running it for your own purposes as part of a solution that includes search should be fine under AGPL.

If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.


yes I understand I can do that, and I also understand why the authors chose to do that, I would have done the same.

My point of view is more from a small saas company perspective (i.e 100% pragmatic):

1. I want as less vendor as possible, especially on something as mission critical as my database 2. I already use AWS RDS and it comes with a LOT of nice things (managed, multi-az, easy backup/restore story, etc.)

In that situation:

1. hosting myself is not an option because I will loose all the niceties that I will have to reimplement 2. buying from a 3rd party is not an option either because: 1. What if they go bankrupt ? 2. We are ISO 27001 and they may be not ISO 27001 themselves or forever. 3. If I choose a vendor because it's "postgres + feature A" then if there's an other vendor selling "postgres + feature B" (timescaledb etc.) what do I do ?

That's why I was more interested in knowing if that specific could one day be implemented in postgres directly (as there's already tsvector).

Once again I'm 100% behind them to have chosen a restrictive license if they plan on selling it, but in that case their interested and mine are not aligned, and that's fine.


That's a really fair use case acknlowedging personal preference to interpret how you like it.

I find some of the built in services on clouds are just open source libraries that are packaged up to increase tie in to that platform.

I like cloud, but cloud agnostically, and hybrid/private clouds in the mix with that seem like a good skill to at least be able to consider thinking through.


ParadeDB author here -- correct! We plan to offer a hosted version soon and the idea behind picking AGPL is to be as permissive as possible so that people can use the product for free, but also protect ourselves from abuse in case a large company, say AWS, were to want to ship it in their own environment.

In fact, we went through much questioning wondering to go with ELv2, Apache, AGPL, etc. before settling on AGPL


Appreciate the response! This would be a great blog post btw.


See who made pg_bm25 - vendor of database based on PostgreSQL. Most likely they would like offer that as hosted solution itself, so they attempt avoid Elasticsearch / Terraform-like drama using AGPL license from beginning.


I forget, does AWS let you use custom extensions from pgrx?


No, they allow use Rust for custom functions (alternatively to PL/SQL) only.


pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.

Awesome to see so many high quality extensions come out of it.

https://github.com/pgcentralfoundation/pgrx


pgrx is awesome and making pg_bm25 would've been infinitely more challenging without it. Check them out if you want to make a Postgres extension, we can't recommend them enough


Thank you. I’ll pass this on to the team.


Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?

This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.

ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.


I’m not sure if Postgres could support that type of operation directly via count() since I don’t know if the fact that no other filters are present is available to the Index Access Method API.

It might be possible to do a separate function though, like:

select pg_bm25_direct_count(‘term’)*


If you do that, I can update postgres-searchbox [1] to use it for better frontend experience.

[1] https://www.npmjs.com/package/postgres-searchbox


That would be fine--basically any way of achieving it would be fine. As of now, in PostgreSQL's FTS, I don't think there's any way to do this fast enough to give it back to the user.


Thanks!

We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.

We haven't gotten around to benchmarking aggregations - that's the focus for next week and we'll publish them once they're done. I would suspect that it's a lot faster than Postgres aggregates since it leverages Tantivy Columnar.


Nice! I would be very interested by your benchmark, don't hesitate to jump in the quickwit discord server to talk about the results. https://discord.quickwit.io/


What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).


ParadeDB and pg_bm25 offer weak consistency. pg_bm25 doesn't slow down transactions for indexing, and like ElasticSearch it becomes become eventually consistent shortly after (typically at most a few seconds, altough your mileage may vary based on the amount of data modified in the transaction(s)).


This is really exciting and I hope to try it out at my company ASAP.


Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?


Ok, all caught up now. Great work and best of luck!

When it comes to the business model: it seems an acqui-hire by Supabase/Neon/etc would be the best bet. It insures the team's focus is on the core product instead of the litany of things to figure out when creating a pg hosting service (payments, downtime, upgrades, customer support, ...) in this highly competitive and demanding market.


Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.


An important step, could be a good combination with pg_vector if they are fast enough


I believe the parent project — paradedb — already does that, for their support of HNSW indexes.


That's right, we do support pgvector (it is pre-installed on ParadeDB) and support full HNSW. In fact, we even have another extension, called pg_search, which is the combination of searching on pgvector and pg_bm25 for better results! Topic of another blog post to come sometime soon :)


Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?


Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres


Yes! We have another extension, pg_search, which is specifically for hybrid search using pg_bm25+pgvector. You can find it here: https://github.com/paradedb/paradedb/tree/dev/pg_search


This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.


looks like a cool project https://github.com/paradedb/paradedb


I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?


Lots of reasons:

1) switching search engines is hard when you’ve built your information needs around one. I’ve led lots of search engine migrations and they’re not fun. I even gave a talk on the problems companies face when doing so. https://haystackconf.com/us2020/search-migration-circus/

2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic

3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?

4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?


By adding the features that those new age startups launch: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.


Reading "legacy" near "elastic" make me feel a little bit old :D :D

BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.


:D :D

Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search


They are part of the hype. Lucene has vector search capabilities. Elasticsearch and Opensearch have support for that (slightly different implementations). I assume solr has similar capabilities. The combination of traditional search and vector search makes a lot of sense from a cost control point of view. Vector search at scale is expensive. The smaller the result set, the cheaper it is to do vector search over it. So using a cheap traditional search to limit the results before you run vector search makes a lot of sense.

Also, bm25 holds up well against vector search. A well tuned model can outperform it but many off the shelf models struggle to do that. Vector search is a useful tool but so far it's not a one size fits all solution that "just works". It's something that can work really well if you know what you are doing and with a lot of tuning. With things like Elasticsearch you can try both approaches.


pg_bm25/ParadeDB author here. What we're doing is building an opinionated alternative within PostgreSQL. If you are not using Postgres, or want your system to be separate, Elastic is still the best choice and will likely remain so.

Other people have brought up great points for why or why not to switch. Our vision for this is that ParadeDB is not merely "better" than Elastic, but rather different. Elastic will never be a PostgreSQL database, and we'll never be a NoSQL search engine. If you want one or the other, you'll pick either ParadeDB or Elastic.


Who is the competition besides Algolia? Last I checked most of the competition is either very expensive or very feature limited compared to Elastic/Solr.


Meilisearch seems like it is the best open source option.

https://www.meilisearch.com/


I think pretty much all the companies who provide vector search are indirect competitors


ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.


Is BM25 still used by "modern" search engines? I wasn't aware.


is this better than lucene


The underlying engine, Tantivy, has better performance characteristics than Lucene.

You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB


It's faster, but misses tons of features, starting with geosearch. Hopefully they will come with wider use.


The issue for geo search is here: https://github.com/quickwit-oss/tantivy/issues/44


Excited to give this a try.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: